[SPARK-22735][ML][DOC] Added VectorSizeHint docs and examples. #20285

MrBago · 2018-01-17T01:58:03Z

What changes were proposed in this pull request?

Added documentation for new transformer.

SparkQA · 2018-01-17T02:29:38Z

Test build #86223 has finished for PR 20285 at commit b4f2c71.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-17T05:27:54Z

Test build #86221 has finished for PR 20285 at commit 85d0db0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler

Just some minor things, but looks pretty good, could you post a screenshot of the doc after building? Are you planning on adding a Java example?

BryanCutler · 2018-01-17T18:45:14Z

docs/ml-features.md

@@ -1283,6 +1283,48 @@ for more details on the API.
 </div>
 </div>

+## VectorSizeHint
+
+It can sometimes be useful to explicitly specify the size of the vectors an a column of


typo 'an a column' -> 'in a column'

BryanCutler · 2018-01-17T18:49:55Z

docs/ml-features.md

+meatadata.
+
+`VectorSizeHint` can also take an optional `handleInvalid` parameter which controls its
+behaviour when the vector column contains nulls for vectors of the wrong size. By default


typo: 'nulls for vectors..' -> 'nulls or vectors'

BryanCutler · 2018-01-17T18:52:30Z

examples/src/main/python/ml/vector_size_hint_example.py

+if __name__ == "__main__":
+    spark = SparkSession\
+        .builder\
+        .appName("VectorAssemblerExample")\


should be "VectorSizeHintExample" - same with other apis

BryanCutler · 2018-01-17T18:54:49Z

examples/src/main/python/ml/vector_size_hint_example.py

+
+    sizeHint = VectorSizeHint(
+        inputCol="userFeatures",
+        handleInvalid="sip",


typo "sip" -> "skip"

SparkQA · 2018-01-17T19:47:20Z

Test build #86287 has finished for PR 20285 at commit c0a53de.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MrBago · 2018-01-18T00:22:14Z

MrBago · 2018-01-18T00:22:32Z

MrBago · 2018-01-18T00:23:33Z

Thanks for the review @BryanCutler, I've added a java example & uploaded 2 screenshots.

SparkQA · 2018-01-18T00:37:06Z

Test build #86302 has finished for PR 20285 at commit 0cdfc1b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler

thanks, LGTM!

viirya · 2018-01-18T00:53:42Z

examples/src/main/java/org/apache/spark/examples/ml/JavaVectorSizeHintExample.java

+import static org.apache.spark.sql.types.DataTypes.*;
+
+// $example on$
+// $example off$


Do we need the above two lines?

viirya · 2018-01-18T00:55:04Z

examples/src/main/python/ml/vector_size_hint_example.py

+        inputCols=["hour", "mobile", "userFeatures"],
+        outputCol="features")
+
+    # This dataframe can be used by used by downstream transformers as before


I think there is some typos here.

viirya · 2018-01-18T00:56:14Z

examples/src/main/scala/org/apache/spark/examples/ml/VectorSizeHintExample.scala

+      .setInputCols(Array("hour", "mobile", "userFeatures"))
+      .setOutputCol("features")
+
+    // This dataframe can be used by used by downstream transformers as before


viirya · 2018-01-18T00:57:52Z

docs/ml-features.md

@@ -1283,6 +1283,56 @@ for more details on the API.
 </div>
 </div>

+## VectorSizeHint
+
+It can sometimes be useful to explicitly specify the size of the vectors a column of


a column of Vector -> for a column of Vector?

viirya · 2018-01-18T01:03:47Z

docs/ml-features.md

+`VectorType`. For example, `VectorAssembler` uses size information from its input columns to
+produce size information and metadata for its output column. While in some cases this information
+can be obtained by inspecting the contents of the column, in a streaming dataframe the contents are
+not available until the stream is started. `VectorSizeHint` allows a user to explicitly specify the


nit: a user -> an user

I don't know if the spark style guide covers this, but I believe "a user" is generally the prefered form, https://english.stackexchange.com/a/105117.

viirya · 2018-01-18T01:04:13Z

docs/ml-features.md

+vector size for a column so that `VectorAssembler`, or other transformers that might
+need to know vector size, can use that column as an input.
+
+To use `VectorSizeHint` a user must set the `inputCol` and `size` parameters. Applying this


a user -> an user

a user is correct because users's pronunciation starts with y

SparkQA · 2018-01-19T01:58:56Z

Test build #86366 has finished for PR 20285 at commit 6228902.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MrBago · 2018-01-22T21:23:49Z

I'd like to prioritize getting this merged to ensure our documentation is complete for the 2.3 release. @viirya and @WeichenXu123 would you mind having another look at it?

dongjoon-hyun · 2018-01-23T18:37:57Z

examples/src/main/java/org/apache/spark/examples/ml/JavaVectorSizeHintExample.java

+      createStructField("userFeatures", new VectorUDT(), false),
+      createStructField("clicked", DoubleType, false)
+    });
+    Row row = RowFactory.create(0, 18, 1.0, Vectors.dense(0.0, 10.0, 0.5), 1.0);


Hi, @MrBago . It seems that we need to add one more row here.

RowFactory.create(0, 18, 1.0, Vectors.dense(0.0, 10.0), 0.0);

WeichenXu123

LGTM except a minor comment. 👍

WeichenXu123 · 2018-01-23T18:43:04Z

docs/ml-features.md

+`handleInvalid` is set to "error", indicating an exception should be thrown. This parameter can
+also be set to "skip", indicating that rows containing invalid values should be filtered out from
+the resulting dataframe, or `optimistic` indicating that all rows should be kept. When
+`handleInvalid` is set to `optimistic` the user takes responsibility for ensuring that the column


optimistic --> "optimistic"
the backquote only used on code vars.

dongjoon-hyun · 2018-01-23T18:50:51Z

examples/src/main/java/org/apache/spark/examples/ml/JavaVectorSizeHintExample.java

+      createStructField("clicked", DoubleType, false)
+    });
+    Row row0 = RowFactory.create(0, 18, 1.0, Vectors.dense(0.0, 10.0, 0.5), 1.0);
+    Row row1 = RowFactory.create(0, 18, 1.0, Vectors.dense(0.0, 10.0, 0.5), 1.0);


Can we use the same data set with the other code?
I mean the second row is different from what I suggested and the other examples'.

Sorry, yes should be fixed now.

SparkQA · 2018-01-23T19:07:35Z

Test build #86541 has finished for PR 20285 at commit 4de3f81.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-23T19:12:16Z

Test build #86543 has finished for PR 20285 at commit 6055a8c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

+1, LGTM.

WeichenXu123

LGTM.

mengxr · 2018-01-23T21:11:10Z

docs/ml-features.md

+`handleInvalid` is set to "error", indicating an exception should be thrown. This parameter can
+also be set to "skip", indicating that rows containing invalid values should be filtered out from
+the resulting dataframe, or "optimistic" indicating that all rows should be kept. When
+`handleInvalid` is set to "optimistic" the user takes responsibility for ensuring that the column


Not clear to me what is the expected behaivor of optimistic. How is it different from error? Does it output null?

I've updated it, let me know if you think we can still make it more clear.

mengxr · 2018-01-23T21:11:40Z

docs/ml-features.md

+<div class="codetabs">
+<div data-lang="scala" markdown="1">
+
+Refer to the [VectorSizeHint Scala docs](api/scala/index.html#org.apache.spark.ml.feature.VectorSizeHint)


minor: Do we need to mention Scala explicitly here?

I don't think so :), but I think we should leave it to be consistent with other examples.

mengxr · 2018-01-23T22:12:11Z

LGTM. Merged into master and branch-2.3. Thanks!

## What changes were proposed in this pull request? Added documentation for new transformer. Author: Bago Amirbekian <bago@databricks.com> Closes #20285 from MrBago/sizeHintDocs. (cherry picked from commit 05839d1) Signed-off-by: Xiangrui Meng <meng@databricks.com>

SparkQA · 2018-01-23T22:12:30Z

Test build #86548 has finished for PR 20285 at commit 3055eec.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-01-23T22:33:27Z

Hi, @mengxr .
Could you resolve the JIRA, too?

https://issues.apache.org/jira/browse/SPARK-22735

Thanks!

MrBago added 2 commits January 16, 2018 17:55

"Added VectorSizeHint docs and examples."

85d0db0

Fix comments

b4f2c71

BryanCutler reviewed Jan 17, 2018

View reviewed changes

Clean up typos.

c0a53de

MrBago added 2 commits January 17, 2018 15:46

Added java example for VectorSizeHint.

00f93a6

Add java size hint example to spark.ml docs.

0cdfc1b

BryanCutler approved these changes Jan 18, 2018

View reviewed changes

viirya reviewed Jan 18, 2018

View reviewed changes

PR feedback.

6228902

dongjoon-hyun reviewed Jan 23, 2018

View reviewed changes

WeichenXu123 reviewed Jan 23, 2018

View reviewed changes

dongjoon-hyun reviewed Jan 23, 2018

View reviewed changes

MrBago added 2 commits January 23, 2018 10:52

Added another row to Java example.

43b0baf

Updated doc style.

6055a8c

MrBago force-pushed the sizeHintDocs branch from 4de3f81 to 6055a8c Compare January 23, 2018 18:53

dongjoon-hyun reviewed Jan 23, 2018

View reviewed changes

WeichenXu123 approved these changes Jan 23, 2018

View reviewed changes

mengxr requested changes Jan 23, 2018

View reviewed changes

Update language for "optimistic" handleInvalid option.

3055eec

mengxr approved these changes Jan 23, 2018

View reviewed changes

asfgit closed this in 05839d1 Jan 23, 2018

MrBago deleted the sizeHintDocs branch January 24, 2018 19:27

[SPARK-22735][ML][DOC] Added VectorSizeHint docs and examples. #20285

[SPARK-22735][ML][DOC] Added VectorSizeHint docs and examples. #20285

Conversation

MrBago commented Jan 17, 2018

What changes were proposed in this pull request?

SparkQA commented Jan 17, 2018

SparkQA commented Jan 17, 2018

BryanCutler left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 17, 2018

MrBago commented Jan 18, 2018

MrBago commented Jan 18, 2018

MrBago commented Jan 18, 2018

SparkQA commented Jan 18, 2018

BryanCutler left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 19, 2018

MrBago commented Jan 22, 2018 • edited Loading

Choose a reason for hiding this comment

WeichenXu123 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun Jan 23, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 23, 2018

SparkQA commented Jan 23, 2018

dongjoon-hyun left a comment

Choose a reason for hiding this comment

WeichenXu123 left a comment

Choose a reason for hiding this comment

mengxr Jan 23, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mengxr Jan 23, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mengxr commented Jan 23, 2018

SparkQA commented Jan 23, 2018

dongjoon-hyun commented Jan 23, 2018

MrBago commented Jan 22, 2018 •

edited

Loading

dongjoon-hyun Jan 23, 2018 •

edited

Loading

mengxr Jan 23, 2018 •

edited

Loading

mengxr Jan 23, 2018 •

edited

Loading