[SPARK-21469][ML][EXAMPLES] Adding Examples for FeatureHasher #19024

BryanCutler · 2017-08-22T18:45:14Z

What changes were proposed in this pull request?

This PR adds ML examples for the FeatureHasher transform in Scala, Java, Python.

How was this patch tested?

Manually ran examples and verified that output is consistent for different APIs

SparkQA · 2017-08-22T19:00:04Z

Test build #81002 has finished for PR 19024 at commit 1f1f31f.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public class JavaFeatureHasherExample

SparkQA · 2017-08-22T19:05:01Z

Test build #81003 has finished for PR 19024 at commit 962f658.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mgaido91 · 2017-08-22T20:45:23Z

examples/src/main/scala/org/apache/spark/examples/ml/FeatureHasherExample.scala

+ * limitations under the License.
+ */
+
+// scalastyle:off println


why do you need this?

Yeah, it's not necessary since this example doesn't use println and it looks like all other scala examples are wrapped with this also. Not sure if there was a reason, but doesn't really hurt anything either.

All the other examples use println, also to explain what is happening. I think it'd be good to add some print to explain what is happening. It is an example, thus people reading it would understand better what is happening if you print some details on it. But if you don't add, I'd remove this since it is useless.

I think we can remove it.

It's not uncommon for examples to not have any illustrative println statements. In fact many example rather use show to show the results.

I agree with that. I was wondering if I should add something but I think it's best described in the scaladoc and letting the user see some results. I'll remove the scalastyle:off println then

BryanCutler · 2017-08-22T21:16:11Z

docs/ml-features.md

@@ -211,6 +211,65 @@ for more details on the API.
 </div>
 </div>

+## FeatureHasher


I put the doc under the Feature Extractors section, let me know if it's not the right place

BryanCutler · 2017-08-22T21:17:34Z

@MLnick , I pretty much copied your example in the scaladoc, just added a couple more rows of data. Please take a look when you can, thanks!

SparkQA · 2017-08-22T21:19:47Z

Test build #81007 has finished for PR 19024 at commit ee8afcf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick · 2017-08-23T12:50:02Z

docs/ml-features.md

+categorical features to strings first.
+- String columns: For categorical features, the hash value of the string "column_name=value"
+is used to map to the vector index, with an indicator value of `1.0`. Thus, categorical features
+are "one-hot" encoded (similarly to using `OneHotEncoder` with `dropLast=false`).


Should link to OneHotEncoder section within the guide here.

MLnick · 2017-08-23T12:51:43Z

docs/ml-features.md

+
+Null (missing) values are ignored (implicitly zero in the resulting feature vector).
+
+Since a simple modulo is used to transform the hash function to a vector index,


We should probably say something to the effect that the hashing mechanism is the same as used for HashingTF

Sure, I can do that.
Would it be more correct to say here "Since a simple modulo is used to determine the vector index for the hashed value,.."?

BryanCutler · 2017-08-23T22:22:13Z

Updated doc screenshot

BryanCutler · 2017-08-23T22:29:29Z

docs/ml-features.md

+
+The hash function used here is also the [MurmurHash 3](https://en.wikipedia.org/wiki/MurmurHash)
+used in [HashingTF](ml-features.html#tf-idf). Since a simple modulo is used to transform the hash
+function to a vector index, it is advisable to use a power of two as the numFeatures parameter;


I read this as the hash function itself is transformed instead of the output of it. Would it be more correct to say here

"Since a simple modulo on the hashed value is used to determine the vector index,.."?

Yeah that sounds better

ok, I'll change the wording in HashingTF also to be consistent

SparkQA · 2017-08-23T22:35:01Z

Test build #81052 has finished for PR 19024 at commit a61fb07.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2017-08-23T22:41:23Z

docs/ml-features.md

 it is advisable to use a power of two as the feature dimension, otherwise the features will 
-not be mapped evenly to the columns. The default feature dimension is `$2^{18} = 262,144$`.
+not be mapped evenly in the vector. The default feature dimension is `$2^{18} = 262,144$`.


@MLnick , as I read this section on TF-IDF, it didn't seem right how it referred to columns, so I made some changes. Does it seem ok?

"to vector indices" perhaps?

yeah that's better

SparkQA · 2017-08-25T18:47:08Z

Test build #81136 has finished for PR 19024 at commit a7a439c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-08-29T23:20:54Z

Test build #81234 has finished for PR 19024 at commit a7a439c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick · 2017-08-30T14:00:58Z

Ran example and checked doc locally. LGTM, merged to master. Thanks!

BryanCutler added 2 commits August 22, 2017 11:42

Added all examples

1f1f31f

cleanup

962f658

BryanCutler changed the title ~~[SPARK-21810][ML][EXAMPLES] Adding Examples for FeatureHasher~~ [SPARK-21469][ML][EXAMPLES] Adding Examples for FeatureHasher Aug 22, 2017

mgaido91 reviewed Aug 22, 2017

View reviewed changes

added doc

ee8afcf

BryanCutler commented Aug 22, 2017

View reviewed changes

MLnick reviewed Aug 23, 2017

View reviewed changes

BryanCutler added 3 commits August 23, 2017 14:39

fixed issues from comments

f248e71

changed TF-IDF doc to not use columns for referring to vectors

940d170

added example output

a61fb07

BryanCutler commented Aug 23, 2017

View reviewed changes

fixed some wording in docs

a7a439c

asfgit closed this in 4133c1b Aug 30, 2017

BryanCutler deleted the ml-examples-FeatureHasher-SPARK-21810 branch March 6, 2018 23:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-21469][ML][EXAMPLES] Adding Examples for FeatureHasher #19024

[SPARK-21469][ML][EXAMPLES] Adding Examples for FeatureHasher #19024

BryanCutler commented Aug 22, 2017

SparkQA commented Aug 22, 2017

SparkQA commented Aug 22, 2017

mgaido91 Aug 22, 2017

BryanCutler Aug 22, 2017

mgaido91 Aug 23, 2017

MLnick Aug 23, 2017

BryanCutler Aug 23, 2017

BryanCutler Aug 22, 2017

BryanCutler commented Aug 22, 2017

SparkQA commented Aug 22, 2017

MLnick Aug 23, 2017

MLnick Aug 23, 2017

BryanCutler Aug 23, 2017 •

edited

Loading

BryanCutler commented Aug 23, 2017

BryanCutler Aug 23, 2017 •

edited

Loading

MLnick Aug 25, 2017

BryanCutler Aug 25, 2017

SparkQA commented Aug 23, 2017

BryanCutler Aug 23, 2017

MLnick Aug 25, 2017

BryanCutler Aug 25, 2017

SparkQA commented Aug 25, 2017

SparkQA commented Aug 29, 2017

MLnick commented Aug 30, 2017


		Null (missing) values are ignored (implicitly zero in the resulting feature vector).

		Since a simple modulo is used to transform the hash function to a vector index,

[SPARK-21469][ML][EXAMPLES] Adding Examples for FeatureHasher #19024

[SPARK-21469][ML][EXAMPLES] Adding Examples for FeatureHasher #19024

Conversation

BryanCutler commented Aug 22, 2017

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Aug 22, 2017

SparkQA commented Aug 22, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BryanCutler commented Aug 22, 2017

SparkQA commented Aug 22, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BryanCutler Aug 23, 2017 • edited Loading

Choose a reason for hiding this comment

BryanCutler commented Aug 23, 2017

BryanCutler Aug 23, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Aug 23, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Aug 25, 2017

SparkQA commented Aug 29, 2017

MLnick commented Aug 30, 2017

BryanCutler Aug 23, 2017 •

edited

Loading

BryanCutler Aug 23, 2017 •

edited

Loading