Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-21469][ML][EXAMPLES] Adding Examples for FeatureHasher #19024

Conversation

BryanCutler
Copy link
Member

What changes were proposed in this pull request?

This PR adds ML examples for the FeatureHasher transform in Scala, Java, Python.

How was this patch tested?

Manually ran examples and verified that output is consistent for different APIs

@BryanCutler BryanCutler changed the title [SPARK-21810][ML][EXAMPLES] Adding Examples for FeatureHasher [SPARK-21469][ML][EXAMPLES] Adding Examples for FeatureHasher Aug 22, 2017
@SparkQA
Copy link

SparkQA commented Aug 22, 2017

Test build #81002 has finished for PR 19024 at commit 1f1f31f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • public class JavaFeatureHasherExample

@SparkQA
Copy link

SparkQA commented Aug 22, 2017

Test build #81003 has finished for PR 19024 at commit 962f658.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* limitations under the License.
*/

// scalastyle:off println
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's not necessary since this example doesn't use println and it looks like all other scala examples are wrapped with this also. Not sure if there was a reason, but doesn't really hurt anything either.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the other examples use println, also to explain what is happening. I think it'd be good to add some print to explain what is happening. It is an example, thus people reading it would understand better what is happening if you print some details on it. But if you don't add, I'd remove this since it is useless.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove it.

It's not uncommon for examples to not have any illustrative println statements. In fact many example rather use show to show the results.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with that. I was wondering if I should add something but I think it's best described in the scaladoc and letting the user see some results. I'll remove the scalastyle:off println then

@@ -211,6 +211,65 @@ for more details on the API.
</div>
</div>

## FeatureHasher
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put the doc under the Feature Extractors section, let me know if it's not the right place

@BryanCutler
Copy link
Member Author

@MLnick , I pretty much copied your example in the scaladoc, just added a couple more rows of data. Please take a look when you can, thanks!

@SparkQA
Copy link

SparkQA commented Aug 22, 2017

Test build #81007 has finished for PR 19024 at commit ee8afcf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

categorical features to strings first.
- String columns: For categorical features, the hash value of the string "column_name=value"
is used to map to the vector index, with an indicator value of `1.0`. Thus, categorical features
are "one-hot" encoded (similarly to using `OneHotEncoder` with `dropLast=false`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should link to OneHotEncoder section within the guide here.


Null (missing) values are ignored (implicitly zero in the resulting feature vector).

Since a simple modulo is used to transform the hash function to a vector index,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably say something to the effect that the hashing mechanism is the same as used for HashingTF

Copy link
Member Author

@BryanCutler BryanCutler Aug 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can do that.
Would it be more correct to say here "Since a simple modulo is used to determine the vector index for the hashed value,.."?

@BryanCutler
Copy link
Member Author

Updated doc screenshot
image


The hash function used here is also the [MurmurHash 3](https://en.wikipedia.org/wiki/MurmurHash)
used in [HashingTF](ml-features.html#tf-idf). Since a simple modulo is used to transform the hash
function to a vector index, it is advisable to use a power of two as the numFeatures parameter;
Copy link
Member Author

@BryanCutler BryanCutler Aug 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read this as the hash function itself is transformed instead of the output of it. Would it be more correct to say here

"Since a simple modulo on the hashed value is used to determine the vector index,.."?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that sounds better

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I'll change the wording in HashingTF also to be consistent

@SparkQA
Copy link

SparkQA commented Aug 23, 2017

Test build #81052 has finished for PR 19024 at commit a61fb07.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

it is advisable to use a power of two as the feature dimension, otherwise the features will
not be mapped evenly to the columns. The default feature dimension is `$2^{18} = 262,144$`.
not be mapped evenly in the vector. The default feature dimension is `$2^{18} = 262,144$`.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MLnick , as I read this section on TF-IDF, it didn't seem right how it referred to columns, so I made some changes. Does it seem ok?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"to vector indices" perhaps?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that's better

@SparkQA
Copy link

SparkQA commented Aug 25, 2017

Test build #81136 has finished for PR 19024 at commit a7a439c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 29, 2017

Test build #81234 has finished for PR 19024 at commit a7a439c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MLnick
Copy link
Contributor

MLnick commented Aug 30, 2017

Ran example and checked doc locally. LGTM, merged to master. Thanks!

@asfgit asfgit closed this in 4133c1b Aug 30, 2017
@BryanCutler BryanCutler deleted the ml-examples-FeatureHasher-SPARK-21810 branch March 6, 2018 23:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants