-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-21469][ML][EXAMPLES] Adding Examples for FeatureHasher #19024
[SPARK-21469][ML][EXAMPLES] Adding Examples for FeatureHasher #19024
Conversation
Test build #81002 has finished for PR 19024 at commit
|
Test build #81003 has finished for PR 19024 at commit
|
* limitations under the License. | ||
*/ | ||
|
||
// scalastyle:off println |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's not necessary since this example doesn't use println
and it looks like all other scala examples are wrapped with this also. Not sure if there was a reason, but doesn't really hurt anything either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the other examples use println
, also to explain what is happening. I think it'd be good to add some print
to explain what is happening. It is an example, thus people reading it would understand better what is happening if you print some details on it. But if you don't add, I'd remove this since it is useless.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can remove it.
It's not uncommon for examples to not have any illustrative println
statements. In fact many example rather use show
to show the results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with that. I was wondering if I should add something but I think it's best described in the scaladoc and letting the user see some results. I'll remove the scalastyle:off println
then
@@ -211,6 +211,65 @@ for more details on the API. | |||
</div> | |||
</div> | |||
|
|||
## FeatureHasher |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put the doc under the Feature Extractors section, let me know if it's not the right place
@MLnick , I pretty much copied your example in the scaladoc, just added a couple more rows of data. Please take a look when you can, thanks! |
Test build #81007 has finished for PR 19024 at commit
|
docs/ml-features.md
Outdated
categorical features to strings first. | ||
- String columns: For categorical features, the hash value of the string "column_name=value" | ||
is used to map to the vector index, with an indicator value of `1.0`. Thus, categorical features | ||
are "one-hot" encoded (similarly to using `OneHotEncoder` with `dropLast=false`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should link to OneHotEncoder
section within the guide here.
docs/ml-features.md
Outdated
|
||
Null (missing) values are ignored (implicitly zero in the resulting feature vector). | ||
|
||
Since a simple modulo is used to transform the hash function to a vector index, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably say something to the effect that the hashing mechanism is the same as used for HashingTF
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can do that.
Would it be more correct to say here "Since a simple modulo is used to determine the vector index for the hashed value,.."?
docs/ml-features.md
Outdated
|
||
The hash function used here is also the [MurmurHash 3](https://en.wikipedia.org/wiki/MurmurHash) | ||
used in [HashingTF](ml-features.html#tf-idf). Since a simple modulo is used to transform the hash | ||
function to a vector index, it is advisable to use a power of two as the numFeatures parameter; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read this as the hash function itself is transformed instead of the output of it. Would it be more correct to say here
"Since a simple modulo on the hashed value is used to determine the vector index,.."?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that sounds better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I'll change the wording in HashingTF also to be consistent
Test build #81052 has finished for PR 19024 at commit
|
docs/ml-features.md
Outdated
it is advisable to use a power of two as the feature dimension, otherwise the features will | ||
not be mapped evenly to the columns. The default feature dimension is `$2^{18} = 262,144$`. | ||
not be mapped evenly in the vector. The default feature dimension is `$2^{18} = 262,144$`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MLnick , as I read this section on TF-IDF, it didn't seem right how it referred to columns, so I made some changes. Does it seem ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"to vector indices" perhaps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah that's better
Test build #81136 has finished for PR 19024 at commit
|
Test build #81234 has finished for PR 19024 at commit
|
Ran example and checked doc locally. LGTM, merged to master. Thanks! |
What changes were proposed in this pull request?
This PR adds ML examples for the FeatureHasher transform in Scala, Java, Python.
How was this patch tested?
Manually ran examples and verified that output is consistent for different APIs