-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-8455][ML] Implement n-gram feature transformer #6887
Conversation
* Defauult: 2, bigram features | ||
* @group param | ||
*/ | ||
val NGramLength: IntParam = new IntParam(this, "NGramLength", "number elements per n-gram (>=1)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NGramLength
-> nGramLength
. Actually, I think calling it n
should be sufficient given the context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK.
Test build #35173 has finished for PR 6887 at commit
|
Test build #35214 timed out for PR 6887 at commit |
))) | ||
testNGram(NGramTransformer, dataset) | ||
} | ||
test("input array < n yields a single n-gram consisting of input array") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a blank line here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK.
Test build #35288 has finished for PR 6887 at commit
|
Jenkins test this please |
Test build #35404 has finished for PR 6887 at commit
|
2 thoughts:
|
I actually think @mengxr suggestion makes sense; it guarantees that the output is an actual n-gram rather than being dependent on whether the input length > n or not. Another possibility could be to pad with some null word. The use cases I am imagining is a n-gram HMM language model, in which case partial sequences with k words (k < n) should be represented by padding null words on the left. |
Yeah, I guess I'm OK either way. I figure the normal user won't care if it's a strict n-gram but might be upset if a non-empty document produces an all-zero feature vector. |
LGTM Merging with master |
Implementation of n-gram feature transformer for ML. Author: Feynman Liang <fliang@databricks.com> Closes apache#6887 from feynmanliang/ngram-featurizer and squashes the following commits: d2c839f [Feynman Liang] Make n > input length yield empty output 9fadd36 [Feynman Liang] Add empty and corner test cases, fix names and spaces fe93873 [Feynman Liang] Implement n-gram feature transformer
Implementation of n-gram feature transformer for ML.