-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Spark-8169] [ML] Add StopWordsRemover as a transformer #6742
Conversation
Test build #34592 has finished for PR 6742 at commit
|
*/ | ||
@Experimental | ||
object StopWords{ | ||
val EnglishSet = ("a an and are as at be by for from has he in is it its of on that the to " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great if this was some standard stopwords list like the SKLearn list you mentioned or this one from CoreNLP. Please include a reference in the docs if you use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please address @feynmanliang 's comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will use the one from SKLearn. Please let me know if there's other suggestion.
Also, it may be useful to think how this will interact with #7084 since we could just supply a vocabulary without stop words. |
* stop words list | ||
*/ | ||
@Experimental | ||
object StopWords{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make this private. We only need to mention where we get the default list of stop words.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I'm just thinking users may write
val stopWords = StopWords.EnglishStopWords ++ Array("python", "scala")
in some occasion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Users can get the default value from the StopWordsRemover
instance.
@hhbyyh Sorry for late review! If you don't have time to address my comments before the feature freeze, please let me know:) |
@mengxr I've started working on it and will try to send an update within one hour. |
Sorry for the delay due to time difference. Update sent. |
Test build #39141 has finished for PR 6742 at commit
|
/** | ||
* stop words list | ||
*/ | ||
private object StopWords{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
space before {
It might be faster. We can compare the performance during QA. Let's use
See my inline comments. I would recommend treating |
@mengxr Update sent: Separating udf, adding note about preserving null, and other style fix. Thanks. |
Test build #39310 has finished for PR 6742 at commit
|
Test build #39314 has finished for PR 6742 at commit
|
LGTM. Merged into master. Thanks! |
Thanks for taking time finish the review. It must have been a long day. Shall I create jira for Python and document? @mengxr |
Yes, please create JIRAs for Python API and doc. Thanks! |
jira: https://issues.apache.org/jira/browse/SPARK-8169
stop words: http://en.wikipedia.org/wiki/Stop_words
StopWordsRemover takes a string array column and outputs a string array column with all defined stop words removed. The transformer should also come with a standard set of stop words as default.
Currently I used a minimum stop words set since on some case, small set of stop words is preferred.
ASCII char has been tested, Yet I cannot check it in due to style check.
Further thought,