-
Notifications
You must be signed in to change notification settings - Fork 29.1k
[SPARK-18374][ML]Incorrect words in StopWords/english.txt #16103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@srowen Thanks for the comments in jira. |
|
Test build #69489 has finished for PR 16103 at commit
|
|
Oh @hhbyyh, I think I am not used to this area but actually I just wanted leave a comment just for stop words list that I considered for my elasticsearch cluster before. I just made a list for missing ones via target = open("mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/english.txt").read().split("\n")
source = open("mylist.txt").read().split("\n")
for w in source:
if w not in target:
print wand I got these below: I am not sure if they should be added here or not but I just wanted to let you know what I considered before for stop words. |
|
Seems reasonable to me too. |
|
Thanks @HyukjinKwon. I'll add them. |
|
Test build #69583 has finished for PR 16103 at commit
|
| wasn | ||
| weren | ||
| won | ||
| wouldn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You would then remove the other stems like "wasn" "weren" etc right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with both options, leaving them or removing them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should remove the spurious stems; the info on this PR suggests they aren't supposed to be there.
|
Test build #69675 has finished for PR 16103 at commit
|
|
Test build #69677 has finished for PR 16103 at commit
|
|
Thanks for the review. |
|
Merged to master |
## What changes were proposed in this pull request? Currently English stop words list in MLlib contains only the argumented words after removing all the apostrophes, so "wouldn't" become "wouldn" and "t". Yet by default Tokenizer and RegexTokenizer don't split on apostrophes or quotes. Adding original form to stop words list to match the behavior of Tokenizer and StopwordsRemover. Also remove "won" from list. see more discussion in the jira: https://issues.apache.org/jira/browse/SPARK-18374 ## How was this patch tested? existing ut Author: Yuhao <yuhao.yang@intel.com> Author: Yuhao Yang <hhbyyh@gmail.com> Closes apache#16103 from hhbyyh/addstopwords.
## What changes were proposed in this pull request? Currently English stop words list in MLlib contains only the argumented words after removing all the apostrophes, so "wouldn't" become "wouldn" and "t". Yet by default Tokenizer and RegexTokenizer don't split on apostrophes or quotes. Adding original form to stop words list to match the behavior of Tokenizer and StopwordsRemover. Also remove "won" from list. see more discussion in the jira: https://issues.apache.org/jira/browse/SPARK-18374 ## How was this patch tested? existing ut Author: Yuhao <yuhao.yang@intel.com> Author: Yuhao Yang <hhbyyh@gmail.com> Closes apache#16103 from hhbyyh/addstopwords.
What changes were proposed in this pull request?
Currently English stop words list in MLlib contains only the argumented words after removing all the apostrophes, so "wouldn't" become "wouldn" and "t". Yet by default Tokenizer and RegexTokenizer don't split on apostrophes or quotes.
Adding original form to stop words list to match the behavior of Tokenizer and StopwordsRemover. Also remove "won" from list.
see more discussion in the jira: https://issues.apache.org/jira/browse/SPARK-18374
How was this patch tested?
existing ut