[SPARK-18374][ML]Incorrect words in StopWords/english.txt #16103

hhbyyh · 2016-12-01T17:21:15Z

What changes were proposed in this pull request?

Currently English stop words list in MLlib contains only the argumented words after removing all the apostrophes, so "wouldn't" become "wouldn" and "t". Yet by default Tokenizer and RegexTokenizer don't split on apostrophes or quotes.

Adding original form to stop words list to match the behavior of Tokenizer and StopwordsRemover. Also remove "won" from list.

see more discussion in the jira: https://issues.apache.org/jira/browse/SPARK-18374

How was this patch tested?

existing ut

hhbyyh · 2016-12-01T17:24:00Z

@srowen Thanks for the comments in jira.

SparkQA · 2016-12-01T18:18:45Z

Test build #69489 has finished for PR 16103 at commit efeae8c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2016-12-02T01:36:21Z

Oh @hhbyyh, I think I am not used to this area but actually I just wanted leave a comment just for stop words list that I considered for my elasticsearch cluster before. I just made a list for missing ones via

target = open("mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/english.txt").read().split("\n")
source = open("mylist.txt").read().split("\n")
for w in source:
    if w not in target:
        print w

and I got these below:

cannot
could
here's
how's
let's
ought
that's
there's
what's
when's
where's
who's
why's
would

I am not sure if they should be added here or not but I just wanted to let you know what I considered before for stop words.

srowen · 2016-12-02T12:39:24Z

Seems reasonable to me too.

hhbyyh · 2016-12-02T19:10:17Z

Thanks @HyukjinKwon. I'll add them.

SparkQA · 2016-12-02T20:27:41Z

Test build #69583 has finished for PR 16103 at commit 4cec57a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2016-12-03T10:28:58Z

mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/english.txt

 wasn
 weren
-won
 wouldn


You would then remove the other stems like "wasn" "weren" etc right?

I'm fine with both options, leaving them or removing them.

I think we should remove the spurious stems; the info on this PR suggests they aren't supposed to be there.

SparkQA · 2016-12-05T18:26:46Z

Test build #69675 has finished for PR 16103 at commit aa8c72a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-12-05T20:19:02Z

Test build #69677 has finished for PR 16103 at commit bd62396.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hhbyyh · 2016-12-06T07:45:31Z

Thanks for the review.

srowen · 2016-12-06T21:12:30Z

Merged to master

## What changes were proposed in this pull request? Currently English stop words list in MLlib contains only the argumented words after removing all the apostrophes, so "wouldn't" become "wouldn" and "t". Yet by default Tokenizer and RegexTokenizer don't split on apostrophes or quotes. Adding original form to stop words list to match the behavior of Tokenizer and StopwordsRemover. Also remove "won" from list. see more discussion in the jira: https://issues.apache.org/jira/browse/SPARK-18374 ## How was this patch tested? existing ut Author: Yuhao <yuhao.yang@intel.com> Author: Yuhao Yang <hhbyyh@gmail.com> Closes apache#16103 from hhbyyh/addstopwords.

add stop words

efeae8c

add more stop words

4cec57a

srowen reviewed Dec 3, 2016

View reviewed changes

hhbyyh added 2 commits December 5, 2016 09:24

Merge remote-tracking branch 'upstream/master' into addstopwords

5498738

remove argumented words

aa8c72a

fix suite

bd62396

srowen approved these changes Dec 5, 2016

View reviewed changes

asfgit closed this in fac5b75 Dec 6, 2016

[SPARK-18374][ML]Incorrect words in StopWords/english.txt #16103

[SPARK-18374][ML]Incorrect words in StopWords/english.txt #16103

Uh oh!

Conversation

hhbyyh commented Dec 1, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

hhbyyh commented Dec 1, 2016

Uh oh!

SparkQA commented Dec 1, 2016

Uh oh!

HyukjinKwon commented Dec 2, 2016

Uh oh!

srowen commented Dec 2, 2016

Uh oh!

hhbyyh commented Dec 2, 2016

Uh oh!

SparkQA commented Dec 2, 2016

Uh oh!

srowen Dec 3, 2016

Choose a reason for hiding this comment

Uh oh!

hhbyyh Dec 3, 2016

Choose a reason for hiding this comment

Uh oh!

srowen Dec 4, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 5, 2016

Uh oh!

SparkQA commented Dec 5, 2016

Uh oh!

hhbyyh commented Dec 6, 2016

Uh oh!

srowen commented Dec 6, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants