Skip to content

Conversation

@hhbyyh
Copy link
Contributor

@hhbyyh hhbyyh commented Dec 1, 2016

What changes were proposed in this pull request?

Currently English stop words list in MLlib contains only the argumented words after removing all the apostrophes, so "wouldn't" become "wouldn" and "t". Yet by default Tokenizer and RegexTokenizer don't split on apostrophes or quotes.

Adding original form to stop words list to match the behavior of Tokenizer and StopwordsRemover. Also remove "won" from list.

see more discussion in the jira: https://issues.apache.org/jira/browse/SPARK-18374

How was this patch tested?

existing ut

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Dec 1, 2016

@srowen Thanks for the comments in jira.

@SparkQA
Copy link

SparkQA commented Dec 1, 2016

Test build #69489 has finished for PR 16103 at commit efeae8c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Oh @hhbyyh, I think I am not used to this area but actually I just wanted leave a comment just for stop words list that I considered for my elasticsearch cluster before. I just made a list for missing ones via

target = open("mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/english.txt").read().split("\n")
source = open("mylist.txt").read().split("\n")
for w in source:
    if w not in target:
        print w

and I got these below:

cannot
could
here's
how's
let's
ought
that's
there's
what's
when's
where's
who's
why's
would

I am not sure if they should be added here or not but I just wanted to let you know what I considered before for stop words.

@srowen
Copy link
Member

srowen commented Dec 2, 2016

Seems reasonable to me too.

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Dec 2, 2016

Thanks @HyukjinKwon. I'll add them.

@SparkQA
Copy link

SparkQA commented Dec 2, 2016

Test build #69583 has finished for PR 16103 at commit 4cec57a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

wasn
weren
won
wouldn
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You would then remove the other stems like "wasn" "weren" etc right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with both options, leaving them or removing them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should remove the spurious stems; the info on this PR suggests they aren't supposed to be there.

@SparkQA
Copy link

SparkQA commented Dec 5, 2016

Test build #69675 has finished for PR 16103 at commit aa8c72a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 5, 2016

Test build #69677 has finished for PR 16103 at commit bd62396.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Dec 6, 2016

Thanks for the review.

@srowen
Copy link
Member

srowen commented Dec 6, 2016

Merged to master

@asfgit asfgit closed this in fac5b75 Dec 6, 2016
robert3005 pushed a commit to palantir/spark that referenced this pull request Dec 15, 2016
## What changes were proposed in this pull request?

Currently English stop words list in MLlib contains only the argumented words after removing all the apostrophes, so "wouldn't" become "wouldn" and "t". Yet by default Tokenizer and RegexTokenizer don't split on apostrophes or quotes.

Adding original form to stop words list to match the behavior of Tokenizer and StopwordsRemover. Also remove "won" from list.

see more discussion in the jira: https://issues.apache.org/jira/browse/SPARK-18374

## How was this patch tested?
existing ut

Author: Yuhao <yuhao.yang@intel.com>
Author: Yuhao Yang <hhbyyh@gmail.com>

Closes apache#16103 from hhbyyh/addstopwords.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
## What changes were proposed in this pull request?

Currently English stop words list in MLlib contains only the argumented words after removing all the apostrophes, so "wouldn't" become "wouldn" and "t". Yet by default Tokenizer and RegexTokenizer don't split on apostrophes or quotes.

Adding original form to stop words list to match the behavior of Tokenizer and StopwordsRemover. Also remove "won" from list.

see more discussion in the jira: https://issues.apache.org/jira/browse/SPARK-18374

## How was this patch tested?
existing ut

Author: Yuhao <yuhao.yang@intel.com>
Author: Yuhao Yang <hhbyyh@gmail.com>

Closes apache#16103 from hhbyyh/addstopwords.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants