Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spark-8169] [ML] Add StopWordsRemover as a transformer #6742

Closed
wants to merge 4 commits into from

Conversation

hhbyyh
Copy link
Contributor

@hhbyyh hhbyyh commented Jun 10, 2015

jira: https://issues.apache.org/jira/browse/SPARK-8169

stop words: http://en.wikipedia.org/wiki/Stop_words

StopWordsRemover takes a string array column and outputs a string array column with all defined stop words removed. The transformer should also come with a standard set of stop words as default.

Currently I used a minimum stop words set since on some case, small set of stop words is preferred.
ASCII char has been tested, Yet I cannot check it in due to style check.

Further thought,

  1. Maybe I should use OpenHashSet. Is it recommended?
  2. Currently I leave the null in input array untouched, i.e. Array(null, null) => Array(null, null).
  3. If the current stop words set looks too limited, any suggestion for replacement? We can have something similar to the one in SKlearn.

@SparkQA
Copy link

SparkQA commented Jun 10, 2015

Test build #34592 has finished for PR 6742 at commit b3aa957.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class StopWordsRemover(override val uid: String)

*/
@Experimental
object StopWords{
val EnglishSet = ("a an and are as at be by for from has he in is it its of on that the to " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if this was some standard stopwords list like the SKLearn list you mentioned or this one from CoreNLP. Please include a reference in the docs if you use it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please address @feynmanliang 's comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will use the one from SKLearn. Please let me know if there's other suggestion.

@feynmanliang
Copy link
Contributor

  1. Maybe I should use OpenHashSet. Is it recommended?

    In my opinion set should suffice. OpenHashSet optimizes for Long and Int keys, but you are using String keys here. Perhaps some perf benchmarks may help here.

  2. Currently I leave the null in input array untouched, i.e. Array(null, null) => Array(null, null).

    I imagine this will be used downstream of Tokenizer, in which case you should not need to worry about nulls. Perhaps document the behavior in the ScalaDocs.

  3. If the current stop words set looks too limited, any suggestion for replacement? We can have something similar to the one in SKlearn.

    See in-line comments.

Also, it may be useful to think how this will interact with #7084 since we could just supply a vocabulary without stop words.

* stop words list
*/
@Experimental
object StopWords{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make this private. We only need to mention where we get the default list of stop words.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I'm just thinking users may write
val stopWords = StopWords.EnglishStopWords ++ Array("python", "scala") in some occasion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users can get the default value from the StopWordsRemover instance.

@mengxr
Copy link
Contributor

mengxr commented Jul 31, 2015

@hhbyyh Sorry for late review! If you don't have time to address my comments before the feature freeze, please let me know:)

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Jul 31, 2015

@mengxr I've started working on it and will try to send an update within one hour.

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Jul 31, 2015

Sorry for the delay due to time difference. Update sent.

@SparkQA
Copy link

SparkQA commented Jul 31, 2015

Test build #39141 has finished for PR 6742 at commit f190217.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class StopWordsRemover(override val uid: String)

/**
* stop words list
*/
private object StopWords{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space before {

@mengxr
Copy link
Contributor

mengxr commented Jul 31, 2015

Further thought,

  1. Maybe I should use OpenHashSet. Is it recommended?

It might be faster. We can compare the performance during QA. Let's use Set for simplicity in this PR.

  1. Currently I leave the null in input array untouched, i.e. Array(null, null) => Array(null, null).

See my inline comments. I would recommend treating null as a normal word.

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Aug 1, 2015

@mengxr Update sent: Separating udf, adding note about preserving null, and other style fix. Thanks.

@SparkQA
Copy link

SparkQA commented Aug 1, 2015

Test build #39310 has finished for PR 6742 at commit f190217.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class StopWordsRemover(override val uid: String)

@SparkQA
Copy link

SparkQA commented Aug 1, 2015

Test build #39314 has finished for PR 6742 at commit fa959d8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class StopWordsRemover(override val uid: String)

@mengxr
Copy link
Contributor

mengxr commented Aug 1, 2015

LGTM. Merged into master. Thanks!

@asfgit asfgit closed this in 8765665 Aug 1, 2015
@hhbyyh
Copy link
Contributor Author

hhbyyh commented Aug 1, 2015

Thanks for taking time finish the review. It must have been a long day.

Shall I create jira for Python and document? @mengxr

@mengxr
Copy link
Contributor

mengxr commented Aug 5, 2015

Yes, please create JIRAs for Python API and doc. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants