-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-9679][ML][PYSPARK] Add Python API for Stop Words Remover #8118
[SPARK-9679][ML][PYSPARK] Add Python API for Stop Words Remover #8118
Conversation
jenkins, retest this please. |
Test build #40679 has finished for PR 8118 at commit
|
jenkins, retest this please |
jenkins, retest this please. |
Test build #40830 has finished for PR 8118 at commit
|
Test build #41169 has finished for PR 8118 at commit
|
@keyword_only | ||
def __init__(self, inputCol=None, outputCol=None, stopWords=[]): | ||
""" | ||
Initialize this instace of the StopWordsRemover. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason why this __init__
doc string breaks the pattern of just repeating the method with default args seen elsewhere in feature.py
?
Test build #41654 has finished for PR 8118 at commit
|
"sensitive comparison over the stop words") | ||
stopWordsObj = _jvm().org.apache.spark.ml.feature.StopWords | ||
defaultStopWords = stopWordsObj.ENGLISH_STOP_WORDS() | ||
print "Constructing java param pair for value "+str(defaultStopWords) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these print
s intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh no, I was checking the type when debugging something
some small comments, LGTM after they're fixed |
Test build #41672 has finished for PR 8118 at commit
|
jenkins, retest this please. |
Test build #41679 has finished for PR 8118 at commit
|
@@ -29,14 +29,14 @@ import org.apache.spark.sql.types.{ArrayType, StringType, StructField, StructTyp | |||
/** | |||
* stop words list | |||
*/ | |||
private object StopWords { | |||
protected[spark] object StopWords { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private[spark]
should be the same but appears more often
|
||
/** | ||
* Use the same default stopwords list as scikit-learn. | ||
* The original list can be found from "Glasgow Information Retrieval Group" | ||
* [[http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words]] | ||
*/ | ||
val EnglishStopWords = Array( "a", "about", "above", "across", "after", "afterwards", "again", | ||
val ENGLISH_STOP_WORDS = Array( "a", "about", "above", "across", "after", "afterwards", "again", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: Since the object is already StopWords
, would English
be sufficient? We didn't use ENGLISH_STOP_WORDS
because it is a mutable array.
Test build #41761 has finished for PR 8118 at commit
|
Test build #41764 has finished for PR 8118 at commit
|
LGTM except a minor issue on the test code style. |
Test build #41848 has finished for PR 8118 at commit
|
Merged into master. Thanks! |
Add a python API for the Stop Words Remover.