-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-11940][PYSPARK][ML] Python API for ml.clustering.LDA #10242
Conversation
Test build #47480 has finished for PR 10242 at commit
|
Test build #47483 has finished for PR 10242 at commit
|
Test build #47573 has finished for PR 10242 at commit
|
@yanboliang Could you help review it ? |
subsamplingRate=0.05, optimizeDocConcentration=True, | ||
checkpointInterval=10, maxIter=20, seed=None): | ||
""" | ||
ssetParams(self, featuresCol="features", k=2, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add \
at the end of each line, otherwise it can not generate API doc correctly.
typo: ssetParams
-> setParams
@yanboliang Push another commit to address the comments. BTW, for the unit test, I will get different result if I use python2.7, is it expected ? |
Test build #47788 has finished for PR 10242 at commit
|
Test build #47790 has finished for PR 10242 at commit
|
Test build #47797 has finished for PR 10242 at commit
|
Test build #47801 has finished for PR 10242 at commit
|
@zjffdu Sorry for slow response! @zjffdu Could you update since versions in this PR and address @yanboliang 's comment? Next version will be @yanboliang Could you make another pass after the update? Thanks! |
Sorry for late response, I will update this PR in the next following days. |
Test build #52180 has finished for PR 10242 at commit
|
|
||
>>> from pyspark.mllib.linalg import Vectors, SparseVector | ||
>>> from pyspark.ml.clustering import LDA | ||
>>> df = sqlContext.createDataFrame([[1, Vectors.dense([0.0, 1.0])], \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we usually make the next line start with ...
, you can refer here.
Test build #52277 has finished for PR 10242 at commit
|
@@ -291,6 +292,284 @@ def _create_model(self, java_model): | |||
return BisectingKMeansModel(java_model) | |||
|
|||
|
|||
class LDAModel(JavaModel): | |||
""" A clustering model derived from the LDA method. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also expose estimatedDocConcentration
for LDAModel
?
Test build #56318 has finished for PR 10242 at commit
|
Test build #56323 has finished for PR 10242 at commit
|
@@ -59,6 +59,8 @@ def since(version): | |||
indent_p = re.compile(r'\n( +)') | |||
|
|||
def deco(f): | |||
if not f.__doc__: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good idea, but can you please do it in a separate PR? This is a broad change, so separating it out would be helpful (in case of conflicts, etc.).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
create SPARK-14834 for this.
@zjffdu thanks for the updates! Btw, can you please add the "[ML]" tag to the PR title? |
One more high-level request: Could you please add persistence to this? I'd like to start adding persistence to Python wrappers immediately since we now have full Python coverage. You should be able to extend MLReadable, MLWritable and add a simple test. |
|
||
@since("2.0.0") | ||
def vocabSize(self): | ||
"""Vocabulary size (number of terms or terms in the vocabulary)""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"terms or terms" must be a mistake from a search-and-replace. I bet it's supposed to be "terms or words"
Could you fix that here and in the Scala doc too please?
@zjffdu Do you mind if I take over this PR? I'd really like to get this API in for 2.0. You'll still be the primary author on the commit. |
@jkbradley I made some update based your comments before, but don't have time to implement the model persistence feature. Please take over this PR. |
Test build #57050 has finished for PR 10242 at commit
|
OK thanks! I'll update it |
Thanks, I did the rebase and updated it. It's in this new PR: [https://github.com//pull/12723] Could you please close this issue, and if you have time take a look at the new PR? Thanks! |
## What changes were proposed in this pull request? pyspark.ml API for LDA * LDA, LDAModel, LocalLDAModel, DistributedLDAModel * includes persistence This replaces [#10242] ## How was this patch tested? * doc test for LDA, including Param setters * unit test for persistence Author: Joseph K. Bradley <joseph@databricks.com> Author: Jeff Zhang <zjffdu@apache.org> Closes #12723 from jkbradley/zjffdu-SPARK-11940.
Besides this issue, also fix another in issue in python/pyspark/init.py (should provide more informative message when no doc is defined but since annotation is added.