Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-11940][PYSPARK][ML] Python API for ml.clustering.LDA #10242

Closed
wants to merge 4 commits into from

Conversation

zjffdu
Copy link
Contributor

@zjffdu zjffdu commented Dec 10, 2015

Besides this issue, also fix another in issue in python/pyspark/init.py (should provide more informative message when no doc is defined but since annotation is added.

@SparkQA
Copy link

SparkQA commented Dec 10, 2015

Test build #47480 has finished for PR 10242 at commit 792b883.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * class LDAModel(JavaModel):\n * class DistributedLDAModel(LDAModel):\n * class LocalLDAModel(LDAModel):\n * class LDA(JavaEstimator, HasFeaturesCol, HasMaxIter, HasSeed, HasCheckpointInterval):\n

@SparkQA
Copy link

SparkQA commented Dec 10, 2015

Test build #47483 has finished for PR 10242 at commit 5b5f091.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * class LDAModel(JavaModel):\n * class DistributedLDAModel(LDAModel):\n * class LocalLDAModel(LDAModel):\n * class LDA(JavaEstimator, HasFeaturesCol, HasMaxIter, HasSeed, HasCheckpointInterval):\n

@SparkQA
Copy link

SparkQA commented Dec 11, 2015

Test build #47573 has finished for PR 10242 at commit 9c2bf31.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * class LDAModel(JavaModel):\n * class DistributedLDAModel(LDAModel):\n * class LocalLDAModel(LDAModel):\n * class LDA(JavaEstimator, HasFeaturesCol, HasMaxIter, HasSeed, HasCheckpointInterval):\n

@zjffdu
Copy link
Contributor Author

zjffdu commented Dec 14, 2015

@yanboliang Could you help review it ?

subsamplingRate=0.05, optimizeDocConcentration=True,
checkpointInterval=10, maxIter=20, seed=None):
"""
ssetParams(self, featuresCol="features", k=2,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add \ at the end of each line, otherwise it can not generate API doc correctly.
typo: ssetParams -> setParams

@zjffdu
Copy link
Contributor Author

zjffdu commented Dec 16, 2015

@yanboliang Push another commit to address the comments. BTW, for the unit test, I will get different result if I use python2.7, is it expected ?

@SparkQA
Copy link

SparkQA commented Dec 16, 2015

Test build #47788 has finished for PR 10242 at commit d189853.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * class LDAModel(JavaModel):\n * class DistributedLDAModel(LDAModel):\n * class LocalLDAModel(LDAModel):\n * class LDA(JavaEstimator, HasFeaturesCol, HasMaxIter, HasSeed, HasCheckpointInterval):\n

@SparkQA
Copy link

SparkQA commented Dec 16, 2015

Test build #47790 has finished for PR 10242 at commit ebf5e35.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * class LDAModel(JavaModel):\n * class DistributedLDAModel(LDAModel):\n * class LocalLDAModel(LDAModel):\n * class LDA(JavaEstimator, HasFeaturesCol, HasMaxIter, HasSeed, HasCheckpointInterval):\n

@SparkQA
Copy link

SparkQA commented Dec 16, 2015

Test build #47797 has finished for PR 10242 at commit bef9c91.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * class LDAModel(JavaModel):\n * class DistributedLDAModel(LDAModel):\n * class LocalLDAModel(LDAModel):\n * class LDA(JavaEstimator, HasFeaturesCol, HasMaxIter, HasSeed, HasCheckpointInterval):\n

@SparkQA
Copy link

SparkQA commented Dec 16, 2015

Test build #47801 has finished for PR 10242 at commit 592f50b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * class LDAModel(JavaModel):\n * class DistributedLDAModel(LDAModel):\n * class LocalLDAModel(LDAModel):\n * class LDA(JavaEstimator, HasFeaturesCol, HasMaxIter, HasSeed, HasCheckpointInterval):\n

@mengxr
Copy link
Contributor

mengxr commented Feb 12, 2016

@zjffdu Sorry for slow response! @zjffdu Could you update since versions in this PR and address @yanboliang 's comment? Next version will be 2.0.0 instead of 1.7.0.

@yanboliang Could you make another pass after the update? Thanks!

@zjffdu
Copy link
Contributor Author

zjffdu commented Feb 25, 2016

Sorry for late response, I will update this PR in the next following days.

@SparkQA
Copy link

SparkQA commented Feb 29, 2016

Test build #52180 has finished for PR 10242 at commit b27c275.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class LDAModel(JavaModel):
    • class DistributedLDAModel(LDAModel):
    • class LocalLDAModel(LDAModel):
    • class LDA(JavaEstimator, HasFeaturesCol, HasMaxIter, HasSeed, HasCheckpointInterval):


>>> from pyspark.mllib.linalg import Vectors, SparseVector
>>> from pyspark.ml.clustering import LDA
>>> df = sqlContext.createDataFrame([[1, Vectors.dense([0.0, 1.0])], \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we usually make the next line start with ..., you can refer here.

@SparkQA
Copy link

SparkQA commented Mar 2, 2016

Test build #52277 has finished for PR 10242 at commit e8723db.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -291,6 +292,284 @@ def _create_model(self, java_model):
return BisectingKMeansModel(java_model)


class LDAModel(JavaModel):
""" A clustering model derived from the LDA method.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also expose estimatedDocConcentration for LDAModel?

@SparkQA
Copy link

SparkQA commented Apr 20, 2016

Test build #56318 has finished for PR 10242 at commit 16ea17d.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 20, 2016

Test build #56323 has finished for PR 10242 at commit 372d5a5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yanboliang
Copy link
Contributor

@zjffdu Please update this PR according the change at #11663 for type conversion and #11939 for param setters using _set method.

@@ -59,6 +59,8 @@ def since(version):
indent_p = re.compile(r'\n( +)')

def deco(f):
if not f.__doc__:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good idea, but can you please do it in a separate PR? This is a broad change, so separating it out would be helpful (in case of conflicts, etc.).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create SPARK-14834 for this.

@jkbradley
Copy link
Member

@zjffdu thanks for the updates! Btw, can you please add the "[ML]" tag to the PR title?

@jkbradley
Copy link
Member

One more high-level request: Could you please add persistence to this? I'd like to start adding persistence to Python wrappers immediately since we now have full Python coverage. You should be able to extend MLReadable, MLWritable and add a simple test.


@since("2.0.0")
def vocabSize(self):
"""Vocabulary size (number of terms or terms in the vocabulary)"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"terms or terms" must be a mistake from a search-and-replace. I bet it's supposed to be "terms or words"
Could you fix that here and in the Scala doc too please?

@zjffdu zjffdu changed the title [SPARK-11940][PYSPARK] Python API for ml.clustering.LDA [SPARK-11940][PYSPARK][ML] Python API for ml.clustering.LDA Apr 22, 2016
@jkbradley
Copy link
Member

@zjffdu Do you mind if I take over this PR? I'd really like to get this API in for 2.0. You'll still be the primary author on the commit.

@zjffdu
Copy link
Contributor Author

zjffdu commented Apr 26, 2016

@jkbradley I made some update based your comments before, but don't have time to implement the model persistence feature. Please take over this PR.

@SparkQA
Copy link

SparkQA commented Apr 26, 2016

Test build #57050 has finished for PR 10242 at commit 2b2bafe.

  • This patch fails Python style tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

OK thanks! I'll update it

@jkbradley
Copy link
Member

Thanks, I did the rebase and updated it. It's in this new PR: [https://github.com//pull/12723]

Could you please close this issue, and if you have time take a look at the new PR? Thanks!

@zjffdu zjffdu closed this Apr 27, 2016
asfgit pushed a commit that referenced this pull request Apr 29, 2016
## What changes were proposed in this pull request?

pyspark.ml API for LDA
* LDA, LDAModel, LocalLDAModel, DistributedLDAModel
* includes persistence

This replaces [#10242]

## How was this patch tested?

* doc test for LDA, including Param setters
* unit test for persistence

Author: Joseph K. Bradley <joseph@databricks.com>
Author: Jeff Zhang <zjffdu@apache.org>

Closes #12723 from jkbradley/zjffdu-SPARK-11940.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants