[SPARK-11940][PYSPARK][ML] Python API for ml.clustering.LDA #10242

zjffdu · 2015-12-10T03:30:09Z

Besides this issue, also fix another in issue in python/pyspark/init.py (should provide more informative message when no doc is defined but since annotation is added.

SparkQA · 2015-12-10T03:43:20Z

Test build #47480 has finished for PR 10242 at commit 792b883.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * class LDAModel(JavaModel):\n * class DistributedLDAModel(LDAModel):\n * class LocalLDAModel(LDAModel):\n * class LDA(JavaEstimator, HasFeaturesCol, HasMaxIter, HasSeed, HasCheckpointInterval):\n

SparkQA · 2015-12-10T04:32:15Z

Test build #47483 has finished for PR 10242 at commit 5b5f091.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * class LDAModel(JavaModel):\n * class DistributedLDAModel(LDAModel):\n * class LocalLDAModel(LDAModel):\n * class LDA(JavaEstimator, HasFeaturesCol, HasMaxIter, HasSeed, HasCheckpointInterval):\n

SparkQA · 2015-12-11T06:51:12Z

Test build #47573 has finished for PR 10242 at commit 9c2bf31.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * class LDAModel(JavaModel):\n * class DistributedLDAModel(LDAModel):\n * class LocalLDAModel(LDAModel):\n * class LDA(JavaEstimator, HasFeaturesCol, HasMaxIter, HasSeed, HasCheckpointInterval):\n

zjffdu · 2015-12-14T06:07:01Z

@yanboliang Could you help review it ?

yanboliang · 2015-12-15T10:26:38Z

python/pyspark/ml/clustering.py

+                  subsamplingRate=0.05, optimizeDocConcentration=True,
+                  checkpointInterval=10, maxIter=20, seed=None):
+        """
+        ssetParams(self, featuresCol="features", k=2,


add \ at the end of each line, otherwise it can not generate API doc correctly.
typo: ssetParams -> setParams

zjffdu · 2015-12-16T04:07:02Z

@yanboliang Push another commit to address the comments. BTW, for the unit test, I will get different result if I use python2.7, is it expected ?

SparkQA · 2015-12-16T04:37:36Z

Test build #47788 has finished for PR 10242 at commit d189853.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * class LDAModel(JavaModel):\n * class DistributedLDAModel(LDAModel):\n * class LocalLDAModel(LDAModel):\n * class LDA(JavaEstimator, HasFeaturesCol, HasMaxIter, HasSeed, HasCheckpointInterval):\n

SparkQA · 2015-12-16T05:53:00Z

Test build #47790 has finished for PR 10242 at commit ebf5e35.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * class LDAModel(JavaModel):\n * class DistributedLDAModel(LDAModel):\n * class LocalLDAModel(LDAModel):\n * class LDA(JavaEstimator, HasFeaturesCol, HasMaxIter, HasSeed, HasCheckpointInterval):\n

SparkQA · 2015-12-16T07:17:40Z

Test build #47797 has finished for PR 10242 at commit bef9c91.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * class LDAModel(JavaModel):\n * class DistributedLDAModel(LDAModel):\n * class LocalLDAModel(LDAModel):\n * class LDA(JavaEstimator, HasFeaturesCol, HasMaxIter, HasSeed, HasCheckpointInterval):\n

SparkQA · 2015-12-16T08:11:27Z

Test build #47801 has finished for PR 10242 at commit 592f50b.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * class LDAModel(JavaModel):\n * class DistributedLDAModel(LDAModel):\n * class LocalLDAModel(LDAModel):\n * class LDA(JavaEstimator, HasFeaturesCol, HasMaxIter, HasSeed, HasCheckpointInterval):\n

mengxr · 2016-02-12T01:32:47Z

@zjffdu Sorry for slow response! @zjffdu Could you update since versions in this PR and address @yanboliang 's comment? Next version will be 2.0.0 instead of 1.7.0.

@yanboliang Could you make another pass after the update? Thanks!

zjffdu · 2016-02-25T00:54:14Z

Sorry for late response, I will update this PR in the next following days.

SparkQA · 2016-02-29T08:53:36Z

Test build #52180 has finished for PR 10242 at commit b27c275.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class LDAModel(JavaModel):
- class DistributedLDAModel(LDAModel):
- class LocalLDAModel(LDAModel):
- class LDA(JavaEstimator, HasFeaturesCol, HasMaxIter, HasSeed, HasCheckpointInterval):

yanboliang · 2016-02-29T10:51:52Z

python/pyspark/ml/clustering.py

+
+    >>> from pyspark.mllib.linalg import Vectors, SparseVector
+    >>> from pyspark.ml.clustering import LDA
+    >>> df = sqlContext.createDataFrame([[1, Vectors.dense([0.0, 1.0])], \


Here we usually make the next line start with ..., you can refer here.

SparkQA · 2016-03-02T02:00:22Z

Test build #52277 has finished for PR 10242 at commit e8723db.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang · 2016-03-02T08:50:44Z

python/pyspark/ml/clustering.py

@@ -291,6 +292,284 @@ def _create_model(self, java_model):
        return BisectingKMeansModel(java_model)


+class LDAModel(JavaModel):
+    """ A clustering model derived from the LDA method.


Should we also expose estimatedDocConcentration for LDAModel?

SparkQA · 2016-04-20T03:58:10Z

Test build #56318 has finished for PR 10242 at commit 16ea17d.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-20T06:02:02Z

Test build #56323 has finished for PR 10242 at commit 372d5a5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang · 2016-04-20T14:30:53Z

@zjffdu Please update this PR according the change at #11663 for type conversion and #11939 for param setters using _set method.

jkbradley · 2016-04-20T18:56:09Z

python/pyspark/__init__.py

@@ -59,6 +59,8 @@ def since(version):
    indent_p = re.compile(r'\n( +)')

    def deco(f):
+        if not f.__doc__:


This is a good idea, but can you please do it in a separate PR? This is a broad change, so separating it out would be helpful (in case of conflicts, etc.).

create SPARK-14834 for this.

jkbradley · 2016-04-20T18:58:12Z

@zjffdu thanks for the updates! Btw, can you please add the "[ML]" tag to the PR title?

jkbradley · 2016-04-20T19:00:06Z

One more high-level request: Could you please add persistence to this? I'd like to start adding persistence to Python wrappers immediately since we now have full Python coverage. You should be able to extend MLReadable, MLWritable and add a simple test.

jkbradley · 2016-04-20T19:13:54Z

python/pyspark/ml/clustering.py

+
+    @since("2.0.0")
+    def vocabSize(self):
+        """Vocabulary size (number of terms or terms in the vocabulary)"""


"terms or terms" must be a mistake from a search-and-replace. I bet it's supposed to be "terms or words"
Could you fix that here and in the Scala doc too please?

jkbradley · 2016-04-26T19:44:11Z

@zjffdu Do you mind if I take over this PR? I'd really like to get this API in for 2.0. You'll still be the primary author on the commit.

zjffdu · 2016-04-26T23:27:39Z

@jkbradley I made some update based your comments before, but don't have time to implement the model persistence feature. Please take over this PR.

SparkQA · 2016-04-26T23:29:01Z

Test build #57050 has finished for PR 10242 at commit 2b2bafe.

This patch fails Python style tests.
This patch does not merge cleanly.
This patch adds no public classes.

jkbradley · 2016-04-26T23:29:19Z

OK thanks! I'll update it

jkbradley · 2016-04-27T02:44:48Z

Thanks, I did the rebase and updated it. It's in this new PR: [https://github.com//pull/12723]

Could you please close this issue, and if you have time take a look at the new PR? Thanks!

## What changes were proposed in this pull request? pyspark.ml API for LDA * LDA, LDAModel, LocalLDAModel, DistributedLDAModel * includes persistence This replaces [#10242] ## How was this patch tested? * doc test for LDA, including Param setters * unit test for persistence Author: Joseph K. Bradley <joseph@databricks.com> Author: Jeff Zhang <zjffdu@apache.org> Closes #12723 from jkbradley/zjffdu-SPARK-11940.

yanboliang reviewed Dec 15, 2015
View reviewed changes

zjffdu force-pushed the SPARK-11940 branch from 592f50b to b27c275 Compare February 29, 2016 08:16

yanboliang reviewed Feb 29, 2016
View reviewed changes

yanboliang reviewed Mar 2, 2016
View reviewed changes

address comments

16ea17d

zjffdu force-pushed the SPARK-11940 branch from d6c9078 to 16ea17d Compare April 20, 2016 03:54

code style fix

372d5a5

jkbradley reviewed Apr 20, 2016
View reviewed changes

zjffdu changed the title ~~[SPARK-11940][PYSPARK] Python API for ml.clustering.LDA~~ [SPARK-11940][PYSPARK][ML] Python API for ml.clustering.LDA Apr 22, 2016

address comments

2b2bafe

jkbradley mentioned this pull request Apr 27, 2016

[SPARK-11940][PYSPARK][ML] Python API for ml.clustering.LDA PR2 #12723

Closed

zjffdu closed this Apr 27, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-11940][PYSPARK][ML] Python API for ml.clustering.LDA #10242

[SPARK-11940][PYSPARK][ML] Python API for ml.clustering.LDA #10242

zjffdu commented Dec 10, 2015

SparkQA commented Dec 10, 2015

SparkQA commented Dec 10, 2015

SparkQA commented Dec 11, 2015

zjffdu commented Dec 14, 2015

yanboliang Dec 15, 2015

zjffdu commented Dec 16, 2015

SparkQA commented Dec 16, 2015

SparkQA commented Dec 16, 2015

SparkQA commented Dec 16, 2015

SparkQA commented Dec 16, 2015

mengxr commented Feb 12, 2016

zjffdu commented Feb 25, 2016

SparkQA commented Feb 29, 2016

yanboliang Feb 29, 2016

SparkQA commented Mar 2, 2016

yanboliang Mar 2, 2016

SparkQA commented Apr 20, 2016

SparkQA commented Apr 20, 2016

yanboliang commented Apr 20, 2016

jkbradley Apr 20, 2016

zjffdu Apr 22, 2016

jkbradley commented Apr 20, 2016

jkbradley commented Apr 20, 2016

jkbradley Apr 20, 2016

jkbradley commented Apr 26, 2016

zjffdu commented Apr 26, 2016

SparkQA commented Apr 26, 2016

jkbradley commented Apr 26, 2016

jkbradley commented Apr 27, 2016

[SPARK-11940][PYSPARK][ML] Python API for ml.clustering.LDA #10242

[SPARK-11940][PYSPARK][ML] Python API for ml.clustering.LDA #10242

Conversation

zjffdu commented Dec 10, 2015

SparkQA commented Dec 10, 2015

SparkQA commented Dec 10, 2015

SparkQA commented Dec 11, 2015

zjffdu commented Dec 14, 2015

yanboliang Dec 15, 2015

Choose a reason for hiding this comment

zjffdu commented Dec 16, 2015

SparkQA commented Dec 16, 2015

SparkQA commented Dec 16, 2015

SparkQA commented Dec 16, 2015

SparkQA commented Dec 16, 2015

mengxr commented Feb 12, 2016

zjffdu commented Feb 25, 2016

SparkQA commented Feb 29, 2016

yanboliang Feb 29, 2016

Choose a reason for hiding this comment

SparkQA commented Mar 2, 2016

yanboliang Mar 2, 2016

Choose a reason for hiding this comment

SparkQA commented Apr 20, 2016

SparkQA commented Apr 20, 2016

yanboliang commented Apr 20, 2016

jkbradley Apr 20, 2016

Choose a reason for hiding this comment

zjffdu Apr 22, 2016

Choose a reason for hiding this comment

jkbradley commented Apr 20, 2016

jkbradley commented Apr 20, 2016

jkbradley Apr 20, 2016

Choose a reason for hiding this comment

jkbradley commented Apr 26, 2016

zjffdu commented Apr 26, 2016

SparkQA commented Apr 26, 2016

jkbradley commented Apr 26, 2016

jkbradley commented Apr 27, 2016