[SPARK-11940][PYSPARK][ML] Python API for ml.clustering.LDA PR2 #12723

jkbradley · 2016-04-27T02:44:09Z

What changes were proposed in this pull request?

pyspark.ml API for LDA

LDA, LDAModel, LocalLDAModel, DistributedLDAModel
includes persistence

This replaces [https://github.com//pull/10242]

How was this patch tested?

doc test for LDA, including Param setters
unit test for persistence

jkbradley · 2016-04-27T02:45:12Z

CC: @yanboliang Would you have time to take a look at this PR? Thanks!

SparkQA · 2016-04-27T02:49:08Z

Test build #57081 has finished for PR 12723 at commit 4f807e8.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class DistributedLDAModel(LDAModel, JavaMLReadable, JavaMLWritable):
- class LocalLDAModel(LDAModel, JavaMLReadable, JavaMLWritable):
- class LDA(JavaEstimator, HasFeaturesCol, HasMaxIter, HasSeed, HasCheckpointInterval,

yanboliang · 2016-04-27T13:14:50Z

python/pyspark/ml/clustering.py

+    pass
+
+
+class LDA(JavaEstimator, HasFeaturesCol, HasMaxIter, HasSeed, HasCheckpointInterval,


@inherit_doc

yanboliang · 2016-04-27T13:36:34Z

Made one pass, thanks.

…erit_doc to LDA and models

jkbradley · 2016-04-27T19:24:05Z

@yanboliang Thanks! I think I addressed everything so far.

SparkQA · 2016-04-27T20:06:14Z

Test build #57163 has finished for PR 12723 at commit 46979e8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang · 2016-04-28T14:19:49Z

I saw you put save/load tests at tests.py rather than in doctest. I think the doctest is not only unit tests but also used for illustrating how to use this Estimator/Model. It's better to have simple test for save/load in doc test. LGTM otherwise. Thanks!

jkbradley · 2016-04-28T18:25:24Z

I've wondered about the save/load in doc tests. On the one hand, it's nice to have that example, but on the other hand, it's going to be an extra save/load test for every run of the Jenkins tests (once we beef up the actual unit tests). I'll add save/load for now but we should re-evaluate in the future.

jkbradley · 2016-04-28T19:06:59Z

Updated!

SparkQA · 2016-04-28T19:36:52Z

Test build #57268 has finished for PR 12723 at commit f37c1c1.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2016-04-28T21:20:53Z

test this please

SparkQA · 2016-04-28T22:03:23Z

Test build #57281 has finished for PR 12723 at commit f37c1c1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2016-04-28T22:27:14Z

Ready now?

yanboliang · 2016-04-29T14:01:10Z

LGTM, thanks!

jkbradley · 2016-04-29T17:41:02Z

Thanks @yanboliang and @zjffdu !
Merging with master

zjffdu and others added 7 commits April 26, 2016 16:37

[SPARK-11940][PYSPARK] Python API for ml.clustering.LDA

c0367f4

address comments

417de17

code style fix

66f265f

address comments

09d5ca7

added type converter and used set instead of paramMap directly

4f9bdaa

remaining PR cleanups, plus fixing use of :py:attr:

0d12924

Added persistence to LDA in Python

4f807e8

jkbradley mentioned this pull request Apr 27, 2016

[SPARK-11940][PYSPARK][ML] Python API for ml.clustering.LDA #10242

Closed

yanboliang reviewed Apr 27, 2016
View reviewed changes

Added keepLastCheckpoint Param to python LDA. Added Experimental, inh…

46979e8

…erit_doc to LDA and models

added save/load to doc test in Python LDA

f37c1c1

asfgit closed this in 775772d Apr 29, 2016

jkbradley deleted the zjffdu-SPARK-11940 branch April 29, 2016 18:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-11940][PYSPARK][ML] Python API for ml.clustering.LDA PR2 #12723

[SPARK-11940][PYSPARK][ML] Python API for ml.clustering.LDA PR2 #12723

jkbradley commented Apr 27, 2016

jkbradley commented Apr 27, 2016

SparkQA commented Apr 27, 2016

yanboliang Apr 27, 2016

yanboliang commented Apr 27, 2016

jkbradley commented Apr 27, 2016

SparkQA commented Apr 27, 2016

yanboliang commented Apr 28, 2016

jkbradley commented Apr 28, 2016

jkbradley commented Apr 28, 2016

SparkQA commented Apr 28, 2016

jkbradley commented Apr 28, 2016

SparkQA commented Apr 28, 2016

jkbradley commented Apr 28, 2016

yanboliang commented Apr 29, 2016

jkbradley commented Apr 29, 2016

		pass


		class LDA(JavaEstimator, HasFeaturesCol, HasMaxIter, HasSeed, HasCheckpointInterval,

[SPARK-11940][PYSPARK][ML] Python API for ml.clustering.LDA PR2 #12723

[SPARK-11940][PYSPARK][ML] Python API for ml.clustering.LDA PR2 #12723

Conversation

jkbradley commented Apr 27, 2016

What changes were proposed in this pull request?

How was this patch tested?

jkbradley commented Apr 27, 2016

SparkQA commented Apr 27, 2016

yanboliang Apr 27, 2016

Choose a reason for hiding this comment

yanboliang commented Apr 27, 2016

jkbradley commented Apr 27, 2016

SparkQA commented Apr 27, 2016

yanboliang commented Apr 28, 2016

jkbradley commented Apr 28, 2016

jkbradley commented Apr 28, 2016

SparkQA commented Apr 28, 2016

jkbradley commented Apr 28, 2016

SparkQA commented Apr 28, 2016

jkbradley commented Apr 28, 2016

yanboliang commented Apr 29, 2016

jkbradley commented Apr 29, 2016