[SPARK-6255] [MLLIB] Support multiclass classification in Python API #5137

yanboliang · 2015-03-23T14:49:53Z

Python API parity check for classification and multiclass classification support, major disparities need to be added for Python:

LogisticRegressionWithLBFGS
    setNumClasses
    setValidateData
LogisticRegressionModel
    getThreshold
    numClasses
    numFeatures
SVMWithSGD
    setValidateData
SVMModel
    getThreshold

For users the greatest benefit in this PR is multiclass classification was supported by Python API.
Users can train multiclass classification model and use it to predict in pyspark.

…fication)

SparkQA · 2015-03-23T14:53:10Z

Test build #28998 has started for PR 5137 at commit ded847c.

This patch merges cleanly.

SparkQA · 2015-03-23T16:15:03Z

Test build #28998 has finished for PR 5137 at commit ded847c.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class LinearClassificationModel(LinearModel):
- class LogisticRegressionModel(LinearClassificationModel):
- class SVMModel(LinearClassificationModel):

AmplabJenkins · 2015-03-23T16:15:07Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28998/
Test PASSed.

SparkQA · 2015-03-24T14:23:26Z

Test build #29091 has started for PR 5137 at commit b0d9c63.

This patch merges cleanly.

SparkQA · 2015-03-24T15:48:14Z

Test build #29091 has finished for PR 5137 at commit b0d9c63.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class LinearClassificationModel(LinearModel):
- class LogisticRegressionModel(LinearClassificationModel):
- class SVMModel(LinearClassificationModel):

AmplabJenkins · 2015-03-24T15:48:18Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29091/
Test PASSed.

yanboliang · 2015-03-24T15:54:22Z

For multiclass classification, I have run the iris dataset which has 3 classes and it work as expected.

>>> from pyspark.mllib.util import MLUtils
>>> data = MLUtils.loadLibSVMFile(sc, "/Users/ybliang/Data/tmp/iris.scale")
>>> model = LogisticRegressionWithLBFGS.train(data = data, numClasses = 3)
>>> model.predict(numpy.array([-0.83, 0.16, -0.86, -0.83]))
1
>>> model.predict(numpy.array([0.11, -0.58, 0.33, 0.17]))
2
>>> model.predict(numpy.array([-0.11, -0.16, 0.39, 0.4]))
0

I would like to know whether it is appropriate to involve real dataset in our doctest?
@jkbradley @mengxr

jkbradley · 2015-03-25T20:27:42Z

python/pyspark/mllib/classification.py

@@ -31,13 +31,13 @@
           'SVMModel', 'SVMWithSGD', 'NaiveBayesModel', 'NaiveBayes']


-class LinearBinaryClassificationModel(LinearModel):
+class LinearClassificationModel(LinearModel):
    """
    Represents a linear binary classification model that predicts to whether an
    example is positive (1.0) or negative (0.0).


Update this doc please. Also, can you please add a note that this is a private abstract class? (That's not because of your PR, but just because we should make that clear here.)

jkbradley · 2015-03-25T20:31:04Z

@yanboliang That's great to test on real datasets, but we should not include them. It's better to generate data or use tiny datasets, as in existing doc tests.

I've added my main comments above. My only other one is about adding doc tests, but it sounds like you're working on that. Thanks!

SparkQA · 2015-03-27T09:53:14Z

Test build #29298 has started for PR 5137 at commit fc7990b.

This patch merges cleanly.

SparkQA · 2015-03-27T11:18:39Z

Test build #29298 has finished for PR 5137 at commit fc7990b.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class LinearClassificationModel(LinearModel):
- class LogisticRegressionModel(LinearClassificationModel):
- class SVMModel(LinearClassificationModel):

AmplabJenkins · 2015-03-27T11:18:43Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29298/
Test PASSed.

jkbradley · 2015-03-27T22:20:46Z

python/pyspark/mllib/classification.py

+        return self._numClasses
+
+    @property
+    def dataWithBiasSize(self):


I don't think we want to expose dataWithBiasSize or weightsMatrix. Can you please define them in init and add underscores in front of their names to make it clear they are private?

jkbradley · 2015-03-27T22:55:31Z

Thanks for the updates! I believe those are my last comments. (Also, I realized that some of the loop optimizations in predict() I'm recommending are not in the Scala version and probably should be! I'll make a JIRA...)

SparkQA · 2015-03-29T07:08:18Z

Test build #29359 has started for PR 5137 at commit 444d5e2.

This patch merges cleanly.

SparkQA · 2015-03-29T08:29:24Z

Test build #29359 has finished for PR 5137 at commit 444d5e2.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class LinearClassificationModel(LinearModel):
- class LogisticRegressionModel(LinearClassificationModel):
- class SVMModel(LinearClassificationModel):

AmplabJenkins · 2015-03-29T08:29:29Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29359/
Test PASSed.

jkbradley · 2015-03-30T19:45:44Z

python/pyspark/mllib/classification.py

    """
-    Represents a linear binary classification model that predicts to whether an
-    example is positive (1.0) or negative (0.0).
+    A private abstract class represents a classification model that predicts to


Better phrasing: "A private abstract class represents a classification model that predicts to which of a set of categories an example belongs."
-->
"A private abstract class representing a multiclass classification model."

jkbradley · 2015-03-30T19:46:18Z

Ok, just a few small comments related to that last change, so almost ready. Thanks!

SparkQA · 2015-03-31T05:22:41Z

Test build #29448 has started for PR 5137 at commit 0bd531e.

SparkQA · 2015-03-31T06:43:15Z

Test build #29448 has finished for PR 5137 at commit 0bd531e.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class LinearClassificationModel(LinearModel):
- class LogisticRegressionModel(LinearClassificationModel):
- class SVMModel(LinearClassificationModel):
This patch does not change any dependencies.

AmplabJenkins · 2015-03-31T06:43:19Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29448/
Test PASSed.

yanboliang · 2015-03-31T17:17:10Z

@jkbradley Thank you for your comments.

jkbradley · 2015-03-31T18:31:47Z

@yanboliang LGTM Thanks for the PR! Merging into master

Python API parity check for classification (support multiclass classi…

ded847c

…fication)

yanboliang changed the title ~~[SPARK-6255] [MLLIB] Python API parity check for classification~~ [WIP] [SPARK-6255] [MLLIB] Python API parity check for classification Mar 23, 2015

Support Mulinomial LR model predict in Python API

b0d9c63

yanboliang changed the title ~~[WIP] [SPARK-6255] [MLLIB] Python API parity check for classification~~ [SPARK-6255] [MLLIB] Python API parity check for classification Mar 24, 2015

yanboliang changed the title ~~[SPARK-6255] [MLLIB] Python API parity check for classification~~ [SPARK-6255] [MLLIB] Support multiclass classification in Python API Mar 24, 2015

jkbradley reviewed Mar 25, 2015
View reviewed changes

address comments

fc7990b

jkbradley reviewed Mar 27, 2015
View reviewed changes

LogisticRegressionModel.predict() optimization

444d5e2

jkbradley reviewed Mar 30, 2015
View reviewed changes

address comments

0bd531e

asfgit closed this in b5bd75d Mar 31, 2015

yanboliang mentioned this pull request Apr 1, 2015

[SPARK-6580] [MLLIB] Optimize LogisticRegressionModel.predictPoint #5249

Closed

yanboliang deleted the spark-6255 branch April 24, 2015 10:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-6255] [MLLIB] Support multiclass classification in Python API #5137

[SPARK-6255] [MLLIB] Support multiclass classification in Python API #5137

yanboliang commented Mar 23, 2015

SparkQA commented Mar 23, 2015

SparkQA commented Mar 23, 2015

AmplabJenkins commented Mar 23, 2015

SparkQA commented Mar 24, 2015

SparkQA commented Mar 24, 2015

AmplabJenkins commented Mar 24, 2015

yanboliang commented Mar 24, 2015

jkbradley Mar 25, 2015

jkbradley commented Mar 25, 2015

SparkQA commented Mar 27, 2015

SparkQA commented Mar 27, 2015

AmplabJenkins commented Mar 27, 2015

jkbradley Mar 27, 2015

jkbradley commented Mar 27, 2015

SparkQA commented Mar 29, 2015

SparkQA commented Mar 29, 2015

AmplabJenkins commented Mar 29, 2015

jkbradley Mar 30, 2015

jkbradley commented Mar 30, 2015

SparkQA commented Mar 31, 2015

SparkQA commented Mar 31, 2015

AmplabJenkins commented Mar 31, 2015

yanboliang commented Mar 31, 2015

jkbradley commented Mar 31, 2015

[SPARK-6255] [MLLIB] Support multiclass classification in Python API #5137

[SPARK-6255] [MLLIB] Support multiclass classification in Python API #5137

Conversation

yanboliang commented Mar 23, 2015

SparkQA commented Mar 23, 2015

SparkQA commented Mar 23, 2015

AmplabJenkins commented Mar 23, 2015

SparkQA commented Mar 24, 2015

SparkQA commented Mar 24, 2015

AmplabJenkins commented Mar 24, 2015

yanboliang commented Mar 24, 2015

jkbradley Mar 25, 2015

Choose a reason for hiding this comment

jkbradley commented Mar 25, 2015

SparkQA commented Mar 27, 2015

SparkQA commented Mar 27, 2015

AmplabJenkins commented Mar 27, 2015

jkbradley Mar 27, 2015

Choose a reason for hiding this comment

jkbradley commented Mar 27, 2015

SparkQA commented Mar 29, 2015

SparkQA commented Mar 29, 2015

AmplabJenkins commented Mar 29, 2015

jkbradley Mar 30, 2015

Choose a reason for hiding this comment

jkbradley commented Mar 30, 2015

SparkQA commented Mar 31, 2015

SparkQA commented Mar 31, 2015

AmplabJenkins commented Mar 31, 2015

yanboliang commented Mar 31, 2015

jkbradley commented Mar 31, 2015