Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-6255] [MLLIB] Support multiclass classification in Python API #5137

Closed
wants to merge 5 commits into from

Conversation

yanboliang
Copy link
Contributor

Python API parity check for classification and multiclass classification support, major disparities need to be added for Python:

LogisticRegressionWithLBFGS
    setNumClasses
    setValidateData
LogisticRegressionModel
    getThreshold
    numClasses
    numFeatures
SVMWithSGD
    setValidateData
SVMModel
    getThreshold

For users the greatest benefit in this PR is multiclass classification was supported by Python API.
Users can train multiclass classification model and use it to predict in pyspark.

@SparkQA
Copy link

SparkQA commented Mar 23, 2015

Test build #28998 has started for PR 5137 at commit ded847c.

  • This patch merges cleanly.

@yanboliang yanboliang changed the title [SPARK-6255] [MLLIB] Python API parity check for classification [WIP] [SPARK-6255] [MLLIB] Python API parity check for classification Mar 23, 2015
@SparkQA
Copy link

SparkQA commented Mar 23, 2015

Test build #28998 has finished for PR 5137 at commit ded847c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class LinearClassificationModel(LinearModel):
    • class LogisticRegressionModel(LinearClassificationModel):
    • class SVMModel(LinearClassificationModel):

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/28998/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Mar 24, 2015

Test build #29091 has started for PR 5137 at commit b0d9c63.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Mar 24, 2015

Test build #29091 has finished for PR 5137 at commit b0d9c63.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class LinearClassificationModel(LinearModel):
    • class LogisticRegressionModel(LinearClassificationModel):
    • class SVMModel(LinearClassificationModel):

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29091/
Test PASSed.

@yanboliang yanboliang changed the title [WIP] [SPARK-6255] [MLLIB] Python API parity check for classification [SPARK-6255] [MLLIB] Python API parity check for classification Mar 24, 2015
@yanboliang yanboliang changed the title [SPARK-6255] [MLLIB] Python API parity check for classification [SPARK-6255] [MLLIB] Support multiclass classification in Python API Mar 24, 2015
@yanboliang
Copy link
Contributor Author

For multiclass classification, I have run the iris dataset which has 3 classes and it work as expected.

>>> from pyspark.mllib.util import MLUtils
>>> data = MLUtils.loadLibSVMFile(sc, "/Users/ybliang/Data/tmp/iris.scale")
>>> model = LogisticRegressionWithLBFGS.train(data = data, numClasses = 3)
>>> model.predict(numpy.array([-0.83, 0.16, -0.86, -0.83]))
1
>>> model.predict(numpy.array([0.11, -0.58, 0.33, 0.17]))
2
>>> model.predict(numpy.array([-0.11, -0.16, 0.39, 0.4]))
0

I would like to know whether it is appropriate to involve real dataset in our doctest?
@jkbradley @mengxr

@@ -31,13 +31,13 @@
'SVMModel', 'SVMWithSGD', 'NaiveBayesModel', 'NaiveBayes']


class LinearBinaryClassificationModel(LinearModel):
class LinearClassificationModel(LinearModel):
"""
Represents a linear binary classification model that predicts to whether an
example is positive (1.0) or negative (0.0).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update this doc please. Also, can you please add a note that this is a private abstract class? (That's not because of your PR, but just because we should make that clear here.)

@jkbradley
Copy link
Member

@yanboliang That's great to test on real datasets, but we should not include them. It's better to generate data or use tiny datasets, as in existing doc tests.

I've added my main comments above. My only other one is about adding doc tests, but it sounds like you're working on that. Thanks!

@SparkQA
Copy link

SparkQA commented Mar 27, 2015

Test build #29298 has started for PR 5137 at commit fc7990b.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Mar 27, 2015

Test build #29298 has finished for PR 5137 at commit fc7990b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class LinearClassificationModel(LinearModel):
    • class LogisticRegressionModel(LinearClassificationModel):
    • class SVMModel(LinearClassificationModel):

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29298/
Test PASSed.

return self._numClasses

@property
def dataWithBiasSize(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to expose dataWithBiasSize or weightsMatrix. Can you please define them in init and add underscores in front of their names to make it clear they are private?

@jkbradley
Copy link
Member

Thanks for the updates! I believe those are my last comments. (Also, I realized that some of the loop optimizations in predict() I'm recommending are not in the Scala version and probably should be! I'll make a JIRA...)

@SparkQA
Copy link

SparkQA commented Mar 29, 2015

Test build #29359 has started for PR 5137 at commit 444d5e2.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Mar 29, 2015

Test build #29359 has finished for PR 5137 at commit 444d5e2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class LinearClassificationModel(LinearModel):
    • class LogisticRegressionModel(LinearClassificationModel):
    • class SVMModel(LinearClassificationModel):

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29359/
Test PASSed.

"""
Represents a linear binary classification model that predicts to whether an
example is positive (1.0) or negative (0.0).
A private abstract class represents a classification model that predicts to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better phrasing: "A private abstract class represents a classification model that predicts to which of a set of categories an example belongs."
-->
"A private abstract class representing a multiclass classification model."

@jkbradley
Copy link
Member

Ok, just a few small comments related to that last change, so almost ready. Thanks!

@SparkQA
Copy link

SparkQA commented Mar 31, 2015

Test build #29448 has started for PR 5137 at commit 0bd531e.

@SparkQA
Copy link

SparkQA commented Mar 31, 2015

Test build #29448 has finished for PR 5137 at commit 0bd531e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class LinearClassificationModel(LinearModel):
    • class LogisticRegressionModel(LinearClassificationModel):
    • class SVMModel(LinearClassificationModel):
  • This patch does not change any dependencies.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/29448/
Test PASSed.

@yanboliang
Copy link
Contributor Author

@jkbradley Thank you for your comments.

@jkbradley
Copy link
Member

@yanboliang LGTM Thanks for the PR! Merging into master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants