New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-6255] [MLLIB] Support multiclass classification in Python API #5137
Conversation
Test build #28998 has started for PR 5137 at commit
|
Test build #28998 has finished for PR 5137 at commit
|
Test PASSed. |
Test build #29091 has started for PR 5137 at commit
|
Test build #29091 has finished for PR 5137 at commit
|
Test PASSed. |
For multiclass classification, I have run the iris dataset which has 3 classes and it work as expected. >>> from pyspark.mllib.util import MLUtils
>>> data = MLUtils.loadLibSVMFile(sc, "/Users/ybliang/Data/tmp/iris.scale")
>>> model = LogisticRegressionWithLBFGS.train(data = data, numClasses = 3)
>>> model.predict(numpy.array([-0.83, 0.16, -0.86, -0.83]))
1
>>> model.predict(numpy.array([0.11, -0.58, 0.33, 0.17]))
2
>>> model.predict(numpy.array([-0.11, -0.16, 0.39, 0.4]))
0 I would like to know whether it is appropriate to involve real dataset in our doctest? |
@@ -31,13 +31,13 @@ | |||
'SVMModel', 'SVMWithSGD', 'NaiveBayesModel', 'NaiveBayes'] | |||
|
|||
|
|||
class LinearBinaryClassificationModel(LinearModel): | |||
class LinearClassificationModel(LinearModel): | |||
""" | |||
Represents a linear binary classification model that predicts to whether an | |||
example is positive (1.0) or negative (0.0). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update this doc please. Also, can you please add a note that this is a private abstract class? (That's not because of your PR, but just because we should make that clear here.)
@yanboliang That's great to test on real datasets, but we should not include them. It's better to generate data or use tiny datasets, as in existing doc tests. I've added my main comments above. My only other one is about adding doc tests, but it sounds like you're working on that. Thanks! |
Test build #29298 has started for PR 5137 at commit
|
Test build #29298 has finished for PR 5137 at commit
|
Test PASSed. |
return self._numClasses | ||
|
||
@property | ||
def dataWithBiasSize(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we want to expose dataWithBiasSize or weightsMatrix. Can you please define them in init and add underscores in front of their names to make it clear they are private?
Thanks for the updates! I believe those are my last comments. (Also, I realized that some of the loop optimizations in predict() I'm recommending are not in the Scala version and probably should be! I'll make a JIRA...) |
Test build #29359 has started for PR 5137 at commit
|
Test build #29359 has finished for PR 5137 at commit
|
Test PASSed. |
""" | ||
Represents a linear binary classification model that predicts to whether an | ||
example is positive (1.0) or negative (0.0). | ||
A private abstract class represents a classification model that predicts to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better phrasing: "A private abstract class represents a classification model that predicts to which of a set of categories an example belongs."
-->
"A private abstract class representing a multiclass classification model."
Ok, just a few small comments related to that last change, so almost ready. Thanks! |
Test build #29448 has started for PR 5137 at commit |
Test build #29448 has finished for PR 5137 at commit
|
Test PASSed. |
@jkbradley Thank you for your comments. |
@yanboliang LGTM Thanks for the PR! Merging into master |
Python API parity check for classification and multiclass classification support, major disparities need to be added for Python:
For users the greatest benefit in this PR is multiclass classification was supported by Python API.
Users can train multiclass classification model and use it to predict in pyspark.