New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-31681][ML][PySpark] Python multiclass logistic regression evaluate should return LogisticRegressionSummary #28503
Conversation
…uate should return LogisticRegressionSummary
Test build #122520 has finished for PR 28503 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, it will make the py side like the scala side:
@Since("2.0.0")
def evaluate(dataset: Dataset[_]): LogisticRegressionSummary = {
// Handle possible missing or invalid prediction columns
val (summaryModel, probabilityColName, predictionColName) = findSummaryModel()
if (numClasses > 2) {
new LogisticRegressionSummaryImpl(summaryModel.transform(dataset),
probabilityColName, predictionColName, $(labelCol), $(featuresCol))
} else {
new BinaryLogisticRegressionSummaryImpl(summaryModel.transform(dataset),
probabilityColName, predictionColName, $(labelCol), $(featuresCol))
}
}
It makes sense. It's a minor behavior change, even if it's arguably a bug fix. I'm trying to judge: should this go into 3.0, as a less surprising point to introduce that, instead of 3.1? |
I think it's better for this to go into 3.0. |
Am I right that currently, this would fail if you fit a multi-class model and then tried to call one of the methods on BinaryLogisticRegressionSummary, like roc()? because the Scala impl on the other side would not be a BinaryLogisticRegressionSummary and would not have that method? If so that's almost not a behavior change, as nothing that this change 'takes away' would have worked. |
You are right. It would fail with Exception
|
OK I added release notes to the JIRA just for completeness. I don't think a migration guide element is needed, as no code that used the methods on BinaryLogisticRegressionSummary would have worked before anyway. |
…uate should return LogisticRegressionSummary ### What changes were proposed in this pull request? Return LogisticRegressionSummary for multiclass logistic regression evaluate in PySpark ### Why are the changes needed? Currently we have ``` since("2.0.0") def evaluate(self, dataset): if not isinstance(dataset, DataFrame): raise ValueError("dataset must be a DataFrame but got %s." % type(dataset)) java_blr_summary = self._call_java("evaluate", dataset) return BinaryLogisticRegressionSummary(java_blr_summary) ``` we should return LogisticRegressionSummary for multiclass logistic regression ### Does this PR introduce _any_ user-facing change? Yes return LogisticRegressionSummary instead of BinaryLogisticRegressionSummary for multiclass logistic regression in Python ### How was this patch tested? unit test Closes #28503 from huaxingao/lr_summary. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit e10516a) Signed-off-by: Sean Owen <srowen@gmail.com>
Merged to master/3.0 |
Thanks! @srowen @zhengruifeng |
What changes were proposed in this pull request?
Return LogisticRegressionSummary for multiclass logistic regression evaluate in PySpark
Why are the changes needed?
Currently we have
we should return LogisticRegressionSummary for multiclass logistic regression
Does this PR introduce any user-facing change?
Yes
return LogisticRegressionSummary instead of BinaryLogisticRegressionSummary for multiclass logistic regression in Python
How was this patch tested?
unit test