Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-31681][ML][PySpark] Python multiclass logistic regression evaluate should return LogisticRegressionSummary #28503

Closed
wants to merge 2 commits into from

Conversation

huaxingao
Copy link
Contributor

What changes were proposed in this pull request?

Return LogisticRegressionSummary for multiclass logistic regression evaluate in PySpark

Why are the changes needed?

Currently we have

    @since("2.0.0")
    def evaluate(self, dataset):
        if not isinstance(dataset, DataFrame):
            raise ValueError("dataset must be a DataFrame but got %s." % type(dataset))
        java_blr_summary = self._call_java("evaluate", dataset)
        return BinaryLogisticRegressionSummary(java_blr_summary)

we should return LogisticRegressionSummary for multiclass logistic regression

Does this PR introduce any user-facing change?

Yes
return LogisticRegressionSummary instead of BinaryLogisticRegressionSummary for multiclass logistic regression in Python

How was this patch tested?

unit test

…uate should return LogisticRegressionSummary
@SparkQA
Copy link

SparkQA commented May 11, 2020

Test build #122520 has finished for PR 28503 at commit 5f7c160.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@zhengruifeng zhengruifeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, it will make the py side like the scala side:

  @Since("2.0.0")
  def evaluate(dataset: Dataset[_]): LogisticRegressionSummary = {
    // Handle possible missing or invalid prediction columns
    val (summaryModel, probabilityColName, predictionColName) = findSummaryModel()
    if (numClasses > 2) {
      new LogisticRegressionSummaryImpl(summaryModel.transform(dataset),
        probabilityColName, predictionColName, $(labelCol), $(featuresCol))
    } else {
      new BinaryLogisticRegressionSummaryImpl(summaryModel.transform(dataset),
        probabilityColName, predictionColName, $(labelCol), $(featuresCol))
    }
  }

@srowen
Copy link
Member

srowen commented May 13, 2020

It makes sense. It's a minor behavior change, even if it's arguably a bug fix. I'm trying to judge: should this go into 3.0, as a less surprising point to introduce that, instead of 3.1?

@huaxingao
Copy link
Contributor Author

I think it's better for this to go into 3.0.

@srowen
Copy link
Member

srowen commented May 14, 2020

Am I right that currently, this would fail if you fit a multi-class model and then tried to call one of the methods on BinaryLogisticRegressionSummary, like roc()? because the Scala impl on the other side would not be a BinaryLogisticRegressionSummary and would not have that method?

If so that's almost not a behavior change, as nothing that this change 'takes away' would have worked.

@huaxingao
Copy link
Contributor Author

You are right. It would fail with Exception

py4j.Py4JException: Method roc([]) does not exist

@srowen
Copy link
Member

srowen commented May 14, 2020

OK I added release notes to the JIRA just for completeness. I don't think a migration guide element is needed, as no code that used the methods on BinaryLogisticRegressionSummary would have worked before anyway.

@srowen srowen closed this in e10516a May 14, 2020
srowen pushed a commit that referenced this pull request May 14, 2020
…uate should return LogisticRegressionSummary

### What changes were proposed in this pull request?
Return LogisticRegressionSummary for multiclass logistic regression evaluate in PySpark

### Why are the changes needed?
Currently we have
```
    since("2.0.0")
    def evaluate(self, dataset):
        if not isinstance(dataset, DataFrame):
            raise ValueError("dataset must be a DataFrame but got %s." % type(dataset))
        java_blr_summary = self._call_java("evaluate", dataset)
        return BinaryLogisticRegressionSummary(java_blr_summary)
```
we should return LogisticRegressionSummary for multiclass logistic regression

### Does this PR introduce _any_ user-facing change?
Yes
return LogisticRegressionSummary instead of BinaryLogisticRegressionSummary for multiclass logistic regression in Python

### How was this patch tested?
unit test

Closes #28503 from huaxingao/lr_summary.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
(cherry picked from commit e10516a)
Signed-off-by: Sean Owen <srowen@gmail.com>
@srowen
Copy link
Member

srowen commented May 14, 2020

Merged to master/3.0

@huaxingao
Copy link
Contributor Author

Thanks! @srowen @zhengruifeng

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants