[SPARK-31681][ML][PySpark] Python multiclass logistic regression evaluate should return LogisticRegressionSummary #28503

huaxingao · 2020-05-11T22:17:38Z

What changes were proposed in this pull request?

Return LogisticRegressionSummary for multiclass logistic regression evaluate in PySpark

Why are the changes needed?

Currently we have

    @since("2.0.0")
    def evaluate(self, dataset):
        if not isinstance(dataset, DataFrame):
            raise ValueError("dataset must be a DataFrame but got %s." % type(dataset))
        java_blr_summary = self._call_java("evaluate", dataset)
        return BinaryLogisticRegressionSummary(java_blr_summary)

we should return LogisticRegressionSummary for multiclass logistic regression

Does this PR introduce any user-facing change?

Yes
return LogisticRegressionSummary instead of BinaryLogisticRegressionSummary for multiclass logistic regression in Python

How was this patch tested?

unit test

…uate should return LogisticRegressionSummary

SparkQA · 2020-05-11T22:46:15Z

Test build #122520 has finished for PR 28503 at commit 5f7c160.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng

LGTM, it will make the py side like the scala side:

  @Since("2.0.0")
  def evaluate(dataset: Dataset[_]): LogisticRegressionSummary = {
    // Handle possible missing or invalid prediction columns
    val (summaryModel, probabilityColName, predictionColName) = findSummaryModel()
    if (numClasses > 2) {
      new LogisticRegressionSummaryImpl(summaryModel.transform(dataset),
        probabilityColName, predictionColName, $(labelCol), $(featuresCol))
    } else {
      new BinaryLogisticRegressionSummaryImpl(summaryModel.transform(dataset),
        probabilityColName, predictionColName, $(labelCol), $(featuresCol))
    }
  }

srowen · 2020-05-13T17:39:10Z

It makes sense. It's a minor behavior change, even if it's arguably a bug fix. I'm trying to judge: should this go into 3.0, as a less surprising point to introduce that, instead of 3.1?

huaxingao · 2020-05-13T17:56:37Z

I think it's better for this to go into 3.0.

srowen · 2020-05-14T14:23:03Z

Am I right that currently, this would fail if you fit a multi-class model and then tried to call one of the methods on BinaryLogisticRegressionSummary, like roc()? because the Scala impl on the other side would not be a BinaryLogisticRegressionSummary and would not have that method?

If so that's almost not a behavior change, as nothing that this change 'takes away' would have worked.

huaxingao · 2020-05-14T15:02:07Z

You are right. It would fail with Exception

py4j.Py4JException: Method roc([]) does not exist

srowen · 2020-05-14T15:54:06Z

OK I added release notes to the JIRA just for completeness. I don't think a migration guide element is needed, as no code that used the methods on BinaryLogisticRegressionSummary would have worked before anyway.

…uate should return LogisticRegressionSummary ### What changes were proposed in this pull request? Return LogisticRegressionSummary for multiclass logistic regression evaluate in PySpark ### Why are the changes needed? Currently we have ``` since("2.0.0") def evaluate(self, dataset): if not isinstance(dataset, DataFrame): raise ValueError("dataset must be a DataFrame but got %s." % type(dataset)) java_blr_summary = self._call_java("evaluate", dataset) return BinaryLogisticRegressionSummary(java_blr_summary) ``` we should return LogisticRegressionSummary for multiclass logistic regression ### Does this PR introduce _any_ user-facing change? Yes return LogisticRegressionSummary instead of BinaryLogisticRegressionSummary for multiclass logistic regression in Python ### How was this patch tested? unit test Closes #28503 from huaxingao/lr_summary. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit e10516a) Signed-off-by: Sean Owen <srowen@gmail.com>

srowen · 2020-05-14T15:55:06Z

Merged to master/3.0

huaxingao · 2020-05-14T15:59:14Z

Thanks! @srowen @zhengruifeng

[SPARK-31681][ML][PySpark] Python multiclass logistic regression eval…

e5c01ef

…uate should return LogisticRegressionSummary

probot-autolabeler bot added ML PYTHON labels May 11, 2020

nit

5f7c160

zhengruifeng approved these changes May 13, 2020

View reviewed changes

srowen closed this in e10516a May 14, 2020

huaxingao deleted the lr_summary branch May 14, 2020 15:59

zero323 mentioned this pull request Jul 18, 2020

[SPARK-31681] Python multiclass logistic regression evaluate should return LogisticRegressionSummary zero323/pyspark-stubs#427

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-31681][ML][PySpark] Python multiclass logistic regression evaluate should return LogisticRegressionSummary #28503

[SPARK-31681][ML][PySpark] Python multiclass logistic regression evaluate should return LogisticRegressionSummary #28503

huaxingao commented May 11, 2020

SparkQA commented May 11, 2020

zhengruifeng left a comment

srowen commented May 13, 2020

huaxingao commented May 13, 2020

srowen commented May 14, 2020

huaxingao commented May 14, 2020

srowen commented May 14, 2020

srowen commented May 14, 2020

huaxingao commented May 14, 2020

[SPARK-31681][ML][PySpark] Python multiclass logistic regression evaluate should return LogisticRegressionSummary #28503

[SPARK-31681][ML][PySpark] Python multiclass logistic regression evaluate should return LogisticRegressionSummary #28503

Conversation

huaxingao commented May 11, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

SparkQA commented May 11, 2020

zhengruifeng left a comment

Choose a reason for hiding this comment

srowen commented May 13, 2020

huaxingao commented May 13, 2020

srowen commented May 14, 2020

huaxingao commented May 14, 2020

srowen commented May 14, 2020

srowen commented May 14, 2020

huaxingao commented May 14, 2020