[SPARK-28514][ML] Remove the redundant transformImpl method in RF & GBT #25256

zhengruifeng · 2019-07-25T10:43:52Z

What changes were proposed in this pull request?

Remove the redundant and confusing transformImpl method in RF & GBT;
In GBTClassifier & RandomForestClassifier, the real transform methods inherit from ProbabilisticClassificationModel which can deal with multi output columns.
The transformImpl method, which deals with only one column - predictionCol, completely does nothing. This is quite confusing.

How was this patch tested?

existing suites

SparkQA · 2019-07-25T11:50:52Z

Test build #108167 has finished for PR 25256 at commit 41c0f63.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

xuanyuanking · 2019-07-27T00:42:51Z

Seem like a partial revert of #6300, cc @BryanCutler for a further review.

BryanCutler · 2019-07-30T18:41:01Z

Yes, from what I can remember the point of these methods was to broadcast the model. It's been a while since I looked at this and it has gotten a little confusing over time. I'm not sure if this is still needed or can be removed cc @mengxr @WeichenXu123

srowen · 2019-07-31T14:48:38Z

I think I'd leave this, as it's on purpose and probably for performance reasons. I wonder if we can just always broadcast the model here? What's the downside? the model is already by default serialized in the closure, so it should serialize. There's overhead to broadcasting a tiny model I guess, but maybe that's fine.

BryanCutler · 2019-07-31T18:30:03Z

I wonder if we can just always broadcast the model here?

This sounds reasonable to me and would make the code easier to follow

zhengruifeng · 2019-08-01T10:08:20Z

@BryanCutler @srowen I am neutral on model broadcasting, I notice that there are three approachs for broadcastable/small models to performance transformation:
1, directly serialize the model in the closure (the most cases);
2, broadcast the model in the transform method every time (like Word2Vec/GBTRegressor);
3, broadcast the model if it is not broadcasted yet, the the broadcasted model can be reused among calls (like CountVectorizer);
If the model broadcasting is better, can we apply it for all algs?

As to this pr, if it can improve performance, I am OK to leave GBTRegressor & RandomForestRegressor;
However, the transformImpl methods in GBTClassifier & RandomForestClassifier are never used, so I tend to remove them.

srowen · 2019-08-01T15:31:45Z

Is it really not used in GBTClassifier for example? it overrides a method in Predictor and that is called in transform, still.

zhengruifeng · 2019-08-02T08:31:27Z

@srowen The transform method (which need override transformImpl) defined in PredictionModel is overrided by the the one (which need override predict/predictRaw/...) define in ProbabilisticClassificationModel.

SparkQA · 2019-08-02T09:47:21Z

Test build #108562 has finished for PR 25256 at commit 41ca50d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2019-08-02T15:45:56Z

I see. I wonder if we should be extra safe and override transformImpl in ProbabilisticClassificationModel to throw an error, and mark it final, to ensure that subclasses don't inherit it. On the other hand... is it possible a future subclass would further override transform and then have a use for the base class transformImpl?

I'm OK with the current approach or further restricting inheritance of transformImpl. Up to your taste.

zhengruifeng · 2019-08-04T03:27:33Z

good idea, It is reasonable to restrict the transformImpl in ProbabilisticClassificationModel

SparkQA · 2019-08-04T03:55:42Z

Test build #108618 has finished for PR 25256 at commit 6c61f60.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2019-08-04T22:46:35Z

mllib/src/main/scala/org/apache/spark/ml/classification/Classifier.scala

@@ -210,6 +210,9 @@ abstract class ClassificationModel[FeaturesType, M <: ClassificationModel[Featur
    outputData.toDF
  }

+  final override private def transformImpl(dataset: Dataset[_]): DataFrame =


Oh, this just can't be private

SparkQA · 2019-08-05T02:39:36Z

Test build #108636 has finished for PR 25256 at commit f6a10d0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2019-08-06T20:12:54Z

Merged to master

zhengruifeng added 2 commits July 25, 2019 18:07

init

db9de7c

init

41c0f63

dongjoon-hyun added the ML label Jul 25, 2019

revert the regression side

41ca50d

zhengruifeng added 2 commits August 4, 2019 11:41

disable transformImpl in classification

0136a83

mark private

6c61f60

srowen reviewed Aug 4, 2019

View reviewed changes

not private

f6a10d0

srowen approved these changes Aug 5, 2019

View reviewed changes

srowen closed this in c17fa13 Aug 6, 2019

zhengruifeng deleted the del_ensamble_transformImpl branch August 7, 2019 00:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-28514][ML] Remove the redundant transformImpl method in RF & GBT #25256

[SPARK-28514][ML] Remove the redundant transformImpl method in RF & GBT #25256

zhengruifeng commented Jul 25, 2019 •

edited by srowen

Loading

SparkQA commented Jul 25, 2019

xuanyuanking commented Jul 27, 2019

BryanCutler commented Jul 30, 2019

srowen commented Jul 31, 2019

BryanCutler commented Jul 31, 2019

zhengruifeng commented Aug 1, 2019

srowen commented Aug 1, 2019

zhengruifeng commented Aug 2, 2019

SparkQA commented Aug 2, 2019

srowen commented Aug 2, 2019

zhengruifeng commented Aug 4, 2019

SparkQA commented Aug 4, 2019

srowen Aug 4, 2019

SparkQA commented Aug 5, 2019

srowen commented Aug 6, 2019

[SPARK-28514][ML] Remove the redundant transformImpl method in RF & GBT #25256

[SPARK-28514][ML] Remove the redundant transformImpl method in RF & GBT #25256

Conversation

zhengruifeng commented Jul 25, 2019 • edited by srowen Loading

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Jul 25, 2019

xuanyuanking commented Jul 27, 2019

BryanCutler commented Jul 30, 2019

srowen commented Jul 31, 2019

BryanCutler commented Jul 31, 2019

zhengruifeng commented Aug 1, 2019

srowen commented Aug 1, 2019

zhengruifeng commented Aug 2, 2019

SparkQA commented Aug 2, 2019

srowen commented Aug 2, 2019

zhengruifeng commented Aug 4, 2019

SparkQA commented Aug 4, 2019

srowen Aug 4, 2019

Choose a reason for hiding this comment

SparkQA commented Aug 5, 2019

srowen commented Aug 6, 2019

zhengruifeng commented Jul 25, 2019 •

edited by srowen

Loading