Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-28514][ML] Remove the redundant transformImpl method in RF & GBT #25256

Closed

Conversation

zhengruifeng
Copy link
Contributor

@zhengruifeng zhengruifeng commented Jul 25, 2019

What changes were proposed in this pull request?

Remove the redundant and confusing transformImpl method in RF & GBT;
In GBTClassifier & RandomForestClassifier, the real transform methods inherit from ProbabilisticClassificationModel which can deal with multi output columns.
The transformImpl method, which deals with only one column - predictionCol, completely does nothing. This is quite confusing.

How was this patch tested?

existing suites

@SparkQA
Copy link

SparkQA commented Jul 25, 2019

Test build #108167 has finished for PR 25256 at commit 41c0f63.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@xuanyuanking
Copy link
Member

Seem like a partial revert of #6300, cc @BryanCutler for a further review.

@BryanCutler
Copy link
Member

Yes, from what I can remember the point of these methods was to broadcast the model. It's been a while since I looked at this and it has gotten a little confusing over time. I'm not sure if this is still needed or can be removed cc @mengxr @WeichenXu123

@srowen
Copy link
Member

srowen commented Jul 31, 2019

I think I'd leave this, as it's on purpose and probably for performance reasons. I wonder if we can just always broadcast the model here? What's the downside? the model is already by default serialized in the closure, so it should serialize. There's overhead to broadcasting a tiny model I guess, but maybe that's fine.

@BryanCutler
Copy link
Member

I wonder if we can just always broadcast the model here?

This sounds reasonable to me and would make the code easier to follow

@zhengruifeng
Copy link
Contributor Author

@BryanCutler @srowen I am neutral on model broadcasting, I notice that there are three approachs for broadcastable/small models to performance transformation:
1, directly serialize the model in the closure (the most cases);
2, broadcast the model in the transform method every time (like Word2Vec/GBTRegressor);
3, broadcast the model if it is not broadcasted yet, the the broadcasted model can be reused among calls (like CountVectorizer);
If the model broadcasting is better, can we apply it for all algs?

As to this pr, if it can improve performance, I am OK to leave GBTRegressor & RandomForestRegressor;
However, the transformImpl methods in GBTClassifier & RandomForestClassifier are never used, so I tend to remove them.

@srowen
Copy link
Member

srowen commented Aug 1, 2019

Is it really not used in GBTClassifier for example? it overrides a method in Predictor and that is called in transform, still.

@zhengruifeng
Copy link
Contributor Author

@srowen The transform method (which need override transformImpl) defined in PredictionModel is overrided by the the one (which need override predict/predictRaw/...) define in ProbabilisticClassificationModel.

图片

@SparkQA
Copy link

SparkQA commented Aug 2, 2019

Test build #108562 has finished for PR 25256 at commit 41ca50d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Aug 2, 2019

I see. I wonder if we should be extra safe and override transformImpl in ProbabilisticClassificationModel to throw an error, and mark it final, to ensure that subclasses don't inherit it. On the other hand... is it possible a future subclass would further override transform and then have a use for the base class transformImpl?

I'm OK with the current approach or further restricting inheritance of transformImpl. Up to your taste.

@zhengruifeng
Copy link
Contributor Author

good idea, It is reasonable to restrict the transformImpl in ProbabilisticClassificationModel

@SparkQA
Copy link

SparkQA commented Aug 4, 2019

Test build #108618 has finished for PR 25256 at commit 6c61f60.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -210,6 +210,9 @@ abstract class ClassificationModel[FeaturesType, M <: ClassificationModel[Featur
outputData.toDF
}

final override private def transformImpl(dataset: Dataset[_]): DataFrame =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, this just can't be private

@SparkQA
Copy link

SparkQA commented Aug 5, 2019

Test build #108636 has finished for PR 25256 at commit f6a10d0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Aug 6, 2019

Merged to master

@srowen srowen closed this in c17fa13 Aug 6, 2019
@zhengruifeng zhengruifeng deleted the del_ensamble_transformImpl branch August 7, 2019 00:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants