[SPARK-19953][ML] Random Forest Models use parent UID when being fit #17296

BryanCutler · 2017-03-14T21:56:34Z

What changes were proposed in this pull request?

The ML RandomForestClassificationModel and RandomForestRegressionModel were not using the estimator parent UID when being fit. This change fixes that so the models can be properly be identified with their parents.

How was this patch tested?Existing tests.

Added check to verify that model uid matches that of the parent, then renamed checkCopy to checkCopyAndUids and verified that it was called by one test for each ML algorithm.

… fit

SparkQA · 2017-03-14T22:48:15Z

Test build #74557 has finished for PR 17296 at commit a801ef9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2017-03-14T22:49:01Z

ping @jkbradley @MLnick

MLnick · 2017-03-15T09:38:28Z

Seems fine - are there other instances of this happening?

I'm wondering why test cases did not pick this up... seems like we should have a standard test case for it?

BryanCutler · 2017-03-15T17:13:00Z

Thanks @MLnick! I checked and didn't see this happening anywhere else. It's not great to put into a test case because it requires training to get the model. It could be tacked onto existing test, but I don't know if it's really worth it. The python tests will eventually test this after SPARK-10931, which is where I discovered it.

MLnick · 2017-03-16T08:09:07Z

hmm I would prefer to test it though I do get it's pretty tricky to do generically. I don't think training a tiny model on a couple data points will add too much overhead.

… added check to suites where missing

BryanCutler · 2017-03-16T23:10:39Z

@MLnick , I found an existing MLTestingUtils.checkCopy that is used to check the copied model uids match and can easily be extended to include the check needed here. I went through and added these checks to any ML suite that wasn't already using, but that led to another issue that I felt should be covered in a separate PR at #17326. Can you take a look at that first and merge if ok, then I'll update this and push the regression tests I made? Thanks!

…nt-uid-SPARK-19953

BryanCutler · 2017-04-03T19:11:15Z

cc @MLnick @jkbradley , I updated with the latest and added a check for the model uid to match parent. I don't think it's great that this check is tacked on to various other tests because it makes it easy to forget it if adding additional algorithms. Hopefully this is good enough for now to get this fix in and I can still follow up with another JIRA to refactor basic checks like this to make it more consistent.

SparkQA · 2017-04-03T19:57:00Z

Test build #75488 has finished for PR 17296 at commit dd1e3bd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick · 2017-04-05T13:20:59Z

High-level seems good now, though there are new conflicts in FPGrowthSuite that need to be resolved.

Did you create a JIRA to track the broader issue of trying to make the testing more generic?

Or at least - we could perhaps try to "enforce" the tests through a test trait (e.g. EstimatorModelTest) with a test that takes generated data, fits and performs the check. The trait could define an abstract generateData method. Then each concrete test could implement the data generator - most have some form of data generator method already anyway.

Of course we still need to ensure new tests implement the trait - but at least if all existing test are adapted in this way it provides the blueprint going forward.

The only other way I can think of would be via some reflection approach (but the correct form of dataset needs to be generated for each estimator...)

…nt-uid-SPARK-19953

SparkQA · 2017-04-05T19:24:27Z

Test build #75551 has finished for PR 17296 at commit 7c7ce13.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler · 2017-04-06T00:12:31Z

@MLnick this should be good to go. I made https://issues.apache.org/jira/browse/SPARK-20234 to address some better consistency in these basic checks.

MLnick · 2017-04-06T07:40:38Z

LGTM, merged to master. Thanks for creating the follow up JIRA.

changed creation of Random Forest Models to use parent UID when being…

a801ef9

… fit

changed model checkCopy to include check for model uid equals parent,…

1565899

… added check to suites where missing

BryanCutler added 2 commits April 3, 2017 11:11

Merge remote-tracking branch 'upstream/master' into rfmodels-use-pare…

d26d35f

…nt-uid-SPARK-19953

renamed to checkCopyAndUids

dd1e3bd

Merge remote-tracking branch 'upstream/master' into rfmodels-use-pare…

7c7ce13

…nt-uid-SPARK-19953

asfgit closed this in e156b5d Apr 6, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-19953][ML] Random Forest Models use parent UID when being fit #17296

[SPARK-19953][ML] Random Forest Models use parent UID when being fit #17296

BryanCutler commented Mar 14, 2017 •

edited

Loading

SparkQA commented Mar 14, 2017

BryanCutler commented Mar 14, 2017

MLnick commented Mar 15, 2017

BryanCutler commented Mar 15, 2017

MLnick commented Mar 16, 2017

BryanCutler commented Mar 16, 2017 •

edited

Loading

BryanCutler commented Apr 3, 2017

SparkQA commented Apr 3, 2017

MLnick commented Apr 5, 2017

SparkQA commented Apr 5, 2017

BryanCutler commented Apr 6, 2017

MLnick commented Apr 6, 2017

[SPARK-19953][ML] Random Forest Models use parent UID when being fit #17296

[SPARK-19953][ML] Random Forest Models use parent UID when being fit #17296

Conversation

BryanCutler commented Mar 14, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?Existing tests.

SparkQA commented Mar 14, 2017

BryanCutler commented Mar 14, 2017

MLnick commented Mar 15, 2017

BryanCutler commented Mar 15, 2017

MLnick commented Mar 16, 2017

BryanCutler commented Mar 16, 2017 • edited Loading

BryanCutler commented Apr 3, 2017

SparkQA commented Apr 3, 2017

MLnick commented Apr 5, 2017

SparkQA commented Apr 5, 2017

BryanCutler commented Apr 6, 2017

MLnick commented Apr 6, 2017

BryanCutler commented Mar 14, 2017 •

edited

Loading

BryanCutler commented Mar 16, 2017 •

edited

Loading