Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-19953][ML] Random Forest Models use parent UID when being fit #17296

Conversation

BryanCutler
Copy link
Member

@BryanCutler BryanCutler commented Mar 14, 2017

What changes were proposed in this pull request?

The ML RandomForestClassificationModel and RandomForestRegressionModel were not using the estimator parent UID when being fit. This change fixes that so the models can be properly be identified with their parents.

How was this patch tested?Existing tests.

Added check to verify that model uid matches that of the parent, then renamed checkCopy to checkCopyAndUids and verified that it was called by one test for each ML algorithm.

@SparkQA
Copy link

SparkQA commented Mar 14, 2017

Test build #74557 has finished for PR 17296 at commit a801ef9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@BryanCutler
Copy link
Member Author

ping @jkbradley @MLnick

@MLnick
Copy link
Contributor

MLnick commented Mar 15, 2017

Seems fine - are there other instances of this happening?

I'm wondering why test cases did not pick this up... seems like we should have a standard test case for it?

@BryanCutler
Copy link
Member Author

Thanks @MLnick! I checked and didn't see this happening anywhere else. It's not great to put into a test case because it requires training to get the model. It could be tacked onto existing test, but I don't know if it's really worth it. The python tests will eventually test this after SPARK-10931, which is where I discovered it.

@MLnick
Copy link
Contributor

MLnick commented Mar 16, 2017

hmm I would prefer to test it though I do get it's pretty tricky to do generically. I don't think training a tiny model on a couple data points will add too much overhead.

@BryanCutler
Copy link
Member Author

BryanCutler commented Mar 16, 2017

@MLnick , I found an existing MLTestingUtils.checkCopy that is used to check the copied model uids match and can easily be extended to include the check needed here. I went through and added these checks to any ML suite that wasn't already using, but that led to another issue that I felt should be covered in a separate PR at #17326. Can you take a look at that first and merge if ok, then I'll update this and push the regression tests I made? Thanks!

@BryanCutler
Copy link
Member Author

cc @MLnick @jkbradley , I updated with the latest and added a check for the model uid to match parent. I don't think it's great that this check is tacked on to various other tests because it makes it easy to forget it if adding additional algorithms. Hopefully this is good enough for now to get this fix in and I can still follow up with another JIRA to refactor basic checks like this to make it more consistent.

@SparkQA
Copy link

SparkQA commented Apr 3, 2017

Test build #75488 has finished for PR 17296 at commit dd1e3bd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MLnick
Copy link
Contributor

MLnick commented Apr 5, 2017

High-level seems good now, though there are new conflicts in FPGrowthSuite that need to be resolved.

Did you create a JIRA to track the broader issue of trying to make the testing more generic?

Or at least - we could perhaps try to "enforce" the tests through a test trait (e.g. EstimatorModelTest) with a test that takes generated data, fits and performs the check. The trait could define an abstract generateData method. Then each concrete test could implement the data generator - most have some form of data generator method already anyway.

Of course we still need to ensure new tests implement the trait - but at least if all existing test are adapted in this way it provides the blueprint going forward.

The only other way I can think of would be via some reflection approach (but the correct form of dataset needs to be generated for each estimator...)

@SparkQA
Copy link

SparkQA commented Apr 5, 2017

Test build #75551 has finished for PR 17296 at commit 7c7ce13.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@BryanCutler
Copy link
Member Author

@MLnick this should be good to go. I made https://issues.apache.org/jira/browse/SPARK-20234 to address some better consistency in these basic checks.

@MLnick
Copy link
Contributor

MLnick commented Apr 6, 2017

LGTM, merged to master. Thanks for creating the follow up JIRA.

@asfgit asfgit closed this in e156b5d Apr 6, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants