Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-3726] [MLlib] Allow sampling_rate not equal to 1.0 in RandomForests #4073

Closed
wants to merge 4 commits into from

Conversation

MechCoder
Copy link
Contributor

I've added support for sampling_rate not equal to 1.0 . I have two major questions.

  1. A Scala style test is failing, since the number of parameters now exceed 10.
  2. I would like suggestions to understand how to test this.

@SparkQA
Copy link

SparkQA commented Jan 16, 2015

Test build #25666 has started for PR 4073 at commit 6685b44.

  • This patch merges cleanly.

@MechCoder
Copy link
Contributor Author

@jkbradley @mengxr it would be great if you could have a look.

@SparkQA
Copy link

SparkQA commented Jan 16, 2015

Test build #25666 has finished for PR 4073 at commit 6685b44.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25666/
Test FAILed.

@MechCoder
Copy link
Contributor Author

I've made changes such that this not break anything, i.e everything is backward compat.

@MechCoder
Copy link
Contributor Author

@jkbradley Oops, the comments got deleted somehow. I meant that this is because there are 10 arguments in trainClassifier and trainRegressor

@jkbradley
Copy link
Member

@MechCoder Taking a closer look, I now realize that part of this functionality is already there...see the JIRA & let me know what you think.

@MechCoder MechCoder closed this Jan 16, 2015
@MechCoder MechCoder deleted the spark-3726 branch January 16, 2015 18:58
@MechCoder
Copy link
Contributor Author

Oh well, but still if I'm not mistaken, the subSamplingRate is overriden by the condition numTrees > 1. This should not be the case as having a lower sampling, might help in this case too (when using the entire data might be expensive, ofcourse assuming the sampling is a good representation of the data itself). IMO, That should be fixed, right?

@MechCoder MechCoder restored the spark-3726 branch January 16, 2015 19:03
@MechCoder MechCoder reopened this Jan 16, 2015
@SparkQA
Copy link

SparkQA commented Jan 16, 2015

Test build #25672 has started for PR 4073 at commit 6685b44.

  • This patch merges cleanly.

@jkbradley
Copy link
Member

Good point, yes, I think it's worth fixing.

@SparkQA
Copy link

SparkQA commented Jan 16, 2015

Test build #25672 has finished for PR 4073 at commit 6685b44.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25672/
Test FAILed.

@jkbradley
Copy link
Member

Also, as far as testing....it's hard. One way might be to:

  • Run RF with a random seed and subsampling rate 1.0
  • Run it the same way, but with with rate < 1.0
  • Make sure it learns different forests.
  • To make this test robust, you'll need to use the right (small but not too small) dataset size (maybe 5 features and 20 instances?), and also use a fixed random seed.

@MechCoder
Copy link
Contributor Author

Thanks, Also a design decision, is it worthy enough to add this as an option to train given that it is now within the "style limit"?

@jkbradley
Copy link
Member

I'd vote for not adding it to train since that part of the API is so unwieldy.

This reverts commit 6685b4494d2cb1ec72dbc540d2d747c75c6939ee.
@SparkQA
Copy link

SparkQA commented Jan 17, 2015

Test build #25704 has started for PR 4073 at commit a7bfc70.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 17, 2015

Test build #25704 has finished for PR 4073 at commit a7bfc70.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25704/
Test PASSed.

@MechCoder
Copy link
Contributor Author

@jkbradley I've added a test according to the other tests in the RandomForestSuite . Let me know if there is anything left.

@SparkQA
Copy link

SparkQA commented Jan 17, 2015

Test build #25705 has started for PR 4073 at commit d1df1b2.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 17, 2015

Test build #25705 has finished for PR 4073 at commit d1df1b2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25705/
Test PASSed.

@MechCoder MechCoder changed the title [SPARK-3726] [MLlib] Allow sampling_rate not equal to 1.0 [SPARK-3726] [MLlib] Allow sampling_rate not equal to 1.0 in RandomForests Jan 18, 2015
test("subsampling rate in RandomForest"){
val arr = EnsembleTestHelper.generateOrderedLabeledPoints(5, 20)
val rdd = sc.parallelize(arr)
val strategy1 = new Strategy(algo = Classification, impurity = Gini, maxDepth = 2,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make 1 instance of strategy, train rf1, modify the strategy's subsamplingRate, and train rf2. Simpler + more clearly using the same settings for other parameters

@MechCoder
Copy link
Contributor Author

@jkbradley Thanks for the tip. Fixed. Anything more?

@SparkQA
Copy link

SparkQA commented Jan 22, 2015

Test build #25953 has started for PR 4073 at commit 8a0acb5.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 22, 2015

Test build #25953 has finished for PR 4073 at commit 8a0acb5.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25953/
Test FAILed.

@MechCoder
Copy link
Contributor Author

Repushed after fixing the style checks.

@SparkQA
Copy link

SparkQA commented Jan 22, 2015

Test build #25955 has started for PR 4073 at commit d5d68e7.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 22, 2015

Test build #25955 has finished for PR 4073 at commit d5d68e7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25955/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Jan 22, 2015

Test build #25961 has started for PR 4073 at commit e0e0d9c.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 22, 2015

Test build #25961 has finished for PR 4073 at commit e0e0d9c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/25961/
Test PASSed.

@MechCoder
Copy link
Contributor Author

ping @jkbradley Could you please have a final look?

@jkbradley
Copy link
Member

@MechCoder This is an addition instead of a correction, but I just realized that Strategy.assertValid() does not check subsamplingRate. Would you mind adding that check? The rest looks good to me. Thanks!

@MechCoder
Copy link
Contributor Author

@jkbradley Fixed. I can haz merge?

@SparkQA
Copy link

SparkQA commented Jan 25, 2015

Test build #26061 has started for PR 4073 at commit 8012fb2.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 25, 2015

Test build #26061 has finished for PR 4073 at commit 8012fb2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26061/
Test PASSed.

@jkbradley
Copy link
Member

@MechCoder Thanks! LGTM

CC: @mengxr Note this is sort of an API change: RandomForest can now be run with subsampled rows. (But this seems fine to me since users could set subsamplingRate before---it just wouldn't do anything.)

@MechCoder
Copy link
Contributor Author

@mengxr This can also be viewd as a bugfix which prevents overwriting of the param subSamplingRate, which was hardcoded to 1.0

@asfgit asfgit closed this in d6894b1 Jan 27, 2015
@mengxr
Copy link
Contributor

mengxr commented Jan 27, 2015

Merged into master. Thanks!

@MechCoder MechCoder deleted the spark-3726 branch January 27, 2015 06:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants