Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-3724][ML] RandomForest: More options for feature subset size. #11989

Closed
wants to merge 11 commits into from
Closed

[SPARK-3724][ML] RandomForest: More options for feature subset size. #11989

wants to merge 11 commits into from

Conversation

yongtang
Copy link
Contributor

What changes were proposed in this pull request?

This PR tries to support more options for feature subset size in RandomForest implementation. Previously, RandomForest only support "auto", "all", "sort", "log2", "onethird". This PR tries to support any given value to allow model search.

In this PR, featureSubsetStrategy could be passed with:
a) a real number in the range of (0.0-1.0] that represents the fraction of the number of features in each subset,
b) an integer number (>0) that represents the number of features in each subset.

How was this patch tested?

Two tests JavaRandomForestClassifierSuite and JavaRandomForestRegressorSuite have been updated to check the additional options for params in this PR.
An additional test has been added to org.apache.spark.mllib.tree.RandomForestSuite to cover the cases in this PR.

@yongtang
Copy link
Contributor Author

cc @jkbradley

@yongtang
Copy link
Contributor Author

Hi @jkbradley, any chance to take a look at this pull request? This pull request is under the "RandomForest improvement umbrella" (SPARK-14046) which was just recently added. Any feedbacks or comments would be greatly appreciated.

@yongtang yongtang changed the title [SPARK-3724][MLLIB] RandomForest: More options for feature subset size. [SPARK-3724][ML] RandomForest: More options for feature subset size. Mar 29, 2016
@@ -360,7 +362,9 @@ private[ml] trait RandomForestParams extends TreeEnsembleParams {
"The number of features to consider for splits at each tree node." +
s" Supported options: ${RandomForestParams.supportedFeatureSubsetStrategies.mkString(", ")}",
(value: String) =>
RandomForestParams.supportedFeatureSubsetStrategies.contains(value.toLowerCase))
RandomForestParams.supportedFeatureSubsetStrategies.contains(value.toLowerCase)
|| (try { value.toInt > 0 } catch { case _ : Throwable => false })
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be simplified using a regex. I think the following pattern should work: "0?\\.\\d+|1.0+|\\d+"

@yongtang
Copy link
Contributor Author

yongtang commented Apr 1, 2016

Hi @sethah thanks for the review. There are some issues with the regex but managed to get it done. The test also has been moved to ML. Let me know if there are any other issues.

@yongtang
Copy link
Contributor Author

yongtang commented Apr 1, 2016

Hi @sethah by the way, could you help start a Jenkins test if possible?

@sethah
Copy link
Contributor

sethah commented Apr 1, 2016

@yongtang I believe only Spark committers can do that. Maybe @jkbradley could help?

@MLnick
Copy link
Contributor

MLnick commented Apr 1, 2016

ok to test

@SparkQA
Copy link

SparkQA commented Apr 1, 2016

Test build #54692 has finished for PR 11989 at commit b9416d6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val numFeaturesPerNode: Int = _featureSubsetStrategy match {
case "all" => numFeatures
case "sqrt" => math.sqrt(numFeatures).ceil.toInt
case "log2" => math.max(1, (math.log(numFeatures) / math.log(2)).ceil.toInt)
case "onethird" => (numFeatures / 3.0).ceil.toInt
case isIntRegex(number) => if (number.toInt > numFeatures) numFeatures else number.toInt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cast will fail if the string integer is greater than Integer.MAX_VALUE. Can you add a check for this before doing the cast?

@SparkQA
Copy link

SparkQA commented Apr 2, 2016

Test build #54750 has finished for PR 11989 at commit a23de1f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yongtang
Copy link
Contributor Author

yongtang commented Apr 2, 2016

@sethah Thanks for the review. I just updated the pull request with issues addressed. Let me know if there are any further issues.

@SparkQA
Copy link

SparkQA commented Apr 2, 2016

Test build #54754 has finished for PR 11989 at commit 047e850.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -360,7 +362,8 @@ private[ml] trait RandomForestParams extends TreeEnsembleParams {
"The number of features to consider for splits at each tree node." +
s" Supported options: ${RandomForestParams.supportedFeatureSubsetStrategies.mkString(", ")}",
(value: String) =>
RandomForestParams.supportedFeatureSubsetStrategies.contains(value.toLowerCase))
RandomForestParams.supportedFeatureSubsetStrategies.contains(value.toLowerCase)
|| value.matches("^(?:[1-9]\\d*|0\\.\\d*[1-9]\\d*\\d*|1\\.0+)$"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This regex is repeated twice, does it perhaps make sense to move it to a constant (would have to be private[spark] probably to enable mllib package to read it). @sethah what do you think about that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that sounds best to me. Also, I am not an expert in regex, but in this pattern 0\\.\\d*[1-9]\\d*\\d* is the last \\d* redundant? I also think that you should be allowed to set the fraction with ".25" but that doesn't work currently. Can we change the middle option to be "0?\\.\\d*[1-9]\\d*"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @MLnick @sethah. In 0\\.\\d*[1-9]\\d*\\d* the last \\d* is to capture situations of trailing zero "0.0250". I didn't take into consideration of ".25" case initially.

Let me update the pull request and address those issues.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was able to remove the last \\d* and still match "0.0250". Since \\d* matches 0 or more occurrences of a digit [0-9], I don't see why there needs to be two consecutively.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh the \\d*\\d* was a typo. I didn't notice that there are two \\d*s. Sorry about that. Let me update the pull request accordingly.

This PR tries to support more options for feature subset size in RandomForest
implementation. Previously, RandomForest only support "auto", "all", "sqrt",
"log2", "onethird". This PR tries to support any given value to allow model
search.

In this PR, `featureSubsetStrategy` could be passed with:
a) a real number in the range of `(0.0-1.0]` that represents the fraction of
the number of features in each subset,
b)  an integer number (`>0`) that represents the number of features in each
subset.
Add one additional test in org.apache.spark.mllib.tree.RandomForestSuite
to cover the changes in options for feature subset size.
Fix a couple of issues in JavaRandomForestRegressorSuite and
JavaRandomForestClassifierSuite.
Fix a typo in comment.
Move tests from mllib to ml. Replace extractors with regex.
Move repeated regex to a constant (@MLnick).
Remove redundant `\\d*` from the end of the regex (@sethah).
Rewording the documentation for better explanation (@sethah).
@yongtang
Copy link
Contributor Author

yongtang commented Apr 5, 2016

@sethah @MLnick Thanks so much for detailed review. The pull request has been updated with the issues addressed. Let me know if there are any other issues.

@SparkQA
Copy link

SparkQA commented Apr 5, 2016

Test build #54947 has finished for PR 11989 at commit c2b662b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -27,6 +27,7 @@ import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.tree.{DecisionTreeSuite => OldDTSuite, EnsembleTestHelper}
import org.apache.spark.mllib.tree.configuration.{Algo => OldAlgo, QuantileStrategy, Strategy => OldStrategy}
import org.apache.spark.mllib.tree.impurity.{Entropy, Gini, GiniCalculator}
import org.apache.spark.mllib.tree.model.RandomForestModel
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why this import was added. It can be removed.

@sethah
Copy link
Contributor

sethah commented Apr 5, 2016

This LGTM other than one small comment about imports. @MLnick could you make a final pass?

@yongtang
Copy link
Contributor Author

yongtang commented Apr 5, 2016

@sethah Thanks. The import has been removed.

@SparkQA
Copy link

SparkQA commented Apr 5, 2016

Test build #54995 has finished for PR 11989 at commit bebd544.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -329,6 +329,8 @@ private[ml] trait HasFeatureSubsetStrategy extends Params {
* - "onethird": use 1/3 of the features
* - "sqrt": use sqrt(number of features)
* - "log2": use log2(number of features)
* - "(0.0-1.0]": use the specified fraction of features
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we can simply consolidate the doc into something like:

- "n": when n is in the range (0, 1.0], use n * number of features. When n is in the range (1, number of features), use n features.

@yongtang
Copy link
Contributor Author

yongtang commented Apr 6, 2016

Thanks @MLnick. The pull request has been updated. Please let me know if there are other issues.

@SparkQA
Copy link

SparkQA commented Apr 6, 2016

Test build #55071 has finished for PR 11989 at commit 13edc07.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -422,6 +422,13 @@ class RandomForestSuite extends SparkFunSuite with MLlibTestSparkContext {
checkFeatureSubsetStrategy(numTrees = 1, "log2",
(math.log(numFeatures) / math.log(2)).ceil.toInt)
checkFeatureSubsetStrategy(numTrees = 1, "onethird", (numFeatures / 3.0).ceil.toInt)
checkFeatureSubsetStrategy(numTrees = 1, "0.1", (0.1 * numFeatures).ceil.toInt)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a particular reason these test cases differ from the Java ones? I notice in the Java tests we also test some of the regex like ".1" and ".10" and "0.10".

I'm wondering if we shouldn't just have a couple test cases for the regex edge cases here (to ensure it gets translated correctly).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @MLnick That might be because of the repetitive lines to copy around. I consolidated the test cases so that it is easier to track what are covered. Let me know if additional test cases are needed.

Consolidate test cases so that both Java and Scala are properly covered.
@SparkQA
Copy link

SparkQA commented Apr 6, 2016

Test build #55121 has finished for PR 11989 at commit 5678ac3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 6, 2016

Test build #55122 has finished for PR 11989 at commit 08feaaa.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

for (String strategy: realStrategies) {
rf.setFeatureSubsetStrategy(strategy);
}
String integerStrategies[] = {"1", "10", "100", "1000", "10000"};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passing 0 should round up to 1, yes? We should test this edge case.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, what happens with negative values? Those should not be allowed - just want to confirm the regex excludes that (we should add some test cases)

@yongtang
Copy link
Contributor Author

yongtang commented Apr 7, 2016

Hi @MLnick, here is the complete list of the scenarios:

Real numbers (has to contain .), assume number of features is 15

 .0        not allowed by regex (should not be all zeros)
 .1        OK
 .10       OK
0.0        not allowed by regex (should not be all zeros)
0.1        OK
0.10       OK
1.0        OK (use 15)
1.1        not allowed (greater than 1.0)

Integer numbers (should not container .), assume number of features is 15

0          not allowed by regex (should not be all zeros)
1          OK
...
...
15         OK
16         OK (use 15)
...

Let me take a look at the not allowed case and see if I could add test cases to cover it.

@yongtang
Copy link
Contributor Author

yongtang commented Apr 8, 2016

Hi @MLnick I updated the pull request to add additional test cases to cover invalid values. Let me know if there is any other issues. Thanks.

@MLnick
Copy link
Contributor

MLnick commented Apr 12, 2016

jenkins retest this please

@SparkQA
Copy link

SparkQA commented Apr 12, 2016

Test build #55606 has finished for PR 11989 at commit 8a4c298.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MLnick
Copy link
Contributor

MLnick commented Apr 12, 2016

LGTM. Merged to master. Thanks @yongtang and also @sethah for reviewing!

@asfgit asfgit closed this in da60b34 Apr 12, 2016
@yongtang yongtang deleted the SPARK-3724 branch April 12, 2016 15:17
@mengxr
Copy link
Contributor

mengxr commented Apr 12, 2016

@yongtang Sorry that I just found this was merged. I think it might be better to use Int.parseInt and parseDouble instead of regexes, which is less robust. For example, what if I entered 1e-2? I created https://issues.apache.org/jira/browse/SPARK-14565. Could you send a follow-up PR? Thanks!

@yongtang
Copy link
Contributor Author

@mengxr Sure I will create a pull request shortly. Thanks.

holdenk added a commit to holdenk/spark that referenced this pull request May 10, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants