[SPARK-3724][ML] RandomForest: More options for feature subset size. #11989

yongtang · 2016-03-27T23:28:00Z

What changes were proposed in this pull request?

This PR tries to support more options for feature subset size in RandomForest implementation. Previously, RandomForest only support "auto", "all", "sort", "log2", "onethird". This PR tries to support any given value to allow model search.

In this PR, featureSubsetStrategy could be passed with:
a) a real number in the range of (0.0-1.0] that represents the fraction of the number of features in each subset,
b) an integer number (>0) that represents the number of features in each subset.

How was this patch tested?

Two tests JavaRandomForestClassifierSuite and JavaRandomForestRegressorSuite have been updated to check the additional options for params in this PR.
An additional test has been added to org.apache.spark.mllib.tree.RandomForestSuite to cover the cases in this PR.

yongtang · 2016-03-28T17:05:09Z

cc @jkbradley

yongtang · 2016-03-29T19:27:58Z

Hi @jkbradley, any chance to take a look at this pull request? This pull request is under the "RandomForest improvement umbrella" (SPARK-14046) which was just recently added. Any feedbacks or comments would be greatly appreciated.

sethah · 2016-03-30T22:55:55Z

mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala

@@ -360,7 +362,9 @@ private[ml] trait RandomForestParams extends TreeEnsembleParams {
    "The number of features to consider for splits at each tree node." +
      s" Supported options: ${RandomForestParams.supportedFeatureSubsetStrategies.mkString(", ")}",
    (value: String) =>
-      RandomForestParams.supportedFeatureSubsetStrategies.contains(value.toLowerCase))
+      RandomForestParams.supportedFeatureSubsetStrategies.contains(value.toLowerCase)
+      || (try { value.toInt > 0 } catch { case _ : Throwable => false })


This could be simplified using a regex. I think the following pattern should work: "0?\\.\\d+|1.0+|\\d+"

yongtang · 2016-04-01T01:48:34Z

Hi @sethah thanks for the review. There are some issues with the regex but managed to get it done. The test also has been moved to ML. Let me know if there are any other issues.

yongtang · 2016-04-01T01:50:09Z

Hi @sethah by the way, could you help start a Jenkins test if possible?

sethah · 2016-04-01T03:07:17Z

@yongtang I believe only Spark committers can do that. Maybe @jkbradley could help?

MLnick · 2016-04-01T07:45:53Z

ok to test

SparkQA · 2016-04-01T08:27:25Z

Test build #54692 has finished for PR 11989 at commit b9416d6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sethah · 2016-04-01T17:21:05Z

mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DecisionTreeMetadata.scala

    val numFeaturesPerNode: Int = _featureSubsetStrategy match {
      case "all" => numFeatures
      case "sqrt" => math.sqrt(numFeatures).ceil.toInt
      case "log2" => math.max(1, (math.log(numFeatures) / math.log(2)).ceil.toInt)
      case "onethird" => (numFeatures / 3.0).ceil.toInt
+      case isIntRegex(number) => if (number.toInt > numFeatures) numFeatures else number.toInt


This cast will fail if the string integer is greater than Integer.MAX_VALUE. Can you add a check for this before doing the cast?

SparkQA · 2016-04-02T03:13:43Z

Test build #54750 has finished for PR 11989 at commit a23de1f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yongtang · 2016-04-02T03:21:11Z

@sethah Thanks for the review. I just updated the pull request with issues addressed. Let me know if there are any further issues.

SparkQA · 2016-04-02T04:04:00Z

Test build #54754 has finished for PR 11989 at commit 047e850.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick · 2016-04-04T08:09:40Z

mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala

@@ -360,7 +362,8 @@ private[ml] trait RandomForestParams extends TreeEnsembleParams {
    "The number of features to consider for splits at each tree node." +
      s" Supported options: ${RandomForestParams.supportedFeatureSubsetStrategies.mkString(", ")}",
    (value: String) =>
-      RandomForestParams.supportedFeatureSubsetStrategies.contains(value.toLowerCase))
+      RandomForestParams.supportedFeatureSubsetStrategies.contains(value.toLowerCase)
+      || value.matches("^(?:[1-9]\\d*|0\\.\\d*[1-9]\\d*\\d*|1\\.0+)$"))


This regex is repeated twice, does it perhaps make sense to move it to a constant (would have to be private[spark] probably to enable mllib package to read it). @sethah what do you think about that?

Yes, that sounds best to me. Also, I am not an expert in regex, but in this pattern 0\\.\\d*[1-9]\\d*\\d* is the last \\d* redundant? I also think that you should be allowed to set the fraction with ".25" but that doesn't work currently. Can we change the middle option to be "0?\\.\\d*[1-9]\\d*"

Thanks @MLnick @sethah. In 0\\.\\d*[1-9]\\d*\\d* the last \\d* is to capture situations of trailing zero "0.0250". I didn't take into consideration of ".25" case initially.

Let me update the pull request and address those issues.

I was able to remove the last \\d* and still match "0.0250". Since \\d* matches 0 or more occurrences of a digit [0-9], I don't see why there needs to be two consecutively.

Oh the \\d*\\d* was a typo. I didn't notice that there are two \\d*s. Sorry about that. Let me update the pull request accordingly.

This PR tries to support more options for feature subset size in RandomForest implementation. Previously, RandomForest only support "auto", "all", "sqrt", "log2", "onethird". This PR tries to support any given value to allow model search. In this PR, `featureSubsetStrategy` could be passed with: a) a real number in the range of `(0.0-1.0]` that represents the fraction of the number of features in each subset, b) an integer number (`>0`) that represents the number of features in each subset.

Add one additional test in org.apache.spark.mllib.tree.RandomForestSuite to cover the changes in options for feature subset size.

Fix a couple of issues in JavaRandomForestRegressorSuite and JavaRandomForestClassifierSuite. Fix a typo in comment.

Move tests from mllib to ml. Replace extractors with regex.

@sethah

Update pull request based on @sethah's feedback.

@sethah

Reduce unneeded tests based on feedbacks from @sethah.

@MLnick

Move repeated regex to a constant (@MLnick). Remove redundant `\\d*` from the end of the regex (@sethah). Rewording the documentation for better explanation (@sethah).

yongtang · 2016-04-05T03:41:13Z

@sethah @MLnick Thanks so much for detailed review. The pull request has been updated with the issues addressed. Let me know if there are any other issues.

SparkQA · 2016-04-05T04:21:59Z

Test build #54947 has finished for PR 11989 at commit c2b662b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sethah · 2016-04-05T15:24:52Z

mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala

@@ -27,6 +27,7 @@ import org.apache.spark.mllib.regression.LabeledPoint
 import org.apache.spark.mllib.tree.{DecisionTreeSuite => OldDTSuite, EnsembleTestHelper}
 import org.apache.spark.mllib.tree.configuration.{Algo => OldAlgo, QuantileStrategy, Strategy => OldStrategy}
 import org.apache.spark.mllib.tree.impurity.{Entropy, Gini, GiniCalculator}
+import org.apache.spark.mllib.tree.model.RandomForestModel


Not sure why this import was added. It can be removed.

sethah · 2016-04-05T15:25:41Z

This LGTM other than one small comment about imports. @MLnick could you make a final pass?

Remove unneeded import.

yongtang · 2016-04-05T16:14:52Z

@sethah Thanks. The import has been removed.

SparkQA · 2016-04-05T16:56:44Z

Test build #54995 has finished for PR 11989 at commit bebd544.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick · 2016-04-05T18:38:59Z

mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala

@@ -329,6 +329,8 @@ private[ml] trait HasFeatureSubsetStrategy extends Params {
   *  - "onethird": use 1/3 of the features
   *  - "sqrt": use sqrt(number of features)
   *  - "log2": use log2(number of features)
+   *  - "(0.0-1.0]": use the specified fraction of features


I'm wondering if we can simply consolidate the doc into something like:

- "n": when n is in the range (0, 1.0], use n * number of features. When n is in the range (1, number of features), use n features.

@MLnick

Reorganize the wording in the comment (@MLnick).

yongtang · 2016-04-06T01:54:40Z

Thanks @MLnick. The pull request has been updated. Please let me know if there are other issues.

SparkQA · 2016-04-06T02:32:44Z

Test build #55071 has finished for PR 11989 at commit 13edc07.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick · 2016-04-06T06:43:05Z

mllib/src/test/scala/org/apache/spark/ml/tree/impl/RandomForestSuite.scala

@@ -422,6 +422,13 @@ class RandomForestSuite extends SparkFunSuite with MLlibTestSparkContext {
    checkFeatureSubsetStrategy(numTrees = 1, "log2",
      (math.log(numFeatures) / math.log(2)).ceil.toInt)
    checkFeatureSubsetStrategy(numTrees = 1, "onethird", (numFeatures / 3.0).ceil.toInt)
+    checkFeatureSubsetStrategy(numTrees = 1, "0.1", (0.1 * numFeatures).ceil.toInt)


Is there a particular reason these test cases differ from the Java ones? I notice in the Java tests we also test some of the regex like ".1" and ".10" and "0.10".

I'm wondering if we shouldn't just have a couple test cases for the regex edge cases here (to ensure it gets translated correctly).

Hi @MLnick That might be because of the repetitive lines to copy around. I consolidated the test cases so that it is easier to track what are covered. Let me know if additional test cases are needed.

Consolidate test cases so that both Java and Scala are properly covered.

SparkQA · 2016-04-06T16:02:11Z

Test build #55121 has finished for PR 11989 at commit 5678ac3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-04-06T16:13:11Z

Test build #55122 has finished for PR 11989 at commit 08feaaa.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick · 2016-04-07T07:56:09Z

mllib/src/test/java/org/apache/spark/ml/regression/JavaRandomForestRegressorSuite.java

+    for (String strategy: realStrategies) {
+      rf.setFeatureSubsetStrategy(strategy);
+    }
+    String integerStrategies[] = {"1", "10", "100", "1000", "10000"};


Passing 0 should round up to 1, yes? We should test this edge case.

Also, what happens with negative values? Those should not be allowed - just want to confirm the regex excludes that (we should add some test cases)

yongtang · 2016-04-07T14:27:58Z

Hi @MLnick, here is the complete list of the scenarios:

Real numbers (has to contain .), assume number of features is 15

 .0        not allowed by regex (should not be all zeros)
 .1        OK
 .10       OK
0.0        not allowed by regex (should not be all zeros)
0.1        OK
0.10       OK
1.0        OK (use 15)
1.1        not allowed (greater than 1.0)

Integer numbers (should not container .), assume number of features is 15

0          not allowed by regex (should not be all zeros)
1          OK
...
...
15         OK
16         OK (use 15)
...

Let me take a look at the not allowed case and see if I could add test cases to cover it.

Add test cases to cover invalid strategies.

yongtang · 2016-04-08T15:24:56Z

Hi @MLnick I updated the pull request to add additional test cases to cover invalid values. Let me know if there is any other issues. Thanks.

MLnick · 2016-04-12T13:41:26Z

jenkins retest this please

SparkQA · 2016-04-12T14:17:51Z

Test build #55606 has finished for PR 11989 at commit 8a4c298.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MLnick · 2016-04-12T14:55:14Z

LGTM. Merged to master. Thanks @yongtang and also @sethah for reviewing!

mengxr · 2016-04-12T17:01:59Z

@yongtang Sorry that I just found this was merged. I think it might be better to use Int.parseInt and parseDouble instead of regexes, which is less robust. For example, what if I entered 1e-2? I created https://issues.apache.org/jira/browse/SPARK-14565. Could you send a follow-up PR? Thanks!

yongtang · 2016-04-12T18:33:34Z

@mengxr Sure I will create a pull request shortly. Thanks.

yongtang changed the title ~~[SPARK-3724][MLLIB] RandomForest: More options for feature subset size.~~ [SPARK-3724][ML] RandomForest: More options for feature subset size. Mar 29, 2016

sethah reviewed Mar 30, 2016
View reviewed changes

sethah reviewed Apr 1, 2016
View reviewed changes

MLnick reviewed Apr 4, 2016
View reviewed changes

yongtang added 7 commits April 5, 2016 02:49

[SPARK-3724][MLLIB] RandomForest: More options for feature subset size.

326f5a0

Add one additional test in org.apache.spark.mllib.tree.RandomForestSuite to cover the changes in options for feature subset size.

[SPARK-3724][MLLIB] RandomForest: More options for feature subset size.

e154354

Fix a couple of issues in JavaRandomForestRegressorSuite and JavaRandomForestClassifierSuite. Fix a typo in comment.

[SPARK-3724][MLLIB] RandomForest: More options for feature subset size.

de3d7ac

Move tests from mllib to ml. Replace extractors with regex.

[SPARK-3724][MLLIB] RandomForest: More options for feature subset size.

704a8f0

Update pull request based on @sethah's feedback.

[SPARK-3724][MLLIB] RandomForest: More options for feature subset size.

f02604b

Reduce unneeded tests based on feedbacks from @sethah.

[SPARK-3724][MLLIB] RandomForest: More options for feature subset size.

c2b662b

Move repeated regex to a constant (@MLnick). Remove redundant `\\d*` from the end of the regex (@sethah). Rewording the documentation for better explanation (@sethah).

sethah reviewed Apr 5, 2016
View reviewed changes

[SPARK-3724][MLLIB] RandomForest: More options for feature subset size.

bebd544

Remove unneeded import.

MLnick reviewed Apr 5, 2016
View reviewed changes

[SPARK-3724][MLLIB] RandomForest: More options for feature subset size.

13edc07

Reorganize the wording in the comment (@MLnick).

MLnick reviewed Apr 6, 2016
View reviewed changes

[SPARK-3724][MLLIB] RandomForest: More options for feature subset size.

08feaaa

Consolidate test cases so that both Java and Scala are properly covered.

MLnick reviewed Apr 7, 2016
View reviewed changes

[SPARK-3724][MLLIB] RandomForest: More options for feature subset size.

8a4c298

Add test cases to cover invalid strategies.

asfgit closed this in da60b34 Apr 12, 2016

yongtang deleted the SPARK-3724 branch April 12, 2016 15:17

holdenk added a commit to holdenk/spark that referenced this pull request May 10, 2016

Add new option to PyDoc from apache#11989

3cfc996

[SPARK-3724][ML] RandomForest: More options for feature subset size. #11989

[SPARK-3724][ML] RandomForest: More options for feature subset size. #11989

Conversation

yongtang commented Mar 27, 2016

What changes were proposed in this pull request?

How was this patch tested?

yongtang commented Mar 28, 2016

yongtang commented Mar 29, 2016

Choose a reason for hiding this comment

yongtang commented Apr 1, 2016

yongtang commented Apr 1, 2016

sethah commented Apr 1, 2016

MLnick commented Apr 1, 2016

SparkQA commented Apr 1, 2016

Choose a reason for hiding this comment

SparkQA commented Apr 2, 2016

yongtang commented Apr 2, 2016

SparkQA commented Apr 2, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yongtang commented Apr 5, 2016

SparkQA commented Apr 5, 2016

Choose a reason for hiding this comment

sethah commented Apr 5, 2016

yongtang commented Apr 5, 2016

SparkQA commented Apr 5, 2016

Choose a reason for hiding this comment

yongtang commented Apr 6, 2016

SparkQA commented Apr 6, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Apr 6, 2016

SparkQA commented Apr 6, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yongtang commented Apr 7, 2016

yongtang commented Apr 8, 2016

MLnick commented Apr 12, 2016

SparkQA commented Apr 12, 2016

MLnick commented Apr 12, 2016

mengxr commented Apr 12, 2016

yongtang commented Apr 12, 2016