-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-3724][ML] RandomForest: More options for feature subset size. #11989
Conversation
cc @jkbradley |
Hi @jkbradley, any chance to take a look at this pull request? This pull request is under the "RandomForest improvement umbrella" (SPARK-14046) which was just recently added. Any feedbacks or comments would be greatly appreciated. |
@@ -360,7 +362,9 @@ private[ml] trait RandomForestParams extends TreeEnsembleParams { | |||
"The number of features to consider for splits at each tree node." + | |||
s" Supported options: ${RandomForestParams.supportedFeatureSubsetStrategies.mkString(", ")}", | |||
(value: String) => | |||
RandomForestParams.supportedFeatureSubsetStrategies.contains(value.toLowerCase)) | |||
RandomForestParams.supportedFeatureSubsetStrategies.contains(value.toLowerCase) | |||
|| (try { value.toInt > 0 } catch { case _ : Throwable => false }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be simplified using a regex. I think the following pattern should work: "0?\\.\\d+|1.0+|\\d+"
Hi @sethah thanks for the review. There are some issues with the regex but managed to get it done. The test also has been moved to ML. Let me know if there are any other issues. |
Hi @sethah by the way, could you help start a Jenkins test if possible? |
@yongtang I believe only Spark committers can do that. Maybe @jkbradley could help? |
ok to test |
Test build #54692 has finished for PR 11989 at commit
|
val numFeaturesPerNode: Int = _featureSubsetStrategy match { | ||
case "all" => numFeatures | ||
case "sqrt" => math.sqrt(numFeatures).ceil.toInt | ||
case "log2" => math.max(1, (math.log(numFeatures) / math.log(2)).ceil.toInt) | ||
case "onethird" => (numFeatures / 3.0).ceil.toInt | ||
case isIntRegex(number) => if (number.toInt > numFeatures) numFeatures else number.toInt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This cast will fail if the string integer is greater than Integer.MAX_VALUE
. Can you add a check for this before doing the cast?
Test build #54750 has finished for PR 11989 at commit
|
@sethah Thanks for the review. I just updated the pull request with issues addressed. Let me know if there are any further issues. |
Test build #54754 has finished for PR 11989 at commit
|
@@ -360,7 +362,8 @@ private[ml] trait RandomForestParams extends TreeEnsembleParams { | |||
"The number of features to consider for splits at each tree node." + | |||
s" Supported options: ${RandomForestParams.supportedFeatureSubsetStrategies.mkString(", ")}", | |||
(value: String) => | |||
RandomForestParams.supportedFeatureSubsetStrategies.contains(value.toLowerCase)) | |||
RandomForestParams.supportedFeatureSubsetStrategies.contains(value.toLowerCase) | |||
|| value.matches("^(?:[1-9]\\d*|0\\.\\d*[1-9]\\d*\\d*|1\\.0+)$")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This regex is repeated twice, does it perhaps make sense to move it to a constant (would have to be private[spark]
probably to enable mllib
package to read it). @sethah what do you think about that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that sounds best to me. Also, I am not an expert in regex, but in this pattern 0\\.\\d*[1-9]\\d*\\d*
is the last \\d*
redundant? I also think that you should be allowed to set the fraction with ".25"
but that doesn't work currently. Can we change the middle option to be "0?\\.\\d*[1-9]\\d*"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was able to remove the last \\d*
and still match "0.0250"
. Since \\d*
matches 0 or more occurrences of a digit [0-9], I don't see why there needs to be two consecutively.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh the \\d*\\d*
was a typo. I didn't notice that there are two \\d*
s. Sorry about that. Let me update the pull request accordingly.
This PR tries to support more options for feature subset size in RandomForest implementation. Previously, RandomForest only support "auto", "all", "sqrt", "log2", "onethird". This PR tries to support any given value to allow model search. In this PR, `featureSubsetStrategy` could be passed with: a) a real number in the range of `(0.0-1.0]` that represents the fraction of the number of features in each subset, b) an integer number (`>0`) that represents the number of features in each subset.
Add one additional test in org.apache.spark.mllib.tree.RandomForestSuite to cover the changes in options for feature subset size.
Fix a couple of issues in JavaRandomForestRegressorSuite and JavaRandomForestClassifierSuite. Fix a typo in comment.
Move tests from mllib to ml. Replace extractors with regex.
Update pull request based on @sethah's feedback.
Reduce unneeded tests based on feedbacks from @sethah.
Test build #54947 has finished for PR 11989 at commit
|
@@ -27,6 +27,7 @@ import org.apache.spark.mllib.regression.LabeledPoint | |||
import org.apache.spark.mllib.tree.{DecisionTreeSuite => OldDTSuite, EnsembleTestHelper} | |||
import org.apache.spark.mllib.tree.configuration.{Algo => OldAlgo, QuantileStrategy, Strategy => OldStrategy} | |||
import org.apache.spark.mllib.tree.impurity.{Entropy, Gini, GiniCalculator} | |||
import org.apache.spark.mllib.tree.model.RandomForestModel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure why this import was added. It can be removed.
This LGTM other than one small comment about imports. @MLnick could you make a final pass? |
Remove unneeded import.
@sethah Thanks. The import has been removed. |
Test build #54995 has finished for PR 11989 at commit
|
@@ -329,6 +329,8 @@ private[ml] trait HasFeatureSubsetStrategy extends Params { | |||
* - "onethird": use 1/3 of the features | |||
* - "sqrt": use sqrt(number of features) | |||
* - "log2": use log2(number of features) | |||
* - "(0.0-1.0]": use the specified fraction of features |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if we can simply consolidate the doc into something like:
- "n": when n is in the range (0, 1.0], use n * number of features. When n is in the range (1, number of features), use n features.
Reorganize the wording in the comment (@MLnick).
Thanks @MLnick. The pull request has been updated. Please let me know if there are other issues. |
Test build #55071 has finished for PR 11989 at commit
|
@@ -422,6 +422,13 @@ class RandomForestSuite extends SparkFunSuite with MLlibTestSparkContext { | |||
checkFeatureSubsetStrategy(numTrees = 1, "log2", | |||
(math.log(numFeatures) / math.log(2)).ceil.toInt) | |||
checkFeatureSubsetStrategy(numTrees = 1, "onethird", (numFeatures / 3.0).ceil.toInt) | |||
checkFeatureSubsetStrategy(numTrees = 1, "0.1", (0.1 * numFeatures).ceil.toInt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a particular reason these test cases differ from the Java ones? I notice in the Java tests we also test some of the regex like ".1" and ".10" and "0.10".
I'm wondering if we shouldn't just have a couple test cases for the regex edge cases here (to ensure it gets translated correctly).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @MLnick That might be because of the repetitive lines to copy around. I consolidated the test cases so that it is easier to track what are covered. Let me know if additional test cases are needed.
Consolidate test cases so that both Java and Scala are properly covered.
Test build #55121 has finished for PR 11989 at commit
|
Test build #55122 has finished for PR 11989 at commit
|
for (String strategy: realStrategies) { | ||
rf.setFeatureSubsetStrategy(strategy); | ||
} | ||
String integerStrategies[] = {"1", "10", "100", "1000", "10000"}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Passing 0
should round up to 1
, yes? We should test this edge case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, what happens with negative values? Those should not be allowed - just want to confirm the regex excludes that (we should add some test cases)
Hi @MLnick, here is the complete list of the scenarios: Real numbers (has to contain
Integer numbers (should not container
Let me take a look at the not allowed case and see if I could add test cases to cover it. |
Add test cases to cover invalid strategies.
Hi @MLnick I updated the pull request to add additional test cases to cover invalid values. Let me know if there is any other issues. Thanks. |
jenkins retest this please |
Test build #55606 has finished for PR 11989 at commit
|
@yongtang Sorry that I just found this was merged. I think it might be better to use |
@mengxr Sure I will create a pull request shortly. Thanks. |
What changes were proposed in this pull request?
This PR tries to support more options for feature subset size in RandomForest implementation. Previously, RandomForest only support "auto", "all", "sort", "log2", "onethird". This PR tries to support any given value to allow model search.
In this PR,
featureSubsetStrategy
could be passed with:a) a real number in the range of
(0.0-1.0]
that represents the fraction of the number of features in each subset,b) an integer number (
>0
) that represents the number of features in each subset.How was this patch tested?
Two tests
JavaRandomForestClassifierSuite
andJavaRandomForestRegressorSuite
have been updated to check the additional options for params in this PR.An additional test has been added to
org.apache.spark.mllib.tree.RandomForestSuite
to cover the cases in this PR.