Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-20043][ML] DecisionTreeModel: ImpurityCalculator builder fails for uppercase impurity type Gini #17407

Conversation

facaiy
Copy link
Contributor

@facaiy facaiy commented Mar 24, 2017

Fix bug: DecisionTreeModel can't recongnize Impurity "Gini" when loading

TODO:

  • add unit test
  • fix the bug

@facaiy facaiy force-pushed the BUG/decision_tree_loader_failer_with_Gini_impurity branch from 8951fe1 to 78aa380 Compare March 24, 2017 05:37
@facaiy facaiy changed the title [SPARK-20043][ML][WIP] DecisionTreeModel can't recongnize Impurity "Gini" when loading [SPARK-20043][ML] DecisionTreeModel can't recongnize Impurity "Gini" when loading Mar 24, 2017
@jkbradley
Copy link
Member

Thanks for catching this bug! I recommend we not modify the value when the setter is called, but rather convert to lowercase internally. In this case, it looks like we can just change ImpurityCalculator.getCalculator.

@SparkQA
Copy link

SparkQA commented Mar 25, 2017

Test build #3611 has finished for PR 17407 at commit 604dcd1.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@facaiy
Copy link
Contributor Author

facaiy commented Mar 25, 2017

@jkbradley Thanks. I agree with your advice. Modifying the value is a little aggressive, while changing ImpurityCalculator.getCalculator is more moderate. However, I'm afraid that the similar bugs, because of mismatch, won't leave if we don't fix it at first.

Anyway, I'd like to take a try according to your advice.
However, some unit tests are broken, though I have tested them before pushing. Perhaps there is some wrong with my IDE, and I need to take a check at first on next Monday.

@facaiy facaiy force-pushed the BUG/decision_tree_loader_failer_with_Gini_impurity branch from 604dcd1 to 05c998c Compare March 25, 2017 23:49
@facaiy facaiy force-pushed the BUG/decision_tree_loader_failer_with_Gini_impurity branch from 05c998c to 80f3306 Compare March 25, 2017 23:52
@jkbradley
Copy link
Member

I agree it'd be nice to come up with a generic fix for making String Params robust to case, but I don't have a good solution right now. I'll think about how we might put some generic testing in place...

Copy link
Member

@jkbradley jkbradley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates! My comments are just for small cleanups.

val rdd = TreeTests.getTreeReadWriteData(sc)

val categoricalData: DataFrame =
TreeTests.setMetadata(rdd, Map(0 -> 2, 1 -> 3), numClasses = 2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To simplify this, you can write TreeTests.setMetadata(rdd, Map.empty[Int, Int], numClasses = 2).
Since this is not testing categorical features, there's no need to throw them in.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

val categoricalData: DataFrame =
TreeTests.setMetadata(rdd, Map(0 -> 2, 1 -> 3), numClasses = 2)

// BUG: see SPARK-20043
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd put the JIRA number in the test title.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed the comment.

@@ -385,6 +385,20 @@ class DecisionTreeClassifierSuite
testEstimatorAndModelReadWrite(dt, continuousData, allParamSettings ++ Map("maxDepth" -> 0),
allParamSettings ++ Map("maxDepth" -> 0), checkModelData)
}

test("read/write: ImpurityCalculator builder did not recognize impurity type: Gini") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be more specific, how about:
"SAPRK-20043: ImpurityCalculator builder fails for uppercase impurity type Gini in model read/write"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

TreeTests.setMetadata(rdd, Map(0 -> 2, 1 -> 3), numClasses = 2)

// BUG: see SPARK-20043
val dt = new DecisionTreeClassifier().setImpurity("Gini")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make this faster, set maxDepth = 2 (something small)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set maxDepth = 2.

@SparkQA
Copy link

SparkQA commented Mar 26, 2017

Test build #3613 has finished for PR 17407 at commit 80f3306.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -385,6 +385,22 @@ class DecisionTreeClassifierSuite
testEstimatorAndModelReadWrite(dt, continuousData, allParamSettings ++ Map("maxDepth" -> 0),
allParamSettings ++ Map("maxDepth" -> 0), checkModelData)
}

test("SAPRK-20043: " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I had a typo: SPARK, not SAPRK

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, copy & paste. thanks for correction.

@facaiy facaiy changed the title [SPARK-20043][ML] DecisionTreeModel can't recongnize Impurity "Gini" when loading [SPARK-20043][ML] DecisionTreeModel: ImpurityCalculator builder fails for uppercase impurity type GiniImpurityCalculator builder fails for uppercase impurity type Gini Mar 26, 2017
@facaiy facaiy changed the title [SPARK-20043][ML] DecisionTreeModel: ImpurityCalculator builder fails for uppercase impurity type GiniImpurityCalculator builder fails for uppercase impurity type Gini [SPARK-20043][ML] DecisionTreeModel: ImpurityCalculator builder fails for uppercase impurity type Gini Mar 26, 2017
@SparkQA
Copy link

SparkQA commented Mar 26, 2017

Test build #3616 has finished for PR 17407 at commit 55fe603.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@facaiy
Copy link
Contributor Author

facaiy commented Mar 26, 2017

@jkbradley Hi, tests passed. Is it good enough to be merged?

By the way, String Params are fragile when saving/loading model, as setParam and getParam methods are useless in such case.

.setMaxDepth(2)

val model = dt.fit(data)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the second unit test seems redundant for this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed.

.setMaxDepth(2)

val model = dt.fit(data)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The blank lines kinda stand out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete some blank lines to keep compact.


val model = dt.fit(data)

testDefaultReadWrite(model, false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about setting testParams=true for this case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@jkbradley
Copy link
Member

LGTM pending tests
Thank you!

@SparkQA
Copy link

SparkQA commented Mar 28, 2017

Test build #3618 has finished for PR 17407 at commit 22ee03d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

Merging with master and branch-2.1

asfgit pushed a commit that referenced this pull request Mar 28, 2017
… for uppercase impurity type Gini

Fix bug: DecisionTreeModel can't recongnize Impurity "Gini" when loading

TODO:
+ [x] add unit test
+ [x] fix the bug

Author: 颜发才(Yan Facai) <facai.yan@gmail.com>

Closes #17407 from facaiy/BUG/decision_tree_loader_failer_with_Gini_impurity.

(cherry picked from commit 7d432af)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
@asfgit asfgit closed this in 7d432af Mar 28, 2017
@facaiy facaiy deleted the BUG/decision_tree_loader_failer_with_Gini_impurity branch March 29, 2017 00:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants