[SPARK-15699] [ML] Implement a Chi-Squared test statistic option for measuring split quality #13440

Open
wants to merge 2 commits into
from

Projects

None yet

5 participants

@erikerlandson
Contributor

What changes were proposed in this pull request?

Using test statistics as a measure of decision tree split quality is a useful split halting measure that can yield improved model quality. I am proposing to add the chi-squared test statistic as a new impurity option (in addition to "gini" and "entropy") for classification decision trees and ensembles.

https://issues.apache.org/jira/browse/SPARK-15699

http://erikerlandson.github.io/blog/2016/05/26/measuring-decision-tree-split-quality-with-test-statistic-p-values/

How was this patch tested?

I added unit testing to verify that the chi-squared "impurity" measure functions as expected when used for decision tree training.

@erikerlandson
Contributor

This is a re-submission of #13438 to fix target branch

@SparkQA
SparkQA commented Jun 1, 2016

Test build #59740 has finished for PR 13440 at commit 04c1316.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA
SparkQA commented Jun 1, 2016

Test build #59745 has finished for PR 13440 at commit 1136518.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA
SparkQA commented Jun 1, 2016

Test build #59751 has finished for PR 13440 at commit 6d38cfd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA
SparkQA commented Aug 23, 2016

Test build #64309 has finished for PR 13440 at commit 6d38cfd.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.
@holdenk
Contributor
holdenk commented Oct 7, 2016

Is this something your still working on? If so it would be good to merge in the latest master. We can also check with @jkbradley to see if he has some review bandwidth.

@erikerlandson
Contributor

@holdenk yes, I'll rebase it this week.

@SparkQA
SparkQA commented Oct 10, 2016

Test build #66679 has finished for PR 13440 at commit b199ae3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA
SparkQA commented Oct 11, 2016

Test build #66756 has finished for PR 13440 at commit 83f5e83.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@erikerlandson
Contributor

test this please

@SparkQA
SparkQA commented Oct 12, 2016

Test build #66766 has finished for PR 13440 at commit 83f5e83.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@erikerlandson
Contributor

@holdenk @jkbradley looks like it's clean again

@wangmiao1981
Contributor

@erikerlandson Are you still working on this PR? Thanks! Miao

@erikerlandson
Contributor

Hi @wangmiao1981,

I am still interested in this, but I don't have any sense about whether upstream has any interest. Does upstream have any intention to accept it?

@SparkQA
SparkQA commented Feb 16, 2017

Test build #73006 has started for PR 13440 at commit 61cbf7c.

@shaneknapp
Contributor

i stopped the build as i need to restart jenkins... i'll retrigger this when we're back up and running.

@wangmiao1981
Contributor

@erikerlandson I am just helping clearing the stale PRs. :) I have no idea whether they have intention to accept it.

@shaneknapp
Contributor

test this please

@SparkQA
SparkQA commented Feb 16, 2017

Test build #73008 has finished for PR 13440 at commit 61cbf7c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@wangmiao1981
Contributor

@thunterdb Can you take a look? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment