New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-15699] [ML] Implement a Chi-Squared test statistic option for measuring split quality #13440

Open
wants to merge 6 commits into
base: master
from

Conversation

Projects
None yet
@erikerlandson
Contributor

erikerlandson commented Jun 1, 2016

What changes were proposed in this pull request?

Using test statistics as a measure of decision tree split quality is a useful split halting measure that can yield improved model quality. I am proposing to add the chi-squared test statistic as a new impurity option (in addition to "gini" and "entropy") for classification decision trees and ensembles.

https://issues.apache.org/jira/browse/SPARK-15699

http://erikerlandson.github.io/blog/2016/05/26/measuring-decision-tree-split-quality-with-test-statistic-p-values/

How was this patch tested?

I added unit testing to verify that the chi-squared "impurity" measure functions as expected when used for decision tree training.

@erikerlandson

This comment has been minimized.

Show comment
Hide comment
@erikerlandson

erikerlandson Jun 1, 2016

Contributor

This is a re-submission of #13438 to fix target branch

Contributor

erikerlandson commented Jun 1, 2016

This is a re-submission of #13438 to fix target branch

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Jun 1, 2016

Test build #59740 has finished for PR 13440 at commit 04c1316.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Jun 1, 2016

Test build #59740 has finished for PR 13440 at commit 04c1316.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Jun 1, 2016

Test build #59745 has finished for PR 13440 at commit 1136518.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Jun 1, 2016

Test build #59745 has finished for PR 13440 at commit 1136518.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Jun 1, 2016

Test build #59751 has finished for PR 13440 at commit 6d38cfd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Jun 1, 2016

Test build #59751 has finished for PR 13440 at commit 6d38cfd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Aug 23, 2016

Test build #64309 has finished for PR 13440 at commit 6d38cfd.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

SparkQA commented Aug 23, 2016

Test build #64309 has finished for PR 13440 at commit 6d38cfd.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.
@holdenk

This comment has been minimized.

Show comment
Hide comment
@holdenk

holdenk Oct 7, 2016

Contributor

Is this something your still working on? If so it would be good to merge in the latest master. We can also check with @jkbradley to see if he has some review bandwidth.

Contributor

holdenk commented Oct 7, 2016

Is this something your still working on? If so it would be good to merge in the latest master. We can also check with @jkbradley to see if he has some review bandwidth.

@erikerlandson

This comment has been minimized.

Show comment
Hide comment
@erikerlandson

erikerlandson Oct 9, 2016

Contributor

@holdenk yes, I'll rebase it this week.

Contributor

erikerlandson commented Oct 9, 2016

@holdenk yes, I'll rebase it this week.

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Oct 10, 2016

Test build #66679 has finished for PR 13440 at commit b199ae3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Oct 10, 2016

Test build #66679 has finished for PR 13440 at commit b199ae3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Oct 11, 2016

Test build #66756 has finished for PR 13440 at commit 83f5e83.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Oct 11, 2016

Test build #66756 has finished for PR 13440 at commit 83f5e83.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@erikerlandson

This comment has been minimized.

Show comment
Hide comment
@erikerlandson

erikerlandson Oct 11, 2016

Contributor

test this please

Contributor

erikerlandson commented Oct 11, 2016

test this please

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Oct 12, 2016

Test build #66766 has finished for PR 13440 at commit 83f5e83.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Oct 12, 2016

Test build #66766 has finished for PR 13440 at commit 83f5e83.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@erikerlandson

This comment has been minimized.

Show comment
Hide comment
@erikerlandson

erikerlandson Oct 12, 2016

Contributor

@holdenk @jkbradley looks like it's clean again

Contributor

erikerlandson commented Oct 12, 2016

@holdenk @jkbradley looks like it's clean again

@wangmiao1981

This comment has been minimized.

Show comment
Hide comment
@wangmiao1981

wangmiao1981 Feb 16, 2017

Contributor

@erikerlandson Are you still working on this PR? Thanks! Miao

Contributor

wangmiao1981 commented Feb 16, 2017

@erikerlandson Are you still working on this PR? Thanks! Miao

@erikerlandson

This comment has been minimized.

Show comment
Hide comment
@erikerlandson

erikerlandson Feb 16, 2017

Contributor

Hi @wangmiao1981,

I am still interested in this, but I don't have any sense about whether upstream has any interest. Does upstream have any intention to accept it?

Contributor

erikerlandson commented Feb 16, 2017

Hi @wangmiao1981,

I am still interested in this, but I don't have any sense about whether upstream has any interest. Does upstream have any intention to accept it?

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA commented Feb 16, 2017

Test build #73006 has started for PR 13440 at commit 61cbf7c.

@shaneknapp

This comment has been minimized.

Show comment
Hide comment
@shaneknapp

shaneknapp Feb 16, 2017

Contributor

i stopped the build as i need to restart jenkins... i'll retrigger this when we're back up and running.

Contributor

shaneknapp commented Feb 16, 2017

i stopped the build as i need to restart jenkins... i'll retrigger this when we're back up and running.

@wangmiao1981

This comment has been minimized.

Show comment
Hide comment
@wangmiao1981

wangmiao1981 Feb 16, 2017

Contributor

@erikerlandson I am just helping clearing the stale PRs. :) I have no idea whether they have intention to accept it.

Contributor

wangmiao1981 commented Feb 16, 2017

@erikerlandson I am just helping clearing the stale PRs. :) I have no idea whether they have intention to accept it.

@shaneknapp

This comment has been minimized.

Show comment
Hide comment
@shaneknapp

shaneknapp Feb 16, 2017

Contributor

test this please

Contributor

shaneknapp commented Feb 16, 2017

test this please

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Feb 16, 2017

Test build #73008 has finished for PR 13440 at commit 61cbf7c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Feb 16, 2017

Test build #73008 has finished for PR 13440 at commit 61cbf7c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@wangmiao1981

This comment has been minimized.

Show comment
Hide comment
@wangmiao1981

wangmiao1981 Feb 21, 2017

Contributor

@thunterdb Can you take a look? Thanks!

Contributor

wangmiao1981 commented Feb 21, 2017

@thunterdb Can you take a look? Thanks!

* Utility functions for Impurity measures
*/
@Since("2.0.0")
@DeveloperApi

This comment has been minimized.

@thunterdb

thunterdb Mar 7, 2017

Contributor

there is no need for this object to be publicly exposed?

@thunterdb

thunterdb Mar 7, 2017

Contributor

there is no need for this object to be publicly exposed?

This comment has been minimized.

@erikerlandson

erikerlandson Mar 8, 2017

Contributor

I don't think so. I don't recall any specific motivation to keep it private, but historically Spark seems to default things to "minimum visibility." The only method currently defined here is an implementation detail for hacking p-values into the existing 'gain' system, where larger is assumed to be better.

@erikerlandson

erikerlandson Mar 8, 2017

Contributor

I don't think so. I don't recall any specific motivation to keep it private, but historically Spark seems to default things to "minimum visibility." The only method currently defined here is an implementation detail for hacking p-values into the existing 'gain' system, where larger is assumed to be better.

@erikerlandson

This comment has been minimized.

Show comment
Hide comment
@erikerlandson

erikerlandson Apr 13, 2017

Contributor

@thunterdb apologize for the latency. I removed the default method defs and rebased

Contributor

erikerlandson commented Apr 13, 2017

@thunterdb apologize for the latency. I removed the default method defs and rebased

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Apr 13, 2017

Test build #75782 has finished for PR 13440 at commit 6762a18.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Apr 13, 2017

Test build #75782 has finished for PR 13440 at commit 6762a18.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.
@erikerlandson

This comment has been minimized.

Show comment
Hide comment
@erikerlandson

erikerlandson Apr 14, 2017

Contributor

test this please

Contributor

erikerlandson commented Apr 14, 2017

test this please

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Apr 14, 2017

Test build #75812 has finished for PR 13440 at commit 6762a18.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Apr 14, 2017

Test build #75812 has finished for PR 13440 at commit 6762a18.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.
@erikerlandson

This comment has been minimized.

Show comment
Hide comment
@erikerlandson

erikerlandson Apr 16, 2017

Contributor

@thunterdb not sure what is failing in the CI.

Contributor

erikerlandson commented Apr 16, 2017

@thunterdb not sure what is failing in the CI.

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Apr 17, 2017

Test build #75868 has finished for PR 13440 at commit a75a01b.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Apr 17, 2017

Test build #75868 has finished for PR 13440 at commit a75a01b.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Apr 19, 2017

Test build #75923 has finished for PR 13440 at commit 4e367b1.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Apr 19, 2017

Test build #75923 has finished for PR 13440 at commit 4e367b1.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.
@erikerlandson

This comment has been minimized.

Show comment
Hide comment
@erikerlandson

erikerlandson Apr 19, 2017

Contributor

@thunterdb still can't diagnose what the source of this "fails to generate doc" error is. I don't see anything wrong with the scaladoc.

Contributor

erikerlandson commented Apr 19, 2017

@thunterdb still can't diagnose what the source of this "fails to generate doc" error is. I don't see anything wrong with the scaladoc.

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Apr 19, 2017

Test build #75944 has finished for PR 13440 at commit d206c6b.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Apr 19, 2017

Test build #75944 has finished for PR 13440 at commit d206c6b.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Apr 21, 2017

Test build #76041 has finished for PR 13440 at commit d2a2381.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Apr 21, 2017

Test build #76041 has finished for PR 13440 at commit d2a2381.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@erikerlandson

This comment has been minimized.

Show comment
Hide comment
@erikerlandson

erikerlandson Apr 21, 2017

Contributor

@thunterdb patch is clean again. @HyukjinKwon found my problem.

Contributor

erikerlandson commented Apr 21, 2017

@thunterdb patch is clean again. @HyukjinKwon found my problem.

@erikerlandson

This comment has been minimized.

Show comment
Hide comment
@erikerlandson

erikerlandson May 5, 2017

Contributor

@thunterdb this is ready for any further review

Contributor

erikerlandson commented May 5, 2017

@thunterdb this is ready for any further review

@erikerlandson

This comment has been minimized.

Show comment
Hide comment
@erikerlandson

erikerlandson May 26, 2017

Contributor

Ready for review

Contributor

erikerlandson commented May 26, 2017

Ready for review

@willb

This comment has been minimized.

Show comment
Hide comment
@willb

willb Jul 21, 2017

Contributor

@thunterdb can you take a look at this now that 2.2 is out?

Contributor

willb commented Jul 21, 2017

@thunterdb can you take a look at this now that 2.2 is out?

@erikerlandson

This comment has been minimized.

Show comment
Hide comment
@erikerlandson

erikerlandson Jul 31, 2017

Contributor

rebased to latest head of master

Contributor

erikerlandson commented Jul 31, 2017

rebased to latest head of master

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Aug 1, 2017

Test build #80093 has finished for PR 13440 at commit bb2f660.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Aug 1, 2017

Test build #80093 has finished for PR 13440 at commit bb2f660.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@willb

This comment has been minimized.

Show comment
Hide comment
@willb

willb Aug 23, 2017

Contributor

@HyukjinKwon @thunterdb can you all take a look at this? It's been under review for quite a long time!

Contributor

willb commented Aug 23, 2017

@HyukjinKwon @thunterdb can you all take a look at this? It's been under review for quite a long time!

@HyukjinKwon

This comment has been minimized.

Show comment
Hide comment
@HyukjinKwon

HyukjinKwon Aug 23, 2017

Member

I don't have ML knowledge enough to review this. I can cc ML committer guys who I can guess have some expertise from git blame but I hope there are some sign-offs left here from some guys here ahead.

Member

HyukjinKwon commented Aug 23, 2017

I don't have ML knowledge enough to review this. I can cc ML committer guys who I can guess have some expertise from git blame but I hope there are some sign-offs left here from some guys here ahead.

@srowen

This comment has been minimized.

Show comment
Hide comment
@srowen

srowen Aug 24, 2017

Member

I've not seen chi squared used as a split statistic; when is it theoretically better than entropy? Or something a bit more fundamental like KL divergence? It makes some sense but does require some assumption about the data

Member

srowen commented Aug 24, 2017

I've not seen chi squared used as a split statistic; when is it theoretically better than entropy? Or something a bit more fundamental like KL divergence? It makes some sense but does require some assumption about the data

@erikerlandson

This comment has been minimized.

Show comment
Hide comment
@erikerlandson

erikerlandson Aug 24, 2017

Contributor

@srowen I discuss some of these questions in the blog post, but the tl/dr is that split quality measures based on statistical tests having p-values are in some senses "less arbitrary." Specifying a p-value as a split quality halting condition has essentially the same semantic regardless of the test. Most such tests also intrinsically take into account decreasing population sizes. As the the splitting progresses and population sizes decrease, it inherently takes a larger and larger population difference to meet the p-value threshold.

On the more pragmatic side, in that post I also demonstrate chi-squared split quality generating a more parsimonious tree than other metrics, which does a better job at ignoring poor quality features.

Contributor

erikerlandson commented Aug 24, 2017

@srowen I discuss some of these questions in the blog post, but the tl/dr is that split quality measures based on statistical tests having p-values are in some senses "less arbitrary." Specifying a p-value as a split quality halting condition has essentially the same semantic regardless of the test. Most such tests also intrinsically take into account decreasing population sizes. As the the splitting progresses and population sizes decrease, it inherently takes a larger and larger population difference to meet the p-value threshold.

On the more pragmatic side, in that post I also demonstrate chi-squared split quality generating a more parsimonious tree than other metrics, which does a better job at ignoring poor quality features.

@felixcheung

This comment has been minimized.

Show comment
Hide comment
@felixcheung

felixcheung Oct 5, 2017

Member

@srowen @thunterdb any more thoughts on this?
how about @sethah @yanboliang @jkbradley?

Member

felixcheung commented Oct 5, 2017

@srowen @thunterdb any more thoughts on this?
how about @sethah @yanboliang @jkbradley?

@willb

This comment has been minimized.

Show comment
Hide comment
@willb

willb Oct 18, 2017

Contributor

I agree with @felixcheung -- @srowen or @thunterdb, can you take a look at this?

Contributor

willb commented Oct 18, 2017

I agree with @felixcheung -- @srowen or @thunterdb, can you take a look at this?

@jsigee87

This comment has been minimized.

Show comment
Hide comment
@jsigee87

jsigee87 Apr 8, 2018

Is this still being considered?

jsigee87 commented Apr 8, 2018

Is this still being considered?

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Aug 2, 2018

Test build #94042 has finished for PR 13440 at commit bb2f660.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Aug 2, 2018

Test build #94042 has finished for PR 13440 at commit bb2f660.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Aug 21, 2018

Test build #95018 has finished for PR 13440 at commit bb2f660.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Aug 21, 2018

Test build #95018 has finished for PR 13440 at commit bb2f660.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@willb

This comment has been minimized.

Show comment
Hide comment
@willb

willb Sep 17, 2018

Contributor

@srowen @thunterdb this PR passes all tests and merges cleanly -- can you take another look? It's been open for quite a while now.

Contributor

willb commented Sep 17, 2018

@srowen @thunterdb this PR passes all tests and merges cleanly -- can you take another look? It's been open for quite a while now.

// a larger-is-better gain value for the minimum-gain threshold
val minGain =
if (metadata.impurity.isTestStatistic) Impurity.pValToGain(metadata.minInfoGain)
else metadata.minInfoGain

This comment has been minimized.

@srowen

srowen Sep 17, 2018

Member

Kind of a design question here... right now the caller has to switch logic based on what's inside metadata. Can methods like metadata.minInfoGain just implement different logic when the impurity is a test statistic, and so on? push this down towards the impurity implementation? I wonder if isTestStatistic can go away with the right API, but I am not familiar with the details of what that requires.

@srowen

srowen Sep 17, 2018

Member

Kind of a design question here... right now the caller has to switch logic based on what's inside metadata. Can methods like metadata.minInfoGain just implement different logic when the impurity is a test statistic, and so on? push this down towards the impurity implementation? I wonder if isTestStatistic can go away with the right API, but I am not familiar with the details of what that requires.

This comment has been minimized.

@erikerlandson

erikerlandson Sep 17, 2018

Contributor

The main issue I recall was that all of the existing metrics assume some kind of "larger is better" gain, and p-values are "smaller is better." I'll take another pass over it and see if I can push that distinction down so it doesn't require exposing new methods.

@erikerlandson

erikerlandson Sep 17, 2018

Contributor

The main issue I recall was that all of the existing metrics assume some kind of "larger is better" gain, and p-values are "smaller is better." I'll take another pass over it and see if I can push that distinction down so it doesn't require exposing new methods.

* :: Experimental ::
* Class for calculating Chi Squared as a split quality metric during binary classification.
*/
@Since("2.2.0")

This comment has been minimized.

@srowen

srowen Sep 17, 2018

Member

This will have to be 2.5.0 for the moment

@srowen

srowen Sep 17, 2018

Member

This will have to be 2.5.0 for the moment

This comment has been minimized.

@erikerlandson

erikerlandson Sep 17, 2018

Contributor

I'll update those. 3.0 might be a good target, especially if I can't do this without new isTestStatistic

@erikerlandson

erikerlandson Sep 17, 2018

Contributor

I'll update those. 3.0 might be a good target, especially if I can't do this without new isTestStatistic

* Get this impurity instance.
* This is useful for passing impurity parameters to a Strategy in Java.
*/
@Since("1.1.0")

This comment has been minimized.

@srowen

srowen Sep 17, 2018

Member

I think I'd label all these as Since 2.5.0 even if they override a method that existed earlier.

@srowen

srowen Sep 17, 2018

Member

I think I'd label all these as Since 2.5.0 even if they override a method that existed earlier.

*/
@Since("2.0.0")
@DeveloperApi
def pValToGain(pval: Double): Double = -math.log(math.max(1e-20, pval))

This comment has been minimized.

@srowen

srowen Sep 17, 2018

Member

private to spark?

@srowen

srowen Sep 17, 2018

Member

private to spark?

*/
@Since("2.2.0")
@DeveloperApi
def isTestStatistic: Boolean

This comment has been minimized.

@srowen

srowen Sep 17, 2018

Member

Adding methods to a public trait is technically an API breaking change. This might be considered a Developer API even though it's not labeled that way. Still if we can avoid adding to the API here, it'd be better.

@srowen

srowen Sep 17, 2018

Member

Adding methods to a public trait is technically an API breaking change. This might be considered a Developer API even though it's not labeled that way. Still if we can avoid adding to the API here, it'd be better.

This comment has been minimized.

@erikerlandson

erikerlandson Sep 17, 2018

Contributor

Can this be customized or extended externally to spark? I'm wondering why it is public.

@erikerlandson

erikerlandson Sep 17, 2018

Contributor

Can this be customized or extended externally to spark? I'm wondering why it is public.

*/
@Since("2.2.0")
@DeveloperApi
def calculate(calcL: ImpurityCalculator, calcR: ImpurityCalculator): Double =

This comment has been minimized.

@srowen

srowen Sep 17, 2018

Member

It looks like this new method doesn't make sense to implement for existing implementations, only the new one. That kind of suggests to me it isn't part of the generic API for an impurity. Is this really something that belongs inside the logic of the implementations? maybe there's a more general method that needs to be exposed, that can then be specialized for all implementations.

@srowen

srowen Sep 17, 2018

Member

It looks like this new method doesn't make sense to implement for existing implementations, only the new one. That kind of suggests to me it isn't part of the generic API for an impurity. Is this really something that belongs inside the logic of the implementations? maybe there's a more general method that needs to be exposed, that can then be specialized for all implementations.

This comment has been minimized.

@erikerlandson

erikerlandson Sep 17, 2018

Contributor

I'll consider if there's a unifying idea here. pval-based metrics require integrating information across the new split children, which I believe was not the case for existing methods.

@erikerlandson

erikerlandson Sep 17, 2018

Contributor

I'll consider if there's a unifying idea here. pval-based metrics require integrating information across the new split children, which I believe was not the case for existing methods.

This comment has been minimized.

@erikerlandson

erikerlandson Sep 17, 2018

Contributor

I suspect that the generalization is closer to my newer signature
val pval = imp.calculate(leftImpurityCalculator, rightImpurityCalculator)
where you have all the context from the left and right nodes. The existing gain-based calculation should fit into this framework, just doing its current weighted average of purity gain.

@erikerlandson

erikerlandson Sep 17, 2018

Contributor

I suspect that the generalization is closer to my newer signature
val pval = imp.calculate(leftImpurityCalculator, rightImpurityCalculator)
where you have all the context from the left and right nodes. The existing gain-based calculation should fit into this framework, just doing its current weighted average of purity gain.

This comment has been minimized.

@erikerlandson

erikerlandson Sep 19, 2018

Contributor

@srowen @willb
I cached the design of the metrics back in. In general, Impurity already uses methods that are only defined on certain impurity sub-classes, and so this new method does not change that situation.

My take on the "problem" is that the existing measures are all based on a localized concept of "purity" (or impurity) that can be calculated using only the data at a single node. Splitting based on statistical tests (p-values) breaks that model, since it is making use of a more generalized concept of split quality that requires the sample populations of both children from a candidate split. A maximally general signature would probably involve the parent and both children.

Another kink in the current design is that ImpurityCalculator is essentially parallel to Impurity, and in fact ImpurityCalculator#calculate() is how impurity measures are currently requested. Impurity seems somewhat redundant, and might be factored out in favor of ImpurityCalculator. The current signature calculate() might be generalized into a more inclusive concept of split quality that expects to make use of {parent,left,right}.

Calls to calculate() are not very wide-spread but threading that change through is outside the scope of this particular PR. If people are interested in that kind of refactoring I could look into it in the near future but probably not in the next couple weeks.

That kind of change would also be API breaking and so a good target for 3.0

@erikerlandson

erikerlandson Sep 19, 2018

Contributor

@srowen @willb
I cached the design of the metrics back in. In general, Impurity already uses methods that are only defined on certain impurity sub-classes, and so this new method does not change that situation.

My take on the "problem" is that the existing measures are all based on a localized concept of "purity" (or impurity) that can be calculated using only the data at a single node. Splitting based on statistical tests (p-values) breaks that model, since it is making use of a more generalized concept of split quality that requires the sample populations of both children from a candidate split. A maximally general signature would probably involve the parent and both children.

Another kink in the current design is that ImpurityCalculator is essentially parallel to Impurity, and in fact ImpurityCalculator#calculate() is how impurity measures are currently requested. Impurity seems somewhat redundant, and might be factored out in favor of ImpurityCalculator. The current signature calculate() might be generalized into a more inclusive concept of split quality that expects to make use of {parent,left,right}.

Calls to calculate() are not very wide-spread but threading that change through is outside the scope of this particular PR. If people are interested in that kind of refactoring I could look into it in the near future but probably not in the next couple weeks.

That kind of change would also be API breaking and so a good target for 3.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment