Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
[SPARK-15699] [ML] Implement a Chi-Squared test statistic option for measuring split quality #13440
What changes were proposed in this pull request?
Using test statistics as a measure of decision tree split quality is a useful split halting measure that can yield improved model quality. I am proposing to add the chi-squared test statistic as a new impurity option (in addition to "gini" and "entropy") for classification decision trees and ensembles.
How was this patch tested?
I added unit testing to verify that the chi-squared "impurity" measure functions as expected when used for decision tree training.
referenced this pull request
Jun 1, 2016
@srowen I discuss some of these questions in the blog post, but the tl/dr is that split quality measures based on statistical tests having p-values are in some senses "less arbitrary." Specifying a p-value as a split quality halting condition has essentially the same semantic regardless of the test. Most such tests also intrinsically take into account decreasing population sizes. As the the splitting progresses and population sizes decrease, it inherently takes a larger and larger population difference to meet the p-value threshold.
On the more pragmatic side, in that post I also demonstrate chi-squared split quality generating a more parsimonious tree than other metrics, which does a better job at ignoring poor quality features.