New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-31185][ML] Implement VarianceThresholdSelector #27954
Conversation
This Selector has quite some common code with other Selectors. I will refactor all the Selectors in #27882. |
// if varianceThreshold not set, remove the features that have the same value in all samples. | ||
val features = if (!isSet(varianceThreshold)) { | ||
// use max and min to avoid numeric precision issues for constant features | ||
result.filter { case (((vari, max), min), _) => ((max != min) && (vari != 0)) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I follow scikit-learn implementation to use max and min to avoid numeric precision issues for constant features.
Test build #120015 has finished for PR 27954 at commit
|
Test build #120022 has finished for PR 27954 at commit
|
|
||
val result = variances.toArray.zip(maxs.toArray).zip(mins.toArray).zipWithIndex | ||
// if varianceThreshold not set, remove the features that have the same value in all samples. | ||
val features = if (!isSet(varianceThreshold)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not giving varianceThreshold
a default value = 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we keep the features >= threshold, so I can't default it to 0 (variance 0 will be kept too).
Maybe I should change the definition to "keep the features > threshold"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will change to "keep the features > threshold". I looked sklearn code and did a little test, sklearn removes the <= threshold features.
def _get_support_mask(self):
check_is_fitted(self)
return self.variances_ > self.threshold
>>> data = [[0, 2, 7, 1],
... [1, 4, 9, 5]]
>>> np.var([0, 1])
0.25
>>> np.var([2, 4])
1.0
>>> np.var([7, 9])
1.0
>>> np.var([1, 5])
4.0
>>> VarianceThreshold(threshold=0.2499).fit_transform(data)
array([[0, 2, 7, 1],
[1, 4, 9, 5]])
>>> VarianceThreshold(threshold=0.25).fit_transform(data)
array([[2, 7, 1],
[4, 9, 5]])
>>> VarianceThreshold(threshold=0.9999).fit_transform(data)
array([[2, 7, 1],
[4, 9, 5]])
>>> VarianceThreshold(threshold=1.0).fit_transform(data)
array([[1],
[5]])
result.filter { case (((vari, _), _), _) => !(vari < getVarianceThreshold) } | ||
} | ||
|
||
val indices = features.map { case (((_, _), _), index) => index } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can simplify this logic like:
val numFeatures = max.size
val indices = Array.tabulate(numFeatures) { i =>
(i, if (max(i) == min(i)) 0.0 else variance(i))
}.filter(_._2 >= getVarianceThreshold).map(_._1)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess in sklearn checking max == min
are just to make variance computation of constant values more stable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Seems it is trying to get around the floating point comparison issue.
Test build #120113 has finished for PR 27954 at commit
|
|
||
/** | ||
* Feature selector that removes all low-variance features. Features with a | ||
* variance lower than the threshold will be removed. The default is to keep |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'lower' -> 'not greater than'
|
||
val numFeatures = maxs.size | ||
val indices = Array.tabulate(numFeatures) { i => | ||
(i, if (maxs(i) == mins(i)) 0.0 else variances(i)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to keep previous comments here :
"# Use peak-to-peak to avoid numeric precision issues for constant features"
val numFeatures = maxs.size | ||
val indices = Array.tabulate(numFeatures) { i => | ||
(i, if (maxs(i) == mins(i)) 0.0 else variances(i)) | ||
} .filter(_._2 > getVarianceThreshold).map(_._1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit : no space after '}'
Test build #120121 has finished for PR 27954 at commit
|
Merged to master |
Thank you very much! |
### What changes were proposed in this pull request? Implement a Feature selector that removes all low-variance features. Features with a variance lower than the threshold will be removed. The default is to keep all features with non-zero variance, i.e. remove the features that have the same value in all samples. ### Why are the changes needed? VarianceThreshold is a simple baseline approach to feature selection. It removes all features whose variance doesn’t meet some threshold. The idea is when a feature doesn’t vary much within itself, it generally has very little predictive power. scikit has implemented this selector. https://scikit-learn.org/stable/modules/feature_selection.html#variance-threshold ### Does this PR introduce any user-facing change? Yes. ### How was this patch tested? Add new test suite. Closes apache#27954 from huaxingao/variance-threshold. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
What changes were proposed in this pull request?
Implement a Feature selector that removes all low-variance features. Features with a
variance lower than the threshold will be removed. The default is to keep all features with non-zero variance, i.e. remove the features that have the same value in all samples.
Why are the changes needed?
VarianceThreshold is a simple baseline approach to feature selection. It removes all features whose variance doesn’t meet some threshold. The idea is when a feature doesn’t vary much within itself, it generally has very little predictive power.
scikit has implemented this selector.
https://scikit-learn.org/stable/modules/feature_selection.html#variance-threshold
Does this PR introduce any user-facing change?
Yes.
How was this patch tested?
Add new test suite.