[SPARK-31185][ML] Implement VarianceThresholdSelector #27954

huaxingao · 2020-03-19T01:08:43Z

What changes were proposed in this pull request?

Implement a Feature selector that removes all low-variance features. Features with a
variance lower than the threshold will be removed. The default is to keep all features with non-zero variance, i.e. remove the features that have the same value in all samples.

Why are the changes needed?

VarianceThreshold is a simple baseline approach to feature selection. It removes all features whose variance doesn’t meet some threshold. The idea is when a feature doesn’t vary much within itself, it generally has very little predictive power.
scikit has implemented this selector.
https://scikit-learn.org/stable/modules/feature_selection.html#variance-threshold

Does this PR introduce any user-facing change?

Yes.

How was this patch tested?

Add new test suite.

huaxingao · 2020-03-19T01:14:51Z

This Selector has quite some common code with other Selectors. I will refactor all the Selectors in #27882.

huaxingao · 2020-03-19T01:20:45Z

mllib/src/main/scala/org/apache/spark/ml/feature/VarianceThresholdSelector.scala

+    // if varianceThreshold not set, remove the features that have the same value in all samples.
+    val features = if (!isSet(varianceThreshold)) {
+      // use max and min to avoid numeric precision issues for constant features
+      result.filter { case (((vari, max), min), _) => ((max != min) && (vari != 0)) }


I follow scikit-learn implementation to use max and min to avoid numeric precision issues for constant features.

SparkQA · 2020-03-19T02:31:02Z

Test build #120015 has finished for PR 27954 at commit 5016cf9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-03-19T05:31:55Z

Test build #120022 has finished for PR 27954 at commit 1455987.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2020-03-19T09:23:23Z

mllib/src/main/scala/org/apache/spark/ml/feature/VarianceThresholdSelector.scala

+
+    val result = variances.toArray.zip(maxs.toArray).zip(mins.toArray).zipWithIndex
+    // if varianceThreshold not set, remove the features that have the same value in all samples.
+    val features = if (!isSet(varianceThreshold)) {


Why not giving varianceThreshold a default value = 0?

I thought we keep the features >= threshold, so I can't default it to 0 (variance 0 will be kept too).
Maybe I should change the definition to "keep the features > threshold"?

I will change to "keep the features > threshold". I looked sklearn code and did a little test, sklearn removes the <= threshold features.

def _get_support_mask(self): check_is_fitted(self) return self.variances_ > self.threshold

>>> data = [[0, 2, 7, 1], ... [1, 4, 9, 5]] >>> np.var([0, 1]) 0.25 >>> np.var([2, 4]) 1.0 >>> np.var([7, 9]) 1.0 >>> np.var([1, 5]) 4.0 >>> VarianceThreshold(threshold=0.2499).fit_transform(data) array([[0, 2, 7, 1], [1, 4, 9, 5]]) >>> VarianceThreshold(threshold=0.25).fit_transform(data) array([[2, 7, 1], [4, 9, 5]]) >>> VarianceThreshold(threshold=0.9999).fit_transform(data) array([[2, 7, 1], [4, 9, 5]]) >>> VarianceThreshold(threshold=1.0).fit_transform(data) array([[1], [5]])

zhengruifeng · 2020-03-19T09:26:22Z

mllib/src/main/scala/org/apache/spark/ml/feature/VarianceThresholdSelector.scala

+      result.filter { case (((vari, _), _), _) => !(vari < getVarianceThreshold) }
+    }
+
+    val indices = features.map { case (((_, _), _), index) => index }


we can simplify this logic like:

val numFeatures = max.size val indices = Array.tabulate(numFeatures) { i => (i, if (max(i) == min(i)) 0.0 else variance(i)) }.filter(_._2 >= getVarianceThreshold).map(_._1)

I guess in sklearn checking max == min are just to make variance computation of constant values more stable?

Yes. Seems it is trying to get around the floating point comparison issue.

SparkQA · 2020-03-20T19:58:23Z

Test build #120113 has finished for PR 27954 at commit 886b2dc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2020-03-21T03:04:36Z

mllib/src/main/scala/org/apache/spark/ml/feature/VarianceThresholdSelector.scala

+
+/**
+ * Feature selector that removes all low-variance features. Features with a
+ * variance lower than the threshold will be removed. The default is to keep


'lower' -> 'not greater than'

zhengruifeng · 2020-03-21T03:05:36Z

mllib/src/main/scala/org/apache/spark/ml/feature/VarianceThresholdSelector.scala

+
+    val numFeatures = maxs.size
+    val indices = Array.tabulate(numFeatures) { i =>
+      (i, if (maxs(i) == mins(i)) 0.0 else variances(i))


I'd like to keep previous comments here :
"# Use peak-to-peak to avoid numeric precision issues for constant features"

zhengruifeng · 2020-03-21T03:06:48Z

mllib/src/main/scala/org/apache/spark/ml/feature/VarianceThresholdSelector.scala

+    val numFeatures = maxs.size
+    val indices = Array.tabulate(numFeatures) { i =>
+      (i, if (maxs(i) == mins(i)) 0.0 else variances(i))
+    } .filter(_._2 > getVarianceThreshold).map(_._1)


nit : no space after '}'

SparkQA · 2020-03-21T06:45:48Z

Test build #120121 has finished for PR 27954 at commit 3f5ecef.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2020-03-22T04:45:15Z

Merged to master

huaxingao · 2020-03-22T06:02:01Z

Thank you very much!

### What changes were proposed in this pull request? Implement a Feature selector that removes all low-variance features. Features with a variance lower than the threshold will be removed. The default is to keep all features with non-zero variance, i.e. remove the features that have the same value in all samples. ### Why are the changes needed? VarianceThreshold is a simple baseline approach to feature selection. It removes all features whose variance doesn’t meet some threshold. The idea is when a feature doesn’t vary much within itself, it generally has very little predictive power. scikit has implemented this selector. https://scikit-learn.org/stable/modules/feature_selection.html#variance-threshold ### Does this PR introduce any user-facing change? Yes. ### How was this patch tested? Add new test suite. Closes apache#27954 from huaxingao/variance-threshold. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>

huaxingao added 2 commits March 18, 2020 17:56

[SPARK-31885][ML] Implement VarianceThresholdSelector

6a9965a

remove extra blank lines

5016cf9

huaxingao changed the title ~~[SPARK-31885][ML] Implement VarianceThresholdSelector~~ [SPARK-31185][ML] Implement VarianceThresholdSelector Mar 19, 2020

huaxingao commented Mar 19, 2020

View reviewed changes

nit

1455987

zhengruifeng reviewed Mar 19, 2020

View reviewed changes

address comments

886b2dc

zhengruifeng reviewed Mar 21, 2020

View reviewed changes

address comments

3f5ecef

zhengruifeng closed this in 307cfe1 Mar 22, 2020

huaxingao deleted the variance-threshold branch March 22, 2020 06:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-31185][ML] Implement VarianceThresholdSelector #27954

[SPARK-31185][ML] Implement VarianceThresholdSelector #27954

huaxingao commented Mar 19, 2020

huaxingao commented Mar 19, 2020

huaxingao Mar 19, 2020

SparkQA commented Mar 19, 2020

SparkQA commented Mar 19, 2020

zhengruifeng Mar 19, 2020

huaxingao Mar 20, 2020

huaxingao Mar 20, 2020

zhengruifeng Mar 19, 2020

zhengruifeng Mar 19, 2020

huaxingao Mar 20, 2020

SparkQA commented Mar 20, 2020

zhengruifeng Mar 21, 2020

zhengruifeng Mar 21, 2020

zhengruifeng Mar 21, 2020

SparkQA commented Mar 21, 2020

zhengruifeng commented Mar 22, 2020

huaxingao commented Mar 22, 2020

[SPARK-31185][ML] Implement VarianceThresholdSelector #27954

[SPARK-31185][ML] Implement VarianceThresholdSelector #27954

Conversation

huaxingao commented Mar 19, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

huaxingao commented Mar 19, 2020

Choose a reason for hiding this comment

SparkQA commented Mar 19, 2020

SparkQA commented Mar 19, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 20, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 21, 2020

zhengruifeng commented Mar 22, 2020

huaxingao commented Mar 22, 2020