Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-31185][ML] Implement VarianceThresholdSelector #27954

Closed
wants to merge 5 commits into from

Conversation

huaxingao
Copy link
Contributor

What changes were proposed in this pull request?

Implement a Feature selector that removes all low-variance features. Features with a
variance lower than the threshold will be removed. The default is to keep all features with non-zero variance, i.e. remove the features that have the same value in all samples.

Why are the changes needed?

VarianceThreshold is a simple baseline approach to feature selection. It removes all features whose variance doesn’t meet some threshold. The idea is when a feature doesn’t vary much within itself, it generally has very little predictive power.
scikit has implemented this selector.
https://scikit-learn.org/stable/modules/feature_selection.html#variance-threshold

Does this PR introduce any user-facing change?

Yes.

How was this patch tested?

Add new test suite.

@huaxingao huaxingao changed the title [SPARK-31885][ML] Implement VarianceThresholdSelector [SPARK-31185][ML] Implement VarianceThresholdSelector Mar 19, 2020
@huaxingao
Copy link
Contributor Author

This Selector has quite some common code with other Selectors. I will refactor all the Selectors in #27882.

// if varianceThreshold not set, remove the features that have the same value in all samples.
val features = if (!isSet(varianceThreshold)) {
// use max and min to avoid numeric precision issues for constant features
result.filter { case (((vari, max), min), _) => ((max != min) && (vari != 0)) }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I follow scikit-learn implementation to use max and min to avoid numeric precision issues for constant features.

@SparkQA
Copy link

SparkQA commented Mar 19, 2020

Test build #120015 has finished for PR 27954 at commit 5016cf9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 19, 2020

Test build #120022 has finished for PR 27954 at commit 1455987.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


val result = variances.toArray.zip(maxs.toArray).zip(mins.toArray).zipWithIndex
// if varianceThreshold not set, remove the features that have the same value in all samples.
val features = if (!isSet(varianceThreshold)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not giving varianceThreshold a default value = 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we keep the features >= threshold, so I can't default it to 0 (variance 0 will be kept too).
Maybe I should change the definition to "keep the features > threshold"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will change to "keep the features > threshold". I looked sklearn code and did a little test, sklearn removes the <= threshold features.

    def _get_support_mask(self):
        check_is_fitted(self)

        return self.variances_ > self.threshold
>>> data = [[0, 2, 7, 1],
...         [1, 4, 9, 5]]
>>> np.var([0, 1])
0.25
>>> np.var([2, 4])
1.0
>>> np.var([7, 9])
1.0
>>> np.var([1, 5])
4.0
>>> VarianceThreshold(threshold=0.2499).fit_transform(data)
array([[0, 2, 7, 1],
       [1, 4, 9, 5]])
>>> VarianceThreshold(threshold=0.25).fit_transform(data)
array([[2, 7, 1],
       [4, 9, 5]])
>>> VarianceThreshold(threshold=0.9999).fit_transform(data)
array([[2, 7, 1],
       [4, 9, 5]])
>>> VarianceThreshold(threshold=1.0).fit_transform(data)
array([[1],
       [5]])

result.filter { case (((vari, _), _), _) => !(vari < getVarianceThreshold) }
}

val indices = features.map { case (((_, _), _), index) => index }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can simplify this logic like:

val numFeatures = max.size
val indices = Array.tabulate(numFeatures) { i => 
    (i, if (max(i) == min(i)) 0.0 else variance(i))
}.filter(_._2 >= getVarianceThreshold).map(_._1)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess in sklearn checking max == min are just to make variance computation of constant values more stable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Seems it is trying to get around the floating point comparison issue.

@SparkQA
Copy link

SparkQA commented Mar 20, 2020

Test build #120113 has finished for PR 27954 at commit 886b2dc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


/**
* Feature selector that removes all low-variance features. Features with a
* variance lower than the threshold will be removed. The default is to keep
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'lower' -> 'not greater than'


val numFeatures = maxs.size
val indices = Array.tabulate(numFeatures) { i =>
(i, if (maxs(i) == mins(i)) 0.0 else variances(i))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to keep previous comments here :
"# Use peak-to-peak to avoid numeric precision issues for constant features"

val numFeatures = maxs.size
val indices = Array.tabulate(numFeatures) { i =>
(i, if (maxs(i) == mins(i)) 0.0 else variances(i))
} .filter(_._2 > getVarianceThreshold).map(_._1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit : no space after '}'

@SparkQA
Copy link

SparkQA commented Mar 21, 2020

Test build #120121 has finished for PR 27954 at commit 3f5ecef.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng
Copy link
Contributor

Merged to master

@huaxingao
Copy link
Contributor Author

Thank you very much!

@huaxingao huaxingao deleted the variance-threshold branch March 22, 2020 06:02
sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
### What changes were proposed in this pull request?
Implement a Feature selector that removes all low-variance features. Features with a
variance lower than the threshold will be removed. The default is to keep all features with non-zero variance, i.e. remove the features that have the same value in all samples.

### Why are the changes needed?
VarianceThreshold is a simple baseline approach to feature selection. It removes all features whose variance doesn’t meet some threshold. The idea is when a feature doesn’t vary much within itself, it generally has very little predictive power.
scikit has implemented this selector.
https://scikit-learn.org/stable/modules/feature_selection.html#variance-threshold

### Does this PR introduce any user-facing change?
Yes.

### How was this patch tested?
Add new test suite.

Closes apache#27954 from huaxingao/variance-threshold.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants