Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-17645][MLLIB][ML]add feature selector method based on: False Discovery Rate (FDR) and Family wise error rate (FWE) #15212

Closed
wants to merge 14 commits into from

Conversation

mpjlu
Copy link

@mpjlu mpjlu commented Sep 23, 2016

What changes were proposed in this pull request?

Univariate feature selection works by selecting the best features based on univariate statistical tests.
FDR and FWE are a popular univariate statistical test for feature selection.
In 2005, the Benjamini and Hochberg paper on FDR was identified as one of the 25 most-cited statistical papers. The FDR uses the Benjamini-Hochberg procedure in this PR. https://en.wikipedia.org/wiki/False_discovery_rate.
In statistics, FWE is the probability of making one or more false discoveries, or type I errors, among all the hypotheses when performing multiple hypotheses tests.
https://en.wikipedia.org/wiki/Family-wise_error_rate

We add FDR and FWE methods for ChiSqSelector in this PR, like it is implemented in scikit-learn.
http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection

How was this patch tested?

ut will be added soon

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

@mpjlu mpjlu changed the title [MLLIB][ML]add feature selector method based on: False Discovery Rate (FDR) and Family wise error rate (FWE) [SPARK-17645][MLLIB][ML][WIP]add feature selector method based on: False Discovery Rate (FDR) and Family wise error rate (FWE) Sep 23, 2016
@SparkQA
Copy link

SparkQA commented Sep 23, 2016

Test build #65817 has finished for PR 15212 at commit 2c07179.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 27, 2016

Test build #65972 has finished for PR 15212 at commit 27aead9.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • // ('scaled' = +Infinity). However in the case that this class also has
    • // 0 probability, the class will not be selected ('scaled' is NaN).
    • final val thresholds: DoubleArrayParam = new DoubleArrayParam(this, \"thresholds\", \"Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > 0 excepting that at most one value may be 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class's threshold\", (t: Array[Double]) => t.forall(_ >= 0) && t.count(_ == 0) <= 1)
    • thresholds = Param(Params._dummy(), \"thresholds\", \"Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > 0, excepting that at most one value may be 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class's threshold.\", typeConverter=TypeConverters.toListFloat)
    • case class SortOrder(child: Expression, direction: SortDirection, nullOrdering: NullOrdering)
    • trait Offset extends Serializable

@SparkQA
Copy link

SparkQA commented Sep 27, 2016

Test build #65974 has finished for PR 15212 at commit 9c7fae3.

  • This patch fails Python style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mpjlu
Copy link
Author

mpjlu commented Sep 27, 2016

hi @srowen @yanboliang , I have updated this PR. Thanks

@SparkQA
Copy link

SparkQA commented Sep 27, 2016

Test build #65975 has finished for PR 15212 at commit 2e97c55.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yanboliang
Copy link
Contributor

@mpjlu I made some changes to improve ChiSqSelector performance at #15277. Let work together to get that in first, and then we can work on this. Thanks!

@SparkQA
Copy link

SparkQA commented Oct 10, 2016

Test build #66633 has finished for PR 15212 at commit e141c68.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 10, 2016

Test build #66634 has finished for PR 15212 at commit d05d7de.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 17, 2016

Test build #67045 has finished for PR 15212 at commit f4a0a14.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mpjlu
Copy link
Author

mpjlu commented Oct 17, 2016

Hi @yanboliang @srowen , this is the last two feature selection methods based on ChiSquare, which is similar to the method in scikit learn. But there is a bug about SelectFDR in scikit learn. I have submit a PR to scikit learn: scikit-learn/scikit-learn#7490.
Thanks very much.

@@ -243,6 +245,19 @@ class ChiSqSelector @Since("2.1.0") () extends Serializable {
case ChiSqSelector.FPR =>
chiSqTestResult
.filter { case (res, _) => res.pValue < alpha }
case ChiSqSelector.FDR =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add docs to clarify This uses the Benjamini-Hochberg procedure.

@@ -243,6 +245,19 @@ class ChiSqSelector @Since("2.1.0") () extends Serializable {
case ChiSqSelector.FPR =>
chiSqTestResult
.filter { case (res, _) => res.pValue < alpha }
case ChiSqSelector.FDR =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Irrelevent to this PR, we can eliminate zipWithIndex at L235, since no one use it. Would you mind to do this clean up in your PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's definitely used -- you have to keep the original index in order to pass them to the model.

.zipWithIndex
.filter { case ((res, _), index) =>
res.pValue <= alpha * (index + 1) / chiSqTestResult.length }
.map { case (_, index) => index}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

index} to index }

@@ -243,6 +245,19 @@ class ChiSqSelector @Since("2.1.0") () extends Serializable {
case ChiSqSelector.FPR =>
chiSqTestResult
.filter { case (res, _) => res.pValue < alpha }
case ChiSqSelector.FDR =>
val tempRDD = chiSqTestResult
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use another name, since it's not an RDD.

@@ -72,11 +72,15 @@ private[feature] trait ChiSqSelectorParams extends Params
def getPercentile: Double = $(percentile)

/**
* The highest p-value for features to be kept.
* Only applicable when selectorType = "fpr".
* alpha means the highest p-value for features to be kept when select type is "fpr".
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should keep Only applicable when selectorType = "fpr", "fdr" or "fwe"., since it's not applicable to other selector types such as "kbest" and "percentile".

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated, thanks

* The highest p-value for features to be kept.
* Only applicable when selectorType = "fpr".
* alpha means the highest p-value for features to be kept when select type is "fpr".
* alpha means the highest uncorrected p-value for features to be kept when select type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

, or the highest ...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated, thanks

final val alpha = new DoubleParam(this, "alpha", "The highest p-value for features to be kept.",
final val alpha = new DoubleParam(this, "alpha",
"alpha means the highest p-value for features to be kept when select type is fpr, " +
"alpha means the highest uncorrected p-value for features to be kept when select type " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

, or the highest ...

.select("filtered", "preFilteredData").collect().foreach {
case Row(vec1: Vector, vec2: Vector) =>
assert(vec1 ~== vec2 absTol 1e-1)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enforce this test case to large dimension data, and output different selected features according to selector type as far as possible.

@@ -2624,7 +2624,9 @@ class ChiSqSelector(JavaEstimator, HasFeaturesCol, HasOutputCol, HasLabelCol, Ja
"will select, ordered by statistics value descending.",
typeConverter=TypeConverters.toFloat)

alpha = Param(Params._dummy(), "alpha", "The highest p-value for features to be kept.",
alpha = Param(Params._dummy(), "alpha", "alpha means the highest p-value for features " +
"to be kept when select type is fpr, alpha means the highest uncorrected " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

@@ -2700,7 +2702,6 @@ def getPercentile(self):
def setAlpha(self, value):
"""
Sets the value of :py:attr:`alpha`.
Only applicable when selectorType = "fpr".
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

@@ -104,6 +108,9 @@ private[feature] trait ChiSqSelectorParams extends Params
* `kbest` chooses the `k` top features according to a chi-squared test.
* `percentile` is similar but chooses a fraction of all features instead of a fixed number.
* `fpr` chooses all features whose false positive rate meets some threshold.
* `fpr` select features based on a false positive rate test.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has two lines for fpr. The existing text is more descriptive.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, updated

@@ -127,7 +134,9 @@ final class ChiSqSelector @Since("1.6.0") (@Since("1.6.0") override val uid: Str

/** @group setParam */
@Since("2.1.0")
def setAlpha(value: Double): this.type = set(alpha, value)
def setAlpha(value: Double): this.type = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change back, thanks

@@ -243,6 +245,19 @@ class ChiSqSelector @Since("2.1.0") () extends Serializable {
case ChiSqSelector.FPR =>
chiSqTestResult
.filter { case (res, _) => res.pValue < alpha }
case ChiSqSelector.FDR =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's definitely used -- you have to keep the original index in order to pass them to the model.

val tempRDD = chiSqTestResult
.sortBy { case (res, _) => res.pValue }
val maxIndex = tempRDD
.zipWithIndex
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This zipWithIndex is however not correct it seems.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will validate the results carefully, can compare the results with sklearn.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a large size data sample in the test Suite, and updated the Contingency tables in the test Suite comments. The value of degree of freedom, statistic and pValue for each feature is also added. so it is easy to validate the result.

SelectFDR in sklearn is not exact. My PR to fix the bug is merged today.

@@ -263,9 +278,15 @@ object ChiSqSelector {
/** String name for `fpr` selector type. */
private[spark] val FPR: String = "fpr"

/** String name for `fdr` selector type. */
private[spark] val FDR: String = "fdr"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this applies to the existing line above too, but, this comment isn't descriptive. You can spell out what all of these mean if there's javadoc at all here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about "Selector type name of False Discovery Rate, which chooses all features whose false discovery rate meets some threshold. "

@SparkQA
Copy link

SparkQA commented Oct 20, 2016

Test build #67249 has finished for PR 15212 at commit d51b78b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 20, 2016

Test build #67261 has finished for PR 15212 at commit 92530ab.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mpjlu mpjlu changed the title [SPARK-17645][MLLIB][ML][WIP]add feature selector method based on: False Discovery Rate (FDR) and Family wise error rate (FWE) [SPARK-17645][MLLIB][ML]add feature selector method based on: False Discovery Rate (FDR) and Family wise error rate (FWE) Oct 20, 2016
@mpjlu
Copy link
Author

mpjlu commented Oct 24, 2016

Hi @yanboliang and @srowen , could you please review whether this PR includes all your comments. Thanks.

@SparkQA
Copy link

SparkQA commented Nov 22, 2016

Test build #68999 has finished for PR 15212 at commit 2625208.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mpjlu
Copy link
Author

mpjlu commented Nov 22, 2016

hi @yanboliang , @srowen @jkbradley , I have updated this PR, thanks.

@yanboliang
Copy link
Contributor

@mpjlu Sorry for late response, I just finished QA work for 2.1 and start ordinary review. Could you resolve the merge conflicts first? I will take a look tomorrow. Thanks.

@SparkQA
Copy link

SparkQA commented Dec 21, 2016

Test build #70454 has finished for PR 15212 at commit 83a429e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@yanboliang yanboliang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only some docs need to be improved, otherwise, looks good. Thanks.

* `numTopFeatures` chooses a fixed number of top features according to a chi-squared test. This is akin to yielding the features with the most predictive power.
* `percentile` is similar to `numTopFeatures` but chooses a fraction of all features instead of a fixed number.
* `fpr` chooses all features whose p-value is below a threshold, thus controlling the false positive rate of selection.

* `fdr` chooses all features whose false discovery rate meets some threshold.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

`fdr` uses the [Benjamini-Hochberg procedure](https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure) to choose all features whose false discovery rate is below a threshold should be better?

* `numTopFeatures` chooses a fixed number of top features according to a chi-squared test. This is akin to yielding the features with the most predictive power.
* `percentile` is similar to `numTopFeatures` but chooses a fraction of all features instead of a fixed number.
* `fpr` chooses all features whose p-value is below a threshold, thus controlling the false positive rate of selection.

* `fdr` chooses all features whose false discovery rate meets some threshold.
* `fwe` chooses all features whose family-wise error rate meets some threshold.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whose p-values is below a threshold, thus controlling the family-wise error rate of selection


* `numTopFeatures` chooses a fixed number of top features according to a chi-squared test. This is akin to yielding the features with the most predictive power.
* `percentile` is similar to `numTopFeatures` but chooses a fraction of all features instead of a fixed number.
* `fpr` chooses all features whose p-value is below a threshold, thus controlling the false positive rate of selection.
* `fdr` chooses all features whose false discovery rate meets some threshold.
* `fwe` chooses all features whose family-wise error rate meets some threshold.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update according the above suggestion.

@@ -92,8 +92,36 @@ private[feature] trait ChiSqSelectorParams extends Params
def getFpr: Double = $(fpr)

/**
* The highest uncorrected p-value for features to be kept.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the doc is incorrect even it's consistent with sklearn, actually we don't compare fdr value with p-value directly. I'm more prefer to change as The upper bound of the expected false discovery rate which is more accuracy and easy to understand.

* Default value is 0.05.
* @group param
*/
@Since("2.1.0")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update version to 2.2.0.

* - `numTopFeatures` chooses a fixed number of top features according to a chi-squared test.
* - `percentile` is similar but chooses a fraction of all features instead of a fixed number.
* - `fpr` chooses all features whose p-value is below a threshold, thus controlling the false
* positive rate of selection.
* - `fdr` chooses all features whose false discovery rate meets some threshold.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

@@ -245,6 +264,20 @@ class ChiSqSelector @Since("2.1.0") () extends Serializable {
case ChiSqSelector.FPR =>
chiSqTestResult
.filter { case (res, _) => res.pValue < fpr }
case ChiSqSelector.FDR =>
// This uses the Benjamini-Hochberg procedure.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/**
* String name for `numTopFeatures` selector type.
*/
/** String name for `numTopFeatures` selector type. */
val NumTopFeatures: String = "numTopFeatures"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private[spark]

/**
* String name for `percentile` selector type.
*/
/** String name for `percentile` selector type. */
val Percentile: String = "percentile"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private[spark]

*
* Use chi-squared calculator from Internet
*/

test("ChiSqSelector transform test (sparse & dense vector)") {
test("ChiSqSelector transform by KBest test (sparse & dense vector)") {
val labeledDiscreteData = sc.parallelize(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many test functions need labeledDiscreteData, we can refactor it out of the function and make other functions shared the same dataset instance.

@SparkQA
Copy link

SparkQA commented Dec 23, 2016

Test build #70536 has finished for PR 15212 at commit 5a7cc2c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@yanboliang yanboliang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except some very minor comments.
cc @srowen for another pass if you have time. Thanks.

* `numTopFeatures` chooses a fixed number of top features according to a chi-squared test. This is akin to yielding the features with the most predictive power.
* `percentile` is similar to `numTopFeatures` but chooses a fraction of all features instead of a fixed number.
* `fpr` chooses all features whose p-value is below a threshold, thus controlling the false positive rate of selection.

* `fdr` uses the [Benjamini-Hochberg procedure](https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure) to choose all features whose false discovery rate is below a threshold.
* `fwe` chooses all features whose whose p-values is below a threshold, thus controlling the family-wise error rate of selection.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove duplicated whose.


* `numTopFeatures` chooses a fixed number of top features according to a chi-squared test. This is akin to yielding the features with the most predictive power.
* `percentile` is similar to `numTopFeatures` but chooses a fraction of all features instead of a fixed number.
* `fpr` chooses all features whose p-value is below a threshold, thus controlling the false positive rate of selection.
* `fdr` uses the [Benjamini-Hochberg procedure](https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure) to choose all features whose false discovery rate is below a threshold.
* `fwe` chooses all features whose whose p-values is below a threshold, thus controlling the family-wise error rate of selection.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove duplicated whose.

* - `fdr` uses the [Benjamini-Hochberg procedure]
* (https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure)
* to choose all features whose false discovery rate is below a threshold.
* - `fwe` chooses all features whose whose p-values is below a threshold,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove duplicated whose.

* - `fdr` uses the [Benjamini-Hochberg procedure]
* (https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure)
* to choose all features whose false discovery rate is below a threshold.
* - `fwe` chooses all features whose whose p-values is below a threshold,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove duplicated whose.

LabeledPoint(2.0, Vectors.dense(Array(8.0, 9.0, 6.0, 5.0, 4.0, 4.0))),
LabeledPoint(2.0, Vectors.dense(Array(8.0, 9.0, 6.0, 4.0, 0.0, 0.0)))), 2)

test("ChiSqSelector transform by KBest test (sparse & dense vector)") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by numTopFeatures

`fdr` uses the [Benjamini-Hochberg procedure]
(https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure)
to choose all features whose false discovery rate is below a threshold.
`fwe` chooses all features whose whose p-values is below a threshold,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove duplicated whose.

`fdr` uses the [Benjamini-Hochberg procedure]
(https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure)
to choose all features whose false discovery rate is below a threshold.
`fwe` chooses all features whose whose p-values is below a threshold,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove duplicated whose.

.setOutputCol("filtered").setSelectorType("fwe").setFwe(0.6)
ChiSqSelectorSuite.testSelector(selector, dataset)
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a simple test for fdr selector.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pinging on @yanboliang 's comment about adding a test for FDR

@SparkQA
Copy link

SparkQA commented Dec 23, 2016

Test build #70549 has finished for PR 15212 at commit aa5f2cc.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val filteredData = labeledDiscreteData.map { lp =>
LabeledPoint(lp.label, model.transform(lp.features))
}.collect().toSet
assert(filteredData == preFilteredData)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creates a ChiSquared feature selector.
The selector supports different selection methods: `numTopFeatures`, `percentile`, `fpr`,
`fdr`, `fwe`.
`numTopFeatures` chooses a fixed number of top features according to a chi-squared test.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Organize following items to make it as a list in the generated API docs, please refer https://github.com/apache/spark/blob/master/python/pyspark/ml/regression.py#L1300

`percentile` is similar but chooses a fraction of all features instead of a fixed number.
`fpr` chooses all features whose p-value is below a threshold, thus controlling the false
positive rate of selection.
`fdr` uses the [Benjamini-Hochberg procedure]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

`Benjamini-Hochberg procedure <https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure>`_

This is the Python style of using doc links. You can refer https://github.com/apache/spark/blob/master/python/pyspark/ml/regression.py#L1308

@@ -274,11 +274,17 @@ def transform(self, vector):
class ChiSqSelector(object):
"""
Creates a ChiSquared feature selector.
The selector supports different selection methods: `numTopFeatures`, `percentile`, `fpr`.
The selector supports different selection methods: `numTopFeatures`, `percentile`, `fpr`,
`fdr`, `fwe`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto, organize following items to make it as a list in the generated API docs.

`numTopFeatures` chooses a fixed number of top features according to a chi-squared test.
`percentile` is similar but chooses a fraction of all features instead of a fixed number.
`fpr` chooses all features whose p-value is below a threshold, thus controlling the false
positive rate of selection.
`fdr` uses the [Benjamini-Hochberg procedure]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

@SparkQA
Copy link

SparkQA commented Dec 27, 2016

Test build #70635 has finished for PR 15212 at commit da6ac35.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yanboliang
Copy link
Contributor

Merged into master, thanks!

@asfgit asfgit closed this in 79ff853 Dec 28, 2016
Copy link
Member

@jkbradley jkbradley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mpjlu @yanboliang Thanks for the PR! I know this is late, but I had a couple of small comments. Could you please send a little follow-up PR against the same JIRA?

* `numTopFeatures` chooses a fixed number of top features according to a chi-squared test. This is akin to yielding the features with the most predictive power.
* `percentile` is similar to `numTopFeatures` but chooses a fraction of all features instead of a fixed number.
* `fpr` chooses all features whose p-value is below a threshold, thus controlling the false positive rate of selection.

* `fdr` uses the [Benjamini-Hochberg procedure](https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure) to choose all features whose false discovery rate is below a threshold.
* `fwe` chooses all features whose p-values is below a threshold, thus controlling the family-wise error rate of selection.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doc sounds the same as fpr. Could you state that the threshold is scaled by 1/numFeatures to clarify?

Also fix: "p-values is" -> "p-values are"

.setOutputCol("filtered").setSelectorType("fwe").setFwe(0.6)
ChiSqSelectorSuite.testSelector(selector, dataset)
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pinging on @yanboliang 's comment about adding a test for FDR

@@ -2629,8 +2629,28 @@ class ChiSqSelector(JavaEstimator, HasFeaturesCol, HasOutputCol, HasLabelCol, Ja
"""
.. note:: Experimental

Chi-Squared feature selection, which selects categorical features to use for predicting a
categorical label.
Creates a ChiSquared feature selector.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add back in the fact that this selects categorical features for predicting a categorical label. It's useful to state expected input data types.

cmonkey pushed a commit to cmonkey/spark that referenced this pull request Dec 29, 2016
…Discovery Rate (FDR) and Family wise error rate (FWE)

## What changes were proposed in this pull request?

Univariate feature selection works by selecting the best features based on univariate statistical tests.
FDR and FWE are a popular univariate statistical test for feature selection.
In 2005, the Benjamini and Hochberg paper on FDR was identified as one of the 25 most-cited statistical papers. The FDR uses the Benjamini-Hochberg procedure in this PR. https://en.wikipedia.org/wiki/False_discovery_rate.
In statistics, FWE is the probability of making one or more false discoveries, or type I errors, among all the hypotheses when performing multiple hypotheses tests.
https://en.wikipedia.org/wiki/Family-wise_error_rate

We add  FDR and FWE methods for ChiSqSelector in this PR, like it is implemented in scikit-learn.
http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection
## How was this patch tested?

ut will be added soon

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Author: Peng <peng.meng@intel.com>
Author: Peng, Meng <peng.meng@intel.com>

Closes apache#15212 from mpjlu/fdr_fwe.
@mpjlu
Copy link
Author

mpjlu commented Dec 29, 2016

Thanks @jkbradley , I will send a follow-up PR for your comments.

@mpjlu
Copy link
Author

mpjlu commented Dec 29, 2016

hi @jkbradley @yanboliang , I have created a follow up PR for this PR. #16434
I have not added FDR test in ML Suite. The main reason is the current data set is not a good case for FDR test (you can only select 0 feature or all 3 features by FDR). Do you think I should write a new data set like in MLLIB to test FDR in ML.

@yanboliang
Copy link
Contributor

@mpjlu Thanks for the follow-up PR. @jkbradley Please feel free to shepherd that PR, since I'm on travel these days. Thanks.

@jkbradley
Copy link
Member

Will do!

asfgit pushed a commit that referenced this pull request Jan 10, 2017
## What changes were proposed in this pull request?
Add FDR test case in ml/feature/ChiSqSelectorSuite.
Improve some comments in the code.
This is a follow-up pr for #15212.

## How was this patch tested?
ut

Author: Peng, Meng <peng.meng@intel.com>

Closes #16434 from mpjlu/fdr_fwe_update.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
…Discovery Rate (FDR) and Family wise error rate (FWE)

## What changes were proposed in this pull request?

Univariate feature selection works by selecting the best features based on univariate statistical tests.
FDR and FWE are a popular univariate statistical test for feature selection.
In 2005, the Benjamini and Hochberg paper on FDR was identified as one of the 25 most-cited statistical papers. The FDR uses the Benjamini-Hochberg procedure in this PR. https://en.wikipedia.org/wiki/False_discovery_rate.
In statistics, FWE is the probability of making one or more false discoveries, or type I errors, among all the hypotheses when performing multiple hypotheses tests.
https://en.wikipedia.org/wiki/Family-wise_error_rate

We add  FDR and FWE methods for ChiSqSelector in this PR, like it is implemented in scikit-learn.
http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection
## How was this patch tested?

ut will be added soon

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Author: Peng <peng.meng@intel.com>
Author: Peng, Meng <peng.meng@intel.com>

Closes apache#15212 from mpjlu/fdr_fwe.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
## What changes were proposed in this pull request?
Add FDR test case in ml/feature/ChiSqSelectorSuite.
Improve some comments in the code.
This is a follow-up pr for apache#15212.

## How was this patch tested?
ut

Author: Peng, Meng <peng.meng@intel.com>

Closes apache#16434 from mpjlu/fdr_fwe_update.
cmonkey pushed a commit to cmonkey/spark that referenced this pull request Feb 15, 2017
## What changes were proposed in this pull request?
Add FDR test case in ml/feature/ChiSqSelectorSuite.
Improve some comments in the code.
This is a follow-up pr for apache#15212.

## How was this patch tested?
ut

Author: Peng, Meng <peng.meng@intel.com>

Closes apache#16434 from mpjlu/fdr_fwe_update.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants