[SPARK-17645][MLLIB][ML]add feature selector method based on: False Discovery Rate (FDR) and Family wise error rate (FWE) #15212

mpjlu · 2016-09-23T07:53:40Z

What changes were proposed in this pull request?

Univariate feature selection works by selecting the best features based on univariate statistical tests.
FDR and FWE are a popular univariate statistical test for feature selection.
In 2005, the Benjamini and Hochberg paper on FDR was identified as one of the 25 most-cited statistical papers. The FDR uses the Benjamini-Hochberg procedure in this PR. https://en.wikipedia.org/wiki/False_discovery_rate.
In statistics, FWE is the probability of making one or more false discoveries, or type I errors, among all the hypotheses when performing multiple hypotheses tests.
https://en.wikipedia.org/wiki/Family-wise_error_rate

We add FDR and FWE methods for ChiSqSelector in this PR, like it is implemented in scikit-learn.
http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection

How was this patch tested?

ut will be added soon

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

SparkQA · 2016-09-23T09:00:33Z

Test build #65817 has finished for PR 15212 at commit 2c07179.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-09-27T12:34:12Z

Test build #65972 has finished for PR 15212 at commit 27aead9.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- // ('scaled' = +Infinity). However in the case that this class also has
- // 0 probability, the class will not be selected ('scaled' is NaN).
- final val thresholds: DoubleArrayParam = new DoubleArrayParam(this, \"thresholds\", \"Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > 0 excepting that at most one value may be 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class's threshold\", (t: Array[Double]) => t.forall(_ >= 0) && t.count(_ == 0) <= 1)
- thresholds = Param(Params._dummy(), \"thresholds\", \"Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > 0, excepting that at most one value may be 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class's threshold.\", typeConverter=TypeConverters.toListFloat)
- case class SortOrder(child: Expression, direction: SortDirection, nullOrdering: NullOrdering)
- trait Offset extends Serializable

SparkQA · 2016-09-27T12:59:25Z

Test build #65974 has finished for PR 15212 at commit 9c7fae3.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

mpjlu · 2016-09-27T12:59:26Z

hi @srowen @yanboliang , I have updated this PR. Thanks

SparkQA · 2016-09-27T14:13:20Z

Test build #65975 has finished for PR 15212 at commit 2e97c55.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang · 2016-09-28T08:23:10Z

@mpjlu I made some changes to improve ChiSqSelector performance at #15277. Let work together to get that in first, and then we can work on this. Thanks!

SparkQA · 2016-10-10T08:25:42Z

Test build #66633 has finished for PR 15212 at commit e141c68.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-10T08:31:24Z

Test build #66634 has finished for PR 15212 at commit d05d7de.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-17T02:25:18Z

Test build #67045 has finished for PR 15212 at commit f4a0a14.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mpjlu · 2016-10-17T02:54:48Z

Hi @yanboliang @srowen , this is the last two feature selection methods based on ChiSquare, which is similar to the method in scikit learn. But there is a bug about SelectFDR in scikit learn. I have submit a PR to scikit learn: scikit-learn/scikit-learn#7490.
Thanks very much.

yanboliang · 2016-10-19T10:18:07Z

mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala

@@ -243,6 +245,19 @@ class ChiSqSelector @Since("2.1.0") () extends Serializable {
      case ChiSqSelector.FPR =>
        chiSqTestResult
          .filter { case (res, _) => res.pValue < alpha }
+      case ChiSqSelector.FDR =>


Add docs to clarify This uses the Benjamini-Hochberg procedure.

yanboliang · 2016-10-19T10:32:20Z

mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala

@@ -243,6 +245,19 @@ class ChiSqSelector @Since("2.1.0") () extends Serializable {
      case ChiSqSelector.FPR =>
        chiSqTestResult
          .filter { case (res, _) => res.pValue < alpha }
+      case ChiSqSelector.FDR =>


Irrelevent to this PR, we can eliminate zipWithIndex at L235, since no one use it. Would you mind to do this clean up in your PR?

It's definitely used -- you have to keep the original index in order to pass them to the model.

yanboliang · 2016-10-19T10:34:36Z

mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala

+          .zipWithIndex
+          .filter { case ((res, _), index) =>
+            res.pValue <= alpha * (index + 1) / chiSqTestResult.length }
+          .map { case (_, index) => index}


index} to index }

yanboliang · 2016-10-19T10:36:19Z

mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala

@@ -243,6 +245,19 @@ class ChiSqSelector @Since("2.1.0") () extends Serializable {
      case ChiSqSelector.FPR =>
        chiSqTestResult
          .filter { case (res, _) => res.pValue < alpha }
+      case ChiSqSelector.FDR =>
+        val tempRDD = chiSqTestResult


Use another name, since it's not an RDD.

yanboliang · 2016-10-19T10:38:41Z

mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala

@@ -72,11 +72,15 @@ private[feature] trait ChiSqSelectorParams extends Params
  def getPercentile: Double = $(percentile)

  /**
-   * The highest p-value for features to be kept.
-   * Only applicable when selectorType = "fpr".
+   * alpha means the highest p-value for features to be kept when select type is "fpr".


We should keep Only applicable when selectorType = "fpr", "fdr" or "fwe"., since it's not applicable to other selector types such as "kbest" and "percentile".

updated, thanks

yanboliang · 2016-10-19T10:41:10Z

mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala

-   * The highest p-value for features to be kept.
-   * Only applicable when selectorType = "fpr".
+   * alpha means the highest p-value for features to be kept when select type is "fpr".
+   * alpha means the highest uncorrected p-value for features to be kept when select type


, or the highest ...

updated, thanks

yanboliang · 2016-10-19T10:41:33Z

mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala

-  final val alpha = new DoubleParam(this, "alpha", "The highest p-value for features to be kept.",
+  final val alpha = new DoubleParam(this, "alpha",
+    "alpha means the highest p-value for features to be kept when select type is fpr, " +
+      "alpha means the highest uncorrected p-value for features to be kept when select type " +


, or the highest ...

yanboliang · 2016-10-19T10:46:00Z

mllib/src/test/scala/org/apache/spark/ml/feature/ChiSqSelectorSuite.scala

+      .select("filtered", "preFilteredData").collect().foreach {
+      case Row(vec1: Vector, vec2: Vector) =>
+        assert(vec1 ~== vec2 absTol 1e-1)
+    }


Enforce this test case to large dimension data, and output different selected features according to selector type as far as possible.

yanboliang · 2016-10-19T10:46:18Z

python/pyspark/ml/feature.py

@@ -2624,7 +2624,9 @@ class ChiSqSelector(JavaEstimator, HasFeaturesCol, HasOutputCol, HasLabelCol, Ja
                       "will select, ordered by statistics value descending.",
                       typeConverter=TypeConverters.toFloat)

-    alpha = Param(Params._dummy(), "alpha", "The highest p-value for features to be kept.",
+    alpha = Param(Params._dummy(), "alpha", "alpha means the highest p-value for features " +
+                  "to be kept when select type is fpr, alpha means the highest uncorrected " +


yanboliang · 2016-10-19T10:46:28Z

python/pyspark/ml/feature.py

@@ -2700,7 +2702,6 @@ def getPercentile(self):
    def setAlpha(self, value):
        """
        Sets the value of :py:attr:`alpha`.
-        Only applicable when selectorType = "fpr".


srowen · 2016-10-19T10:52:59Z

mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala

@@ -104,6 +108,9 @@ private[feature] trait ChiSqSelectorParams extends Params
 * `kbest` chooses the `k` top features according to a chi-squared test.
 * `percentile` is similar but chooses a fraction of all features instead of a fixed number.
 * `fpr` chooses all features whose false positive rate meets some threshold.
+ * `fpr` select features based on a false positive rate test.


This has two lines for fpr. The existing text is more descriptive.

thanks, updated

srowen · 2016-10-19T10:53:09Z

mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala

@@ -127,7 +134,9 @@ final class ChiSqSelector @Since("1.6.0") (@Since("1.6.0") override val uid: Str

  /** @group setParam */
  @Since("2.1.0")
-  def setAlpha(value: Double): this.type = set(alpha, value)
+  def setAlpha(value: Double): this.type = {


Why this change?

change back, thanks

srowen · 2016-10-19T10:54:42Z

mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala

@@ -243,6 +245,19 @@ class ChiSqSelector @Since("2.1.0") () extends Serializable {
      case ChiSqSelector.FPR =>
        chiSqTestResult
          .filter { case (res, _) => res.pValue < alpha }
+      case ChiSqSelector.FDR =>


It's definitely used -- you have to keep the original index in order to pass them to the model.

srowen · 2016-10-19T10:55:21Z

mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala

+        val tempRDD = chiSqTestResult
+          .sortBy { case (res, _) => res.pValue }
+        val maxIndex = tempRDD
+          .zipWithIndex


This zipWithIndex is however not correct it seems.

I will validate the results carefully, can compare the results with sklearn.

I have added a large size data sample in the test Suite, and updated the Contingency tables in the test Suite comments. The value of degree of freedom, statistic and pValue for each feature is also added. so it is easy to validate the result.

SelectFDR in sklearn is not exact. My PR to fix the bug is merged today.

srowen · 2016-10-19T10:56:13Z

mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala

@@ -263,9 +278,15 @@ object ChiSqSelector {
  /** String name for `fpr` selector type. */
  private[spark] val FPR: String = "fpr"

+  /** String name for `fdr` selector type. */
+  private[spark] val FDR: String = "fdr"


I know this applies to the existing line above too, but, this comment isn't descriptive. You can spell out what all of these mean if there's javadoc at all here.

How about "Selector type name of False Discovery Rate, which chooses all features whose false discovery rate meets some threshold. "

SparkQA · 2016-10-20T09:14:39Z

Test build #67249 has finished for PR 15212 at commit d51b78b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-20T11:19:19Z

Test build #67261 has finished for PR 15212 at commit 92530ab.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mpjlu · 2016-10-24T04:30:14Z

Hi @yanboliang and @srowen , could you please review whether this PR includes all your comments. Thanks.

SparkQA · 2016-11-22T14:08:19Z

Test build #68999 has finished for PR 15212 at commit 2625208.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mpjlu · 2016-11-22T15:16:08Z

hi @yanboliang , @srowen @jkbradley , I have updated this PR, thanks.

yanboliang · 2016-12-15T09:59:13Z

@mpjlu Sorry for late response, I just finished QA work for 2.1 and start ordinary review. Could you resolve the merge conflicts first? I will take a look tomorrow. Thanks.

SparkQA · 2016-12-21T06:14:02Z

Test build #70454 has finished for PR 15212 at commit 83a429e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang

Only some docs need to be improved, otherwise, looks good. Thanks.

yanboliang · 2016-12-23T02:38:27Z

docs/ml-features.md

 * `numTopFeatures` chooses a fixed number of top features according to a chi-squared test. This is akin to yielding the features with the most predictive power.
 * `percentile` is similar to `numTopFeatures` but chooses a fraction of all features instead of a fixed number.
 * `fpr` chooses all features whose p-value is below a threshold, thus controlling the false positive rate of selection.
-
+* `fdr` chooses all features whose false discovery rate meets some threshold.


`fdr` uses the [Benjamini-Hochberg procedure](https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure) to choose all features whose false discovery rate is below a threshold should be better?

yanboliang · 2016-12-23T02:40:25Z

docs/ml-features.md

 * `numTopFeatures` chooses a fixed number of top features according to a chi-squared test. This is akin to yielding the features with the most predictive power.
 * `percentile` is similar to `numTopFeatures` but chooses a fraction of all features instead of a fixed number.
 * `fpr` chooses all features whose p-value is below a threshold, thus controlling the false positive rate of selection.
-
+* `fdr` chooses all features whose false discovery rate meets some threshold.
+* `fwe` chooses all features whose family-wise error rate meets some threshold.


whose p-values is below a threshold, thus controlling the family-wise error rate of selection

yanboliang · 2016-12-23T02:46:44Z

docs/mllib-feature-extraction.md


 * `numTopFeatures` chooses a fixed number of top features according to a chi-squared test. This is akin to yielding the features with the most predictive power.
 * `percentile` is similar to `numTopFeatures` but chooses a fraction of all features instead of a fixed number.
 * `fpr` chooses all features whose p-value is below a threshold, thus controlling the false positive rate of selection.
+* `fdr` chooses all features whose false discovery rate meets some threshold.
+* `fwe` chooses all features whose family-wise error rate meets some threshold.


Update according the above suggestion.

yanboliang · 2016-12-23T02:50:40Z

mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala

@@ -92,8 +92,36 @@ private[feature] trait ChiSqSelectorParams extends Params
  def getFpr: Double = $(fpr)

  /**
+   * The highest uncorrected p-value for features to be kept.


I think the doc is incorrect even it's consistent with sklearn, actually we don't compare fdr value with p-value directly. I'm more prefer to change as The upper bound of the expected false discovery rate which is more accuracy and easy to understand.

yanboliang · 2016-12-23T02:51:45Z

mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala

+   * Default value is 0.05.
+   * @group param
+   */
+  @Since("2.1.0")


Update version to 2.2.0.

yanboliang · 2016-12-23T03:06:52Z

mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala

 *  - `numTopFeatures` chooses a fixed number of top features according to a chi-squared test.
 *  - `percentile` is similar but chooses a fraction of all features instead of a fixed number.
 *  - `fpr` chooses all features whose p-value is below a threshold, thus controlling the false
 *    positive rate of selection.
+ *  - `fdr` chooses all features whose false discovery rate meets some threshold.


yanboliang · 2016-12-23T03:07:46Z

mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala

@@ -245,6 +264,20 @@ class ChiSqSelector @Since("2.1.0") () extends Serializable {
      case ChiSqSelector.FPR =>
        chiSqTestResult
          .filter { case (res, _) => res.pValue < fpr }
+      case ChiSqSelector.FDR =>
+        // This uses the Benjamini-Hochberg procedure.


Add link to explain Benjamini-Hochberg procedure: https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure

yanboliang · 2016-12-23T03:10:26Z

mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala

-  /**
-   * String name for `numTopFeatures` selector type.
-   */
+  /** String name for `numTopFeatures` selector type. */
  val NumTopFeatures: String = "numTopFeatures"


private[spark]

yanboliang · 2016-12-23T03:10:39Z

mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala

-  /**
-   * String name for `percentile` selector type.
-   */
+  /** String name for `percentile` selector type. */
  val Percentile: String = "percentile"


private[spark]

yanboliang · 2016-12-23T03:13:17Z

mllib/src/test/scala/org/apache/spark/mllib/feature/ChiSqSelectorSuite.scala

   *
   *  Use chi-squared calculator from Internet
   */

-  test("ChiSqSelector transform test (sparse & dense vector)") {
+  test("ChiSqSelector transform by KBest test (sparse & dense vector)") {
    val labeledDiscreteData = sc.parallelize(


Many test functions need labeledDiscreteData, we can refactor it out of the function and make other functions shared the same dataset instance.

SparkQA · 2016-12-23T06:24:14Z

Test build #70536 has finished for PR 15212 at commit 5a7cc2c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang

LGTM except some very minor comments.
cc @srowen for another pass if you have time. Thanks.

yanboliang · 2016-12-23T13:34:25Z

docs/ml-features.md

 * `numTopFeatures` chooses a fixed number of top features according to a chi-squared test. This is akin to yielding the features with the most predictive power.
 * `percentile` is similar to `numTopFeatures` but chooses a fraction of all features instead of a fixed number.
 * `fpr` chooses all features whose p-value is below a threshold, thus controlling the false positive rate of selection.
-
+* `fdr` uses the [Benjamini-Hochberg procedure](https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure) to choose all features whose false discovery rate is below a threshold.
+* `fwe` chooses all features whose whose p-values is below a threshold, thus controlling the family-wise error rate of selection.


Remove duplicated whose.

yanboliang · 2016-12-23T13:36:07Z

docs/mllib-feature-extraction.md


 * `numTopFeatures` chooses a fixed number of top features according to a chi-squared test. This is akin to yielding the features with the most predictive power.
 * `percentile` is similar to `numTopFeatures` but chooses a fraction of all features instead of a fixed number.
 * `fpr` chooses all features whose p-value is below a threshold, thus controlling the false positive rate of selection.
+* `fdr` uses the [Benjamini-Hochberg procedure](https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure) to choose all features whose false discovery rate is below a threshold.
+* `fwe` chooses all features whose whose p-values is below a threshold, thus controlling the family-wise error rate of selection.


Remove duplicated whose.

yanboliang · 2016-12-23T13:43:02Z

mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala

+ *  - `fdr` uses the [Benjamini-Hochberg procedure]
+ *    (https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure)
+ *    to choose all features whose false discovery rate is below a threshold.
+ *  - `fwe` chooses all features whose whose p-values is below a threshold,


Remove duplicated whose.

yanboliang · 2016-12-23T13:44:22Z

mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala

+ *  - `fdr` uses the [Benjamini-Hochberg procedure]
+ *    (https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure)
+ *    to choose all features whose false discovery rate is below a threshold.
+ *  - `fwe` chooses all features whose whose p-values is below a threshold,


Remove duplicated whose.

yanboliang · 2016-12-23T13:47:17Z

mllib/src/test/scala/org/apache/spark/mllib/feature/ChiSqSelectorSuite.scala

+      LabeledPoint(2.0, Vectors.dense(Array(8.0, 9.0, 6.0, 5.0, 4.0, 4.0))),
+      LabeledPoint(2.0, Vectors.dense(Array(8.0, 9.0, 6.0, 4.0, 0.0, 0.0)))), 2)
+
+  test("ChiSqSelector transform by KBest test (sparse & dense vector)") {


by numTopFeatures

yanboliang · 2016-12-23T13:50:02Z

python/pyspark/ml/feature.py

+    `fdr` uses the [Benjamini-Hochberg procedure]
+    (https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure)
+    to choose all features whose false discovery rate is below a threshold.
+    `fwe` chooses all features whose whose p-values is below a threshold,


Remove duplicated whose.

yanboliang · 2016-12-23T13:53:34Z

python/pyspark/mllib/feature.py

+    `fdr` uses the [Benjamini-Hochberg procedure]
+    (https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure)
+    to choose all features whose false discovery rate is below a threshold.
+    `fwe` chooses all features whose whose p-values is below a threshold,


Remove duplicated whose.

yanboliang · 2016-12-23T13:56:19Z

mllib/src/test/scala/org/apache/spark/ml/feature/ChiSqSelectorSuite.scala

+      .setOutputCol("filtered").setSelectorType("fwe").setFwe(0.6)
+    ChiSqSelectorSuite.testSelector(selector, dataset)
+  }
+


Add a simple test for fdr selector.

Pinging on @yanboliang 's comment about adding a test for FDR

SparkQA · 2016-12-23T16:46:48Z

Test build #70549 has finished for PR 15212 at commit aa5f2cc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang · 2016-12-26T15:54:13Z

mllib/src/test/scala/org/apache/spark/mllib/feature/ChiSqSelectorSuite.scala

+    val filteredData = labeledDiscreteData.map { lp =>
+      LabeledPoint(lp.label, model.transform(lp.features))
+    }.collect().toSet
+    assert(filteredData == preFilteredData)


Use ===, see the difference at http://stackoverflow.com/questions/10489548/what-is-the-triple-equals-operator-in-scala-koans .

yanboliang · 2016-12-26T16:02:15Z

python/pyspark/ml/feature.py

+    Creates a ChiSquared feature selector.
+    The selector supports different selection methods: `numTopFeatures`, `percentile`, `fpr`,
+    `fdr`, `fwe`.
+    `numTopFeatures` chooses a fixed number of top features according to a chi-squared test.


Organize following items to make it as a list in the generated API docs, please refer https://github.com/apache/spark/blob/master/python/pyspark/ml/regression.py#L1300

yanboliang · 2016-12-26T16:18:59Z

python/pyspark/ml/feature.py

+    `percentile` is similar but chooses a fraction of all features instead of a fixed number.
+    `fpr` chooses all features whose p-value is below a threshold, thus controlling the false
+    positive rate of selection.
+    `fdr` uses the [Benjamini-Hochberg procedure]


`Benjamini-Hochberg procedure <https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure>`_

This is the Python style of using doc links. You can refer https://github.com/apache/spark/blob/master/python/pyspark/ml/regression.py#L1308

yanboliang · 2016-12-26T16:20:37Z

python/pyspark/mllib/feature.py

@@ -274,11 +274,17 @@ def transform(self, vector):
 class ChiSqSelector(object):
    """
    Creates a ChiSquared feature selector.
-    The selector supports different selection methods: `numTopFeatures`, `percentile`, `fpr`.
+    The selector supports different selection methods: `numTopFeatures`, `percentile`, `fpr`,
+    `fdr`, `fwe`.


Ditto, organize following items to make it as a list in the generated API docs.

yanboliang · 2016-12-26T16:21:13Z

python/pyspark/mllib/feature.py

    `numTopFeatures` chooses a fixed number of top features according to a chi-squared test.
    `percentile` is similar but chooses a fraction of all features instead of a fixed number.
    `fpr` chooses all features whose p-value is below a threshold, thus controlling the false
    positive rate of selection.
+    `fdr` uses the [Benjamini-Hochberg procedure]


SparkQA · 2016-12-27T14:27:12Z

Test build #70635 has finished for PR 15212 at commit da6ac35.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang · 2016-12-28T08:49:48Z

Merged into master, thanks!

jkbradley

@mpjlu @yanboliang Thanks for the PR! I know this is late, but I had a couple of small comments. Could you please send a little follow-up PR against the same JIRA?

jkbradley · 2016-12-28T20:20:57Z

docs/ml-features.md

 * `numTopFeatures` chooses a fixed number of top features according to a chi-squared test. This is akin to yielding the features with the most predictive power.
 * `percentile` is similar to `numTopFeatures` but chooses a fraction of all features instead of a fixed number.
 * `fpr` chooses all features whose p-value is below a threshold, thus controlling the false positive rate of selection.
-
+* `fdr` uses the [Benjamini-Hochberg procedure](https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure) to choose all features whose false discovery rate is below a threshold.
+* `fwe` chooses all features whose p-values is below a threshold, thus controlling the family-wise error rate of selection.


This doc sounds the same as fpr. Could you state that the threshold is scaled by 1/numFeatures to clarify?

Also fix: "p-values is" -> "p-values are"

jkbradley · 2016-12-28T20:21:26Z

mllib/src/test/scala/org/apache/spark/ml/feature/ChiSqSelectorSuite.scala

+      .setOutputCol("filtered").setSelectorType("fwe").setFwe(0.6)
+    ChiSqSelectorSuite.testSelector(selector, dataset)
+  }
+


Pinging on @yanboliang 's comment about adding a test for FDR

jkbradley · 2016-12-28T20:21:45Z

python/pyspark/ml/feature.py

@@ -2629,8 +2629,28 @@ class ChiSqSelector(JavaEstimator, HasFeaturesCol, HasOutputCol, HasLabelCol, Ja
    """
    .. note:: Experimental

-    Chi-Squared feature selection, which selects categorical features to use for predicting a
-    categorical label.
+    Creates a ChiSquared feature selector.


Please add back in the fact that this selects categorical features for predicting a categorical label. It's useful to state expected input data types.

…Discovery Rate (FDR) and Family wise error rate (FWE) ## What changes were proposed in this pull request? Univariate feature selection works by selecting the best features based on univariate statistical tests. FDR and FWE are a popular univariate statistical test for feature selection. In 2005, the Benjamini and Hochberg paper on FDR was identified as one of the 25 most-cited statistical papers. The FDR uses the Benjamini-Hochberg procedure in this PR. https://en.wikipedia.org/wiki/False_discovery_rate. In statistics, FWE is the probability of making one or more false discoveries, or type I errors, among all the hypotheses when performing multiple hypotheses tests. https://en.wikipedia.org/wiki/Family-wise_error_rate We add FDR and FWE methods for ChiSqSelector in this PR, like it is implemented in scikit-learn. http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection ## How was this patch tested? ut will be added soon (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: Peng <peng.meng@intel.com> Author: Peng, Meng <peng.meng@intel.com> Closes apache#15212 from mpjlu/fdr_fwe.

mpjlu · 2016-12-29T13:15:43Z

Thanks @jkbradley , I will send a follow-up PR for your comments.

mpjlu · 2016-12-29T15:40:14Z

hi @jkbradley @yanboliang , I have created a follow up PR for this PR. #16434
I have not added FDR test in ML Suite. The main reason is the current data set is not a good case for FDR test (you can only select 0 feature or all 3 features by FDR). Do you think I should write a new data set like in MLLIB to test FDR in ML.

yanboliang · 2016-12-29T16:33:00Z

@mpjlu Thanks for the follow-up PR. @jkbradley Please feel free to shepherd that PR, since I'm on travel these days. Thanks.

jkbradley · 2016-12-29T18:27:51Z

Will do!

## What changes were proposed in this pull request? Add FDR test case in ml/feature/ChiSqSelectorSuite. Improve some comments in the code. This is a follow-up pr for #15212. ## How was this patch tested? ut Author: Peng, Meng <peng.meng@intel.com> Closes #16434 from mpjlu/fdr_fwe_update.

…Discovery Rate (FDR) and Family wise error rate (FWE) ## What changes were proposed in this pull request? Univariate feature selection works by selecting the best features based on univariate statistical tests. FDR and FWE are a popular univariate statistical test for feature selection. In 2005, the Benjamini and Hochberg paper on FDR was identified as one of the 25 most-cited statistical papers. The FDR uses the Benjamini-Hochberg procedure in this PR. https://en.wikipedia.org/wiki/False_discovery_rate. In statistics, FWE is the probability of making one or more false discoveries, or type I errors, among all the hypotheses when performing multiple hypotheses tests. https://en.wikipedia.org/wiki/Family-wise_error_rate We add FDR and FWE methods for ChiSqSelector in this PR, like it is implemented in scikit-learn. http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection ## How was this patch tested? ut will be added soon (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: Peng <peng.meng@intel.com> Author: Peng, Meng <peng.meng@intel.com> Closes apache#15212 from mpjlu/fdr_fwe.

## What changes were proposed in this pull request? Add FDR test case in ml/feature/ChiSqSelectorSuite. Improve some comments in the code. This is a follow-up pr for apache#15212. ## How was this patch tested? ut Author: Peng, Meng <peng.meng@intel.com> Closes apache#16434 from mpjlu/fdr_fwe_update.

add feature selector method: FDR and FWE

2c07179

mpjlu changed the title ~~[MLLIB][ML]add feature selector method based on: False Discovery Rate (FDR) and Family wise error rate (FWE)~~ [SPARK-17645][MLLIB][ML][WIP]add feature selector method based on: False Discovery Rate (FDR) and Family wise error rate (FWE) Sep 23, 2016

lins05 mentioned this pull request Sep 25, 2016

[SPARK-17017][ML][MLLIB][ML][DOC] Updated the ml/mllib feature selection docs for ChiSqSelector #15236

Closed

Peng added 2 commits September 27, 2016 20:02

add fdr and fwe of feature selection

27aead9

fix python style bug

9c7fae3

python style change

2e97c55

yanboliang mentioned this pull request Sep 29, 2016

[SPARK-17704][ML][MLlib] ChiSqSelector performance improvement. #15277

Closed

Peng added 2 commits October 10, 2016 14:54

merge alphaFPR, alphaDFR and alphaFWE to alpha

e141c68

minor change

d05d7de

Merge remote-tracking branch 'origin/master' into fdr_fwe

f4a0a14

yanboliang reviewed Oct 19, 2016

View reviewed changes

srowen requested changes Oct 19, 2016

View reviewed changes

add test cases, and revise docs

d51b78b

mimor change

92530ab

mpjlu changed the title ~~[SPARK-17645][MLLIB][ML][WIP]add feature selector method based on: False Discovery Rate (FDR) and Family wise error rate (FWE)~~ [SPARK-17645][MLLIB][ML]add feature selector method based on: False Discovery Rate (FDR) and Family wise error rate (FWE) Oct 20, 2016

change the parameter alpha to fpr, fdr, fwe

2625208

fix confict

83a429e

yanboliang reviewed Dec 23, 2016

View reviewed changes

doc and Since fix

5a7cc2c

yanboliang reviewed Dec 23, 2016

View reviewed changes

fix typo

aa5f2cc

yanboliang reviewed Dec 26, 2016

View reviewed changes

python code style change

da6ac35

asfgit closed this in 79ff853 Dec 28, 2016

jkbradley reviewed Dec 28, 2016

View reviewed changes

mpjlu mentioned this pull request Dec 29, 2016

[SPARK-17645][MLLIB][ML][FOLLOW-UP] document minor change #16434

Closed

[SPARK-17645][MLLIB][ML]add feature selector method based on: False Discovery Rate (FDR) and Family wise error rate (FWE) #15212

[SPARK-17645][MLLIB][ML]add feature selector method based on: False Discovery Rate (FDR) and Family wise error rate (FWE) #15212

Conversation

mpjlu commented Sep 23, 2016

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Sep 23, 2016

SparkQA commented Sep 27, 2016

SparkQA commented Sep 27, 2016

mpjlu commented Sep 27, 2016

SparkQA commented Sep 27, 2016

yanboliang commented Sep 28, 2016

SparkQA commented Oct 10, 2016

SparkQA commented Oct 10, 2016

SparkQA commented Oct 17, 2016

mpjlu commented Oct 17, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Oct 20, 2016

SparkQA commented Oct 20, 2016

mpjlu commented Oct 24, 2016

SparkQA commented Nov 22, 2016

mpjlu commented Nov 22, 2016

yanboliang commented Dec 15, 2016

SparkQA commented Dec 21, 2016

yanboliang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Dec 23, 2016

yanboliang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Dec 23, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Dec 27, 2016

yanboliang commented Dec 28, 2016

jkbradley left a comment

Choose a reason for hiding this comment