[SPARK-26111][ML][WIP] Support F-value between label/feature for continuous distribution feature selection by huaxingao · Pull Request #27322 · apache/spark

huaxingao · 2020-01-22T18:37:48Z

What changes were proposed in this pull request?

Add FValue selection for continuous distribution features.

Why are the changes needed?

Current Spark only supports the selection of categorical features, while there are many requirements for the selection of continuous distribution features.

ANOVA F-value is one way to select features from the continuous distribution and it's important to support it in spark.

Does this PR introduce any user-facing change?

Yes

How was this patch tested?

Add unit tests

…inuous distribution feature selection

huaxingao · 2020-01-22T19:00:09Z

@srowen
Hi Sean, I am thinking of adding selector for continuous distribution features. I want to ask your opinion before I go any further. I will also ask Ruifeng after Chinese New Year holiday. I bet he is on vacation so don't want to ping him now.

Currently, Spark only supports selection of categorical features (ChiSqSelector). I am thinking of adding two new selectors for continuous distribution features:

FValueRegressionSelector for continuous features and continuous labels.
FValueClassificationSelector for continuous features and categorical labels.

Currently, this WIP PR only has FValueRegressionSelector implemented. FValueClassificationSelector is very similar. The calculation for classification f value is a little more complicated. I wrote the pseudo code here along with an example:

  // pseudo code:
  // for each feature:
  //     separate feature values into array of arrays by label (call it arr)
  //       e.g. if feature is [3.3, 2.5, 1.0, 3.0, 2.0] and labels are [1, 2, 1, 3, 3]
  //       then output should be arr = [[3.3, 1.0], [2.5], [3.0, 2.0]]
  //     n_classes = len(arr) (num. of distinct label categories)
  //     n_samples_per_class = [len(a) for a in arr]
  //     n_samples = sum(n_samples_per_class)    (= num. of rows in feature column)
  //       e.g. in above example, n_classes = 3, n_samples_per_class = [2, 1, 2], n_samples = 5
  //     ss_all = sum of squares of all in feature (e.g. 3.3^2+2.5^2+1.0^2+3.0^2+2.0^2)
  //     sq_sum_all = square of sum of all data (e.g. (3.3+2.5+1.0+3.0+2.0)^2)
  //     sq_sum_classes = [sum(a) ** 2 for a in arr]  (e.g. [(3.3+1.0)^2, 2.5^2, (3.0+2.0)^2]
  //     sstot = ss_all - (sq_sum_all / n_samples)
  //     ssbn = sum( sq_sum_classes[k] / n_samples_per_class[k] for k in range(n_classes)) - (sq_sum_all / n_samples)
  //       e.g. ((3.3+1.0)^2 / 2 + 2.5^2 / 1 + (3.0+2.0)^2 / 2) - sq_sum_all / 5
  //     sswn = sstot - ssbn
  //     dfbn = n_classes - 1
  //     dfwn = n_samples - n_classes
  //     msb = ssbn / dfbn
  //     msw = sswn / dfwn
  //     f = msb / msw
  //     pvalue = 1 - FDistribution(dfbn, dfwn).cdf(f)

sklean has both f_regression and f_classif. Here are the links:
https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_regression.html#sklearn.feature_selection.f_regression
https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_classif.html#sklearn.feature_selection.f_classif

srowen · 2020-01-22T19:01:46Z

I think it's fairly reasonable, yeah. F-value is pretty standard and not hard to compute.

SparkQA · 2020-01-22T19:51:00Z

Test build #117248 has finished for PR 27322 at commit fd2a25f.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
final class FRegressionSelector @Since(\"3.1.0\") (@Since(\"3.1.0\") override val uid: String)
class FRegressionSelectorModelWriter(instance: FRegressionSelectorModel) extends MLWriter
case class FRegressionTestResult(

zhengruifeng · 2020-02-01T06:22:28Z

mllib/src/main/scala/org/apache/spark/ml/feature/FRegressionSelector.scala

+ */
+@Since("3.1.0")
+final class FRegressionSelectorModel private[ml] (
+    @Since("3.1.0") override val uid: String,


I found that f_regression in scikit-learn will return both arrays of F-values and P-values, can we also add them to FRegressionSelectorModel?

zhengruifeng · 2020-02-01T06:24:52Z

mllib/src/main/scala/org/apache/spark/ml/feature/FRegressionSelector.scala

+    s"FRegressionSelectorModel: uid=$uid, numSelectedFeatures=${selectedFeatures.length}"
+  }
+
+  private[spark] def compressSparse(indices: Array[Int],


Sorry I am not sure what you want me to do for this one and the next one. Could you please clarify?

zhengruifeng · 2020-02-01T06:28:59Z

mllib/src/main/scala/org/apache/spark/ml/feature/FRegressionSelector.scala

+      filterIndicesIdx = filterIndices(j)
+      if (indicesIdx == filterIndicesIdx) {
+        newIndices += j
+        newValues += values(i)


val value = values(i) if (value != 0) { ... }

here requires selectedFeatures in model is sorted?

zhengruifeng · 2020-02-01T06:32:27Z

mllib/src/main/scala/org/apache/spark/ml/feature/FRegressionSelector.scala

+@Since("3.1.0")
+final class FRegressionSelectorModel private[ml] (
+    @Since("3.1.0") override val uid: String,
+    val selectedFeatures: Array[Int])


add a requirement to make sure that selectedFeatures is sorted?

var prev = -1 selectedFeatures.foreach { i => require(prev < i, s"Index $i follows $prev and is not strictly increasing") prev = i }

zhengruifeng · 2020-02-01T06:37:01Z

mllib/src/main/scala/org/apache/spark/ml/stat/FRegressionTest.scala

+
+    var fTestResultArray = new Array[FRegressionTestResult](numOfFeatures)
+    val labels = rdd.map(d => d.label)
+    for (i <- 0 until numOfFeatures) {


compute each col at once?
This should be inefficient, I guess only one pass is needed.

zhengruifeng · 2020-02-01T06:38:45Z

mllib/src/main/scala/org/apache/spark/ml/stat/FRegressionTest.scala

+   * @return Array containing the FRegressionTestResult for every feature against the label.
+   */
+  @Since("3.1.0")
+  def test_regression(dataset: Dataset[_], featuresCol: String, labelCol: String):


this method name test_regression should follow Camel-Case

zhengruifeng · 2020-02-01T06:40:38Z

mllib/src/main/scala/org/apache/spark/ml/feature/FRegressionSelector.scala

+    (newIndices.result(), newValues.result())
+  }
+
+  private[spark] def compressDense(values: Array[Double]): Array[Double] = {


the logic here is simple, just put it in the above defination of func

zhengruifeng · 2020-02-01T06:43:39Z

mllib/src/main/scala/org/apache/spark/ml/feature/FRegressionSelector.scala

+      case errorType =>
+        throw new IllegalStateException(s"Unknown FRegressionSelector Type: $errorType")
+    }
+    val indices = features.map { case (_, index) => index }


it seems that this indices need to be sorted?

zhengruifeng · 2020-02-01T06:49:29Z

Currently, this WIP PR only has FValueRegressionSelector implemented. FValueClassificationSelector is very similar. The calculation for classification f value is a little more complicated.

I think f_classif is different enough for another PR.

huaxingao · 2020-02-10T17:45:17Z

I will close this WIP and submit a new PR.

@zhengruifeng
I will address your comments in the new PR.

[SPARK-26111][ML][WIP] Support F-value between label/feature for cont…

fd2a25f

…inuous distribution feature selection

zhengruifeng reviewed Feb 1, 2020

View reviewed changes

dongjoon-hyun added the ML label Feb 5, 2020

huaxingao closed this Feb 10, 2020

huaxingao mentioned this pull request Feb 11, 2020

[SPARK-30776][ML] Support FValueRegressionSelector for continuous features and continuous labels #27527

Closed

Conversation

huaxingao commented Jan 22, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

huaxingao commented Jan 22, 2020

Uh oh!

srowen commented Jan 22, 2020

Uh oh!

SparkQA commented Jan 22, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhengruifeng commented Feb 1, 2020

Uh oh!

huaxingao commented Feb 10, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants