Skip to content

[SPARK-26111][ML][WIP] Support F-value between label/feature for continuous distribution feature selection#27322

Closed
huaxingao wants to merge 1 commit intoapache:masterfrom
huaxingao:spark-fvalue
Closed

[SPARK-26111][ML][WIP] Support F-value between label/feature for continuous distribution feature selection#27322
huaxingao wants to merge 1 commit intoapache:masterfrom
huaxingao:spark-fvalue

Conversation

@huaxingao
Copy link
Contributor

What changes were proposed in this pull request?

Add FValue selection for continuous distribution features.

Why are the changes needed?

Current Spark only supports the selection of categorical features, while there are many requirements for the selection of continuous distribution features.

ANOVA F-value is one way to select features from the continuous distribution and it's important to support it in spark.

Does this PR introduce any user-facing change?

Yes

How was this patch tested?

Add unit tests

@huaxingao
Copy link
Contributor Author

@srowen
Hi Sean, I am thinking of adding selector for continuous distribution features. I want to ask your opinion before I go any further. I will also ask Ruifeng after Chinese New Year holiday. I bet he is on vacation so don't want to ping him now.

Currently, Spark only supports selection of categorical features (ChiSqSelector). I am thinking of adding two new selectors for continuous distribution features:

  1. FValueRegressionSelector for continuous features and continuous labels.
  2. FValueClassificationSelector for continuous features and categorical labels.

Currently, this WIP PR only has FValueRegressionSelector implemented. FValueClassificationSelector is very similar. The calculation for classification f value is a little more complicated. I wrote the pseudo code here along with an example:

  // pseudo code:
  // for each feature:
  //     separate feature values into array of arrays by label (call it arr)
  //       e.g. if feature is [3.3, 2.5, 1.0, 3.0, 2.0] and labels are [1, 2, 1, 3, 3]
  //       then output should be arr = [[3.3, 1.0], [2.5], [3.0, 2.0]]
  //     n_classes = len(arr) (num. of distinct label categories)
  //     n_samples_per_class = [len(a) for a in arr]
  //     n_samples = sum(n_samples_per_class)    (= num. of rows in feature column)
  //       e.g. in above example, n_classes = 3, n_samples_per_class = [2, 1, 2], n_samples = 5
  //     ss_all = sum of squares of all in feature (e.g. 3.3^2+2.5^2+1.0^2+3.0^2+2.0^2)
  //     sq_sum_all = square of sum of all data (e.g. (3.3+2.5+1.0+3.0+2.0)^2)
  //     sq_sum_classes = [sum(a) ** 2 for a in arr]  (e.g. [(3.3+1.0)^2, 2.5^2, (3.0+2.0)^2]
  //     sstot = ss_all - (sq_sum_all / n_samples)
  //     ssbn = sum( sq_sum_classes[k] / n_samples_per_class[k] for k in range(n_classes)) - (sq_sum_all / n_samples)
  //       e.g. ((3.3+1.0)^2 / 2 + 2.5^2 / 1 + (3.0+2.0)^2 / 2) - sq_sum_all / 5
  //     sswn = sstot - ssbn
  //     dfbn = n_classes - 1
  //     dfwn = n_samples - n_classes
  //     msb = ssbn / dfbn
  //     msw = sswn / dfwn
  //     f = msb / msw
  //     pvalue = 1 - FDistribution(dfbn, dfwn).cdf(f)

sklean has both f_regression and f_classif. Here are the links:
https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_regression.html#sklearn.feature_selection.f_regression
https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.f_classif.html#sklearn.feature_selection.f_classif

@srowen
Copy link
Member

srowen commented Jan 22, 2020

I think it's fairly reasonable, yeah. F-value is pretty standard and not hard to compute.

@SparkQA
Copy link

SparkQA commented Jan 22, 2020

Test build #117248 has finished for PR 27322 at commit fd2a25f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • final class FRegressionSelector @Since(\"3.1.0\") (@Since(\"3.1.0\") override val uid: String)
  • class FRegressionSelectorModelWriter(instance: FRegressionSelectorModel) extends MLWriter
  • case class FRegressionTestResult(

*/
@Since("3.1.0")
final class FRegressionSelectorModel private[ml] (
@Since("3.1.0") override val uid: String,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found that f_regression in scikit-learn will return both arrays of F-values and P-values, can we also add them to FRegressionSelectorModel?

s"FRegressionSelectorModel: uid=$uid, numSelectedFeatures=${selectedFeatures.length}"
}

private[spark] def compressSparse(indices: Array[Int],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lint

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I am not sure what you want me to do for this one and the next one. Could you please clarify?

filterIndicesIdx = filterIndices(j)
if (indicesIdx == filterIndicesIdx) {
newIndices += j
newValues += values(i)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

val value = values(i)
if (value != 0) {
...
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here requires selectedFeatures in model is sorted?

@Since("3.1.0")
final class FRegressionSelectorModel private[ml] (
@Since("3.1.0") override val uid: String,
val selectedFeatures: Array[Int])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a requirement to make sure that selectedFeatures is sorted?

    var prev = -1
    selectedFeatures.foreach { i =>
      require(prev < i, s"Index $i follows $prev and is not strictly increasing")
      prev = i
    }


var fTestResultArray = new Array[FRegressionTestResult](numOfFeatures)
val labels = rdd.map(d => d.label)
for (i <- 0 until numOfFeatures) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

compute each col at once?
This should be inefficient, I guess only one pass is needed.

* @return Array containing the FRegressionTestResult for every feature against the label.
*/
@Since("3.1.0")
def test_regression(dataset: Dataset[_], featuresCol: String, labelCol: String):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method name test_regression should follow Camel-Case

(newIndices.result(), newValues.result())
}

private[spark] def compressDense(values: Array[Double]): Array[Double] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the logic here is simple, just put it in the above defination of func

case errorType =>
throw new IllegalStateException(s"Unknown FRegressionSelector Type: $errorType")
}
val indices = features.map { case (_, index) => index }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems that this indices need to be sorted?

@zhengruifeng
Copy link
Contributor

Currently, this WIP PR only has FValueRegressionSelector implemented. FValueClassificationSelector is very similar. The calculation for classification f value is a little more complicated.

I think f_classif is different enough for another PR.

@huaxingao
Copy link
Contributor Author

I will close this WIP and submit a new PR.

@zhengruifeng
I will address your comments in the new PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants