[SPARK-26111][ML][WIP] Support F-value between label/feature for continuous distribution feature selection#27322
[SPARK-26111][ML][WIP] Support F-value between label/feature for continuous distribution feature selection#27322huaxingao wants to merge 1 commit intoapache:masterfrom
Conversation
…inuous distribution feature selection
|
@srowen Currently, Spark only supports selection of categorical features (
Currently, this WIP PR only has sklean has both f_regression and f_classif. Here are the links: |
|
I think it's fairly reasonable, yeah. F-value is pretty standard and not hard to compute. |
|
Test build #117248 has finished for PR 27322 at commit
|
| */ | ||
| @Since("3.1.0") | ||
| final class FRegressionSelectorModel private[ml] ( | ||
| @Since("3.1.0") override val uid: String, |
There was a problem hiding this comment.
I found that f_regression in scikit-learn will return both arrays of F-values and P-values, can we also add them to FRegressionSelectorModel?
| s"FRegressionSelectorModel: uid=$uid, numSelectedFeatures=${selectedFeatures.length}" | ||
| } | ||
|
|
||
| private[spark] def compressSparse(indices: Array[Int], |
There was a problem hiding this comment.
Sorry I am not sure what you want me to do for this one and the next one. Could you please clarify?
| filterIndicesIdx = filterIndices(j) | ||
| if (indicesIdx == filterIndicesIdx) { | ||
| newIndices += j | ||
| newValues += values(i) |
There was a problem hiding this comment.
val value = values(i)
if (value != 0) {
...
}
There was a problem hiding this comment.
here requires selectedFeatures in model is sorted?
| @Since("3.1.0") | ||
| final class FRegressionSelectorModel private[ml] ( | ||
| @Since("3.1.0") override val uid: String, | ||
| val selectedFeatures: Array[Int]) |
There was a problem hiding this comment.
add a requirement to make sure that selectedFeatures is sorted?
var prev = -1
selectedFeatures.foreach { i =>
require(prev < i, s"Index $i follows $prev and is not strictly increasing")
prev = i
}|
|
||
| var fTestResultArray = new Array[FRegressionTestResult](numOfFeatures) | ||
| val labels = rdd.map(d => d.label) | ||
| for (i <- 0 until numOfFeatures) { |
There was a problem hiding this comment.
compute each col at once?
This should be inefficient, I guess only one pass is needed.
| * @return Array containing the FRegressionTestResult for every feature against the label. | ||
| */ | ||
| @Since("3.1.0") | ||
| def test_regression(dataset: Dataset[_], featuresCol: String, labelCol: String): |
There was a problem hiding this comment.
this method name test_regression should follow Camel-Case
| (newIndices.result(), newValues.result()) | ||
| } | ||
|
|
||
| private[spark] def compressDense(values: Array[Double]): Array[Double] = { |
There was a problem hiding this comment.
the logic here is simple, just put it in the above defination of func
| case errorType => | ||
| throw new IllegalStateException(s"Unknown FRegressionSelector Type: $errorType") | ||
| } | ||
| val indices = features.map { case (_, index) => index } |
There was a problem hiding this comment.
it seems that this indices need to be sorted?
I think |
|
I will close this WIP and submit a new PR. @zhengruifeng |
What changes were proposed in this pull request?
Add FValue selection for continuous distribution features.
Why are the changes needed?
Current Spark only supports the selection of categorical features, while there are many requirements for the selection of continuous distribution features.
ANOVA F-value is one way to select features from the continuous distribution and it's important to support it in spark.
Does this PR introduce any user-facing change?
Yes
How was this patch tested?
Add unit tests