-
Notifications
You must be signed in to change notification settings - Fork 28.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-17645][MLLIB][ML]add feature selector method based on: False Discovery Rate (FDR) and Family wise error rate (FWE) #15212
Conversation
Test build #65817 has finished for PR 15212 at commit
|
Test build #65972 has finished for PR 15212 at commit
|
Test build #65974 has finished for PR 15212 at commit
|
hi @srowen @yanboliang , I have updated this PR. Thanks |
Test build #65975 has finished for PR 15212 at commit
|
Test build #66633 has finished for PR 15212 at commit
|
Test build #66634 has finished for PR 15212 at commit
|
Test build #67045 has finished for PR 15212 at commit
|
Hi @yanboliang @srowen , this is the last two feature selection methods based on ChiSquare, which is similar to the method in scikit learn. But there is a bug about SelectFDR in scikit learn. I have submit a PR to scikit learn: scikit-learn/scikit-learn#7490. |
@@ -243,6 +245,19 @@ class ChiSqSelector @Since("2.1.0") () extends Serializable { | |||
case ChiSqSelector.FPR => | |||
chiSqTestResult | |||
.filter { case (res, _) => res.pValue < alpha } | |||
case ChiSqSelector.FDR => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add docs to clarify This uses the Benjamini-Hochberg procedure
.
@@ -243,6 +245,19 @@ class ChiSqSelector @Since("2.1.0") () extends Serializable { | |||
case ChiSqSelector.FPR => | |||
chiSqTestResult | |||
.filter { case (res, _) => res.pValue < alpha } | |||
case ChiSqSelector.FDR => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Irrelevent to this PR, we can eliminate zipWithIndex
at L235, since no one use it. Would you mind to do this clean up in your PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's definitely used -- you have to keep the original index in order to pass them to the model.
.zipWithIndex | ||
.filter { case ((res, _), index) => | ||
res.pValue <= alpha * (index + 1) / chiSqTestResult.length } | ||
.map { case (_, index) => index} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
index}
to index }
@@ -243,6 +245,19 @@ class ChiSqSelector @Since("2.1.0") () extends Serializable { | |||
case ChiSqSelector.FPR => | |||
chiSqTestResult | |||
.filter { case (res, _) => res.pValue < alpha } | |||
case ChiSqSelector.FDR => | |||
val tempRDD = chiSqTestResult |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use another name, since it's not an RDD.
@@ -72,11 +72,15 @@ private[feature] trait ChiSqSelectorParams extends Params | |||
def getPercentile: Double = $(percentile) | |||
|
|||
/** | |||
* The highest p-value for features to be kept. | |||
* Only applicable when selectorType = "fpr". | |||
* alpha means the highest p-value for features to be kept when select type is "fpr". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should keep Only applicable when selectorType = "fpr", "fdr" or "fwe".
, since it's not applicable to other selector types such as "kbest" and "percentile".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated, thanks
* The highest p-value for features to be kept. | ||
* Only applicable when selectorType = "fpr". | ||
* alpha means the highest p-value for features to be kept when select type is "fpr". | ||
* alpha means the highest uncorrected p-value for features to be kept when select type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
, or the highest ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated, thanks
final val alpha = new DoubleParam(this, "alpha", "The highest p-value for features to be kept.", | ||
final val alpha = new DoubleParam(this, "alpha", | ||
"alpha means the highest p-value for features to be kept when select type is fpr, " + | ||
"alpha means the highest uncorrected p-value for features to be kept when select type " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
, or the highest ...
.select("filtered", "preFilteredData").collect().foreach { | ||
case Row(vec1: Vector, vec2: Vector) => | ||
assert(vec1 ~== vec2 absTol 1e-1) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Enforce this test case to large dimension data, and output different selected features according to selector type as far as possible.
@@ -2624,7 +2624,9 @@ class ChiSqSelector(JavaEstimator, HasFeaturesCol, HasOutputCol, HasLabelCol, Ja | |||
"will select, ordered by statistics value descending.", | |||
typeConverter=TypeConverters.toFloat) | |||
|
|||
alpha = Param(Params._dummy(), "alpha", "The highest p-value for features to be kept.", | |||
alpha = Param(Params._dummy(), "alpha", "alpha means the highest p-value for features " + | |||
"to be kept when select type is fpr, alpha means the highest uncorrected " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto.
@@ -2700,7 +2702,6 @@ def getPercentile(self): | |||
def setAlpha(self, value): | |||
""" | |||
Sets the value of :py:attr:`alpha`. | |||
Only applicable when selectorType = "fpr". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto.
@@ -104,6 +108,9 @@ private[feature] trait ChiSqSelectorParams extends Params | |||
* `kbest` chooses the `k` top features according to a chi-squared test. | |||
* `percentile` is similar but chooses a fraction of all features instead of a fixed number. | |||
* `fpr` chooses all features whose false positive rate meets some threshold. | |||
* `fpr` select features based on a false positive rate test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has two lines for fpr. The existing text is more descriptive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, updated
@@ -127,7 +134,9 @@ final class ChiSqSelector @Since("1.6.0") (@Since("1.6.0") override val uid: Str | |||
|
|||
/** @group setParam */ | |||
@Since("2.1.0") | |||
def setAlpha(value: Double): this.type = set(alpha, value) | |||
def setAlpha(value: Double): this.type = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change back, thanks
@@ -243,6 +245,19 @@ class ChiSqSelector @Since("2.1.0") () extends Serializable { | |||
case ChiSqSelector.FPR => | |||
chiSqTestResult | |||
.filter { case (res, _) => res.pValue < alpha } | |||
case ChiSqSelector.FDR => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's definitely used -- you have to keep the original index in order to pass them to the model.
val tempRDD = chiSqTestResult | ||
.sortBy { case (res, _) => res.pValue } | ||
val maxIndex = tempRDD | ||
.zipWithIndex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This zipWithIndex is however not correct it seems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will validate the results carefully, can compare the results with sklearn.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added a large size data sample in the test Suite, and updated the Contingency tables in the test Suite comments. The value of degree of freedom, statistic and pValue for each feature is also added. so it is easy to validate the result.
SelectFDR in sklearn is not exact. My PR to fix the bug is merged today.
@@ -263,9 +278,15 @@ object ChiSqSelector { | |||
/** String name for `fpr` selector type. */ | |||
private[spark] val FPR: String = "fpr" | |||
|
|||
/** String name for `fdr` selector type. */ | |||
private[spark] val FDR: String = "fdr" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this applies to the existing line above too, but, this comment isn't descriptive. You can spell out what all of these mean if there's javadoc at all here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about "Selector type name of False Discovery Rate, which chooses all features whose false discovery rate meets some threshold. "
Test build #67249 has finished for PR 15212 at commit
|
Test build #67261 has finished for PR 15212 at commit
|
Hi @yanboliang and @srowen , could you please review whether this PR includes all your comments. Thanks. |
Test build #68999 has finished for PR 15212 at commit
|
hi @yanboliang , @srowen @jkbradley , I have updated this PR, thanks. |
@mpjlu Sorry for late response, I just finished QA work for 2.1 and start ordinary review. Could you resolve the merge conflicts first? I will take a look tomorrow. Thanks. |
Test build #70454 has finished for PR 15212 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only some docs need to be improved, otherwise, looks good. Thanks.
* `numTopFeatures` chooses a fixed number of top features according to a chi-squared test. This is akin to yielding the features with the most predictive power. | ||
* `percentile` is similar to `numTopFeatures` but chooses a fraction of all features instead of a fixed number. | ||
* `fpr` chooses all features whose p-value is below a threshold, thus controlling the false positive rate of selection. | ||
|
||
* `fdr` chooses all features whose false discovery rate meets some threshold. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
`fdr` uses the [Benjamini-Hochberg procedure](https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure) to choose all features whose false discovery rate is below a threshold
should be better?
* `numTopFeatures` chooses a fixed number of top features according to a chi-squared test. This is akin to yielding the features with the most predictive power. | ||
* `percentile` is similar to `numTopFeatures` but chooses a fraction of all features instead of a fixed number. | ||
* `fpr` chooses all features whose p-value is below a threshold, thus controlling the false positive rate of selection. | ||
|
||
* `fdr` chooses all features whose false discovery rate meets some threshold. | ||
* `fwe` chooses all features whose family-wise error rate meets some threshold. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whose p-values is below a threshold, thus controlling the family-wise error rate of selection
|
||
* `numTopFeatures` chooses a fixed number of top features according to a chi-squared test. This is akin to yielding the features with the most predictive power. | ||
* `percentile` is similar to `numTopFeatures` but chooses a fraction of all features instead of a fixed number. | ||
* `fpr` chooses all features whose p-value is below a threshold, thus controlling the false positive rate of selection. | ||
* `fdr` chooses all features whose false discovery rate meets some threshold. | ||
* `fwe` chooses all features whose family-wise error rate meets some threshold. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update according the above suggestion.
@@ -92,8 +92,36 @@ private[feature] trait ChiSqSelectorParams extends Params | |||
def getFpr: Double = $(fpr) | |||
|
|||
/** | |||
* The highest uncorrected p-value for features to be kept. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the doc is incorrect even it's consistent with sklearn, actually we don't compare fdr
value with p-value
directly. I'm more prefer to change as The upper bound of the expected false discovery rate
which is more accuracy and easy to understand.
* Default value is 0.05. | ||
* @group param | ||
*/ | ||
@Since("2.1.0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update version to 2.2.0.
* - `numTopFeatures` chooses a fixed number of top features according to a chi-squared test. | ||
* - `percentile` is similar but chooses a fraction of all features instead of a fixed number. | ||
* - `fpr` chooses all features whose p-value is below a threshold, thus controlling the false | ||
* positive rate of selection. | ||
* - `fdr` chooses all features whose false discovery rate meets some threshold. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto
@@ -245,6 +264,20 @@ class ChiSqSelector @Since("2.1.0") () extends Serializable { | |||
case ChiSqSelector.FPR => | |||
chiSqTestResult | |||
.filter { case (res, _) => res.pValue < fpr } | |||
case ChiSqSelector.FDR => | |||
// This uses the Benjamini-Hochberg procedure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add link to explain Benjamini-Hochberg procedure
: https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure
/** | ||
* String name for `numTopFeatures` selector type. | ||
*/ | ||
/** String name for `numTopFeatures` selector type. */ | ||
val NumTopFeatures: String = "numTopFeatures" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private[spark]
/** | ||
* String name for `percentile` selector type. | ||
*/ | ||
/** String name for `percentile` selector type. */ | ||
val Percentile: String = "percentile" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private[spark]
* | ||
* Use chi-squared calculator from Internet | ||
*/ | ||
|
||
test("ChiSqSelector transform test (sparse & dense vector)") { | ||
test("ChiSqSelector transform by KBest test (sparse & dense vector)") { | ||
val labeledDiscreteData = sc.parallelize( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many test functions need labeledDiscreteData
, we can refactor it out of the function and make other functions shared the same dataset instance.
Test build #70536 has finished for PR 15212 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except some very minor comments.
cc @srowen for another pass if you have time. Thanks.
* `numTopFeatures` chooses a fixed number of top features according to a chi-squared test. This is akin to yielding the features with the most predictive power. | ||
* `percentile` is similar to `numTopFeatures` but chooses a fraction of all features instead of a fixed number. | ||
* `fpr` chooses all features whose p-value is below a threshold, thus controlling the false positive rate of selection. | ||
|
||
* `fdr` uses the [Benjamini-Hochberg procedure](https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure) to choose all features whose false discovery rate is below a threshold. | ||
* `fwe` chooses all features whose whose p-values is below a threshold, thus controlling the family-wise error rate of selection. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove duplicated whose
.
|
||
* `numTopFeatures` chooses a fixed number of top features according to a chi-squared test. This is akin to yielding the features with the most predictive power. | ||
* `percentile` is similar to `numTopFeatures` but chooses a fraction of all features instead of a fixed number. | ||
* `fpr` chooses all features whose p-value is below a threshold, thus controlling the false positive rate of selection. | ||
* `fdr` uses the [Benjamini-Hochberg procedure](https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure) to choose all features whose false discovery rate is below a threshold. | ||
* `fwe` chooses all features whose whose p-values is below a threshold, thus controlling the family-wise error rate of selection. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove duplicated whose
.
* - `fdr` uses the [Benjamini-Hochberg procedure] | ||
* (https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure) | ||
* to choose all features whose false discovery rate is below a threshold. | ||
* - `fwe` chooses all features whose whose p-values is below a threshold, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove duplicated whose
.
* - `fdr` uses the [Benjamini-Hochberg procedure] | ||
* (https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure) | ||
* to choose all features whose false discovery rate is below a threshold. | ||
* - `fwe` chooses all features whose whose p-values is below a threshold, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove duplicated whose
.
LabeledPoint(2.0, Vectors.dense(Array(8.0, 9.0, 6.0, 5.0, 4.0, 4.0))), | ||
LabeledPoint(2.0, Vectors.dense(Array(8.0, 9.0, 6.0, 4.0, 0.0, 0.0)))), 2) | ||
|
||
test("ChiSqSelector transform by KBest test (sparse & dense vector)") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by numTopFeatures
`fdr` uses the [Benjamini-Hochberg procedure] | ||
(https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure) | ||
to choose all features whose false discovery rate is below a threshold. | ||
`fwe` chooses all features whose whose p-values is below a threshold, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove duplicated whose
.
`fdr` uses the [Benjamini-Hochberg procedure] | ||
(https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure) | ||
to choose all features whose false discovery rate is below a threshold. | ||
`fwe` chooses all features whose whose p-values is below a threshold, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove duplicated whose
.
.setOutputCol("filtered").setSelectorType("fwe").setFwe(0.6) | ||
ChiSqSelectorSuite.testSelector(selector, dataset) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a simple test for fdr
selector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pinging on @yanboliang 's comment about adding a test for FDR
Test build #70549 has finished for PR 15212 at commit
|
val filteredData = labeledDiscreteData.map { lp => | ||
LabeledPoint(lp.label, model.transform(lp.features)) | ||
}.collect().toSet | ||
assert(filteredData == preFilteredData) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use ===
, see the difference at http://stackoverflow.com/questions/10489548/what-is-the-triple-equals-operator-in-scala-koans .
Creates a ChiSquared feature selector. | ||
The selector supports different selection methods: `numTopFeatures`, `percentile`, `fpr`, | ||
`fdr`, `fwe`. | ||
`numTopFeatures` chooses a fixed number of top features according to a chi-squared test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Organize following items to make it as a list in the generated API docs, please refer https://github.com/apache/spark/blob/master/python/pyspark/ml/regression.py#L1300
`percentile` is similar but chooses a fraction of all features instead of a fixed number. | ||
`fpr` chooses all features whose p-value is below a threshold, thus controlling the false | ||
positive rate of selection. | ||
`fdr` uses the [Benjamini-Hochberg procedure] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
`Benjamini-Hochberg procedure <https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure>`_
This is the Python style of using doc links. You can refer https://github.com/apache/spark/blob/master/python/pyspark/ml/regression.py#L1308
@@ -274,11 +274,17 @@ def transform(self, vector): | |||
class ChiSqSelector(object): | |||
""" | |||
Creates a ChiSquared feature selector. | |||
The selector supports different selection methods: `numTopFeatures`, `percentile`, `fpr`. | |||
The selector supports different selection methods: `numTopFeatures`, `percentile`, `fpr`, | |||
`fdr`, `fwe`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto, organize following items to make it as a list in the generated API docs.
`numTopFeatures` chooses a fixed number of top features according to a chi-squared test. | ||
`percentile` is similar but chooses a fraction of all features instead of a fixed number. | ||
`fpr` chooses all features whose p-value is below a threshold, thus controlling the false | ||
positive rate of selection. | ||
`fdr` uses the [Benjamini-Hochberg procedure] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto.
Test build #70635 has finished for PR 15212 at commit
|
Merged into master, thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mpjlu @yanboliang Thanks for the PR! I know this is late, but I had a couple of small comments. Could you please send a little follow-up PR against the same JIRA?
* `numTopFeatures` chooses a fixed number of top features according to a chi-squared test. This is akin to yielding the features with the most predictive power. | ||
* `percentile` is similar to `numTopFeatures` but chooses a fraction of all features instead of a fixed number. | ||
* `fpr` chooses all features whose p-value is below a threshold, thus controlling the false positive rate of selection. | ||
|
||
* `fdr` uses the [Benjamini-Hochberg procedure](https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure) to choose all features whose false discovery rate is below a threshold. | ||
* `fwe` chooses all features whose p-values is below a threshold, thus controlling the family-wise error rate of selection. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doc sounds the same as fpr. Could you state that the threshold is scaled by 1/numFeatures to clarify?
Also fix: "p-values is" -> "p-values are"
.setOutputCol("filtered").setSelectorType("fwe").setFwe(0.6) | ||
ChiSqSelectorSuite.testSelector(selector, dataset) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pinging on @yanboliang 's comment about adding a test for FDR
@@ -2629,8 +2629,28 @@ class ChiSqSelector(JavaEstimator, HasFeaturesCol, HasOutputCol, HasLabelCol, Ja | |||
""" | |||
.. note:: Experimental | |||
|
|||
Chi-Squared feature selection, which selects categorical features to use for predicting a | |||
categorical label. | |||
Creates a ChiSquared feature selector. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add back in the fact that this selects categorical features for predicting a categorical label. It's useful to state expected input data types.
…Discovery Rate (FDR) and Family wise error rate (FWE) ## What changes were proposed in this pull request? Univariate feature selection works by selecting the best features based on univariate statistical tests. FDR and FWE are a popular univariate statistical test for feature selection. In 2005, the Benjamini and Hochberg paper on FDR was identified as one of the 25 most-cited statistical papers. The FDR uses the Benjamini-Hochberg procedure in this PR. https://en.wikipedia.org/wiki/False_discovery_rate. In statistics, FWE is the probability of making one or more false discoveries, or type I errors, among all the hypotheses when performing multiple hypotheses tests. https://en.wikipedia.org/wiki/Family-wise_error_rate We add FDR and FWE methods for ChiSqSelector in this PR, like it is implemented in scikit-learn. http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection ## How was this patch tested? ut will be added soon (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: Peng <peng.meng@intel.com> Author: Peng, Meng <peng.meng@intel.com> Closes apache#15212 from mpjlu/fdr_fwe.
Thanks @jkbradley , I will send a follow-up PR for your comments. |
hi @jkbradley @yanboliang , I have created a follow up PR for this PR. #16434 |
@mpjlu Thanks for the follow-up PR. @jkbradley Please feel free to shepherd that PR, since I'm on travel these days. Thanks. |
Will do! |
…Discovery Rate (FDR) and Family wise error rate (FWE) ## What changes were proposed in this pull request? Univariate feature selection works by selecting the best features based on univariate statistical tests. FDR and FWE are a popular univariate statistical test for feature selection. In 2005, the Benjamini and Hochberg paper on FDR was identified as one of the 25 most-cited statistical papers. The FDR uses the Benjamini-Hochberg procedure in this PR. https://en.wikipedia.org/wiki/False_discovery_rate. In statistics, FWE is the probability of making one or more false discoveries, or type I errors, among all the hypotheses when performing multiple hypotheses tests. https://en.wikipedia.org/wiki/Family-wise_error_rate We add FDR and FWE methods for ChiSqSelector in this PR, like it is implemented in scikit-learn. http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection ## How was this patch tested? ut will be added soon (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: Peng <peng.meng@intel.com> Author: Peng, Meng <peng.meng@intel.com> Closes apache#15212 from mpjlu/fdr_fwe.
## What changes were proposed in this pull request? Add FDR test case in ml/feature/ChiSqSelectorSuite. Improve some comments in the code. This is a follow-up pr for apache#15212. ## How was this patch tested? ut Author: Peng, Meng <peng.meng@intel.com> Closes apache#16434 from mpjlu/fdr_fwe_update.
## What changes were proposed in this pull request? Add FDR test case in ml/feature/ChiSqSelectorSuite. Improve some comments in the code. This is a follow-up pr for apache#15212. ## How was this patch tested? ut Author: Peng, Meng <peng.meng@intel.com> Closes apache#16434 from mpjlu/fdr_fwe_update.
What changes were proposed in this pull request?
Univariate feature selection works by selecting the best features based on univariate statistical tests.
FDR and FWE are a popular univariate statistical test for feature selection.
In 2005, the Benjamini and Hochberg paper on FDR was identified as one of the 25 most-cited statistical papers. The FDR uses the Benjamini-Hochberg procedure in this PR. https://en.wikipedia.org/wiki/False_discovery_rate.
In statistics, FWE is the probability of making one or more false discoveries, or type I errors, among all the hypotheses when performing multiple hypotheses tests.
https://en.wikipedia.org/wiki/Family-wise_error_rate
We add FDR and FWE methods for ChiSqSelector in this PR, like it is implemented in scikit-learn.
http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection
How was this patch tested?
ut will be added soon
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)