[MLLIB] SPARK-5491 (ex SPARK-1473): Chi-square feature selection #1484

avulanov · 2014-07-18T11:35:23Z

The following is implemented:

generic traits for feature selection and filtering
trait for feature selection of LabeledPoint with discrete data
traits for calculation of contingency table and chi squared
class for chi-squared feature selection
tests for the above

Needs some optimization in matrix operations.

This request is a try to implement feature selection for MLLIB, the previous work by the issue author @izendejas was not finished (https://issues.apache.org/jira/browse/SPARK-1473). This request is also related to data discretization issues: https://issues.apache.org/jira/browse/SPARK-1303 and https://issues.apache.org/jira/browse/SPARK-1216 that weren't merged.

SparkQA · 2014-07-18T11:37:57Z

QA tests have started for PR 1484. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16826/consoleFull

SparkQA · 2014-07-18T11:38:02Z

QA results for PR 1484:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class ChiSquaredFeatureSelection(labeledDiscreteData: RDD[LabeledPoint], numTopFeatures: Int)
trait FeatureSelection[T] extends java.io.Serializable {
sealed trait FeatureFilter[T] extends FeatureSelection[T] {
trait LabeledPointFeatureFilter extends FeatureFilter[LabeledPoint] {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16826/consoleFull

SparkQA · 2014-07-18T11:42:56Z

QA tests have started for PR 1484. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16827/consoleFull

SparkQA · 2014-07-18T13:23:20Z

QA results for PR 1484:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class ChiSquaredFeatureSelection(labeledDiscreteData: RDD[LabeledPoint], numTopFeatures: Int)
trait FeatureSelection[T] extends java.io.Serializable {
sealed trait FeatureFilter[T] extends FeatureSelection[T] {
trait LabeledPointFeatureFilter extends FeatureFilter[LabeledPoint] {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16827/consoleFull

avulanov · 2014-08-04T06:57:49Z

@mengxr Could you review or comment this? Thanks!

mengxr · 2014-08-04T07:10:06Z

Sure. We had some transformers implemented under mllib.feature, similar to sk-learn's approach. For feature selection, we can follow the same approach if we view feature selection as transformation: 1) fit a dataset and select a subset of features, 2) transform a dataset by picking out selected features. So for the API, I suggest the following

class ChiSquaredFeatureSelector(numFeatures: Int) extends Serializable {
  def fit(dataset: RDD[LabeledPoint]): this.type
  def transform(dataset: RDD[LabeledPoint]): RDD[LabeledPoint]
}

and we can hide the implementation from public interfaces. Please let me know whether this sounds good to you.

avulanov · 2014-08-04T09:04:19Z

@mengxr

Do I understand correct, that you propose that fit(dataset: RDD[LabeledPoint]) should compute feature scores according to the feature selection algorithm and transform(dataset: RDD[LabeledPoint]) should return the filtered dataset?
It seems that such an interface allows misuse when someone calls transform before fit. In some sense it is similar to calling predict before actually learning the model. This is avoided in MLLib classification models implementation by means of ClassificationModel interface that has predict only. Individual classifier has object that returns its instance (that does training as well). I like this approach more because it is less error-prone from user prospective, but it is a little bit implicit from developer's prospective (you need to know that you need to implement a fabric). Long story short, why not to seal fit inside the constructor or inside the object?

trait FeatureSelector extends Serializable {
   def transform(dataset: RDD[LabeledPoint]): RDD[LabeledPoint]
}
//EITHER
class ChiSquaredFeatureSelector(dataset: RDD[LabeledPoint], numFeatures: Int) extends FeatureSelector {
  // perform chi squared computations...
  // implement transform
   override def transform(dataset: RDD[LabeledPoint]): RDD[LabeledPoint]
}
// OR (like in classification models):
class ChiSquaredFeatureSelector extends FeatureSelector {
   private def fit(dataset: RDD[LabeledPoint])
  // implement transform
   override def transform(dataset: RDD[LabeledPoint]): RDD[LabeledPoint]
}
object ChiSquaredFeatureSelector{
   def fit(dataset: RDD[LabeledPoint], numFeatures: Int) {
      val chi = new ChiSquaredFeatureSelector 
      chi.fit
      return chi
}

mengxr · 2014-08-04T16:58:39Z

@avulanov I have the same concern about calling transform before fit. There are two options: 1) throw an error, 2) fit on the same dataset and then transform (fit_transform in sk-learn). But I don't have a strong preference of either one.

I want to add another candidate to what you proposed:

class ChiSquaredFeatureSelection {
   def fit(dataset: RDD[LabeledPoint], numFeatures: Int): ChiSquaredFeatureSelector
}

class ChiSquaredFeatureSelector {
  def transform(dataset: RDD[LabeledPoint]): RDD[LabeledPoint]
}

We can discuss the class hierarchy later since they are not user-facing.

A problem with all the candidates here is we cannot apply the same transformation on RDD[Vector], which is required for prediction. I'm thinking about something like the following:

class ChiSquaredFeatureSelection {
   def fit[T <: Vectorized with Labeled](dataset: RDD[T], numFeatures: Int): ChiSquaredFeatureSelector
}

class ChiSquaredFeatureSelector {
  def transform[T <: Vectorized](dataset: RDD[T]): RDD[T]
}

avulanov · 2014-08-05T08:59:10Z

@mengxr

I also have concerns regarding the mentioned two options. Throwing an error means to have a method that returns an error when it is called with valid parameters. Calling fit inside transform will cause a question what the next fit call will do.
Could you explain how the upper bound like [T <: Vectorized with Labeled] can be implemented? LabeledPoint is a case class with no class hierarchy or traits.
It seems that all implementations of transform will do the same: filter features by index. I propose to implement such a filter. It also will solve the problem of filtering both LabeledPoint and Vector:

trait FeatureFilter {
   val indices: Set[Int]
   def transform(RDD[LabeledPoint]: data) = data.map { lp => new LabeledPoint(lp.label, Compress(lp.features, indices)) }
   def transform(RDD[Vector]: data) = data.map { v => Compress(v, indices) }
}

object Compress {
 def apply(features: Vector, indexes: Set[Int]): Vector = {
    val (values, _) =
      features.toArray.zipWithIndex.filter { case (value, index) =>
        indexes.contains(index)}.unzip
    Vectors.dense(values.toArray)
  }
}

class ChiSquaredFeatureSelection(RDD[LabeledPoint]: data, Int: numFeatures) extends FeatureFilter {
   // compute chiSquared and return feature indices 
   featureIndices = {....}
}

avulanov · 2014-08-07T06:55:01Z

@mengxr Btw., discretization is needed for feature selection. Do you plan to merge this https://issues.apache.org/jira/browse/SPARK-1303 ?

mengxr · 2014-08-07T07:17:40Z

[SPARK-2852][MLLIB] Separate model from IDF/StandardScaler algorithms #1814 separates model from algorithm.
It is just an idea such that a feature transformer doesn't need to worry about other data in the instance.
It won't work because both transform methods have the same type signature. This is also related to 2). We want to apply the filter on the vector and we don't care other data, labeled or not. I'm now working on the API design to make feature transformation easier. Let's discuss more after v1.1.

Btw, I will re-visit the discretization PR after v1.1 to make sure it doesn't have performance issues.

avulanov · 2014-08-07T07:50:11Z

@mengxr

Looks good, probably I should implement LabeledPointTransformer same as VectorTransformer and then implement ChiSquared that returns ChiSquaredModel after calling fit. The latter with extend both transformers. One thing only - I need to use a different method name for transform in LabeledTransformer.
2)3) Ok!
I will test discretization in a few days as well. I need it for the project I am working on.

mengxr · 2014-09-26T18:13:37Z

@avulanov In 1.1, we have chiSqTest implemented in mllib.stat.Statistics. Could you update this PR using chiSqTest, implement ChiSqSelector under mllib.feature, similar to StandardScaler and IDF? Thanks!

For the transformer name, chiSquared is a heavily overloaded term. So I suggest ChiSqSelector to be more precise.

avulanov · 2014-10-14T00:43:09Z

@mengxr Sure! Thanks for suggestion.

SparkQA · 2014-11-12T00:40:14Z

Test build #23232 has started for PR 1484 at commit f660728.

This patch merges cleanly.

avulanov · 2014-11-12T01:38:34Z

@mengxr Just to clarify: I'll implement ChiSqSelector with the method fit(data: RDD[LabeledPoint]): ChiSqSelectorModel. The latter extends VectorTransformer. Right?

SparkQA · 2014-11-12T02:00:51Z

Test build #23232 has finished for PR 1484 at commit f660728.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class ChiSquaredFeatureSelection(labeledDiscreteData: RDD[LabeledPoint], numTopFeatures: Int)
- trait FeatureSelection[T] extends java.io.Serializable
- sealed trait FeatureFilter[T] extends FeatureSelection[T]
- trait LabeledPointFeatureFilter extends FeatureFilter[LabeledPoint]

AmplabJenkins · 2014-11-12T02:00:55Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23232/
Test PASSed.

mengxr · 2014-11-12T06:52:48Z

@avulanov We have ChiSq tests implemented under "mllib.stat.Statistics":

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/stat/Statistics.scala#L166

Could you please call the method there and select top features based on the test statistics? This would make us have a single place for ChiSq implementation.

avulanov · 2014-11-13T00:05:25Z

@mengxr ChiSqTest is private for stat package, I cannot access it from feature package: private[stat] object ChiSqTest. Should I change [stat] to [mllib] or should I make it public?

mengxr · 2014-11-13T00:19:35Z

No, Statistics.chiSqTest is public. Please check the link I mentioned above.

avulanov · 2014-11-13T00:30:22Z

Ok, thanks! Sorry, I didn't understand the API from the first sight :)

SparkQA · 2014-11-13T22:07:42Z

Test build #23329 has started for PR 1484 at commit e972e07.

This patch merges cleanly.

SparkQA · 2014-11-13T22:57:20Z

Test build #23329 has finished for PR 1484 at commit e972e07.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class ChiSqSelectorModel(indices: IndexedSeq[Int]) extends VectorTransformer

AmplabJenkins · 2014-11-13T22:57:23Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23329/
Test FAILed.

avulanov · 2014-11-13T22:59:53Z

@mengxr for some reason I cannot see the trace of the build, it seems that I need to login to Jenkins, but I don't have an account there

mengxr · 2014-11-13T23:34:39Z

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23329/console

I saw sbt/sbt mllib/test:compie failed.

avulanov · 2014-12-05T01:03:27Z

@mengxr Could you suggest why the test fails?

SparkQA · 2015-01-08T01:32:02Z

Test build #559 has started for PR 1484 at commit e972e07.

This patch merges cleanly.

mengxr · 2015-01-31T02:31:43Z

test this please

SparkQA · 2015-01-31T02:32:35Z

Test build #26451 has started for PR 1484 at commit a6ad82a.

This patch merges cleanly.

SparkQA · 2015-01-31T03:44:54Z

Test build #26451 has finished for PR 1484 at commit a6ad82a.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class ChiSqSelector (val numTopFeatures: Int)

AmplabJenkins · 2015-01-31T03:44:58Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26451/
Test PASSed.

mengxr · 2015-01-31T07:56:58Z

mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala

+import org.apache.spark.mllib.stat.Statistics
+import org.apache.spark.rdd.RDD
+
+import scala.collection.mutable.ArrayBuilder


organize imports (If you use idea intellij, there is a useful plugin: https://plugins.jetbrains.com/plugin/7350)

SparkQA · 2015-02-02T18:57:54Z

Test build #26524 has started for PR 1484 at commit 755d358.

This patch merges cleanly.

avulanov · 2015-02-02T19:05:53Z

@mengxr Thank you for your comments! Done! Do you have any plans to add feature discretization capabilities to MLlib? There are few links in the head of this thread.

mengxr · 2015-02-02T20:10:03Z

LGTM pending Jenkins ...

SparkQA · 2015-02-02T20:10:10Z

Test build #26524 has finished for PR 1484 at commit 755d358.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class ChiSqSelectorModel (val selectedFeatures: Array[Int]) extends VectorTransformer
- class ChiSqSelector (val numTopFeatures: Int)

AmplabJenkins · 2015-02-02T20:10:14Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26524/
Test PASSed.

mengxr · 2015-02-02T20:12:41Z

Yes, it would be nice to add feature discretization to MLlib. We had a PP, but as you've tried it doesn't scale well. I don't have concrete scalable algorithms in mind now. We can discuss more on the JIRA page.

mengxr · 2015-02-02T20:13:47Z

Merged into master. Thanks!

…lling policy (apache#1484) ### What changes were proposed in this pull request? This PR aims to support two new executor rolling policies. - `PEAK_JVM_ONHEAP_MEMORY` policy chooses an executor with the biggest peak JVM on-heap memory. - `PEAK_JVM_OFFHEAP_MEMORY` policy chooses an executor with the biggest peak JVM off-heap memory. ### Why are the changes needed? Although peak memory is a kind of historic value, these two new policies add a capability to maintain the memory usage of Spark jobs minimally as much as possible. ### Does this PR introduce _any_ user-facing change? Yes, but this is a new feature. ### How was this patch tested? Pass the CIs. Closes apache#37418 from dongjoon-hyun/SPARK-39987. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 3df7124) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit 84cd907) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org>

avulanov mentioned this pull request Sep 9, 2014

[MLLIB] SPARK-2329 Add multi-label evaluation metrics #1270

Closed

mengxr reviewed Jan 31, 2015
View reviewed changes

Addressing reviewers comments @mengxr

755d358

asfgit closed this in c081b21 Feb 2, 2015

avulanov mentioned this pull request Aug 12, 2016

[SPARK-17017][MLLIB][ML] add a chiSquare Selector based on False Positive Rate (FPR) test #14597

Closed

[MLLIB] SPARK-5491 (ex SPARK-1473): Chi-square feature selection #1484

[MLLIB] SPARK-5491 (ex SPARK-1473): Chi-square feature selection #1484

Conversation

avulanov commented Jul 18, 2014

SparkQA commented Jul 18, 2014

SparkQA commented Jul 18, 2014

SparkQA commented Jul 18, 2014

SparkQA commented Jul 18, 2014

avulanov commented Aug 4, 2014

mengxr commented Aug 4, 2014

avulanov commented Aug 4, 2014

mengxr commented Aug 4, 2014

avulanov commented Aug 5, 2014

avulanov commented Aug 7, 2014

mengxr commented Aug 7, 2014

avulanov commented Aug 7, 2014

mengxr commented Sep 26, 2014

avulanov commented Oct 14, 2014

SparkQA commented Nov 12, 2014

avulanov commented Nov 12, 2014

SparkQA commented Nov 12, 2014

AmplabJenkins commented Nov 12, 2014

mengxr commented Nov 12, 2014

avulanov commented Nov 13, 2014

mengxr commented Nov 13, 2014

avulanov commented Nov 13, 2014

SparkQA commented Nov 13, 2014

SparkQA commented Nov 13, 2014

AmplabJenkins commented Nov 13, 2014

avulanov commented Nov 13, 2014

mengxr commented Nov 13, 2014

avulanov commented Dec 5, 2014

SparkQA commented Jan 8, 2015

mengxr commented Jan 31, 2015

SparkQA commented Jan 31, 2015

SparkQA commented Jan 31, 2015

AmplabJenkins commented Jan 31, 2015

mengxr Jan 31, 2015

Choose a reason for hiding this comment

SparkQA commented Feb 2, 2015

avulanov commented Feb 2, 2015

mengxr commented Feb 2, 2015

SparkQA commented Feb 2, 2015

AmplabJenkins commented Feb 2, 2015

mengxr commented Feb 2, 2015

mengxr commented Feb 2, 2015