Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MLLIB] SPARK-5491 (ex SPARK-1473): Chi-square feature selection #1484

Closed
wants to merge 16 commits into from

Conversation

avulanov
Copy link
Contributor

The following is implemented:

  1. generic traits for feature selection and filtering
  2. trait for feature selection of LabeledPoint with discrete data
  3. traits for calculation of contingency table and chi squared
  4. class for chi-squared feature selection
  5. tests for the above

Needs some optimization in matrix operations.

This request is a try to implement feature selection for MLLIB, the previous work by the issue author @izendejas was not finished (https://issues.apache.org/jira/browse/SPARK-1473). This request is also related to data discretization issues: https://issues.apache.org/jira/browse/SPARK-1303 and https://issues.apache.org/jira/browse/SPARK-1216 that weren't merged.

@SparkQA
Copy link

SparkQA commented Jul 18, 2014

QA tests have started for PR 1484. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16826/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 18, 2014

QA results for PR 1484:
- This patch FAILED unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class ChiSquaredFeatureSelection(labeledDiscreteData: RDD[LabeledPoint], numTopFeatures: Int)
trait FeatureSelection[T] extends java.io.Serializable {
sealed trait FeatureFilter[T] extends FeatureSelection[T] {
trait LabeledPointFeatureFilter extends FeatureFilter[LabeledPoint] {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16826/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 18, 2014

QA tests have started for PR 1484. This patch merges cleanly.
View progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16827/consoleFull

@SparkQA
Copy link

SparkQA commented Jul 18, 2014

QA results for PR 1484:
- This patch PASSES unit tests.
- This patch merges cleanly
- This patch adds the following public classes (experimental):
class ChiSquaredFeatureSelection(labeledDiscreteData: RDD[LabeledPoint], numTopFeatures: Int)
trait FeatureSelection[T] extends java.io.Serializable {
sealed trait FeatureFilter[T] extends FeatureSelection[T] {
trait LabeledPointFeatureFilter extends FeatureFilter[LabeledPoint] {

For more information see test ouptut:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/16827/consoleFull

@avulanov
Copy link
Contributor Author

avulanov commented Aug 4, 2014

@mengxr Could you review or comment this? Thanks!

@mengxr
Copy link
Contributor

mengxr commented Aug 4, 2014

Sure. We had some transformers implemented under mllib.feature, similar to sk-learn's approach. For feature selection, we can follow the same approach if we view feature selection as transformation: 1) fit a dataset and select a subset of features, 2) transform a dataset by picking out selected features. So for the API, I suggest the following

class ChiSquaredFeatureSelector(numFeatures: Int) extends Serializable {
  def fit(dataset: RDD[LabeledPoint]): this.type
  def transform(dataset: RDD[LabeledPoint]): RDD[LabeledPoint]
} 

and we can hide the implementation from public interfaces. Please let me know whether this sounds good to you.

@avulanov
Copy link
Contributor Author

avulanov commented Aug 4, 2014

@mengxr

  1. Do I understand correct, that you propose that fit(dataset: RDD[LabeledPoint]) should compute feature scores according to the feature selection algorithm and transform(dataset: RDD[LabeledPoint]) should return the filtered dataset?
  2. It seems that such an interface allows misuse when someone calls transform before fit. In some sense it is similar to calling predict before actually learning the model. This is avoided in MLLib classification models implementation by means of ClassificationModel interface that has predict only. Individual classifier has object that returns its instance (that does training as well). I like this approach more because it is less error-prone from user prospective, but it is a little bit implicit from developer's prospective (you need to know that you need to implement a fabric). Long story short, why not to seal fit inside the constructor or inside the object?
trait FeatureSelector extends Serializable {
   def transform(dataset: RDD[LabeledPoint]): RDD[LabeledPoint]
}
//EITHER
class ChiSquaredFeatureSelector(dataset: RDD[LabeledPoint], numFeatures: Int) extends FeatureSelector {
  // perform chi squared computations...
  // implement transform
   override def transform(dataset: RDD[LabeledPoint]): RDD[LabeledPoint]
}
// OR (like in classification models):
class ChiSquaredFeatureSelector extends FeatureSelector {
   private def fit(dataset: RDD[LabeledPoint])
  // implement transform
   override def transform(dataset: RDD[LabeledPoint]): RDD[LabeledPoint]
}
object ChiSquaredFeatureSelector{
   def fit(dataset: RDD[LabeledPoint], numFeatures: Int) {
      val chi = new ChiSquaredFeatureSelector 
      chi.fit
      return chi
}

@mengxr
Copy link
Contributor

mengxr commented Aug 4, 2014

@avulanov I have the same concern about calling transform before fit. There are two options: 1) throw an error, 2) fit on the same dataset and then transform (fit_transform in sk-learn). But I don't have a strong preference of either one.

I want to add another candidate to what you proposed:

class ChiSquaredFeatureSelection {
   def fit(dataset: RDD[LabeledPoint], numFeatures: Int): ChiSquaredFeatureSelector
}

class ChiSquaredFeatureSelector {
  def transform(dataset: RDD[LabeledPoint]): RDD[LabeledPoint]
}

We can discuss the class hierarchy later since they are not user-facing.

A problem with all the candidates here is we cannot apply the same transformation on RDD[Vector], which is required for prediction. I'm thinking about something like the following:

class ChiSquaredFeatureSelection {
   def fit[T <: Vectorized with Labeled](dataset: RDD[T], numFeatures: Int): ChiSquaredFeatureSelector
}

class ChiSquaredFeatureSelector {
  def transform[T <: Vectorized](dataset: RDD[T]): RDD[T]
}

@avulanov
Copy link
Contributor Author

avulanov commented Aug 5, 2014

@mengxr

  1. I also have concerns regarding the mentioned two options. Throwing an error means to have a method that returns an error when it is called with valid parameters. Calling fit inside transform will cause a question what the next fit call will do.
  2. Could you explain how the upper bound like [T <: Vectorized with Labeled] can be implemented? LabeledPoint is a case class with no class hierarchy or traits.
  3. It seems that all implementations of transform will do the same: filter features by index. I propose to implement such a filter. It also will solve the problem of filtering both LabeledPoint and Vector:
trait FeatureFilter {
   val indices: Set[Int]
   def transform(RDD[LabeledPoint]: data) = data.map { lp => new LabeledPoint(lp.label, Compress(lp.features, indices)) }
   def transform(RDD[Vector]: data) = data.map { v => Compress(v, indices) }
}

object Compress {
 def apply(features: Vector, indexes: Set[Int]): Vector = {
    val (values, _) =
      features.toArray.zipWithIndex.filter { case (value, index) =>
        indexes.contains(index)}.unzip
    Vectors.dense(values.toArray)
  }
}

class ChiSquaredFeatureSelection(RDD[LabeledPoint]: data, Int: numFeatures) extends FeatureFilter {
   // compute chiSquared and return feature indices 
   featureIndices = {....}
}


@avulanov
Copy link
Contributor Author

avulanov commented Aug 7, 2014

@mengxr Btw., discretization is needed for feature selection. Do you plan to merge this https://issues.apache.org/jira/browse/SPARK-1303 ?

@mengxr
Copy link
Contributor

mengxr commented Aug 7, 2014

  1. [SPARK-2852][MLLIB] Separate model from IDF/StandardScaler algorithms #1814 separates model from algorithm.
  2. It is just an idea such that a feature transformer doesn't need to worry about other data in the instance.
  3. It won't work because both transform methods have the same type signature. This is also related to 2). We want to apply the filter on the vector and we don't care other data, labeled or not. I'm now working on the API design to make feature transformation easier. Let's discuss more after v1.1.

Btw, I will re-visit the discretization PR after v1.1 to make sure it doesn't have performance issues.

@avulanov
Copy link
Contributor Author

avulanov commented Aug 7, 2014

@mengxr

  1. Looks good, probably I should implement LabeledPointTransformer same as VectorTransformer and then implement ChiSquared that returns ChiSquaredModel after calling fit. The latter with extend both transformers. One thing only - I need to use a different method name for transform in LabeledTransformer.
    2)3) Ok!
  2. I will test discretization in a few days as well. I need it for the project I am working on.

@mengxr
Copy link
Contributor

mengxr commented Sep 26, 2014

@avulanov In 1.1, we have chiSqTest implemented in mllib.stat.Statistics. Could you update this PR using chiSqTest, implement ChiSqSelector under mllib.feature, similar to StandardScaler and IDF? Thanks!

For the transformer name, chiSquared is a heavily overloaded term. So I suggest ChiSqSelector to be more precise.

@avulanov
Copy link
Contributor Author

@mengxr Sure! Thanks for suggestion.

@SparkQA
Copy link

SparkQA commented Nov 12, 2014

Test build #23232 has started for PR 1484 at commit f660728.

  • This patch merges cleanly.

@avulanov
Copy link
Contributor Author

@mengxr Just to clarify: I'll implement ChiSqSelector with the method fit(data: RDD[LabeledPoint]): ChiSqSelectorModel. The latter extends VectorTransformer. Right?

@SparkQA
Copy link

SparkQA commented Nov 12, 2014

Test build #23232 has finished for PR 1484 at commit f660728.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class ChiSquaredFeatureSelection(labeledDiscreteData: RDD[LabeledPoint], numTopFeatures: Int)
    • trait FeatureSelection[T] extends java.io.Serializable
    • sealed trait FeatureFilter[T] extends FeatureSelection[T]
    • trait LabeledPointFeatureFilter extends FeatureFilter[LabeledPoint]

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23232/
Test PASSed.

@mengxr
Copy link
Contributor

mengxr commented Nov 12, 2014

@avulanov We have ChiSq tests implemented under "mllib.stat.Statistics":

https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/stat/Statistics.scala#L166

Could you please call the method there and select top features based on the test statistics? This would make us have a single place for ChiSq implementation.

@avulanov
Copy link
Contributor Author

@mengxr ChiSqTest is private for stat package, I cannot access it from feature package: private[stat] object ChiSqTest. Should I change [stat] to [mllib] or should I make it public?

@mengxr
Copy link
Contributor

mengxr commented Nov 13, 2014

No, Statistics.chiSqTest is public. Please check the link I mentioned above.

@avulanov
Copy link
Contributor Author

Ok, thanks! Sorry, I didn't understand the API from the first sight :)

@SparkQA
Copy link

SparkQA commented Nov 13, 2014

Test build #23329 has started for PR 1484 at commit e972e07.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 13, 2014

Test build #23329 has finished for PR 1484 at commit e972e07.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class ChiSqSelectorModel(indices: IndexedSeq[Int]) extends VectorTransformer

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23329/
Test FAILed.

@avulanov
Copy link
Contributor Author

@mengxr for some reason I cannot see the trace of the build, it seems that I need to login to Jenkins, but I don't have an account there

@mengxr
Copy link
Contributor

mengxr commented Nov 13, 2014

https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23329/console

I saw sbt/sbt mllib/test:compie failed.

@avulanov
Copy link
Contributor Author

avulanov commented Dec 5, 2014

@mengxr Could you suggest why the test fails?

@SparkQA
Copy link

SparkQA commented Jan 8, 2015

Test build #559 has started for PR 1484 at commit e972e07.

  • This patch merges cleanly.

@mengxr
Copy link
Contributor

mengxr commented Jan 31, 2015

test this please

@SparkQA
Copy link

SparkQA commented Jan 31, 2015

Test build #26451 has started for PR 1484 at commit a6ad82a.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Jan 31, 2015

Test build #26451 has finished for PR 1484 at commit a6ad82a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class ChiSqSelector (val numTopFeatures: Int)

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26451/
Test PASSed.

import org.apache.spark.mllib.stat.Statistics
import org.apache.spark.rdd.RDD

import scala.collection.mutable.ArrayBuilder
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

organize imports (If you use idea intellij, there is a useful plugin: https://plugins.jetbrains.com/plugin/7350)

@SparkQA
Copy link

SparkQA commented Feb 2, 2015

Test build #26524 has started for PR 1484 at commit 755d358.

  • This patch merges cleanly.

@avulanov
Copy link
Contributor Author

avulanov commented Feb 2, 2015

@mengxr Thank you for your comments! Done! Do you have any plans to add feature discretization capabilities to MLlib? There are few links in the head of this thread.

@mengxr
Copy link
Contributor

mengxr commented Feb 2, 2015

LGTM pending Jenkins ...

@SparkQA
Copy link

SparkQA commented Feb 2, 2015

Test build #26524 has finished for PR 1484 at commit 755d358.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class ChiSqSelectorModel (val selectedFeatures: Array[Int]) extends VectorTransformer
    • class ChiSqSelector (val numTopFeatures: Int)

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26524/
Test PASSed.

@mengxr
Copy link
Contributor

mengxr commented Feb 2, 2015

Yes, it would be nice to add feature discretization to MLlib. We had a PP, but as you've tried it doesn't scale well. I don't have concrete scalable algorithms in mind now. We can discuss more on the JIRA page.

@mengxr
Copy link
Contributor

mengxr commented Feb 2, 2015

Merged into master. Thanks!

@asfgit asfgit closed this in c081b21 Feb 2, 2015
sunchao pushed a commit to sunchao/spark that referenced this pull request Jun 2, 2023
…lling policy (apache#1484)

### What changes were proposed in this pull request?

This PR aims to support two new executor rolling policies.
- `PEAK_JVM_ONHEAP_MEMORY` policy chooses an executor with the biggest peak JVM on-heap memory.
- `PEAK_JVM_OFFHEAP_MEMORY` policy chooses an executor with the biggest peak JVM off-heap memory.

### Why are the changes needed?

Although peak memory is a kind of historic value, these two new policies add a capability to maintain the memory usage of Spark jobs minimally as much as possible.

### Does this PR introduce _any_ user-facing change?

Yes, but this is a new feature.

### How was this patch tested?

Pass the CIs.

Closes apache#37418 from dongjoon-hyun/SPARK-39987.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 3df7124)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 84cd907)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

Co-authored-by: Dongjoon Hyun <dongjoon@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants