[SPARK-17772][ML][TEST] Add test functions for ML sample weights #15721

sethah · 2016-11-01T20:55:49Z

What changes were proposed in this pull request?

More and more ML algos are accepting sample weights, and they have been tested rather heterogeneously and with code duplication. This patch adds extensible helper methods to MLTestingUtils that can be reused by various algorithms accepting sample weights. Up to now, there seems to be a few tests that have been implemented commonly:

Check that oversampling is the same as giving the instances sample weights proportional to the number of samples
Check that outliers with tiny sample weights do not affect the algorithm's performance

This patch adds an additional test:

Check that algorithms are invariant to constant scaling of the sample weights. i.e. uniform sample weights with w_i = 1.0 is effectively the same as uniform sample weights with w_i = 10000 or w_i = 0.0001

The instances of these tests occurred in LinearRegression, NaiveBayes, and LogisticRegression. Those tests have been removed/modified to use the new helper methods. These helper functions will be of use when SPARK-9478 is implemented.

How was this patch tested?

This patch only involves modifying test suites.

Other notes

Both IsotonicRegression and GeneralizedLinearRegression also extend HasWeightCol. I did not modify these test suites because it will make this patch easier to review, and because they did not duplicate the same tests as the three suites that were modified. If we want to change them later, we can create a JIRA for it now, but it's open for debate.

sethah · 2016-11-01T20:59:46Z

This issue also brings up a more general point of how sample weights should be tested. It seems there are some common rules that all algorithms that incorporate sample weights are thought to follow (mentioned above), but there are also algorithm specific details in some cases. I am of the opinion that we ought to incorporate some mixture of the two - common sample weight tests and algorithm specific tests. This is the approach followed in linear/logistic/generalized linear regression which all compare weighted data sets to R's output. Other algorithms that can't be compared directly with R will make use of the common helper functions. Random forests and decision trees will also heavily incorporate these into testing.

I do appreciate others' thoughts on this issue.

SparkQA · 2016-11-01T22:01:20Z

Test build #67922 has finished for PR 15721 at commit e10be45.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sethah · 2016-12-19T01:06:51Z

ping @jkbradley @MLnick @yanboliang

yanboliang · 2016-12-19T14:23:47Z

Sorry for late response, I like this change, will make a pass tomorrow. Thanks.

yanboliang

@sethah I made a pass and left some comments. Thanks for working on it.

yanboliang · 2016-12-20T04:48:35Z

mllib/src/test/scala/org/apache/spark/ml/util/MLTestingUtils.scala

+   * @param noiseLevel A number in [0.0, 1.0] indicating how much noise to add to the label.
+   * @return Generated sequence of noisy instances.
+   */
+  def generateNoisyData(


I am a bit worried whether we should provide this general noisy data generation function:

It's better we can generate data following the rule of specific algorithms, for example, users provide coefficients, the mean and variance of generated features for LogisticRegression.

Actually, some generators such as LinearDataGenerator.generateLinearInput has already considered the noise level.

Just like LinearDataGenerator.generateLinearInput, I think we should add argument eps for other generators such as LogisticRegressionSuite.generateLogisticInput, LogisticRegressionSuite.generateMultinomialLogisticInput, NaiveBayesSuite.generateNaiveBayesInput, to make them output noisy label natively.

Fair point. Actually, the noise is not strictly necessary for this patch in the other cases. I can use the existing datasets (for the most part). I removed this generator and passed the test data to the testing util methods.

yanboliang · 2016-12-20T04:58:00Z

mllib/src/test/scala/org/apache/spark/ml/util/MLTestingUtils.scala

+        modelEquals: (M, M) => Unit,
+        seed: Long): Unit = {
+    import spark.implicits._
+    val df = generateNoisyData(numPoints, numClasses, numFeatures, categoricalFeaturesInfo,


If we add noise in native data generators(see my above comment), we should remove this line and pass in the generated dataset(which already includes noise) directly.

yanboliang · 2016-12-20T04:59:01Z

mllib/src/test/scala/org/apache/spark/ml/util/MLTestingUtils.scala

+   * to assigning a sample weight proportional to the number of samples for each point.
+   */
+  def testOversamplingVsWeighting[M <: Model[M], E <: Estimator[M]](
+        spark: SparkSession,


yanboliang · 2016-12-20T04:59:48Z

mllib/src/test/scala/org/apache/spark/ml/util/MLTestingUtils.scala

+   * model despite the outliers.
+   */
+  def testOutliersWithSmallWeights[M <: Model[M], E <: Estimator[M]](
+        spark: SparkSession,


yanboliang · 2016-12-20T05:05:48Z

mllib/src/test/scala/org/apache/spark/ml/util/MLTestingUtils.scala

+    import spark.implicits._
+    val df = generateNoisyData(numPoints, numClasses, numFeatures, categoricalFeaturesInfo,
+      seed).toDF()
+    val outlierFunction = getRandomLinearPredictionFunction(numFeatures, numClasses, seed - 1)


I'm more prefer to implement outlierFunction as a simple function such as:

class -> numClass - class - 1 for classification.

label -> -label for regression.

which should be more intuitional and easy to be understand by developers/contributors.

Yeah, this works too and is simpler. Thanks!

yanboliang · 2016-12-20T05:26:58Z

mllib/src/test/scala/org/apache/spark/ml/util/MLTestingUtils.scala

+      .set(estimator.labelCol, "label")
+      .set(estimator.featuresCol, "features")
+      .set(estimator.weightCol, "weight")
+    val models = Seq(0.001, 1.0, 1000.0).map { w =>


I think Seq(1.0, 1000.0) should be enough.

I disagree. 1.0 and 1000.0 are both integers, and I have already had experience with algorithms not properly handling fractional weights before. I think here we cover the case of (tiny, unit, and large) weights as well as (fractional, integer) weights.

Yeah, I got your concerns and I remember it happened at #16149, make sense.

sethah · 2016-12-21T17:20:32Z

@yanboliang Thanks for reviewing, I addressed your comments.

sethah · 2016-12-21T17:24:14Z

mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala

@@ -47,6 +49,11 @@ class LinearRegressionSuite
    datasetWithDenseFeature = sc.parallelize(LinearDataGenerator.generateLinearInput(
      intercept = 6.3, weights = Array(4.7, 7.2), xMean = Array(0.9, -1.3),
      xVariance = Array(0.7, 1.2), nPoints = 10000, seed, eps = 0.1), 2).map(_.asML).toDF()
+
+    weightedDatasetWithDenseFeature = sc.parallelize(LinearDataGenerator.generateLinearInput(


I added this small dataset with a higher noise value for weighted testing. It's necessary because when we test oversampling vs weighting, we need the noise to be high enough that the model learns incorrect coefficients when the weights are not applied. The coefficients used to generate each point are the same, but some points are emphasized more with weights. This dataset needs to be small enough and have enough noise that it doesn't still learn the true coefficients when the weights are not applied, if that makes sense.

datasetWithStrongNoise? I think weighted* is really misleading.

SparkQA · 2016-12-21T18:17:03Z

Test build #70481 has finished for PR 15721 at commit 8f287f3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang

Only some minor comments, otherwise, looks good. Thanks.

yanboliang · 2016-12-23T12:17:18Z

mllib/src/test/scala/org/apache/spark/ml/util/MLTestingUtils.scala

-import org.apache.spark.ml.feature.Instance
-import org.apache.spark.ml.linalg.{Vector, Vectors}
+import org.apache.spark.ml.feature.{Instance, LabeledPoint}
+import org.apache.spark.ml.linalg.{BLAS, DenseMatrix, DenseVector, Vector, Vectors}


BLAS, DenseMatrix, DenseVector were not used and can be removed.

yanboliang · 2016-12-23T12:26:39Z

mllib/src/test/scala/org/apache/spark/ml/util/MLTestingUtils.scala

+   * Given a dataframe, generate two output dataframes: one having the original rows oversampled
+   * an integer number of times, and one having the original rows but with a column of weights
+   * proportional to the number of oversampled instances in the oversampled dataframe.
+   */


Nit: dataframe -> DataFrame for all occurrences.

yanboliang · 2016-12-23T12:41:10Z

mllib/src/test/scala/org/apache/spark/ml/util/MLTestingUtils.scala

+   * Given a dataframe, generate two output dataframes: one having the original rows oversampled
+   * an integer number of times, and one having the original rows but with a column of weights
+   * proportional to the number of oversampled instances in the oversampled dataframe.
+   */
  def genEquivalentOversampledAndWeightedInstances(


For this and the following three functions, some one uses data: DataFrame, labelCol: String, featuresCol, while others use data: Dataset[LabeledPoint] or data: Dataset[Instance] as the arguments. Could we make the arguments consistent? I'm prefer the latter one.

I made them all take Dataset[LabeledPoint]. Good suggestion.

yanboliang · 2016-12-23T12:50:45Z

mllib/src/test/scala/org/apache/spark/ml/util/MLTestingUtils.scala

+   * model despite the outliers.
+   */
+  def testOutliersWithSmallWeights[M <: Model[M], E <: Estimator[M]](
+      ds: Dataset[Instance],


I'd prefer to change this to data: Dataset[LabeledPoint](pass in dataset w/o weight), and move .withColumn("weight", lit(1.0))(which are duplicated in test cases of each algorithms currently) inside this function.

yanboliang · 2016-12-23T12:56:02Z

mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala

-      assert(label === pred)
+  test("logistic regression with sample weights") {
+    def modelEquals(m1: LogisticRegressionModel, m2: LogisticRegressionModel): Unit = {
+      assert(m1.coefficientMatrix ~== m2.coefficientMatrix absTol 0.01)


Do we also need to check interceptVector?

Done, thanks.

yanboliang · 2016-12-23T13:04:41Z

mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala

@@ -47,6 +49,11 @@ class LinearRegressionSuite
    datasetWithDenseFeature = sc.parallelize(LinearDataGenerator.generateLinearInput(
      intercept = 6.3, weights = Array(4.7, 7.2), xMean = Array(0.9, -1.3),
      xVariance = Array(0.7, 1.2), nPoints = 10000, seed, eps = 0.1), 2).map(_.asML).toDF()
+
+    weightedDatasetWithDenseFeature = sc.parallelize(LinearDataGenerator.generateLinearInput(


datasetWithStrongNoise? I think weighted* is really misleading.

yanboliang · 2016-12-23T13:27:03Z

mllib/src/test/scala/org/apache/spark/ml/classification/NaiveBayesSuite.scala

+    )
+    testParams.foreach { case (family, dataset) =>
+      // NaiveBayes is sensitive to constant scaling of the weights unless smoothing is set to 0
+      val estimator = new NaiveBayes().setSmoothing(0.0).setModelType(family)


I think it's not practical to set smoothing as 0.0, so it's better to test NB with none zero smoothing value. If it does not applicable to testArbitrarilyScaledWeights, we can omit it. Or generating two estimators whose smoothing values are respectively 0.0 and 1.0 for different test functions.

I think the test with smoothing as 0.0 is a nice check on the weighting algorithm for Naive Bayes, so I prefer to keep it. I made separate smoothing/no smoothing estimators.

yanboliang · 2016-12-23T13:27:30Z

mllib/src/test/scala/org/apache/spark/ml/classification/NaiveBayesSuite.scala

-      Array(0.10, 0.10, 0.70, 0.10)  // label 2
+      Array(0.30, 0.30, 0.30, 0.30), // label 0
+      Array(0.30, 0.30, 0.30, 0.30), // label 1
+      Array(0.40, 0.40, 0.40, 0.40)  // label 2


Could you let me know why you change this?

Ya this is changed so that when we set smoothing to zero for the weighted tests, we don't get some theta values of infinity.

SparkQA · 2016-12-27T03:41:35Z

Test build #70621 has finished for PR 15721 at commit 9f2eaf3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-12-27T03:51:39Z

Test build #70623 has finished for PR 15721 at commit f2dbdd9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang · 2016-12-28T14:36:05Z

LGTM, merged into master. Thanks.

sethah · 2016-12-28T15:34:28Z

Thanks @yanboliang!

## What changes were proposed in this pull request? More and more ML algos are accepting sample weights, and they have been tested rather heterogeneously and with code duplication. This patch adds extensible helper methods to `MLTestingUtils` that can be reused by various algorithms accepting sample weights. Up to now, there seems to be a few tests that have been implemented commonly: * Check that oversampling is the same as giving the instances sample weights proportional to the number of samples * Check that outliers with tiny sample weights do not affect the algorithm's performance This patch adds an additional test: * Check that algorithms are invariant to constant scaling of the sample weights. i.e. uniform sample weights with `w_i = 1.0` is effectively the same as uniform sample weights with `w_i = 10000` or `w_i = 0.0001` The instances of these tests occurred in LinearRegression, NaiveBayes, and LogisticRegression. Those tests have been removed/modified to use the new helper methods. These helper functions will be of use when [SPARK-9478](https://issues.apache.org/jira/browse/SPARK-9478) is implemented. ## How was this patch tested? This patch only involves modifying test suites. ## Other notes Both IsotonicRegression and GeneralizedLinearRegression also extend `HasWeightCol`. I did not modify these test suites because it will make this patch easier to review, and because they did not duplicate the same tests as the three suites that were modified. If we want to change them later, we can create a JIRA for it now, but it's open for debate. Author: sethah <seth.hendrickson16@gmail.com> Closes apache#15721 from sethah/SPARK-17772.

add sample weight helper tests

e10be45

yanboliang reviewed Dec 20, 2016

View reviewed changes

remove data generator and outlier func

8f287f3

sethah commented Dec 21, 2016

View reviewed changes

yanboliang reviewed Dec 23, 2016

View reviewed changes

sethah added 2 commits December 26, 2016 21:41

address review 2

9f2eaf3

variable naming

f2dbdd9

asfgit closed this in 6a475ae Dec 28, 2016

[SPARK-17772][ML][TEST] Add test functions for ML sample weights #15721

[SPARK-17772][ML][TEST] Add test functions for ML sample weights #15721

Conversation

sethah commented Nov 1, 2016

What changes were proposed in this pull request?

How was this patch tested?

Other notes

sethah commented Nov 1, 2016

SparkQA commented Nov 1, 2016

sethah commented Dec 19, 2016

yanboliang commented Dec 19, 2016

yanboliang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sethah commented Dec 21, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Dec 21, 2016

yanboliang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yanboliang Dec 23, 2016 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Dec 27, 2016

SparkQA commented Dec 27, 2016

yanboliang commented Dec 28, 2016

sethah commented Dec 28, 2016

yanboliang Dec 23, 2016 •

edited