Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-17772][ML][TEST] Add test functions for ML sample weights #15721

Closed
wants to merge 4 commits into from

Conversation

sethah
Copy link
Contributor

@sethah sethah commented Nov 1, 2016

What changes were proposed in this pull request?

More and more ML algos are accepting sample weights, and they have been tested rather heterogeneously and with code duplication. This patch adds extensible helper methods to MLTestingUtils that can be reused by various algorithms accepting sample weights. Up to now, there seems to be a few tests that have been implemented commonly:

  • Check that oversampling is the same as giving the instances sample weights proportional to the number of samples
  • Check that outliers with tiny sample weights do not affect the algorithm's performance

This patch adds an additional test:

  • Check that algorithms are invariant to constant scaling of the sample weights. i.e. uniform sample weights with w_i = 1.0 is effectively the same as uniform sample weights with w_i = 10000 or w_i = 0.0001

The instances of these tests occurred in LinearRegression, NaiveBayes, and LogisticRegression. Those tests have been removed/modified to use the new helper methods. These helper functions will be of use when SPARK-9478 is implemented.

How was this patch tested?

This patch only involves modifying test suites.

Other notes

Both IsotonicRegression and GeneralizedLinearRegression also extend HasWeightCol. I did not modify these test suites because it will make this patch easier to review, and because they did not duplicate the same tests as the three suites that were modified. If we want to change them later, we can create a JIRA for it now, but it's open for debate.

@sethah
Copy link
Contributor Author

sethah commented Nov 1, 2016

This issue also brings up a more general point of how sample weights should be tested. It seems there are some common rules that all algorithms that incorporate sample weights are thought to follow (mentioned above), but there are also algorithm specific details in some cases. I am of the opinion that we ought to incorporate some mixture of the two - common sample weight tests and algorithm specific tests. This is the approach followed in linear/logistic/generalized linear regression which all compare weighted data sets to R's output. Other algorithms that can't be compared directly with R will make use of the common helper functions. Random forests and decision trees will also heavily incorporate these into testing.

I do appreciate others' thoughts on this issue.

@SparkQA
Copy link

SparkQA commented Nov 1, 2016

Test build #67922 has finished for PR 15721 at commit e10be45.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sethah
Copy link
Contributor Author

sethah commented Dec 19, 2016

ping @jkbradley @MLnick @yanboliang

@yanboliang
Copy link
Contributor

Sorry for late response, I like this change, will make a pass tomorrow. Thanks.

Copy link
Contributor

@yanboliang yanboliang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sethah I made a pass and left some comments. Thanks for working on it.

* @param noiseLevel A number in [0.0, 1.0] indicating how much noise to add to the label.
* @return Generated sequence of noisy instances.
*/
def generateNoisyData(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit worried whether we should provide this general noisy data generation function:

  • It's better we can generate data following the rule of specific algorithms, for example, users provide coefficients, the mean and variance of generated features for LogisticRegression.
  • Actually, some generators such as LinearDataGenerator.generateLinearInput has already considered the noise level.

Just like LinearDataGenerator.generateLinearInput, I think we should add argument eps for other generators such as LogisticRegressionSuite.generateLogisticInput, LogisticRegressionSuite.generateMultinomialLogisticInput, NaiveBayesSuite.generateNaiveBayesInput, to make them output noisy label natively.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point. Actually, the noise is not strictly necessary for this patch in the other cases. I can use the existing datasets (for the most part). I removed this generator and passed the test data to the testing util methods.

modelEquals: (M, M) => Unit,
seed: Long): Unit = {
import spark.implicits._
val df = generateNoisyData(numPoints, numClasses, numFeatures, categoricalFeaturesInfo,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we add noise in native data generators(see my above comment), we should remove this line and pass in the generated dataset(which already includes noise) directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

* to assigning a sample weight proportional to the number of samples for each point.
*/
def testOversamplingVsWeighting[M <: Model[M], E <: Estimator[M]](
spark: SparkSession,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

* model despite the outliers.
*/
def testOutliersWithSmallWeights[M <: Model[M], E <: Estimator[M]](
spark: SparkSession,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

import spark.implicits._
val df = generateNoisyData(numPoints, numClasses, numFeatures, categoricalFeaturesInfo,
seed).toDF()
val outlierFunction = getRandomLinearPredictionFunction(numFeatures, numClasses, seed - 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm more prefer to implement outlierFunction as a simple function such as:

  • class -> numClass - class - 1 for classification.
  • label -> -label for regression.

which should be more intuitional and easy to be understand by developers/contributors.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this works too and is simpler. Thanks!

.set(estimator.labelCol, "label")
.set(estimator.featuresCol, "features")
.set(estimator.weightCol, "weight")
val models = Seq(0.001, 1.0, 1000.0).map { w =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Seq(1.0, 1000.0) should be enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree. 1.0 and 1000.0 are both integers, and I have already had experience with algorithms not properly handling fractional weights before. I think here we cover the case of (tiny, unit, and large) weights as well as (fractional, integer) weights.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I got your concerns and I remember it happened at #16149, make sense.

@sethah
Copy link
Contributor Author

sethah commented Dec 21, 2016

@yanboliang Thanks for reviewing, I addressed your comments.

@@ -47,6 +49,11 @@ class LinearRegressionSuite
datasetWithDenseFeature = sc.parallelize(LinearDataGenerator.generateLinearInput(
intercept = 6.3, weights = Array(4.7, 7.2), xMean = Array(0.9, -1.3),
xVariance = Array(0.7, 1.2), nPoints = 10000, seed, eps = 0.1), 2).map(_.asML).toDF()

weightedDatasetWithDenseFeature = sc.parallelize(LinearDataGenerator.generateLinearInput(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this small dataset with a higher noise value for weighted testing. It's necessary because when we test oversampling vs weighting, we need the noise to be high enough that the model learns incorrect coefficients when the weights are not applied. The coefficients used to generate each point are the same, but some points are emphasized more with weights. This dataset needs to be small enough and have enough noise that it doesn't still learn the true coefficients when the weights are not applied, if that makes sense.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

datasetWithStrongNoise? I think weighted* is really misleading.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@SparkQA
Copy link

SparkQA commented Dec 21, 2016

Test build #70481 has finished for PR 15721 at commit 8f287f3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@yanboliang yanboliang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only some minor comments, otherwise, looks good. Thanks.

import org.apache.spark.ml.feature.Instance
import org.apache.spark.ml.linalg.{Vector, Vectors}
import org.apache.spark.ml.feature.{Instance, LabeledPoint}
import org.apache.spark.ml.linalg.{BLAS, DenseMatrix, DenseVector, Vector, Vectors}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BLAS, DenseMatrix, DenseVector were not used and can be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

* Given a dataframe, generate two output dataframes: one having the original rows oversampled
* an integer number of times, and one having the original rows but with a column of weights
* proportional to the number of oversampled instances in the oversampled dataframe.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: dataframe -> DataFrame for all occurrences.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

* Given a dataframe, generate two output dataframes: one having the original rows oversampled
* an integer number of times, and one having the original rows but with a column of weights
* proportional to the number of oversampled instances in the oversampled dataframe.
*/
def genEquivalentOversampledAndWeightedInstances(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this and the following three functions, some one uses data: DataFrame, labelCol: String, featuresCol, while others use data: Dataset[LabeledPoint] or data: Dataset[Instance] as the arguments. Could we make the arguments consistent? I'm prefer the latter one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made them all take Dataset[LabeledPoint]. Good suggestion.

* model despite the outliers.
*/
def testOutliersWithSmallWeights[M <: Model[M], E <: Estimator[M]](
ds: Dataset[Instance],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to change this to data: Dataset[LabeledPoint](pass in dataset w/o weight), and move .withColumn("weight", lit(1.0))(which are duplicated in test cases of each algorithms currently) inside this function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

assert(label === pred)
test("logistic regression with sample weights") {
def modelEquals(m1: LogisticRegressionModel, m2: LogisticRegressionModel): Unit = {
assert(m1.coefficientMatrix ~== m2.coefficientMatrix absTol 0.01)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we also need to check interceptVector?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks.

@@ -47,6 +49,11 @@ class LinearRegressionSuite
datasetWithDenseFeature = sc.parallelize(LinearDataGenerator.generateLinearInput(
intercept = 6.3, weights = Array(4.7, 7.2), xMean = Array(0.9, -1.3),
xVariance = Array(0.7, 1.2), nPoints = 10000, seed, eps = 0.1), 2).map(_.asML).toDF()

weightedDatasetWithDenseFeature = sc.parallelize(LinearDataGenerator.generateLinearInput(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

datasetWithStrongNoise? I think weighted* is really misleading.

)
testParams.foreach { case (family, dataset) =>
// NaiveBayes is sensitive to constant scaling of the weights unless smoothing is set to 0
val estimator = new NaiveBayes().setSmoothing(0.0).setModelType(family)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's not practical to set smoothing as 0.0, so it's better to test NB with none zero smoothing value. If it does not applicable to testArbitrarilyScaledWeights, we can omit it. Or generating two estimators whose smoothing values are respectively 0.0 and 1.0 for different test functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the test with smoothing as 0.0 is a nice check on the weighting algorithm for Naive Bayes, so I prefer to keep it. I made separate smoothing/no smoothing estimators.

Array(0.10, 0.10, 0.70, 0.10) // label 2
Array(0.30, 0.30, 0.30, 0.30), // label 0
Array(0.30, 0.30, 0.30, 0.30), // label 1
Array(0.40, 0.40, 0.40, 0.40) // label 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you let me know why you change this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya this is changed so that when we set smoothing to zero for the weighted tests, we don't get some theta values of infinity.

@SparkQA
Copy link

SparkQA commented Dec 27, 2016

Test build #70621 has finished for PR 15721 at commit 9f2eaf3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 27, 2016

Test build #70623 has finished for PR 15721 at commit f2dbdd9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yanboliang
Copy link
Contributor

LGTM, merged into master. Thanks.

@asfgit asfgit closed this in 6a475ae Dec 28, 2016
@sethah
Copy link
Contributor Author

sethah commented Dec 28, 2016

Thanks @yanboliang!

cmonkey pushed a commit to cmonkey/spark that referenced this pull request Dec 29, 2016
## What changes were proposed in this pull request?

More and more ML algos are accepting sample weights, and they have been tested rather heterogeneously and with code duplication. This patch adds extensible helper methods to `MLTestingUtils` that can be reused by various algorithms accepting sample weights. Up to now, there seems to be a few tests that have been implemented commonly:

* Check that oversampling is the same as giving the instances sample weights proportional to the number of samples
* Check that outliers with tiny sample weights do not affect the algorithm's performance

This patch adds an additional test:

* Check that algorithms are invariant to constant scaling of the sample weights. i.e. uniform sample weights with `w_i = 1.0` is effectively the same as uniform sample weights with `w_i = 10000` or `w_i = 0.0001`

The instances of these tests occurred in LinearRegression, NaiveBayes, and LogisticRegression. Those tests have been removed/modified to use the new helper methods. These helper functions will be of use when [SPARK-9478](https://issues.apache.org/jira/browse/SPARK-9478) is implemented.

## How was this patch tested?

This patch only involves modifying test suites.

## Other notes

Both IsotonicRegression and GeneralizedLinearRegression also extend `HasWeightCol`. I did not modify these test suites because it will make this patch easier to review, and because they did not duplicate the same tests as the three suites that were modified. If we want to change them later, we can create a JIRA for it now, but it's open for debate.

Author: sethah <seth.hendrickson16@gmail.com>

Closes apache#15721 from sethah/SPARK-17772.
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
## What changes were proposed in this pull request?

More and more ML algos are accepting sample weights, and they have been tested rather heterogeneously and with code duplication. This patch adds extensible helper methods to `MLTestingUtils` that can be reused by various algorithms accepting sample weights. Up to now, there seems to be a few tests that have been implemented commonly:

* Check that oversampling is the same as giving the instances sample weights proportional to the number of samples
* Check that outliers with tiny sample weights do not affect the algorithm's performance

This patch adds an additional test:

* Check that algorithms are invariant to constant scaling of the sample weights. i.e. uniform sample weights with `w_i = 1.0` is effectively the same as uniform sample weights with `w_i = 10000` or `w_i = 0.0001`

The instances of these tests occurred in LinearRegression, NaiveBayes, and LogisticRegression. Those tests have been removed/modified to use the new helper methods. These helper functions will be of use when [SPARK-9478](https://issues.apache.org/jira/browse/SPARK-9478) is implemented.

## How was this patch tested?

This patch only involves modifying test suites.

## Other notes

Both IsotonicRegression and GeneralizedLinearRegression also extend `HasWeightCol`. I did not modify these test suites because it will make this patch easier to review, and because they did not duplicate the same tests as the three suites that were modified. If we want to change them later, we can create a JIRA for it now, but it's open for debate.

Author: sethah <seth.hendrickson16@gmail.com>

Closes apache#15721 from sethah/SPARK-17772.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants