[SPARK-14599][ML] BaggedPoint should support sample weights. #12370

sethah · 2016-04-13T22:09:15Z

What changes were proposed in this pull request?

This PR changes BaggedPoint to store the number of subsamples AND the sample weight of Datum. Specifically:

subsampleWeights: Array[Double] is changed to subsampleCounts: Array[Int]
A sampleWeight: Double field is added to the BaggedPoint constructor
A function to extract the sample weight from datum is added to convertToBaggedPointRDD. This will be helpful when we add weights to decision trees, so that we can extract the instance weight from the RDD[Instance].

How was this patch tested?

This PR does not introduce any new functional changes, so there are no tests added.

sethah · 2016-04-13T22:11:37Z

In a previous "TODO" it was proposed that we could incorporate sample weights by simply multiplying the subsample counts by the sample weight and storing them in an array. I chose not to do this because of the need to have both raw counts and weighted counts when adding weights to decision trees. If we simply store the weighted counts, we lose the information about the raw counts, which makes it impossible to track minInstancesPerNode for decision trees.

SparkQA · 2016-04-13T22:49:14Z

Test build #55748 has finished for PR 12370 at commit a673658.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sethah · 2016-04-18T14:46:15Z

cc @jkbradley If you get a chance to review it would be much appreciated.

sethah · 2016-04-20T15:54:31Z

cc @MLnick could you take a look? This is blocking SPARK-9478 which I have a PR ready to submit for.

sethah · 2016-04-28T20:00:55Z

ping @jkbradley @MLnick

I created this PR and #12374 to make SPARK-9478 easier to review. Alternatively, I could submit them all as one PR. It would be nice to get sample weights for trees into Spark 2.0. Thoughts?

Also ping @holdenk

MLnick · 2016-04-29T09:30:18Z

mllib/src/main/scala/org/apache/spark/ml/tree/impl/BaggedPoint.scala

@@ -60,20 +68,24 @@ private[spark] object BaggedPoint {
      subsamplingRate: Double,
      numSubsamples: Int,
      withReplacement: Boolean,
+      extractSampleWeight: (Datum => Double) = (_: Datum) => 1.0,


Just checking my understanding here, but is the intention to in future support something like WeightedTreePoint (or amend TreePoint to include a weight), which is constructed in turn from Instance rather than LabeledPoint, and then the function passed can be ... => point.weight or similar?

Yes, that is exactly the case for this. I could not think of a better way to implement this, while still keeping bagged point generic (i.e. not requiring Datum to have a weight property or something similar).

MLnick · 2016-04-29T13:55:28Z

@sethah just to confirm, is SPARK-9478 about sample weights, or class weights? The title is for class weights but I think the actual idea and PR etc is for sample weights, yes?

sethah · 2016-04-29T14:46:08Z

@MLnick Yes, SPARK-9478 is for sample weighting.

MechCoder · 2016-06-07T00:45:58Z

mllib/src/main/scala/org/apache/spark/ml/tree/impl/BaggedPoint.scala

+  /**
+   * Subsample counts weighted by the sample weight.
+   */
+  def weightedCounts: Array[Double] = subsampleCounts.map(_ * sampleWeight)


Should this be a val?

I added this as a convenience method. If we make it a val then we add storage overhead in the class which is redundant. If preferable, we could remove it entirely.

MechCoder · 2016-06-07T00:54:43Z

Should be there a sanity check providing input RDD of instance objects and extractSampleWeight as callable that just returns the weight for each instance?

sethah added 3 commits April 13, 2016 11:26

bagged point supports sample weight

383956a

added a test

6f6c2a1

removing test and style

a673658

sethah mentioned this pull request Apr 14, 2016

[SPARK-9478] [ml] Add class weights to Random Forest #9008

Closed

MLnick reviewed Apr 29, 2016
View reviewed changes

MechCoder reviewed Jun 7, 2016
View reviewed changes

sethah closed this Oct 10, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-14599][ML] BaggedPoint should support sample weights. #12370

[SPARK-14599][ML] BaggedPoint should support sample weights. #12370

sethah commented Apr 13, 2016

sethah commented Apr 13, 2016

SparkQA commented Apr 13, 2016

sethah commented Apr 18, 2016

sethah commented Apr 20, 2016

sethah commented Apr 28, 2016

MLnick Apr 29, 2016

sethah Apr 29, 2016

MLnick commented Apr 29, 2016

sethah commented Apr 29, 2016

MechCoder Jun 7, 2016

sethah Jun 7, 2016

MechCoder commented Jun 7, 2016

[SPARK-14599][ML] BaggedPoint should support sample weights. #12370

[SPARK-14599][ML] BaggedPoint should support sample weights. #12370

Conversation

sethah commented Apr 13, 2016

What changes were proposed in this pull request?

How was this patch tested?

sethah commented Apr 13, 2016

SparkQA commented Apr 13, 2016

sethah commented Apr 18, 2016

sethah commented Apr 20, 2016

sethah commented Apr 28, 2016

MLnick Apr 29, 2016

Choose a reason for hiding this comment

sethah Apr 29, 2016

Choose a reason for hiding this comment

MLnick commented Apr 29, 2016

sethah commented Apr 29, 2016

MechCoder Jun 7, 2016

Choose a reason for hiding this comment

sethah Jun 7, 2016

Choose a reason for hiding this comment

MechCoder commented Jun 7, 2016