[SPARK-21623][ML]fix RF doc #18832

mpjlu · 2017-08-03T08:30:10Z

What changes were proposed in this pull request?

comments of parentStats in RF are wrong.
parentStats is not only used for the first iteration, it is used with all the iteration for unordered features.

How was this patch tested?

srowen · 2017-08-03T09:02:03Z

@sethah

SparkQA · 2017-08-03T09:39:03Z

Test build #80199 has finished for PR 18832 at commit 83c7504.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sethah · 2017-08-03T15:10:05Z

The comment is not wrong. It's added for when we are finding the best split, to compute the right child stats from the left child stats. We would have just used the stats that are already available on the node node.stats, but those aren't available on the first iteration. In the method binsToBestSplit:

var gainAndImpurityStats: ImpurityStats = if (level == 0) {
      null
    } else {
      node.stats
    }

Otherwise, instead of parent stats we would just use node.stats.impurityCalculator.

mpjlu · 2017-08-03T15:24:53Z

node.stats is ImpurityStats, and parentStats is Array[Double], there are different. Maybe this comment should be used on node.stats, but not on parentStats. Is my understanding wrong?

mpjlu · 2017-08-03T15:30:50Z

parentStats is used in this code:　binAggregates.getParentImpurityCalculator(), this is used in all iteration.
So that comment seems very misleading.
} else if (binAggregates.metadata.isUnordered(featureIndex)) { // Unordered categorical feature val leftChildOffset = binAggregates.getFeatureOffset(featureIndexIdx) val (bestFeatureSplitIndex, bestFeatureGainStats) = Range(0, numSplits).map { splitIndex => val leftChildStats = binAggregates.getImpurityCalculator(leftChildOffset, splitIndex) val rightChildStats = binAggregates.getParentImpurityCalculator() .subtract(leftChildStats) gainAndImpurityStats = calculateImpurityStats(gainAndImpurityStats, leftChildStats, rightChildStats, binAggregates.metadata) (splitIndex, gainAndImpurityStats) }.maxBy(_._2.gain) (splits(featureIndex)(bestFeatureSplitIndex), bestFeatureGainStats) }

sethah · 2017-08-03T15:38:30Z

I don't agree the comment is misleading. It might be confusing, but that's something different.

The reason that the DTStatsAggregator needs to keep track of parentStats is so that we can get an ImpurityCalculator for the parent node when we are finding best splits. However, an ImpurityCalculator for the parent node already exists via node.stats.impurityCalculator on every iteration except the first. It is precisely this reason that we had to add parentStats at all. It's unnecessary otherwise.

If you want to change it, then something like "Parent stats need to be explicitly tracked in the DTStatsAggregator because the parent [[Node]] object does not have ImpurityStats on the first iteration."

I doubt that's much clearer. Just to note, this comment was intended for developers anyway, since it's all private APIs.

mpjlu · 2017-08-03T16:02:47Z

I know your point.
I am confusing the code doesn't work that way.
The code update parentStats for each iteration. Actually, we only need to update parentStats for the first Iteration.
So we should update the code?
Thanks.

sethah · 2017-08-03T16:18:41Z

No, I don't think so. Computing parent stats is a very small fraction of the time and memory compared with the overall allStats array. That's why we decided to just add it in the first place.

mpjlu · 2017-08-03T16:30:34Z

I agree with you. Do you think we should update the comment to help others understand the code.
Since parantStats is updated and used in each iteration.
Thanks.

sethah · 2017-08-03T16:51:44Z

If you want to change it, that's fine. I think it's fine either way.

mpjlu · 2017-08-04T02:24:12Z

Thanks @sethah .
I strongly think we should update the commend or just delete the comment as the current PR.
Another reason is: there are three kinds of feature: categorical, ordered categorical, and continuous
Only the first iteration of categorical feature need parentStats, the other two don't need. The comment seems all first iteration need parentStats.

srowen · 2017-08-07T08:13:47Z

@mpjlu can you either close or update the change to reflect your input and Seth's?

mpjlu · 2017-08-07T08:45:05Z

Thanks @srowen , I revised the comments per Seth's suggestion: "Parent stats need to be explicitly tracked in the DTStatsAggregator because the parent [[Node]] object does not have ImpurityStats on the first iteration."

SparkQA · 2017-08-07T09:47:16Z

Test build #80331 has finished for PR 18832 at commit 814e6b2.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-08-07T09:54:23Z

Test build #80333 has finished for PR 18832 at commit 04e5abd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2017-08-07T10:03:26Z

merged to master

fix RF doc

83c7504

Peng Meng added 2 commits August 7, 2017 16:33

add comment for parentStats

814e6b2

fix RF doc

04e5abd

srowen approved these changes Aug 7, 2017

View reviewed changes

asfgit closed this in 1426eea Aug 7, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-21623][ML]fix RF doc #18832

[SPARK-21623][ML]fix RF doc #18832

mpjlu commented Aug 3, 2017

srowen commented Aug 3, 2017

SparkQA commented Aug 3, 2017

sethah commented Aug 3, 2017

mpjlu commented Aug 3, 2017

mpjlu commented Aug 3, 2017

sethah commented Aug 3, 2017

mpjlu commented Aug 3, 2017

sethah commented Aug 3, 2017

mpjlu commented Aug 3, 2017

sethah commented Aug 3, 2017

mpjlu commented Aug 4, 2017

srowen commented Aug 7, 2017

mpjlu commented Aug 7, 2017

SparkQA commented Aug 7, 2017

SparkQA commented Aug 7, 2017

srowen commented Aug 7, 2017

[SPARK-21623][ML]fix RF doc #18832

[SPARK-21623][ML]fix RF doc #18832

Conversation

mpjlu commented Aug 3, 2017

What changes were proposed in this pull request?

How was this patch tested?

srowen commented Aug 3, 2017

SparkQA commented Aug 3, 2017

sethah commented Aug 3, 2017

mpjlu commented Aug 3, 2017

mpjlu commented Aug 3, 2017

sethah commented Aug 3, 2017

mpjlu commented Aug 3, 2017

sethah commented Aug 3, 2017

mpjlu commented Aug 3, 2017

sethah commented Aug 3, 2017

mpjlu commented Aug 4, 2017

srowen commented Aug 7, 2017

mpjlu commented Aug 7, 2017

SparkQA commented Aug 7, 2017

SparkQA commented Aug 7, 2017

srowen commented Aug 7, 2017