Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-21623][ML]fix RF doc #18832

Closed
wants to merge 3 commits into from
Closed

[SPARK-21623][ML]fix RF doc #18832

wants to merge 3 commits into from

Conversation

mpjlu
Copy link

@mpjlu mpjlu commented Aug 3, 2017

What changes were proposed in this pull request?

comments of parentStats in RF are wrong.
parentStats is not only used for the first iteration, it is used with all the iteration for unordered features.

How was this patch tested?

@srowen
Copy link
Member

srowen commented Aug 3, 2017

@sethah

@SparkQA
Copy link

SparkQA commented Aug 3, 2017

Test build #80199 has finished for PR 18832 at commit 83c7504.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sethah
Copy link
Contributor

sethah commented Aug 3, 2017

The comment is not wrong. It's added for when we are finding the best split, to compute the right child stats from the left child stats. We would have just used the stats that are already available on the node node.stats, but those aren't available on the first iteration. In the method binsToBestSplit:

var gainAndImpurityStats: ImpurityStats = if (level == 0) {
      null
    } else {
      node.stats
    }

Otherwise, instead of parent stats we would just use node.stats.impurityCalculator.

@mpjlu
Copy link
Author

mpjlu commented Aug 3, 2017

node.stats is ImpurityStats, and parentStats is Array[Double], there are different. Maybe this comment should be used on node.stats, but not on parentStats. Is my understanding wrong?

@mpjlu
Copy link
Author

mpjlu commented Aug 3, 2017

parentStats is used in this code: binAggregates.getParentImpurityCalculator(), this is used in all iteration.
So that comment seems very misleading.
} else if (binAggregates.metadata.isUnordered(featureIndex)) { // Unordered categorical feature val leftChildOffset = binAggregates.getFeatureOffset(featureIndexIdx) val (bestFeatureSplitIndex, bestFeatureGainStats) = Range(0, numSplits).map { splitIndex => val leftChildStats = binAggregates.getImpurityCalculator(leftChildOffset, splitIndex) val rightChildStats = binAggregates.getParentImpurityCalculator() .subtract(leftChildStats) gainAndImpurityStats = calculateImpurityStats(gainAndImpurityStats, leftChildStats, rightChildStats, binAggregates.metadata) (splitIndex, gainAndImpurityStats) }.maxBy(_._2.gain) (splits(featureIndex)(bestFeatureSplitIndex), bestFeatureGainStats) }

@sethah
Copy link
Contributor

sethah commented Aug 3, 2017

I don't agree the comment is misleading. It might be confusing, but that's something different.

The reason that the DTStatsAggregator needs to keep track of parentStats is so that we can get an ImpurityCalculator for the parent node when we are finding best splits. However, an ImpurityCalculator for the parent node already exists via node.stats.impurityCalculator on every iteration except the first. It is precisely this reason that we had to add parentStats at all. It's unnecessary otherwise.

If you want to change it, then something like "Parent stats need to be explicitly tracked in the DTStatsAggregator because the parent [[Node]] object does not have ImpurityStats on the first iteration."

I doubt that's much clearer. Just to note, this comment was intended for developers anyway, since it's all private APIs.

@mpjlu
Copy link
Author

mpjlu commented Aug 3, 2017

I know your point.
I am confusing the code doesn't work that way.
The code update parentStats for each iteration. Actually, we only need to update parentStats for the first Iteration.
So we should update the code?
Thanks.

@sethah
Copy link
Contributor

sethah commented Aug 3, 2017

No, I don't think so. Computing parent stats is a very small fraction of the time and memory compared with the overall allStats array. That's why we decided to just add it in the first place.

@mpjlu
Copy link
Author

mpjlu commented Aug 3, 2017

I agree with you. Do you think we should update the comment to help others understand the code.
Since parantStats is updated and used in each iteration.
Thanks.

@sethah
Copy link
Contributor

sethah commented Aug 3, 2017

If you want to change it, that's fine. I think it's fine either way.

@mpjlu
Copy link
Author

mpjlu commented Aug 4, 2017

Thanks @sethah .
I strongly think we should update the commend or just delete the comment as the current PR.
Another reason is: there are three kinds of feature: categorical, ordered categorical, and continuous
Only the first iteration of categorical feature need parentStats, the other two don't need. The comment seems all first iteration need parentStats.

@srowen
Copy link
Member

srowen commented Aug 7, 2017

@mpjlu can you either close or update the change to reflect your input and Seth's?

@mpjlu
Copy link
Author

mpjlu commented Aug 7, 2017

Thanks @srowen , I revised the comments per Seth's suggestion: "Parent stats need to be explicitly tracked in the DTStatsAggregator because the parent [[Node]] object does not have ImpurityStats on the first iteration."

@SparkQA
Copy link

SparkQA commented Aug 7, 2017

Test build #80331 has finished for PR 18832 at commit 814e6b2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 7, 2017

Test build #80333 has finished for PR 18832 at commit 04e5abd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Aug 7, 2017

merged to master

@asfgit asfgit closed this in 1426eea Aug 7, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants