Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-14351] [MLlib] [ML] Optimize findBestSplits method for decision trees (and random forest) #13959

Closed
wants to merge 8 commits into from

Conversation

MechCoder
Copy link
Contributor

@MechCoder MechCoder commented Jun 29, 2016

What changes were proposed in this pull request?

The current findBestSplits method creates an instance of ImpurityCalculator and ImpurityStats for every possible split and feature in the search for the bestSplit. Every instance of ImpurityCalculator creates an array of size statsSize which is unnecessary and take a non-negligible amount of time. This pull request tackles this problem by the following technique.

  1. Remove the impurityCalculator instantiation for every possible split and feature. Replace this by a calculateGain method for each impurity that computes the gain directly from the allStats attribute of the DTStatsAggregator which holds all the necessary information.
  2. Replace returning an instance of ImpurityStats for every possible split and feature with just the information gain returned from the calculateGain method since the gain is sufficient to calculate the bestSplit. Just return an instance of ImpurityStats once for the bestSplit
  3. Remove the not-so-useful calculateImpurityStats method.

How was this patch tested?

Since this is a performance improvement, tests are necessary. Here are the improvements for a RandomForestRegressor with maxDepth set to 30, subSamplingRate set to 1 and maxBins set to 20 on synthetic data. The timings were calculated locally and the mean of 3 attempts were taken.

n_trees n_samples n_features time in master total time in this branch
1 10000 500 8.954 7.786
10 10000 500 9.44 6.825
100 10000 500 18.457 16.498
1 500 10000 8.718 6.783
10 500 10000 8.579 6.853
100 500 10000 17.593 15.905
1 1000 1000 8.323 6.456
10 1000 1000 8.841 6.633
100 1000 1000 17.834 16.077
500 1000 1000 64.3 58.94

@MechCoder
Copy link
Contributor Author

@jkbradley @sethah Please have a look when free!

@SparkQA
Copy link

SparkQA commented Jun 29, 2016

Test build #61425 has finished for PR 13959 at commit e8b8914.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 29, 2016

Test build #61427 has finished for PR 13959 at commit af1ff66.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MechCoder
Copy link
Contributor Author

The test failure is just due to binary incompatibility. I can fix those once we decide that the current PR is the way to proceed.

@HyukjinKwon
Copy link
Member

I think we should fix the compatibility issue first rather then leaving this PR incomplete. If it is inactive, I would rather like to propose to close this for now.

@MechCoder
Copy link
Contributor Author

I don't understand. If you don't have time to review that is fine (I've been there too), but there is no need to close a PR due to unavailability of comitters.

One of the reasons, that I am happy to have stopped contributing to Spark and focus my energy elsewhere...

Thanks!

@srowen
Copy link
Member

srowen commented May 18, 2017

I think the problem is that this PR was incomplete, and left open. We generally only leave open PRs that are active. There was evidently no interest in proceeding with it; I don't know if it was lack of attention.

@sethah
Copy link
Contributor

sethah commented May 18, 2017

The lack of bandwidth in MLlib means that sometimes good code that would make an impact just gets ignored. This is kind of the reality of things. However, if we are going to close the PR simply because committers could not or did not get to it - this is the case here IMO - then we should also close the JIRA. Closing a PR for this reason essentially means "we don't see this as an issue worth spending time on." That's a reason to close a JIRA as well. Closing the JIRA will at least prevent others from wasting their time on this issue like @MechCoder did.

If we don't close the JIRA, then it seems like we are closing it merely because we don't want the "clutter" of long waiting prs. But if a PR is still valid, well-written, and solves a real problem, why would we not keep it open? This sends a bad message to contributors IMO.

@srowen
Copy link
Member

srowen commented May 18, 2017

True, and I'd probably close the JIRA too. Maybe we can draw @jkbradley 's attention for a comment?

A closed PR still exists and can be examined or reopened, so it doesn't go away. I'd prefer to close it if it's almost surely not going to be merged, as a minor courtesy to the contributor, rather than leave it open. It's not so much the clutter, but, that's a factor. If there are always 500 open PRs, what's one more? and being open carries virtually no information.

I both want more closing of things to reflect that fact that, at this stage, not a lot is going to change -- and want more committers.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented May 18, 2017

FWIW, I think we already had few discussion in the mailing list about the last resort - automatic-closing. I was strongly against this. This is my effort to prevent this for now and the reason described above.

@sethah
Copy link
Contributor

sethah commented May 18, 2017

This is fine, but are we not also policing JIRAs? I've argued above that the reason this PR has been inactive is simply lack of interest in this issue. If that's the case, then the JIRA must also be closed, since we've implicitly decided that this is no longer or never was a problem worth solving.

If it is not the case, then we must absolutely leave this open - since this is a.) still unsolved b.) good code c.) still of interest. The reason for closing PRs but not their corresponding JIRA would be that the PR is either poorly implemented or the author is non-responsive. While it is no doubt frustrating for contributors to submit code that is well-written and solves a problem that a project committer asked for, I also don't know that leaving it open indefinitely is a solution either. I guess I don't understand why we'd be willing to leave JIRAs open indefinitely but not PRs. At any rate, in this case I would have proposed we ask for Joseph's (issue creator) input for a few days, and if we hear nothing close both JIRA and PR. We surely do not want others to submit patches for this issue if it will not be reviewed and merged.

@HyukjinKwon
Copy link
Member

The reason for closing PRs but not their corresponding JIRA would be that the PR is either poorly implemented or the author is non-responsive.

Yes, I tried to identify this case.

For this PR (or such PRs), the author looks still responsive and active so I do not disagree with re-opening personally because this was the point in #18017. Probably, I should have left a comment about this in each PR for clarification though.

@sethah
Copy link
Contributor

sethah commented May 18, 2017

Yes, this is a tough issue. Let's wait and see if @jkbradley has thoughts on this issue. If we don't hear anything, then I'd leave it up to @MechCoder on whether to reopen. Thanks, btw, for taking the time to do the cleanups. It is important and justified in many cases.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented May 18, 2017

@MechCoder, I apologise that, probably, it sounds the reason for my suggestion was not clear initially and if it looked without a respect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants