-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite approx #7214
Rewrite approx #7214
Conversation
Codecov Report
@@ Coverage Diff @@
## master #7214 +/- ##
=======================================
Coverage 83.71% 83.71%
=======================================
Files 13 13
Lines 3892 3892
=======================================
Hits 3258 3258
Misses 634 634 Continue to review full report at Codecov.
|
The PR is pretty much ready other than some cleanups. I won't merge it until 1.5 is made. |
@ShvetsKS I did some quick tests with external memory: The usability of external memory:
I have 64 GB of memory, so the kernel has done lots of caching for the data. Are you interested in implementing it for Hist? This PR has prepared most of the foundational work. |
9bbbaa2
to
0012969
Compare
@RAMitchell @hcho3 @ShvetsKS Could you please give a high-level review on this rewrite? I will split up the PR for more detailed reviews, just want to know your thoughts about the idea of rewriting. |
Close #7244 . |
2705c57
to
29c32fa
Compare
29c32fa
to
a583d51
Compare
a583d51
to
068dd79
Compare
068dd79
to
7140ba6
Compare
Categorical data support is implemented here. |
The perf change for hist:
|
c56cbab
to
1d7a620
Compare
1d7a620
to
aed9d27
Compare
Since all parameters are supported, we can start to drop the experimental tag on parameter validation once this PR is fully merged. |
58f62e8
to
9b0dde0
Compare
b03ccfb
to
dc6359e
Compare
384fdd6
to
555acf5
Compare
One special case is epsilon, which is significantly slower due to the number of features of this dataset. For the accuracy they are pretty much the same, I tuned the sketching algorithm to match the result of the original implementation but then removed the tuning to avoid specialization. At some point we might want to rewrite the CPU sketching algorithm, but I don't want to go down that rabbit hole at this point. |
49c4ce0
to
dca1572
Compare
dca1572
to
f4b8f31
Compare
Save cuts. Prototype on fetching. Copy the code. Simple test. Add gpair to batch parameter. Add hessian to batch parameter. Move. Pass hessian into sketching. Extract a push page function. Make private. Lint. Revert debug. Simple DMatrix. Regenerate the index. ama. Clang tidy. Retain page. Fix. Lint. Tidy. Integer backed enum. Convert to uint32_t. Prototype for saving gidx. Save cuts. Prototype on fetching. Copy the code. Simple test. Add gpair to batch parameter. Add hessian to batch parameter. Move. Pass hessian into sketching. Extract a push page function. Make private. Lint. Revert debug. Simple DMatrix. Initial port. Pass in hessian. Init column sampler. Unused code. Use ctx. Merge sampling. Use ctx in partition. Fix init root. Force regenerate the sketch. Create a ctx. Get it compile. Don't use const method. Use page id. Pass in base row id. Pass the cut instead. Small fixes. Debug. Fix bin size. Debug. Fixes. Debug. Fix empty partition. Remove comment. Lint. Fix tests compilation. Remove check. Merge some fixes. fix. Fix fetching. lint. Extract expand entry. Lint. Fix unittests. Fix windows build. Fix comparison. Make const. Note. const. Fix reduce hist. Fix sparse data. Avoid implicit conversion. private. mem leak. Remove skip initialization. Use maximum space. demo. lint. File link tags. ama. Fix redefinition. Fix ranking. use npy. Comment. Tune it down. Specify the tree method. Get rid of the duplicated partitioner. Allocate task. Tests. make batches. Log. Remove span. Revert "make batches." This reverts commit 33f7072. small cleanup. Lint. Revert demo. Better make batches. Demo. Test for grow policy. Test feature weights. small cleanup. Remove iterator in evaluation. Fix dask test. Pass n_threads. Start implementation for categorical data. Fix. Add apply split. Enumerate splits. Enable sklearn. Works. d_step. update. Pass feature types into index. Search cut. Add test. As cat. Fix cut. Extract some tests. Fix. Interesting case. Add Python tests. Cleanup. Revert "Interesting case." This reverts commit 6bbaac2. Bin. Fix. Dispatch. Remove subtraction trick. Lint Use multiple buffers. Revert "Use multiple buffers." This reverts commit 2849f57. Test for external memory. Format. Partition based categorical split. Remove debug code. Fix. Lint. Fix test. Fix demo. Fix. Add test. Remove use of omp func. name. Fix. test. Make LCG impl compliant to std. Fix test. Constexpr. Use unsigned type. osx More test. Rebase error. Rebase error. Rebase error. Reverse unused changes. Config. Remove weird set thread. External memory test. Revert changes. Cleanup. wording. Fix doc. Test monotone constraint. Extract test for gamma. typo. Safe guard. Cleanup && comments. Update Python documents. Add push col page. hack. Port the sketch. Opt search bin. Cleanup. Reduce the gap. Fix sum hessian. Start cleaning up. Duplicated. Cleanup. lint. Test. Port the changes. test. Port the changes. Fixes && cleanup. Decide whether should sorted sketch be used. tests. Use regen. Lint. Revert. init. empty dataset. Handle empty dataset directly in quantile. empty. Update tests. Fix approx test. Revert "Fix approx test." This reverts commit d690afb.
bac7485
to
865b04e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work. This will make life a lot easier.
@RAMitchell @hcho3 Thanks for all the reviews, I'm so excited to have this merged. |
Based on #7183
This PR rewrites the approx tree method to use codebase from hist for better performance and prepare for categorical data support.
The rewrite has many benefits:
max_leaves
andmax_depth
.grow_policy
.max_bin
).TODOs:
MatchThreadsToNodes
.Close #7244 .