Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite approx #7214

Merged
merged 2 commits into from
Jan 10, 2022
Merged

Rewrite approx #7214

merged 2 commits into from
Jan 10, 2022

Conversation

trivialfis
Copy link
Member

@trivialfis trivialfis commented Sep 8, 2021

Based on #7183

This PR rewrites the approx tree method to use codebase from hist for better performance and prepare for categorical data support.

The rewrite has many benefits:

  • Support for both max_leaves and max_depth.
  • Support for grow_policy.
  • Support for mono constraint.
  • Support for feature weights.
  • Support for easier bin configuration (max_bin).
  • Support for categorical data.
  • Faster performance for most of the datasets. (many times faster)
  • Support for prediction cache.
  • Significantly better performance for external memory.
  • Unites the code base between approx and hist.

TODOs:

  • Figure out how to handle MatchThreadsToNodes.
  • Handle different sizes of batches.
  • Handle empty partition gracefully.
  • Add tests.
  • Distributed training.
  • Tests for build policies and related parameters (max_leave, max_depth).
  • Fix task initialization.

Close #7244 .

@codecov-commenter
Copy link

codecov-commenter commented Sep 10, 2021

Codecov Report

Merging #7214 (8f1882b) into master (d434942) will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #7214   +/-   ##
=======================================
  Coverage   83.71%   83.71%           
=======================================
  Files          13       13           
  Lines        3892     3892           
=======================================
  Hits         3258     3258           
  Misses        634      634           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d434942...8f1882b. Read the comment docs.

@trivialfis
Copy link
Member Author

The PR is pretty much ready other than some cleanups. I won't merge it until 1.5 is made.

@trivialfis
Copy link
Member Author

@ShvetsKS I did some quick tests with external memory:

The usability of external memory:

  • Current approx on the master branch: Unusable. It iterates the data for every node. Running for 40 minutes now with 40 GB of data, still not reaching the finish line of the first iteration.
  • Rewritten approx: Better, the bottleneck is in regenerating the histogram index page.
  • Hist: Not implemented yet. But I think that's the best place to have external memory since it needs only reading the histogram index.
  • GPU Hist: I have tested a prototype implementation before but copying from disk to cpu then to gpu seems to be too much overhead.

I have 64 GB of memory, so the kernel has done lots of caching for the data. Are you interested in implementing it for Hist? This PR has prepared most of the foundational work.

@trivialfis
Copy link
Member Author

@RAMitchell @hcho3 @ShvetsKS Could you please give a high-level review on this rewrite? I will split up the PR for more detailed reviews, just want to know your thoughts about the idea of rewriting.

@trivialfis
Copy link
Member Author

Close #7244 .

src/common/hist_util.cc Outdated Show resolved Hide resolved
src/common/hist_util.cc Outdated Show resolved Hide resolved
src/common/partition_builder.h Outdated Show resolved Hide resolved
src/common/partition_builder.h Outdated Show resolved Hide resolved
src/common/threading_utils.h Outdated Show resolved Hide resolved
src/gbm/gbtree.cc Outdated Show resolved Hide resolved
src/common/hist_util.cc Outdated Show resolved Hide resolved
src/common/hist_util.cc Outdated Show resolved Hide resolved
@trivialfis
Copy link
Member Author

Categorical data support is implemented here.

src/common/hist_util.cc Outdated Show resolved Hide resolved
@trivialfis
Copy link
Member Author

The perf change for hist:

Master PR
Airline 1367.8661037589918 1365.7433312300127
Bosch 96.10689159599133 100.5928801370028
Covtype 72.04649400700873 71.20271111401962
Higgs 173.99613595800474 174.35436162599945
Year 27.867479905995424 30.441671705979388

@trivialfis
Copy link
Member Author

Since all parameters are supported, we can start to drop the experimental tag on parameter validation once this PR is fully merged.

@trivialfis trivialfis changed the title [WIP] Rewrite approx Rewrite approx Dec 5, 2021
@trivialfis trivialfis marked this pull request as ready for review December 6, 2021 00:34
@trivialfis
Copy link
Member Author

trivialfis commented Dec 10, 2021

  • max_bin: 256
  • sketch_eps: 0.0078125
AUC AUC (rewrite) Accuracy Accuracy (rewrite) MAE MAE (rewrite) Time Time (rewrite)
airline 0.8083021582698621 0.8083021582679778 0.6653586256318165 0.6653586256318165 7452.502085362998 1946.756329375
bosch 0.7032129451430552 0.7064355238462672 0.8514382259767688 0.8494741288278775 249.46239526499994 183.57548371099983
covtype 0.8589967556775643 0.8574993760918392 155.14399087498896 85.87701512000058
epsilon 0.9258295171219089 0.9259069769846513 0.82366 0.82496 667.7028755839856 2270.314828159986
higgs 0.8224143371680539 0.8223955953797959 0.7132495454545454 0.7133218181818182 1277.4466711200075 347.96909062200575
year 6.341546058654785 6.340557098388672 35.443501809000736 39.575479084975086

One special case is epsilon, which is significantly slower due to the number of features of this dataset. For the accuracy they are pretty much the same, I tuned the sketching algorithm to match the result of the original implementation but then removed the tuning to avoid specialization.

At some point we might want to rewrite the CPU sketching algorithm, but I don't want to go down that rabbit hole at this point.

Save cuts.

Prototype on fetching.

Copy the code.

Simple test.

Add gpair to batch parameter.

Add hessian to batch parameter.

Move.

Pass hessian into sketching.

Extract a push page function.

Make private.

Lint.

Revert debug.

Simple DMatrix.

Regenerate the index.

ama.

Clang tidy.

Retain page.

Fix.

Lint.

Tidy.

Integer backed enum.

Convert to uint32_t.

Prototype for saving gidx.

Save cuts.

Prototype on fetching.

Copy the code.

Simple test.

Add gpair to batch parameter.

Add hessian to batch parameter.

Move.

Pass hessian into sketching.

Extract a push page function.

Make private.

Lint.

Revert debug.

Simple DMatrix.

Initial port.

Pass in hessian.

Init column sampler.

Unused code.

Use ctx.

Merge sampling.

Use ctx in partition.

Fix init root.

Force regenerate the sketch.

Create a ctx.

Get it compile.

Don't use const method.

Use page id.

Pass in base row id.

Pass the cut instead.

Small fixes.

Debug.

Fix bin size.

Debug.

Fixes.

Debug.

Fix empty partition.

Remove comment.

Lint.

Fix tests compilation.

Remove check.

Merge some fixes.

fix.

Fix fetching.

lint.

Extract expand entry.

Lint.

Fix unittests.

Fix windows build.

Fix comparison.

Make const.

Note.

const.

Fix reduce hist.

Fix sparse data.

Avoid implicit conversion.

private.

mem leak.

Remove skip initialization.

Use maximum space.

demo.

lint.

File link tags.

ama.

Fix redefinition.

Fix ranking.

use npy.

Comment.

Tune it down.

Specify the tree method.

Get rid of the duplicated partitioner.

Allocate task.

Tests.

make batches.

Log.

Remove span.

Revert "make batches."

This reverts commit 33f7072.

small cleanup.

Lint.

Revert demo.

Better make batches.

Demo.

Test for grow policy.

Test feature weights.

small cleanup.

Remove iterator in evaluation.

Fix dask test.

Pass n_threads.

Start implementation for categorical data.

Fix.

Add apply split.

Enumerate splits.

Enable sklearn.

Works.

d_step.

update.

Pass feature types into index.

Search cut.

Add test.

As cat.

Fix cut.

Extract some tests.

Fix.

Interesting case.

Add Python tests.

Cleanup.

Revert "Interesting case."

This reverts commit 6bbaac2.

Bin.

Fix.

Dispatch.

Remove subtraction trick.

Lint

Use multiple buffers.

Revert "Use multiple buffers."

This reverts commit 2849f57.

Test for external memory.

Format.

Partition based categorical split.

Remove debug code.

Fix.

Lint.

Fix test.

Fix demo.

Fix.

Add test.

Remove use of omp func.

name.

Fix.

test.

Make LCG impl compliant to std.

Fix test.

Constexpr.

Use unsigned type.

osx

More test.

Rebase error.

Rebase error.

Rebase error.

Reverse unused changes.

Config.

Remove weird set thread.

External memory test.

Revert changes.

Cleanup.

wording.

Fix doc.

Test monotone constraint.

Extract test for gamma.

typo.

Safe guard.

Cleanup && comments.

Update Python documents.

Add push col page.

hack.

Port the sketch.

Opt search bin.

Cleanup.

Reduce the gap.

Fix sum hessian.

Start cleaning up.

Duplicated.

Cleanup.

lint.

Test.

Port the changes.

test.

Port the changes.

Fixes && cleanup.

Decide whether should sorted sketch be used.

tests.

Use regen.

Lint.

Revert.

init.

empty dataset.

Handle empty dataset directly in quantile.

empty.

Update tests.

Fix approx test.

Revert "Fix approx test."

This reverts commit d690afb.
Copy link
Member

@RAMitchell RAMitchell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work. This will make life a lot easier.

@trivialfis trivialfis merged commit 0015031 into dmlc:master Jan 10, 2022
@trivialfis trivialfis deleted the rewrite-approx branch January 10, 2022 13:15
@trivialfis
Copy link
Member Author

trivialfis commented Jan 10, 2022

@RAMitchell @hcho3 Thanks for all the reviews, I'm so excited to have this merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Column sampling not working with approximate tree building
3 participants