Testing hist_util #5251

RAMitchell · 2020-01-31T02:28:12Z

This PR is for adding more tests for quantile generation in hist/gpu_hist. Part of the motivation is to create some absolute tests of accuracy for GPU sketching, instead of comparing against existing CPU algorithms.

So far I have tested the following expected behaviours with respect to a single feature:

The histogram cut min_value should be less than all inputs
The last histogram cut value should be greater than all inputs
The rank of quantile cuts from the sketch should not exceed error ~0.01 as compared to rank from sorted values
And input of k < num_bins unique values should output k unique bins. i.e. it should be possible to split over each unique value

I also expected that number of cuts (excluding the minimum value) should equal the number of bins requested if the unique input size is greater than num_bins. This is true for categorical features with <= unique values but not for continuous features. This currently means that if you ask for 256 bins you get 254 actual bins, which seems like a bug. Two of the samples get lost here (

xgboost/src/common/hist_util.h

Line 208 in 472ded5

for (size_t i = 2; i < summary.size; ++i) {

), one value gets added afterwards to be larger than everything. Another sample seems to get lost here (

xgboost/src/common/hist_util.cc

Line 319 in 472ded5

a.SetPrune(summary_array[fid], max_num_bins);

), where 256 samples go into SetPrune and 255 come out, despite max_num_bins being set at 256.

After resolving this I also want to add tests comparing the rank of the output quantiles against the correct rank by sorting the input.

I would also like to clarify the need for specialised cut points for categorical variables, implemented here (

xgboost/src/common/hist_util.h

Line 200 in 472ded5

/* specialized code categorial / ordinal data -- use midpoints */

) and discussed in #5095.

@hcho3 any comments would be much appreciated.

RAMitchell · 2020-02-03T03:35:25Z

I added tests for eps accuracy of sketch for combinations of number of bins and input sizes. I also fixed an off by one error where for categorical data the number of histogram bins would always be one less than the unique inputs, so certain categorical values were never used.

It still seems possible for summary.SetPrune() to unnecessarily remove cuts in some cases, but this is not a major issue and I'm not confident to change that function, so will leave that for another day.

I also did some experiments with specialised logic for categorical splits, I don't think they have any effect on accuracy, maybe someone else can provide an example of why this is necessary?

trivialfis · 2020-02-04T02:34:49Z

src/common/hist_util.h

-        if (i == 2 || cpt > p_cuts_->cut_values_.back()) {
-          p_cuts_->cut_values_.push_back(cpt);
-        }
+    for (size_t i = 1; i < summary.size; ++i) {


@hcho3 I vaguely remembered that you have some use cases for this specialization.

I'm still working through a bunch of corner cases for this.

@RAMitchell My preference is either we don't support it, or provide full blown implementation.

#5095 and #5096 may be relevant.

The specialisation is unnecessary, we can use the same code to get correct results for both categorical and continuous. After this PR, if the number of unique feature values is less than or equal to max_bin, then each value will get its own bin and can be used for splitting. Before this PR, if the number of unique values was > 16 but less than max_bin, categorical features would get lost.

RAMitchell · 2020-02-11T02:40:13Z

So WXQSummary.SetPrune() seems to be quite broken. Sometimes it unnecessarily removes elements from a set before it reaches capacity and my tests show incorrect distributions of elements in its quantile ranges.

I replaced this with the simpler function WQSummary.SetPrune() for the hist and gpu_hist algorithms and everything is working as expected. I believe this function only gets called once towards the end of quantile calculation, so I am not expecting this to affect speed but will check with some benchmarks shortly.

RAMitchell · 2020-02-11T02:46:11Z

@rongou I noticed your sampling tests were quite volatile. The reason is because quantiles calculated on external memory end up slightly different compared to in memory.

The fix is to set max_bins equal to the number of training rows so that the quantiles end up the same for both external memory and in memory (each row has its own bin). Your tests comparing predictions from external memory/in memory now pass to a much higher accuracy.

rongou · 2020-02-11T02:56:08Z

@RAMitchell that's great! Thanks!

RAMitchell · 2020-02-12T01:28:57Z

I checked the performance of the different pruning methods to compute DenseCuts and there does not seem to be a meaningful difference. If anything WQSketch is faster.

RAMitchell force-pushed the testing-quantiles branch from b3ddadc to 02e68d1 Compare February 2, 2020 23:37

RAMitchell changed the title ~~[WIP] Testing hist_util~~ Testing hist_util Feb 3, 2020

RAMitchell requested review from trivialfis and hcho3 February 3, 2020 03:40

RAMitchell force-pushed the testing-quantiles branch from 1c18d6b to 9f3fca7 Compare February 3, 2020 20:55

trivialfis approved these changes Feb 4, 2020

View reviewed changes

RAMitchell force-pushed the testing-quantiles branch from 9061c84 to 9621278 Compare February 5, 2020 01:24

RAMitchell added 8 commits February 12, 2020 13:32

Testing quantiles

bc09c61

Fix number of bins

06cc6b3

Rank tests

faba460

Remove categorical split specialisation

969ebbc

Extend tests to multiple features, switch to WQSketch

6185305

Add tests for SparseCuts

d0667b3

Add external memory quantile tests, fix some existing tests

57340e1

Add some comments

cb1b2c8

RAMitchell force-pushed the testing-quantiles branch from 58675d0 to cb1b2c8 Compare February 12, 2020 00:35

RAMitchell force-pushed the testing-quantiles branch from 4b58811 to 648641c Compare February 12, 2020 22:21

Relax test

375d37a

RAMitchell force-pushed the testing-quantiles branch from 648641c to 375d37a Compare February 13, 2020 22:48

RAMitchell merged commit 24ad9de into dmlc:master Feb 14, 2020

RAMitchell mentioned this pull request Feb 14, 2020

[Improvement] Support of discrete vs continuous metadata + AddCutPoint() adaptation #5096

Closed

zhangzhang10 mentioned this pull request Apr 6, 2020

Array out-of-bound access bug #5492

Closed

lock bot locked as resolved and limited conversation to collaborators May 20, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Testing hist_util #5251

Testing hist_util #5251

RAMitchell commented Jan 31, 2020 •

edited

RAMitchell commented Feb 3, 2020

trivialfis Feb 4, 2020

RAMitchell Feb 4, 2020

trivialfis Feb 4, 2020

hcho3 Feb 4, 2020

RAMitchell Feb 11, 2020

RAMitchell commented Feb 11, 2020

RAMitchell commented Feb 11, 2020

rongou commented Feb 11, 2020

RAMitchell commented Feb 12, 2020

Testing hist_util #5251

Testing hist_util #5251

Conversation

RAMitchell commented Jan 31, 2020 • edited

RAMitchell commented Feb 3, 2020

trivialfis Feb 4, 2020

Choose a reason for hiding this comment

RAMitchell Feb 4, 2020

Choose a reason for hiding this comment

trivialfis Feb 4, 2020

Choose a reason for hiding this comment

hcho3 Feb 4, 2020

Choose a reason for hiding this comment

RAMitchell Feb 11, 2020

Choose a reason for hiding this comment

RAMitchell commented Feb 11, 2020

RAMitchell commented Feb 11, 2020

rongou commented Feb 11, 2020

RAMitchell commented Feb 12, 2020

RAMitchell commented Jan 31, 2020 •

edited