Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Testing hist_util #5251

Merged
merged 9 commits into from Feb 14, 2020
Merged

Testing hist_util #5251

merged 9 commits into from Feb 14, 2020

Conversation

RAMitchell
Copy link
Member

@RAMitchell RAMitchell commented Jan 31, 2020

This PR is for adding more tests for quantile generation in hist/gpu_hist. Part of the motivation is to create some absolute tests of accuracy for GPU sketching, instead of comparing against existing CPU algorithms.

So far I have tested the following expected behaviours with respect to a single feature:

  • The histogram cut min_value should be less than all inputs
  • The last histogram cut value should be greater than all inputs
  • The rank of quantile cuts from the sketch should not exceed error ~0.01 as compared to rank from sorted values
  • And input of k < num_bins unique values should output k unique bins. i.e. it should be possible to split over each unique value

I also expected that number of cuts (excluding the minimum value) should equal the number of bins requested if the unique input size is greater than num_bins. This is true for categorical features with <= unique values but not for continuous features. This currently means that if you ask for 256 bins you get 254 actual bins, which seems like a bug. Two of the samples get lost here (

for (size_t i = 2; i < summary.size; ++i) {
), one value gets added afterwards to be larger than everything. Another sample seems to get lost here (
a.SetPrune(summary_array[fid], max_num_bins);
), where 256 samples go into SetPrune and 255 come out, despite max_num_bins being set at 256.

After resolving this I also want to add tests comparing the rank of the output quantiles against the correct rank by sorting the input.

I would also like to clarify the need for specialised cut points for categorical variables, implemented here (

/* specialized code categorial / ordinal data -- use midpoints */
) and discussed in #5095.

@hcho3 any comments would be much appreciated.

@RAMitchell
Copy link
Member Author

I added tests for eps accuracy of sketch for combinations of number of bins and input sizes. I also fixed an off by one error where for categorical data the number of histogram bins would always be one less than the unique inputs, so certain categorical values were never used.

It still seems possible for summary.SetPrune() to unnecessarily remove cuts in some cases, but this is not a major issue and I'm not confident to change that function, so will leave that for another day.

I also did some experiments with specialised logic for categorical splits, I don't think they have any effect on accuracy, maybe someone else can provide an example of why this is necessary?

@RAMitchell RAMitchell changed the title [WIP] Testing hist_util Testing hist_util Feb 3, 2020
if (i == 2 || cpt > p_cuts_->cut_values_.back()) {
p_cuts_->cut_values_.push_back(cpt);
}
for (size_t i = 1; i < summary.size; ++i) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hcho3 I vaguely remembered that you have some use cases for this specialization.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still working through a bunch of corner cases for this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@RAMitchell My preference is either we don't support it, or provide full blown implementation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#5095 and #5096 may be relevant.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The specialisation is unnecessary, we can use the same code to get correct results for both categorical and continuous. After this PR, if the number of unique feature values is less than or equal to max_bin, then each value will get its own bin and can be used for splitting. Before this PR, if the number of unique values was > 16 but less than max_bin, categorical features would get lost.

@RAMitchell
Copy link
Member Author

So WXQSummary.SetPrune() seems to be quite broken. Sometimes it unnecessarily removes elements from a set before it reaches capacity and my tests show incorrect distributions of elements in its quantile ranges.

I replaced this with the simpler function WQSummary.SetPrune() for the hist and gpu_hist algorithms and everything is working as expected. I believe this function only gets called once towards the end of quantile calculation, so I am not expecting this to affect speed but will check with some benchmarks shortly.

@RAMitchell
Copy link
Member Author

@rongou I noticed your sampling tests were quite volatile. The reason is because quantiles calculated on external memory end up slightly different compared to in memory.

The fix is to set max_bins equal to the number of training rows so that the quantiles end up the same for both external memory and in memory (each row has its own bin). Your tests comparing predictions from external memory/in memory now pass to a much higher accuracy.

@rongou
Copy link
Contributor

rongou commented Feb 11, 2020

@RAMitchell that's great! Thanks!

@RAMitchell
Copy link
Member Author

WQSketch vs  WXQSketch, 5 features
I checked the performance of the different pruning methods to compute DenseCuts and there does not seem to be a meaningful difference. If anything WQSketch is faster.

@RAMitchell RAMitchell merged commit 24ad9de into dmlc:master Feb 14, 2020
@lock lock bot locked as resolved and limited conversation to collaborators May 20, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants