Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-8504: [C++] Add BitRunReader and use it in parquet #7143

Closed
wants to merge 7 commits into from

Conversation

emkornfield
Copy link
Contributor

@emkornfield emkornfield commented May 10, 2020

Adds two implementations of a BitRunReader, which returns set/not-set
and number of bits in a row.

  • Adds benchmarks comparing the two implementations under different
    distributions.

  • Adds the reader for use ParquetWriter (there is a second
    location on Nullable terminal node that I left unchanged because
    it showed a performance drop of 30%, I think this is due the issue
    described in the next bullet point, or because BitVisitor is getting
    vectorized to something more efficient).

  • Refactors GetBatchedSpaced and GetBatchedSpacedWithDict:

    1. Use a single templated method that adds a template parameter
      that the code can share.
    2. Does all checking for out of bounds indices in one go instead
      of on each pass through th literal (this is a slight behavior
      change as the index returned will be different).
    3. Makes use of the BitRunReader.

Based on bechmarks BM_ColumnRead this seems to make performance worse by
50-80%. This was surprising to me and my current working theory is
that the nullable benchmarks present the worse case scenario every
other element is null and therefore the overhead of invoking the call
relative to the existing code is high (using perf calls to NextRun()
jump to top after this change). Before making a decision on reverting
the use of BitRunReader here I'd like to implement a more comprehensive
set of benchmarks to test this hypothesis.

Other TODOs

  • Need to revert change to testing submodule hash
  • Add attribution to wikipedia for InvertRemainingBits
  • Fix some unintelligible comments.
  • Resolve performance issues.

@github-actions
Copy link

@wesm
Copy link
Member

wesm commented May 10, 2020

A fair chunk of RLE-related code came out of Impala originally, it might not be a bad idea to peek at what's in apache/impala to see if it has gotten worked on perf-wise since the beginning of 2016. In any case, we wouldn't want to take on a change that causes a perf regression in cases with many short null/non-null runs

@emkornfield
Copy link
Contributor Author

Agreed. I'm think I'll have to add a special case for this. The code can also be simplified a bit further for cases with runs

@emkornfield emkornfield changed the title ARROW-8504: [C++] Add BitRunReader and use it in parquet ARROW-8504: [C++] [wip]Add BitRunReader and use it in parquet May 11, 2020
@emkornfield
Copy link
Contributor Author

emkornfield commented May 11, 2020

@wesm interesting data point, I updated performance benchmarks to generate random values/nullability (and kept deterministic one). It seems like the bad regression is really only deterministically alternating nullability. Randomly generated nullability (at 50%) shows improvements. I still have some cleanup to do but I'd be curious on peoples thoughts on the need to try to detect/special case deterministic patterns in nullability (its possible the maybe I just chose a bad seed?):
image

Also, https://github.com/apache/impala/blob/19a4d8fe794c9b17e69d6c65473f9a68084916bb/be/src/kudu/util/rle-encoding.h is the file I found int impala? Maybe we did all the additions for GetBatchSpaced etc? Was there another source you were thinking of?

@emkornfield
Copy link
Contributor Author

OK, any code cleanup I've tried to do seems to make performance worse. So I think the one remaining question is if I should try to special case alternating null/non-null values because the regression is bad.

I think the run calculation is still more expense in some cases, but looks very bad when nulls follow a pattern the CPU branch predictor can get correct all the time. So the only way of special casing this is try to estimate when alternating values is happening and revert back to the old one at a time code path. I'm -0 on make such a special case because it isn't really clear to me how much this occurs in "real" data.

I think another way of framing this is, if we had pseudo-random benchmark data to begin with would anybody notice the particularly bad performance? Would we try to special case this under that scenario?

@wesm
Copy link
Member

wesm commented May 13, 2020

I agree that we shouldn't be basing the decision on the performance of random / worst-case scenario data. If the perf is better on "realistic" datasets (> 80/90% non-null data with consistent consecutive runs) then it doesn't seem like a bad idea to me. Bending over backwards for the special / "unrealistic" case also doesn't seem worthwhile

@emkornfield
Copy link
Contributor Author

@wesm OK, I did a little bit more in depth sampling. And it looks like this new algorithm is a win for 0-5% nulls, then a regression until someplace between 45-50% nulls then a likely a win with a larger percentage of nulls. I'll add a special case to estimate which algorithm to use (this one or 1 by 1 based on percentage of nulls and sampling the first N elements of the bitmap vector).

image

@emkornfield
Copy link
Contributor Author

OK, I was actually able to make this faster with a little bit of refactoring. The only major regression is now the exact alternating case (~50% worse). I'll wait till #7175 is merged to publish new benchmarks, but I no longer think special casing is necessary.

@emkornfield emkornfield force-pushed the ARROW-8504 branch 2 times, most recently from 95873c1 to 64731c4 Compare May 22, 2020 03:20
@emkornfield emkornfield changed the title ARROW-8504: [C++] [wip]Add BitRunReader and use it in parquet ARROW-8504: [C++] Add BitRunReader and use it in parquet May 22, 2020
@emkornfield
Copy link
Contributor Author

image

OK I think this is ready for review CC @pitrou @wesm

Benchmarks show mostly 20% improvement for any random null values across the board. The worse case scenario shows 40-60% regression. This seems to occur when we have deterministically alternating null/not-null. I believe this to be because modern branch predictors can detect this pattern so there is minimal branch cost relative to the extra work that is happening inside of BitRunReader

@wesm
Copy link
Member

wesm commented May 22, 2020

Cool, I will try to look at this tomorrow and kick the tires a bit with benchmarks

@emkornfield
Copy link
Contributor Author

Will fix CI issues tomorrow

@emkornfield
Copy link
Contributor Author

Need to look into the windows failures some more.

@wesm
Copy link
Member

wesm commented May 26, 2020

I will review and do some perf tire-kicking in the meantime

@pitrou
Copy link
Member

pitrou commented May 26, 2020

If this is not urgent, you can wait for next week and I'll review as well. As you prefer, of course :-)

@wesm
Copy link
Member

wesm commented May 26, 2020

@pitrou yes, I can wait for you to code review, I will have a brief look and I'm mostly interested in trying it out on some datasets (such as the ones from https://ursalabs.org/blog/2019-10-columnar-perf/)

@wesm
Copy link
Member

wesm commented May 26, 2020

I fan on the "Fannie Mae 2016 Q4" dataset from that blog post

before

In [1]: %timeit pq.read_table('/home/wesm/code/notebooks/20190919file_benchmarks/2016Q4_uncompressed.parquet', use_threads=True)                                                               
1.01 s ± 18.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [2]: %timeit pq.read_table('/home/wesm/code/notebooks/20190919file_benchmarks/2016Q4_uncompressed.parquet', use_threads=False)                                                              
5.27 s ± 104 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

after

In [1]: %timeit pq.read_table('/home/wesm/code/notebooks/20190919file_benchmarks/2016Q4_uncompressed.parquet', use_threads=True)                                                               
1.01 s ± 9.35 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [2]: %timeit pq.read_table('/home/wesm/code/notebooks/20190919file_benchmarks/2016Q4_uncompressed.parquet', use_threads=False)                                                              
4.86 s ± 39.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

So there seems to be a perf improvement in single-threaded decoding but it seems that it's effectively bottlenecked on other issues in multithreaded decoding

@wesm
Copy link
Member

wesm commented May 26, 2020

@emkornfield if it's OK I'm going to let @pitrou review this when he is back from vacation next week since he's better with these code paths

@emkornfield
Copy link
Contributor Author

No rush on review. The failures on windows are little concerning to me (i.e. there still might be a bug in the code). @wesm not sure if is is noise, but it looks like variance is reduced with the new code. Can you share the encodings/schema for the parquet file that you are running? Would it pay to check-in the file (or a sampled version) and write a c++ benchmark against it?

@emkornfield
Copy link
Contributor Author

Also @wesm have you looked at improvements from #6985?

@wesm
Copy link
Member

wesm commented May 26, 2020

Ah, I will try to look next week on my machine with AVX512

@emkornfield
Copy link
Contributor Author

It should be a performance win on SSE4 or AVX as long as the data isn't nested.

@pitrou
Copy link
Member

pitrou commented Jun 2, 2020

BitRunReader benchmark here (with gcc 7.5):

BitRunReader/-1            10689 ns        10687 ns       196758 bytes_per_second=45.6911M/s
BitRunReader/0               243 ns          243 ns      8648058 bytes_per_second=1.96541G/s
BitRunReader/10             2315 ns         2315 ns       902899 bytes_per_second=210.938M/s
BitRunReader/25             4543 ns         4542 ns       469378 bytes_per_second=107.5M/s
BitRunReader/50             6284 ns         6283 ns       331741 bytes_per_second=77.7125M/s
BitRunReader/60             6453 ns         6452 ns       325743 bytes_per_second=75.6844M/s
BitRunReader/75             4952 ns         4951 ns       423942 bytes_per_second=98.622M/s
BitRunReader/99              449 ns          449 ns      4676786 bytes_per_second=1086.96M/s

BitRunReaderScalar/-1       6132 ns         6132 ns       339553 bytes_per_second=79.6324M/s
BitRunReaderScalar/0        2955 ns         2955 ns       707136 bytes_per_second=165.263M/s
BitRunReaderScalar/10       4080 ns         4079 ns       513884 bytes_per_second=119.696M/s
BitRunReaderScalar/25       5348 ns         5347 ns       380446 bytes_per_second=91.313M/s
BitRunReaderScalar/50       6427 ns         6426 ns       319813 bytes_per_second=75.9863M/s
BitRunReaderScalar/60       6143 ns         6142 ns       340548 bytes_per_second=79.5007M/s
BitRunReaderScalar/75       5037 ns         5036 ns       403359 bytes_per_second=96.9589M/s
BitRunReaderScalar/99       2913 ns         2912 ns       716137 bytes_per_second=167.664M/s

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I find the implementation quite convoluted. I would like to see the following changes:

  • Make all critical functions inline
  • Load a single byte at a time (it probably doesn't bring much to process an entire 64-bit word at a time, and it adds complication)
  • Try to simplify mask handling


void BitRunReader::LoadWord() {
word_ = 0;
// On the initial load if there is an offset we need to account for this when
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, why don't you have a separate function for the initial load? You're paying the price for this on every subsequent load (especially as the function may not be inlined at all).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing to have a separate function. This is intentionally not inlined as part of the main loop (it was a performance improvement to move AdvanceTillChange out-of-line.

int64_t new_bits = BitUtil::CountTrailingZeros(word_) - (start_position % 64);
position_ += new_bits;

if (ARROW_PREDICT_FALSE(position_ % 64 == 0) &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be aware that signed modulo is slightly slower than unsigned modulo (the compiler doesn't know that position_ is positive).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could instead maintain a bit offset and/or a word mask. Could also perhaps help deconvolute InvertRemainingBits...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could do position_ & 63 too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the tip. changing to position_ & 63. Antoine, I'm not sure what you are thinking for maintaining a bitmask could you provide some more details?

ARROW_PREDICT_TRUE(position_ < length_)) {
// Continue extending position while we can advance an entire word.
// (updates position_ accordingly).
AdvanceTillChange();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you folded AdvanceTillChange() here, the code would probably be more readable (less duplicated conditions).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By folded do you mean inline? This make a significant performance hit on parquet benchmarks. I added a comment.

private:
void AdvanceTillChange();
void InvertRemainingBits() {
// Mask applied everying above the lowest bit.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

???

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tried to clarify.

@emkornfield
Copy link
Contributor Author

emkornfield commented Jun 3, 2020

@pitrou thank you for the review. Comments inline.

Ok, I find the implementation quite convoluted. I would like to see the following changes:

  • Make all critical functions inline

If this is AdvanceTillWord, this hurts performance. LoadInitialWord in the constructor can be done but I don't think it will have a large impact on performance.

  • Load a single byte at a time (it probably doesn't bring much to process an entire 64-bit word at a time, and it adds complication)

I'm not sure I understand what direction you would like me to take here. I would expect word level handling to be much faster for cases this is intended for (when there are actual runs). In practice, once integrated into a system you are likely correct though. But I'm not sure exactly what algorithm you are thinking of for byte level handling. If you provide more description I can try to implement it and compare.

  • Try to simplify mask handling

Is this mostly InvertRemainingBits? could you provide some more feedback on the algorithm structure you were thinking of?

@emkornfield
Copy link
Contributor Author

I could try to adapt the Parquet code to use BitBlockCounter and see what the benchmarks look like?

@wesm I think if you can take a look and potentially revise the benchmarks at

->Args({/*null_percentage=*/kAlternatingOrNa, /*first_value_percentage=*/0})
to make sure we are aligned on what we are trying to improve, I can update the this PR accordingly. I think there are really two options:

  1. Remove BitRunReader entirely and use BitBlockCounter
  2. Use BitBlockCounter in addition to BitRunReader

The way to go really depends on what percentage of values we expect to be null. My intuition is that very high rates and very low rates are likely, but I think you probably have a better intuition as to the exact definition of high or low.

@wesm
Copy link
Member

wesm commented Jun 5, 2020

@emkornfield I can take a look at this once #7356 is merged which adds a single-word function to BitBlockCounter

@wesm
Copy link
Member

wesm commented Jun 22, 2020

@emkornfield I'm sorry that I've been neglecting this PR. I will try to rebase this and investigate the perf questions a little bit

@wesm
Copy link
Member

wesm commented Jun 22, 2020

I rebased. I'm going to look at the benchmarks locally and then probably merge this so further performance work (including comparing with BitBlockCounter) can be pursued as follow up

@wesm
Copy link
Member

wesm commented Jun 22, 2020

Here's what I see with gcc-8:

                                benchmark          baseline         contender  change %  regression
1       BM_ReadColumn<true,Int32Type>/1/1     1.336 GiB/sec     1.880 GiB/sec    40.701       False
17      BM_ReadColumn<true,Int64Type>/1/1     2.224 GiB/sec     2.926 GiB/sec    31.567       False
14     BM_ReadColumn<true,Int32Type>/99/0     1.599 GiB/sec     2.001 GiB/sec    25.128       False
23    BM_ReadColumn<true,Int32Type>/99/50     1.601 GiB/sec     1.990 GiB/sec    24.300       False
21    BM_ReadColumn<true,Int64Type>/50/50  1002.325 MiB/sec     1.206 GiB/sec    23.171       False
12    BM_ReadColumn<true,Int32Type>/50/50   619.919 MiB/sec   758.845 MiB/sec    22.410       False
33    BM_ReadColumn<true,Int64Type>/45/25  1016.908 MiB/sec     1.212 GiB/sec    22.095       False
30     BM_ReadColumn<true,Int32Type>/50/0   620.068 MiB/sec   757.008 MiB/sec    22.085       False
34      BM_ReadColumn<true,Int64Type>/5/5     1.586 GiB/sec     1.935 GiB/sec    21.981       False
27     BM_ReadColumn<true,Int64Type>/99/0     2.539 GiB/sec     3.081 GiB/sec    21.308       False
13     BM_ReadColumn<true,Int64Type>/10/5     1.361 GiB/sec     1.644 GiB/sec    20.791       False
18     BM_ReadColumn<true,Int64Type>/50/1     1.002 GiB/sec     1.204 GiB/sec    20.174       False
26     BM_ReadColumn<true,Int64Type>/75/1     1.182 GiB/sec     1.419 GiB/sec    20.005       False
15    BM_ReadColumn<true,Int64Type>/99/50     2.532 GiB/sec     3.036 GiB/sec    19.890       False
11    BM_ReadColumn<true,Int64Type>/30/10     1.123 GiB/sec     1.322 GiB/sec    17.754       False
35    BM_ReadColumn<true,Int64Type>/25/10     1.187 GiB/sec     1.396 GiB/sec    17.558       False
25     BM_ReadColumn<true,Int32Type>/25/5   741.612 MiB/sec   870.530 MiB/sec    17.384       False
3     BM_ReadColumn<true,Int32Type>/10/10   859.032 MiB/sec  1005.688 MiB/sec    17.072       False
31    BM_ReadColumn<true,Int64Type>/35/10     1.095 GiB/sec     1.254 GiB/sec    14.528       False
20   BM_ReadColumn<true,BooleanType>/-1/1   712.571 MiB/sec   808.480 MiB/sec    13.460       False
5    BM_ReadColumn<true,DoubleType>/25/25     1.244 GiB/sec     1.399 GiB/sec    12.397       False
9    BM_ReadColumn<true,BooleanType>/5/10   460.741 MiB/sec   484.654 MiB/sec     5.190       False
7   BM_ReadColumn<false,BooleanType>/1/20   397.103 MiB/sec   412.517 MiB/sec     3.882       False
29  BM_ReadColumn<false,BooleanType>/-1/0   984.767 MiB/sec   994.483 MiB/sec     0.987       False
19   BM_ReadColumn<false,DoubleType>/-1/0    13.609 GiB/sec    13.738 GiB/sec     0.944       False
28    BM_ReadColumn<false,Int32Type>/-1/1     4.910 GiB/sec     4.702 GiB/sec    -4.236       False
8     BM_ReadColumn<false,Int64Type>/-1/1     8.631 GiB/sec     8.000 GiB/sec    -7.310        True
6    BM_ReadColumn<false,Int64Type>/-1/10     2.872 GiB/sec     2.329 GiB/sec   -18.912        True
32   BM_ReadColumn<false,Int32Type>/-1/10     1.457 GiB/sec     1.099 GiB/sec   -24.615        True
22   BM_ReadColumn<true,DoubleType>/10/50     1.433 GiB/sec   964.501 MiB/sec   -34.267        True
24     BM_ReadColumn<true,Int64Type>/-1/0     2.325 GiB/sec     1.400 GiB/sec   -39.804        True
2     BM_ReadColumn<true,DoubleType>/-1/0     2.412 GiB/sec     1.422 GiB/sec   -41.059        True
10     BM_ReadColumn<true,Int32Type>/-1/0     1.454 GiB/sec   864.019 MiB/sec   -41.959        True
16  BM_ReadColumn<false,DoubleType>/-1/20     3.826 GiB/sec     2.184 GiB/sec   -42.899        True
0    BM_ReadColumn<false,Int64Type>/-1/50     4.669 GiB/sec     1.865 GiB/sec   -60.055        True
4    BM_ReadColumn<false,Int32Type>/-1/50     3.184 GiB/sec   862.581 MiB/sec   -73.546        True

I agree that the alternating NA/not-NA is pretty pathological so IMHO this performance regression doesn't seem like a big deal to me.

@wesm
Copy link
Member

wesm commented Jun 22, 2020

Here's with clang, even less of an issue there it seems

                                benchmark         baseline        contender  change %  regression
32      BM_ReadColumn<true,Int32Type>/1/1    1.656 GiB/sec    2.689 GiB/sec    62.384       False
15     BM_ReadColumn<true,Int64Type>/99/0    2.861 GiB/sec    4.184 GiB/sec    46.245       False
21    BM_ReadColumn<true,Int32Type>/99/50    1.890 GiB/sec    2.744 GiB/sec    45.220       False
20    BM_ReadColumn<true,Int64Type>/99/50    2.898 GiB/sec    4.199 GiB/sec    44.891       False
6      BM_ReadColumn<true,Int64Type>/50/1    1.530 GiB/sec    2.210 GiB/sec    44.402       False
30    BM_ReadColumn<true,Int64Type>/50/50    1.586 GiB/sec    2.233 GiB/sec    40.787       False
28     BM_ReadColumn<true,Int32Type>/50/0  881.102 MiB/sec    1.196 GiB/sec    39.025       False
22    BM_ReadColumn<true,Int64Type>/45/25    1.598 GiB/sec    2.221 GiB/sec    38.951       False
14      BM_ReadColumn<true,Int64Type>/1/1    2.967 GiB/sec    4.122 GiB/sec    38.919       False
29    BM_ReadColumn<true,Int32Type>/50/50  887.206 MiB/sec    1.203 GiB/sec    38.854       False
35     BM_ReadColumn<true,Int32Type>/99/0    1.928 GiB/sec    2.661 GiB/sec    38.045       False
17    BM_ReadColumn<true,Int64Type>/35/10    1.656 GiB/sec    2.248 GiB/sec    35.689       False
0    BM_ReadColumn<false,Int32Type>/-1/50    2.765 GiB/sec    3.750 GiB/sec    35.628       False
26    BM_ReadColumn<true,Int64Type>/30/10    1.692 GiB/sec    2.221 GiB/sec    31.244       False
19     BM_ReadColumn<true,Int64Type>/75/1    1.712 GiB/sec    2.217 GiB/sec    29.505       False
2      BM_ReadColumn<true,Int32Type>/25/5  991.022 MiB/sec    1.229 GiB/sec    27.031       False
31    BM_ReadColumn<true,Int64Type>/25/10    1.744 GiB/sec    2.164 GiB/sec    24.091       False
16   BM_ReadColumn<false,Int64Type>/-1/50    5.892 GiB/sec    7.252 GiB/sec    23.084       False
24    BM_ReadColumn<true,Int32Type>/10/10    1.003 GiB/sec    1.225 GiB/sec    22.073       False
23   BM_ReadColumn<true,DoubleType>/25/25    1.790 GiB/sec    2.159 GiB/sec    20.659       False
10      BM_ReadColumn<true,Int64Type>/5/5    1.995 GiB/sec    2.371 GiB/sec    18.853       False
5      BM_ReadColumn<true,Int64Type>/10/5    1.796 GiB/sec    2.122 GiB/sec    18.128       False
3   BM_ReadColumn<false,DoubleType>/-1/20    3.488 GiB/sec    4.062 GiB/sec    16.453       False
1    BM_ReadColumn<false,Int64Type>/-1/10    2.740 GiB/sec    3.055 GiB/sec    11.497       False
11   BM_ReadColumn<false,Int32Type>/-1/10    1.362 GiB/sec    1.486 GiB/sec     9.086       False
9    BM_ReadColumn<false,DoubleType>/-1/0   12.984 GiB/sec   13.310 GiB/sec     2.509       False
13   BM_ReadColumn<true,BooleanType>/5/10  515.781 MiB/sec  523.838 MiB/sec     1.562       False
7     BM_ReadColumn<false,Int32Type>/-1/1    4.505 GiB/sec    4.559 GiB/sec     1.198       False
34  BM_ReadColumn<false,BooleanType>/1/20  400.254 MiB/sec  403.780 MiB/sec     0.881       False
25  BM_ReadColumn<false,BooleanType>/-1/0  870.109 MiB/sec  875.827 MiB/sec     0.657       False
18    BM_ReadColumn<false,Int64Type>/-1/1    8.237 GiB/sec    8.045 GiB/sec    -2.331       False
27   BM_ReadColumn<true,BooleanType>/-1/1  816.651 MiB/sec  793.763 MiB/sec    -2.803       False
8    BM_ReadColumn<true,DoubleType>/10/50    1.649 GiB/sec    1.587 GiB/sec    -3.792       False
12     BM_ReadColumn<true,Int64Type>/-1/0    3.044 GiB/sec    1.432 GiB/sec   -52.958        True
33    BM_ReadColumn<true,DoubleType>/-1/0    3.161 GiB/sec    1.479 GiB/sec   -53.209        True
4      BM_ReadColumn<true,Int32Type>/-1/0    1.863 GiB/sec  754.619 MiB/sec   -60.448        True

@wesm
Copy link
Member

wesm commented Jun 23, 2020

I'm not sure what the MSVC failure is about but I'll debug locally

@wesm
Copy link
Member

wesm commented Jun 23, 2020

I'm able to reproduce the error in VS and set breakpoints, I got this far to see that GetBatchWithDictSpaced has decoded more values than it was asked to

image

@wesm
Copy link
Member

wesm commented Jun 23, 2020

there seems to be a situation where the bit run has more values then are needed to fulfill the call to GetSpaced

image

@wesm
Copy link
Member

wesm commented Jun 23, 2020

@emkornfield I'm sort of at a dead end here, hopefully the above gives you some clues about where there might be a problem

@wesm
Copy link
Member

wesm commented Jun 23, 2020

The bug seems to be in the BitRunReader

image

@emkornfield
Copy link
Contributor Author

Thanks, I'll take a look tonight. Hopefully this should be enough of a clue.

and number of bits in a row.

Adds benchmarks comparing the two implementations under different
distributions.

- Makes use of Adds the BitRunReader for use in Parquet Writing

- Refactors GetBatchedSpaced and GetBatchedSpacedWithDict:

Use a single templated method that adds a template parameter
that the code can share.
Does all checking for out of bounds indices in one go instead
of on each pass through th literal (this is a slight behavior
change as the index returned will be different).
Makes use of the BitRunReader. With exactly alternating bits this shows
a big performance drop, but is generally positive across any random
and/or skewered nullability.

fix type

cast to make appveyor happy

add predict false

one more cast for windows

remove redundant using

try to fix builds

address some comments

inline all methods

remove InvertRemainingBits and use LeastSignificantBitMask (renamed from PartialWordMask)

remove LoadInitialWord, fix compile error

fix lint

Pre-rebase work

iwyu

Fix MSVC warning
@emkornfield
Copy link
Contributor Author

I fixed the bug and some other issues. I also incorporated BitBlockCounter in rle_encoder.h. Latest benchmarks.

image

@emkornfield
Copy link
Contributor Author

This is with SSE4_2

@emkornfield
Copy link
Contributor Author

Travis CI seems flaky. My branch: https://travis-ci.org/github/emkornfield/arrow/builds/701926099

@wesm
Copy link
Member

wesm commented Jun 25, 2020

FWIW gcc benchmarks (sse4.2) on my machine (ubuntu 18.04 on i9-9960X)

$ archery benchmark diff --cc=gcc-8 --cxx=g++-8 emkornfield/ARROW-8504 master --suite-filter=parquet-arrow
                                benchmark         baseline        contender  change %            counters
35     BM_ReadColumn<true,Int64Type>/75/1  267.930 MiB/sec  275.083 MiB/sec     2.670   {'iterations': 2}
6      BM_ReadColumn<true,Int32Type>/99/0  238.381 MiB/sec  244.286 MiB/sec     2.477   {'iterations': 3}
10   BM_ReadColumn<true,DoubleType>/10/50  270.775 MiB/sec  275.249 MiB/sec     1.652   {'iterations': 2}
31     BM_ReadColumn<true,Int32Type>/25/5  171.549 MiB/sec  174.237 MiB/sec     1.567   {'iterations': 2}
29    BM_ReadColumn<false,Int32Type>/-1/1  856.194 MiB/sec  866.945 MiB/sec     1.256  {'iterations': 11}
7        BM_WriteColumn<true,BooleanType>   31.593 MiB/sec   31.845 MiB/sec     0.797   {'iterations': 1}
44             BM_ReadIndividualRowGroups  849.469 MiB/sec  856.209 MiB/sec     0.793   {'iterations': 6}
36    BM_ReadColumn<true,DoubleType>/-1/0  408.388 MiB/sec  411.490 MiB/sec     0.759   {'iterations': 3}
3      BM_ReadColumn<true,Int32Type>/50/0  151.600 MiB/sec  152.399 MiB/sec     0.527   {'iterations': 2}
32    BM_ReadColumn<true,Int64Type>/30/10  268.512 MiB/sec  269.849 MiB/sec     0.498   {'iterations': 2}
8   BM_ReadColumn<false,DoubleType>/-1/20  772.878 MiB/sec  776.716 MiB/sec     0.497   {'iterations': 5}
17   BM_ReadColumn<true,DoubleType>/25/25  277.705 MiB/sec  278.957 MiB/sec     0.451   {'iterations': 2}
23    BM_ReadColumn<false,Int64Type>/-1/1    1.472 GiB/sec    1.478 GiB/sec     0.414  {'iterations': 13}
30      BM_ReadColumn<true,Int32Type>/1/1  235.809 MiB/sec  236.730 MiB/sec     0.391   {'iterations': 3}
45         BM_WriteColumn<true,Int32Type>   45.067 MiB/sec   45.232 MiB/sec     0.366   {'iterations': 1}
41    BM_ReadColumn<true,Int64Type>/45/25  241.545 MiB/sec  242.427 MiB/sec     0.365   {'iterations': 2}
40   BM_ReadColumn<false,Int64Type>/-1/10  710.485 MiB/sec  712.181 MiB/sec     0.239   {'iterations': 5}
22  BM_ReadColumn<false,BooleanType>/-1/0  211.939 MiB/sec  212.433 MiB/sec     0.233  {'iterations': 15}
39   BM_ReadColumn<false,Int32Type>/-1/50  629.632 MiB/sec  630.955 MiB/sec     0.210   {'iterations': 9}
15  BM_ReadColumn<false,BooleanType>/1/20  145.490 MiB/sec  145.701 MiB/sec     0.145  {'iterations': 10}
9    BM_ReadColumn<true,BooleanType>/-1/1  162.373 MiB/sec  162.497 MiB/sec     0.077   {'iterations': 4}
11     BM_ReadColumn<true,Int32Type>/-1/0  257.190 MiB/sec  257.278 MiB/sec     0.034   {'iterations': 3}
13        BM_WriteColumn<false,Int32Type>   32.378 MiB/sec   32.386 MiB/sec     0.025   {'iterations': 1}
37    BM_ReadColumn<true,Int32Type>/99/50  238.918 MiB/sec  238.975 MiB/sec     0.024   {'iterations': 3}
20       BM_WriteColumn<false,DoubleType>   54.200 MiB/sec   54.209 MiB/sec     0.017   {'iterations': 1}
19      BM_ReadColumn<true,Int64Type>/1/1  369.522 MiB/sec  369.199 MiB/sec    -0.087   {'iterations': 2}
0    BM_ReadColumn<false,Int64Type>/-1/50    1.144 GiB/sec    1.143 GiB/sec    -0.090  {'iterations': 10}
34    BM_ReadColumn<true,Int64Type>/35/10  262.271 MiB/sec  261.890 MiB/sec    -0.145   {'iterations': 2}
21     BM_ReadColumn<true,Int64Type>/50/1  242.510 MiB/sec  242.067 MiB/sec    -0.183   {'iterations': 2}
5     BM_ReadColumn<true,Int32Type>/50/50  152.344 MiB/sec  152.000 MiB/sec    -0.225   {'iterations': 2}
1     BM_ReadColumn<true,Int32Type>/10/10  179.063 MiB/sec  178.583 MiB/sec    -0.268   {'iterations': 2}
28        BM_WriteColumn<false,Int64Type>   64.162 MiB/sec   63.990 MiB/sec    -0.269   {'iterations': 1}
38        BM_WriteColumn<true,DoubleType>   69.641 MiB/sec   69.446 MiB/sec    -0.280   {'iterations': 1}
18      BM_WriteColumn<false,BooleanType>   26.643 MiB/sec   26.557 MiB/sec    -0.324   {'iterations': 2}
16   BM_ReadColumn<false,DoubleType>/-1/0    2.291 GiB/sec    2.281 GiB/sec    -0.397  {'iterations': 20}
43         BM_WriteColumn<true,Int64Type>   75.453 MiB/sec   75.050 MiB/sec    -0.534   {'iterations': 1}
33   BM_ReadColumn<true,BooleanType>/5/10  115.534 MiB/sec  114.574 MiB/sec    -0.832   {'iterations': 3}
14    BM_ReadColumn<true,Int64Type>/50/50  242.071 MiB/sec  240.005 MiB/sec    -0.854   {'iterations': 2}
4                BM_ReadMultipleRowGroups  826.026 MiB/sec  818.709 MiB/sec    -0.886   {'iterations': 5}
24   BM_ReadColumn<false,Int32Type>/-1/10  386.383 MiB/sec  382.213 MiB/sec    -1.079   {'iterations': 6}
12     BM_ReadColumn<true,Int64Type>/-1/0  410.877 MiB/sec  404.700 MiB/sec    -1.503   {'iterations': 3}
25     BM_ReadColumn<true,Int64Type>/10/5  289.612 MiB/sec  284.777 MiB/sec    -1.670   {'iterations': 2}
2     BM_ReadColumn<true,Int64Type>/25/10  277.045 MiB/sec  271.925 MiB/sec    -1.848   {'iterations': 2}
26      BM_ReadColumn<true,Int64Type>/5/5  311.137 MiB/sec  295.823 MiB/sec    -4.922   {'iterations': 2}
42     BM_ReadColumn<true,Int64Type>/99/0  371.803 MiB/sec  347.872 MiB/sec    -6.436   {'iterations': 2}
27    BM_ReadColumn<true,Int64Type>/99/50  381.029 MiB/sec  351.329 MiB/sec    -7.795   {'iterations': 3}

Here's clang-11

$ archery benchmark diff --cc=clang-11 --cxx=clang++-11 emkornfield/ARROW-8504 master --suite-filter=parquet-arrow
                                benchmark         baseline        contender  change %            counters
37    BM_ReadColumn<true,Int64Type>/35/10  242.396 MiB/sec  259.801 MiB/sec     7.180   {'iterations': 2}
17   BM_ReadColumn<false,DoubleType>/-1/0    2.160 GiB/sec    2.276 GiB/sec     5.363  {'iterations': 20}
0       BM_ReadColumn<true,Int64Type>/5/5  304.868 MiB/sec  313.829 MiB/sec     2.939   {'iterations': 2}
10     BM_ReadColumn<true,Int64Type>/-1/0  398.515 MiB/sec  408.179 MiB/sec     2.425   {'iterations': 3}
3      BM_ReadColumn<true,Int64Type>/10/5  286.868 MiB/sec  289.629 MiB/sec     0.962   {'iterations': 2}
18             BM_ReadIndividualRowGroups  845.304 MiB/sec  851.353 MiB/sec     0.716   {'iterations': 5}
33     BM_ReadColumn<true,Int64Type>/99/0  372.125 MiB/sec  374.716 MiB/sec     0.696   {'iterations': 2}
12     BM_ReadColumn<true,Int32Type>/50/0  151.242 MiB/sec  152.137 MiB/sec     0.591   {'iterations': 2}
8                BM_ReadMultipleRowGroups  819.229 MiB/sec  823.423 MiB/sec     0.512   {'iterations': 5}
1     BM_ReadColumn<true,Int64Type>/25/10  270.997 MiB/sec  272.218 MiB/sec     0.450   {'iterations': 2}
16         BM_WriteColumn<true,Int32Type>   44.766 MiB/sec   44.946 MiB/sec     0.402   {'iterations': 1}
11     BM_ReadColumn<true,Int64Type>/50/1  241.717 MiB/sec  242.539 MiB/sec     0.340   {'iterations': 2}
45    BM_ReadColumn<true,Int64Type>/50/50  241.083 MiB/sec  241.646 MiB/sec     0.233   {'iterations': 2}
27      BM_WriteColumn<false,BooleanType>   26.050 MiB/sec   26.095 MiB/sec     0.170   {'iterations': 2}
20    BM_ReadColumn<false,Int32Type>/-1/1  864.068 MiB/sec  865.475 MiB/sec     0.163  {'iterations': 11}
7         BM_WriteColumn<true,DoubleType>   69.439 MiB/sec   69.545 MiB/sec     0.152   {'iterations': 1}
19   BM_ReadColumn<true,DoubleType>/25/25  275.396 MiB/sec  275.811 MiB/sec     0.151   {'iterations': 2}
23    BM_ReadColumn<true,Int64Type>/99/50  370.614 MiB/sec  371.017 MiB/sec     0.109   {'iterations': 2}
9      BM_ReadColumn<true,Int32Type>/25/5  173.728 MiB/sec  173.861 MiB/sec     0.076   {'iterations': 2}
5     BM_ReadColumn<true,DoubleType>/-1/0  410.275 MiB/sec  410.482 MiB/sec     0.051   {'iterations': 3}
44       BM_WriteColumn<false,DoubleType>   54.559 MiB/sec   54.586 MiB/sec     0.050   {'iterations': 1}
36       BM_WriteColumn<true,BooleanType>   31.427 MiB/sec   31.426 MiB/sec    -0.003   {'iterations': 1}
28   BM_ReadColumn<true,BooleanType>/5/10  115.041 MiB/sec  114.909 MiB/sec    -0.115   {'iterations': 3}
2     BM_ReadColumn<true,Int64Type>/45/25  242.615 MiB/sec  242.257 MiB/sec    -0.148   {'iterations': 2}
6     BM_ReadColumn<false,Int64Type>/-1/1    1.471 GiB/sec    1.469 GiB/sec    -0.157  {'iterations': 13}
4   BM_ReadColumn<false,BooleanType>/1/20  145.277 MiB/sec  145.008 MiB/sec    -0.185  {'iterations': 10}
24    BM_ReadColumn<true,Int32Type>/99/50  238.739 MiB/sec  238.281 MiB/sec    -0.192   {'iterations': 3}
34        BM_WriteColumn<false,Int32Type>   32.013 MiB/sec   31.937 MiB/sec    -0.237   {'iterations': 1}
30  BM_ReadColumn<false,BooleanType>/-1/0  213.488 MiB/sec  212.948 MiB/sec    -0.253  {'iterations': 14}
39     BM_ReadColumn<true,Int32Type>/-1/0  259.936 MiB/sec  259.218 MiB/sec    -0.276   {'iterations': 3}
14    BM_ReadColumn<true,Int64Type>/30/10  270.662 MiB/sec  269.891 MiB/sec    -0.285   {'iterations': 2}
26        BM_WriteColumn<false,Int64Type>   62.818 MiB/sec   62.576 MiB/sec    -0.385   {'iterations': 1}
13     BM_ReadColumn<true,Int64Type>/75/1  275.538 MiB/sec  274.461 MiB/sec    -0.391   {'iterations': 2}
32   BM_ReadColumn<false,Int32Type>/-1/50  631.052 MiB/sec  627.655 MiB/sec    -0.538   {'iterations': 9}
38  BM_ReadColumn<false,DoubleType>/-1/20  771.752 MiB/sec  766.959 MiB/sec    -0.621   {'iterations': 5}
22   BM_ReadColumn<false,Int64Type>/-1/10  712.603 MiB/sec  708.037 MiB/sec    -0.641   {'iterations': 5}
21      BM_ReadColumn<true,Int64Type>/1/1  370.007 MiB/sec  367.621 MiB/sec    -0.645   {'iterations': 2}
25    BM_ReadColumn<true,Int32Type>/10/10  182.975 MiB/sec  181.714 MiB/sec    -0.689   {'iterations': 2}
29   BM_ReadColumn<false,Int64Type>/-1/50    1.144 GiB/sec    1.135 GiB/sec    -0.713  {'iterations': 10}
41   BM_ReadColumn<true,DoubleType>/10/50  271.556 MiB/sec  269.579 MiB/sec    -0.728   {'iterations': 2}
42   BM_ReadColumn<false,Int32Type>/-1/10  390.676 MiB/sec  386.926 MiB/sec    -0.960   {'iterations': 6}
35         BM_WriteColumn<true,Int64Type>   74.108 MiB/sec   73.364 MiB/sec    -1.004   {'iterations': 1}
40      BM_ReadColumn<true,Int32Type>/1/1  236.583 MiB/sec  233.838 MiB/sec    -1.160   {'iterations': 3}
43   BM_ReadColumn<true,BooleanType>/-1/1  162.242 MiB/sec  160.101 MiB/sec    -1.320   {'iterations': 4}
31     BM_ReadColumn<true,Int32Type>/99/0  236.270 MiB/sec  232.278 MiB/sec    -1.690   {'iterations': 3}
15    BM_ReadColumn<true,Int32Type>/50/50  152.712 MiB/sec  148.673 MiB/sec    -2.645   {'iterations': 2}

@wesm
Copy link
Member

wesm commented Jun 25, 2020

+1, thanks for fixing this @emkornfield!

@wesm wesm closed this in e0a9d0f Jun 25, 2020
sgnkc pushed a commit to sgnkc/arrow that referenced this pull request Jul 2, 2020
Adds two implementations of a BitRunReader, which returns set/not-set
and number of bits in a row.

- Adds benchmarks comparing the two implementations under different
distributions.

- Adds the reader for use ParquetWriter (there is a second
location on Nullable terminal node that I left unchanged because
it showed a performance drop of 30%, I think this is due the issue
described in the next bullet point, or because BitVisitor is getting
vectorized to something more efficient).

- Refactors GetBatchedSpaced and GetBatchedSpacedWithDict:
  1.  Use a single templated method that adds a template parameter
      that the code can share.
  2.  Does all checking for out of bounds indices in one go instead
      of on each pass through th literal (this is a slight behavior
      change as the index returned will be different).
  3.  Makes use of the BitRunReader.

Based on bechmarks BM_ColumnRead this seems to make performance worse by
50-80%.  This was surprising to me and my current working theory is
that the nullable benchmarks present the worse case scenario every
other element is null and therefore the overhead of invoking the call
relative to the existing code is high (using perf calls to NextRun()
jump to top after this change).  Before making a decision on reverting
the use of BitRunReader here I'd like to implement a more comprehensive
set of benchmarks to test this hypothesis.

Other TODOs
- [x] Need to revert change to testing submodule hash
- [x] Add attribution to wikipedia for InvertRemainingBits
- [x] Fix some unintelligible comments.
- [x] Resolve performance issues.

Closes apache#7143 from emkornfield/ARROW-8504

Authored-by: Micah Kornfield <emkornfield@gmail.com>
Signed-off-by: Wes McKinney <wesm@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants