Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-6450: [C++] Use 2x reallocation strategy in BufferBuilder instead of 1.5x #5270

Closed
wants to merge 3 commits into from

Conversation

wesm
Copy link
Member

@wesm wesm commented Sep 4, 2019

As Antoine noted, 2x seems to have a small benefit with jemalloc and a more significant benefit with the system allocator.

1.5x with jemalloc (i9-9960X on Ubuntu 18.04)

---------------------------------------------------------------------------
Benchmark                                    Time           CPU Iterations
---------------------------------------------------------------------------
BufferBuilderTinyWrites/real_time    197653056 ns  197652025 ns          7   1.26484GB/s
BufferBuilderSmallWrites/real_time    82063937 ns   82062700 ns         17    3.0464GB/s
BufferBuilderLargeWrites/real_time    94425574 ns   94423358 ns         15   2.63343GB/s
BuildBooleanArrayNoNulls              64521268 ns   64519003 ns         22   3.87483GB/s
BuildIntArrayNoNulls                  93079445 ns   93053984 ns         15   2.68661GB/s
BuildAdaptiveIntNoNulls               34491212 ns   34479458 ns         40   7.25069GB/s
BuildAdaptiveIntNoNullsScalarAppend  127306096 ns  127302694 ns         11   1.96382GB/s
BuildBinaryArray                     809044598 ns  809023401 ns          2   316.431MB/s
BuildChunkedBinaryArray              637293688 ns  637266310 ns          2   401.716MB/s
BuildFixedSizeBinaryArray            252201859 ns  252194570 ns          6   1015.09MB/s
BuildDecimalArray                    561307314 ns  560996750 ns          2   912.661MB/s
BuildInt64DictionaryArrayRandom      266231767 ns  266228961 ns          5   961.578MB/s
BuildInt64DictionaryArraySequential  257900722 ns  257894477 ns          5   992.654MB/s
BuildInt64DictionaryArraySimilar     444990975 ns  444979666 ns          3   575.307MB/s
BuildStringDictionaryArray           748622297 ns  748614312 ns          2   456.223MB/s
ArrayDataConstructDestruct               38763 ns      38763 ns      36426

2x and jemalloc

---------------------------------------------------------------------------
Benchmark                                    Time           CPU Iterations
---------------------------------------------------------------------------
BufferBuilderTinyWrites/real_time    197711155 ns  197709990 ns          7   1.26447GB/s
BufferBuilderSmallWrites/real_time    81069040 ns   81033678 ns         17   3.08379GB/s
BufferBuilderLargeWrites/real_time    94150970 ns   94148357 ns         15   2.64111GB/s
BuildBooleanArrayNoNulls              61140830 ns   61135379 ns         23   4.08929GB/s
BuildIntArrayNoNulls                  92607631 ns   92604219 ns         15   2.69966GB/s
BuildAdaptiveIntNoNulls               33907949 ns   33880894 ns         41   7.37879GB/s
BuildAdaptiveIntNoNullsScalarAppend  117096729 ns  117093371 ns         12   2.13505GB/s
BuildBinaryArray                     704963079 ns  704624550 ns          2   363.314MB/s
BuildChunkedBinaryArray              623889554 ns  623874319 ns          2   410.339MB/s
BuildFixedSizeBinaryArray            247338429 ns  247333048 ns          6   1035.04MB/s
BuildDecimalArray                    409234807 ns  409223276 ns          3   1.22183GB/s
BuildInt64DictionaryArrayRandom      296006132 ns  295998723 ns          5   864.869MB/s
BuildInt64DictionaryArraySequential  285353620 ns  285339768 ns          5   897.176MB/s
BuildInt64DictionaryArraySimilar     457288271 ns  457281346 ns          3    559.83MB/s
BuildStringDictionaryArray           729509721 ns  729493850 ns          2   468.181MB/s
ArrayDataConstructDestruct               40037 ns      40036 ns      35186

1.5x and system allocator

---------------------------------------------------------------------------
Benchmark                                    Time           CPU Iterations
---------------------------------------------------------------------------
BufferBuilderTinyWrites/real_time    673175457 ns  673154074 ns          2   380.287MB/s
BufferBuilderSmallWrites/real_time   418318011 ns  418314964 ns          3   611.974MB/s
BufferBuilderLargeWrites/real_time   534294790 ns  534282517 ns          3   476.574MB/s
BuildBooleanArrayNoNulls             135128411 ns  135122284 ns         12   1.85018GB/s
BuildIntArrayNoNulls                 448668680 ns  448661666 ns          3   570.586MB/s
BuildAdaptiveIntNoNulls              109891141 ns  109889005 ns         10   2.27502GB/s
BuildAdaptiveIntNoNullsScalarAppend  148289318 ns  148286607 ns          9   1.68592GB/s
BuildBinaryArray                    1162657389 ns 1162619171 ns          1   220.192MB/s
BuildChunkedBinaryArray              606754226 ns  606740819 ns          2   421.926MB/s
BuildFixedSizeBinaryArray            657576337 ns  657397552 ns          2   389.414MB/s
BuildDecimalArray                   1156559933 ns 1156545644 ns          1   442.698MB/s
BuildInt64DictionaryArrayRandom      414707956 ns  414695071 ns          3   617.321MB/s
BuildInt64DictionaryArraySequential  405631113 ns  405557929 ns          3   631.229MB/s
BuildInt64DictionaryArraySimilar     566810196 ns  566795202 ns          2   451.662MB/s
BuildStringDictionaryArray           847338199 ns  847322935 ns          2   403.075MB/s
ArrayDataConstructDestruct               40224 ns      40223 ns      35377

2x and system allocator

---------------------------------------------------------------------------
Benchmark                                    Time           CPU Iterations
---------------------------------------------------------------------------
BufferBuilderTinyWrites/real_time    362125402 ns  362125418 ns          4   706.937MB/s
BufferBuilderSmallWrites/real_time   370158964 ns  370156920 ns          4   691.594MB/s
BufferBuilderLargeWrites/real_time   380047623 ns  380032175 ns          4   669.998MB/s
BuildBooleanArrayNoNulls              92829889 ns   92827630 ns         15   2.69316GB/s
BuildIntArrayNoNulls                 208102367 ns  208099591 ns          6   1.20135GB/s
BuildAdaptiveIntNoNulls               57545256 ns   57543182 ns         24   4.34456GB/s
BuildAdaptiveIntNoNullsScalarAppend  121124555 ns  121122124 ns         11   2.06403GB/s
BuildBinaryArray                     698486781 ns  698459415 ns          2   366.521MB/s
BuildChunkedBinaryArray              634176049 ns  634149980 ns          2    403.69MB/s
BuildFixedSizeBinaryArray            386775119 ns  386764276 ns          4   661.902MB/s
BuildDecimalArray                    583660353 ns  583645899 ns          2   877.244MB/s
BuildInt64DictionaryArrayRandom      352273499 ns  352250693 ns          4   726.755MB/s
BuildInt64DictionaryArraySequential  341524466 ns  341520605 ns          4   749.589MB/s
BuildInt64DictionaryArraySimilar     500344696 ns  500323404 ns          3   511.669MB/s
BuildStringDictionaryArray           793840167 ns  793819318 ns          2   430.243MB/s
ArrayDataConstructDestruct               39403 ns      39403 ns      35585

// Otherwise overallocate by 1.5 to keep a linear amortized cost.
// TODO: revisit this? See comment in BufferOutputStream::Reserve.
return std::max(new_capacity, current_capacity * 3 / 2);
// Doubling capacity except for large Reserve requests
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider commenting that doubling works well with JEMalloc?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to work well with the system allocator as well. I guess the 1.5 factor is a good choice for smallish allocations, but less so once the allocator (re)allocates entire pages.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, why "except for large Reserve requests"?

Copy link
Contributor

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason I thought we got rid of this a long time ago.

@pitrou
Copy link
Member

pitrou commented Sep 4, 2019

What are the benchmark results for builder-benchmark?

@pitrou
Copy link
Member

pitrou commented Sep 4, 2019

Btw it's impressive how far behind the system allocator is.

@pitrou
Copy link
Member

pitrou commented Sep 4, 2019

Here are some benchmarks (Ubuntu 18.04, Ryzen 7 1700):

1.5x and jemalloc

BufferBuilderTinyWrites/real_time    333148186 ns    332245497 ns            5 bytes_per_second=768.427M/s
BufferBuilderSmallWrites/real_time    93194685 ns     92631361 ns           23 bytes_per_second=2.68255G/s
BufferBuilderLargeWrites/real_time    89324782 ns     88918139 ns           24 bytes_per_second=2.78381G/s

BuildBooleanArrayNoNulls              71730467 ns     71541662 ns           29 bytes_per_second=3.49447G/s
BuildIntArrayNoNulls                  93380546 ns     92866418 ns           23 bytes_per_second=2.69204G/s
BuildAdaptiveIntNoNulls               39836144 ns     39452195 ns           52 bytes_per_second=6.33678G/s
BuildAdaptiveIntNoNullsScalarAppend  259067249 ns    258988727 ns            8 bytes_per_second=988.46M/s
BuildBinaryArray                     653744521 ns    652748528 ns            3 bytes_per_second=392.188M/s
BuildChunkedBinaryArray              504499969 ns    504009419 ns            4 bytes_per_second=507.927M/s
BuildFixedSizeBinaryArray            405099083 ns    404745441 ns            5 bytes_per_second=632.496M/s
BuildDecimalArray                    789990787 ns    788466731 ns            3 bytes_per_second=649.362M/s
BuildInt64DictionaryArrayRandom      569856849 ns    569493069 ns            4 bytes_per_second=449.523M/s
BuildInt64DictionaryArraySequential  553514226 ns    552983410 ns            4 bytes_per_second=462.943M/s
BuildInt64DictionaryArraySimilar     789256859 ns    788815926 ns            3 bytes_per_second=324.537M/s
BuildStringDictionaryArray          1460393214 ns   1459601695 ns            2 bytes_per_second=233.992M/s

2x and jemalloc

BufferBuilderTinyWrites/real_time    328798897 ns    328095132 ns            5 bytes_per_second=778.591M/s
BufferBuilderSmallWrites/real_time    87949666 ns     87567775 ns           24 bytes_per_second=2.84253G/s
BufferBuilderLargeWrites/real_time    85086232 ns     84828381 ns           25 bytes_per_second=2.92248G/s

BuildBooleanArrayNoNulls              65633284 ns     65505032 ns           32 bytes_per_second=3.8165G/s
BuildIntArrayNoNulls                  86031153 ns     85691751 ns           25 bytes_per_second=2.91743G/s
BuildAdaptiveIntNoNulls               33889643 ns     33816280 ns           62 bytes_per_second=7.39289G/s
BuildAdaptiveIntNoNullsScalarAppend  254664440 ns    254587733 ns            8 bytes_per_second=1005.55M/s
BuildBinaryArray                     598989691 ns    597710112 ns            4 bytes_per_second=428.301M/s
BuildChunkedBinaryArray              518256106 ns    517907817 ns            4 bytes_per_second=494.296M/s
BuildFixedSizeBinaryArray            407843852 ns    407561172 ns            5 bytes_per_second=628.127M/s
BuildDecimalArray                    686072412 ns    685079344 ns            3 bytes_per_second=747.359M/s
BuildInt64DictionaryArrayRandom      564424718 ns    564247012 ns            4 bytes_per_second=453.702M/s
BuildInt64DictionaryArraySequential  550441876 ns    550010612 ns            4 bytes_per_second=465.446M/s
BuildInt64DictionaryArraySimilar     793397531 ns    792774122 ns            3 bytes_per_second=322.917M/s
BuildStringDictionaryArray          1456951448 ns   1456270538 ns            2 bytes_per_second=234.527M/s

@wesm
Copy link
Member Author

wesm commented Sep 4, 2019

I think the results I showed last night were affected by CPU throttling. I'm going to update the PR description with higher quality results, and use min_time=1.0 when running the benchmarks. Sorry for my sloppiness

@wesm
Copy link
Member Author

wesm commented Sep 4, 2019

Updated the PR description. The jemalloc performance difference I show is similar to what @pitrou shows. The system allocator perf is much better for 2x than 1.5x, though, so that supports the decision (though we obviously want people to be using jemalloc)

@wesm
Copy link
Member Author

wesm commented Sep 4, 2019

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants