Experiment: Potential speed up strategies #399

svilupp · 2023-03-12T15:26:29Z

This PR aims to showcase various strategies how to improve the performance of Arrow.jl and allow for easier testing. The intention is to be broken into separate modular PRs for actual contributions (upon interest).

TL;DR Arrow.jl beats everyone except for one case (loading large strings in Task1, uncompressed Polars+PyArrow is so fast - 5ms - that it clearly uses some lazy strategy and skips materializing the strings)

Changes:

Project deps: added deps for InlineStrings (direct)
Arrow.Table - added kwarg "useinlinestrings" to load strings as InlineStrings whenever possible (defaults to "true")
Arrow.write - added chunking/partitioning as a default (same as PyArrow, kwarg "chunksize" (defaults to "64000" like PyArrow, could be changed to consider available threads)
Compression - added thread-safe implementation with locks for decompression, added locks for compression (see Add locks for de-/compressor + initialize decompressor #397)
TranscodingStreams (pirates involved) - added an in-place mutating transcode to avoid unnecessary output buffer resizing

Future:

separate parsing the structure from materializing the buffers within RecordBatches. Currently, it's done together (VectorIterator), but separating it would enable better multithreading (it seems to be the secret ingredient of Polars/Pyarrow when data is not partitioned)
allow native serialization/deserialization of InlineStrings (wouldn't be readable in other languages / would look like Ints)

Timings (copied from the original thread for comparison):
Task 1: 10x count nonmissing elements in the first column of a table
Data: 2 columns of 5K-long strings each, 10% of data missing, 10K rows
Timings: (ordered by Uncompressed, LZ4, ZSTD)

Pandas: 1.2s, 1.5s, 1.6s
Polars: 5ms, 1.5s, 2.05s
Polars+PyArrow: 4.8ms, 0.26s, 0.42s
Arrow+32Threads: 0.17s, 2.3s, 1.6s
Arrow+1Thread: 0.2, 2.25s, 1.9s

Data: 32 partitions (!), 2 columns of 5K-long strings each, 10% of data missing, 10K rows
Timings: (ordered by Uncompressed, LZ4, ZSTD)

Pandas: 1.2s, 1.0s, 1.2s
Polars: 9ms, 2.1s, 2.8s
Polars+PyArrow: 1.1s, 1.3s, 1.5s
Arrow+32Threads: 0.22s, 0.44s, 0.4s
(Arrow.jl timing also benefits from a quick fix to TranscodingStreams)

NEW: partitioned using the new defaults+keywords write_out_compressions(df, fn; chunksize = cld(nrow(df), Threads.nthreads()));

Arrow+32Threads: 0.21s, 0.26s, 0.25s
Arrow+1Threads: 0.15s, 0.45s, 1.18s

Task 2: 10x mean value of Int column in the first column of a table
Data: 10 columns, Int64, 10M rows
Timings: (ordered by Uncompressed, LZ4, ZSTD)

Pandas: 5.4s, 5.9s, 5.84s
Polars: 0.23s, 8s, 8.1s
Polars+PyArrow: 0.2s, 0.7s, 0.6s
Arrow+32Threads: 0.1s, 17.2s, 6.1s
Arrow+1Thread: 0.1s, 16.3s, 6.3s

Data: 32 partitions (!), 10 columns, Int64, 10M rows
Timings: (ordered by Uncompressed, LZ4, ZSTD)

Pandas: 5.6, 2.8s, 2.6s
Polars: 0.23s, 12.8s, 12.6s
Polars+PyArrow: 6.5s, 6.5s, 6.4s
Arrow+32Threads: 0.1s, 1.2s, 0.7s
(Arrow.jl timing also benefits from a quick fix to TranscodingStreams)

NEW: partitioned using the new defaults+keywords write_out_compressions(df, fn; chunksize = min(64000, cld(nrow(df), Threads.nthreads())));

Arrow+32Threads: 0.23s, 2.0s, 1.8s (clearly bigger chunks are better here -- see the case above)
Arrow+1Threads: 0.24s, 4.3s, 4.05s

Added a new task to test out the automatic string inlining
Task 3: 10x count nonmissing elements in the first column of a table
Data: 2 columns of 10 codeunits long strings each, 10% of data missing, 1M rows
partitioned using the new defaults+keywords write_out_compressions(df, fn; chunksize = min(64000, cld(nrow(df), Threads.nthreads())));

Timings: (ordered by Uncompressed, LZ4, ZSTD)

Arrow+32Threads (useinlinestrings=false): 0.38, 0.44s, 0.44s
Arrow+1Threads (useinlinestrings=false): 0.45, 0.48s, 0.74s
Arrow+32Threads (useinlinestrings=true): 0.1s, 0.15s, 0.2s
Pandas: 4.8s, 5s, 5s
Polars: 0.47s, 0.64s, 0.87s
Polars+PyArrow: 0.32s, 0.36s, 0.4s
(for comparison, I ran also string length check on Polars+PyArrow: 0.32s, 0.36s, 0.41s)

And the best part? On my MacBook, I can get timings for Task 3 around 40ms (uncompressed) and 90ms (compressed), which is better than what I could have hoped for :)

Setup:

Machine: m5dn.8xl, NVME local drive, 32 vCPU, 96GB RAM
Julia 1.8.5 / Arrow 2.4.3 + this PR
Python 3.10

Benchmarking code is provided in #393

Arrow project - added deps for InlineStrings Arrow.Table - added kwarg useinlinestrings to load strings as InlineStrings whenever possible Arrow.write - added chunking as a default (ala PyArrow), kwarg chunksize Compression - added thread-safe implementation with locks for decompression, added locks for compression TranscodingStreams - added inplace mutating transcode to avoid unnecessary resizing

…fer for type consistency))

svilupp · 2023-03-12T16:10:32Z

Two additional thoughts on InlineStrings:

We could add support for native serialization/deserialization of InlineStrings (but it wouldn't be readable in other languages / would look like Integers)
At the moment I find the smallest suitable type by searches the _whole_vector of offsets which is expensive (c. 1ms for 10M elements). We could try sampling only a subset, but we would have to introduce some safety checks if suddenly a bigger String shows up. I decided to pay the cost upfront and make sure no strings are larger.

svilupp · 2023-03-12T16:14:45Z

One side note on compression: I was quite surprised that despite LZ4 being known to be the "fastest" option, ZSTD (which compressed the data to smaller file sizes) has mostly been on par and in some cases even faster than LZ4! It depends on your data and its sizes (repetition etc.), but I was positively surprised by ZSTD -- I'll be using it much more from now on.

baumgold · 2023-03-12T23:31:13Z

Project.toml

@@ -27,6 +27,7 @@ CodecLz4 = "5ba52731-8f18-5e0d-9241-30f10d1ec561"
 CodecZstd = "6b39b394-51ab-5f42-8807-6242bab2b4c2"
 DataAPI = "9a962f9c-6df0-11e9-0e5d-c546b8b5ee8a"
 Dates = "ade2ca70-3891-5945-98fb-dc099432e06a"
+InlineStrings = "842dd82b-1e85-43dc-bf29-5d0ee9dffc48"


Missing compat entry for InlineStrings?

Added. I set it to 1.4, as I don't understand its evolution up to this point, but I have a suspicion that "1" would work as well.

To be clear, I'm not sure if this PR could ever be merged.
I've hijacked (copy&pasted) transcode method from TranscodingStreams to create the mutating version, which is probably not the best practice :-/

If we let transcode to its own allocation it will allocate a small vector, start filling it, resize the vector, fill it some more, resize the vector, etc. Instead in this commit we pre-allocate a vector of the corect size and pass it to transcode(). Inspired by apache#399

If we let transcode to its own allocation it will allocate a small vector, start filling it, resize the vector, fill it some more, resize the vector, etc. Instead in this commit we pre-allocate a vector of the corect size and pass it to transcode(). Inspired by #399

If we let transcode to its own allocation it will allocate a small vector, start filling it, resize the vector, fill it some more, resize the vector, etc. Instead in this commit we pre-allocate a vector of the corect size and pass it to transcode(). Inspired by apache#399

svilupp added 8 commits March 11, 2023 19:06

add ZstdDecompressor to transcode!

e15f9d6

added deps

6caea63

updated inlinestrings to catch chainedvectors (moved deeper in)

cc5dad8

extend inlinestring fromarrow to apply to missing type union

2148cc9

add support for ChainedVector and move inlining to the outer loop (sa…

d4a3ef9

…fer for type consistency))

remove loose debug statement

966287a

fix lock name typo

153b82a

svilupp mentioned this pull request Mar 12, 2023

[Docs] Improve documentation around partitions/multithreading #401

Open

svilupp mentioned this pull request Mar 12, 2023

Benchmark of Arrow.jl vs Pyarrow (/Polars) #393

Open

baumgold reviewed Mar 13, 2023

View reviewed changes

add compat for InlineStrings

b476043

JoaoAparicio mentioned this pull request Apr 11, 2023

Pre-allocate buffer #422

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experiment: Potential speed up strategies #399

Experiment: Potential speed up strategies #399

svilupp commented Mar 12, 2023

svilupp commented Mar 12, 2023

svilupp commented Mar 12, 2023

baumgold Mar 12, 2023

svilupp Mar 13, 2023

Experiment: Potential speed up strategies #399

Are you sure you want to change the base?

Experiment: Potential speed up strategies #399

Conversation

svilupp commented Mar 12, 2023

svilupp commented Mar 12, 2023

svilupp commented Mar 12, 2023

baumgold Mar 12, 2023

Choose a reason for hiding this comment

svilupp Mar 13, 2023

Choose a reason for hiding this comment