Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experiment: Potential speed up strategies #399

Draft
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

svilupp
Copy link
Contributor

@svilupp svilupp commented Mar 12, 2023

This PR aims to showcase various strategies how to improve the performance of Arrow.jl and allow for easier testing. The intention is to be broken into separate modular PRs for actual contributions (upon interest).

TL;DR Arrow.jl beats everyone except for one case (loading large strings in Task1, uncompressed Polars+PyArrow is so fast - 5ms - that it clearly uses some lazy strategy and skips materializing the strings)

Changes:

  • Project deps: added deps for InlineStrings (direct)
  • Arrow.Table - added kwarg "useinlinestrings" to load strings as InlineStrings whenever possible (defaults to "true")
  • Arrow.write - added chunking/partitioning as a default (same as PyArrow, kwarg "chunksize" (defaults to "64000" like PyArrow, could be changed to consider available threads)
  • Compression - added thread-safe implementation with locks for decompression, added locks for compression (see Add locks for de-/compressor + initialize decompressor #397)
  • TranscodingStreams (pirates involved) - added an in-place mutating transcode to avoid unnecessary output buffer resizing

Future:

  • separate parsing the structure from materializing the buffers within RecordBatches. Currently, it's done together (VectorIterator), but separating it would enable better multithreading (it seems to be the secret ingredient of Polars/Pyarrow when data is not partitioned)
  • allow native serialization/deserialization of InlineStrings (wouldn't be readable in other languages / would look like Ints)

Timings (copied from the original thread for comparison):
Task 1: 10x count nonmissing elements in the first column of a table
Data: 2 columns of 5K-long strings each, 10% of data missing, 10K rows
Timings: (ordered by Uncompressed, LZ4, ZSTD)

  • Pandas: 1.2s, 1.5s, 1.6s
  • Polars: 5ms, 1.5s, 2.05s
  • Polars+PyArrow: 4.8ms, 0.26s, 0.42s
  • Arrow+32Threads: 0.17s, 2.3s, 1.6s
  • Arrow+1Thread: 0.2, 2.25s, 1.9s

Data: 32 partitions (!), 2 columns of 5K-long strings each, 10% of data missing, 10K rows
Timings: (ordered by Uncompressed, LZ4, ZSTD)

  • Pandas: 1.2s, 1.0s, 1.2s
  • Polars: 9ms, 2.1s, 2.8s
  • Polars+PyArrow: 1.1s, 1.3s, 1.5s
  • Arrow+32Threads: 0.22s, 0.44s, 0.4s
    (Arrow.jl timing also benefits from a quick fix to TranscodingStreams)

NEW: partitioned using the new defaults+keywords write_out_compressions(df, fn; chunksize = cld(nrow(df), Threads.nthreads()));

  • Arrow+32Threads: 0.21s, 0.26s, 0.25s
  • Arrow+1Threads: 0.15s, 0.45s, 1.18s

Task 2: 10x mean value of Int column in the first column of a table
Data: 10 columns, Int64, 10M rows
Timings: (ordered by Uncompressed, LZ4, ZSTD)

  • Pandas: 5.4s, 5.9s, 5.84s
  • Polars: 0.23s, 8s, 8.1s
  • Polars+PyArrow: 0.2s, 0.7s, 0.6s
  • Arrow+32Threads: 0.1s, 17.2s, 6.1s
  • Arrow+1Thread: 0.1s, 16.3s, 6.3s

Data: 32 partitions (!), 10 columns, Int64, 10M rows
Timings: (ordered by Uncompressed, LZ4, ZSTD)

  • Pandas: 5.6, 2.8s, 2.6s
  • Polars: 0.23s, 12.8s, 12.6s
  • Polars+PyArrow: 6.5s, 6.5s, 6.4s
  • Arrow+32Threads: 0.1s, 1.2s, 0.7s
    (Arrow.jl timing also benefits from a quick fix to TranscodingStreams)

NEW: partitioned using the new defaults+keywords write_out_compressions(df, fn; chunksize = min(64000, cld(nrow(df), Threads.nthreads())));

  • Arrow+32Threads: 0.23s, 2.0s, 1.8s (clearly bigger chunks are better here -- see the case above)
  • Arrow+1Threads: 0.24s, 4.3s, 4.05s

Added a new task to test out the automatic string inlining
Task 3: 10x count nonmissing elements in the first column of a table
Data: 2 columns of 10 codeunits long strings each, 10% of data missing, 1M rows
partitioned using the new defaults+keywords write_out_compressions(df, fn; chunksize = min(64000, cld(nrow(df), Threads.nthreads())));

Timings: (ordered by Uncompressed, LZ4, ZSTD)

  • Arrow+32Threads (useinlinestrings=false): 0.38, 0.44s, 0.44s
  • Arrow+1Threads (useinlinestrings=false): 0.45, 0.48s, 0.74s
  • Arrow+32Threads (useinlinestrings=true): 0.1s, 0.15s, 0.2s
  • Pandas: 4.8s, 5s, 5s
  • Polars: 0.47s, 0.64s, 0.87s
  • Polars+PyArrow: 0.32s, 0.36s, 0.4s
    (for comparison, I ran also string length check on Polars+PyArrow: 0.32s, 0.36s, 0.41s)

And the best part? On my MacBook, I can get timings for Task 3 around 40ms (uncompressed) and 90ms (compressed), which is better than what I could have hoped for :)

Setup:

  • Machine: m5dn.8xl, NVME local drive, 32 vCPU, 96GB RAM
  • Julia 1.8.5 / Arrow 2.4.3 + this PR
  • Python 3.10

Benchmarking code is provided in #393

Arrow project - added deps for InlineStrings
Arrow.Table - added kwarg useinlinestrings to load strings as InlineStrings whenever possible
Arrow.write - added chunking as a default (ala PyArrow), kwarg chunksize
Compression - added thread-safe implementation with locks for decompression, added locks for compression
TranscodingStreams - added inplace mutating transcode to avoid unnecessary resizing
@svilupp
Copy link
Contributor Author

svilupp commented Mar 12, 2023

Two additional thoughts on InlineStrings:

  • We could add support for native serialization/deserialization of InlineStrings (but it wouldn't be readable in other languages / would look like Integers)
  • At the moment I find the smallest suitable type by searches the _whole_vector of offsets which is expensive (c. 1ms for 10M elements). We could try sampling only a subset, but we would have to introduce some safety checks if suddenly a bigger String shows up. I decided to pay the cost upfront and make sure no strings are larger.

@svilupp
Copy link
Contributor Author

svilupp commented Mar 12, 2023

One side note on compression: I was quite surprised that despite LZ4 being known to be the "fastest" option, ZSTD (which compressed the data to smaller file sizes) has mostly been on par and in some cases even faster than LZ4! It depends on your data and its sizes (repetition etc.), but I was positively surprised by ZSTD -- I'll be using it much more from now on.

@@ -27,6 +27,7 @@ CodecLz4 = "5ba52731-8f18-5e0d-9241-30f10d1ec561"
CodecZstd = "6b39b394-51ab-5f42-8807-6242bab2b4c2"
DataAPI = "9a962f9c-6df0-11e9-0e5d-c546b8b5ee8a"
Dates = "ade2ca70-3891-5945-98fb-dc099432e06a"
InlineStrings = "842dd82b-1e85-43dc-bf29-5d0ee9dffc48"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing compat entry for InlineStrings?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added. I set it to 1.4, as I don't understand its evolution up to this point, but I have a suspicion that "1" would work as well.

To be clear, I'm not sure if this PR could ever be merged.
I've hijacked (copy&pasted) transcode method from TranscodingStreams to create the mutating version, which is probably not the best practice :-/

JoaoAparicio added a commit to JoaoAparicio/arrow-julia that referenced this pull request Apr 11, 2023
If we let transcode to its own allocation it will allocate a small
vector, start filling it, resize the vector, fill it some more, resize
the vector, etc.

Instead in this commit we pre-allocate a vector of the corect size and
pass it to transcode().

Inspired by apache#399
@JoaoAparicio JoaoAparicio mentioned this pull request Apr 11, 2023
JoaoAparicio added a commit to JoaoAparicio/arrow-julia that referenced this pull request Apr 11, 2023
If we let transcode to its own allocation it will allocate a small
vector, start filling it, resize the vector, fill it some more, resize
the vector, etc.

Instead in this commit we pre-allocate a vector of the corect size and
pass it to transcode().

Inspired by apache#399
baumgold pushed a commit that referenced this pull request Apr 11, 2023
If we let transcode to its own allocation it will allocate a small
vector, start filling it, resize the vector, fill it some more, resize
the vector, etc.

Instead in this commit we pre-allocate a vector of the corect size and
pass it to transcode().

Inspired by #399
Moelf pushed a commit to Moelf/arrow-julia that referenced this pull request Apr 13, 2023
If we let transcode to its own allocation it will allocate a small
vector, start filling it, resize the vector, fill it some more, resize
the vector, etc.

Instead in this commit we pre-allocate a vector of the corect size and
pass it to transcode().

Inspired by apache#399
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants