-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reading FixedSizeList from parquet is slower than reading values into more rows #34510
Comments
Can you provide some same code to reproduce the problem here? And does the FixedSizeList and double within it is nullable? |
Everything was nullable, I’ll check with not nulls and provide a minimal example as well tomorrow |
The same happens with not null values (I'm not sure how to define the not null list correctly, but looks like it doesn't matter): import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
arr_random = np.random.default_rng().standard_normal(size=[8000000], dtype='float64')
arr1 = pa.array(arr_random)
arr2 = pa.FixedSizeListArray.from_arrays(arr_random, 80)
t1 = pa.Table.from_arrays([arr1], schema=pa.schema([('A', pa.float64(), False)]))
t2 = pa.Table.from_arrays([arr2], schema=pa.schema([('A', pa.list_(pa.field('A', pa.float64(), False), 80), False)]))
t3 = pa.Table.from_arrays([arr2], schema=pa.schema([pa.field('A', pa.list_(pa.float64(), 80), False)]))
pq.write_table(t1, 't1.parquet')
pq.write_table(t2, 't2.parquet')
pq.write_table(t3, 't3.parquet')
t1 = pq.read_table('t1.parquet') # 30ms
t2 = pq.read_table('t2.parquet') # 100ms
t3 = pq.read_table('t3.parquet') # 100ms print(t1.get_total_buffer_size(), t2.get_total_buffer_size(), t3.get_total_buffer_size()) # (64000000, 64000000, 64000000)
print(t1.schema, t2.schema, t3.schema)
# (A: double not null,
# A: fixed_size_list<A: double not null>[80] not null
# child 0, A: double not null,
# A: fixed_size_list<item: double>[80] not null
# child 0, item: double) |
Thanks, I'll testing this tonight. Currently I guess constructing FixedSizeList may use some space and consuming some time. |
In parquet C++, we have "parquet-arrow-reader-writer-benchmark", and I found that, In Parquet Schema Converting, FixedSizeList is handled same as List:
So, maybe it will allocate rep-levels for it, and when decoding, it will produce rep-levels. And decoding / encoding rep-levels may consuming some time. For non-null double, it doesn't have List, so, it's not neccessary for it to write a rep-level or def-level. I guess there are some other reason, but I'm not familiar with python to C++ code |
2x allocation / comp time for rep/dev levels compared to the array values sounds excessive as well (and it could be optimized to use a single static value, right?) I agree what you’ve found is the main reason and also I share your suspicion that it’s still too slow and there might be other reasons or replevels cascade somehow into even worse perf :) |
Yes, in the future, developer may optimize it. If
I've profile the C++ part, in my MacOS with release (O2):
The benchmark uses list rather than FixedSizeList, but I think the benchmark is similiar. I'm not so familiar with Python part, maybe someone can profile that path |
Looks like I have to learn a lot about repetition and definition levels, but also it looks like they can be RLE encoded which means practically zero overhead if not many nulls are used - it can be equal or similar to the non-nullable in the best scenario. I'm not a C++ coder, but summarizing the above discussion there are 2-3 fast paths missing at different hierarchy levels:
|
We don't support FixedSizeList in arrow-rs AFAIK. Parquet to my knowledge does not have an equivalent logical construct, and so it isn't particularly clear to me what support would mean other than implicitly casting between a regular list and a fixed size list.
Assuming the doubles are PLAIN encoded this is not surprising to me, you are comparing the performance of what is effectively a In the Rust implementation we have a couple of tricks that help here, but it is still relatively expensive (at least compared to primitive decoding):
It will actually all be 2, unless the doubles are themselves not nullable
These repetition levels will be RLE encoded. Theoretically a reader could preserve this, but the record shredding logic is extremely fiddly and so might run the risk of adding complexity to an already very complex piece of code. At least in arrow-rs we always decode repetition levels to an array of |
Thanks, this is an amazing explanation. The other day I saw a Tensor canonical extension type discussion on the Arrow mailing list, which builds on top of FixedSizeList. It looks like it means partial parquet support only, in most cases it'll be cheaper to simply denormalize the data. A slightly faster alternative is using fixed size binary data (it's still slower than doubles, but much better than the |
If denormalizing is an option, it will definitely be faster, at least until parquet adds native support for fixed size repeated elements (relatively unlikely - lots of readers don't even support v2 which is a decade old now). However, I believe the use-case for tensor types is to serialize other columns alongside, at which point denormalizing may not be possible. That being said, whilst FixedSizeList may be slow compared to native primitives, compared to decoding byte arrays, or even some of the other primitive encodings such as the deeply flawed DELTA_BINARY_PACKED, it should still be pretty fast. Certainly compared to other commonly used serialization formats for tensors such as protobuf or JSON, it will be significantly faster. To be honest parquet's tag line could be "It's good enough". You can almost certainly do 2-3x better than parquet for any given workload, but you really need orders of magnitude improvements to overcome ecosystem inertia. I suspect most workloads will also mix in byte arrays and/or object storage or block compression, at which point those will easily be the tall pole in decode performance. |
Agreed, also I'm closing this issue as in this format it's not really actionable. |
Maybe you can use Fixed Sized binary and have a transmute rule for it. Like
Hi @tustvold . To be honest, I wonder why |
The paper they link to actually explains why the approach is problematic - http://arxiv.org/pdf/1209.2137v5.pdf. The whole paper is on why not to implement delta compression in this way 😂 |
Learned a lot, thanks! |
@AlenkaF the above is still relevant for the new What do you think overall? |
Though it is unfortunate that the values get blown up and writing to parquet becomes slower I do still think it is "good enough" as already mentioned due to the complexity involved. What if applications would use custom metadata to hold the schema and tensor type while writing only storage values (floats for example) in parquet files? It would need some custom logic to construct the tensor again when reading but might be a good alternative (buffers should still be the same after read, not copied). cc @rok |
I think Given the current activity in Parquet community it might be worth proposing adding Also I wonder if optimized take (#39798) would improve the performance somewhat once all the PRs land. |
This would be my recommendation, as this would allow for encoding non-nullable tensors without the need for any definition or repetition levels at all. Given the growing prevalence of workloads using such types, I think this would be broadly valuable. |
Yes, But yes, proposing adding |
I've started a discussion on dev@parquet and will open a PR against parquet-format soon. |
Describe the bug, including details regarding any error messages, version, and platform.
I'm not 100% sure if it's a bug, but I don't understand the differences between the two cases:
Nested arrays:
Exploded:
Reading the first table from parquet (version 2.6, zstd compression, single file) is surprisingly slower than reading the second table. I'd assume it's the same task, a few columns are even shorter. The file sizes are almost equal.
I used pyarrow 11 from conda and local SSD.
Component(s)
Parquet
Python
The text was updated successfully, but these errors were encountered: