Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading FixedSizeList from parquet is slower than reading values into more rows #34510

Closed
alippai opened this issue Mar 9, 2023 · 22 comments
Closed

Comments

@alippai
Copy link
Contributor

alippai commented Mar 9, 2023

Describe the bug, including details regarding any error messages, version, and platform.

I'm not 100% sure if it's a bug, but I don't understand the differences between the two cases:
Nested arrays:

table with 30 columns, 2 level "index" (so 2 columns) and 29 FixedSizeList<double, 80> columns, 100k rows

Exploded:

table with 31 columns, 3 level "index" (so 3 columns) and 29 double columns, 8000k rows

Reading the first table from parquet (version 2.6, zstd compression, single file) is surprisingly slower than reading the second table. I'd assume it's the same task, a few columns are even shorter. The file sizes are almost equal.

I used pyarrow 11 from conda and local SSD.

Component(s)

Parquet
Python

@mapleFU
Copy link
Member

mapleFU commented Mar 9, 2023

Can you provide some same code to reproduce the problem here? And does the FixedSizeList and double within it is nullable?

@alippai
Copy link
Contributor Author

alippai commented Mar 9, 2023

Everything was nullable, I’ll check with not nulls and provide a minimal example as well tomorrow

@alippai
Copy link
Contributor Author

alippai commented Mar 9, 2023

The same happens with not null values (I'm not sure how to define the not null list correctly, but looks like it doesn't matter):

import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq

arr_random = np.random.default_rng().standard_normal(size=[8000000], dtype='float64')
arr1 = pa.array(arr_random)
arr2 = pa.FixedSizeListArray.from_arrays(arr_random, 80)
t1 = pa.Table.from_arrays([arr1], schema=pa.schema([('A', pa.float64(), False)]))
t2 = pa.Table.from_arrays([arr2], schema=pa.schema([('A', pa.list_(pa.field('A', pa.float64(), False), 80), False)]))
t3 = pa.Table.from_arrays([arr2], schema=pa.schema([pa.field('A', pa.list_(pa.float64(), 80), False)]))

pq.write_table(t1, 't1.parquet')
pq.write_table(t2, 't2.parquet')
pq.write_table(t3, 't3.parquet')

%%timeit

t1 = pq.read_table('t1.parquet') # 30ms

%%timeit

t2 = pq.read_table('t2.parquet') # 100ms

%%timeit

t3 = pq.read_table('t3.parquet') # 100ms
print(t1.get_total_buffer_size(), t2.get_total_buffer_size(), t3.get_total_buffer_size()) # (64000000, 64000000, 64000000)
print(t1.schema, t2.schema, t3.schema)
# (A: double not null,
# A: fixed_size_list<A: double not null>[80] not null
#   child 0, A: double not null,
# A: fixed_size_list<item: double>[80] not null
#   child 0, item: double)

@mapleFU
Copy link
Member

mapleFU commented Mar 10, 2023

Thanks, I'll testing this tonight. Currently I guess constructing FixedSizeList may use some space and consuming some time.

@mapleFU
Copy link
Member

mapleFU commented Mar 10, 2023

In parquet C++, we have "parquet-arrow-reader-writer-benchmark", and I found that, BM_ReadListColumn is lower than BM_ReadColumn<...,Int64Type>. It's mainly because the hanlding of List.

In Parquet Schema Converting, FixedSizeList is handled same as List:

    case ArrowTypeId::FIXED_SIZE_LIST:
    case ArrowTypeId::LARGE_LIST:
    case ArrowTypeId::LIST: {
      auto list_type = std::static_pointer_cast<::arrow::BaseListType>(field->type());
      return ListToNode(list_type, name, field->nullable(), field_id, properties,
                        arrow_properties, out);
    }

So, maybe it will allocate rep-levels for it, and when decoding, it will produce rep-levels. And decoding / encoding rep-levels may consuming some time.

For non-null double, it doesn't have List, so, it's not neccessary for it to write a rep-level or def-level.

I guess there are some other reason, but I'm not familiar with python to C++ code

@alippai
Copy link
Contributor Author

alippai commented Mar 10, 2023

2x allocation / comp time for rep/dev levels compared to the array values sounds excessive as well (and it could be optimized to use a single static value, right?)

I agree what you’ve found is the main reason and also I share your suspicion that it’s still too slow and there might be other reasons or replevels cascade somehow into even worse perf :)

@mapleFU
Copy link
Member

mapleFU commented Mar 10, 2023

and it could be optimized to use a single static value, right?

Yes, in the future, developer may optimize it. If FixedSizeArray is non-nullable, Parquet can have a single static value, but if FixedSizeArray is non-nullable, it cannot.

other reasons or replevels cascade somehow into even worse perf

I've profile the C++ part, in my MacOS with release (O2):

C811842B-4DDA-423A-B8A1-EC6A7E4ADE33

  1. Decoding double is fast
  2. Decoding levels use nearly same time as Decoding double
  3. Constructing List cost a little time

The benchmark uses list rather than FixedSizeList, but I think the benchmark is similiar.

I'm not so familiar with Python part, maybe someone can profile that path

@alippai
Copy link
Contributor Author

alippai commented Mar 10, 2023

Looks like I have to learn a lot about repetition and definition levels, but also it looks like they can be RLE encoded which means practically zero overhead if not many nulls are used - it can be equal or similar to the non-nullable in the best scenario.

I'm not a C++ coder, but summarizing the above discussion there are 2-3 fast paths missing at different hierarchy levels:

  1. Calculating the def and rep levels for 100k rows with all non-null values takes 2x time as reading 8M doubles (this is suspicious, but might be correct)
  2. Definition level data is all 1 and supposed to be RLE encoded and we might want to skip expanding it (maybe decode it as not nullable as checking all values is not needed?)
  3. Repetition level data is a vector of 0 followed by 79x1 repeated 100k times for our case. I'm not sure if RLE will help here, sounds like an unnecessary complex structure for fixed size lists. Would ZSTD / ZLIB work better here generally speaking (it's not part of the parquet spec yet)? On the other hand reading or decoding it might be skipped as it can be derived from the metadata.

@alippai
Copy link
Contributor Author

alippai commented Mar 10, 2023

@alamb @tustvold I saw your blog post about this for arrow-rs. Do you handle this differently in Rust?

@tustvold
Copy link
Contributor

tustvold commented Mar 10, 2023

@alamb @tustvold I saw your blog post about this for arrow-rs. Do you handle this differently in Rust?

We don't support FixedSizeList in arrow-rs AFAIK. Parquet to my knowledge does not have an equivalent logical construct, and so it isn't particularly clear to me what support would mean other than implicitly casting between a regular list and a fixed size list.

Calculating the def and rep levels for 100k rows with all non-null values takes 2x time as reading 8M doubles

Assuming the doubles are PLAIN encoded this is not surprising to me, you are comparing the performance of what is effectively a memcpy that will run at the memory bandwidth, to a fairly complex bit-packing scheme used for the definition and repetition levels.

In the Rust implementation we have a couple of tricks that help here, but it is still relatively expensive (at least compared to primitive decoding):

  • We decode definition levels directly to the null buffer if there are only nulls at the leaf level (i.e. no lists or nested nulls), allowing us to preserve the bit-packing
  • We have vectorised unpack implementations specialised for each bit width (I believe arrow C++ does this also)

Definition level data is all 1 and supposed to be RLE encoded

It will actually all be 2, unless the doubles are themselves not nullable

Repetition level data is a vector of 0 followed by 79x1 repeated 100k times for our case. I'm not sure if RLE will help here, sounds like an unnecessary complex structure for fixed size lists

These repetition levels will be RLE encoded. Theoretically a reader could preserve this, but the record shredding logic is extremely fiddly and so might run the risk of adding complexity to an already very complex piece of code. At least in arrow-rs we always decode repetition levels to an array of i16

@alippai
Copy link
Contributor Author

alippai commented Mar 10, 2023

Thanks, this is an amazing explanation.

The other day I saw a Tensor canonical extension type discussion on the Arrow mailing list, which builds on top of FixedSizeList. It looks like it means partial parquet support only, in most cases it'll be cheaper to simply denormalize the data.

A slightly faster alternative is using fixed size binary data (it's still slower than doubles, but much better than the FixedSizeList<double>).

@tustvold
Copy link
Contributor

It looks like it means partial parquet support only, in most cases it'll be cheaper to simply denormalize the data.

If denormalizing is an option, it will definitely be faster, at least until parquet adds native support for fixed size repeated elements (relatively unlikely - lots of readers don't even support v2 which is a decade old now). However, I believe the use-case for tensor types is to serialize other columns alongside, at which point denormalizing may not be possible.

That being said, whilst FixedSizeList may be slow compared to native primitives, compared to decoding byte arrays, or even some of the other primitive encodings such as the deeply flawed DELTA_BINARY_PACKED, it should still be pretty fast. Certainly compared to other commonly used serialization formats for tensors such as protobuf or JSON, it will be significantly faster.

To be honest parquet's tag line could be "It's good enough". You can almost certainly do 2-3x better than parquet for any given workload, but you really need orders of magnitude improvements to overcome ecosystem inertia. I suspect most workloads will also mix in byte arrays and/or object storage or block compression, at which point those will easily be the tall pole in decode performance.

@alippai
Copy link
Contributor Author

alippai commented Mar 10, 2023

Agreed, also I'm closing this issue as in this format it's not really actionable.
Arrow based fixed size lists of primitive values (eg. tensors) shouldn't be converted to nested parquet data, but instead they are better as BYTE_ARRAY in parquet (while I think it'd be important sadly there is no fixed size BYTE_ARRAY in the parquet spec so it'll be still slightly slower than possible). Also some fast paths for never null data - which was not marked as non-nullable when the data was saved - can be useful too, but that's all.

@alippai alippai closed this as completed Mar 10, 2023
@mapleFU
Copy link
Member

mapleFU commented Mar 11, 2023

Maybe you can use Fixed Sized binary and have a transmute rule for it. Like FixedLengthBinary(sizeof(double) * 80), and use zero-copying here. But it's still tricky...

or even some of the other primitive encodings such as the deeply flawed DELTA_BINARY_PACKED

Hi @tustvold . To be honest, I wonder why DELTA_BINARY_PACKED is deeply flawed, it's here any reference or comments? I think though not have packed or FOR, it's still ok for DELTA datas.

@tustvold
Copy link
Contributor

why DELTA_BINARY_PACKED is deeply flawed

The paper they link to actually explains why the approach is problematic - http://arxiv.org/pdf/1209.2137v5.pdf. The whole paper is on why not to implement delta compression in this way 😂

@mapleFU
Copy link
Member

mapleFU commented Mar 11, 2023

Learned a lot, thanks!

@alippai
Copy link
Contributor Author

alippai commented May 13, 2024

@AlenkaF the above is still relevant for the new FixedShapeTensorType and similar types as of version 16.
I'm not sure if the metadata of the Arrow FixedSizeList should be used for optimized parquet read or maybe the right solution is extending parquet so it supports dense, fixed length lists not only sparse nested types natively.

What do you think overall?

@AlenkaF
Copy link
Member

AlenkaF commented May 14, 2024

Though it is unfortunate that the values get blown up and writing to parquet becomes slower I do still think it is "good enough" as already mentioned due to the complexity involved.

What if applications would use custom metadata to hold the schema and tensor type while writing only storage values (floats for example) in parquet files? It would need some custom logic to construct the tensor again when reading but might be a good alternative (buffers should still be the same after read, not copied).

cc @rok

@rok
Copy link
Member

rok commented May 14, 2024

What if applications would use custom metadata to hold the schema and tensor type while writing only storage values (floats for example) in parquet files? It would need some custom logic to construct the tensor again when reading but might be a good alternative (buffers should still be the same after read, not copied).

I think FixedShapeTensor get stored as FixedSizeList plus some metadata so overhead comes from storing FixedSizeList. I'm not sure, but maybe there's a clean way to have FixedSizeList cast to FixedSizeBinary or similar when writing FixedShapeTensor and then the inverse on reading. I don't think we have a clean option here though.

Given the current activity in Parquet community it might be worth proposing adding FixedSizeList to Parquet?

Also I wonder if optimized take (#39798) would improve the performance somewhat once all the PRs land.

@tustvold
Copy link
Contributor

Given the current activity in Parquet community it might be worth proposing adding FixedSizeList to Parquet?

This would be my recommendation, as this would allow for encoding non-nullable tensors without the need for any definition or repetition levels at all. Given the growing prevalence of workloads using such types, I think this would be broadly valuable.

@AlenkaF
Copy link
Member

AlenkaF commented May 15, 2024

I think FixedShapeTensor get stored as FixedSizeList plus some metadata so overhead comes from storing FixedSizeList. I'm not sure, but maybe there's a clean way to have FixedSizeList cast to FixedSizeBinary or similar when writing FixedShapeTensor and then the inverse on reading. I don't think we have a clean option here though.

Yes, FixedShapeTensor gets stored as its storage which is FixedSizeList plus metadata. I was thinking about, silly I know, storing values from the storage list which would be a primitive layout.

But yes, proposing adding FixedSizeList to Parquet is a great idea.

@rok
Copy link
Member

rok commented May 15, 2024

I've started a discussion on dev@parquet and will open a PR against parquet-format soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants