Reading FixedSizeList from parquet is slower than reading values into more rows #34510

alippai · 2023-03-09T05:13:21Z

Describe the bug, including details regarding any error messages, version, and platform.

I'm not 100% sure if it's a bug, but I don't understand the differences between the two cases:
Nested arrays:

table with 30 columns, 2 level "index" (so 2 columns) and 29 FixedSizeList<double, 80> columns, 100k rows

Exploded:

table with 31 columns, 3 level "index" (so 3 columns) and 29 double columns, 8000k rows

Reading the first table from parquet (version 2.6, zstd compression, single file) is surprisingly slower than reading the second table. I'd assume it's the same task, a few columns are even shorter. The file sizes are almost equal.

I used pyarrow 11 from conda and local SSD.

Component(s)

Parquet
Python

The text was updated successfully, but these errors were encountered:

mapleFU · 2023-03-09T16:59:09Z

Can you provide some same code to reproduce the problem here? And does the FixedSizeList and double within it is nullable?

alippai · 2023-03-09T17:21:49Z

Everything was nullable, I’ll check with not nulls and provide a minimal example as well tomorrow

alippai · 2023-03-09T20:37:13Z

The same happens with not null values (I'm not sure how to define the not null list correctly, but looks like it doesn't matter):

import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq

arr_random = np.random.default_rng().standard_normal(size=[8000000], dtype='float64')
arr1 = pa.array(arr_random)
arr2 = pa.FixedSizeListArray.from_arrays(arr_random, 80)
t1 = pa.Table.from_arrays([arr1], schema=pa.schema([('A', pa.float64(), False)]))
t2 = pa.Table.from_arrays([arr2], schema=pa.schema([('A', pa.list_(pa.field('A', pa.float64(), False), 80), False)]))
t3 = pa.Table.from_arrays([arr2], schema=pa.schema([pa.field('A', pa.list_(pa.float64(), 80), False)]))

pq.write_table(t1, 't1.parquet')
pq.write_table(t2, 't2.parquet')
pq.write_table(t3, 't3.parquet')

%%timeit

t1 = pq.read_table('t1.parquet') # 30ms

%%timeit

t2 = pq.read_table('t2.parquet') # 100ms

%%timeit

t3 = pq.read_table('t3.parquet') # 100ms

print(t1.get_total_buffer_size(), t2.get_total_buffer_size(), t3.get_total_buffer_size()) # (64000000, 64000000, 64000000)
print(t1.schema, t2.schema, t3.schema)
# (A: double not null,
# A: fixed_size_list<A: double not null>[80] not null
#   child 0, A: double not null,
# A: fixed_size_list<item: double>[80] not null
#   child 0, item: double)

mapleFU · 2023-03-10T02:50:51Z

Thanks, I'll testing this tonight. Currently I guess constructing FixedSizeList may use some space and consuming some time.

mapleFU · 2023-03-10T09:52:22Z

In parquet C++, we have "parquet-arrow-reader-writer-benchmark", and I found that, BM_ReadListColumn is lower than BM_ReadColumn<...,Int64Type>. It's mainly because the hanlding of List.

In Parquet Schema Converting, FixedSizeList is handled same as List:

    case ArrowTypeId::FIXED_SIZE_LIST:
    case ArrowTypeId::LARGE_LIST:
    case ArrowTypeId::LIST: {
      auto list_type = std::static_pointer_cast<::arrow::BaseListType>(field->type());
      return ListToNode(list_type, name, field->nullable(), field_id, properties,
                        arrow_properties, out);
    }

So, maybe it will allocate rep-levels for it, and when decoding, it will produce rep-levels. And decoding / encoding rep-levels may consuming some time.

For non-null double, it doesn't have List, so, it's not neccessary for it to write a rep-level or def-level.

I guess there are some other reason, but I'm not familiar with python to C++ code

alippai · 2023-03-10T14:17:17Z

2x allocation / comp time for rep/dev levels compared to the array values sounds excessive as well (and it could be optimized to use a single static value, right?)

I agree what you’ve found is the main reason and also I share your suspicion that it’s still too slow and there might be other reasons or replevels cascade somehow into even worse perf :)

mapleFU · 2023-03-10T14:29:36Z

and it could be optimized to use a single static value, right?

Yes, in the future, developer may optimize it. If FixedSizeArray is non-nullable, Parquet can have a single static value, but if FixedSizeArray is non-nullable, it cannot.

other reasons or replevels cascade somehow into even worse perf

I've profile the C++ part, in my MacOS with release (O2):

Decoding double is fast
Decoding levels use nearly same time as Decoding double
Constructing List cost a little time

The benchmark uses list rather than FixedSizeList, but I think the benchmark is similiar.

I'm not so familiar with Python part, maybe someone can profile that path

alippai · 2023-03-10T17:50:04Z

Looks like I have to learn a lot about repetition and definition levels, but also it looks like they can be RLE encoded which means practically zero overhead if not many nulls are used - it can be equal or similar to the non-nullable in the best scenario.

I'm not a C++ coder, but summarizing the above discussion there are 2-3 fast paths missing at different hierarchy levels:

Calculating the def and rep levels for 100k rows with all non-null values takes 2x time as reading 8M doubles (this is suspicious, but might be correct)
Definition level data is all 1 and supposed to be RLE encoded and we might want to skip expanding it (maybe decode it as not nullable as checking all values is not needed?)
Repetition level data is a vector of 0 followed by 79x1 repeated 100k times for our case. I'm not sure if RLE will help here, sounds like an unnecessary complex structure for fixed size lists. Would ZSTD / ZLIB work better here generally speaking (it's not part of the parquet spec yet)? On the other hand reading or decoding it might be skipped as it can be derived from the metadata.

alippai · 2023-03-10T17:59:32Z

@alamb @tustvold I saw your blog post about this for arrow-rs. Do you handle this differently in Rust?

tustvold · 2023-03-10T18:29:16Z

@alamb @tustvold I saw your blog post about this for arrow-rs. Do you handle this differently in Rust?

We don't support FixedSizeList in arrow-rs AFAIK. Parquet to my knowledge does not have an equivalent logical construct, and so it isn't particularly clear to me what support would mean other than implicitly casting between a regular list and a fixed size list.

Calculating the def and rep levels for 100k rows with all non-null values takes 2x time as reading 8M doubles

Assuming the doubles are PLAIN encoded this is not surprising to me, you are comparing the performance of what is effectively a memcpy that will run at the memory bandwidth, to a fairly complex bit-packing scheme used for the definition and repetition levels.

In the Rust implementation we have a couple of tricks that help here, but it is still relatively expensive (at least compared to primitive decoding):

We decode definition levels directly to the null buffer if there are only nulls at the leaf level (i.e. no lists or nested nulls), allowing us to preserve the bit-packing
We have vectorised unpack implementations specialised for each bit width (I believe arrow C++ does this also)

Definition level data is all 1 and supposed to be RLE encoded

It will actually all be 2, unless the doubles are themselves not nullable

Repetition level data is a vector of 0 followed by 79x1 repeated 100k times for our case. I'm not sure if RLE will help here, sounds like an unnecessary complex structure for fixed size lists

These repetition levels will be RLE encoded. Theoretically a reader could preserve this, but the record shredding logic is extremely fiddly and so might run the risk of adding complexity to an already very complex piece of code. At least in arrow-rs we always decode repetition levels to an array of i16

alippai · 2023-03-10T19:49:00Z

Thanks, this is an amazing explanation.

The other day I saw a Tensor canonical extension type discussion on the Arrow mailing list, which builds on top of FixedSizeList. It looks like it means partial parquet support only, in most cases it'll be cheaper to simply denormalize the data.

A slightly faster alternative is using fixed size binary data (it's still slower than doubles, but much better than the FixedSizeList<double>).

tustvold · 2023-03-10T20:33:19Z

It looks like it means partial parquet support only, in most cases it'll be cheaper to simply denormalize the data.

If denormalizing is an option, it will definitely be faster, at least until parquet adds native support for fixed size repeated elements (relatively unlikely - lots of readers don't even support v2 which is a decade old now). However, I believe the use-case for tensor types is to serialize other columns alongside, at which point denormalizing may not be possible.

That being said, whilst FixedSizeList may be slow compared to native primitives, compared to decoding byte arrays, or even some of the other primitive encodings such as the deeply flawed DELTA_BINARY_PACKED, it should still be pretty fast. Certainly compared to other commonly used serialization formats for tensors such as protobuf or JSON, it will be significantly faster.

To be honest parquet's tag line could be "It's good enough". You can almost certainly do 2-3x better than parquet for any given workload, but you really need orders of magnitude improvements to overcome ecosystem inertia. I suspect most workloads will also mix in byte arrays and/or object storage or block compression, at which point those will easily be the tall pole in decode performance.

alippai · 2023-03-10T20:58:20Z

Agreed, also I'm closing this issue as in this format it's not really actionable.
Arrow based fixed size lists of primitive values (eg. tensors) shouldn't be converted to nested parquet data, but instead they are better as BYTE_ARRAY in parquet (~~while I think it'd be important sadly there is no fixed size BYTE_ARRAY in the parquet spec so it'll be still slightly slower than possible~~). Also some fast paths for never null data - which was not marked as non-nullable when the data was saved - can be useful too, but that's all.

mapleFU · 2023-03-11T05:42:56Z

Maybe you can use Fixed Sized binary and have a transmute rule for it. Like FixedLengthBinary(sizeof(double) * 80), and use zero-copying here. But it's still tricky...

or even some of the other primitive encodings such as the deeply flawed DELTA_BINARY_PACKED

Hi @tustvold . To be honest, I wonder why DELTA_BINARY_PACKED is deeply flawed, it's here any reference or comments? I think though not have packed or FOR, it's still ok for DELTA datas.

tustvold · 2023-03-11T08:53:21Z

why DELTA_BINARY_PACKED is deeply flawed

The paper they link to actually explains why the approach is problematic - http://arxiv.org/pdf/1209.2137v5.pdf. The whole paper is on why not to implement delta compression in this way 😂

mapleFU · 2023-03-11T16:49:37Z

Learned a lot, thanks!

alippai · 2024-05-13T16:13:46Z

@AlenkaF the above is still relevant for the new FixedShapeTensorType and similar types as of version 16.
I'm not sure if the metadata of the Arrow FixedSizeList should be used for optimized parquet read or maybe the right solution is extending parquet so it supports dense, fixed length lists not only sparse nested types natively.

What do you think overall?

AlenkaF · 2024-05-14T07:35:41Z

Though it is unfortunate that the values get blown up and writing to parquet becomes slower I do still think it is "good enough" as already mentioned due to the complexity involved.

What if applications would use custom metadata to hold the schema and tensor type while writing only storage values (floats for example) in parquet files? It would need some custom logic to construct the tensor again when reading but might be a good alternative (buffers should still be the same after read, not copied).

cc @rok

rok · 2024-05-14T09:51:04Z

What if applications would use custom metadata to hold the schema and tensor type while writing only storage values (floats for example) in parquet files? It would need some custom logic to construct the tensor again when reading but might be a good alternative (buffers should still be the same after read, not copied).

I think FixedShapeTensor get stored as FixedSizeList plus some metadata so overhead comes from storing FixedSizeList. I'm not sure, but maybe there's a clean way to have FixedSizeList cast to FixedSizeBinary or similar when writing FixedShapeTensor and then the inverse on reading. I don't think we have a clean option here though.

Given the current activity in Parquet community it might be worth proposing adding FixedSizeList to Parquet?

Also I wonder if optimized take (#39798) would improve the performance somewhat once all the PRs land.

tustvold · 2024-05-14T11:00:32Z

Given the current activity in Parquet community it might be worth proposing adding FixedSizeList to Parquet?

This would be my recommendation, as this would allow for encoding non-nullable tensors without the need for any definition or repetition levels at all. Given the growing prevalence of workloads using such types, I think this would be broadly valuable.

AlenkaF · 2024-05-15T07:28:45Z

I think FixedShapeTensor get stored as FixedSizeList plus some metadata so overhead comes from storing FixedSizeList. I'm not sure, but maybe there's a clean way to have FixedSizeList cast to FixedSizeBinary or similar when writing FixedShapeTensor and then the inverse on reading. I don't think we have a clean option here though.

Yes, FixedShapeTensor gets stored as its storage which is FixedSizeList plus metadata. I was thinking about, silly I know, storing values from the storage list which would be a primitive layout.

But yes, proposing adding FixedSizeList to Parquet is a great idea.

rok · 2024-05-15T12:32:50Z

I've started a discussion on dev@parquet and will open a PR against parquet-format soon.

alippai added the Type: bug label Mar 9, 2023

github-actions bot added the Component: Parquet label Mar 9, 2023

alippai closed this as completed Mar 10, 2023

sjperkins mentioned this issue Jul 13, 2023

Investigate performance of FixedSizedList/List types in Parquet files ratt-ru/arcae#40

Open

alippai mentioned this issue Aug 17, 2023

GH-37072: [C++] GetArrayView: allow non-nullable fields to be null if parent is null #37202

Closed

alippai mentioned this issue Jan 20, 2024

GH-39704: [C++][Parquet] Benchmark levels decoding #39705

Merged

rok mentioned this issue May 15, 2024

PARQUET-2474: Add FIXED_SIZE_LIST logical type apache/parquet-format#241

Open

asfimport mentioned this issue Jun 5, 2024

[Format] Specify FIXED_SIZE_LIST Logical type apache/parquet-format#430

Open

rok mentioned this issue Jun 24, 2024

[Format] Specify VARIABLE_SIZE_LIST Logical type apache/parquet-format#437

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading FixedSizeList from parquet is slower than reading values into more rows #34510

Reading FixedSizeList from parquet is slower than reading values into more rows #34510

alippai commented Mar 9, 2023 •

edited

Loading

mapleFU commented Mar 9, 2023

alippai commented Mar 9, 2023

alippai commented Mar 9, 2023

mapleFU commented Mar 10, 2023

mapleFU commented Mar 10, 2023

alippai commented Mar 10, 2023

mapleFU commented Mar 10, 2023

alippai commented Mar 10, 2023 •

edited

Loading

alippai commented Mar 10, 2023 •

edited

Loading

tustvold commented Mar 10, 2023 •

edited

Loading

alippai commented Mar 10, 2023

tustvold commented Mar 10, 2023

alippai commented Mar 10, 2023 •

edited

Loading

mapleFU commented Mar 11, 2023 •

edited

Loading

tustvold commented Mar 11, 2023

mapleFU commented Mar 11, 2023

alippai commented May 13, 2024

AlenkaF commented May 14, 2024

rok commented May 14, 2024

tustvold commented May 14, 2024

AlenkaF commented May 15, 2024

rok commented May 15, 2024

Reading FixedSizeList from parquet is slower than reading values into more rows #34510

Reading FixedSizeList from parquet is slower than reading values into more rows #34510

Comments

alippai commented Mar 9, 2023 • edited Loading

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

mapleFU commented Mar 9, 2023

alippai commented Mar 9, 2023

alippai commented Mar 9, 2023

mapleFU commented Mar 10, 2023

mapleFU commented Mar 10, 2023

alippai commented Mar 10, 2023

mapleFU commented Mar 10, 2023

alippai commented Mar 10, 2023 • edited Loading

alippai commented Mar 10, 2023 • edited Loading

tustvold commented Mar 10, 2023 • edited Loading

alippai commented Mar 10, 2023

tustvold commented Mar 10, 2023

alippai commented Mar 10, 2023 • edited Loading

mapleFU commented Mar 11, 2023 • edited Loading

tustvold commented Mar 11, 2023

mapleFU commented Mar 11, 2023

alippai commented May 13, 2024

AlenkaF commented May 14, 2024

rok commented May 14, 2024

tustvold commented May 14, 2024

AlenkaF commented May 15, 2024

rok commented May 15, 2024

alippai commented Mar 9, 2023 •

edited

Loading

alippai commented Mar 10, 2023 •

edited

Loading

alippai commented Mar 10, 2023 •

edited

Loading

tustvold commented Mar 10, 2023 •

edited

Loading

alippai commented Mar 10, 2023 •

edited

Loading

mapleFU commented Mar 11, 2023 •

edited

Loading