-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python][Parquet] Column statistics incorrect for Dictionary Column in Parquet #15042
Comments
Tested on pyarrow 10.0.1 |
This is very odd. It might have more do with something odd in unify dictionaries than Parquet, but not 100% sure. If we create the two chunks as slices of the same original dictionary array, it works fine: import pyarrow as pa
import pyarrow.parquet as pq
schema = pa.schema({"field_1": pa.dictionary(pa.int32(), pa.string())})
arr = pa.array(["rusty", "sean", "aa", "zzz", "frank"]).dictionary_encode()
arr_1 = arr.slice(0, 3)
arr_2 = arr.slice(3, 5)
t = pa.Table.from_batches(
[
pa.record_batch([arr_1], names=["field_1"]),
pa.record_batch([arr_2], names=["field_1"]),
]
)
with pq.ParquetWriter("example.parquet", schema) as writer:
writer.write_table(t)
metadata = pq.ParquetFile("example.parquet").metadata
print(f"Has {metadata.num_row_groups} row groups")
stats = metadata.row_group(0).column(0).statistics
print(stats) outputs
|
Oh but if I add |
It appears the null count is also wrong. From a glance, the dictionary writing path (
So I suppose the behavior makes sense given the above algorithm. The statistics and null counts are based only on the first chunk in the column. We need to add "update null count and potentially update min/max" to that middle branch. |
I am going to remove the milestone in preparation for the release. If this is a blocker please add the label |
Describe the bug, including details regarding any error messages, version, and platform.
When writing a column that is a
pa.dictionary(pa.int32(), pa.string())
to a Parquet file incorrect statistics are produced. I believe the statistics are calculated from the contents of the first chunk rather than all chunks. Since the CSV parser produces chunks that may not contain all dictionary rows, this results in incorrect statistics to be produced.A test case is below that demonstrates the problem without using the CSV parser:
Workaround: if you call
t = t.combine_chunks()
Before calling
write_table()
the proper column statistics are written.The difficulty with having improper column statistics is that query engines (Athena, Trino) use column statistics to create predicate pushdowns as part of their query execution. If these query plans are incorrect resulting in data that exists in the Parquet file not being returned as part of the query result.
Component(s)
C++, Parquet, Python
The text was updated successfully, but these errors were encountered: