Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python][Parquet] Column statistics incorrect for Dictionary Column in Parquet #15042

Closed
rustyconover opened this issue Dec 19, 2022 · 5 comments · Fixed by #15179
Closed

[Python][Parquet] Column statistics incorrect for Dictionary Column in Parquet #15042

rustyconover opened this issue Dec 19, 2022 · 5 comments · Fixed by #15179
Assignees
Labels
Component: C++ Component: Parquet Component: Python Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. Priority: Blocker Marks a blocker for the release Type: bug
Milestone

Comments

@rustyconover
Copy link

Describe the bug, including details regarding any error messages, version, and platform.

When writing a column that is a

pa.dictionary(pa.int32(), pa.string()) to a Parquet file incorrect statistics are produced. I believe the statistics are calculated from the contents of the first chunk rather than all chunks. Since the CSV parser produces chunks that may not contain all dictionary rows, this results in incorrect statistics to be produced.

A test case is below that demonstrates the problem without using the CSV parser:

import pyarrow as pa
import pyarrow.parquet as pq

schema = pa.schema({"field_1": pa.dictionary(pa.int32(), pa.string())})

# The ordering of the values don't matter, but they must
# be stored in seperate chunks (as they would be if they are parsed from CSV)
arr_1 = pa.array(["rusty", "sean", "aa"]).dictionary_encode()
arr_2 = pa.array(["zzz", "frank"]).dictionary_encode()

t = pa.Table.from_batches(
    [
        pa.record_batch([arr_1], names=["field_1"]),
        pa.record_batch([arr_2], names=["field_1"]),
    ]
)

# If this is commented the bug does not occur.
t = t.unify_dictionaries()

with pq.ParquetWriter("example.parquet", schema) as writer:
    writer.write_table(t)

# The Parquet stats will be:
# ┌─────────────────┬─────────────────┐
# │ stats_min_value │ stats_max_value │
# ├─────────────────┼─────────────────┤
# │ aa              │ sean            │
# └─────────────────┴─────────────────┘
#
# The stats should be:
#
# ┌─────────────────┬─────────────────┐
# │ stats_min_value │ stats_max_value │
# ├─────────────────┼─────────────────┤
# │ aa              │ zzz             │
# └─────────────────┴─────────────────┘
#

Workaround: if you call

t = t.combine_chunks()

Before calling write_table() the proper column statistics are written.

The difficulty with having improper column statistics is that query engines (Athena, Trino) use column statistics to create predicate pushdowns as part of their query execution. If these query plans are incorrect resulting in data that exists in the Parquet file not being returned as part of the query result.

Component(s)

C++, Parquet, Python

@rustyconover
Copy link
Author

Tested on pyarrow 10.0.1

@kou kou changed the title Column statistics incorrect for Dictionary Column in Parquet [Python][Parquet] Column statistics incorrect for Dictionary Column in Parquet Dec 20, 2022
@wjones127
Copy link
Member

This is very odd. It might have more do with something odd in unify dictionaries than Parquet, but not 100% sure. If we create the two chunks as slices of the same original dictionary array, it works fine:

import pyarrow as pa
import pyarrow.parquet as pq

schema = pa.schema({"field_1": pa.dictionary(pa.int32(), pa.string())})

arr = pa.array(["rusty", "sean", "aa", "zzz", "frank"]).dictionary_encode()
arr_1 = arr.slice(0, 3)
arr_2 = arr.slice(3, 5)

t = pa.Table.from_batches(
    [
        pa.record_batch([arr_1], names=["field_1"]),
        pa.record_batch([arr_2], names=["field_1"]),
    ]
)


with pq.ParquetWriter("example.parquet", schema) as writer:
    writer.write_table(t)

metadata = pq.ParquetFile("example.parquet").metadata
print(f"Has {metadata.num_row_groups} row groups")
stats = metadata.row_group(0).column(0).statistics
print(stats)

outputs

Has 1 row groups
<pyarrow._parquet.Statistics object at 0x11f76fb50>
  has_min_max: True
  min: aa
  max: zzz
  null_count: 0
  distinct_count: 0
  num_values: 5
  physical_type: BYTE_ARRAY
  logical_type: String
  converted_type (legacy): UTF8

@wjones127
Copy link
Member

Oh but if I add t = t.unify_dictionaries() is has the same behavior. 😞

@wjones127 wjones127 self-assigned this Jan 2, 2023
@westonpace
Copy link
Member

westonpace commented Jan 2, 2023

It appears the null count is also wrong. From a glance, the dictionary writing path (WriteArrowDictionary in column_writer.cc) looks something like this:

if (AlreadyHaveDictionaryForColumn()) {
  if (IsDictionaryChanged()) {
    # Fallback to plain or maybe unify or something
  } else {
    # Write another indices batch
  }
} else {
  # Calculate statistics / null count and setup column and store dictionary for future write to column
}

So I suppose the behavior makes sense given the above algorithm. The statistics and null counts are based only on the first chunk in the column. We need to add "update null count and potentially update min/max" to that middle branch.

@raulcd
Copy link
Member

raulcd commented Jan 11, 2023

I am going to remove the milestone in preparation for the release. If this is a blocker please add the label Priority: Blocker

@raulcd raulcd removed this from the 11.0.0 milestone Jan 11, 2023
@wjones127 wjones127 added the Priority: Blocker Marks a blocker for the release label Jan 11, 2023
@wjones127 wjones127 added this to the 11.0.0 milestone Jan 11, 2023
wjones127 added a commit that referenced this issue Jan 11, 2023
…naries (#15179)

* Closes: #15042

Authored-by: Will Jones <willjones127@gmail.com>
Signed-off-by: Will Jones <willjones127@gmail.com>
@wjones127 wjones127 added the Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. label Jan 25, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: C++ Component: Parquet Component: Python Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. Priority: Blocker Marks a blocker for the release Type: bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants