[Python][Parquet] Column statistics incorrect for Dictionary Column in Parquet #15042

rustyconover · 2022-12-19T22:29:21Z

Describe the bug, including details regarding any error messages, version, and platform.

When writing a column that is a

pa.dictionary(pa.int32(), pa.string()) to a Parquet file incorrect statistics are produced. I believe the statistics are calculated from the contents of the first chunk rather than all chunks. Since the CSV parser produces chunks that may not contain all dictionary rows, this results in incorrect statistics to be produced.

A test case is below that demonstrates the problem without using the CSV parser:

import pyarrow as pa
import pyarrow.parquet as pq

schema = pa.schema({"field_1": pa.dictionary(pa.int32(), pa.string())})

# The ordering of the values don't matter, but they must
# be stored in seperate chunks (as they would be if they are parsed from CSV)
arr_1 = pa.array(["rusty", "sean", "aa"]).dictionary_encode()
arr_2 = pa.array(["zzz", "frank"]).dictionary_encode()

t = pa.Table.from_batches(
    [
        pa.record_batch([arr_1], names=["field_1"]),
        pa.record_batch([arr_2], names=["field_1"]),
    ]
)

# If this is commented the bug does not occur.
t = t.unify_dictionaries()

with pq.ParquetWriter("example.parquet", schema) as writer:
    writer.write_table(t)

# The Parquet stats will be:
# ┌─────────────────┬─────────────────┐
# │ stats_min_value │ stats_max_value │
# ├─────────────────┼─────────────────┤
# │ aa              │ sean            │
# └─────────────────┴─────────────────┘
#
# The stats should be:
#
# ┌─────────────────┬─────────────────┐
# │ stats_min_value │ stats_max_value │
# ├─────────────────┼─────────────────┤
# │ aa              │ zzz             │
# └─────────────────┴─────────────────┘
#

Workaround: if you call

t = t.combine_chunks()

Before calling write_table() the proper column statistics are written.

The difficulty with having improper column statistics is that query engines (Athena, Trino) use column statistics to create predicate pushdowns as part of their query execution. If these query plans are incorrect resulting in data that exists in the Parquet file not being returned as part of the query result.

Component(s)

C++, Parquet, Python

The text was updated successfully, but these errors were encountered:

rustyconover · 2022-12-19T22:31:27Z

Tested on pyarrow 10.0.1

wjones127 · 2022-12-30T22:21:00Z

This is very odd. It might have more do with something odd in unify dictionaries than Parquet, but not 100% sure. If we create the two chunks as slices of the same original dictionary array, it works fine:

import pyarrow as pa
import pyarrow.parquet as pq

schema = pa.schema({"field_1": pa.dictionary(pa.int32(), pa.string())})

arr = pa.array(["rusty", "sean", "aa", "zzz", "frank"]).dictionary_encode()
arr_1 = arr.slice(0, 3)
arr_2 = arr.slice(3, 5)

t = pa.Table.from_batches(
    [
        pa.record_batch([arr_1], names=["field_1"]),
        pa.record_batch([arr_2], names=["field_1"]),
    ]
)


with pq.ParquetWriter("example.parquet", schema) as writer:
    writer.write_table(t)

metadata = pq.ParquetFile("example.parquet").metadata
print(f"Has {metadata.num_row_groups} row groups")
stats = metadata.row_group(0).column(0).statistics
print(stats)

outputs

Has 1 row groups
<pyarrow._parquet.Statistics object at 0x11f76fb50>
  has_min_max: True
  min: aa
  max: zzz
  null_count: 0
  distinct_count: 0
  num_values: 5
  physical_type: BYTE_ARRAY
  logical_type: String
  converted_type (legacy): UTF8

wjones127 · 2022-12-30T23:41:34Z

Oh but if I add t = t.unify_dictionaries() is has the same behavior. 😞

westonpace · 2023-01-02T18:12:18Z

It appears the null count is also wrong. From a glance, the dictionary writing path (WriteArrowDictionary in column_writer.cc) looks something like this:

if (AlreadyHaveDictionaryForColumn()) {
  if (IsDictionaryChanged()) {
    # Fallback to plain or maybe unify or something
  } else {
    # Write another indices batch
  }
} else {
  # Calculate statistics / null count and setup column and store dictionary for future write to column
}

So I suppose the behavior makes sense given the above algorithm. The statistics and null counts are based only on the first chunk in the column. We need to add "update null count and potentially update min/max" to that middle branch.

raulcd · 2023-01-11T15:03:28Z

I am going to remove the milestone in preparation for the release. If this is a blocker please add the label Priority: Blocker

…naries (#15179) * Closes: #15042 Authored-by: Will Jones <willjones127@gmail.com> Signed-off-by: Will Jones <willjones127@gmail.com>

rustyconover added the Type: bug label Dec 19, 2022

kou changed the title ~~Column statistics incorrect for Dictionary Column in Parquet~~ [Python][Parquet] Column statistics incorrect for Dictionary Column in Parquet Dec 20, 2022

wjones127 self-assigned this Jan 2, 2023

github-actions bot mentioned this issue Jan 3, 2023

GH-15042: [C++][Parquet] Update stats on subsequent batches of dictionaries #15179

Merged

wjones127 added Component: Parquet Component: C++ labels Jan 4, 2023

wjones127 added this to the 11.0.0 milestone Jan 4, 2023

rok added the Component: Python label Jan 8, 2023

raulcd removed this from the 11.0.0 milestone Jan 11, 2023

wjones127 added the Priority: Blocker Marks a blocker for the release label Jan 11, 2023

wjones127 added this to the 11.0.0 milestone Jan 11, 2023

wjones127 closed this as completed in #15179 Jan 11, 2023

wjones127 added a commit that referenced this issue Jan 11, 2023

GH-15042: [C++][Parquet] Update stats on subsequent batches of dictio…

0da51b7

…naries (#15179) * Closes: #15042 Authored-by: Will Jones <willjones127@gmail.com> Signed-off-by: Will Jones <willjones127@gmail.com>

wjones127 added the Critical Fix Bugfixes for security vulnerabilities, crashes, or invalid data. label Jan 25, 2023

wgtmac mentioned this issue Feb 9, 2023

[C++][Parquet] Fix updating page statistics for WriteArrowDictionary #34106

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python][Parquet] Column statistics incorrect for Dictionary Column in Parquet #15042

[Python][Parquet] Column statistics incorrect for Dictionary Column in Parquet #15042

rustyconover commented Dec 19, 2022

rustyconover commented Dec 19, 2022

wjones127 commented Dec 30, 2022

wjones127 commented Dec 30, 2022

westonpace commented Jan 2, 2023 •

edited

Loading

raulcd commented Jan 11, 2023

[Python][Parquet] Column statistics incorrect for Dictionary Column in Parquet #15042

[Python][Parquet] Column statistics incorrect for Dictionary Column in Parquet #15042

Comments

rustyconover commented Dec 19, 2022

Describe the bug, including details regarding any error messages, version, and platform.

Component(s)

rustyconover commented Dec 19, 2022

wjones127 commented Dec 30, 2022

wjones127 commented Dec 30, 2022

westonpace commented Jan 2, 2023 • edited Loading

raulcd commented Jan 11, 2023

westonpace commented Jan 2, 2023 •

edited

Loading