[C++][Parquet] Reading dict pages is not reading all values? #29148

asfimport · 2021-07-29T04:41:23Z

While round tripping dictionary-encoded arrays in dictionary-encoded parquet files in arrow2, I have been unable to have pyarrow read all values from the dictionary page. This contrasts with (py)spark, that can read them.

Attached to this issue is a parquet file generated from rust's arrow2 whereby I read the IPC "generated_dictionary" file and write it into parquet (v1) with dictionary-encoding. I.e. 2 pages, one with the values, the other with the indices.

The expected result for the column 0, "dict0" is

import pyarrow

path = "generated_dictionary"
golden_path = f"../testing/arrow-testing/data/arrow-ipc-stream/integration/1.0.0-littleendian/{path}.arrow_file"
column = ("dict0", 0)

table = pyarrow.ipc.RecordBatchFileReader(golden_path).read_all()
expected = next(c for i, c in enumerate(expected) if i == column[1])
expected = expected.combine_chunks().tolist()
print(expected)
# ['nwg€6d€', None, None, None, None, None, None, None, None, 'e£a5µ矢a', None, None, 'rpc£µ£3', None, None, None, None]


# read with pyspark
spark = pyspark.sql.SparkSession.builder.config(
    # see https://stackoverflow.com/a/62024670/931303
    "spark.sql.parquet.enableVectorizedReader",
    "false",
).getOrCreate()

df = spark.read.parquet(f"{golden_path}.parquet")

r = df.select(column[0]).collect()

result = [r[column[0]] for r in r]
assert expected == result

However, I have been unable to correctly read it from pyarrow. The result I get:

table = pq.read_table(f"{path}.parquet")
result = table[0]
print(result.combine_chunks().dictionary)
print(result.combine_chunks().indices)
[
  "2lf4µµr",
  "",
  "nwg€6d€",
  "rpc£µ£3",
  "e£a5µ矢a"
]
[
  2,
  null,
  null,
  null,
  null,
  null,
  null,
  null,
  null,
  8,
  null,
  null,
  4,
  null,
  null,
  null,
  null
]

which is incorrect as the largest index (8) is larger than the len (5) of the values.

The indices are being read correctly, but not all values are. For clarity, the buffer in the dictionary page (PLAIN-encoded as per spec) on the attached parquet is:

# ["2lf4µµr", "", "nwg€6d€", "", "rpc£µ£3", "", "", "", "e£a5µ矢a", ""]

[
9, 0, 0, 0, 50, 108, 102, 52, 194, 181, 194, 181, 114,
0, 0, 0, 0, 
11, 0, 0, 0, 110, 119, 103, 226, 130, 172, 54, 100, 226, 130, 172, 
0, 0, 0, 0,
10, 0, 0, 0, 114, 112, 99, 194, 163, 194, 181, 194, 163, 51, 
0, 0, 0, 0, 
0, 0, 0, 0, 
0, 0, 0, 0, 
11, 0, 0, 0, 101, 194, 163, 97, 53, 194, 181, 231, 159, 162, 97, 
0, 0, 0, 0
]

and the reported number of values in the dict page header is 10. I would expect all values to be read directly to the dictionary.

We cannot discard the possibility that I am doing something wrong in writing. So far I was able to round-trip these within arrow2 and can read dict-encoded from both pyarrow and pyspark, which suggests that the arrow2 reader is correct.

Reporter: Jorge Leitão / @jorgecarleitao

Original Issue Attachments:

generated_dictionary.parquet

PRs and other links:

GitHub Pull Request #11148

_{Note: This issue was originally created as ARROW-13487. Please see the migration documentation for further details.}

asfimport · 2021-08-24T18:19:18Z

Antoine Pitrou / @pitrou:
@emkornfield

asfimport · 2021-08-24T21:41:28Z

Micah Kornfield / @emkornfield:
So what I believe is happening is someplace in the Arrow decoding path we make the assumption that dictionary values are unique and don't remap indices being read if they arent.

asfimport · 2021-08-24T21:55:16Z

Micah Kornfield / @emkornfield:

arrow/cpp/src/parquet/encoding.cc

Line 1843 in f406b53

PARQUET_THROW_NOT_OK(binary_builder->InsertMemoValues(*arr));

is the problematic line. It appears the documentation isn't too specific but what InsertMemoValues does appears to dedupe the values. So I think the two options are:

1. Make that method act more like a multimap.

2. Bypass that method and construct a dictionary array without deduping.

3. Recalculate indices when reading them in (likely has the biggest performance hit but likely lowest amount of downstream impact). I suppose we could also special case this if there are duplicates in the dictionary.

@wesm originally wrote this code I think, and I seem to recall there being some issues here so curious if he recalls any downstream issues of not-dedupping.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++][Parquet] Reading dict pages is not reading all values? #29148

[C++][Parquet] Reading dict pages is not reading all values? #29148

asfimport commented Jul 29, 2021

asfimport commented Aug 24, 2021

asfimport commented Aug 24, 2021

asfimport commented Aug 24, 2021

[C++][Parquet] Reading dict pages is not reading all values? #29148

[C++][Parquet] Reading dict pages is not reading all values? #29148

Comments

asfimport commented Jul 29, 2021

Original Issue Attachments:

PRs and other links:

asfimport commented Aug 24, 2021

asfimport commented Aug 24, 2021

asfimport commented Aug 24, 2021