Parquet list columns read incorrectly #2557

bmschmidt · 2021-11-06T14:45:58Z

What happens?

When scanning parquet files with list columns, results with incorrect offsets can be returned.

To Reproduce

import numpy as np
import pyarrow as pa
import duckdb
import pyarrow.parquet as pq
R = 100_000
fake_data = pa.array([np.arange(i, i + 100).astype('float64') for i in range(R)])
tb = pa.table({'word0': pa.array([str(i) for i in range(0, R)]), 'year_counts': fake_data})
pq.write_table(tb, "test.parquet")

con = duckdb.connect(":memory:")

con.query("SELECT year_counts FROM parquet_scan('test.parquet') WHERE word0='90000'")

This should return a list-column of length 100 starting at 90,000; instead it starts with 912.

pq.read_table("test.parquet", columns = ['year_counts'], filters = [("word0", "=", "90000")])['year_counts']

returns the correct result.

Environment (please complete the following information):

OS: OS X
DuckDB Version: 0.3.1-dev550
DuckDB Client: Python / WASM

Before Submitting

Have you tried this on the latest master branch?

Python: pip install duckdb --upgrade --pre
R: install.packages("https://github.com/duckdb/duckdb/releases/download/master-builds/duckdb_r_src.tar.gz", repos = NULL)
Other Platforms: You can find binaries here or compile from source.

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

The text was updated successfully, but these errors were encountered:

hannes · 2021-11-08T07:53:50Z

Thanks for reporting, we will have a look

FenrirWillow · 2021-11-22T14:51:25Z

Hello everyone, this bug impacted one of the prototypes we were exploring using DuckDB + Node for, so I took a stab at fixing the problem myself. I will add some regression tests to ensure that this return in some nasty form, but in the meantime if there is anything else that needs doing to help this get merged, please let me know :)

Fix bug in parquet reader causing list columns to be parsed incorrectly (#2557)

Mytherin · 2021-12-22T16:23:28Z

This should be fixed now.

FenrirWillow mentioned this issue Nov 22, 2021

Fix bug in parquet reader causing list columns to be parsed incorrectly (#2557) #2650

Merged

Mytherin added a commit that referenced this issue Nov 23, 2021

Merge pull request #2650 from FenrirWillow/bugfix/parquet-list-columns

2cb6c7d

Fix bug in parquet reader causing list columns to be parsed incorrectly (#2557)

Mytherin closed this as completed Dec 22, 2021

Mytherin mentioned this issue Dec 22, 2021

Parquet Writer Rework: Support complex types #2832

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parquet list columns read incorrectly #2557

Parquet list columns read incorrectly #2557

bmschmidt commented Nov 6, 2021 •

edited

hannes commented Nov 8, 2021

FenrirWillow commented Nov 22, 2021

Mytherin commented Dec 22, 2021

Parquet list columns read incorrectly #2557

Parquet list columns read incorrectly #2557

Comments

bmschmidt commented Nov 6, 2021 • edited

What happens?

To Reproduce

Environment (please complete the following information):

Before Submitting

hannes commented Nov 8, 2021

FenrirWillow commented Nov 22, 2021

Mytherin commented Dec 22, 2021

bmschmidt commented Nov 6, 2021 •

edited