Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet list columns read incorrectly #2557

Closed
2 tasks done
bmschmidt opened this issue Nov 6, 2021 · 3 comments · Fixed by #2832
Closed
2 tasks done

Parquet list columns read incorrectly #2557

bmschmidt opened this issue Nov 6, 2021 · 3 comments · Fixed by #2832

Comments

@bmschmidt
Copy link

bmschmidt commented Nov 6, 2021

What happens?

When scanning parquet files with list columns, results with incorrect offsets can be returned.

To Reproduce

import numpy as np
import pyarrow as pa
import duckdb
import pyarrow.parquet as pq
R = 100_000
fake_data = pa.array([np.arange(i, i + 100).astype('float64') for i in range(R)])
tb = pa.table({'word0': pa.array([str(i) for i in range(0, R)]), 'year_counts': fake_data})
pq.write_table(tb, "test.parquet")

con = duckdb.connect(":memory:")

con.query("SELECT year_counts FROM parquet_scan('test.parquet') WHERE word0='90000'")

This should return a list-column of length 100 starting at 90,000; instead it starts with 912.

pq.read_table("test.parquet", columns = ['year_counts'], filters = [("word0", "=", "90000")])['year_counts']

returns the correct result.

Environment (please complete the following information):

  • OS: OS X
  • DuckDB Version: 0.3.1-dev550
  • DuckDB Client: Python / WASM

Before Submitting

  • Have you tried this on the latest master branch?
  • Python: pip install duckdb --upgrade --pre
  • R: install.packages("https://github.com/duckdb/duckdb/releases/download/master-builds/duckdb_r_src.tar.gz", repos = NULL)
  • Other Platforms: You can find binaries here or compile from source.
  • Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?
@hannes
Copy link
Member

hannes commented Nov 8, 2021

Thanks for reporting, we will have a look

@FenrirWillow
Copy link
Contributor

Hello everyone, this bug impacted one of the prototypes we were exploring using DuckDB + Node for, so I took a stab at fixing the problem myself. I will add some regression tests to ensure that this return in some nasty form, but in the meantime if there is anything else that needs doing to help this get merged, please let me know :)

Mytherin added a commit that referenced this issue Nov 23, 2021
Fix bug in parquet reader causing list columns to be parsed incorrectly (#2557)
@Mytherin
Copy link
Collaborator

This should be fixed now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants