Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet: Support repetition level >1 and multi-column fields #871

Open
rcaudy opened this issue Jul 20, 2021 · 4 comments
Open

Parquet: Support repetition level >1 and multi-column fields #871

rcaudy opened this issue Jul 20, 2021 · 4 comments
Assignees
Labels
core Core development tasks feature request New feature or request parquet Related to the Parquet integration
Milestone

Comments

@rcaudy
Copy link
Member

rcaudy commented Jul 20, 2021

Currently, we regard nested repetition and multi-column fields as uncommon and hard to map into a columnar data table like Deephaven's.
This feature request is intended to capture views to the contrary.

Linked to #294 , although intended for a later effort.

@rcaudy rcaudy added feature request New feature or request core Core development tasks parquet Related to the Parquet integration labels Jul 20, 2021
@rcaudy rcaudy added this to the Backlog milestone Jul 20, 2021
@rcaudy rcaudy changed the title Support repetition levels > 1 in Parquet files Support repetition or definition levels > 1 in Parquet files Jul 22, 2021
@rcaudy rcaudy changed the title Support repetition or definition levels > 1 in Parquet files Parquet: Support repetition or definition levels > 1 Aug 3, 2021
@rcaudy rcaudy changed the title Parquet: Support repetition or definition levels > 1 Parquet: Support repetition level >1 and multi-column fields Aug 9, 2021
@rcaudy
Copy link
Member Author

rcaudy commented Aug 9, 2021

Likely the first step is some kind of "flattening", but this is contrary to the intent of the Dremel design, so maybe we can think of a better solution.

@rcaudy
Copy link
Member Author

rcaudy commented Aug 9, 2021

I'll be improving our error messages with a PR shortly. New messages:
For:

t = io.deephaven.db.tables.utils.ParquetTools.readTable("/data/parquetFiles/nonnullable_nested_v1_IMPALA_NULLS_NONE.parquet")

We'll see:

java.lang.UnsupportedOperationException: Unsupported maximum repetition level 2 in column int_array_array/list/element/list/element

For:

t = io.deephaven.db.tables.utils.ParquetTools.readTable("/data/parquetFiles/repeated_nested_RUST_NONE.parquet")

We'll see:

java.lang.UnsupportedOperationException: Encountered unsupported multi-column field phoneNumbers: found columns phoneNumbers/phone/number and phoneNumbers/phone/kind

@devinrsmith
Copy link
Member

It might be nice to be able to specify which columns you care about for your Table - in which case, the user can choose to not include the nested columns.

There's a mechanism right now to provide column instructions:

from deephaven.parquet import read, ColumnInstruction

t = read(
    path="/snappy.parquet",
    col_instructions=[
        ColumnInstruction(column_name="date", parquet_column_name="date")
    ],
)

but this currently throws the error:

java.lang.UnsupportedOperationException: Encountered unsupported multi-column field outputs: found columns outputs/list/element/address and outputs/list/element/index
	at io.deephaven.parquet.table.ParquetSchemaReader.lambda$readParquetSchema$1(ParquetSchemaReader.java:174)
	at java.base/java.util.HashMap.compute(HashMap.java:1316)
	at io.deephaven.parquet.table.ParquetSchemaReader.readParquetSchema(ParquetSchemaReader.java:169)
	at io.deephaven.parquet.table.ParquetTools.convertSchema(ParquetTools.java:647)
	at io.deephaven.parquet.table.ParquetTools.readTableInternal(ParquetTools.java:384)
	at io.deephaven.parquet.table.ParquetTools.readTable(ParquetTools.java:94)

@devinrsmith
Copy link
Member

A user has hit this w/ the parquet viewer, see devinrsmith/deephaven-parquet-viewer#9

devinrsmith added a commit to devinrsmith/deephaven-core that referenced this issue Nov 15, 2023
Additionally, adds explicit entry points for single, flat-partitioned, and kv-partitioned reads.

Fixes deephaven#4746
Partial workaround for deephaven#871
devinrsmith added a commit that referenced this issue Nov 16, 2023
Additionally, adds explicit entry points for single, flat-partitioned, and kv-partitioned reads.

Fixes #4746
Partial workaround for #871
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core development tasks feature request New feature or request parquet Related to the Parquet integration
Projects
None yet
Development

No branches or pull requests

3 participants