Add support for dictionary stripes #68

progval · 2024-03-13T11:55:15Z

In Arrow, dictionary columns are a separate data type, while in ORC they are a per-stripe encoding. This means we cannot get an Arrow schema for a whole ORC file, the Arrow schema is only valid per-stripe

Unfortunately, this breaks a feature of this crate (which I assume is important for datafusion), and I don't see a way out. Thoughts?

This changes the error in test1 from "Incorrect datatype" error to a difference in serialized output.

(Note: test1.orc has a binary column, so you should apply #67 first if you want to see the change.)

In Arrow, dictionary columns are a separate data type, while in ORC they are a per-stripe encoding. This means we cannot get an Arrow schema for a whole ORC file, the Arrow schema is only valid per-stripe

Jefffrey · 2024-03-13T11:58:40Z

In Arrow, dictionary columns are a separate data type, while in ORC they are a per-stripe encoding.

Oh this is a very good pickup 👀

Thanks for this, I'll try review soon 👍

Jefffrey · 2024-03-14T09:27:11Z

I'm thinking we'll have to read all String into String type array and forgo copying dictionary encoded string stripe columns directly into Arrow dictionary arrays:

We want to be able to read an entire file as consistent record batches (with same schema) since I think otherwise it could confuse the consumer (e.g. datafusion)
We don't want cases where if we read only a particular stripe (e.g. due to pruning) it can give a recordbatch with different schema depending on which stripe is read
If we consider multiple files being queried by datafusion, we wouldn't want some having dictionary type array whilst others have string type array

So I think will need to change the logic for decoding dictionary encoded string stripes to just decode to regular StringArray

progval · 2024-03-14T22:25:57Z

So this means datafusion won't be able to use ORC dictionaries for predicate pushdown, right?

Jefffrey · 2024-03-15T12:11:06Z

So this means datafusion won't be able to use ORC dictionaries for predicate pushdown, right?

I haven't thought that far ahead yet honestly. The main takeaway is that we'll always decode to StringArrays. As for how that's done internally and how it can affect predicate pushdown, that remains to be seen.

Jefffrey · 2024-03-20T10:06:07Z

So I realized there is an arrow kernel for casting/converting from dictionary to primitive so I used it as a quick fix: 7f66552

My understanding of arrow dictionary encoding and how it interacts with datafusion is still immature so I'm continuing to do some reading on this matter (some of my assumptions in above comments are probably incorrect).

I noticed parquet had a similar issue which I have been reading through: apache/arrow-rs#171

Jefffrey · 2024-03-23T04:09:00Z

I've created an issue here for tracking: #72

I probably won't spend too much more time investigating this until the crate becomes more feature complete (will focus on correctness over performance for now), but will appreciate any further insights/contributions 👍

Add support for dictionary stripes

fbd76dd

In Arrow, dictionary columns are a separate data type, while in ORC they are a per-stripe encoding. This means we cannot get an Arrow schema for a whole ORC file, the Arrow schema is only valid per-stripe

Jefffrey mentioned this pull request Mar 23, 2024

Dictionary array handling #72

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for dictionary stripes #68

Add support for dictionary stripes #68

progval commented Mar 13, 2024 •

edited

Loading

Jefffrey commented Mar 13, 2024

Jefffrey commented Mar 14, 2024

progval commented Mar 14, 2024

Jefffrey commented Mar 15, 2024

Jefffrey commented Mar 20, 2024

Jefffrey commented Mar 23, 2024

Add support for dictionary stripes #68

Are you sure you want to change the base?

Add support for dictionary stripes #68

Conversation

progval commented Mar 13, 2024 • edited Loading

Jefffrey commented Mar 13, 2024

Jefffrey commented Mar 14, 2024

progval commented Mar 14, 2024

Jefffrey commented Mar 15, 2024

Jefffrey commented Mar 20, 2024

Jefffrey commented Mar 23, 2024

progval commented Mar 13, 2024 •

edited

Loading