Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for dictionary stripes #68

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

progval
Copy link
Contributor

@progval progval commented Mar 13, 2024

In Arrow, dictionary columns are a separate data type, while in ORC they are a per-stripe encoding. This means we cannot get an Arrow schema for a whole ORC file, the Arrow schema is only valid per-stripe

Unfortunately, this breaks a feature of this crate (which I assume is important for datafusion), and I don't see a way out. Thoughts?

This changes the error in test1 from "Incorrect datatype" error to a difference in serialized output.

(Note: test1.orc has a binary column, so you should apply #67 first if you want to see the change.)

In Arrow, dictionary columns are a separate data type, while in ORC they are
a per-stripe encoding. This means we cannot get an Arrow schema for a whole
ORC file, the Arrow schema is only valid per-stripe
@Jefffrey
Copy link
Collaborator

In Arrow, dictionary columns are a separate data type, while in ORC they are a per-stripe encoding.

Oh this is a very good pickup 👀

Thanks for this, I'll try review soon 👍

@Jefffrey
Copy link
Collaborator

I'm thinking we'll have to read all String into String type array and forgo copying dictionary encoded string stripe columns directly into Arrow dictionary arrays:

  • We want to be able to read an entire file as consistent record batches (with same schema) since I think otherwise it could confuse the consumer (e.g. datafusion)
  • We don't want cases where if we read only a particular stripe (e.g. due to pruning) it can give a recordbatch with different schema depending on which stripe is read
  • If we consider multiple files being queried by datafusion, we wouldn't want some having dictionary type array whilst others have string type array

So I think will need to change the logic for decoding dictionary encoded string stripes to just decode to regular StringArray

@progval
Copy link
Contributor Author

progval commented Mar 14, 2024

So this means datafusion won't be able to use ORC dictionaries for predicate pushdown, right?

@Jefffrey
Copy link
Collaborator

So this means datafusion won't be able to use ORC dictionaries for predicate pushdown, right?

I haven't thought that far ahead yet honestly. The main takeaway is that we'll always decode to StringArrays. As for how that's done internally and how it can affect predicate pushdown, that remains to be seen.

@Jefffrey
Copy link
Collaborator

So I realized there is an arrow kernel for casting/converting from dictionary to primitive so I used it as a quick fix: 7f66552

My understanding of arrow dictionary encoding and how it interacts with datafusion is still immature so I'm continuing to do some reading on this matter (some of my assumptions in above comments are probably incorrect).

I noticed parquet had a similar issue which I have been reading through: apache/arrow-rs#171

@Jefffrey
Copy link
Collaborator

I've created an issue here for tracking: #72

I probably won't spend too much more time investigating this until the crate becomes more feature complete (will focus on correctness over performance for now), but will appreciate any further insights/contributions 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants