Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dask Orc Files - Mismatched Schema Feature Request #5957

Open
LongBeach805 opened this issue Feb 28, 2020 · 2 comments
Open

Dask Orc Files - Mismatched Schema Feature Request #5957

LongBeach805 opened this issue Feb 28, 2020 · 2 comments

Comments

@LongBeach805
Copy link

LongBeach805 commented Feb 28, 2020

It would be nice to be able to select multiple orc files at once that contain different schema. For instance, if the user has three files with structs:

  1. struct<id, name, city, state>
  2. struct<id, city, state>
  3. struct<id, name, state>

And then used the following code:

data = dd.read_orc(path, column=['id', 'name', 'state'])

Dask should be able to pull from the files, and replace any values with nulls if they don't exist in the data set instead of throwing ValueError: Incompatible schemas while parsing ORC files.

@mrocklin
Copy link
Member

cc'ing @rjzamora for triage

From my perspective something like this sounds nice, but I suspect that it also introduces some complexity. I would not be surprised to see some resistance from people who think about this kind of thing, but in general I think that @rjzamora or others who are more familiar with this space will likely have better informed thoughts.

@LongBeach805
Copy link
Author

LongBeach805 commented Feb 28, 2020

My hack is to basically dd.concat([d1, d2, d3, ...]) after processing each dataframe into a similar structure. Thank you for taking a look into this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants