-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Saving and Loading partitioned parquet with empty partitions is inconsistant #5252
Comments
Is it fair to summarize the issue as "
Can you tell, is this an issue in pyarrow, or in how Dask is using pyarrow? |
One thing I note (for pyarrow), the divisions are different before & after reading.
|
Yeah, there's something odd happening in partitions / divisions on loads of parquets that is new behavior from previous versions of dask. Specifically getting the error I can put together tests for a PR, but the bigger question is what changed in how parquet divisions are being handled in
I tried digging but I couldn't easily parse and follow how the |
Tests would be very welcome. Perhaps @rjzamora could take a look at things
afterwards
…On Thu, Aug 15, 2019 at 11:47 AM Justin Waugh ***@***.***> wrote:
Is it fair to summarize the issue as "DataFrame.repartition sometimes(?)
fails on a parquet dataset` with some(?) empty partitions"?
Yeah, there's something odd happening in partitions / divisions on loads
of parquets that is new behavior from previous versions of dask.
Specifically getting the error E ValueError: New division must be unique,
except for the last element, likely caused by the division rewriting
before/after reading you identified. (all saves/reloads (imo) should result
in the same dataframe. )
I can put together tests for a PR, but the bigger question is what changed
in how parquet divisions are being handled in 2.2.0 and why?, and I
haven't spent the time digging through the changes exactly yet.
Can you tell, is this an issue in pyarrow, or in how Dask is using pyarrow?
I tried digging but I couldn't easily parse and follow how the schema is
being generated and passed around.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#5252?email_source=notifications&email_token=AACKZTHCZTOLUWVQT6GU2WTQEV3ATA5CNFSM4IKFMCS2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4MFZCY#issuecomment-521690251>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AACKZTAVCYNOS5JWQPT3O43QEV3ATANCNFSM4IKFMCSQ>
.
|
Thanks for raising @bluecoconut - Sorry about this! The goal of the parquet refactor was to break things down into a clear core-engine structure (consistent between arrow and fastparquet). We tried not to change/break existing behavior, but much of the code was ultimately rewritten and the test coverage was certainly not perfect. I will take a look at the code snippets you shared here. Any other tests/information you can provide would certainly help me resolve the problems. |
@rjzamora Any update on this? I don't have any better examples or tests than the ones provided. Specifically: the two key issues are
(This throws an error due to schema mismatch)
|
Thanks for the nudge @bluecoconut, and thanks for providing a specific example for (1) - I will take a look today. Can you clarify that both cases are behaving as you want in 2.1.0 and not in 2.3.0? I think I ran some quick tests last week and found that your repartition test was behaving exactly the same for both new and old versions of dask for me - However, I'll need to check this again (might have been an environment mishap on my end) |
I think the second case (2) with the partition number changing specifically showed up in the newer dask. For the schema changing in pyarrow (1), I'm not entirely sure when the behavior started, but it seemed like it appeared after we updated from 0.12 (pinned, due to a serialization bug) to the newest (0.14) version of arrow? But this happened at the same time for us as updating dask, so without building out a compatibility matrix, i can't be sure of exactly when this particular behavior appeared. I can try to find some time soon to see if i can pinpoint all the different versions / what changes caused what. |
Just started looking at this - The 1st problem is definitly related to changes in pyarrow (rather than dask). Both new and old versions of dask seem to return the same error here. I'll se what I can do to smooth this over :) |
@bluecoconut - Just an update here... I looked into the pyarrow problem for a bit, and rasied #5307 as a possible fix. However, I am not totally convinced that everything in that PR is ideal/appropriate. That is, I'm not sure that we should set In the current master branch, you should not get the schema error if you explicitly define the schema when writing the parquet dataset. From your example above: schema = pa.schema(
[
("x", pa.float64()),
("timestamp", pa.timestamp("ns")),
("id", pa.int64()),
("name", pa.string()),
("y", pa.float64()),
]
)
df.to_parquet("wow.pq", schema=schema, engine="pyarrow")
dd.read_parquet("wow.pq", engine="pyarrow") Note that the If we decide not to set dd.read_parquet("wow.pq", dataset={"validate_schema": False}, engine="pyarrow") |
I definitely understand the challenge for the pyarrow schema validation, thanks for chasing this! For now I'm going to generate Is that a possible avenue here for a more general solution in dask? (eg. dask dataframes should be able to reliably use their meta to help inform a universal schema on save?) Also, separate to that bug, but also part of this same issue ticket, what should the path be to fix the problem with loading partitioned parquet? (example below)
Is this something I should put in a separate issue for / try to come up with a simpler / more straightforward test? the issue here is that the behavior is changing depending on the initialization of the timeseries dataset itself (and therefore the distribution of the time column), which affects the loaded partition divisions. Effectively I think that there should just be something that validates that saving a partitioned dataframe and loading it results in the same divisions across many different odd inputs. (entirely empty dataframe partitions, etc.) |
Yes - good suggestion. I think that is probably doable.
I didn't get to spend much time on this one yet. I understand your point about wanting the same partitions throughout the parquet round trip. Unfortunately this may be tricky. When the dataset is written to a parquet dataset, the metadata only contains row-group statistics about the data that is actually written. During With that said, I do feel that a following |
From #5307 (comment)
|
Problems
There seems to be a few issues with saving and loading partitioned dataframes where some partitions are empty or where columns have null values. I'm also finding that the behavior is inconsistent and semi-unpredictable.
Environment
Python environment:
Example Code
Fastparquet
Fast Parquet Output (be sure to scroll right, and see how they fail at different times that seem unrelated (?) to the df length)
Pyarrow Input
(Almost exact same code, notice the change to
repartition(npartitions=5)
, to give it a fighting chance)Pyarrow Output
Actual new error
(From previous versions of dask, our tests were passing (with a test that caused something like this), but now they are failing. This is due to code in
dataframe.core.check_divisions
:if len(divisions[:-1]) != len(list(unique(divisions[:-1]))):
ValueError: New division must be unique, except for the last element
Expected behavior
In all of the cases above I wouldn't expect any errors. I don't know ahead of time if dataframes will be empty on individual partitions, and on reloads I want to assume I can "split" to a certain degree (even if it's creating empty partitions).
Final notes:
Of particular worry is the new (?) (I haven't seen it before, on pyarrow 0.12 which we were pinned to before now), schema errors across the dataframes, which complain about nulls existing. This makes me think that the metadata is getting written "per partition". I was able to (also intermittantly, so its hard to be sure this will work) show this behavior purely by editing data on a single partition.
Separate (but related?) issue with None's in schema...
raises an error on the read
ValueError: Schema in /home/jawaugh/code/notebooks/justin/dask-partitions/wow.pq/part.12.parquet was different.
The text was updated successfully, but these errors were encountered: