Add schema keyword argument to to_parquet#5150
Conversation
|
cc @rjzamora |
|
It looks like there is a failing test |
|
@birdsarah Thanks for taking this on! Feel free to have at it! |
Have verified this test fails if code change is not present.
| ("partition_column", pa.int64()), | ||
| ] | ||
| ) | ||
| engine = "pyarrow" |
There was a problem hiding this comment.
Let's just inline "pyarrow" rather than have an engine variable perhaps
Co-Authored-By: Matthew Rocklin <mrocklin@gmail.com>
…irdsarah/dask into fixes-inconsistent-schema-part-pyarrow
|
@mmccarty sorry for all the pings. I've decided to take ownership properly and am building on your tests. |
|
@mrocklin the assortment of assertions that now are present ( Note that setting tz to, say, 'US/Eastern' in place of 'UTC' and the timezone test fails. I have now spent more time than I have trying to debug this. I don't understand timezone support very well. The result is that the times come back out the wrong time. If anyone wants to jump in and fix, great. Otherwise we should note the limitations of timezone support somewhere - which should be written by someone who understands what's gone wrong. |
|
Need some help. I have a test failing on travis https://travis-ci.org/dask/dask/jobs/563346766#L1074 that does not fail locally. Any suggestions on how to fix gratefully received. |
|
@birdsarah Sorry I can't be very helpful now, but I should be able to take a closer look at this tomorrow morning. My quick suggestion would be to set |
|
Thanks @rjzamora, I've given that a try. We'll see how it fares. I'd obviously prefer to know why there's a discrepancy between local and travis. The only obvious thing I can see is that travis has distributed pinned. But I can't imagine why that would affect this code path. |
|
Sigh. I'm back to an error I battled with extensively locally. Would appreciate a snippet of code from someone with the best way to test the timezone columns in and out. |
|
I don't have a great understanding of timestamp types, but the difference in indices before and after the partition seems to be causing the issue. Resetting the index seems to work for me: assert df.timestamps.equals(
ddf_after_write.timestamps.compute().reset_index(drop=True)
) |
Note timezones are not supported. They come out from read tz-naive.
|
Fixed this up so I hope it reliably passes. It's a little verbose but Also now documents that timezones are a problem. They come out of the write-read cycle setup in the test as timezone naive objects. |
|
@birdsarah Wow! You've been busy on this one! Sorry I was slow to respond about the tests. Looks like you've got it now. I'll get caught up on this PR and close the old one. |
|
@birdsarah It would be helpful to get this fix into the next release. If you can't get back to it soon and want some help let me know. I can try to find help getting that unit test wrapped up. |
|
@rjzamora is taking a look Also @birdsarah I notice a number of black formatting commits. You may be interested in https://docs.dask.org/en/latest/develop.html#code-formatting Which recommends the following: |
|
working on it now |
|
PR#5157 now includes these changes (or similar), and hopefully the CI will pass... If @birdsarah does have time to push this through, I can certainly sync again with this PR after it is merged [EDIT: Scratch that - 5157 does not have correct timestamp assertions at the moment.] |
|
This should be it. I have no idea why locally these tests would pass but not on CI. Things should be super explicit now. |
|
This is now passing. Let me know if you'd like me to add the compute test. |
No need to add The changes look good to me. My only suggestion might be to use shorter names for the new tests - but that's probably just a personal preference |
|
I'm afraid I prefer long test names. So we'll have to agree to disagree or someone can point me to a style guide. |
|
Thanks to all!
…On Mon, Jul 29, 2019 at 5:39 PM Sarah Bird ***@***.***> wrote:
I'm afraid I prefer long test names. So we'll have to agree to disagree or
someone can point me to a style guide.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#5150?email_source=notifications&email_token=AAEY2GS57P6F66LKJTV3CDLQB5PRZA5CNFSM4IGRNQ72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3CCY6Q#issuecomment-516172922>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEY2GU5UGBM4QL26PLUQWLQB5PRZANCNFSM4IGRNQ7Q>
.
|
|
What's the status here. Is this good to go in? |
|
+1 from me |
|
Great! Merging in. Thanks @birdsarah and @mmccarty for the work and @rjzamora for review. |
|
Is there a way to print out the pyarrow field list / schema so I can save it in a file? Specifically, I'd like to:
Some files have hundreds of columns so I don't want to categorize each one manually. The inference from Schema.from_pandas() is good for 99% of the columns, there are just a few I need to tweak and it'd be nice to have a hard coded reference (similar to a create table script in SQL) |
Fixes #4194
This supercedes @mmccarty's PR #4851. It takes his tests but works with the new parquet code.
@mmccarty hope you don't mind me just doing this. was feeling motivated this morning.
black dask/flake8 dask