Add schema keyword argument to to_parquet by birdsarah · Pull Request #5150 · dask/dask

birdsarah · 2019-07-24T15:44:37Z

Fixes #4194

This supercedes @mmccarty's PR #4851. It takes his tests but works with the new parquet code.

@mmccarty hope you don't mind me just doing this. was feeling motivated this morning.

Tests added / passed
Passes black dask / flake8 dask

@mmccarty

Tests from https://github.com/dask/dask/pull/4851/files from @mmccarty

mrocklin · 2019-07-24T15:48:46Z

cc @rjzamora

dask/dataframe/io/parquet/arrow.py

mrocklin · 2019-07-24T17:28:31Z

It looks like there is a failing test

mmccarty · 2019-07-24T17:56:58Z

@birdsarah Thanks for taking this on! Feel free to have at it!

Have verified this test fails if code change is not present.

dask/dataframe/io/tests/test_parquet.py

mrocklin · 2019-07-24T23:33:38Z

dask/dataframe/io/tests/test_parquet.py

+            ("partition_column", pa.int64()),
+        ]
+    )
+    engine = "pyarrow"


Let's just inline "pyarrow" rather than have an engine variable perhaps

dask/dataframe/io/tests/test_parquet.py

Co-Authored-By: Matthew Rocklin <mrocklin@gmail.com>

…irdsarah/dask into fixes-inconsistent-schema-part-pyarrow

birdsarah · 2019-07-25T00:07:09Z

@mmccarty sorry for all the pings. I've decided to take ownership properly and am building on your tests.

birdsarah · 2019-07-25T01:38:46Z

@mrocklin the assortment of assertions that now are present (df.timestamps.equals and np.array_equal) are the only ones that I could get working. Any interest in different assertions would need to be done by someone else with more experience than me.

Note that setting tz to, say, 'US/Eastern' in place of 'UTC' and the timezone test fails. I have now spent more time than I have trying to debug this. I don't understand timezone support very well. The result is that the times come back out the wrong time. If anyone wants to jump in and fix, great. Otherwise we should note the limitations of timezone support somewhere - which should be written by someone who understands what's gone wrong.

birdsarah · 2019-07-25T01:50:06Z

Need some help. I have a test failing on travis https://travis-ci.org/dask/dask/jobs/563346766#L1074 that does not fail locally. Any suggestions on how to fix gratefully received.

rjzamora · 2019-07-25T02:09:36Z

@birdsarah Sorry I can't be very helpful now, but I should be able to take a closer look at this tomorrow morning. My quick suggestion would be to set gather_statistics=False in the read_parquet call (because you probably don't need statsitics/divisions for this test anyway).

birdsarah · 2019-07-25T04:03:21Z

Thanks @rjzamora, I've given that a try. We'll see how it fares. I'd obviously prefer to know why there's a discrepancy between local and travis. The only obvious thing I can see is that travis has distributed pinned. But I can't imagine why that would affect this code path.

birdsarah · 2019-07-25T05:21:23Z

Sigh. I'm back to an error I battled with extensively locally. Would appreciate a snippet of code from someone with the best way to test the timezone columns in and out.

rjzamora · 2019-07-25T17:12:51Z

I don't have a great understanding of timestamp types, but the difference in indices before and after the partition seems to be causing the issue. Resetting the index seems to work for me:

assert df.timestamps.equals(
    ddf_after_write.timestamps.compute().reset_index(drop=True)
)

Note timezones are not supported. They come out from read tz-naive.

birdsarah · 2019-07-26T00:57:42Z

Fixed this up so I hope it reliably passes. It's a little verbose but NaT != NaT so this seems clear at least.

Also now documents that timezones are a problem. They come out of the write-read cycle setup in the test as timezone naive objects.

mmccarty · 2019-07-26T16:13:21Z

@birdsarah Wow! You've been busy on this one! Sorry I was slow to respond about the tests. Looks like you've got it now. I'll get caught up on this PR and close the old one.

mmccarty · 2019-07-29T16:59:26Z

@birdsarah It would be helpful to get this fix into the next release. If you can't get back to it soon and want some help let me know. I can try to find help getting that unit test wrapped up.

mrocklin · 2019-07-29T19:07:24Z

@rjzamora is taking a look

Also @birdsarah I notice a number of black formatting commits. You may be interested in https://docs.dask.org/en/latest/develop.html#code-formatting

Which recommends the following:

pip install pre-commit
pre-commit install

birdsarah · 2019-07-29T19:52:39Z

working on it now

rjzamora · 2019-07-29T19:57:10Z

PR#5157 now includes these changes (or similar), and hopefully the CI will pass... If @birdsarah does have time to push this through, I can certainly sync again with this PR after it is merged

[EDIT: Scratch that - 5157 does not have correct timestamp assertions at the moment.]

birdsarah · 2019-07-29T20:08:27Z

This should be it. I have no idea why locally these tests would pass but not on CI. Things should be super explicit now.

dask/dataframe/io/tests/test_parquet.py

birdsarah · 2019-07-29T21:07:22Z

This is now passing. Let me know if you'd like me to add the compute test.

rjzamora · 2019-07-29T21:19:35Z

This is now passing. Let me know if you'd like me to add the compute test.

No need to add compute here. I will add it to #5157 if it really is unavoidable.

The changes look good to me. My only suggestion might be to use shorter names for the new tests - but that's probably just a personal preference

birdsarah · 2019-07-29T21:39:38Z

I'm afraid I prefer long test names. So we'll have to agree to disagree or someone can point me to a style guide.

mmccarty · 2019-07-29T23:00:33Z

Thanks to all!

…

On Mon, Jul 29, 2019 at 5:39 PM Sarah Bird ***@***.***> wrote: I'm afraid I prefer long test names. So we'll have to agree to disagree or someone can point me to a style guide. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#5150?email_source=notifications&email_token=AAEY2GS57P6F66LKJTV3CDLQB5PRZA5CNFSM4IGRNQ72YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3CCY6Q#issuecomment-516172922>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAEY2GU5UGBM4QL26PLUQWLQB5PRZANCNFSM4IGRNQ7Q> .

mrocklin · 2019-07-31T00:47:27Z

What's the status here. Is this good to go in?

rjzamora · 2019-07-31T02:30:38Z

+1 from me

mrocklin · 2019-07-31T03:04:18Z

Great! Merging in. Thanks @birdsarah and @mmccarty for the work and @rjzamora for review.

ldacey · 2020-03-15T18:00:40Z

Is there a way to print out the pyarrow field list / schema so I can save it in a file?

Specifically, I'd like to:

Clean up my dataframe and call schema = pa.Schema.from_pandas(df)
Print out a list of pa.fields like this:

fields = [
    pa.field('id', pa.int64()),
    pa.field('secondaryid', pa.int64()),
    pa.field('date', pa.timestamp('ms')), 
    pa.field('status', pa.dictionary(pa.int8(), pa.string(), ordered=False),
]

Edit those fields, if needed, and then save it in a file
Refer to the list of fields when a save a parquet file (to help eliminate issues where a partition might have all null values or mixed dtypes for a certain column) when reading a bunch of files in with Dask

Some files have hundreds of columns so I don't want to categorize each one manually. The inference from Schema.from_pandas() is good for 99% of the columns, there are just a few I need to tweak and it'd be nice to have a hard coded reference (similar to a create table script in SQL)

Add pyarrow manual schema and tests

54439fc

Tests from https://github.com/dask/dask/pull/4851/files from @mmccarty

mrocklin reviewed Jul 24, 2019

View reviewed changes

dask/dataframe/io/parquet/arrow.py Outdated Show resolved Hide resolved

birdsarah added 3 commits July 24, 2019 14:51

Black reformat

5168e8b

Update to schema keyword

3bbf500

Fix failing test

33f8f01

Have verified this test fails if code change is not present.

birdsarah commented Jul 24, 2019

View reviewed changes

dask/dataframe/io/tests/test_parquet.py Outdated Show resolved Hide resolved

black formatting

1bce0c0

mrocklin reviewed Jul 24, 2019

View reviewed changes

dask/dataframe/io/tests/test_parquet.py Show resolved Hide resolved

mrocklin reviewed Jul 24, 2019

View reviewed changes

dask/dataframe/io/tests/test_parquet.py Outdated Show resolved Hide resolved

mrocklin reviewed Jul 24, 2019

View reviewed changes

dask/dataframe/io/tests/test_parquet.py Outdated Show resolved Hide resolved

mrocklin reviewed Jul 24, 2019

View reviewed changes

dask/dataframe/io/tests/test_parquet.py Outdated Show resolved Hide resolved

birdsarah and others added 4 commits July 24, 2019 18:46

Update dask/dataframe/io/tests/test_parquet.py

a3738f4

Co-Authored-By: Matthew Rocklin <mrocklin@gmail.com>

Remove test and inline string

1f8a2d9

Merge branch 'fixes-inconsistent-schema-part-pyarrow' of github.com:b…

a5e3674

…irdsarah/dask into fixes-inconsistent-schema-part-pyarrow

Black format

7b789b8

Overhaul tests

44a98cc

Merge branch 'master' into fixes-inconsistent-schema-part-pyarrow

eebef60

Try not gathering statistics

15ddb12

Fix timestamp test and add seperate timezone timestamp test

08f1206

Note timezones are not supported. They come out from read tz-naive.

This was referenced Jul 26, 2019

Add schema keyword argument to to_parquet #4851

Closed

Release #5168

Closed

Merge branch 'master' into fixes-inconsistent-schema-part-pyarrow

17d1652

Timestamps come out as numpy datetime64 type

0f0eb61

rjzamora reviewed Jul 29, 2019

View reviewed changes

dask/dataframe/io/tests/test_parquet.py Show resolved Hide resolved

rjzamora mentioned this pull request Jul 30, 2019

Allowing fastparquet to handle gather_statistics=False for file lists #5157

Merged

2 tasks

mrocklin merged commit 6c307a8 into dask:master Jul 31, 2019

rjzamora mentioned this pull request Aug 23, 2019

Saving and Loading partitioned parquet with empty partitions is inconsistant #5252

Closed

jorisvandenbossche mentioned this pull request Dec 11, 2019

No mechanism to enforce schema in to_parquet pandas-dev/pandas#30189

Closed

TomAugspurger mentioned this pull request Jan 14, 2020

Dask Dataframe doesn't read parquet metadata file/fails for multiple partitions #5759

Closed

Uh oh!

Conversation

birdsarah commented Jul 24, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mrocklin commented Jul 24, 2019

Uh oh!

Uh oh!

mrocklin commented Jul 24, 2019

Uh oh!

mmccarty commented Jul 24, 2019

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mrocklin Jul 24, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

birdsarah commented Jul 25, 2019

Uh oh!

birdsarah commented Jul 25, 2019

Uh oh!

birdsarah commented Jul 25, 2019

Uh oh!

rjzamora commented Jul 25, 2019

Uh oh!

birdsarah commented Jul 25, 2019

Uh oh!

birdsarah commented Jul 25, 2019

Uh oh!

rjzamora commented Jul 25, 2019

Uh oh!

birdsarah commented Jul 26, 2019

Uh oh!

mmccarty commented Jul 26, 2019

Uh oh!

mmccarty commented Jul 29, 2019

Uh oh!

mrocklin commented Jul 29, 2019

Uh oh!

birdsarah commented Jul 29, 2019

Uh oh!

rjzamora commented Jul 29, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

birdsarah commented Jul 29, 2019

Uh oh!

Uh oh!

birdsarah commented Jul 29, 2019

Uh oh!

rjzamora commented Jul 29, 2019

Uh oh!

birdsarah commented Jul 29, 2019

Uh oh!

mmccarty commented Jul 29, 2019 via email

Uh oh!

mrocklin commented Jul 31, 2019

Uh oh!

rjzamora commented Jul 31, 2019

Uh oh!

mrocklin commented Jul 31, 2019

Uh oh!

ldacey commented Mar 15, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

birdsarah commented Jul 24, 2019 •

edited

Loading

rjzamora commented Jul 29, 2019 •

edited

Loading