Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet metadata-handling improvements #5218

Merged
merged 4 commits into from Aug 13, 2019

Conversation

@rjzamora
Copy link
Contributor

commented Aug 2, 2019

This PR is intended to address some components of Issue#5212 and Issue#5200.

  • I have added simple logic to drop statistics for any column with None values for min or max in any column chunk. I decided not to drop all metadata when None is found, because it is posible that the bad statistics are isolated to column(s) that will not be used as the index (thus not affecting the divisions). This will hopefully help with #5212

  • I am including changes recommended by @martindurant here (note that I didn't make any real effort to "clean up" the code here, but it might be fine as is)

  • Tests added / passed

  • Passes black dask / flake8 dask

cs_max = column.statistics.max
if None in [cs_min, cs_max]:
skip_cols.add(name)
continue

This comment has been minimized.

Copy link
@martindurant

martindurant Aug 2, 2019

Member

Perfect. This should probably be done in fastparquet too when generating the stats values, but I suppose in principle, people might want to use them for something other than making dask divisions lists.

@martindurant

This comment has been minimized.

Copy link
Member

commented Aug 2, 2019

I guess it would be good to measure the bytes size of the serialised complete graph for a dd.read_parquet like the one in the linked issue with reasonable number of columns and groups, with and without this change. That would be a decent proxy for the cost of doing that de/serialisation.

@martindurant

This comment has been minimized.

Copy link
Member

commented Aug 2, 2019

Had a moment and did this, len(cloudpickle.dumps(df)), got 60002bytes with this change, 87668 without.

@rjzamora

This comment has been minimized.

Copy link
Contributor Author

commented Aug 5, 2019

A separate problem? https://stackoverflow.com/questions/57345240/problem-with-opening-parquet-files-with-fastparquet-after-updating-to-latest-ver

I'm actually having trouble reproducing this. It doesn't really make much sense to me why the "partitioned_on" columns would need to be specified in the columns argument, or why the index would not be detected correctly. Is the @ notation in '@timestamp' significant? (i. e. I don't personally know if the @ has some important meaning)

@martindurant

This comment has been minimized.

Copy link
Member

commented Aug 5, 2019

No, the "@" is just a character.

I think the point is, that the _meta should include the columns due to directory-based partitioning whether or not they are specified in the columns= kwarg; but these columns do not show up in the schema of _metadata or _common_metadata (but they should be in the pandas JSON metadata).

The index thing I don't understand, since presumably the dataframe being written should have specified this. But certainly, a tests that looks like this example would be a useful addition.

@rjzamora

This comment has been minimized.

Copy link
Contributor Author

commented Aug 5, 2019

I think the point is, that the _meta should include the columns due to directory-based partitioning whether or not they are specified in the columns= kwarg; but these columns do not show up in the schema of _metadata or _common_metadata (but they should be in the pandas JSON metadata).

Ah, I think I'm understanding a bit (although still unable to reproduce the problem). These columns should indeed be in the JSON metadata, but they are excluded from meta right here. During compute(), however, fastparquet "knows" to include the partitioned-on columns, even if they are not specified in columns.

@martindurant

This comment has been minimized.

Copy link
Member

commented Aug 5, 2019

Aha, exactly, those columns should include directory-derived ones, whether inferred from the filenames or from pandas metadata

@martindurant

This comment has been minimized.

Copy link
Member

commented Aug 5, 2019

(the original meta should already include them, so this is simple to change)

@rjzamora

This comment has been minimized.

Copy link
Contributor Author

commented Aug 5, 2019

Aha, exactly, those columns should include directory-derived ones, whether inferred from the filenames or from pandas metadata

Just to be clear, this statement suggests that this current assertion/test should not be true, is that correct? In test_partition_on_cats_2, we are making the assumption that the columns argument really does mean that all other columns should be left out of the resulting dataframe.

@martindurant

This comment has been minimized.

Copy link
Member

commented Aug 5, 2019

Well, in previous versions, you could not unselect a directory-derived columns; but I suppose you are right and it is reasonable to do this, and then instead of changing how meta is made, we need the reader to exclude the inferred column (perhaps only possible after it was created anyway)

@martindurant

This comment has been minimized.

Copy link
Member

commented Aug 5, 2019

Plus in that test, we didn't actually ensure that the data could be loaded, which I think would have found this issue.

@rjzamora

This comment has been minimized.

Copy link
Contributor Author

commented Aug 5, 2019

Plus in that test, we didn't actually ensure that the data could be loaded, which I think would have found this issue.

Are you saying that we aren't actually using compute()?

@martindurant

This comment has been minimized.

Copy link
Member

commented Aug 5, 2019

Are you saying that we aren't actually using compute()?

Right, for the df that explicitly checks this columns= case

@mrocklin

This comment has been minimized.

Copy link
Member

commented Aug 12, 2019

@rjzamora @martindurant what is the status here? It looks like this PR has gotten a bit stale.

@rjzamora

This comment has been minimized.

Copy link
Contributor Author

commented Aug 12, 2019

what is the status here? It looks like this PR has gotten a bit stale.

Initially, I was leaving this open in case we found any other problems. At this point, we should probably just wrap this up and open a new PR if another issue surfaces.

@martindurant

This comment has been minimized.

Copy link
Member

commented Aug 12, 2019

Agreed, I think this is an improvement

@TomAugspurger

This comment has been minimized.

Copy link
Member

commented Aug 12, 2019

@rjzamora I merged master and pushed to your branch to resolve the CI issues.

@TomAugspurger

This comment has been minimized.

Copy link
Member

commented Aug 12, 2019

OK to remove the "WIP" in the title?

@martindurant are you comfortable with the PR, or does it need another review / touch up?

@rjzamora

This comment has been minimized.

Copy link
Contributor Author

commented Aug 12, 2019

I merged master and pushed to your branch to resolve the CI issues.

Thanks @TomAugspurger !

@martindurant

This comment has been minimized.

Copy link
Member

commented Aug 12, 2019

@rjzamora , up to you when to remove "WIP", I'll have a look then.

@rjzamora rjzamora changed the title [WIP] Parquet metadata-handling improvements Parquet metadata-handling improvements Aug 12, 2019

@martindurant

This comment has been minimized.

Copy link
Member

commented Aug 12, 2019

@rjzamora

This comment has been minimized.

Copy link
Contributor Author

commented Aug 12, 2019

Does this resolve https://stackoverflow.com/questions/57345240/problem-with-opening-parquet-files-with-fastparquet-after-updating-to-latest-ver ? Maybe the specific example there could become a test.

I am now remembering that one concern I had with this PR was that I was not able to reproduce the original error reported on stack overflow. So, although I believe the changes should address that problem, I am not actually "sure".

Note that I also made the decision here that a call to read_parquet should indeed include "only" the columns specified by the columns argument (if not None), even if some missing columns correspond to partitions, etc.

@martindurant

This comment has been minimized.

Copy link
Member

commented Aug 12, 2019

I also made the decision here that a call to read_parquet should indeed include "only" the columns specified by the columns argument (if not None), even if some missing columns correspond to partitions, etc.

I am OK with that

@rjzamora

This comment has been minimized.

Copy link
Contributor Author

commented Aug 12, 2019

Maybe the specific example there could become a test.

I am hoping that test_partition_on has been properly expanded to cover this.

Note that I also added this compute call to test_partition_on_cats_2 for the case where one of the "partitioned-on" columns is not specified in the columns argument. Prior to these changes, this assertion would have failed because the pre- and post-compute columns would have disagreed (which seems to have caused the stack-overflow error).

@martindurant

This comment has been minimized.

Copy link
Member

commented Aug 12, 2019

OK, good with me

@martindurant

This comment has been minimized.

Copy link
Member

commented Aug 13, 2019

+1

@martindurant

This comment has been minimized.

Copy link
Member

commented Aug 13, 2019

@rjzamora , just to check you were done here.

@rjzamora

This comment has been minimized.

Copy link
Contributor Author

commented Aug 13, 2019

just to check you were done here.

Yes - we should open something new if any other problems show up. Thanks for your help @martindurant !

@martindurant martindurant merged commit f3eadf3 into dask:master Aug 13, 2019

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@rjzamora rjzamora deleted the rjzamora:fp-improvements branch Aug 19, 2019

@apiszcz

This comment has been minimized.

Copy link

commented Aug 21, 2019

I am seeing the same error as #5218 (comment)
when reading from a parquet file created with append=True using fastparquet. I have not test the very latest dask

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.