Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet metadata-handling improvements #5218

Merged
merged 4 commits into from Aug 13, 2019
Merged

Conversation

rjzamora
Copy link
Member

@rjzamora rjzamora commented Aug 2, 2019

This PR is intended to address some components of Issue#5212 and Issue#5200.

  • I have added simple logic to drop statistics for any column with None values for min or max in any column chunk. I decided not to drop all metadata when None is found, because it is posible that the bad statistics are isolated to column(s) that will not be used as the index (thus not affecting the divisions). This will hopefully help with #5212

  • I am including changes recommended by @martindurant here (note that I didn't make any real effort to "clean up" the code here, but it might be fine as is)

  • Tests added / passed

  • Passes black dask / flake8 dask

cs_max = column.statistics.max
if None in [cs_min, cs_max]:
skip_cols.add(name)
continue
Copy link
Member

@martindurant martindurant Aug 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect. This should probably be done in fastparquet too when generating the stats values, but I suppose in principle, people might want to use them for something other than making dask divisions lists.

@martindurant
Copy link
Member

@martindurant martindurant commented Aug 2, 2019

I guess it would be good to measure the bytes size of the serialised complete graph for a dd.read_parquet like the one in the linked issue with reasonable number of columns and groups, with and without this change. That would be a decent proxy for the cost of doing that de/serialisation.

@martindurant
Copy link
Member

@martindurant martindurant commented Aug 2, 2019

Had a moment and did this, len(cloudpickle.dumps(df)), got 60002bytes with this change, 87668 without.

@rjzamora
Copy link
Member Author

@rjzamora rjzamora commented Aug 5, 2019

A separate problem? https://stackoverflow.com/questions/57345240/problem-with-opening-parquet-files-with-fastparquet-after-updating-to-latest-ver

I'm actually having trouble reproducing this. It doesn't really make much sense to me why the "partitioned_on" columns would need to be specified in the columns argument, or why the index would not be detected correctly. Is the @ notation in '@timestamp' significant? (i. e. I don't personally know if the @ has some important meaning)

@martindurant
Copy link
Member

@martindurant martindurant commented Aug 5, 2019

No, the "@" is just a character.

I think the point is, that the _meta should include the columns due to directory-based partitioning whether or not they are specified in the columns= kwarg; but these columns do not show up in the schema of _metadata or _common_metadata (but they should be in the pandas JSON metadata).

The index thing I don't understand, since presumably the dataframe being written should have specified this. But certainly, a tests that looks like this example would be a useful addition.

@rjzamora
Copy link
Member Author

@rjzamora rjzamora commented Aug 5, 2019

I think the point is, that the _meta should include the columns due to directory-based partitioning whether or not they are specified in the columns= kwarg; but these columns do not show up in the schema of _metadata or _common_metadata (but they should be in the pandas JSON metadata).

Ah, I think I'm understanding a bit (although still unable to reproduce the problem). These columns should indeed be in the JSON metadata, but they are excluded from meta right here. During compute(), however, fastparquet "knows" to include the partitioned-on columns, even if they are not specified in columns.

@martindurant
Copy link
Member

@martindurant martindurant commented Aug 5, 2019

Aha, exactly, those columns should include directory-derived ones, whether inferred from the filenames or from pandas metadata

@martindurant
Copy link
Member

@martindurant martindurant commented Aug 5, 2019

(the original meta should already include them, so this is simple to change)

@rjzamora
Copy link
Member Author

@rjzamora rjzamora commented Aug 5, 2019

Aha, exactly, those columns should include directory-derived ones, whether inferred from the filenames or from pandas metadata

Just to be clear, this statement suggests that this current assertion/test should not be true, is that correct? In test_partition_on_cats_2, we are making the assumption that the columns argument really does mean that all other columns should be left out of the resulting dataframe.

@martindurant
Copy link
Member

@martindurant martindurant commented Aug 5, 2019

Well, in previous versions, you could not unselect a directory-derived columns; but I suppose you are right and it is reasonable to do this, and then instead of changing how meta is made, we need the reader to exclude the inferred column (perhaps only possible after it was created anyway)

@martindurant
Copy link
Member

@martindurant martindurant commented Aug 5, 2019

Plus in that test, we didn't actually ensure that the data could be loaded, which I think would have found this issue.

@rjzamora
Copy link
Member Author

@rjzamora rjzamora commented Aug 5, 2019

Plus in that test, we didn't actually ensure that the data could be loaded, which I think would have found this issue.

Are you saying that we aren't actually using compute()?

@martindurant
Copy link
Member

@martindurant martindurant commented Aug 5, 2019

Are you saying that we aren't actually using compute()?

Right, for the df that explicitly checks this columns= case

@mrocklin
Copy link
Member

@mrocklin mrocklin commented Aug 12, 2019

@rjzamora @martindurant what is the status here? It looks like this PR has gotten a bit stale.

@rjzamora
Copy link
Member Author

@rjzamora rjzamora commented Aug 12, 2019

what is the status here? It looks like this PR has gotten a bit stale.

Initially, I was leaving this open in case we found any other problems. At this point, we should probably just wrap this up and open a new PR if another issue surfaces.

@martindurant
Copy link
Member

@martindurant martindurant commented Aug 12, 2019

Agreed, I think this is an improvement

@TomAugspurger
Copy link
Member

@TomAugspurger TomAugspurger commented Aug 12, 2019

@rjzamora I merged master and pushed to your branch to resolve the CI issues.

@TomAugspurger
Copy link
Member

@TomAugspurger TomAugspurger commented Aug 12, 2019

OK to remove the "WIP" in the title?

@martindurant are you comfortable with the PR, or does it need another review / touch up?

@rjzamora
Copy link
Member Author

@rjzamora rjzamora commented Aug 12, 2019

I merged master and pushed to your branch to resolve the CI issues.

Thanks @TomAugspurger !

@martindurant
Copy link
Member

@martindurant martindurant commented Aug 12, 2019

@rjzamora , up to you when to remove "WIP", I'll have a look then.

@rjzamora rjzamora changed the title [WIP] Parquet metadata-handling improvements Parquet metadata-handling improvements Aug 12, 2019
@martindurant
Copy link
Member

@martindurant martindurant commented Aug 12, 2019

Does this resolve https://stackoverflow.com/questions/57345240/problem-with-opening-parquet-files-with-fastparquet-after-updating-to-latest-ver ? Maybe the specific example there could become a test.

@rjzamora
Copy link
Member Author

@rjzamora rjzamora commented Aug 12, 2019

Does this resolve https://stackoverflow.com/questions/57345240/problem-with-opening-parquet-files-with-fastparquet-after-updating-to-latest-ver ? Maybe the specific example there could become a test.

I am now remembering that one concern I had with this PR was that I was not able to reproduce the original error reported on stack overflow. So, although I believe the changes should address that problem, I am not actually "sure".

Note that I also made the decision here that a call to read_parquet should indeed include "only" the columns specified by the columns argument (if not None), even if some missing columns correspond to partitions, etc.

@martindurant
Copy link
Member

@martindurant martindurant commented Aug 12, 2019

I also made the decision here that a call to read_parquet should indeed include "only" the columns specified by the columns argument (if not None), even if some missing columns correspond to partitions, etc.

I am OK with that

@rjzamora
Copy link
Member Author

@rjzamora rjzamora commented Aug 12, 2019

Maybe the specific example there could become a test.

I am hoping that test_partition_on has been properly expanded to cover this.

Note that I also added this compute call to test_partition_on_cats_2 for the case where one of the "partitioned-on" columns is not specified in the columns argument. Prior to these changes, this assertion would have failed because the pre- and post-compute columns would have disagreed (which seems to have caused the stack-overflow error).

@martindurant
Copy link
Member

@martindurant martindurant commented Aug 12, 2019

OK, good with me

@martindurant
Copy link
Member

@martindurant martindurant commented Aug 13, 2019

+1

@martindurant
Copy link
Member

@martindurant martindurant commented Aug 13, 2019

@rjzamora , just to check you were done here.

@rjzamora
Copy link
Member Author

@rjzamora rjzamora commented Aug 13, 2019

just to check you were done here.

Yes - we should open something new if any other problems show up. Thanks for your help @martindurant !

@martindurant martindurant merged commit f3eadf3 into dask:master Aug 13, 2019
2 checks passed
@rjzamora rjzamora deleted the fp-improvements branch Aug 19, 2019
@apiszcz
Copy link

@apiszcz apiszcz commented Aug 21, 2019

I am seeing the same error as #5218 (comment)
when reading from a parquet file created with append=True using fastparquet. I have not test the very latest dask

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants