Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue while reading parquet files from s3 having single root partition & multiple leaf partitions #182

Closed
yackoa opened this issue Jul 19, 2017 · 9 comments

Comments

@yackoa
Copy link

yackoa commented Jul 19, 2017

The root partitions information gets omitted in the absence of _metadata file in s3 & when the list of paths has only one root partition.
I am using the following code to read form s3:

    s3 = s3fs.S3FileSystem()
    myopen = s3.open
    fp_obj = fp.ParquetFile(s3_path_list,open_with=myopen)
    df = fp_obj.to_pandas()

if the list of paths (s3_path_list variable) had the following two file paths :

root_dir_in_s3/my_table/unique_id=500/my_date=2017-03-16/p‌​‌​a‌​rt.0.parquet 
root_dir_in_s3/my_table/unique_id=600/my_date=2017-03-25/p‌​‌​a‌​rt.1.parquet 

I will get both the unique_id & my_date column information in the df.

But if the root partition ie unique_id had only one value , ie our file list path is like the following :

root_dir_in_s3/my_table/unique_id=500/my_date=2017-03-16/p‌​‌​a‌​rt.0.parquet 
root_dir_in_s3/my_table/unique_id=500/my_date=2017-03-25/p‌​‌​a‌​rt.1.parquet

fastparquet omits the root partition information (unique_id in our case) from the final dataframe.

Spark handles this pretty well, without _metadata file.

I believe the above scenario is a bug,

@yackoa
Copy link
Author

yackoa commented Jul 19, 2017

Adding some more information ..
if the list of paths were as the following :

root_dir_in_s3/my_table/unique_id=500/my_date=2017-03-16/p‌​‌​a‌​rt.0.parquet
root_dir_in_s3/my_table/unique_id=500/my_date=2017-03-25/p‌​‌​a‌​rt.1.parquet
root_dir_in_s3/my_table/unique_id=700/my_date=2017-02-12/p‌​‌​a‌​rt.1.parquet

Which means that we had one instance of unique_id=700 & two instances of unique_id=500, the root partition , ie unique_id is still omitted. I feel the path list kind off needs a even distribution of root partition of sorts

@martindurant
Copy link
Member

Thank you for reporting.
I would call it an unimplemented corner-case, rather than a bug :)

  • are you certain this needs to be fixed for you, since the following would be a simpler solution

    df = pf.to_pandas().assign(unique_id=500)

Would you say it is reasonable to add an optional parameter to ParquetFile specifying the dataset root for cases like this, or to improve fastparquet.util.analyse_paths to assume (field=val)-like directories are always part of the data-set? Note that drill-style data-sets only invlude the val part and not the fields, so there would be no way of telling for those.

@yackoa
Copy link
Author

yackoa commented Jul 19, 2017

yes I believe this can be called as a unimplemented corner-case or a good to have feature :)

I think it could be reasonable if we can add a optional parameter to assume (field=val)-like directories are part of that dataset, since thats how partitions work in the parquet format (not an expert) . I think it would make it more compliant to to parquet files generated by spark.

I believe if I use the solution that you have provided it might end up being slower, when i have multiple root partitions, since it means i will have to sort of iterate all the files one by one and add all the partition values in the style that you have mentioned (unique_id & my_date in this case).

@martindurant
Copy link
Member

If there are multiple values of unique_id, seems to work for me:

In [31]: files = glob('my_table/*/*/*.parquet')

In [32]: files
Out[32]:
['my_table/unique_id=500/my_date=2017-03-16/part.0.parquet',
 'my_table/unique_id=500/my_date=2017-03-16/part.1.parquet',
 'my_table/unique_id=700/my_date=2017-02-12/part.1.parquet']

In [33]: pf = fastparquet.ParquetFile(files)

In [34]: pf.to_pandas()
Out[34]:
    a    my_date unique_id
0   5 2017-03-16       500
1   6 2017-03-16       500
2   1 2017-03-16       500
3   4 2017-03-16       500
4   7 2017-03-16       500
5   0 2017-03-16       500
...
24  7 2017-02-12       700
25  0 2017-02-12       700
26  8 2017-02-12       700
27  0 2017-02-12       700
28  8 2017-02-12       700
29  0 2017-02-12       700

This worked on s3 too.

@yackoa
Copy link
Author

yackoa commented Jul 19, 2017

ok my bad for not checking the the entire dataframe throughly ! :) !!
I just verified that the issue is only when we have one a single root partition !

@yackoa
Copy link
Author

yackoa commented Jul 21, 2017

Hi @martindurant , so would this be addressed in an upcoming release ?

@martindurant
Copy link
Member

I'll try to make a fix today, and then consider when I might release. No reason it should take too long.

@martindurant
Copy link
Member

v0.1.1 available now on conda-forge.

@yackoa
Copy link
Author

yackoa commented Jul 22, 2017

Awesome. thank you for fixing ! :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants