Issue while reading parquet files from s3 having single root partition & multiple leaf partitions #182

yackoa · 2017-07-19T18:00:48Z

The root partitions information gets omitted in the absence of _metadata file in s3 & when the list of paths has only one root partition.
I am using the following code to read form s3:

    s3 = s3fs.S3FileSystem()
    myopen = s3.open
    fp_obj = fp.ParquetFile(s3_path_list,open_with=myopen)
    df = fp_obj.to_pandas()

if the list of paths (s3_path_list variable) had the following two file paths :

root_dir_in_s3/my_table/unique_id=500/my_date=2017-03-16/p‌‌a‌rt.0.parquet 
root_dir_in_s3/my_table/unique_id=600/my_date=2017-03-25/p‌‌a‌rt.1.parquet

I will get both the unique_id & my_date column information in the df.

But if the root partition ie unique_id had only one value , ie our file list path is like the following :

root_dir_in_s3/my_table/unique_id=500/my_date=2017-03-16/p‌‌a‌rt.0.parquet 
root_dir_in_s3/my_table/unique_id=500/my_date=2017-03-25/p‌‌a‌rt.1.parquet

fastparquet omits the root partition information (unique_id in our case) from the final dataframe.

Spark handles this pretty well, without _metadata file.

I believe the above scenario is a bug,

The text was updated successfully, but these errors were encountered:

yackoa · 2017-07-19T18:06:39Z

Adding some more information ..
if the list of paths were as the following :

root_dir_in_s3/my_table/unique_id=500/my_date=2017-03-16/p‌‌a‌rt.0.parquet
root_dir_in_s3/my_table/unique_id=500/my_date=2017-03-25/p‌‌a‌rt.1.parquet
root_dir_in_s3/my_table/unique_id=700/my_date=2017-02-12/p‌‌a‌rt.1.parquet

Which means that we had one instance of unique_id=700 & two instances of unique_id=500, the root partition , ie unique_id is still omitted. I feel the path list kind off needs a even distribution of root partition of sorts

martindurant · 2017-07-19T18:07:04Z

Thank you for reporting.
I would call it an unimplemented corner-case, rather than a bug :)

are you certain this needs to be fixed for you, since the following would be a simpler solution

df = pf.to_pandas().assign(unique_id=500)

Would you say it is reasonable to add an optional parameter to ParquetFile specifying the dataset root for cases like this, or to improve fastparquet.util.analyse_paths to assume (field=val)-like directories are always part of the data-set? Note that drill-style data-sets only invlude the val part and not the fields, so there would be no way of telling for those.

yackoa · 2017-07-19T18:17:45Z

yes I believe this can be called as a unimplemented corner-case or a good to have feature :)

I think it could be reasonable if we can add a optional parameter to assume (field=val)-like directories are part of that dataset, since thats how partitions work in the parquet format (not an expert) . I think it would make it more compliant to to parquet files generated by spark.

I believe if I use the solution that you have provided it might end up being slower, when i have multiple root partitions, since it means i will have to sort of iterate all the files one by one and add all the partition values in the style that you have mentioned (unique_id & my_date in this case).

martindurant · 2017-07-19T18:50:38Z

If there are multiple values of unique_id, seems to work for me:

In [31]: files = glob('my_table/*/*/*.parquet')

In [32]: files
Out[32]:
['my_table/unique_id=500/my_date=2017-03-16/part.0.parquet',
 'my_table/unique_id=500/my_date=2017-03-16/part.1.parquet',
 'my_table/unique_id=700/my_date=2017-02-12/part.1.parquet']

In [33]: pf = fastparquet.ParquetFile(files)

In [34]: pf.to_pandas()
Out[34]:
    a    my_date unique_id
0   5 2017-03-16       500
1   6 2017-03-16       500
2   1 2017-03-16       500
3   4 2017-03-16       500
4   7 2017-03-16       500
5   0 2017-03-16       500
...
24  7 2017-02-12       700
25  0 2017-02-12       700
26  8 2017-02-12       700
27  0 2017-02-12       700
28  8 2017-02-12       700
29  0 2017-02-12       700

This worked on s3 too.

yackoa · 2017-07-19T19:17:19Z

ok my bad for not checking the the entire dataframe throughly ! :) !!
I just verified that the issue is only when we have one a single root partition !

yackoa · 2017-07-21T08:07:18Z

Hi @martindurant , so would this be addressed in an upcoming release ?

martindurant · 2017-07-21T12:35:56Z

I'll try to make a fix today, and then consider when I might release. No reason it should take too long.

martindurant · 2017-07-21T19:14:38Z

v0.1.1 available now on conda-forge.

yackoa · 2017-07-22T05:32:05Z

Awesome. thank you for fixing ! :-)

martindurant mentioned this issue Jul 21, 2017

Single partition fix #183

Merged

martindurant closed this as completed in #183 Jul 21, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue while reading parquet files from s3 having single root partition & multiple leaf partitions #182

Issue while reading parquet files from s3 having single root partition & multiple leaf partitions #182

yackoa commented Jul 19, 2017 •

edited

yackoa commented Jul 19, 2017

martindurant commented Jul 19, 2017

yackoa commented Jul 19, 2017

martindurant commented Jul 19, 2017

yackoa commented Jul 19, 2017

yackoa commented Jul 21, 2017

martindurant commented Jul 21, 2017

martindurant commented Jul 21, 2017

yackoa commented Jul 22, 2017

Issue while reading parquet files from s3 having single root partition & multiple leaf partitions #182

Issue while reading parquet files from s3 having single root partition & multiple leaf partitions #182

Comments

yackoa commented Jul 19, 2017 • edited

yackoa commented Jul 19, 2017

martindurant commented Jul 19, 2017

yackoa commented Jul 19, 2017

martindurant commented Jul 19, 2017

yackoa commented Jul 19, 2017

yackoa commented Jul 21, 2017

martindurant commented Jul 21, 2017

martindurant commented Jul 21, 2017

yackoa commented Jul 22, 2017

yackoa commented Jul 19, 2017 •

edited