New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue while reading parquet files from s3 having single root partition & multiple leaf partitions #182
Comments
Adding some more information ..
Which means that we had one instance of unique_id=700 & two instances of unique_id=500, the root partition , ie unique_id is still omitted. I feel the path list kind off needs a even distribution of root partition of sorts |
Thank you for reporting.
Would you say it is reasonable to add an optional parameter to |
yes I believe this can be called as a unimplemented corner-case or a good to have feature :) I think it could be reasonable if we can add a optional parameter to assume (field=val)-like directories are part of that dataset, since thats how partitions work in the parquet format (not an expert) . I think it would make it more compliant to to parquet files generated by spark. I believe if I use the solution that you have provided it might end up being slower, when i have multiple root partitions, since it means i will have to sort of iterate all the files one by one and add all the partition values in the style that you have mentioned (unique_id & my_date in this case). |
If there are multiple values of unique_id, seems to work for me:
This worked on s3 too. |
ok my bad for not checking the the entire dataframe throughly ! :) !! |
Hi @martindurant , so would this be addressed in an upcoming release ? |
I'll try to make a fix today, and then consider when I might release. No reason it should take too long. |
v0.1.1 available now on conda-forge. |
Awesome. thank you for fixing ! :-) |
The root partitions information gets omitted in the absence of _metadata file in s3 & when the list of paths has only one root partition.
I am using the following code to read form s3:
if the list of paths (s3_path_list variable) had the following two file paths :
I will get both the unique_id & my_date column information in the df.
But if the root partition ie unique_id had only one value , ie our file list path is like the following :
fastparquet omits the root partition information (unique_id in our case) from the final dataframe.
Spark handles this pretty well, without _metadata file.
I believe the above scenario is a bug,
The text was updated successfully, but these errors were encountered: