You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
So the actual issue is that aws-data-wrangler creates many unnecessary files for empty partitions (8 per partition). For the small dataframes like here - it creates a big overhead in terms of IO (all empty partitions are 10x the actual data).
I'm not sure if this is an easy fix (and whether it should be fixed at all)
vfilimonov
changed the title
pandas.write_parquet: when writing view with categoricals - the whole dataframe is written
pandas.write_parquet creates unnecessary partitions when writing views with categoricals
Jan 20, 2020
I fixed it with few code updates. Also added your exactly test case in our test bench.
P.S. About the file size: AWS Data Wrangler tries to parallelize all that is possible. So by default the number of files will be something like: Number of Cores x Number of partitions. But you can control this parallelism with the procs_cpu_bound parameter. If you set procs_cpu_bound=1 it will keep only 1 file per partition.
P.S.S. This will be released in our new version on the Weekend.
Please let me know, if you have more feedbacks about this issue.
Hello @igorborgest ,
it looks like, views with selection along categorical column are not properly respected.
For example, the following code:
Results in writing only
Year=1990
part of the dataframe:However if we convert year column to
categorical
:The whole original dataset will be written:
Versions:
aws-data-wrangler: 0.2.5
pandas: 0.25.3
pyarrow: 0.15.1
P.S. And since we're on this page: why the chunks are so small (5 KB in this case) when writing dataframe to S3?
The text was updated successfully, but these errors were encountered: