pandas.write_parquet creates unnecessary partitions when writing views with categoricals #112

vfilimonov · 2020-01-18T22:57:02Z

it looks like, views with selection along categorical column are not properly respected.

For example, the following code:

d = pd.date_range('1990-01-01', freq='D', periods=10000)
vals = pd.np.random.randn(len(d), 4)
x = pd.DataFrame(vals, index=d, columns=['A','B','C','D'])
x['Year'] = x.index.year

wr.pandas.to_parquet(x[x.Year==1990], path='s3://BUCKET/x_1.parquet', partition_cols=['Year'])

Results in writing only Year=1990 part of the dataframe:

['s3://BUCKET/x_1.parquet/year=1990/c07aa0158dbc4018b78092af2d194b02.snappy.parquet',
 's3://BUCKET/x_1.parquet/year=1990/06c8c8643a5746bc89ffff305491d7b3.snappy.parquet',
 's3://BUCKET/x_1.parquet/year=1990/ee79fe3bf1a8432586d8378f7ea54e00.snappy.parquet',
 's3://BUCKET/x_1.parquet/year=1990/6b7aab3133fe457e849cbf26ac4b9db7.snappy.parquet',
 's3://BUCKET/x_1.parquet/year=1990/a9f7c03ddeed4442bdb1e82c1634bb4a.snappy.parquet',
 's3://BUCKET/x_1.parquet/year=1990/9b5cc05305244825b2967ead3076077c.snappy.parquet',
 's3://BUCKET/x_1.parquet/year=1990/688b467989064e9a8fa99fb1755afa59.snappy.parquet',
 's3://BUCKET/x_1.parquet/year=1990/843732aec4d949ee984ef6e1265507cb.snappy.parquet']

However if we convert year column to categorical:

d = pd.date_range('1990-01-01', freq='D', periods=10000)
vals = pd.np.random.randn(len(d), 4)
x = pd.DataFrame(vals, index=d, columns=['A','B','C','D'])
x['Year'] = x.index.year
x['Year'] = x['Year'].astype('category')

wr.pandas.to_parquet(x[x.Year==1990], path='s3://BUCKET/x_2.parquet', partition_cols=['Year'])

The whole original dataset will be written:

 ['s3://BUCKET/x_2.parquet/year=1990/440ff0157ad243a7953f3acc33019e39.snappy.parquet',
  's3://BUCKET/x_2.parquet/year=1991/fa3f5971573142eda26fae3b119bd393.snappy.parquet',
  's3://BUCKET/x_2.parquet/year=1992/8a7febc167a14c7eb28031f85833be8e.snappy.parquet',
  's3://BUCKET/x_2.parquet/year=1993/4ce2bfb7a0e64666acf3d51459da434f.snappy.parquet',
  's3://BUCKET/x_2.parquet/year=1994/9ebe8fd7e21744799f2236185c7f7651.snappy.parquet',
  's3://BUCKET/x_2.parquet/year=1995/866e31c2a8f54731b022b64d56f2442a.snappy.parquet',
  's3://BUCKET/x_2.parquet/year=1996/a8fde35191bf4dd0b864fa0c3933662f.snappy.parquet',
  's3://BUCKET/x_2.parquet/year=1997/2c0ff6446e7c4671b446921aa70a4f1c.snappy.parquet',
...

Versions:
aws-data-wrangler: 0.2.5
pandas: 0.25.3
pyarrow: 0.15.1

P.S. And since we're on this page: why the chunks are so small (5 KB in this case) when writing dataframe to S3?

The text was updated successfully, but these errors were encountered:

vfilimonov · 2020-01-20T14:17:53Z

Correction: not the whole dataframe is written, just created empty partitions for all categories.

It looks like it is partially inherited from pyarrow issue: https://issues.apache.org/jira/browse/ARROW-7617, as it creates partition folder for every categorical value.

So the actual issue is that aws-data-wrangler creates many unnecessary files for empty partitions (8 per partition). For the small dataframes like here - it creates a big overhead in terms of IO (all empty partitions are 10x the actual data).

I'm not sure if this is an easy fix (and whether it should be fixed at all)

igorborgest · 2020-01-21T16:27:16Z

@vfilimonov thanks for all the troubleshooting, it helped a lot.

Actually this is something inherited from Pandas.

I fixed it with few code updates. Also added your exactly test case in our test bench.

P.S. About the file size: AWS Data Wrangler tries to parallelize all that is possible. So by default the number of files will be something like: Number of Cores x Number of partitions. But you can control this parallelism with the procs_cpu_bound parameter. If you set procs_cpu_bound=1 it will keep only 1 file per partition.

P.S.S. This will be released in our new version on the Weekend.

Please let me know, if you have more feedbacks about this issue.

vfilimonov · 2020-01-21T20:15:59Z

Wow, that was fast, thanks a lot, @igorborgest ! It’s nice that the fix was so straightforward!

vfilimonov changed the title ~~pandas.write_parquet: when writing view with categoricals - the whole dataframe is written~~ pandas.write_parquet creates unnecessary partitions when writing views with categoricals Jan 20, 2020

igorborgest self-assigned this Jan 21, 2020

igorborgest added enhancement New feature or request WIP Work in progress labels Jan 21, 2020

igorborgest mentioned this issue Jan 21, 2020

Handling categorical partitions #115

Merged

igorborgest closed this as completed Jan 21, 2020

igorborgest added bug Something isn't working and removed WIP Work in progress labels Jan 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pandas.write_parquet creates unnecessary partitions when writing views with categoricals #112

pandas.write_parquet creates unnecessary partitions when writing views with categoricals #112

vfilimonov commented Jan 18, 2020 •

edited

vfilimonov commented Jan 20, 2020 •

edited

igorborgest commented Jan 21, 2020

vfilimonov commented Jan 21, 2020

pandas.write_parquet creates unnecessary partitions when writing views with categoricals #112

pandas.write_parquet creates unnecessary partitions when writing views with categoricals #112

Comments

vfilimonov commented Jan 18, 2020 • edited

vfilimonov commented Jan 20, 2020 • edited

igorborgest commented Jan 21, 2020

vfilimonov commented Jan 21, 2020

vfilimonov commented Jan 18, 2020 •

edited

vfilimonov commented Jan 20, 2020 •

edited