Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pandas.write_parquet creates unnecessary partitions when writing views with categoricals #112

Closed
vfilimonov opened this issue Jan 18, 2020 · 3 comments
Assignees
Labels
bug Something isn't working enhancement New feature or request

Comments

@vfilimonov
Copy link

vfilimonov commented Jan 18, 2020

Hello @igorborgest ,

it looks like, views with selection along categorical column are not properly respected.

For example, the following code:

d = pd.date_range('1990-01-01', freq='D', periods=10000)
vals = pd.np.random.randn(len(d), 4)
x = pd.DataFrame(vals, index=d, columns=['A','B','C','D'])
x['Year'] = x.index.year

wr.pandas.to_parquet(x[x.Year==1990], path='s3://BUCKET/x_1.parquet', partition_cols=['Year'])

Results in writing only Year=1990 part of the dataframe:

['s3://BUCKET/x_1.parquet/year=1990/c07aa0158dbc4018b78092af2d194b02.snappy.parquet',
 's3://BUCKET/x_1.parquet/year=1990/06c8c8643a5746bc89ffff305491d7b3.snappy.parquet',
 's3://BUCKET/x_1.parquet/year=1990/ee79fe3bf1a8432586d8378f7ea54e00.snappy.parquet',
 's3://BUCKET/x_1.parquet/year=1990/6b7aab3133fe457e849cbf26ac4b9db7.snappy.parquet',
 's3://BUCKET/x_1.parquet/year=1990/a9f7c03ddeed4442bdb1e82c1634bb4a.snappy.parquet',
 's3://BUCKET/x_1.parquet/year=1990/9b5cc05305244825b2967ead3076077c.snappy.parquet',
 's3://BUCKET/x_1.parquet/year=1990/688b467989064e9a8fa99fb1755afa59.snappy.parquet',
 's3://BUCKET/x_1.parquet/year=1990/843732aec4d949ee984ef6e1265507cb.snappy.parquet']

However if we convert year column to categorical:

d = pd.date_range('1990-01-01', freq='D', periods=10000)
vals = pd.np.random.randn(len(d), 4)
x = pd.DataFrame(vals, index=d, columns=['A','B','C','D'])
x['Year'] = x.index.year
x['Year'] = x['Year'].astype('category')

wr.pandas.to_parquet(x[x.Year==1990], path='s3://BUCKET/x_2.parquet', partition_cols=['Year'])

The whole original dataset will be written:

 ['s3://BUCKET/x_2.parquet/year=1990/440ff0157ad243a7953f3acc33019e39.snappy.parquet',
  's3://BUCKET/x_2.parquet/year=1991/fa3f5971573142eda26fae3b119bd393.snappy.parquet',
  's3://BUCKET/x_2.parquet/year=1992/8a7febc167a14c7eb28031f85833be8e.snappy.parquet',
  's3://BUCKET/x_2.parquet/year=1993/4ce2bfb7a0e64666acf3d51459da434f.snappy.parquet',
  's3://BUCKET/x_2.parquet/year=1994/9ebe8fd7e21744799f2236185c7f7651.snappy.parquet',
  's3://BUCKET/x_2.parquet/year=1995/866e31c2a8f54731b022b64d56f2442a.snappy.parquet',
  's3://BUCKET/x_2.parquet/year=1996/a8fde35191bf4dd0b864fa0c3933662f.snappy.parquet',
  's3://BUCKET/x_2.parquet/year=1997/2c0ff6446e7c4671b446921aa70a4f1c.snappy.parquet',
...

Versions:
aws-data-wrangler: 0.2.5
pandas: 0.25.3
pyarrow: 0.15.1

P.S. And since we're on this page: why the chunks are so small (5 KB in this case) when writing dataframe to S3?

@vfilimonov
Copy link
Author

vfilimonov commented Jan 20, 2020

Correction: not the whole dataframe is written, just created empty partitions for all categories.

It looks like it is partially inherited from pyarrow issue: https://issues.apache.org/jira/browse/ARROW-7617, as it creates partition folder for every categorical value.

So the actual issue is that aws-data-wrangler creates many unnecessary files for empty partitions (8 per partition). For the small dataframes like here - it creates a big overhead in terms of IO (all empty partitions are 10x the actual data).

I'm not sure if this is an easy fix (and whether it should be fixed at all)

@vfilimonov vfilimonov changed the title pandas.write_parquet: when writing view with categoricals - the whole dataframe is written pandas.write_parquet creates unnecessary partitions when writing views with categoricals Jan 20, 2020
@igorborgest igorborgest self-assigned this Jan 21, 2020
@igorborgest igorborgest added enhancement New feature or request WIP Work in progress labels Jan 21, 2020
@igorborgest
Copy link
Contributor

@vfilimonov thanks for all the troubleshooting, it helped a lot.

Actually this is something inherited from Pandas.

I fixed it with few code updates. Also added your exactly test case in our test bench.

P.S. About the file size: AWS Data Wrangler tries to parallelize all that is possible. So by default the number of files will be something like: Number of Cores x Number of partitions. But you can control this parallelism with the procs_cpu_bound parameter. If you set procs_cpu_bound=1 it will keep only 1 file per partition.

P.S.S. This will be released in our new version on the Weekend.

Please let me know, if you have more feedbacks about this issue.

@igorborgest igorborgest added bug Something isn't working and removed WIP Work in progress labels Jan 21, 2020
@vfilimonov
Copy link
Author

Wow, that was fast, thanks a lot, @igorborgest ! It’s nice that the fix was so straightforward!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants