Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Appending parquet file from python to s3 #327

Closed
Jeeva-Ganesan opened this issue Apr 17, 2018 · 5 comments
Closed

Appending parquet file from python to s3 #327

Jeeva-Ganesan opened this issue Apr 17, 2018 · 5 comments

Comments

@Jeeva-Ganesan
Copy link

Jeeva-Ganesan commented Apr 17, 2018

Here is my snippet in spark-shell

jdbcDF.write.mode("append").partitionBy("date").parquet("s3://bucket/Data/")

Problem description

Now, i am trying to do the same thing in python with fastparquet.

import s3fs
from fastparquet import write
s3 = s3fs.S3FileSystem()
myopen = s3.open
write('****/20180101.parq', data, compression='GZIP', open_with=myopen)

First thing, I tried to save as snappy compression,
write('****/20180101.snappy.parquet', data, compression='SNAPPY', open_with=myopen)

but got error,

Compression 'SNAPPY' not available. Options: ['GZIP', 'UNCOMPRESSED']

Then, tried to use GZIP, it worked, but not sure how I can append or create partition here. Here is an issue I created in pandas. https://github.com/pandas-dev/pandas/issues/20638

Thanks.

@martindurant
Copy link
Member

The write function has append and partition_on keyword arguments, see the documentation.

@Jeeva-Ganesan
Copy link
Author

Thanks. That works fine, however, while writing it in s3, this also creates a copy of the folder structure in my machine, is it expected ?

@martindurant
Copy link
Member

I'm afraid I don't follow you - can you please describe exactly what you did and what happened?

@Jeeva-Ganesan
Copy link
Author

Jeeva-Ganesan commented Apr 18, 2018

Ok. Let me explain. I have this folder structure in s3 - s3://bucketname/user/data/ . And this is my code to write my partition in it.

import s3fs
from fastparquet import write
s3 = s3fs.S3FileSystem()
myopen = s3.open
write('bucketname/user/data/', dataframe, file_scheme='hive', partition_on = ['date'], open_with=myopen)

I am running this in Jupyter notebook, when I run this, everything works fine and s3 path looks like this,

bucketname/user/data/date=2018-01-01/part-o.parquet.

However, in my local machine, I have this folder structure created automatically - bucketname/user/data/date=2018-01-01/, but no parquet file in it. I am wonder if it is creating a local copy before moving the file to s3.

@martindurant
Copy link
Member

OK, understood. No, the files, are not first created locally and copied.

As documented , you should supply not only the function to open, but also the function to make directories. In the case of s3, there is no such concept as directories, so the function your need to provide should not actually do anything, but you still must provide it to avoid using the default, which makes local directories.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants