Appending parquet file from python to s3 #327

Jeeva-Ganesan · 2018-04-17T03:38:10Z

Here is my snippet in spark-shell

jdbcDF.write.mode("append").partitionBy("date").parquet("s3://bucket/Data/")

Problem description

Now, i am trying to do the same thing in python with fastparquet.

import s3fs
from fastparquet import write
s3 = s3fs.S3FileSystem()
myopen = s3.open
write('****/20180101.parq', data, compression='GZIP', open_with=myopen)

First thing, I tried to save as snappy compression,
write('****/20180101.snappy.parquet', data, compression='SNAPPY', open_with=myopen)

but got error,

Compression 'SNAPPY' not available. Options: ['GZIP', 'UNCOMPRESSED']

Then, tried to use GZIP, it worked, but not sure how I can append or create partition here. Here is an issue I created in pandas. https://github.com/pandas-dev/pandas/issues/20638

Thanks.

The text was updated successfully, but these errors were encountered:

martindurant · 2018-04-17T13:07:27Z

The write function has append and partition_on keyword arguments, see the documentation.

Jeeva-Ganesan · 2018-04-18T22:06:43Z

Thanks. That works fine, however, while writing it in s3, this also creates a copy of the folder structure in my machine, is it expected ?

martindurant · 2018-04-18T22:54:17Z

I'm afraid I don't follow you - can you please describe exactly what you did and what happened?

Jeeva-Ganesan · 2018-04-18T23:04:22Z

Ok. Let me explain. I have this folder structure in s3 - s3://bucketname/user/data/ . And this is my code to write my partition in it.

import s3fs
from fastparquet import write
s3 = s3fs.S3FileSystem()
myopen = s3.open
write('bucketname/user/data/', dataframe, file_scheme='hive', partition_on = ['date'], open_with=myopen)

I am running this in Jupyter notebook, when I run this, everything works fine and s3 path looks like this,

bucketname/user/data/date=2018-01-01/part-o.parquet.

However, in my local machine, I have this folder structure created automatically - bucketname/user/data/date=2018-01-01/, but no parquet file in it. I am wonder if it is creating a local copy before moving the file to s3.

martindurant · 2018-04-19T13:42:54Z

OK, understood. No, the files, are not first created locally and copied.

As documented , you should supply not only the function to open, but also the function to make directories. In the case of s3, there is no such concept as directories, so the function your need to provide should not actually do anything, but you still must provide it to avoid using the default, which makes local directories.

martindurant closed this as completed Apr 19, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Appending parquet file from python to s3 #327

Appending parquet file from python to s3 #327

Jeeva-Ganesan commented Apr 17, 2018 •

edited

martindurant commented Apr 17, 2018

Jeeva-Ganesan commented Apr 18, 2018

martindurant commented Apr 18, 2018

Jeeva-Ganesan commented Apr 18, 2018 •

edited

martindurant commented Apr 19, 2018

Appending parquet file from python to s3 #327

Appending parquet file from python to s3 #327

Comments

Jeeva-Ganesan commented Apr 17, 2018 • edited

martindurant commented Apr 17, 2018

Jeeva-Ganesan commented Apr 18, 2018

martindurant commented Apr 18, 2018

Jeeva-Ganesan commented Apr 18, 2018 • edited

martindurant commented Apr 19, 2018

Jeeva-Ganesan commented Apr 17, 2018 •

edited

Jeeva-Ganesan commented Apr 18, 2018 •

edited