ARROW-1213: [Python] Support s3fs filesystem for Amazon S3 in ParquetDataset #916

wesm · 2017-07-31T16:38:45Z

wesm · 2017-07-31T17:47:27Z

@martindurant I am using a private API from s3fs to create an os.walk-alike, could you let me know if what I'm doing is reasonable? https://github.com/apache/arrow/pull/916/files#diff-5120a0b2b8aec06d58e53ff1067b83c5R229

martindurant · 2017-07-31T18:10:23Z

Certainly you can do this, and it looks about right. I'm surprised that you would want to, I have always rather used something like glob or the existing s3fs walk to get a plain list of files rather than iterating the os.walk tree.
Note that it is totally possible to have a "directory" and a "file" with exactly the same name, since the directories are only virtual anyway (i.e., just key prefixes, not an actual hierarchical structure). This suggests to me that this kind of walk is not a natural fit for S3 - but I understand that you would like consistency between the backends.

wesm · 2017-07-31T18:24:14Z

@fjetter @xhochy this will hopefully fix the problem in dask/dask#2527 -- that patch is still needed in part to pass the filesystem object to ParquetDataset

Change-Id: I8d484da73fc3bb4a4c57c67f07c7345ece1d4af6

Change-Id: I67ebe3eac59113ee0b56eccdd59964dbbf6bcffc

… Rename to HadoopFilesystem. Add walk implementation for HDFS, base Parquet directory walker on that Change-Id: I1e3f5b1b578e21f2d498602ef0150e3f94d9415a

Change-Id: I9d82af90efb2c2bd47cc32da1eb0ea8fe1e6469f

Change-Id: I732078d7bc105ff4bcf3efab4535e13c33945f77

Change-Id: I41cb5374d6681c95aac766d0e1c51d976d8a7ec8

Change-Id: I5912fcb30eaa85089d6d9272a3ebacd2bf4806aa

yackoa · 2017-08-01T09:33:53Z

Thank you for adding the feature!

DrChrisLevy · 2017-10-18T11:40:33Z

So if I have an ec2 instance or say an emr cluster on AWS, does this fix allow for reading a directory of multiple parquet files in S3 from pyarrow? I still can't seem to find an example of this fix in action. I can't seem to get pq.ParquetDataset("path to s3 directory") working. I have tried importing s3fs too. Is there an example of using this new feature in the docs? Cheers.

wesm · 2017-10-18T13:19:59Z

@DrChrisLevy I opened ARROW-1682 https://issues.apache.org/jira/browse/ARROW-1682 about adding some documentation for this. Are you passing the s3fs filesystem as an argument to ParquetDataset?

DrChrisLevy · 2017-10-18T14:55:36Z

Thanks @wesm !
I figured it out by looking through the commit changes. If anyone comes across this thread here is how you can read parquet files from an S3 directory using pyarrow.

Make sure you have the packages:

pip install pyarrow
pip install s3fs

Python Code:

import s3fs
from pyarrow.filesystem import S3FSWrapper
import pyarrow.parquet as pq
access_key = <> # string with your aws_access_key_id
secret_key = <> # string with your aws_secret_access_key
fs = s3fs.S3FileSystem(key=access_key, secret=secret_key)

# Suppose you had some parquet files stored in the
# s3 path: s3://my_bucket/my_data/my_favorite_data
bucket = 'my_bucket'
path = 'my_data/my_favorite_data' 
bucket_uri = 's3://{bucket}/{path}'.format(**{'bucket':bucket, 'path': path})
dataset = pq.ParquetDataset(bucket_uri, filesystem=fs)
table = dataset.read()
df = table.to_pandas()

martindurant · 2017-10-18T15:03:18Z

Great to see arrow and s3fs working together, thanks for looking into it.
Note that you can also give your credentials via files (typically in ~/.aws) or environment variables, if you don't want them to be stored within your code. Also, if you are on AWS hardware, then credentials should generally be available via the IAM service - see the s3fs docs.

DrChrisLevy · 2017-10-18T15:08:16Z

@martindurant yea of course, better not to hard code in the credentials. Just wanted to get a working example. Thanks

martindurant · 2017-10-18T15:08:48Z

OK, just being sure :)

martindurant · 2017-10-18T15:11:00Z

@wesm , you may want to include documentation with examples like this not just for s3fs but also the other pythonic file-like systems I know about (gcsfs, adlfs, perhaps hdfs3 - although arrow already supports that, of course).

yackoa · 2017-10-30T12:53:54Z

@wesm does the wrapper take care of write to s3 as well using s3fs ?

wesm · 2017-10-30T13:42:57Z

@yackoa yes -- though, if you are using write_to_dataset then you will need ARROW-1555 (part of Arrow 0.8.0) 4db0046

gsakkis · 2018-01-25T17:36:16Z

Sorry to resurrect this but has there been a regression since then? I am trying the code sample from @DrChrisLevy above and I am getting IndexError. Looks like it doesn't like the s3:// scheme, passing bucket/path works.

martindurant · 2018-01-25T17:43:28Z

On the s3fs side, paths starting s3:// are still supported, there are tests for that.

gsakkis · 2018-01-25T18:35:32Z

Looks like the (or one) issue is in S3FSWrapper.isfile: the condition contents[0] == path is true without the scheme and false with it.

wesm · 2018-01-25T18:47:39Z

OK, let's open a new JIRA so we can fix and add a test for this

wesm · 2018-01-25T22:50:41Z

https://issues.apache.org/jira/browse/ARROW-2038

AlekseyYuvzhikVB · 2020-10-28T17:01:00Z

Great to see arrow and s3fs working together, thanks for looking into it.
Note that you can also give your credentials via files (typically in ~/.aws) or environment variables, if you don't want them to be stored within your code. Also, if you are on AWS hardware, then credentials should generally be available via the IAM service - see the s3fs docs.

I'm using pyarrow and several aws profiles in ~/.aws/credentials and my code works fine with default profile but it returns
data_set = pq.ParquetDataset(paths, filesystem=fs) File "/Library/Python/3.7/site-packages/pyarrow/parquet.py", line 1170, in __init__ open_file_func=partial(_open_dataset_file, self._metadata) File "/Library/Python/3.7/site-packages/pyarrow/parquet.py", line 1365, in _make_manifest .format(path)) OSError: Passed non-file path: s3://<valid path to parquet file>
if i'm using not default profile to get an access to s3 bucket.
Details are here https://stackoverflow.com/questions/64565926/getting-oserror-passed-non-file-path-using-pyarrow-parquetdataset
Do you know how to fix such issue?

martindurant · 2020-10-28T19:23:04Z

@AlekseyYuvzhikVB - answered on SO. Please avoid posting in multiple places.

jorisvandenbossche · 2020-10-30T11:54:02Z

@martindurant a follow-up of what you commented on SO (a bit easier here, also because it is off topic for SO):

It is weird that you are passing an s3fs instance to arrow,

Why is that weird? Isn't that the whole reason that fsspec filesystems were subclassing from pyarrow.filesystem.FileSystem if pyarrow was installed (and similar argument for the original changes in this PR), so that pyarrow would work with fsspec-based filesystems (like s3fs) ?

martindurant · 2020-10-30T12:43:39Z

I'm not sure I am remembering the correct question, but I think the weird thing was not passing the instance (which is expected), but also using boto directly to list and filter files.

jorisvandenbossche · 2020-10-30T12:47:43Z

https://stackoverflow.com/questions/64565926/getting-oserror-passed-non-file-path-using-pyarrow-parquetdataset

I agree with what you say above, but so that's not what you said on SO ;) Might be an oversight, I made an edit on SO.

jorisvandenbossche · 2020-10-30T12:48:56Z

Ah, no I see, I just misunderstood your sentence: the weird thing is not the first part (passing an s3fs instance to arrow), but the second part of the sentence. I think that's easy to misread (as I did ;)), will try to clarify

martindurant · 2020-10-30T12:51:46Z

Ordering and commas...
The thought process would have been along the lines: "ok, this is about s3fs with pyarrow, but wait, you're using other stuff too"

wesm force-pushed the ARROW-1213 branch from ca41c2a to fff52cf Compare July 31, 2017 17:35

wesm added 8 commits July 31, 2017 15:11

Start on Dask filesystem wrapper, S3-Parquet dataset test case

4c0bcf4

Change-Id: I8d484da73fc3bb4a4c57c67f07c7345ece1d4af6

Refactoring slightly

4984a9d

Change-Id: I67ebe3eac59113ee0b56eccdd59964dbbf6bcffc

Refactor HdfsClient into pyarrow/hdfs.py. Add connect factory method.…

bbd664e

… Rename to HadoopFilesystem. Add walk implementation for HDFS, base Parquet directory walker on that Change-Id: I1e3f5b1b578e21f2d498602ef0150e3f94d9415a

Progress toward supporting s3fs in Parquet reader

719f806

Change-Id: I9d82af90efb2c2bd47cc32da1eb0ea8fe1e6469f

Implement os.walk emulation layer for s3fs

0be33bb

Change-Id: I732078d7bc105ff4bcf3efab4535e13c33945f77

Auto-wrap s3fs filesystem when using ParquetDataset

4d3e722

Change-Id: I41cb5374d6681c95aac766d0e1c51d976d8a7ec8

Add deprecation warning for HdfsClient

c54302d

Change-Id: I5912fcb30eaa85089d6d9272a3ebacd2bf4806aa

Add HDFS section to API docs

f8a0aff

wesm force-pushed the ARROW-1213 branch from 2be1043 to f8a0aff Compare July 31, 2017 19:11

xhochy approved these changes Jul 31, 2017

View reviewed changes

asfgit closed this in af2aeaf Jul 31, 2017

wesm deleted the ARROW-1213 branch July 31, 2017 22:47

This was referenced Jun 25, 2020

[Python] Add documentation / example for reading a directory of Parquet files on S3 #17691

Closed

[Python] Follow-up bug fixes for s3fs Parquet support #15650

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-1213: [Python] Support s3fs filesystem for Amazon S3 in ParquetDataset #916

ARROW-1213: [Python] Support s3fs filesystem for Amazon S3 in ParquetDataset #916

wesm commented Jul 31, 2017

wesm commented Jul 31, 2017

martindurant commented Jul 31, 2017

wesm commented Jul 31, 2017

yackoa commented Aug 1, 2017

DrChrisLevy commented Oct 18, 2017

wesm commented Oct 18, 2017

DrChrisLevy commented Oct 18, 2017

martindurant commented Oct 18, 2017

DrChrisLevy commented Oct 18, 2017

martindurant commented Oct 18, 2017

martindurant commented Oct 18, 2017

yackoa commented Oct 30, 2017

wesm commented Oct 30, 2017

gsakkis commented Jan 25, 2018

martindurant commented Jan 25, 2018

gsakkis commented Jan 25, 2018

wesm commented Jan 25, 2018

wesm commented Jan 25, 2018

AlekseyYuvzhikVB commented Oct 28, 2020

martindurant commented Oct 28, 2020

jorisvandenbossche commented Oct 30, 2020 •

edited

martindurant commented Oct 30, 2020

jorisvandenbossche commented Oct 30, 2020

jorisvandenbossche commented Oct 30, 2020

martindurant commented Oct 30, 2020

ARROW-1213: [Python] Support s3fs filesystem for Amazon S3 in ParquetDataset #916

ARROW-1213: [Python] Support s3fs filesystem for Amazon S3 in ParquetDataset #916

Conversation

wesm commented Jul 31, 2017

wesm commented Jul 31, 2017

martindurant commented Jul 31, 2017

wesm commented Jul 31, 2017

yackoa commented Aug 1, 2017

DrChrisLevy commented Oct 18, 2017

wesm commented Oct 18, 2017

DrChrisLevy commented Oct 18, 2017

martindurant commented Oct 18, 2017

DrChrisLevy commented Oct 18, 2017

martindurant commented Oct 18, 2017

martindurant commented Oct 18, 2017

yackoa commented Oct 30, 2017

wesm commented Oct 30, 2017

gsakkis commented Jan 25, 2018

martindurant commented Jan 25, 2018

gsakkis commented Jan 25, 2018

wesm commented Jan 25, 2018

wesm commented Jan 25, 2018

AlekseyYuvzhikVB commented Oct 28, 2020

martindurant commented Oct 28, 2020

jorisvandenbossche commented Oct 30, 2020 • edited

martindurant commented Oct 30, 2020

jorisvandenbossche commented Oct 30, 2020

jorisvandenbossche commented Oct 30, 2020

martindurant commented Oct 30, 2020

jorisvandenbossche commented Oct 30, 2020 •

edited