Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-1213: [Python] Support s3fs filesystem for Amazon S3 in ParquetDataset #916

Closed
wants to merge 8 commits into from

Conversation

wesm
Copy link
Member

@wesm wesm commented Jul 31, 2017

cc @yackoa

@wesm
Copy link
Member Author

wesm commented Jul 31, 2017

@martindurant I am using a private API from s3fs to create an os.walk-alike, could you let me know if what I'm doing is reasonable? https://github.com/apache/arrow/pull/916/files#diff-5120a0b2b8aec06d58e53ff1067b83c5R229

@martindurant
Copy link

Certainly you can do this, and it looks about right. I'm surprised that you would want to, I have always rather used something like glob or the existing s3fs walk to get a plain list of files rather than iterating the os.walk tree.
Note that it is totally possible to have a "directory" and a "file" with exactly the same name, since the directories are only virtual anyway (i.e., just key prefixes, not an actual hierarchical structure). This suggests to me that this kind of walk is not a natural fit for S3 - but I understand that you would like consistency between the backends.

@wesm
Copy link
Member Author

wesm commented Jul 31, 2017

@fjetter @xhochy this will hopefully fix the problem in dask/dask#2527 -- that patch is still needed in part to pass the filesystem object to ParquetDataset

wesm added 8 commits July 31, 2017 15:11
Change-Id: I8d484da73fc3bb4a4c57c67f07c7345ece1d4af6
Change-Id: I67ebe3eac59113ee0b56eccdd59964dbbf6bcffc
… Rename to HadoopFilesystem. Add walk implementation for HDFS, base Parquet directory walker on that

Change-Id: I1e3f5b1b578e21f2d498602ef0150e3f94d9415a
Change-Id: I9d82af90efb2c2bd47cc32da1eb0ea8fe1e6469f
Change-Id: I732078d7bc105ff4bcf3efab4535e13c33945f77
Change-Id: I41cb5374d6681c95aac766d0e1c51d976d8a7ec8
Change-Id: I5912fcb30eaa85089d6d9272a3ebacd2bf4806aa
@asfgit asfgit closed this in af2aeaf Jul 31, 2017
@wesm wesm deleted the ARROW-1213 branch July 31, 2017 22:47
@yackoa
Copy link

yackoa commented Aug 1, 2017

Thank you for adding the feature!

@DrChrisLevy
Copy link

So if I have an ec2 instance or say an emr cluster on AWS, does this fix allow for reading a directory of multiple parquet files in S3 from pyarrow? I still can't seem to find an example of this fix in action. I can't seem to get pq.ParquetDataset("path to s3 directory") working. I have tried importing s3fs too. Is there an example of using this new feature in the docs? Cheers.

@wesm
Copy link
Member Author

wesm commented Oct 18, 2017

@DrChrisLevy I opened ARROW-1682 https://issues.apache.org/jira/browse/ARROW-1682 about adding some documentation for this. Are you passing the s3fs filesystem as an argument to ParquetDataset?

@DrChrisLevy
Copy link

Thanks @wesm !
I figured it out by looking through the commit changes. If anyone comes across this thread here is how you can read parquet files from an S3 directory using pyarrow.

Make sure you have the packages:

pip install pyarrow
pip install s3fs

Python Code:

import s3fs
from pyarrow.filesystem import S3FSWrapper
import pyarrow.parquet as pq
access_key = <> # string with your aws_access_key_id
secret_key = <> # string with your aws_secret_access_key
fs = s3fs.S3FileSystem(key=access_key, secret=secret_key)

# Suppose you had some parquet files stored in the
# s3 path: s3://my_bucket/my_data/my_favorite_data
bucket = 'my_bucket'
path = 'my_data/my_favorite_data' 
bucket_uri = 's3://{bucket}/{path}'.format(**{'bucket':bucket, 'path': path})
dataset = pq.ParquetDataset(bucket_uri, filesystem=fs)
table = dataset.read()
df = table.to_pandas() 

@martindurant
Copy link

Great to see arrow and s3fs working together, thanks for looking into it.
Note that you can also give your credentials via files (typically in ~/.aws) or environment variables, if you don't want them to be stored within your code. Also, if you are on AWS hardware, then credentials should generally be available via the IAM service - see the s3fs docs.

@DrChrisLevy
Copy link

@martindurant yea of course, better not to hard code in the credentials. Just wanted to get a working example. Thanks

@martindurant
Copy link

OK, just being sure :)

@martindurant
Copy link

@wesm , you may want to include documentation with examples like this not just for s3fs but also the other pythonic file-like systems I know about (gcsfs, adlfs, perhaps hdfs3 - although arrow already supports that, of course).

@yackoa
Copy link

yackoa commented Oct 30, 2017

@wesm does the wrapper take care of write to s3 as well using s3fs ?

@wesm
Copy link
Member Author

wesm commented Oct 30, 2017

@yackoa yes -- though, if you are using write_to_dataset then you will need ARROW-1555 (part of Arrow 0.8.0) 4db0046

@gsakkis
Copy link
Contributor

gsakkis commented Jan 25, 2018

Sorry to resurrect this but has there been a regression since then? I am trying the code sample from @DrChrisLevy above and I am getting IndexError. Looks like it doesn't like the s3:// scheme, passing bucket/path works.

@martindurant
Copy link

On the s3fs side, paths starting s3:// are still supported, there are tests for that.

@gsakkis
Copy link
Contributor

gsakkis commented Jan 25, 2018

Looks like the (or one) issue is in S3FSWrapper.isfile: the condition contents[0] == path is true without the scheme and false with it.

@wesm
Copy link
Member Author

wesm commented Jan 25, 2018

OK, let's open a new JIRA so we can fix and add a test for this

@wesm
Copy link
Member Author

wesm commented Jan 25, 2018

https://issues.apache.org/jira/browse/ARROW-2038

@AlekseyYuvzhikVB
Copy link

Great to see arrow and s3fs working together, thanks for looking into it.
Note that you can also give your credentials via files (typically in ~/.aws) or environment variables, if you don't want them to be stored within your code. Also, if you are on AWS hardware, then credentials should generally be available via the IAM service - see the s3fs docs.

I'm using pyarrow and several aws profiles in ~/.aws/credentials and my code works fine with default profile but it returns
data_set = pq.ParquetDataset(paths, filesystem=fs) File "/Library/Python/3.7/site-packages/pyarrow/parquet.py", line 1170, in __init__ open_file_func=partial(_open_dataset_file, self._metadata) File "/Library/Python/3.7/site-packages/pyarrow/parquet.py", line 1365, in _make_manifest .format(path)) OSError: Passed non-file path: s3://<valid path to parquet file>
if i'm using not default profile to get an access to s3 bucket.
Details are here https://stackoverflow.com/questions/64565926/getting-oserror-passed-non-file-path-using-pyarrow-parquetdataset
Do you know how to fix such issue?

@martindurant
Copy link

@AlekseyYuvzhikVB - answered on SO. Please avoid posting in multiple places.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Oct 30, 2020

@martindurant a follow-up of what you commented on SO (a bit easier here, also because it is off topic for SO):

It is weird that you are passing an s3fs instance to arrow,

Why is that weird? Isn't that the whole reason that fsspec filesystems were subclassing from pyarrow.filesystem.FileSystem if pyarrow was installed (and similar argument for the original changes in this PR), so that pyarrow would work with fsspec-based filesystems (like s3fs) ?

@martindurant
Copy link

I'm not sure I am remembering the correct question, but I think the weird thing was not passing the instance (which is expected), but also using boto directly to list and filter files.

@jorisvandenbossche
Copy link
Member

https://stackoverflow.com/questions/64565926/getting-oserror-passed-non-file-path-using-pyarrow-parquetdataset

I agree with what you say above, but so that's not what you said on SO ;) Might be an oversight, I made an edit on SO.

@jorisvandenbossche
Copy link
Member

Ah, no I see, I just misunderstood your sentence: the weird thing is not the first part (passing an s3fs instance to arrow), but the second part of the sentence. I think that's easy to misread (as I did ;)), will try to clarify

@martindurant
Copy link

Ordering and commas...
The thought process would have been along the lines: "ok, this is about s3fs with pyarrow, but wait, you're using other stuff too"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants