New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-1213: [Python] Support s3fs filesystem for Amazon S3 in ParquetDataset #916
Conversation
@martindurant I am using a private API from s3fs to create an |
Certainly you can do this, and it looks about right. I'm surprised that you would want to, I have always rather used something like glob or the existing s3fs walk to get a plain list of files rather than iterating the os.walk tree. |
@fjetter @xhochy this will hopefully fix the problem in dask/dask#2527 -- that patch is still needed in part to pass the filesystem object to |
Change-Id: I8d484da73fc3bb4a4c57c67f07c7345ece1d4af6
Change-Id: I67ebe3eac59113ee0b56eccdd59964dbbf6bcffc
… Rename to HadoopFilesystem. Add walk implementation for HDFS, base Parquet directory walker on that Change-Id: I1e3f5b1b578e21f2d498602ef0150e3f94d9415a
Change-Id: I9d82af90efb2c2bd47cc32da1eb0ea8fe1e6469f
Change-Id: I732078d7bc105ff4bcf3efab4535e13c33945f77
Change-Id: I41cb5374d6681c95aac766d0e1c51d976d8a7ec8
Change-Id: I5912fcb30eaa85089d6d9272a3ebacd2bf4806aa
Thank you for adding the feature! |
So if I have an ec2 instance or say an emr cluster on AWS, does this fix allow for reading a directory of multiple parquet files in S3 from pyarrow? I still can't seem to find an example of this fix in action. I can't seem to get pq.ParquetDataset("path to s3 directory") working. I have tried importing s3fs too. Is there an example of using this new feature in the docs? Cheers. |
@DrChrisLevy I opened ARROW-1682 https://issues.apache.org/jira/browse/ARROW-1682 about adding some documentation for this. Are you passing the s3fs filesystem as an argument to |
Thanks @wesm ! Make sure you have the packages:
Python Code:
|
Great to see arrow and s3fs working together, thanks for looking into it. |
@martindurant yea of course, better not to hard code in the credentials. Just wanted to get a working example. Thanks |
OK, just being sure :) |
@wesm , you may want to include documentation with examples like this not just for s3fs but also the other pythonic file-like systems I know about (gcsfs, adlfs, perhaps hdfs3 - although arrow already supports that, of course). |
@wesm does the wrapper take care of write to s3 as well using s3fs ? |
@yackoa yes -- though, if you are using |
Sorry to resurrect this but has there been a regression since then? I am trying the code sample from @DrChrisLevy above and I am getting |
On the s3fs side, paths starting |
Looks like the (or one) issue is in S3FSWrapper.isfile: the condition |
OK, let's open a new JIRA so we can fix and add a test for this |
I'm using pyarrow and several aws profiles in ~/.aws/credentials and my code works fine with default profile but it returns |
@AlekseyYuvzhikVB - answered on SO. Please avoid posting in multiple places. |
@martindurant a follow-up of what you commented on SO (a bit easier here, also because it is off topic for SO):
Why is that weird? Isn't that the whole reason that fsspec filesystems were subclassing from pyarrow.filesystem.FileSystem if pyarrow was installed (and similar argument for the original changes in this PR), so that pyarrow would work with fsspec-based filesystems (like s3fs) ? |
I'm not sure I am remembering the correct question, but I think the weird thing was not passing the instance (which is expected), but also using boto directly to list and filter files. |
I agree with what you say above, but so that's not what you said on SO ;) Might be an oversight, I made an edit on SO. |
Ah, no I see, I just misunderstood your sentence: the weird thing is not the first part (passing an s3fs instance to arrow), but the second part of the sentence. I think that's easy to misread (as I did ;)), will try to clarify |
Ordering and commas... |
cc @yackoa