Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Add filesystem / IO implementation for Google Cloud Storage #17070

Closed
asfimport opened this issue Jul 17, 2017 · 39 comments
Closed

[C++] Add filesystem / IO implementation for Google Cloud Storage #17070

asfimport opened this issue Jul 17, 2017 · 39 comments

Comments

@asfimport
Copy link

asfimport commented Jul 17, 2017

See example jumping off point

https://github.com/tensorflow/tensorflow/tree/master/tensorflow/core/platform/cloud

Reporter: Wes McKinney / @wesm
Assignee: Carlos O'Ryan / @coryan

Related issues:

Note: This issue was originally created as ARROW-1231. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
It doesn't look like there are any locally-running GCS-compatible servers, or at least I haven't found any...

@asfimport
Copy link
Author

Wes McKinney / @wesm:
Ah I'm really glad there's an official C++ library now

https://github.com/googleapis/google-cloud-cpp

Note that this library began after I opened this issue originally so it's good that we waited

https://github.com/googleapis/google-cloud-cpp/graphs/contributors

Probably this will need to be added to conda-forge. We'll have to have libcurl in our build toolchain also...

@asfimport
Copy link
Author

Lei (Eddy) Xu:
Hey @wesm , do we have plan to implement this now?

We are very interested into this feature.

@asfimport
Copy link
Author

Wes McKinney / @wesm:
I think we need some information about what is the recommended way (fastest, most robust) in C++ to use GCS. @brills or @emkornfield do you know what is the current state of the art (is it google-cloud-cpp?)

@asfimport
Copy link
Author

Zhuo Peng / @brills:
I don't work on related stuff, but looking at our internal site, google-cloud-cpp seems to be right choice.

Micah might know more.

 

https://googleapis.dev/cpp/google-cloud-storage/latest/ seems to be the documentation for https://googleapis.github.io/google-cloud-cpp/ ?

@asfimport
Copy link
Author

Micah Kornfield / @emkornfield:
I think Zhuo is correct.  I've reached out internally to the person I believe to be the owner to confirm.

@asfimport
Copy link
Author

Frank Natividad:
Hi folks,

I'm confirming that the Cloud Storage library in https://github.com/googleapis/google-cloud-cpp is the current state of the art.

 

Cheers

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
Does it provide better performance than the S3 endpoint?

@asfimport
Copy link
Author

Frank Natividad:
Hi Antoine, could you clarify what you mean by better performance than the S3 endpoint? 

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
Doesn't GCS provide a S3-compatible endpoint? Is it detrimental to use the AWS SDK as opposed to the native GCS APIs?

@asfimport
Copy link
Author

Wes McKinney / @wesm:
Even if the performance of uploads and downloads is equivalent, I would guess that the SDK provided by the GCP development team will provide the most comprehensive access to GCS's features. And there are utilities (such as parallel uploading 1) developed with GCS's particular characteristics in mind

@asfimport
Copy link
Author

Frank Natividad:
The XML API does exist and compatibility with S3 SDK is available but not for all operations. Not exhaustive, but three things I think about when telling you this:

  • Permission management with IAM isn't available through the XML API and S3 SDK would need to rely on ACL management. The GCS team aligned with GCP permission management by supporting IAM policies at the Organization/Project/Bucket level. IAM policies is the canonical way to manage permissions on GCP. For example, Uniform Bucket Level Access was introduced to disable ACLs at Bucket and Object level and is required to support Conditional permission policies (IAM Conditions). ACL are still supported but not the canonical permission management and very specific to GCS.

  • S3 API isn't fully supported through the XML API, for example, S3 multi-part uploads are not supported. Moreover, you'd need to develop around these limitations when you hit them.

  • The GCS C++ library was design based on experience from user friction from existing libraries in other languages in mind.

    In this case, I would recommend the GCS C++ library if you're using C++ and Cloud Storage API over the S3 SDK.

    HTH

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
Thank you for the explanation. I agree we should use the GCS C++ SDK.
Note to self: there's a standalone GCS-compatible server at https://github.com/fsouza/fake-gcs-server, should be useful for testing.

@asfimport
Copy link
Author

Clark Zinzow:
Where does this fit in y'all's priority queue at the moment? @pitrou  are you planning to take this on?  I saw that you did the S3 implementation.

I'm mostly interested in realizing the benefits of ARROW-8031 in Ray, but if no one plans on taking this issue in the next few weeks, I could try to find a few spare cycles to take on this issue (following the patterns set by Antoine's S3 implementation) and 8031 thereafter.

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
[~clarkzinzow] You're welcome to take a look.

@asfimport
Copy link
Author

Wes McKinney / @wesm:
[~clarkzinzow] note that google-cloud-cpp does not seem to be available in conda-forge yet, so I'm opening a child issue about dealing with that. I don't know anyone else for whom this is a short term priority until later this year so we are happy to help and give advice / code review

@asfimport
Copy link
Author

Clark Zinzow:
@pitrou @wesm  Great, thanks! Is it safe to say that adding the google-cloud-cpp conda-forge recipe and adding google-cloud-cpp to ThirdPartyToolchain are the only true blockers for adding the GCS external store implementation for Plasma? If that's the case and if this issue isn't of high priority for anyone ATM, then I would probably prefer to work on ARROW-8031 instead of this issue after ARROW-8147 and ARROW-8148 are done, if that's acceptable.

@asfimport
Copy link
Author

Wes McKinney / @wesm:
[~clarkzinzow] well, adding the thirdparty dependencies is a necessary condition to be able to add a Filesystem implementation that wraps google-cloud-cpp, like

https://github.com/apache/arrow/blob/master/cpp/src/arrow/filesystem/s3fs.cc

If you want to work on a filesystem implementation for GCS without dealing with the packaging / toolchain issues, you are welcome to do that also. At some point all of this work (the filesystem wrapper and thirdparty toolchain support) has to be done properly so that we can package and deploy the software all the places it needs to go.

@asfimport
Copy link
Author

Clark Zinzow:
@wesm  Ah I don't think I was very clear, sorry about that. I'm mostly interested (as an Arrow user) in being able to use GCS as an external store for Plasma, ARROW-8031; I was offering to work on the GCS filesystem implementation issue since I thought that it was a prerequisite for the Plasma external store issue. AFAICT, the only real blockers for working on the external store implementation for Plasma is adding a google-cloud-cpp conda-forge recipe and adding google-cloud-cpp to ThirdPartyToolchain, ARROW-8147 and ARROW-8148; i.e., if those packaging/toolchain issues are broken out as separate from the GCS filesystem issue, then this GCS filesystem implementation is not a prerequisite for the external store implementation for Plasma.

If that is the case, I'm asking if I (or someone else, if they are interested) could take on the packaging/toolchain issues ARROW-8147 and ARROW-8148, and once those are finished, I could work on the GCS external store implementation for Plasma.  This would leave the much larger effort around the GCS filesystem implementation for later.

Does that make sense?  And is my judgement of the actual GCS filesystem implementation not being a prerequisite for the GCS external store implementation for Plasma correct?

@asfimport
Copy link
Author

Wes McKinney / @wesm:
Perhaps I'm not understanding ARROW-8031. Are you proposing to use the generic Filesystem API (instead of a GCS implementation thereof) to offload objects from Plasma? If that's the case then I agree. Otherwise if you need to read/write to GCS in particular, without this issue being resolved I'm not sure how you can proceed.

@asfimport
Copy link
Author

Clark Zinzow:
Maybe I don't have a correct understanding of the external store interface and semantics.  It was my impression after looking at the interface and the [first pass at an S3 external store implementation|#diff-c17d56d3503f18faacf739e160958f6e] that essentially only a put and get interface has to be implemented, where the Plasma objects can be put/get to/from GCS buckets as opaque blobs using the C++ GCS client.  Am I understanding that correctly?

@asfimport
Copy link
Author

Wes McKinney / @wesm:
I guess we may be talking past each other. The Arrow C++ build system needs to be informed about how to build and/or link to the Google C++ client libraries. In other words, adding an option to the build system like -DARROW_GCS=ON like we currently have -DARROW_S3=ON. You are welcome to tackle the problem in any order you wish. I will wait for your pull requests

@asfimport
Copy link
Author

Clark Zinzow:
Sorry for the confusion.  My current plan is to tackle the packaging/toolchain issues ARROW-8147 and ARROW-8148, along with anything else required for the Arrow C++ build system to be able to build and link against the GCP C++ sdk.  Once that is working, I'm planning on developing an external store GCS implementation for Plasma, ARROW-8031, so that objects can be evicted to GCS.  AFAICT, this shouldn't involve much more than implementing the Put and Get interfaces using the C++ GCS client WriteObject and ReadObject APIs, respectively.

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
Looks like the Python GCS client is unusable for non-default endpoints:
googleapis/python-storage#102

@asfimport
Copy link
Author

Wes McKinney / @wesm:
I see that TileDB (MIT license) has built a wrapper for GCS that may be a helpful resource whenever this gets implemented in the future

https://github.com/TileDB-Inc/TileDB/blob/dev/tiledb/sm/filesystem/gcs.cc

@asfimport
Copy link
Author

Wenbing Bai:
Hi @wesm  is there any plan to support this?

@asfimport
Copy link
Author

Wes McKinney / @wesm:
We need a volunteer to build the GCS C++ Filesystem implementation (and associated unit tests). Here's another wrapper which could be used to aid development int his project

https://github.com/BlazingDB/blazingsql/blob/branch-0.15/io/src/FileSystem/private/GoogleCloudStorage_p.cpp

@asfimport
Copy link
Author

Carlos O'Ryan / @coryan:
Hi, by way of introduction, I am the lead developer for the GCS C++ SDK (https://github.com/googleapis/google-cloud-cpp/graphs/contributors).

 

I am interested in helping with the GCS C++ Filesystem implementation.  I will start going through the documentation on how to contribute to the project, that may take me a few days. If there are questions or suggestions in the interim, please do not hesitate to ask.

 

@asfimport
Copy link
Author

Micah Kornfield / @emkornfield:
Hi Carlos, I've added you as a contributor in JIRA and assigned the issue to you.  Note there is a open pull request for adding CMake dependencies to the project, you might want to check in to see if the author has any intentions on finishing it (they haven't been too responsive) to avoid duplicate work.

This was referenced Jan 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant