S3 storage backend #1201

m-ildefons · 2022-10-28T09:23:51Z

Dear Mr. Rosdahl,

a few weeks ago, I implemented a secondary storage backend for S3 storage. The code is currently in proof-of-concept status and lives here.
The implementation is based on the aws-cpp-sdk at the moment and has been tested manually against a local S3 endpoint.
I would very much like to commit this feature upstream and appreciate any feedback.

I can think of several use cases, where this feature would come in handy. For example, when using Github hosted runners to build large code bases, the S3 backend would provide an economic path to keep the cached data longer than the built-in storage with Github would allow. This is useful when builds don't happen sporadically as the built-in storage of Github has rather short retention policies. According to my napkin math, it's also more economic than using the existing HTTP or Redis backends, because S3 comes much cheaper than an EC2 instance or a managed Redis instance with the same amount of storage - at least within AWS.

Best regards.

afbjorklund · 2022-10-31T08:56:12Z

There were some similar discussions regarding Azure Blob storage and azure-sdk-for-cpp here:

Azure blob secondary storage #1152

Something that would be nice-to-have would be a proxy, that could talk to the cloud storage backends.

This exists for the Redis backend today, but would be nice also for HTTP - and maybe even for File... ?

This proxy sets up the connection to the backend, handles the TLS overhead and the authentication etc.

Then the local communication can use some efficient unix socket, and just worry about get/put (remove).

It could also use some "plugin" system (or different servers?), to handle the bloat of these cloud SDK libraries.

Currently ccache doesn't do any SSL, because of this overhead (both in startup time, and in code size) :

HTTPS support for secondary storage? #894

While using SSL might be mandatory, in these environments.

afbjorklund · 2022-10-31T08:59:32Z

According to my napkin math, it's also more economic than using the existing HTTP or Redis backends,

A performance vs pricing comparison would also be great to have, between the different AWS alternatives.

Compare #1152 (comment)

Like a blog post or somesuch ?

afbjorklund · 2022-10-31T09:01:40Z

You should also compare with s3-fuse (and FileStorage)

m-ildefons · 2022-10-31T12:10:11Z

Hey Andreas,

thanks for the input. I've seen the Azure blob storage discussion but afaik Azure blob storage uses a different protocol, which is why I created a separate discussion for S3. Also that discussion somewhat changed course away from adding support natively to creating a config using blobfuse.
I'm no fan of layering a file abstraction in between, because I think that it's not going to help performance and the additional complexity would probably drive users away from actually putting it into production. Keep in mind that this should ideally be useful in situations like CI pipelines, where there are only limited possibilities for running additional daemons, mounting filesystem etc. And not to forget, the FileStorage backend is basically just implementing a local object store on top of a file system, exposing a simple get/put/delete API internally to Ccache. Cutting that out of the equation, when there is already object storage would be favorable IMO.

The two advantages I see for the proxy solution/s3-fuse solution is a) it might allow re-use of a single TLS connection, eliminating the overhead of the hand-shake and b) it would separate the code bases such that Ccache itself doesn't have to suck in so much protocol specific code. I'd quite like to see a performance comparison.

In this implementation, the TLS is handled by the aws-cpp-sdk, but if needed the part of the S3 protocol we need is simple enough to implement it in Ccache directly, but that would require adding TLS support there. The overhead of the TLS handshake is unfortunate then, but often pales in comparison to the compilation time some sources require. And this is especially true inside limited CI environments like Github actions. So even if it's not as fast as reusing a single connection, there are gains to be made.
I'll admit that pulling in large parts of the aws-cpp-sdk of which we only need a very small subset isn't really nice.

The proxy daemon with Unix pipe to Ccache solution is something I haven't thought about yet. It sounds like it would provide the best performance at the cost of having to configure an additional daemon. Can you provide me with a link to that Redis proxy? I failed to find it myself.

I might do a blog post about this feature once there is a clear path to up streaming it, the pricing argument would definitively be discussed in there in more detail. Keep in mind, that so far I've just done some napkin math for this.

One thing to keep in mind too is that S3 doesn't necessarily mean AWS. There are multiple solutions1 2 3 4 for providing self-hosted S3 endpoints and these don't necessarily need to use TLS. Using such an S3 endpoint as a secondary cache can be very advantageous when you have the endpoint there already. And on a private network, TLS may not be a requirement.

afbjorklund · 2022-10-31T14:43:53Z

It would be "nice" with a native solution, just saying that there are alternatives...
(for completeness, you could also "just use a disk" and mount it with NFS etc* ?)

* as in EFS: https://docs.aws.amazon.com/efs/

And the solution could handle both, with the daemon being a performance add-on
That it how it worked with the proxies for memcached and redis, they are optional.

Original one for memccache was couchbase "moxi"

Can you provide me with a link to that Redis proxy? I failed to find it myself.

https://github.com/twitter/twemproxy (nutcracker)

https://github.com/ccache/ccache/wiki/Redis-storage

afbjorklund · 2022-10-31T15:14:33Z

I'll admit that pulling in large parts of the aws-cpp-sdk of which we only need a very small subset isn't really nice.

To be honest I don't really know what these SDKs are providing in addition to the existing HTTP/HTTPS storage ?

m-ildefons · 2022-10-31T16:07:27Z

Thanks for pulling that link up.
I can only speak for the aws-sdk, as I haven't used the azure one. It provides a convenient way to access the various HTTP endpoints with the right headers set and the right request types and TLS handled and everything from C++. To do that they provide lots of classes modelling the data returned by various API calls and lots of utility functions and classes to create the objects that you send to the API. So naturally there's a lot more in there than just the GET/PUT/DELETE part that we need, e.g. lots of classes for handling ACLs and Lifecycles and Quotas etc.

In any case, I just stumbled accross: https://github.com/mozilla/sccache which has storage backends both for azure and S3

afbjorklund · 2022-11-03T19:12:09Z

I think it will need a plugin system, before it can depend on external libraries like SSL or SDK (without being turned OFF)

HTTPS support for secondary storage? #894

jrosdahl · 2022-11-07T18:23:27Z

Hi @m-ildefons, thanks the feature request. It would certainly be good to support S3 storage.

I am unfortunately not interested in adding an S3 storage backend to the current set of backends if the backend depends on AWS-SDK. I have now written some background and thoughts about it here: #1214

m-ildefons · 2022-11-08T07:57:54Z

Hi @jrosdahl ,
thanks for getting back to me. The proposal for the long-lived backend service sounds very reasonable to me. I've also taken a small look at https://github.com/mozilla/sccache implementation and it seems they are doing exactly that - except they use a TCP socket instead.
Off the top of my head, there are several things we should think of when implementing that backend daemon:

We should avoid situations where user/group permissions can cause trouble, e.g. in directories that have shared access by multiple users.
Parallel compilations would result in multiple ccache processes accessing the same backend process at the same time, this should not cause contention.
We might want to consider shared memory instead of sockets for performance reasons. Although that requires a carefully designed API for the backend daemon.

m-ildefons added the feature New or improved feature label Oct 28, 2022

jrosdahl mentioned this issue Nov 7, 2022

New storage backend model #1214

Open

jrosdahl changed the title ~~S3 Storage Backend~~ S3 storage backend Dec 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

S3 storage backend #1201

S3 storage backend #1201

m-ildefons commented Oct 28, 2022

afbjorklund commented Oct 31, 2022

afbjorklund commented Oct 31, 2022 •

edited

afbjorklund commented Oct 31, 2022

m-ildefons commented Oct 31, 2022

afbjorklund commented Oct 31, 2022 •

edited

afbjorklund commented Oct 31, 2022 •

edited

m-ildefons commented Oct 31, 2022

afbjorklund commented Nov 3, 2022

jrosdahl commented Nov 7, 2022

m-ildefons commented Nov 8, 2022

S3 storage backend #1201

S3 storage backend #1201

Comments

m-ildefons commented Oct 28, 2022

afbjorklund commented Oct 31, 2022

afbjorklund commented Oct 31, 2022 • edited

afbjorklund commented Oct 31, 2022

m-ildefons commented Oct 31, 2022

afbjorklund commented Oct 31, 2022 • edited

afbjorklund commented Oct 31, 2022 • edited

m-ildefons commented Oct 31, 2022

afbjorklund commented Nov 3, 2022

jrosdahl commented Nov 7, 2022

m-ildefons commented Nov 8, 2022

afbjorklund commented Oct 31, 2022 •

edited

afbjorklund commented Oct 31, 2022 •

edited

afbjorklund commented Oct 31, 2022 •

edited