Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

signed urls #283

Open
rabernat opened this issue Sep 3, 2020 · 7 comments
Open

signed urls #283

rabernat opened this issue Sep 3, 2020 · 7 comments

Comments

@rabernat
Copy link
Contributor

rabernat commented Sep 3, 2020

At today's Pangeo meeting, we discussed the idea of using signed urls to provided unauthenticated users access to google cloud storage, specifically for reading zarr stores. I'm creating this issue to track that idea.

I found this page in the gcs docs with some helpful advice.
https://cloud.google.com/storage/docs/access-control/signing-urls-manually

How would we go about "proxying" access to a restricted bucket via signed urls? I can't quite wrap my head around it. Where would the signing happen? It couldn't happen from the user's notebook pod--it would have to happen from within some service running somewhere with enhanced credentials. Could we connect it to jupyterhub's auth? Or to our auth0 account?

@martindurant
Copy link
Member

cc @danielballan @CJ-Wright
ref: #277

From the fsspec/gcsfs view, the implementation should certainly support producing signed URLs from current credentials, which is the PR above. You could also write an implementation which interfaces with some broker to do file listings and get signed http urls, and then use these. This is not proxying, this is either redirect (automatically in HTTP-land, or explicitly in code).

intake/intake#524 demonstrates a prototype of how you would rewrite "urlpath" data source arg in the intake server to give signed URLs back

@CJ-Wright
Copy link

If anyone wants to take over those PRs please feel free. I'm not abandoning them, but they might not fit into my current time constraints.

@hadim
Copy link

hadim commented Aug 24, 2023

Sorry to poke on that old issue but I have a use case that seems similar to what was discussed during that Pangeo meeting.

I am working on a backend that will allow users to download/upload files with a custom permission system that can't rely on gcp auth (custom logic in the backend). I haven't tried it but signed urls seems to be the perfect candidates for files but I am not sure I can see that working for folder. One type of file we want users to download/upload are Zarr files (also Parquet partitioned folders for example).

My understanding is that we would need to have some kind of logics within the Zarr reader that will ask the server to provide a signed urls for any files he wants to access within the Zarr folder.

Is that currently possible with zarr/parquet/fsspec/gcfs? If not, where would be the best place you think to contribute and add that logic (seems fsspec since zarr and pandas support fsspec FS)?

@martindurant
Copy link
Member

Firstly: URL signing is indeed implemented in gcsfs (method .sign()).

To upload, you would need to generate a POST signed URL, but it would still need some HTTP payload formatting like in gcsfs.core.simple_upload . There would be one URL per file, so maybe many for a zarr dataset.

Having said all that, it should not be too complicated to make a subclass of GCSFileSystem which, upon write, defers to some other system to get a signed URL and then use that - all the calls in gcsfs are HTTP. You might do the same for read. Zarr only accesses cat_file and pipe_file methods, I think, so not too much work. There would need to be no change to zarr or parquet (although the latter writes files using more, different code paths depending on arrow Vs fastparquet).

@hadim
Copy link

hadim commented Aug 25, 2023

Thanks @martindurant, for the input!

I did some POC and indeed, at least for cat and pipe it works nicely.

One thing is that signed urls do not support listing objects (at least from what I saw). My understating is that without a "listing objects" features, the zarr will always need to have a .zmetadata file in order to know which children/subfolders it contains.

Am I correct about that requirement?

@hadim
Copy link

hadim commented Aug 25, 2023

For example, this zarr HTTP url https://uk1s3.embassy.ebi.ac.uk/idr/zarr/v0.3/9836842.zarr/ taken from https://github.com/ome/napari-ome-zarr does not seem to work when using zarr.open. I guess it works only with ome-zarr because the spec and lib know in advance the keys to look into?

Maybe the broader question is whether HTTP (without listing capability) can work with zarr given that this protocol does not really support a MutableMapping interface (except when .zmetadata exists).

@martindurant
Copy link
Member

You are quite right, zarr (v2) has no way to know the children of a group except by listing in the absence of consolidated metadata. In some situations, the expected arrays can be described elsewhere (such as OME), or you could do something with kerchunk to establish the layout and save it to another place.

I haven't read the upstream documentation on URL signing recently, but it's possible that listing can be signed too. S3 does allow this, but using a different call and permissions than GET/PUT/POST for a single file, not surrently implemented in s3fs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants