Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make storage adapter API return types consistent across Azure and GCP #26

Open
hannelita opened this issue Jun 3, 2020 · 4 comments
Open
Labels
bug Something isn't working design A design task

Comments

@hannelita
Copy link
Contributor

It looks like the streaming API for GUnicorn on Heroku does not support the method tell(). ATM there is a workaround where I return the size of the stream; which is not fully compatible with the requested return type of put on the implementation of the provider. What's a good way to abstract over different providers and wsgi servers? Should we revisit the api?

Note: the code below is bad

# google_cloud.py
def put(self, prefix: str, oid: str, data_stream: BinaryIO) -> int:
        bucket = self.storage_client.get_bucket(self.bucket_name)
        blob = bucket.blob(self._get_blob_path(prefix, oid))
        blob.upload_from_string(data_stream.read())
        return data_stream.get_size()
@hannelita hannelita added bug Something isn't working design A design task labels Jun 3, 2020
@rufuspollock
Copy link
Member

@hannelita i'm not sure i totally understand this ...

@shevron is this clear to you? If so, could you comment on what you think 😄

@shevron
Copy link
Contributor

shevron commented Jun 4, 2020

@hannelita are we talking about the GCP implementation of the streaming API or the built-in local storage one?

Specifically about the code that doesn't work (I see it is GCP code) - what is expected to be returned there is the number of uploaded bytes or otherwise put, the size of the blob after all data has been uploaded. For this, you can just do:

    return blob.size

I am not a GCP expert but I think this will work, or perhaps you may need to call blob.reload() to get fresh data from the server.

If this is not what you meant, please clarify. Specifically "What's a good way to abstract over different providers and wsgi servers? Should we revisit the api?" is not clear to me - I find this API to be just the right level of abstract (get a file object, return the number of bytes uploaded as an int).

If the problem is that gunicorn / heroku doesn't give you the size in bytes of data_stream, we should look into that - just like we can solve the GCP case by asking GCP how many bytes did we just upload, we can probably find ways to ensure this works on Heroku / Gunicorn with other backends too. If that is still the case, we can decide to change the API or decide that we don't want to support Heroku.

@shevron
Copy link
Contributor

shevron commented Jun 4, 2020

As a side note, to the best of my understanding (again not a GCP / Heroku expect) you should not be using blob.upload_from_string(data_stream.read()) but blob.upload_from_file(data_stream) so that we don't have to read the entire file to memory before uploading.

@shevron shevron changed the title [API return types] Make storage adapter API return types consistent across Azure and GCP Aug 5, 2020
@shevron
Copy link
Contributor

shevron commented Aug 18, 2020

Also, we should not be using download_as_string() in get, and the # type: ignore comment there is a good indication that this returns an unexpected type.

get methods of storage backends are expected to return an Iterable[bytes] (which could be anything that when iterated over returns bytes, so a list of bytes, an open file in binary mode, a stream, etc.).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working design A design task
Projects
Development

No branches or pull requests

3 participants