Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add method to set metadata #38

Closed
yan-hic opened this issue Nov 29, 2017 · 21 comments
Closed

Add method to set metadata #38

yan-hic opened this issue Nov 29, 2017 · 21 comments

Comments

@yan-hic
Copy link
Contributor

yan-hic commented Nov 29, 2017

Feature request (or current workaround)

  1. set meta on existing object like like https://cloud.google.com/storage/docs/gsutil/commands/setmeta
  2. dict parameter at file creation (open in w mode)
@martindurant
Copy link
Member

martindurant commented Nov 29, 2017

The relevant end-point appears to be (object/patch)[https://cloud.google.com/storage/docs/json_api/v1/objects/patch] with a body like '{"metadata": {"key": "value"}}'. It should not be difficult to implement set/get metadata using that.
It is also possible to set the metadata at file creation, but that is probably a bit more complicated.

@yan-hic
Copy link
Contributor Author

yan-hic commented Nov 29, 2017

Actually it is simple to do at file creation - see here for resumable but same in simple_upload:

 def _initiate_upload(self):
        r = requests.post('https://www.googleapis.com/upload/storage/v1/b/%s/o'
                          % quote_plus(self.bucket),
                          params={'uploadType': 'resumable'},
                          headers=self.gcsfs.header, json={'name': self.key, 'metadata': {'test': '2.7'}})
        self.location = r.headers['Location']

The metadata can be passed as parameter in the GCSFile instantiation.

@martindurant
Copy link
Member

OK, then I guess both should be done.
In s3fs we have getxattr, setxattr for existing files, so we can also do calls for individual keys alongside the whole metadata.

@yan-hic
Copy link
Contributor Author

yan-hic commented Nov 29, 2017

https://cloud.google.com/storage/docs/json_api/v1/how-tos/simple-upload indicates that if metadata needs to be added, simple is no good. Actually it can be used but through a different URL call.
I suggest to limit to resumable uploads - which fits my use case.
Will submit a PR

@martindurant
Copy link
Member

You would be doubling the time for uploads on small files without metadata. Could you not stay with simple upload, but call the PATCH method after close in the case that there is metadata?

@yan-hic
Copy link
Contributor Author

yan-hic commented Nov 29, 2017

Not sure I understand - I have no use for simple upload as none of my writes are < 5Mb. The change proposal will only affect (benefit) resumable uploads in _initiate_upload, which is the main reason to use GCSFS and a file-obj.

For simple upload, metadata parameter if passed on open() would be ignored. To support that scenario, an after-fact PATCH is best indeed.

The change is a very minor. Feel free to reject or bundle otherwise but it's working nicely for me.

@martindurant
Copy link
Member

Sorry, I thought you meant avoiding simple upload altogether. I would like metadata added in both cases, rather than silently ignoring them in some cases.

So the plan:

  • open() should take metadata=dict parameter
  • on _initiate_upload, set the metadata, if there is any
  • add set_metadata and get_metadata methods for existing files; should getting cache?
  • if upload is simple, and there is metadata, calls set_metadata after close
  • (optional) add set_xattr and get_xattr for accessing a single key.

@yan-hic
Copy link
Contributor Author

yan-hic commented Nov 29, 2017

Sorry but I can only help with the 2 first bullets.
The get_metadata might be redundant as GCSFSFile.details renders all. Could be a helper though.

@martindurant
Copy link
Member

See #39

@martindurant
Copy link
Member

@yiga2 , I'll leave this issue open despite your PR until the rest of the functionality is complete.

@yan-hic
Copy link
Contributor Author

yan-hic commented Nov 30, 2017

@martindurant I have found a easy way to get simple_upload to work but i's using the XML API instead the JSON API.
It uses the x-goog-meta- headers which unfortunately are ignored by the JSON API endpoint.

        if self.metadata is not None:
            for k, v in self.metadata.items():
                head['x-goog-meta-' + k] = v
        path = 'https://storage.googleapis.com/%s/%s' % (quote_plus(self.bucket), self.key)
        r = requests.put(path, headers=head, data=data)

Since response is not json, subsequent code to read size and md5 need a bit of rewrite.

Are you fine with it or want to stick with a JSON API approach ?

BTW, Google's XML API syntax is apparently compatible with AWS: https://cloud.google.com/storage/docs/migrating

@martindurant
Copy link
Member

Have you tried setting the same header keys with the original end-point? It would be odd if the two syntaxes had different capabilities. I don't mind particularly, but it would make the code less friendly, I suspect.
On S3 compatibility: it's not really true, in my experience :) I can't remember what didn't work.

@yan-hic
Copy link
Contributor Author

yan-hic commented Nov 30, 2017

I tried and as I wrote, it is ignored. Odd indeed and there is no indication that either endpoint will be deprecated.
On the additional change, it would be:

size, md5 = int(r.headers['x-goog-stored-content-length']), \
            re.findall('md5=(.+)', r.headers['x-goog-hash'])[0]

The md5 retrieval is indeed more cumbersome but the path and r declarations are shorter and more readable.

Also that's the only way to set the metadata at object creation, which saves an API call and risk of error.

@yan-hic
Copy link
Contributor Author

yan-hic commented Dec 1, 2017

@martindurant
I have just changed to use the (new) Google Client libraries as I needed service account support badly (across projects and accounts). It has metadata, prefix and more.
For me, no need for GCS-level file-obj anymore as I managed to get the input sent in compressed format and will compress the output back to GCS. Hence the read/write iteration are on the zip/gz objects and both compressed blobs fit in memory.

Considering the existence of Google's abstraction layer and their move to under-the-hood gRPC, I wonder if GCSFS should only focus on helpers and gaps e.g. ioBase gap. On the latter, you went through the effort of defining the required methods already and there is no sign Google is working on it. It would require to adapt to use https://googlecloudplatform.github.io/google-resumable-media-python/latest/google.resumable_media.requests.html (not Simple down/up as covered by standard lib).

Sure the changes are profound and beyond my knowledge. The fact that GCSFS relies on deprecated oauth2client could be a trigger to revisit.

See you back on other projects I am following, fastparquet and fastavro !

@yan-hic yan-hic closed this as completed Dec 1, 2017
@asford
Copy link
Collaborator

asford commented Jan 2, 2018

@martindurant Would you mind reopening this issue? I've a use case that likely requires metadata on small uploads and this seems like a natural place to plan that work.

@mrocklin mrocklin reopened this Jan 2, 2018
@yan-hic
Copy link
Contributor Author

yan-hic commented Nov 7, 2018

@martindurant et al. - I am back ! Turned out I needed gcsfs and metadata on _simple_upload.
So as per G, I have switched uploadType from media to multipart as follows:

    '''  old
    r = self.gcsfs.session.post(
       path, params={'uploadType': 'media', 'name': self.key}, data=data)
    '''

    metadata = {'name': self.key}
    if self.metadata is not None:
        metadata['metadata'] = self.metadata
    metadata = str(metadata)
    data = ('--==0=='
            '\nContent-Type: application/json; charset=UTF-8'
            '\n\n' + metadata +
            '\n--==0=='
            '\nContent-Type: application/octet-stream'
            '\n\n').encode() + data + b'\n--==0==--'

    r = self.gcsfs.session.post(path,
                                headers={'Content-Type': 'multipart/related; boundary="==0=="'},
                                params={'uploadType': 'multipart'},
                                data=data)`

Sorry I haven't touched Git for a while and am very unfamiliar with PR. The above code works and we use it in production so thought of sharing.

Hope someone will take this through and close the loop.

@martindurant
Copy link
Member

Thank, @yiga2 !
I'll try to make a little time to push it through, I could do with a simple test case that demonstrates it working, and I can sort out the business of making VCR recordings.

@martindurant
Copy link
Member

metadata = str(metadata) - should this not be JSON encoding, which is not necessarily the same thing?

@yan-hic
Copy link
Contributor Author

yan-hic commented Nov 7, 2018

The encode() will format it to JSON. Merging name with metadata is just easier when using dict.update() then inserting in JSON (without the JSON library). But your call really, I am by no mean a Py expert.

On the test case, count on me !

But reflecting on this as I was playing with different chunk (block) size for optimal performance (which for GCP-to-GCP is the max 8MB), there is no real performance benefit between simple upload and resumable upload. Actually as G indicates here, resumable is even recommended for data integrity.

Hence question: what about skipping simple upload altogether and use resumable for any file size ?
If you want to keep it, to minimize side effects (if any), you could introduce a top-level boolean (or GCSFile argument) e.g. use_simple_upload

At the other hand, resumable with 5MB blocksize gives you the same plus you can pass easily custom metadata (already supported by GCSFS)

@yan-hic
Copy link
Contributor Author

yan-hic commented Mar 31, 2021

Not sure I would reopen this but if anyone is reading this claiming it doesn't work: make sure the dict values are strings as gcsfs (or fsspec) will not error out. I figure that when passing a nested dict and no metadata got saved.

One should do as follows:

..., metadata={'address': json.dumps({'street': '1 my avenue', 'county': None, ...

and deserialize back when reading the metadata.

@martindurant not sure if wrapping this in the core code i.e. for each value, if not str, jsonify, is worth it but raising an error would be good.

@martindurant
Copy link
Member

Sounds reasonable to either make this work as expected or to document for those that come across the current behaviour and would otherwise be surprised.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants