Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow configuring bucket location #455

Merged
merged 19 commits into from
Mar 22, 2022
Merged

Conversation

janjagusch
Copy link
Contributor

Closes #454

@janjagusch janjagusch changed the title Allow to configure bucket location Allow configuring bucket location Mar 16, 2022
@janjagusch janjagusch marked this pull request as draft March 16, 2022 12:13
Copy link
Member

@martindurant martindurant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this so quickly.

I would appreciate adding a docstring entry describing this, and that this only applies to bucket creation via mkdir.

Do you think it might make more sense to add the argument to mkdir instead of init? Or maybe, in init we have a default_location, and mkdir also takes an optional location as an override.

Unfortunately, the CI does not have multiple locations available (of course), but we should add variants on the location with mkdir to ensure that the passed arguments are valid.

gcsfs/core.py Outdated Show resolved Hide resolved
@janjagusch
Copy link
Contributor Author

Thanks for adding this so quickly.

Thanks for the quick review!

I would appreciate adding a docstring entry describing this, and that this only applies to bucket creation via mkdir.

Definitely! I have this in a local commit already and will push this now.

Do you think it might make more sense to add the argument to mkdir instead of init? Or maybe, in init we have a default_location, and mkdir also takes an optional location as an override.

I would prefer having this as part of the __init__, as in my use case I only have control over constructing the GCSFileSystem instance but not over the mkdir call. For me personally having the ability to overwrite the location with a function argument in mkdir is not useful, but I wouldn't oppose adding it.

Unfortunately, the CI does not have multiple locations available (of course), but we should add variants on the location with mkdir to ensure that the passed arguments are valid.

I have added a test case locally and was surprised to see that it fails (all buckets are still created in the US, even though I specify the location differently). That's why I moved the PR back to a draft, until I figure out what's going on here.

@martindurant
Copy link
Member

surprised to see that it fails

OK, please let me know.

@martindurant
Copy link
Member

Location should be one of the keys in this table? https://cloud.google.com/storage/docs/locations#available-locations

(if yes, this link should also be in the docstring)

@janjagusch
Copy link
Contributor Author

Location should be one of the keys in this table? https://cloud.google.com/storage/docs/locations#available-locations

(if yes, this link should also be in the docstring)

Yes, I believe so (we will know for certain once the tests are working). And yes, the link is part of the docstring.

@janjagusch janjagusch mentioned this pull request Mar 16, 2022
gcsfs/core.py Outdated Show resolved Hide resolved
@janjagusch
Copy link
Contributor Author

surprised to see that it fails

OK, please let me know.

Okay, it's working now. I didn't know about the fake GCS server, which apparently can only create buckets in the US-CENTRAL1 region. After setting export STORAGE_EMULATOR_HOST="https://storage.googleapis.com" everything was working (at least after making some more fixes).

@janjagusch janjagusch marked this pull request as ready for review March 17, 2022 10:07
@janjagusch
Copy link
Contributor Author

@martindurant, did you have some time to take a look at this PR already? 😇

Copy link
Member

@martindurant martindurant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason for moving around the code of _list_buckets?

I think the plan we had would be the best way to finish this off:

  • rename to the argument in __init__ to default_location (=None)
  • allow _mkdir to take a location argument (=None)
  • in _mkdir, set location = location or self.location, and the rest is the same.

@janjagusch
Copy link
Contributor Author

What is the reason for moving around the code of _list_buckets?

Good question. In my test_bucket_location test I need a way to access the location metadata about a bucket, which unfortunately gets removed in the _list_buckets method (it drops all metadata but the name). So I've introduced _list_bucket_dicts, which contains all metadata about a bucket and refactored _list_buckets to use _list_bucket_dicts to avoid code duplication.

I think the plan we had would be the best way to finish this off:

  • rename to the argument in __init__ to default_location (=None)
  • allow _mkdir to take a location argument (=None)
  • in _mkdir, set location = location or self.location, and the rest is the same.

True, I kind of forgot about this, sorry. I will get to it now.

@janjagusch
Copy link
Contributor Author

ready for review.

@martindurant
Copy link
Member

it drops all metadata but the name

so perhaps it's a good idea to let this information come through to the user too

@janjagusch
Copy link
Contributor Author

it drops all metadata but the name

so perhaps it's a good idea to let this information come through to the user too

actually, you're making a good point here. i was thinking about leaving the API of _list_buckets unchanged but considering that

  • it's not part of the public API of the GCSFileSystem class and
  • that the original return dictionary can still be processes in the same way (we just add a couple more keys)

it should be okay to do so.

I'll get right to it.

gcsfs/tests/conftest.py Outdated Show resolved Hide resolved
@janjagusch
Copy link
Contributor Author

janjagusch commented Mar 22, 2022

it drops all metadata but the name

so perhaps it's a good idea to let this information come through to the user too

actually, you're making a good point here. i was thinking about leaving the API of _list_buckets unchanged but considering that

  • it's not part of the public API of the GCSFileSystem class and
  • that the original return dictionary can still be processes in the same way (we just add a couple more keys)

it should be okay to do so.

I'll get right to it.

done. again, I'd suggest squashing, as the commit history has turned out to be quite messy, @martindurant. if there's anything else you'd like to see changed, please let me know.

@martindurant
Copy link
Member

github seems to be having trouble today, seems like it might take a while for tests to show up.

@janjagusch
Copy link
Contributor Author

github seems to be having trouble today, seems like it might take a while for tests to show up.

yeah, I've also had issues pushing earlier today.
btw, if you could release a new tag after this has been merged, that would be fantastic. 👼

@janjagusch
Copy link
Contributor Author

i think the CI error is related to me creating new GCSFileSystem instances for the location tests but i'm not entirely sure what to do about it. what exactly is tested here? that the session is a singleton?

    def test_current(gcs):
        assert GCSFileSystem.current() is gcs
        gcs2 = GCSFileSystem(endpoint_url=gcs._endpoint)
>       assert gcs2.session is gcs.session
E       assert <aiohttp.client.ClientSession object at 0x7f34e2af6d90> is <aiohttp.client.ClientSession object at 0x7f34e2cdba90>
E        +  where <aiohttp.client.ClientSession object at 0x7f34e2af6d90> = <gcsfs.core.GCSFileSystem object at 0x7f34e2ad2b90>.session
E        +  and   <aiohttp.client.ClientSession object at 0x7f34e2cdba90> = <gcsfs.core.GCSFileSystem object at 0x7f34e2a48a90>.session

@martindurant
Copy link
Member

that the session is a singleton?

It's the GCSFileSystem that is the singleton - if you instantiate with the same arguments, you should get the same instance. The change was to pass default_location here, even though it has the default value of None.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow to configure bucket location
2 participants