Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Copy url params in format_parsed_parts #101

Closed

Conversation

rahuliyer95
Copy link
Contributor

@rahuliyer95 rahuliyer95 commented May 12, 2023

Similar to #95 I noticed that when creating sub-paths using / or __truediv__ we are not copying the url params.

Example,

>>> from upath import UPath
>>> base = UPath("s3://bucket/dir?param1=value1")
>>> base._url
SplitResult(scheme='s3', netloc='bucket', path='/dir', query='param1=value1', fragment='')
>>> file = base / "file1.txt"
>>> file._url
SplitResult(scheme='s3', netloc='bucket', path='/dir/file1.txt', query='', fragment='')

This PR attempts to resolve this by copying the query parameters when creating sub-paths using /.

Results with the fix,

>>> from upath import UPath
>>> base = UPath("s3://bucket/dir?param1=value1")
>>> base._url
SplitResult(scheme='s3', netloc='bucket', path='/dir', query='param1=value1', fragment='')
>>> file = base / "file1.txt"
>>> file._url
SplitResult(scheme='s3', netloc='bucket', path='/dir/file1.txt', query='param1=value1', fragment='')

@rahuliyer95 rahuliyer95 force-pushed the issue-95/update-url-in-joinpath branch from 53eac9a to ded54d6 Compare May 12, 2023 18:21
@rahuliyer95
Copy link
Contributor Author

rahuliyer95 commented May 12, 2023

Requesting Review: @ap-- @normanrz

Copy link
Collaborator

@ap-- ap-- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @rahuliyer95

Thank you for opening the PR! ❤️

I think we'll have to check how this fits together with: #88 and #92

As it stands now, I believe the test assumption is wrong, because query parameters have to be provided on a per "file" basis and should not carry over.

I understand though, that this might be a common use case, so maybe we could provide some other way to globally set query parameters...

Cheers,
Andreas 😃

path1 = self.path / f"test_joinpath_dir1?{query}"
assert path1._url.query == query
path2 = path1 / "file1.txt"
assert path2._url.query == query
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assumption is probably incorrect. See:

>>> import urllib.parse
>>> urllib.parse.urljoin("http://www.example.com/dir/?p1=v1&p2=v2", "file.txt")
'http://www.example.com/dir/file.txt'

@rahuliyer95
Copy link
Contributor Author

Hi @rahuliyer95

Thank you for opening the PR! ❤️

I think we'll have to check how this fits together with: #88 and #92

As it stands now, I believe the test assumption is wrong, because query parameters have to be provided on a per "file" basis and should not carry over.

I understand though, that this might be a common use case, so maybe we could provide some other way to globally set query parameters...

Cheers, Andreas 😃

Thanks for getting back on this, we are using upath to read data from different namespaces in our cloud backend and the credentials for these namespaces are different. We are trying to add an identifier for these namespaces via query params, example

from upath import UPath

dir1 = UPath("scheme://bucket/dir1?namespace=ns1")
dir2 = UPath("scheme://bucket/dir2?namespace=ns2")

now dir1 and dir2 will use proper credentials for ns1 and ns2 respectively because that's how we have written our custom fsspec.Filesystem implementation which uses the _get_kwargs_from_urls to parse the namespace parameter and set the appropriate credentials. This helps us a lot abstract away from some this internals of the filesystem and just pass around paths in our code which works seamlessly.

@ap--
Copy link
Collaborator

ap-- commented May 18, 2023

Thank you for the additional context!

Is there a reason why you chose to implement it via query parameters?

Will there ever be multiple namespaces for the same bucket/dir combination?
Or is a mapping from (bucket, dir) to namespace unique?

As I understand right now: you were basically looking for a way to encode which credentials to use into the string representation of the url.

@rahuliyer95
Copy link
Contributor Author

Is there a reason why you chose to implement it via query parameters?

For our use-case this path parameter can be provided by the user and the API contract we set is that the user only needs to provide the right path with a namespace so that the right credentials can be automatically used and the user of the API need not worry about additional configuration.

Will there ever be multiple namespaces for the same bucket/dir combination?
Or is a mapping from (bucket, dir) to namespace unique?

There could be multiple namespace for the same bucket/dir combination. e.g: bucket/dir can exist in both ns1 and ns2 but at a given time the user would only accessing bucket/dir from either ns1 and ns2 and we would request the user to create two paths one for each namespace so that proper credentials are used for accessing them.

As I understand right now: you were basically looking for a way to encode which credentials to use into the string representation of the url.

Yes, without having to expose actual credentials in the URL itself.

@drernie
Copy link

drernie commented Jul 12, 2023

FYI, I had a very similar situation. We ultimately ended up requiring the user to specify the credentials via an environment variable, because there were too many places the URI could be logged, which would create a security vulnerability.

@rahuliyer95
Copy link
Contributor Author

To clarify I am not proposing to share the actual credentials in the URL params, just an id or a reference to where the credentials are present; e.g: name of the environment variable to read for the credentials.

@rahuliyer95
Copy link
Contributor Author

Managed to solve this using kwargs of the UPath by subclassing the S3Path

class MyS3PathWithNamespace(S3Path):
    @classmethod
    def _from_parts(
        cls,
        args: List[Union[str, os.PathLike]],
        url: Optional[SplitResult] = None,
        **kwargs: Any,
    ):
        if url:
            query = parse_qs(url.query)
            if namespace := next(iter(query.get("namespace", [])), ""):
                kwargs["namespace"] = namespace
        return super()._from_parts(args, url=url, **kwargs)

@rahuliyer95 rahuliyer95 closed this Nov 1, 2023
@rahuliyer95 rahuliyer95 deleted the issue-95/update-url-in-joinpath branch November 1, 2023 17:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants