Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pathlib.Path type #434

Open
clbarnes opened this issue Oct 6, 2020 · 27 comments
Open

pathlib.Path type #434

clbarnes opened this issue Oct 6, 2020 · 27 comments

Comments

@clbarnes
Copy link

clbarnes commented Oct 6, 2020

pathlib is great for interacting with paths on the local file system. It's expressive, terse, ergonomic, and typed. I'd love to use fsspec going forward, but don't want to give up the power and convenience of pathlib.

Could AbstractFileSystems contain a factory method which produced a PurePath subclass with the correct separator, prefix etc.? This concrete class could also contain the methods implemented by pathlib.Path or it could just be passed back to the filesystem.

@martindurant
Copy link
Member

pathlib is great

(maybe my bias against pathlib keeps showing through)

Could AbstractFileSystems contain

I am not opposed in principle, but I don't immediately see how this would work. It sounds like a pain to implement?

@clbarnes
Copy link
Author

clbarnes commented Oct 6, 2020

Agreed that it may be a pain to implement all of the FS-touching operations present in Path. The instance would need to keep around a reference to the creating FileSystem which wouldn't be very tidy, and a lot of the methods wouldn't abstract well over other FSs.

However, the API of PurePath (which basically handles splitting on separators, relative paths and so on) is not too bad and IMO would add some ergonomics and type helpfulness. AbstractFileSystems could have a PurePath inner class which inherits from a fsspec.PurePath and basically just allows implementors to set their separators, disallowed characters and so on. You'd then instantiate it with my_fs.PurePath(), could do type annotations with MyFileSystem.PurePath. Downstream users could thus abstract further over fsspec implementations because they don't need to know about the FS' separators and so on to split out path components and file extensions and so on - they just use the PurePath API as they're used to doing on their own filesystem.

@martindurant
Copy link
Member

Is this something you are proposing to take on? :)

@clbarnes
Copy link
Author

clbarnes commented Oct 6, 2020

If anyone else thinks it would be valuable and it would make it through review, I may be able to find time!

@martindurant
Copy link
Member

If anyone else thinks it would be valuable

Let's see if other people chime in. pathlib has certainly been mentioned before in the issues.

@michcio1234
Copy link

Hello! At Data Revenue we have crafted a drfs package that does more or less what you described.

Example:

from drfs import DRPath
dir_path = DRPath("s3://bucket/key")
file_path = dir_path / "file.txt"
with file_path.open("rb") as f:
    f.read()

Another example (presenting another functionality of the package) is here.

It is an internal package and was opensourced just yesterday (actually because of this thread), so I realise that the documentation is not perfect and there are bugs here and there. The package will be probably refactored in the near future (possibly as you described - by integrating with fsspec), but maybe it can already be of use for you. :)

@raybellwaves
Copy link
Contributor

I stumbled across this yesterday https://github.com/drivendataorg/cloudpathlib. Could be used as inspiration here

@martindurant
Copy link
Member

Hurray for the extra functionality, @michcio1234 ! We would gladly see something like this included directly in fsspec, or indeed as a separate package.

@raybellwaves , I will post an issue on that repo to see if they are interested in contributing - or as inspiration, as you say.

@martindurant
Copy link
Member

@michcio1234 , also I'd note that your second example starts to look something like an intake catalog spec

@hadim
Copy link

hadim commented Nov 10, 2020

I would also love to have a pathlib.Path like-object for remote paths directly in fsspec. I would prefer to not rely on yet-another-lib mentioned here for this. And pathlib.PurePath does not work because it strips out the first // after the name of the protocol:

import pathlib
my_path = pathlib.Path("s3://my-bucket/experiment-1") / "results.csv"
my_path

returns:

PosixPath('s3:/my-bucket/experiment-1/results.csv')

When used with fsspec, the filesystem selected is the local one:

import fsspec
with fsspec.open(my_path, "w") as f:
    f.write("my results")

Locally a folder s3: is created.

@martindurant
Copy link
Member

Locally a folder s3: is created.

Ooops...

Again, I'd be really happy for someone to contribute this, but it likely won't be me.

@jamesmyatt
Copy link
Contributor

jamesmyatt commented Nov 16, 2020

This is something I've been thinking about and wanting too. The particular advantage is that it then allows you to use fsspec with (almost) any function that expects a pathlib.Path object.

I think that it could be quite a simple class that acts as an adapter between the pathlib.Path API and the fsspec.AbstractFileSystem API.

@hadim
Copy link

hadim commented Nov 16, 2020

The particular advantage is that it then allows you to use fsspec with (almost) any function that expects a pathlib.Path object.

Agree. I use pathlib everywhere but recently had to revert my code back because I also wanted to support fsspec paths...

@mraspaud
Copy link
Contributor

One more vote here, we have just started looking into this. One very simple way we tried is just to implement the __fspath__ method, that goes already a long way.

@hadim
Copy link

hadim commented Nov 16, 2020

Can you develop on fspath?

@martindurant
Copy link
Member

My reading of fspath (with or without underscores) if that it implicitly applies to local paths. Currently, local and smb implementations support fspath.

@hadim
Copy link

hadim commented Nov 16, 2020

fspath does not seem to solve the issue that the protocol is stripped out of the original path:

import pathlib

p = pathlib.PurePath('s3://bucket-data/setup.py')
print(str(p), p.__fspath__())
s3:/bucket-data/setup.py s3:/bucket-data/setup.py

s3:// is replaced by s3:/.

One way would be to monkey-patch pathlib.PurePath() but this is obviously not a good idea to monkey patch the stdlib.

@mraspaud
Copy link
Contributor

ok, sorry I wasn't clear, I just mean that supporting the path protocol (not the full pathlib.PurePath) help us a lot already.

@hadim
Copy link

hadim commented Jan 26, 2021

It looks like someone is building a lib for this: https://github.com/Quansight/universal_pathlib

Friendly ping to @tonyfast and @andrewfulton9: would you consider contributing directly here instead of in an external library?

@martindurant any thoughts on this?

@martindurant
Copy link
Member

Sure! Since it needs no further dependencies, it might well be hosted within fsspec - but maybe it doesn't matter. I'd be happy to put something into fsspec's docs pointing to the examples of upath for those looking for Path support, if the package is to remain separate. As far as I can see, upath has only been up for six days, so we'll see how it goes.

@jamesmyatt
Copy link
Contributor

I think https://github.com/Quansight/universal_pathlib is pretty much the kind of adapter I had in mind. It looks like it's far too early to include in fsspec now, but I think it ought to be integrated when it's more mature.

@hadim
Copy link

hadim commented Jan 26, 2021

Even if it's an early project, it would make sense to add early in fsspec but I guess it depends on what the dev wants to do here.

@martindurant
Copy link
Member

Let's wait for the quansight people to comment :)

@jamesmyatt
Copy link
Contributor

@martindurant , agreed.

@hadim , if it were me, this early in a project, then I'd want to be able to iterate faster than a more established project like fsspec.

@andrewfulton9
Copy link

Thanks for pinging me. I'm glad to see the interest in upath from you all. I would lean towards waiting to moving it into fsspec for the reason @jamesmyatt mentions above. I built this package to use in internal client projects, and I am still coming across bugs in it pretty frequently, so being able to push up releases with bug fixes as soon as they are fixed is pretty valuable to me. Once it is more stable though, I am definitely open to merging it into fsspec, if that still looks like the right path forward.

@martindurant
Copy link
Member

Do you think it's worth mentioning somewhere in the fsspec documentation yet?

@andrewfulton9
Copy link

I think it is worth mentioning, maybe with the caveat that it's still in the early stage of development. So far my test suite covers most of the relevant pathlib.Path methods using PyarrowHDFS, S3FS, and LocalFileSystem (via a mock) as back-ends. I have also tried GithubFileSystem and MemoryFilesystem and they both seem to work as well. Any filesystem that follows similar path paradigms should work too as far as I can tell. The HTTPFilesystem backend doesn't currently work since its methods expect a a full URL rather than just the path part of a URL. I am about to push a fix up for that this afternoon though. I think for most users it should be useable, and extra exposure could help me move it along faster by increasing issues/contributions. This evening I can add more examples to the examples notebook and I'll add more details to the README as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants