New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standardize filesystem interface between dask and arrow #2880

Open
jcrist opened this Issue Nov 10, 2017 · 19 comments

Comments

Projects
None yet
5 participants
@jcrist
Copy link
Member

jcrist commented Nov 10, 2017

In #2801 a discussion started around how to handle the differences between arrow and dask filesystem classes in a more robust way.

Current Behavior

Dask and arrow have differing interfaces for filesystem objects. To handle this, pyarrow contains a few wrappers for dask filesystems, and applies them if the string of the class name matches a couple known ones (https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L786-L792). This seems hacky and not-robust. It also doesn't support all of the dask filesystems, only a few special cased ones.

A few ideas

  • Dask is responsible for creating wrappers for each of its filesystems that subclass from pyarrow.filesystem.FileSystem. This is much more robust (we fulfill the required interface, no inference hacks), but may require us to maintain wrappers for each fileystem.

  • Arrow and dask standardize on the same interface, and the check moves to one checking for an abstract base class. This means that dask doesn't need to depend on arrow (nor arrow on dask), as long as we fill the same interface.

    From inspection, it looks like the differences amount to:

    • Lack of a walk method on a few of our classes
    • Lack of isfile/isdir in s3fs
    • Difference in walk method between s3fs and the arrow wrapper
    • delete vs rm for method name

Of these two solutions, the second is definitely more desirable, as it would simplify the interaction between these two projects.

@jcrist

This comment has been minimized.

Copy link
Member

jcrist commented Nov 10, 2017

Ping @martindurant, @mrocklin, @wesm (and probably others on the arrow team?) for comment.

@mrocklin

This comment has been minimized.

Copy link
Member

mrocklin commented Nov 10, 2017

I suspect that a good step forward would be for someone (I nominate @jcirst) to write an abstract base class for the full interface, to have some discussion over that interface, and then to include it (and necessary changes to subclasses) within each project.

If we wanted to be more serious we could also publish it as a proper Python module with a tiny doc page to which we could refer others who come into this space in the future.

@jcrist

This comment has been minimized.

Copy link
Member

jcrist commented Nov 10, 2017

Sure, I can handle that. I'll most likely need hdfs arrow parquet support before then though, so I may patch #2801 separately.

@mrocklin

This comment has been minimized.

Copy link
Member

mrocklin commented Nov 30, 2017

Moving conversation from #2801 over here (cc @wesm @xhochy )

@wesm suggested the following:

We could create a pure Python library to define a abstract public API for these sorts of things. Then we aren't relying on duck typing

I agree. Some decisions that need to be made

  1. Name of this library and class (I'll throw out filesystem_interface.FileSystem as a strawman)
  2. Home of this library (I'll suggest the pydata org)
  3. Mechanisms for subclassing, do we want to use the abc module to enforce things?
  4. The precise methods that are required and their names. Do we have any current conflicts? Which projects are missing which methods?
@jcrist

This comment has been minimized.

Copy link
Member

jcrist commented Nov 30, 2017

Location of the library makes sense to me, name seems a bit rough but I can't think of a better one.

Mechanisms for subclassing, do we want to use the abc module to enforce things?

If we expect downstream filesystem libraries to use this library as a dependency, then I think subclassing makes sense. If we expect them to use this library only for a spec/testing, then I think an abc makes sense (things can be subclasses of an abc without ever subclassing them explicitly).

The precise methods that are required and their names. Do we have any current conflicts? Which projects are missing which methods?

One main issue is that most of the dask filesystems implement walk incorrectly. I fixed this in hdfs3 (dask/hdfs3#141) and filed issues in s3fs and gcsfs.


I should have some time to get started on this in the next couple days (if people are fine with me taking this on). Right now my intent is that

  • The library has an abc showing all expected methods and their signatures
  • The library documents the spec clearly
  • The library has some basic tests to check that a downstream library matches the spec
@martindurant

This comment has been minimized.

Copy link
Member

martindurant commented Nov 30, 2017

Fixing walk is, of course, fairly straight-forward, as we could always provide a walk_list with the old behaviour. However, we (from dask) should be careful not to be biased in defining the standard.

The first of the libraries I made (hdfs3) has methods mostly matching posix, which in turn match most of the nomenclature in HDFS, but I know some would have preferred closer consistency with the python standard library. Other FSs are less posix-like in nature.

I would say that the posix conventions are indeed a useful model to go from (it really is a standard) and would mean the least changes for the existing FS back-ends used by Dask; some FSs (azure) may be resistant to change - that's one vote.

@xhochy

This comment has been minimized.

Copy link
Contributor

xhochy commented Dec 7, 2017

Is there already a (design) document that describes what dask expects from a filesystem implementation? That would be a good starter for a common definition.

@martindurant

This comment has been minimized.

@mrocklin

This comment has been minimized.

Copy link
Member

mrocklin commented Dec 7, 2017

We might consider publishing this as a prose design document somewhere that is more human readable and less likely to change as Dask changes. This would, presumably, be part of the documentation of any interface library.

@xhochy

This comment has been minimized.

Copy link
Contributor

xhochy commented Mar 18, 2018

There is also https://www.pyfilesystem.org/ with a slightly different API then dask/arrow. Would that be also a candidate project that may use the same APIs or does someone see a clear differentiator why we would need a different base API?

@wesm

This comment has been minimized.

Copy link
Contributor

wesm commented Mar 22, 2018

I think it's worth assessing if that one meets our requirements.

@xhochy

This comment has been minimized.

Copy link
Contributor

xhochy commented Mar 23, 2018

Another thing to add that I encounter during implementing a FileSystem for AzureStorageBlob is there is a lot of Python code that can be shared between the object store based implementations. The code of S3FS and GCSFS is very similar in many situations and I can easily use a copy&paste approach to quickly get started with Azure. But copying the code (even if the License allows it) is not feeling right. This in addition to a set of a abstract base classes, we would also benefit from an object store filesystem base class.

@mrocklin

This comment has been minimized.

Copy link
Member

mrocklin commented Mar 23, 2018

I'm entirely in favor of code sharing and a super-class where appropriate.

I have to admit that I haven't yet investigated pyfilesystem sufficiently well to have an opinion on it.

@martindurant

This comment has been minimized.

Copy link
Member

martindurant commented Mar 23, 2018

+hdfs3 and azure-data-late.

Yes, it's very true, that each library has a very similar structure, but also there are many differences in the specific implementations. The base class wouldn't make building any of the libraries any easier, there isn't much to be shared, I don't think, but it would make their usage easier - it would require guarantees of the same set of methods, behaving in the same prescribed way. This is largely true already.

@martindurant

This comment has been minimized.

Copy link
Member

martindurant commented Apr 24, 2018

I have begun a repo which can be used to define a spec for filesystems: https://github.com/martindurant/filesystem_spec
(note that it has many personal comments from me, but that is just to get conversation going, everything is up for discussion)

Everyone here is welcome to read and comment on it in issues and PRs, if you agree that it is a location we can work on together to build consensus. If useful, it should be moved from my personal account to somewhere shared.

cc @xhochy @fjetter @wesm @mrocklin @jcrist

@mrocklin

This comment has been minimized.

Copy link
Member

mrocklin commented Apr 26, 2018

@xhochy @fjetter if you get a chance it would be useful to get your thoughts on that repository, and what might need to change for Arrow to use it, either formally or informally

@martindurant it's nice to see that there is some common code that might cut down on replication between the various filesystem libraries :)

@wesm

This comment has been minimized.

Copy link
Contributor

wesm commented May 8, 2018

I'm pretty backlogged until later this month -- it may be useful in the meantime to make the Arrow dev mailing list aware of this effort, though

@martindurant

This comment has been minimized.

Copy link
Member

martindurant commented May 8, 2018

How would I do that - or would you mind doing it for me?

@mrocklin

This comment has been minimized.

Copy link
Member

mrocklin commented May 8, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment