Skip to content

Enabling glob with HTTP file-system#3926

Merged
martindurant merged 8 commits intodask:masterfrom
martindurant:http_with_fs
Apr 30, 2019
Merged

Enabling glob with HTTP file-system#3926
martindurant merged 8 commits intodask:masterfrom
martindurant:http_with_fs

Conversation

@martindurant
Copy link
Copy Markdown
Member

@martindurant martindurant commented Aug 31, 2018

(not meant to be merged, just for discussion)

I have implemented a simplistic way to do ls for HTTP, which has been often requested. It laods the target page, and looks for HREFs that look like they are children. I'm sure it has many cases that would break it, but the point is to demonstrate reuse of code in fsspec, so we get walk and glob for free by defining ls.

This kind of thing begs the question: do file-systems like this belong in dask? What about the rest of the bytes functionality, some of which is very dask specific, some of which is not?

This derives from fsspec's abstract file-system, to show reusability of
glob and walk code.
@martindurant
Copy link
Copy Markdown
Member Author

cc @mmccarty , who wanted glob for HTTP

@mrocklin
Copy link
Copy Markdown
Member

mrocklin commented Sep 2, 2018

I tried to give this a shot but ran into fsspec not being available. I tried to install it but it looks like it's also not easily pip installable from github. Happy to give it another go in a while.

@martindurant
Copy link
Copy Markdown
Member Author

Sorry about that, @mrocklin , seems like I didn't successfully upload versioneer pieced. It should install now from github.
Note that fsspec also includes a reworked copy of memoryfs, which is still an open PR somewhere here in dask, but doesn't belong here (and it's utility isn't obvious in a distributed setting). If I was to work on this now, I would begin by putting a lot of tests around the code in fsspec; howeevr, it's also not completely obvious that that repo, if it is to be a specification, should contain generic functionality or implementations.

@martindurant
Copy link
Copy Markdown
Member Author

I guess there's no impetus for moving this code out to fsspec; should I add the ls and (partial) glob directly to HTTPFileSystem? I can keep that in this PR.

@martindurant
Copy link
Copy Markdown
Member Author

@mrocklin , I guess this idea has stalled; happy to close as "uninteresting". I may, at a later date, do a more thorough job in fsspec, but it's not a priority.

@mrocklin
Copy link
Copy Markdown
Member

mrocklin commented Sep 10, 2018 via email

@mmccarty
Copy link
Copy Markdown
Member

mmccarty commented Sep 10, 2018

I think folks are interested in seeing this implemented. I thought the question was where it should go, in dask or fsspec?

@martindurant
Copy link
Copy Markdown
Member Author

martindurant commented Sep 10, 2018

Yes, the idea was to get a conversation going on a couple of interrelated things, which is why this makes for a poor PR:

  • the usage of ls/walk/glob in HTTP, is it a decent implementation, should it be updated in dask using the existing code in fsspec. This question (from Intake) started things.
  • an example of depending on fsspec, which is unreleased and largely unvetted; could act as a motivator for pushing that forward
  • general question on whether should move some/all FS implementation code over to fsspec, since it doesn't really belong in dask. That idea could more generally apply to much of dask.bytes; e.g., open_files is very useful everywhere, as Intake has found, but some stuff in there really is dasky.

@martindurant
Copy link
Copy Markdown
Member Author

An effort to make fsspec more than just a demonstration class and into a usable file-system, to which open_files and its requirememnts, as well as HTTPFileSystem could be added: https://github.com/martindurant/filesystem_spec/pull/12 (still requires a lot of testing and docs before anyone should use it for anything)

@martindurant
Copy link
Copy Markdown
Member Author

Pulled out fsspec, so that this can work with the current dask bytes infrastructure. Uses some code in fsspec and will contribute some back too.

@martindurant martindurant changed the title RFC: Example of enabling glob with HTTP file-system Enabling glob with HTTP file-system Feb 7, 2019
@martindurant
Copy link
Copy Markdown
Member Author

@jcrist , you might have some opinions here

raise NotImplementedError

def isdir(self, path):
return True
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is odd, but in HTTP, any URL can contain more files at paths below it

def test_parquet():
from distutils.version import LooseVersion
if LooseVersion(requests.__version__) < LooseVersion("2.21.0"):
pytest.skip()
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because older requests seems not to be able to make a HEAD request successfully to find out how big the file is.

@martindurant
Copy link
Copy Markdown
Member Author

I think this can be merged, and may be useful to some - still will want to push for separation of such stuff into fsspec following the permanent adoption of py3. Anyone have any objections?

@martindurant martindurant merged commit 871067f into dask:master Apr 30, 2019
@martindurant martindurant deleted the http_with_fs branch April 30, 2019 21:25
jorge-pessoa pushed a commit to jorge-pessoa/dask that referenced this pull request May 14, 2019
* Example of enabling glob with HTTP file-system

This derives from fsspec's abstract file-system, to show reusability of
glob and walk code.

* ls should return sorted, unique list

* more backporting

* revert options

* remove stray print

* remove import for flake

* Skip filesize on older requests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants