Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enabling glob with HTTP file-system #3926

Merged
merged 8 commits into from Apr 30, 2019

Conversation

Projects
None yet
3 participants
@martindurant
Copy link
Member

commented Aug 31, 2018

(not meant to be merged, just for discussion)

I have implemented a simplistic way to do ls for HTTP, which has been often requested. It laods the target page, and looks for HREFs that look like they are children. I'm sure it has many cases that would break it, but the point is to demonstrate reuse of code in fsspec, so we get walk and glob for free by defining ls.

This kind of thing begs the question: do file-systems like this belong in dask? What about the rest of the bytes functionality, some of which is very dask specific, some of which is not?

Example of enabling glob with HTTP file-system
This derives from fsspec's abstract file-system, to show reusability of
glob and walk code.
@martindurant

This comment has been minimized.

Copy link
Member Author

commented Aug 31, 2018

cc @mmccarty , who wanted glob for HTTP

@mrocklin

This comment has been minimized.

Copy link
Member

commented Sep 2, 2018

I tried to give this a shot but ran into fsspec not being available. I tried to install it but it looks like it's also not easily pip installable from github. Happy to give it another go in a while.

@martindurant

This comment has been minimized.

Copy link
Member Author

commented Sep 4, 2018

Sorry about that, @mrocklin , seems like I didn't successfully upload versioneer pieced. It should install now from github.
Note that fsspec also includes a reworked copy of memoryfs, which is still an open PR somewhere here in dask, but doesn't belong here (and it's utility isn't obvious in a distributed setting). If I was to work on this now, I would begin by putting a lot of tests around the code in fsspec; howeevr, it's also not completely obvious that that repo, if it is to be a specification, should contain generic functionality or implementations.

@martindurant

This comment has been minimized.

Copy link
Member Author

commented Sep 7, 2018

I guess there's no impetus for moving this code out to fsspec; should I add the ls and (partial) glob directly to HTTPFileSystem? I can keep that in this PR.

@martindurant

This comment has been minimized.

Copy link
Member Author

commented Sep 10, 2018

@mrocklin , I guess this idea has stalled; happy to close as "uninteresting". I may, at a later date, do a more thorough job in fsspec, but it's not a priority.

@mrocklin

This comment has been minimized.

Copy link
Member

commented Sep 10, 2018

@mmccarty

This comment has been minimized.

Copy link
Member

commented Sep 10, 2018

I think folks are interested in seeing this implemented. I thought the question was where it should go, in dask or fsspec?

@martindurant

This comment has been minimized.

Copy link
Member Author

commented Sep 10, 2018

Yes, the idea was to get a conversation going on a couple of interrelated things, which is why this makes for a poor PR:

  • the usage of ls/walk/glob in HTTP, is it a decent implementation, should it be updated in dask using the existing code in fsspec. This question (from Intake) started things.
  • an example of depending on fsspec, which is unreleased and largely unvetted; could act as a motivator for pushing that forward
  • general question on whether should move some/all FS implementation code over to fsspec, since it doesn't really belong in dask. That idea could more generally apply to much of dask.bytes; e.g., open_files is very useful everywhere, as Intake has found, but some stuff in there really is dasky.
@martindurant

This comment has been minimized.

Copy link
Member Author

commented Sep 10, 2018

An effort to make fsspec more than just a demonstration class and into a usable file-system, to which open_files and its requirememnts, as well as HTTPFileSystem could be added: martindurant/filesystem_spec#12 (still requires a lot of testing and docs before anyone should use it for anything)

martindurant added some commits Feb 7, 2019

Merge branch 'master' into http_with_fs
Backport changes from fsspec
@martindurant

This comment has been minimized.

Copy link
Member Author

commented Feb 7, 2019

Pulled out fsspec, so that this can work with the current dask bytes infrastructure. Uses some code in fsspec and will contribute some back too.

@martindurant martindurant changed the title RFC: Example of enabling glob with HTTP file-system Enabling glob with HTTP file-system Feb 7, 2019

martindurant added some commits Feb 7, 2019

@martindurant

This comment has been minimized.

Copy link
Member Author

commented Feb 8, 2019

@jcrist , you might have some opinions here


def mkdirs(self, url):
"""Make any intermediate directories to make path writable"""
raise NotImplementedError

def isdir(self, path):
return True

This comment has been minimized.

Copy link
@martindurant

martindurant Feb 8, 2019

Author Member

This is odd, but in HTTP, any URL can contain more files at paths below it

@pytest.mark.network
def test_parquet():
from distutils.version import LooseVersion
if LooseVersion(requests.__version__) < LooseVersion("2.21.0"):
pytest.skip()

This comment has been minimized.

Copy link
@martindurant

martindurant Feb 8, 2019

Author Member

This is because older requests seems not to be able to make a HEAD request successfully to find out how big the file is.

@martindurant

This comment has been minimized.

Copy link
Member Author

commented Apr 30, 2019

I think this can be merged, and may be useful to some - still will want to push for separation of such stuff into fsspec following the permanent adoption of py3. Anyone have any objections?

@martindurant martindurant merged commit 871067f into dask:master Apr 30, 2019

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details

@martindurant martindurant deleted the martindurant:http_with_fs branch Apr 30, 2019

jorge-pessoa pushed a commit to jorge-pessoa/dask that referenced this pull request May 14, 2019

Enabling glob with HTTP file-system (dask#3926)
* Example of enabling glob with HTTP file-system

This derives from fsspec's abstract file-system, to show reusability of
glob and walk code.

* ls should return sorted, unique list

* more backporting

* revert options

* remove stray print

* remove import for flake

* Skip filesize on older requests

Thomas-Z added a commit to Thomas-Z/dask that referenced this pull request May 17, 2019

Enabling glob with HTTP file-system (dask#3926)
* Example of enabling glob with HTTP file-system

This derives from fsspec's abstract file-system, to show reusability of
glob and walk code.

* ls should return sorted, unique list

* more backporting

* revert options

* remove stray print

* remove import for flake

* Skip filesize on older requests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.