Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

base dask on fsspec #5064

Merged
merged 28 commits into from
Jul 18, 2019
Merged

base dask on fsspec #5064

merged 28 commits into from
Jul 18, 2019

Conversation

martindurant
Copy link
Member

@martindurant martindurant commented Jul 3, 2019

  • Tests added / passed
  • Passes black dask / flake8 dask

lots of red in the +/-!

@martindurant martindurant changed the title WIP: base dask on sspec WIP: base dask on fsspec Jul 3, 2019
@mrocklin
Copy link
Member

mrocklin commented Jul 4, 2019

Side thought: You should write about this. The message might be something like "We (I) have invested a lot of time building out the bytes handling in Dask in a nicely modular way that can be helpful to others. We found that, surprisingly, it was helpful to others. This includes Intake, X, Y, and Z (zarr, rapids, ...?). We've now moved the last bits from Dask so that you too can use these systems without depending on Dask explicitly (though of course it's a very lightweight dependency).

@martindurant
Copy link
Member Author

@mrocklin : yep, eventually. I posted the code here more for backup and to show the dev group that this thing really would work.

@martindurant
Copy link
Member Author

Merger here requires mergers on other repos and possibly releases; but all tests pass locally now.

@martindurant
Copy link
Member Author

OK, so errors are due to two main things:

  • s3fs not having been released with fsspec
  • gzipfile et al being seekable with py37, but not giving "operation not supported" earlier

@mrocklin
Copy link
Member

mrocklin commented Jul 9, 2019

Merger here requires mergers on other repos and possibly releases; but all tests pass locally now.

Woot!

@martindurant
Copy link
Member Author

So it would be a good idea to make a decision now on read_* functions with compression. Should blocksize!=None be a straight exception? None of the compression engines support real random access normally (even if some can). Currently, random access is allowed and works on py37, but it is scanning through the file to find the right offset. That could be useful in some scenarios, but terrible in most.
If an early exception, I would probably put it in logical_size, since that's where the first scan would occur, to find the end of the uncompressed data.

@martindurant
Copy link
Member Author

Just a single failure on https://travis-ci.org/dask/dask/jobs/557843869#L1002 around compression with chunksize; but quite a few windows path errors I'll need to do some work on.

@mrocklin
Copy link
Member

So it would be a good idea to make a decision now on read_* functions with compression. Should blocksize!=None be a straight exception? None of the compression engines support real random access normally (even if some can). Currently, random access is allowed and works on py37, but it is scanning through the file to find the right offset. That could be useful in some scenarios, but terrible in most.

When you say it like that, erring seems reasonable to me. I haven't thought about this very deeply though. cc'ing @jcrist who might have thoughts?

@jcrist
Copy link
Member

jcrist commented Jul 12, 2019

For read_csv we already warn and then switch to using blocksize=None:

if blocksize and compression not in seekable_files:
warn(
"Warning %s compression does not support breaking apart files\n"
"Please ensure that each individual file can fit in memory and\n"
"use the keyword ``blocksize=None to remove this message``\n"
"Setting ``blocksize=None``" % compression
)
blocksize = None

Seems reasonable to me to convert this to an error everywhere.

@martindurant
Copy link
Member Author

OK, actually passes Travis for py37 (linting not yet done).
Now back to windows paths...

@martindurant
Copy link
Member Author

green on windows, only flake errors on py37 :)

Martin Durant added 2 commits July 14, 2019 12:53
Leaving imports in dask.bytes.core, so as not to break code that uses
thses outside of dask. Will later mark for deprecation
@martindurant
Copy link
Member Author

martindurant commented Jul 17, 2019

py37: passing (with s3fs from master)
py36: failing s3 only (s3fs from release)
py35: failing s3 (s3fs from release) and something in kwargs to compression
appveyor: ony failure is unrelated partd thing

@martindurant martindurant changed the title WIP: base dask on fsspec base dask on fsspec Jul 17, 2019
@martindurant
Copy link
Member Author

Summary:

  • I recommend merging this.

Outstanding:

  • the test failure on appveyor is due to something with partd, which I'm pretty sure is unrelated
  • the appveyor and py37 tests use master versions, which currently are the same as released versions. This can be reverted, but wanted to be sure that's the consensus.

@TomAugspurger
Copy link
Member

the test failure on appveyor is due to something with partd, which I'm pretty sure is unrelated

Most likely #5104

the appveyor and py37 tests use master versions, which currently are the same as released versions. This can be reverted, but wanted to be sure that's the consensus.

Could they master versions be moved to the nightly upstream build? I've almost got that passing, just one remaining pandas failure that I'll try to fix today.

@martindurant
Copy link
Member Author

Could they master versions be moved to the nightly upstream build?

Sure, I'm ok with that - then for this PR, should go back to released versions only in all builds.

@rjzamora
Copy link
Member

PR#4995 now has these changes merged in, and all tests are passing.

@martindurant
Copy link
Member Author

Will merge this evening, unless there are objections.

@martindurant martindurant merged commit 77d51a1 into dask:master Jul 18, 2019
@martindurant martindurant deleted the fsspec branch July 18, 2019 23:27
@jrbourbeau
Copy link
Member

Woo! Thanks for all your work on this @martindurant :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants