Skip to content

Remove write_bytes#3116

Merged
jcrist merged 3 commits intodask:masterfrom
jcrist:write-bytes-cleanup__TEST_HDFS__
Jan 31, 2018
Merged

Remove write_bytes#3116
jcrist merged 3 commits intodask:masterfrom
jcrist:write-bytes-cleanup__TEST_HDFS__

Conversation

@jcrist
Copy link
Copy Markdown
Member

@jcrist jcrist commented Jan 30, 2018

This function was overly complicated and tried to handle too many different input types. It is easier and cleaner to have output functions work with the file objects directly. Since this function was never public api no deprecation period is used.

  • Removes usage of write_bytes in to_csv. This allows writing directly to the file object, reducing graph size and memory usage (no need to write to in-memory bytes first).
  • Removes usage of write_bytes in to_textfiles. Cleans up internals of this function, removes duplicate calls to __dask_optimize__, reduces LOC.
  • Removes write_bytes function from dask.bytes.core. Ports over any necessary relevant tests to write to files directly from open_files.

Skip intermediate bytes/string generation. This reduces graph size,
reduces memory usage, and is simpler overall to understand.
Use `open_files` directly instead of `write_bytes`. This allows handling
`dask.bag` io cleaner. Also remove double optimization in
`to_textfiles` (once in `to_delayed` and once before compute) by
creating a bag and calling `compute` on that instead.
@jcrist jcrist force-pushed the write-bytes-cleanup__TEST_HDFS__ branch from 44c17e8 to cef3935 Compare January 30, 2018 23:31
This function was overly complicated and tried to handle too many
different input types. It is easier and cleaner to have output functions
work with the file objects directly.
@jcrist jcrist force-pushed the write-bytes-cleanup__TEST_HDFS__ branch from cef3935 to 626e218 Compare January 30, 2018 23:39
@jcrist
Copy link
Copy Markdown
Member Author

jcrist commented Jan 31, 2018

cc @mrocklin, @martindurant - this could use a quick review. In short I think all uses of write_bytes can be better served by using open_files directly. More efficient and results in a smaller api to maintain.

@mrocklin
Copy link
Copy Markdown
Member

mrocklin commented Jan 31, 2018 via email

@martindurant
Copy link
Copy Markdown
Member

Looks very nice, actually. I wanted to do something like this about a year ago :)

I wonder, is there any scope/need for a generalized write_to function as a utility for other settings? I don't have anything in mind, but I'm thinking that the to_csv pattern should be pretty common. I'm not suggesting to do anything about it, this is purely curiosity.

+1

name = 'to-textfiles-' + uuid.uuid4().hex
dsk = {(name, i): (_to_textfiles_chunk, (b.name, i), f)
for i, f in enumerate(files)}
out = type(b)(merge(dsk, b.dask), name, b.npartitions)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes a bag?
So if compute is false, we unwrap the bag and give back the delayed _to_textfiles_chunks?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. But in that case we only call __dask_optimize__ once, rather than twice as we're currently doing. This potentially allows more inlining/fusing to happen, and avoids doing extra work. Seemed cleaner overall to me.

@jcrist
Copy link
Copy Markdown
Member Author

jcrist commented Jan 31, 2018

I wonder, is there any scope/need for a generalized write_to function as a utility for other settings?

Maybe, but I'd rather wait until we have a sufficient number of them before trying to generalize something out.

@jcrist jcrist merged commit d4e3c59 into dask:master Jan 31, 2018
@jcrist jcrist deleted the write-bytes-cleanup__TEST_HDFS__ branch January 31, 2018 17:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants