Remove `write_bytes` by jcrist · Pull Request #3116 · dask/dask

jcrist · 2018-01-30T23:25:30Z

This function was overly complicated and tried to handle too many different input types. It is easier and cleaner to have output functions work with the file objects directly. Since this function was never public api no deprecation period is used.

Removes usage of write_bytes in to_csv. This allows writing directly to the file object, reducing graph size and memory usage (no need to write to in-memory bytes first).
Removes usage of write_bytes in to_textfiles. Cleans up internals of this function, removes duplicate calls to __dask_optimize__, reduces LOC.
Removes write_bytes function from dask.bytes.core. Ports over any necessary relevant tests to write to files directly from open_files.

Skip intermediate bytes/string generation. This reduces graph size, reduces memory usage, and is simpler overall to understand.

Use `open_files` directly instead of `write_bytes`. This allows handling `dask.bag` io cleaner. Also remove double optimization in `to_textfiles` (once in `to_delayed` and once before compute) by creating a bag and calling `compute` on that instead.

This function was overly complicated and tried to handle too many different input types. It is easier and cleaner to have output functions work with the file objects directly.

jcrist · 2018-01-31T16:14:07Z

cc @mrocklin, @martindurant - this could use a quick review. In short I think all uses of write_bytes can be better served by using open_files directly. More efficient and results in a smaller api to maintain.

mrocklin · 2018-01-31T16:19:30Z

I gave this a quick once-over earlier today and things seemed fine to me. @martindurant probably has the final say here though.

…

On Wed, Jan 31, 2018 at 11:14 AM, Jim Crist ***@***.***> wrote: cc @mrocklin <https://github.com/mrocklin>, @martindurant <https://github.com/martindurant> - this could use a quick review. In short I think all uses of write_bytes can be better served by using open_files directly. More efficient and results in a smaller api to maintain. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#3116 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszGz8lrGsI8zyUaZOTkQdnIOS0DiYks5tQJFQgaJpZM4RzLim> .

martindurant · 2018-01-31T17:11:45Z

Looks very nice, actually. I wanted to do something like this about a year ago :)

I wonder, is there any scope/need for a generalized write_to function as a utility for other settings? I don't have anything in mind, but I'm thinking that the to_csv pattern should be pretty common. I'm not suggesting to do anything about it, this is purely curiosity.

+1

martindurant · 2018-01-31T17:05:42Z

dask/bag/core.py

+    name = 'to-textfiles-' + uuid.uuid4().hex
+    dsk = {(name, i): (_to_textfiles_chunk, (b.name, i), f)
+           for i, f in enumerate(files)}
+    out = type(b)(merge(dsk, b.dask), name, b.npartitions)


This makes a bag?
So if compute is false, we unwrap the bag and give back the delayed _to_textfiles_chunks?

Correct. But in that case we only call __dask_optimize__ once, rather than twice as we're currently doing. This potentially allows more inlining/fusing to happen, and avoids doing extra work. Seemed cleaner overall to me.

jcrist · 2018-01-31T17:17:07Z

I wonder, is there any scope/need for a generalized write_to function as a utility for other settings?

Maybe, but I'd rather wait until we have a sufficient number of them before trying to generalize something out.

jcrist added 2 commits January 30, 2018 16:10

Write directly to file in to_csv

96b7aa2

Skip intermediate bytes/string generation. This reduces graph size, reduces memory usage, and is simpler overall to understand.

jcrist force-pushed the write-bytes-cleanup__TEST_HDFS__ branch from 44c17e8 to cef3935 Compare January 30, 2018 23:31

Remove write_bytes

626e218

This function was overly complicated and tried to handle too many different input types. It is easier and cleaner to have output functions work with the file objects directly.

jcrist force-pushed the write-bytes-cleanup__TEST_HDFS__ branch from cef3935 to 626e218 Compare January 30, 2018 23:39

martindurant reviewed Jan 31, 2018

View reviewed changes

jcrist merged commit d4e3c59 into dask:master Jan 31, 2018

jcrist deleted the write-bytes-cleanup__TEST_HDFS__ branch January 31, 2018 17:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remove `write_bytes`#3116

Remove `write_bytes`#3116
jcrist merged 3 commits intodask:masterfrom
jcrist:write-bytes-cleanup__TEST_HDFS__

jcrist commented Jan 30, 2018

Uh oh!

jcrist commented Jan 31, 2018

Uh oh!

mrocklin commented Jan 31, 2018 via email

Uh oh!

martindurant commented Jan 31, 2018

Uh oh!

martindurant Jan 31, 2018

Uh oh!

jcrist Jan 31, 2018

Uh oh!

jcrist commented Jan 31, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

jcrist commented Jan 30, 2018

Uh oh!

jcrist commented Jan 31, 2018

Uh oh!

mrocklin commented Jan 31, 2018 via email

Uh oh!

martindurant commented Jan 31, 2018

Uh oh!

martindurant Jan 31, 2018

Choose a reason for hiding this comment

Uh oh!

jcrist Jan 31, 2018

Choose a reason for hiding this comment

Uh oh!

jcrist commented Jan 31, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants