Conversation
Skip intermediate bytes/string generation. This reduces graph size, reduces memory usage, and is simpler overall to understand.
Use `open_files` directly instead of `write_bytes`. This allows handling `dask.bag` io cleaner. Also remove double optimization in `to_textfiles` (once in `to_delayed` and once before compute) by creating a bag and calling `compute` on that instead.
44c17e8 to
cef3935
Compare
This function was overly complicated and tried to handle too many different input types. It is easier and cleaner to have output functions work with the file objects directly.
cef3935 to
626e218
Compare
|
cc @mrocklin, @martindurant - this could use a quick review. In short I think all uses of |
|
I gave this a quick once-over earlier today and things seemed fine to me.
@martindurant probably has the final say here though.
…On Wed, Jan 31, 2018 at 11:14 AM, Jim Crist ***@***.***> wrote:
cc @mrocklin <https://github.com/mrocklin>, @martindurant
<https://github.com/martindurant> - this could use a quick review. In
short I think all uses of write_bytes can be better served by using
open_files directly. More efficient and results in a smaller api to
maintain.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3116 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AASszGz8lrGsI8zyUaZOTkQdnIOS0DiYks5tQJFQgaJpZM4RzLim>
.
|
|
Looks very nice, actually. I wanted to do something like this about a year ago :) I wonder, is there any scope/need for a generalized write_to function as a utility for other settings? I don't have anything in mind, but I'm thinking that the to_csv pattern should be pretty common. I'm not suggesting to do anything about it, this is purely curiosity. +1 |
| name = 'to-textfiles-' + uuid.uuid4().hex | ||
| dsk = {(name, i): (_to_textfiles_chunk, (b.name, i), f) | ||
| for i, f in enumerate(files)} | ||
| out = type(b)(merge(dsk, b.dask), name, b.npartitions) |
There was a problem hiding this comment.
This makes a bag?
So if compute is false, we unwrap the bag and give back the delayed _to_textfiles_chunks?
There was a problem hiding this comment.
Correct. But in that case we only call __dask_optimize__ once, rather than twice as we're currently doing. This potentially allows more inlining/fusing to happen, and avoids doing extra work. Seemed cleaner overall to me.
Maybe, but I'd rather wait until we have a sufficient number of them before trying to generalize something out. |
This function was overly complicated and tried to handle too many different input types. It is easier and cleaner to have output functions work with the file objects directly. Since this function was never public api no deprecation period is used.
write_bytesinto_csv. This allows writing directly to the file object, reducing graph size and memory usage (no need to write to in-memory bytes first).write_bytesinto_textfiles. Cleans up internals of this function, removes duplicate calls to__dask_optimize__, reduces LOC.write_bytesfunction fromdask.bytes.core. Ports over any necessary relevant tests to write to files directly fromopen_files.