Add bag.read_avro by martindurant · Pull Request #4000 · dask/dask

martindurant · 2018-09-20T20:58:51Z

Involves a little copying of code from uavro, but stands alone.

Tests added / passed
Passes flake8 dask

Involves a little copying of code from uavro, but stands alone.

martindurant · 2018-09-20T21:02:12Z

I would not advise trying to add a to_avro counterpart to bag, since in general we cannot know whether the items in a bag all share the same schema.

ian-whitestone · 2018-09-20T21:07:49Z

dask/bag/avro.py

+    files = open_files(urlpath, **storage_options)
+    with copy.copy(files[0]) as f:
+        # we assume the same header for all files
+        head = read_header(f)


Please forgive the naive question, but what exactly does this header represent?

Mainly wondering if it could change if the avro files in a given urlpath had different schemas..

The header has a name, description, namespace, schema (fields and types) and a block sync marker. You are right, in the case of bags, it may be OK to read files where each has a different schema, and I actually think the comment is wrong and this would work fine as things stand. There is no test for different schemas, though, I hadn't considered it.

Ah, actually I've made a mistake - we need the sync marker before up front, else the read_bytes call above goes wrong. This'll take a little bit of work.

I see what you are saying, bit of a circular dependency. In the dask-avro library @rmax used the fastavro reader to grab the sync marker before calling read_bytes if that helps.

Either way, if we can get by with just reading the header from one file that would be great, as I think doing it for each file significantly slowed things down in dask_avro.

(see my comment below)
In general, the sync markers are all different. I think it is possibly to have identical headers for every file, but that is not the normal way to do things.

martindurant · 2018-09-20T21:28:21Z

@mrocklin , do you think it's reasonable to touch every file in the client part of the code, since each will have a different sync marker, and there's no way to load in chunks without that. The amount loaded for each file would be small, but this is exactly the case we are trying to avoid in the parquet case. The reads could be parallelised, at least.

mrocklin · 2018-09-20T21:45:43Z

@mrocklin , do you think it's reasonable to touch every file in the client part of the code

Obviously it would be nice to avoid this. From what you say though it sounds like that's not possible? Seem like an easy decision in that case :)

rmax · 2018-09-20T22:37:01Z

Already commented but want to add that according to the avro spec (and the reference implementation), every file will have a different sync marker.

mrocklin · 2018-09-21T13:39:36Z

Out of curiosity, how expensive would it be to also write avro?

martindurant · 2018-09-21T13:43:16Z

how expensive would it be to also write avro?

Using fastavro, one file per partition, and with no checks around consistency of the items, it would be easy. The hardest part would be inferring the schema, I don't think fastavro has a way to do that automatically (in the test it is hard-coded).

mrocklin · 2018-09-21T13:44:56Z

Is it reasonable to ask users to provide the schema?

…

On Fri, Sep 21, 2018 at 9:43 AM, Martin Durant ***@***.***> wrote: how expensive would it be to also write avro? Using fastavro, one file per partition, and with no checks around consistency of the items, it would be easy. The hardest part would be inferring the schema, I don't think fastavro has a way to do that automatically (in the test it is hard-coded). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4000 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszIWwAX-XusEfhjF2iC-Zjui2yw93ks5udOz2gaJpZM4Wy7bp> .

martindurant · 2018-09-21T13:46:54Z

Is it reasonable to ask users to provide the schema?

I think that @rmax and @ian-whitestone might have a better feeling for that.

ian-whitestone · 2018-09-21T14:43:37Z

Full disclosure: I haven't done any avro writing before.

I think it's reasonable, and I feel like it enforces good practices. With that said, schema inference could be a nice, separate feature, or an option when calling the write function, for those looking to prototype pipelines quickly. I know spark lets you dump a dataframe to avro without specifying a schema.

martindurant · 2018-09-21T14:56:18Z

@mrocklin , I'd be happy to attempt both provided-schema and inference writing. The latter I would imagine using the first N records, where N is user-specified. Things like "enum" (category) would not be inferable.

Maybe that should be in a new PR, and maybe lower priority that dataframe.read_avro. I have already put a bit of work into uavro to update to the things learned in this PR so far.

mrocklin · 2018-09-21T15:00:30Z

Agreed. No pressure from me on write_avro. It was just a question. I'd be more than happy to wait until someone actually asks for it.

…

On Fri, Sep 21, 2018 at 10:56 AM, Martin Durant ***@***.***> wrote: @mrocklin <https://github.com/mrocklin> , I'd be happy to attempt both provided-schema and inference writing. The latter I would imagine using the first N records, where N is user-specified. Things like "enum" (category) would not be inferable. Maybe that should be in a new PR, and maybe lower priority that dataframe.read_avro. I have already put a bit of work into uavro to update to the things learned in this PR so far. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4000 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszM9dVc4og44ulgOPOJUDcCbuQjsaks5udP4ggaJpZM4Wy7bp> .

martindurant · 2018-09-21T23:38:19Z

I think this is pretty simple and good like this, and I can come back for the other features in the future.

mrocklin · 2018-09-22T11:52:07Z

Sounds good to me

…

On Fri, Sep 21, 2018 at 7:38 PM, Martin Durant ***@***.***> wrote: I think this is pretty simple and good like this, and I can come back for the other features in the future. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4000 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AASszL2DzNc9JgXoXN4ST5cqpJnuRHmUks5udXhsgaJpZM4Wy7bp> .

Add bag.read_avro

cf99f51

Involves a little copying of code from uavro, but stands alone.

martindurant mentioned this pull request Sep 20, 2018

Add notebook example? intake/intake-avro#8

Closed

ian-whitestone reviewed Sep 20, 2018

View reviewed changes

ian-whitestone mentioned this pull request Sep 20, 2018

Merging into the dask library? rmax/dask-avro#1

Open

Read all headers when chunking

9fb149f

remove sync for debug

e7f0e1d

martindurant merged commit 15eb83f into dask:master Sep 22, 2018

martindurant deleted the bag_read_avro branch September 22, 2018 13:26

ian-whitestone mentioned this pull request Sep 23, 2018

dask.bag.read_avro not recognizing s3 filesystem #4006

Closed

Uh oh!

Conversation

martindurant commented Sep 20, 2018

Uh oh!

martindurant commented Sep 20, 2018

Uh oh!

ian-whitestone Sep 20, 2018

Choose a reason for hiding this comment

Uh oh!

martindurant Sep 20, 2018

Choose a reason for hiding this comment

Uh oh!

martindurant Sep 20, 2018

Choose a reason for hiding this comment

Uh oh!

ian-whitestone Sep 20, 2018

Choose a reason for hiding this comment

Uh oh!

martindurant Sep 20, 2018

Choose a reason for hiding this comment

Uh oh!

martindurant commented Sep 20, 2018

Uh oh!

mrocklin commented Sep 20, 2018

Uh oh!

rmax commented Sep 20, 2018

Uh oh!

mrocklin commented Sep 21, 2018

Uh oh!

martindurant commented Sep 21, 2018

Uh oh!

mrocklin commented Sep 21, 2018 via email

Uh oh!

martindurant commented Sep 21, 2018

Uh oh!

ian-whitestone commented Sep 21, 2018

Uh oh!

martindurant commented Sep 21, 2018

Uh oh!

mrocklin commented Sep 21, 2018 via email

Uh oh!

martindurant commented Sep 21, 2018

Uh oh!

mrocklin commented Sep 22, 2018 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants