Conversation
Involves a little copying of code from uavro, but stands alone.
|
I would not advise trying to add a to_avro counterpart to bag, since in general we cannot know whether the items in a bag all share the same schema. |
dask/bag/avro.py
Outdated
| files = open_files(urlpath, **storage_options) | ||
| with copy.copy(files[0]) as f: | ||
| # we assume the same header for all files | ||
| head = read_header(f) |
There was a problem hiding this comment.
Please forgive the naive question, but what exactly does this header represent?
Mainly wondering if it could change if the avro files in a given urlpath had different schemas..
There was a problem hiding this comment.
The header has a name, description, namespace, schema (fields and types) and a block sync marker. You are right, in the case of bags, it may be OK to read files where each has a different schema, and I actually think the comment is wrong and this would work fine as things stand. There is no test for different schemas, though, I hadn't considered it.
There was a problem hiding this comment.
Ah, actually I've made a mistake - we need the sync marker before up front, else the read_bytes call above goes wrong. This'll take a little bit of work.
There was a problem hiding this comment.
I see what you are saying, bit of a circular dependency. In the dask-avro library @rmax used the fastavro reader to grab the sync marker before calling read_bytes if that helps.
Either way, if we can get by with just reading the header from one file that would be great, as I think doing it for each file significantly slowed things down in dask_avro.
There was a problem hiding this comment.
(see my comment below)
In general, the sync markers are all different. I think it is possibly to have identical headers for every file, but that is not the normal way to do things.
|
@mrocklin , do you think it's reasonable to touch every file in the client part of the code, since each will have a different sync marker, and there's no way to load in chunks without that. The amount loaded for each file would be small, but this is exactly the case we are trying to avoid in the parquet case. The reads could be parallelised, at least. |
Obviously it would be nice to avoid this. From what you say though it sounds like that's not possible? Seem like an easy decision in that case :) |
|
Already commented but want to add that according to the avro spec (and the reference implementation), every file will have a different sync marker. |
|
Out of curiosity, how expensive would it be to also write avro? |
Using fastavro, one file per partition, and with no checks around consistency of the items, it would be easy. The hardest part would be inferring the schema, I don't think fastavro has a way to do that automatically (in the test it is hard-coded). |
|
Is it reasonable to ask users to provide the schema?
…On Fri, Sep 21, 2018 at 9:43 AM, Martin Durant ***@***.***> wrote:
how expensive would it be to also write avro?
Using fastavro, one file per partition, and with no checks around
consistency of the items, it would be easy. The hardest part would be
inferring the schema, I don't think fastavro has a way to do that
automatically (in the test it is hard-coded).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4000 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AASszIWwAX-XusEfhjF2iC-Zjui2yw93ks5udOz2gaJpZM4Wy7bp>
.
|
I think that @rmax and @ian-whitestone might have a better feeling for that. |
|
Full disclosure: I haven't done any avro writing before. I think it's reasonable, and I feel like it enforces good practices. With that said, schema inference could be a nice, separate feature, or an option when calling the write function, for those looking to prototype pipelines quickly. I know spark lets you dump a dataframe to avro without specifying a schema. |
|
@mrocklin , I'd be happy to attempt both provided-schema and inference writing. The latter I would imagine using the first N records, where N is user-specified. Things like "enum" (category) would not be inferable. Maybe that should be in a new PR, and maybe lower priority that dataframe.read_avro. I have already put a bit of work into uavro to update to the things learned in this PR so far. |
|
Agreed. No pressure from me on write_avro. It was just a question. I'd
be more than happy to wait until someone actually asks for it.
…On Fri, Sep 21, 2018 at 10:56 AM, Martin Durant ***@***.***> wrote:
@mrocklin <https://github.com/mrocklin> , I'd be happy to attempt both
provided-schema and inference writing. The latter I would imagine using the
first N records, where N is user-specified. Things like "enum" (category)
would not be inferable.
Maybe that should be in a new PR, and maybe lower priority that
dataframe.read_avro. I have already put a bit of work into uavro to update
to the things learned in this PR so far.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4000 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AASszM9dVc4og44ulgOPOJUDcCbuQjsaks5udP4ggaJpZM4Wy7bp>
.
|
|
I think this is pretty simple and good like this, and I can come back for the other features in the future. |
|
Sounds good to me
…On Fri, Sep 21, 2018 at 7:38 PM, Martin Durant ***@***.***> wrote:
I think this is pretty simple and good like this, and I can come back for
the other features in the future.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4000 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AASszL2DzNc9JgXoXN4ST5cqpJnuRHmUks5udXhsgaJpZM4Wy7bp>
.
|
Involves a little copying of code from uavro, but stands alone.
flake8 dask