ARROW-2843: [Format] Removing field layout from Schema.fbs breaks bac… by yufeldman · Pull Request #2263 · apache/arrow

yufeldman · 2018-07-13T21:27:23Z

Fix backward compatibility of Schema

…kward compatibility - Adding removed field back and mark deprecated to prevent from using it and to reduce generated code footprint

jacques-n

Let's review flatbuf spec. I believe a simple bytes field is serialization compatible with a layout which should allow us to remove Vector Layout from the definition

wesm · 2018-07-14T00:06:17Z

I think adding this field back in will make metadata written with 0.8.0 or 0.9.0 unreadable. I can confirm if that would be helpful. I don't think we should break the metadata two more times (adding this would take us to V5, then removing it again would take us to V6).

wesm · 2018-07-14T00:13:14Z

You cannot delete fields you don't use anymore from the schema, but you can simply stop writing them into your data for almost the same effect. Additionally you can mark them as deprecated as in the example above, which will prevent the generation of accessors in the generated C++, as a way to enforce the field not being used any more. (careful: this may break code!).

This suggests that maintaining backwards compatibility would require keeping a deprecated field in the schema forever. We should confirm, though

yufeldman · 2018-07-14T00:30:47Z

regarding typing for deprecated fields - yes looks like it could be a placeholder per say:

google/flatbuffers#4385

There was a question

Would it be safe then to simply put some "placeholder" with arbitrary type ? ie go from
table { a:int; b:int; c:int } to: table { deprecated0 : empty (deprecated); b:int; c:int } where "empty" is some type (an empty table for example).

And and answer

Yes, a placeholder would be fine. In fact that was one of the suggestions above: the ability to have placeholders without a name or a type. Using some kind of type specifically for this purpose sounds like an ok alternative.

wesm · 2018-07-14T00:39:48Z

Here is some test code in Python to write a schema to a file, then read it, so different codebase revisions can be compared:

import pyarrow as pa

fields = [
    pa.field('foo', pa.int32()),
    pa.field('bar', pa.string()),
    pa.field('baz', pa.list_(pa.int8()))
]
metadata = {b'foo': b'bar', b'pandas': b'badger'}

schema = pa.schema(fields, metadata=metadata)

buf = schema.serialize()

with open('test.schema', 'wb') as f:
    f.write(buf.to_pybytes())

with open('test.schema', 'rb') as f:
    schema_serialized = f.read()
    schema2 = pa.read_schema(pa.py_buffer(schema_serialized))

I will double check but it seems that these changes do not impact data written with 0.9.x. I'm concerned about having a lot of cruft lying around in our protocol, particularly when the justification for the cruft is based on a contraindicated use for Arrow that has not reached a stage of maturity where we have committed to maintaining backward compatibility.

yufeldman · 2018-07-14T00:42:01Z

Regarding data generated with arrow 0.8, 0.9 - very good point. We definitely need to test it.
I expect it to have issues.
Flatbuffers guarantee both backward and forward compatibility, but with no removal of fields.
Ironically this is not the last field - if it were it wouldn't be a problem.

wesm · 2018-07-14T00:52:57Z

Sorry that test that I gave is not sufficient because it's the field metadata that's impacted. I'll post an updated one

wesm · 2018-07-14T00:58:54Z

import pyarrow as pa

field_metadata = {b'foo': b'bar', b'pandas': b'badger'}

fields = [
    pa.field('foo', pa.int32(), metadata=field_metadata),
    pa.field('bar', pa.string(), metadata=field_metadata),
    pa.field('baz', pa.list_(pa.int8()), metadata=field_metadata)
]

schema = pa.schema(fields)

buf = schema.serialize()

with open('test.schema', 'wb') as f:
    f.write(buf.to_pybytes())

with open('test.schema', 'rb') as f:
    schema_serialized = f.read()
    schema2 = pa.read_schema(pa.py_buffer(schema_serialized))

Data written by 0.9.0 or master branch, after this patch the metadata seems to be dropped from reads (but the process doesn't crash in this example at least).

yufeldman · 2018-07-14T01:06:33Z

You mean when you write with 0.9 and read with my patch, right?
I would imagine since field is added we have one more field and somehow (probably not the case all the time) it considers that metadata is absent - since it reads length of metadata first. I would not trust it much. In some cases it should throw IndexOutOfBoundsException

wesm · 2018-07-14T01:15:09Z

Yes, write with 0.9.0, then read with this patch. The generated C++ bindings don't validate by default; it's possible that using a Verifier would catch the table VTable mismatch

yufeldman · 2018-07-14T01:20:15Z

Ok, Let me see what to do with versions 0.8 and 0.9 - we can't abandon them.
But at least we should keep in mind that blind removal fields from FlatBuffers can cause a lot of trouble.

wesm · 2018-07-14T01:22:44Z

I'm personally fine with breaking the metadata if it's the right thing to do because we have warned users not to use Arrow for persistence yet. We're discussing two options right now:

Break < 0.8.0 (status quo as of right now)
Break 0.8.0 and 0.9.0, and add cruft to the Flatbuffers schemas

I agree that neither is ideal. I would prefer to not have the cruft

yufeldman · 2018-07-14T01:39:39Z

Yes, let me get back on that. Need to think if/how we can keep "wolfs fed and sheep not eaten" :).

wesm · 2018-07-16T04:00:37Z

One option to help with migrating persisted data: you can write these schemas to JSON with Arrow 0.7.x, then parse the JSON (which has not changed, save for the omission of the vector layout) with 0.9.x/0.10.x and write out the new binary data.

yufeldman · 2018-07-16T04:52:43Z

Thank you for the suggestion. Though not sure why would I need to write it out to JSON first - I could probably write it directly to binary withe the omission of vector layout (after reading with earlier version)? Am I missing anything?

wesm · 2018-07-16T05:01:11Z

Ah, I was suggesting as an alternative to having to fashion a reader for the old protocol and a writer for the new protocol in the same process. If you made a simple CLI for JSON <-> Binary (like the integration tests do) and to run it against two JARs (one to read, the other to write).

yufeldman · 2018-07-16T05:10:44Z

Understood. Thank you for the suggestion. Will see.

yufeldman · 2018-07-20T01:34:27Z

I am going to close this PR and JIRA.
Just want to make sure that in the future we do not remove fields from "fbs", but mark them deprecated.

ARROW-2843: [Format] Removing field layout from Schema.fbs breaks bac…

2a3df25

…kward compatibility - Adding removed field back and mark deprecated to prevent from using it and to reduce generated code footprint

jacques-n reviewed Jul 13, 2018

View reviewed changes

yufeldman closed this Jul 20, 2018

Conversation

yufeldman commented Jul 13, 2018

Uh oh!

jacques-n left a comment

Choose a reason for hiding this comment

Uh oh!

wesm commented Jul 14, 2018

Uh oh!

wesm commented Jul 14, 2018

Uh oh!

yufeldman commented Jul 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wesm commented Jul 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yufeldman commented Jul 14, 2018

Uh oh!

wesm commented Jul 14, 2018

Uh oh!

wesm commented Jul 14, 2018

Uh oh!

yufeldman commented Jul 14, 2018

Uh oh!

wesm commented Jul 14, 2018

Uh oh!

yufeldman commented Jul 14, 2018

Uh oh!

wesm commented Jul 14, 2018

Uh oh!

yufeldman commented Jul 14, 2018

Uh oh!

wesm commented Jul 16, 2018

Uh oh!

yufeldman commented Jul 16, 2018

Uh oh!

wesm commented Jul 16, 2018

Uh oh!

yufeldman commented Jul 16, 2018

Uh oh!

yufeldman commented Jul 20, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yufeldman commented Jul 14, 2018 •

edited

Loading

wesm commented Jul 14, 2018 •

edited

Loading