Skip to content

ARROW-2843: [Format] Removing field layout from Schema.fbs breaks bac…#2263

Closed
yufeldman wants to merge 1 commit intoapache:masterfrom
yufeldman:ARROW-2843
Closed

ARROW-2843: [Format] Removing field layout from Schema.fbs breaks bac…#2263
yufeldman wants to merge 1 commit intoapache:masterfrom
yufeldman:ARROW-2843

Conversation

@yufeldman
Copy link
Copy Markdown

Fix backward compatibility of Schema

…kward compatibility

- Adding removed field back and mark deprecated to prevent from using it
  and to reduce generated code footprint
Copy link
Copy Markdown
Contributor

@jacques-n jacques-n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's review flatbuf spec. I believe a simple bytes field is serialization compatible with a layout which should allow us to remove Vector Layout from the definition

@wesm
Copy link
Copy Markdown
Member

wesm commented Jul 14, 2018

I think adding this field back in will make metadata written with 0.8.0 or 0.9.0 unreadable. I can confirm if that would be helpful. I don't think we should break the metadata two more times (adding this would take us to V5, then removing it again would take us to V6).

@wesm
Copy link
Copy Markdown
Member

wesm commented Jul 14, 2018

You cannot delete fields you don't use anymore from the schema, but you can simply stop writing them into your data for almost the same effect. Additionally you can mark them as deprecated as in the example above, which will prevent the generation of accessors in the generated C++, as a way to enforce the field not being used any more. (careful: this may break code!).

This suggests that maintaining backwards compatibility would require keeping a deprecated field in the schema forever. We should confirm, though

@yufeldman
Copy link
Copy Markdown
Author

yufeldman commented Jul 14, 2018

regarding typing for deprecated fields - yes looks like it could be a placeholder per say:

google/flatbuffers#4385

There was a question

Would it be safe then to simply put some "placeholder" with arbitrary type ? ie go from
table { a:int; b:int; c:int } to: table { deprecated0 : empty (deprecated); b:int; c:int } where "empty" is some type (an empty table for example).

And and answer

Yes, a placeholder would be fine. In fact that was one of the suggestions above: the ability to have placeholders without a name or a type. Using some kind of type specifically for this purpose sounds like an ok alternative.

@wesm
Copy link
Copy Markdown
Member

wesm commented Jul 14, 2018

Here is some test code in Python to write a schema to a file, then read it, so different codebase revisions can be compared:

import pyarrow as pa

fields = [
    pa.field('foo', pa.int32()),
    pa.field('bar', pa.string()),
    pa.field('baz', pa.list_(pa.int8()))
]
metadata = {b'foo': b'bar', b'pandas': b'badger'}

schema = pa.schema(fields, metadata=metadata)

buf = schema.serialize()

with open('test.schema', 'wb') as f:
    f.write(buf.to_pybytes())

with open('test.schema', 'rb') as f:
    schema_serialized = f.read()
    schema2 = pa.read_schema(pa.py_buffer(schema_serialized))

I will double check but it seems that these changes do not impact data written with 0.9.x. I'm concerned about having a lot of cruft lying around in our protocol, particularly when the justification for the cruft is based on a contraindicated use for Arrow that has not reached a stage of maturity where we have committed to maintaining backward compatibility.

@yufeldman
Copy link
Copy Markdown
Author

Regarding data generated with arrow 0.8, 0.9 - very good point. We definitely need to test it.
I expect it to have issues.
Flatbuffers guarantee both backward and forward compatibility, but with no removal of fields.
Ironically this is not the last field - if it were it wouldn't be a problem.

@wesm
Copy link
Copy Markdown
Member

wesm commented Jul 14, 2018

Sorry that test that I gave is not sufficient because it's the field metadata that's impacted. I'll post an updated one

@wesm
Copy link
Copy Markdown
Member

wesm commented Jul 14, 2018

import pyarrow as pa

field_metadata = {b'foo': b'bar', b'pandas': b'badger'}

fields = [
    pa.field('foo', pa.int32(), metadata=field_metadata),
    pa.field('bar', pa.string(), metadata=field_metadata),
    pa.field('baz', pa.list_(pa.int8()), metadata=field_metadata)
]

schema = pa.schema(fields)

buf = schema.serialize()

with open('test.schema', 'wb') as f:
    f.write(buf.to_pybytes())

with open('test.schema', 'rb') as f:
    schema_serialized = f.read()
    schema2 = pa.read_schema(pa.py_buffer(schema_serialized))

Data written by 0.9.0 or master branch, after this patch the metadata seems to be dropped from reads (but the process doesn't crash in this example at least).

@yufeldman
Copy link
Copy Markdown
Author

You mean when you write with 0.9 and read with my patch, right?
I would imagine since field is added we have one more field and somehow (probably not the case all the time) it considers that metadata is absent - since it reads length of metadata first. I would not trust it much. In some cases it should throw IndexOutOfBoundsException

@wesm
Copy link
Copy Markdown
Member

wesm commented Jul 14, 2018

Yes, write with 0.9.0, then read with this patch. The generated C++ bindings don't validate by default; it's possible that using a Verifier would catch the table VTable mismatch

@yufeldman
Copy link
Copy Markdown
Author

Ok, Let me see what to do with versions 0.8 and 0.9 - we can't abandon them.
But at least we should keep in mind that blind removal fields from FlatBuffers can cause a lot of trouble.

@wesm
Copy link
Copy Markdown
Member

wesm commented Jul 14, 2018

I'm personally fine with breaking the metadata if it's the right thing to do because we have warned users not to use Arrow for persistence yet. We're discussing two options right now:

  • Break < 0.8.0 (status quo as of right now)
  • Break 0.8.0 and 0.9.0, and add cruft to the Flatbuffers schemas

I agree that neither is ideal. I would prefer to not have the cruft

@yufeldman
Copy link
Copy Markdown
Author

Yes, let me get back on that. Need to think if/how we can keep "wolfs fed and sheep not eaten" :).

@wesm
Copy link
Copy Markdown
Member

wesm commented Jul 16, 2018

One option to help with migrating persisted data: you can write these schemas to JSON with Arrow 0.7.x, then parse the JSON (which has not changed, save for the omission of the vector layout) with 0.9.x/0.10.x and write out the new binary data.

@yufeldman
Copy link
Copy Markdown
Author

Thank you for the suggestion. Though not sure why would I need to write it out to JSON first - I could probably write it directly to binary withe the omission of vector layout (after reading with earlier version)? Am I missing anything?

@wesm
Copy link
Copy Markdown
Member

wesm commented Jul 16, 2018

Ah, I was suggesting as an alternative to having to fashion a reader for the old protocol and a writer for the new protocol in the same process. If you made a simple CLI for JSON <-> Binary (like the integration tests do) and to run it against two JARs (one to read, the other to write).

@yufeldman
Copy link
Copy Markdown
Author

Understood. Thank you for the suggestion. Will see.

@yufeldman
Copy link
Copy Markdown
Author

I am going to close this PR and JIRA.
Just want to make sure that in the future we do not remove fields from "fbs", but mark them deprecated.

@yufeldman yufeldman closed this Jul 20, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants