Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Do not sort schema fields by default #12232

Merged
merged 3 commits into from
Sep 29, 2021

Conversation

merlimat
Copy link
Contributor

Motivation

In Avro schema, the order of fields is used in the validation process, so if we are sorting the fields, that will generate an unexpected schema for a python producer/consumer and it will make it not interoperable with Java and other clients.

@merlimat merlimat added type/bug The PR fixed a bug or issue reported a bug release/2.8.2 labels Sep 29, 2021
@merlimat merlimat added this to the 2.9.0 milestone Sep 29, 2021
@merlimat merlimat self-assigned this Sep 29, 2021
@@ -60,6 +60,9 @@ class Record(with_metaclass(RecordMeta, object)):
# This field is used to set namespace for Avro Record schema.
_avro_namespace = None

# Generate a schema where fields are sorted alphabetically
_sorted_fields = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to set this value to True to avoid compatibility problems?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only concern is that we'll keep getting users stuck with this issue if we don't fix the default behavior.

Copy link
Contributor

@gaoran10 gaoran10 Sep 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe other users will meet new issues? This user is ok, but other users may be stuck.

Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we updated Avro in 2.8.0 we saw the same thing happened on Java.

This is not a problem because the consumer has the original writer schema while reading.

This should not be a problem

@sijie can give more context, he gave me lot of useful advice during the Avro update

@hangc0276
Copy link
Contributor

@merlimat Some tests run failed, Please take a look, thanks.

FAIL: test_avro_required_default (schema_test.SchemaTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/schema_test.py", line 856, in test_avro_required_default
    "default": None
AssertionError: {'fields': [{'default': None, 'type': ['null', 'int'], 'name': 'a'}, {'default': [truncated]... != {'fields': [{'default': None, 'type': ['null', 'int'], 'name': 'a'}, {'default': [truncated]...
Diff is 2313 characters long. Set self.maxDiff to None to see it.

======================================================================
FAIL: test_complex (schema_test.SchemaTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/schema_test.py", line 106, in test_complex
    "type": ["null", 'MySubRecord']
AssertionError: {'fields': [{'type': ['null', 'string'], 'name': 'a'}, {'type': ['null', {'field [truncated]... != {'fields': [{'type': ['null', 'string'], 'name': 'a'}, {'type': ['null', {'field [truncated]...
Diff is 1195 characters long. Set self.maxDiff to None to see it.

@sijie
Copy link
Member

sijie commented Sep 29, 2021

@eolivelli Putting the compatibility conversation aside, the PR here is valuable. Because it provides a flag to control whether to sort schema fields or not. So it will ensure python clients and java clients can generate the schema in the same order, which will improve debuggability and readability.

Copy link
Contributor

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good for me

+1

Copy link
Contributor

@gaoran10 gaoran10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't think about the compatibility problem, I'm OK.

@hangc0276
Copy link
Contributor

hangc0276 commented Sep 29, 2021

I found a case about the schema without sort field.
If we define the class like:

class T2(Record):
            b = Integer()
            a = Integer()
            d = String()
            c = Double()

But the schema fields of T2 without sort is:

[{'type': ['null', 'int'], 'name': 'a'}, {'type': ['null', 'double'], 'name': 'c'}, {'type': ['null', 'int'], 'name': 'b'}, {'type': ['null', 'string'], 'name': 'd'}]

However we expected order is the field define order of the class (I am not sure), such as b, a, d, c, but the output order is a, c, b, d

@codelipenghui codelipenghui merged commit 2f3ad4d into apache:master Sep 29, 2021
@codelipenghui
Copy link
Contributor

@hangc0276 Could you please help create a separate issue to track the issue you have mentioned?

codelipenghui pushed a commit that referenced this pull request Sep 29, 2021
### Motivation

In Avro schema, the order of fields is used in the validation process, so if we are sorting the fields, that will generate an unexpected schema for a python producer/consumer and it will make it not interoperable with Java and other clients.

(cherry picked from commit 2f3ad4d)
@codelipenghui codelipenghui added the cherry-picked/branch-2.8 Archived: 2.8 is end of life label Sep 29, 2021
@hangc0276
Copy link
Contributor

@hangc0276 Could you please help create a separate issue to track the issue you have mentioned?

@codelipenghui I create an issue to track it apache/pulsar-client-python#48

@merlimat merlimat deleted the py-schema-sorted branch September 29, 2021 15:54
bharanic-dev pushed a commit to bharanic-dev/pulsar that referenced this pull request Mar 18, 2022
### Motivation

In Avro schema, the order of fields is used in the validation process, so if we are sorting the fields, that will generate an unexpected schema for a python producer/consumer and it will make it not interoperable with Java and other clients.
@BewareMyPower
Copy link
Contributor

BewareMyPower commented Mar 29, 2023

This PR actually leads to an incompatibility between the Java client and the Python client. See the issue here and the discussion in the mail list: https://lists.apache.org/thread/wl5rws7m0gqxc9n512llnpzf7kq5sp0j

In short, to be compatible with the POJO like:

class User {
    String name;
    int age;
    double score;
}

We have to give the following definition in Python:

class User(Record):
    _sorted_fields = True
    name = String()
    age = Integer(required=True)
    score = Double(required=True)

The default values are not compatible with the Java client so I suggest changing these default values. Please continue the discussion in the mail list if any of you has objection.

/cc @merlimat @gaoran10 @congbobo184 @codelipenghui @eolivelli @hangc0276 @mattisonchao

(BTW, these Python options are not documented well and I found the solution by reading the source code)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cherry-picked/branch-2.8 Archived: 2.8 is end of life release/2.8.2 release/2.9.0 type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants