Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AVRO-1646: Print/parse consistency for nested null namespace #724

Merged

Conversation

bplommer
Copy link
Contributor

@bplommer bplommer commented Nov 25, 2019

Fixes print-parse consistency for named types with null namespace,
enclosed in named fields with null namespace, which are themselves
enclosed in named fields with non-null namespace.

Currently, if a type specifies the null namespace, this overrides the
enclosing non-null namespace for the purpose of naming that type, but
not for the purposes of determining the closest enclosing namespace for
more deeply nested types. This fixes that by removing special-casing of
null when setting the default namespace for recursively parsed schemas.

For example:

{
    "type": "record",
    "name": "Outer",
    "space": "space",
    "fields": [
        {
            "name": "f",
            "type": {
                "type": "record",
                "name": "Inner",
                "namespace": "",
                "fields": [
                    {
                        "name": "x",
                        "type": {
                            "type": "record",
                            "name": "Deeper",
                            "fields": [
                                {
                                    "name": "y",
                                    "type": "int"
                                }
                            ]
                        }
                    }
                ]
            }
        }
    ]
}

when parsed and re-printed becomes

{
  "type": "record",
  "name": "Outer",
  "namespace": "space",
  "fields": [
    {
      "name": "f",
      "type": {
        "type": "record",
        "name": "Inner",
        "namespace": "",
        "fields": [
          {
            "name": "x",
            "type": {
              "type": "record",
              "name": "Deeper",
              "namespace": "space",
              "fields": [
                {
                  "name": "y",
                  "type": "int"
                }
              ]
            }
          }
        ]
      }
    }
  ]
}

Adapted from the patch that was submitted with the AVRO-1646 ticket.

Jira

Tests

  • My PR adds the following unit tests OR does not need testing for this extremely good reason:
    • TestSchema.testDeeplyNestedNullNamespace

Commits

  • My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters (not including Jira issue reference)
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

Documentation

  • In case of new functionality, my PR adds documentation that describes how to use it.
    • All the public functions and the classes in the PR contain Javadoc that explain what it does

@probot-autolabeler probot-autolabeler bot added the Java Pull Requests for Java binding label Nov 25, 2019
Fixes print-parse consistency for named types with null namespace,
enclosed in named fields with null namespace, which are themselves
enclosed in named fields with non-null namespace.

Currently, if a type specifies the null namespace, this overrides the
enclosing non-null namespace for the purpose of naming that type, but
not for the purposes of determining the closest enclosing namespace for
more deeply nested types. This fixes that by removing special-casing of
null when setting the default namespace for recursively parsed schemas.

Adapted from the patch that was submitted with the AVRO-1646 ticket.
@RyanSkraba
Copy link
Contributor

Thanks for the clear case in the description! I'm looking into this, but there's definitely something odd going on -- not in the description but the unit test.

Just reading the test and the Avro spec, I would expect the full name of the records to be space.Outer, space.Inner and space.Deeper. When run, the null namespace is applied to to Inner and Deeper, despite not specifying the null namespace anywhere in the test. Where does it come from?

Is this an ambiguity in the spec with regards to namespace inheritance? When you consider "canonical schemas", there's no way to distinguish the null namespace "" and a missing namespace. What rule is used to introduce the null namespace?

@bplommer
Copy link
Contributor Author

bplommer commented Nov 26, 2019

@RyanSkraba My understanding was originally the same as yours, but after digging into it I came to the following understanding:

  • In the JSON representation, "namespace": "" specifies the null namespace, whereas leaving out the namespace field means the enclosing namespace (if any), or otherwise null, is used. From the spec: "If there is no enclosing namespace then the null namespace is used."
  • In the Java representation, there is no such thing as a named Schema instance that doesn't specify a namespace. The parser keeps track of the enclosing namespace, and uses it to explicitly specify the namespace when recursively parsing named types with no specified naemspace. When a named JSON schema with no specified namespace at the top level is parsed, the resulting Schema instance specifies the null namespace - this accords with the rule of using the null namespace when there is no enclosing namespace, but it does mean that nesting of Schema instances has different semantics from nesting of Json schemas.

Thus, when Inner is parsed it explicitly specifies the null namespace.

That's a good point about canonical schemas - based on my reading of the spec, the canonical form of the name Foo with null namespace should be .Foo, not Foo, but that isn't how it's currently implemented. The spec could definitely be much clearer.

@RyanSkraba
Copy link
Contributor

Excellent and clear explanation, thanks! To paraphrase: in Java, the fact that inner exists and was parsed independently of outer gives it the NULL/"" namespace.

That sounds like a surprising rule in the current implementation, but it would avoid some problems with the same schema instance being reused in different name spaces. It doesn't seem quite right (for the semantics reason you give above), but it'd be hard to argue that it's wrong.

In any case, the current behaviour without this PR is actually addressing is definitely wrong and improved by this fix! It took me a while to unravel the way Names.space(), and Name.space() work during Schema parsing, but it looks good to me.

@RyanSkraba RyanSkraba merged commit 335acc5 into apache:master Feb 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Java Pull Requests for Java binding
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants