AVRO-1646: Print/parse consistency for nested null namespace #724

bplommer · 2019-11-25T11:07:36Z

Fixes print-parse consistency for named types with null namespace,
enclosed in named fields with null namespace, which are themselves
enclosed in named fields with non-null namespace.

Currently, if a type specifies the null namespace, this overrides the
enclosing non-null namespace for the purpose of naming that type, but
not for the purposes of determining the closest enclosing namespace for
more deeply nested types. This fixes that by removing special-casing of
null when setting the default namespace for recursively parsed schemas.

For example:

{
    "type": "record",
    "name": "Outer",
    "space": "space",
    "fields": [
        {
            "name": "f",
            "type": {
                "type": "record",
                "name": "Inner",
                "namespace": "",
                "fields": [
                    {
                        "name": "x",
                        "type": {
                            "type": "record",
                            "name": "Deeper",
                            "fields": [
                                {
                                    "name": "y",
                                    "type": "int"
                                }
                            ]
                        }
                    }
                ]
            }
        }
    ]
}

when parsed and re-printed becomes

{
  "type": "record",
  "name": "Outer",
  "namespace": "space",
  "fields": [
    {
      "name": "f",
      "type": {
        "type": "record",
        "name": "Inner",
        "namespace": "",
        "fields": [
          {
            "name": "x",
            "type": {
              "type": "record",
              "name": "Deeper",
              "namespace": "space",
              "fields": [
                {
                  "name": "y",
                  "type": "int"
                }
              ]
            }
          }
        ]
      }
    }
  ]
}

Adapted from the patch that was submitted with the AVRO-1646 ticket.

Jira

My PR addresses the following Avro Jira issues and references them in the PR title.
- https://issues.apache.org/jira/browse/AVRO-1646

Tests

My PR adds the following unit tests OR does not need testing for this extremely good reason:
- TestSchema.testDeeplyNestedNullNamespace

Commits

My commits all reference Jira issues in their subject lines. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters (not including Jira issue reference)
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

Documentation

In case of new functionality, my PR adds documentation that describes how to use it.
- All the public functions and the classes in the PR contain Javadoc that explain what it does

Fixes print-parse consistency for named types with null namespace, enclosed in named fields with null namespace, which are themselves enclosed in named fields with non-null namespace. Currently, if a type specifies the null namespace, this overrides the enclosing non-null namespace for the purpose of naming that type, but not for the purposes of determining the closest enclosing namespace for more deeply nested types. This fixes that by removing special-casing of null when setting the default namespace for recursively parsed schemas. Adapted from the patch that was submitted with the AVRO-1646 ticket.

RyanSkraba · 2019-11-26T09:27:10Z

Thanks for the clear case in the description! I'm looking into this, but there's definitely something odd going on -- not in the description but the unit test.

Just reading the test and the Avro spec, I would expect the full name of the records to be space.Outer, space.Inner and space.Deeper. When run, the null namespace is applied to to Inner and Deeper, despite not specifying the null namespace anywhere in the test. Where does it come from?

Is this an ambiguity in the spec with regards to namespace inheritance? When you consider "canonical schemas", there's no way to distinguish the null namespace "" and a missing namespace. What rule is used to introduce the null namespace?

bplommer · 2019-11-26T10:46:38Z

@RyanSkraba My understanding was originally the same as yours, but after digging into it I came to the following understanding:

In the JSON representation, "namespace": "" specifies the null namespace, whereas leaving out the namespace field means the enclosing namespace (if any), or otherwise null, is used. From the spec: "If there is no enclosing namespace then the null namespace is used."
In the Java representation, there is no such thing as a named Schema instance that doesn't specify a namespace. The parser keeps track of the enclosing namespace, and uses it to explicitly specify the namespace when recursively parsing named types with no specified naemspace. When a named JSON schema with no specified namespace at the top level is parsed, the resulting Schema instance specifies the null namespace - this accords with the rule of using the null namespace when there is no enclosing namespace, but it does mean that nesting of Schema instances has different semantics from nesting of Json schemas.

Thus, when Inner is parsed it explicitly specifies the null namespace.

That's a good point about canonical schemas - based on my reading of the spec, the canonical form of the name Foo with null namespace should be .Foo, not Foo, but that isn't how it's currently implemented. The spec could definitely be much clearer.

RyanSkraba · 2019-11-26T15:13:38Z

Excellent and clear explanation, thanks! To paraphrase: in Java, the fact that inner exists and was parsed independently of outer gives it the NULL/"" namespace.

That sounds like a surprising rule in the current implementation, but it would avoid some problems with the same schema instance being reused in different name spaces. It doesn't seem quite right (for the semantics reason you give above), but it'd be hard to argue that it's wrong.

In any case, the current behaviour without this PR is actually addressing is definitely wrong and improved by this fix! It took me a while to unravel the way Names.space(), and Name.space() work during Schema parsing, but it looks good to me.

probot-autolabeler bot added the Java Pull Requests for Java binding label Nov 25, 2019

bplommer force-pushed the AVRO-1646-nested_null_namespace_pr branch from f3cbae8 to 2b45bd0 Compare November 25, 2019 11:42

bplommer mentioned this pull request Nov 25, 2019

Relax validation of reader/writer schema type names during decoding fd4s/vulcan#94

Closed

RyanSkraba approved these changes Nov 26, 2019

View reviewed changes

RyanSkraba merged commit 335acc5 into apache:master Feb 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AVRO-1646: Print/parse consistency for nested null namespace #724

AVRO-1646: Print/parse consistency for nested null namespace #724

bplommer commented Nov 25, 2019 •

edited

Loading

RyanSkraba commented Nov 26, 2019

bplommer commented Nov 26, 2019 •

edited

Loading

RyanSkraba commented Nov 26, 2019

AVRO-1646: Print/parse consistency for nested null namespace #724

AVRO-1646: Print/parse consistency for nested null namespace #724

Conversation

bplommer commented Nov 25, 2019 • edited Loading

Jira

Tests

Commits

Documentation

RyanSkraba commented Nov 26, 2019

bplommer commented Nov 26, 2019 • edited Loading

RyanSkraba commented Nov 26, 2019

bplommer commented Nov 25, 2019 •

edited

Loading

bplommer commented Nov 26, 2019 •

edited

Loading