Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AvroEx.Schema.Parser - Hand-rolled schema parser #62

Merged
merged 16 commits into from Mar 4, 2022
Merged

Conversation

davydog187
Copy link
Member

@davydog187 davydog187 commented Feb 24, 2022

The goal of this PR is to

  1. Directly support parsing Elixir terms
  2. Seamlessly parse Elixir terms or json
  3. Detailed error messages for invalid Avro schemas
  4. Add better support for all schema types defined in the Avro 1.11.0 Specification.

Closes #40
Closes #60
Closes #61
Closes #55

Schema features

  • Primitives
  • Unions
  • Arrays
  • Maps
  • Fixed
  • Enums
  • Records
  • Name references
  • Namespace propagation

@davydog187
Copy link
Member Author

davydog187 commented Mar 2, 2022

I'm having some questions about the spec with respect to how full names are resolved, and what is and isn't allowed in terms of duplicates.

Testing with avro-js, I get the following results

Duplicate field names in a Record

❌ Not allowed

> avro.parse({type: "record", name: "a", fields: [{name: "a", type: "string"}, {name: "a", type: "string"}]})
Uncaught Error: duplicate a field name
    at new RecordType (/Users/dave/Development/avro_ex/javascript/node_modules/avro-js/lib/schemas.js:1429:11)
    at /Users/dave/Development/avro_ex/javascript/node_modules/avro-js/lib/schemas.js:136:14
    at Object.createType (/Users/dave/Development/avro_ex/javascript/node_modules/avro-js/lib/schemas.js:137:7)
    at Object.parse (/Users/dave/Development/avro_ex/javascript/node_modules/avro-js/lib/files.js:74:13)

Duplicate field names in separate Records

✅ Allowed

> avro.parse({type: "record", name: "a", fields: [{name: "a", type: "string"}, {name: "b", type: {type: "record", name: "inner", fields: [{name: "a", type: "string"}]}}]})
RecordType {
  _name: 'a',
  _aliases: [],
  _type: 'record',
  _fields: [
    Field { _name: 'a', _type: StringType {}, _aliases: [], _order: 1 },
    Field { _name: 'b', _type: [RecordType], _aliases: [], _order: 1 }
  ],
  _constructor: [Function: a] { getType: [Function (anonymous)] },
  _read: [Function: reada],
  _skip: [Function: skipa],
  _write: [Function: writea],
  _check: [Function: checka]
}

However, it does not seem to respect namespaces on Record fields, which tells me I'm either misunderstanding the spec, or this implementation has issues.

The spec reads

In record, enum and fixed definitions, the fullname is determined in one of the following ways:

A name and namespace are both specified. For example, one might use "name": "X", "namespace": "org.foo" to indicate the fullname org.foo.X.
A fullname is specified. If the name specified contains a dot, then it is assumed to be a fullname, and any namespace also specified is ignored. For example, use "name": "org.foo.X" to indicate the fullname org.foo.X.
A name only is specified, i.e., a name that contains no dots. In this case the namespace is taken from the most tightly enclosing schema or protocol. For example, if "name": "X" is specified, and this occurs within a field of the record definition of org.foo.Y, then the fullname is org.foo.X. If there is no enclosing namespace then the null namespace is used.
References to previously defined names are as in the latter two cases above: if they contain a dot they are a fullname, if they do not contain a dot, the namespace is the namespace of the enclosing definition.

Primitive type names have no namespace and their names may not be defined in any namespace.

A schema or protocol may not contain multiple definitions of a fullname. Further, a name must be defined before it is used ("before" in the depth-first, left-to-right traversal of the JSON parse tree, where the types attribute of a protocol is always deemed to come "before" the messages attribute.)

This leads me to believe the following assertions:

❌ Record fields should not contain duplicates
✅ Fields in different records can have the same name, only if they have a different namespace, and thus different fullname
✅ Namespaces are inherited from immediately enclosing definition
❌ Namespaces are not inherited from ancestors beyond parent

@davydog187
Copy link
Member Author

davydog187 commented Mar 2, 2022

It seems that apache/avro#1439 and apache/avro#1573 seem to clarify the rules around namespace resolution. I'll dig into that more

@davydog187
Copy link
Member Author

Asking for feedback here https://github.com/apache/avro/pull/1439/files#r818203665

@davydog187 davydog187 marked this pull request as ready for review March 3, 2022 14:40
@davydog187 davydog187 requested a review from a team as a code owner March 3, 2022 14:40
@davydog187
Copy link
Member Author

davydog187 commented Mar 3, 2022

Opening this for review while I get over the final hurdle, namespace propagation. @doomspork if you want to give some review please do. Note that once the tests are passing and credo is happy, I will likely merge so I can move onto removing ecto and the rest of the cleanup, which will be very noisy

{:ok, schema} -> schema
_ -> raise "Parsing schema failed"
def decode_schema!(schema) do
if is_binary(schema) and not Schema.Parser.primitive?(schema) do
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@doomspork this ended up being very elegant. Now you can pass anything to AvroEx.decode_schema!/1 and it will figure out if its json or not

@@ -305,13 +309,14 @@ defmodule AvroEx.Schema do
"Record<name=foo>"
"""
@spec type_name(schema_types()) :: String.t()
def type_name(%Primitive{type: nil}), do: "null"
def type_name(%Primitive{type: :null}), do: "null"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe for another ticket, but this function need a way to get the parent namespace and pass it down to full_name


defp extract_data({data, rest, {type, raw}}) do
if rest != %{} do
# TODO this violates the spec
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unfornuately I can't strictly validate here as the spec says that any additional fields should be treated as metadata. Maybe we can add an option in the future for strict parsing. I'll remove this validation for this PR and add a ticket for future work to add the option

put_in(context.names[name], schema)
end

defp aliases(%{aliases: aliases, namespace: namespace} = record, parent_namespace)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I stole this from AvroEx.Schema. I think we will remove that function in a future PR as is only needed in this context

field(:qualified_names, {:array, :string}, default: [])
# TODO remove
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

metadata is valid

mix.lock Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
@davydog187 davydog187 merged commit d7110e2 into master Mar 4, 2022
@davydog187 davydog187 deleted the new-parser branch March 4, 2022 22:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant