Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add meta schema: a JSON Schema which matches recap schemas #346

Merged
merged 12 commits into from
Aug 3, 2023
Merged

Add meta schema: a JSON Schema which matches recap schemas #346

merged 12 commits into from
Aug 3, 2023

Conversation

mjperrone
Copy link
Contributor

@mjperrone mjperrone commented Aug 2, 2023

What

This adds a JSON Schema describing valid recap schemas.

Why

There is a robust ecosystem around JSON Schema that provides tools like validation, example generation, code generation, etc out of the schema definitions. Defining a JSON Schema definition for Recap will allow Recap and its users to more easily hook into that ecosystem.

Closes #342

How

I manually created the JSON Schema file by reading closely the Recap Spec.

Known limitations

As mentioned in the issue, I don't know of a way to require that the alias strings are referencing defined types elsewhere in the document, so some invalid documents will pass. No documents that are valid should fail.

Validation

In order to help iterate on the JSON Schema definition (which I expect isn't perfect), I test up some unit tests. I define a few valid recap schema documents and use jsonschema to test the documents against the schema definition.

Potential future work

  • Define a meta schema in recap (maybe using recap converter from the json schema)
  • flesh out more positive or negative examples
  • use the JSON Schema definition for something like editor integrated validation

{
"name": "string_field",
"doc": "doc",
"type": "invalidtype",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I included this in the initial draft PR to show what the failing test UX is like:

E           jsonschema.exceptions.ValidationError: {'name': 'string_field', 'doc': 'doc', 'type': 'invalidtype', 'bytes': 255} is not valid under any of the given schemas
E           
E           Failed validating 'oneOf' in schema['items']:
E               {'oneOf': [{'allOf': [{'$ref': '#/$defs/RecapNull'},
E                                     {'$ref': '#/$defs/RecapCommonFields'}]},
E                          {'allOf': [{'$ref': '#/$defs/RecapBool'},
E                                     {'$ref': '#/$defs/RecapCommonFields'}]},
E                          {'allOf': [{'$ref': '#/$defs/RecapInt'},
E                                     {'$ref': '#/$defs/RecapCommonFields'}]},
E                          {'allOf': [{'$ref': '#/$defs/RecapFloat'},
E                                     {'$ref': '#/$defs/RecapCommonFields'}]},
E                          {'allOf': [{'$ref': '#/$defs/RecapString'},
E                                     {'$ref': '#/$defs/RecapCommonFields'}]},
E                          {'allOf': [{'$ref': '#/$defs/RecapBytes'},
E                                     {'$ref': '#/$defs/RecapCommonFields'}]},
E                          {'allOf': [{'$ref': '#/$defs/RecapList'},
E                                     {'$ref': '#/$defs/RecapCommonFields'}]},
E                          {'allOf': [{'$ref': '#/$defs/RecapMap'},
E                                     {'$ref': '#/$defs/RecapCommonFields'}]},
E                          {'allOf': [{'$ref': '#/$defs/RecapStruct'},
E                                     {'$ref': '#/$defs/RecapCommonFields'}]},
E                          {'allOf': [{'$ref': '#/$defs/RecapEnum'},
E                                     {'$ref': '#/$defs/RecapCommonFields'}]},
E                          {'allOf': [{'$ref': '#/$defs/RecapUnion'},
E                                     {'$ref': '#/$defs/RecapCommonFields'}]}]}
E           
E           On instance[0]:
E               {'bytes': 255,
E                'doc': 'doc',
E                'name': 'string_field',
E                'type': 'invalidtype'}

.venv/lib/python3.10/site-packages/jsonschema/validators.py:1302: ValidationError
==================================================================== short test summary info =====================================================================
FAILED tests/unit/meta_schema/test_json_schema.py::test_valid_schemas[tests/spec/invalid_field_type.json] - jsonschema.exceptions.ValidationError: {'name': 'string_field', 'doc': 'doc', 'type': 'invalidtype', 'bytes': 255} is not valid under any of the given schemas

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Can you add functionality to let us ignore expected failed tests (i.e. xfail)? I think it complicates the logic in test_json_schema.py, but it'll be useful as we build out and validate the harness, I think. If it's a pain, though, we can hack it (e.g. .json.failing filename)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the harness to find .unimplemented.invalid.json files and expect that they will throw as invalid schemas, but mark that test as expected to fail while unimplemented.

@@ -0,0 +1,377 @@
{
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After this JSON Schema is reviewed and approved in this PR, I will remove this from this PR and add it to recap-website/specs/type/0.2.0.json, then switch from a file reference to a network call.

I'm including it in this PR draft to make it easier to iterate on the spec and tests in parallel.

tests/0.2.0.json Outdated
"$ref": "#/$defs/RecapNull"
},
{
"$ref": "#/$defs/RecapCommonFields"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose to put the common fields alias and doc in its own def instead of repeating them in each of the base types (like RecapNull here).

I also chose to put the allOf in this RecapType oneOf union instead of inside each of the base types. I did this both to mirror the DRY way the existing spec introduces these fields, and to emphasize the unique fields when looking at the base types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up repeating myself so that the error messages are easier to read when validation fails.

tests/0.2.0.json Outdated
Comment on lines 177 to 178
"minimum": 1,
"maximum": 64
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about the correct ranges for some of these integer values

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Integer bits field is a 32-bit signed int, so min is 1 and max is 2_147_483_647.

Same for bits in float.

bytes for string and bytes is 64-bit signed. The attributes section in the spec has some info there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated max for these fields

}
}
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I have represented Logical types in this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No prob.

I think there are two things to do.

  1. Allow "logical" to be defined as a string attribute in RecapCommonFields
  2. Add defs for built-in logical types (the ones defined in the spec)

I think (1) is worth doing in this PR. I think (2) should be put in a follow-on GH issue. I think it requires a bit more thought, since other non-builtin logical types might exist, and those wouldn't get validated (beyond just checking that logical is set).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done 1 and made a note to make an issue for 2 after this PR is finished

@criccomini criccomini marked this pull request as ready for review August 2, 2023 16:02
Copy link
Contributor

@criccomini criccomini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good. I left a couple comments here and there, but I agree with your decisions.

One thing I didn't catch that I think is a common gotcha: disallowing alias to be set when type reference an alias. See the "But alias inheritance is not allowed" section of the spec for an example.

{
"name": "string_field",
"doc": "doc",
"type": "invalidtype",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Can you add functionality to let us ignore expected failed tests (i.e. xfail)? I think it complicates the logic in test_json_schema.py, but it'll be useful as we build out and validate the harness, I think. If it's a pain, though, we can hack it (e.g. .json.failing filename)

@criccomini
Copy link
Contributor

One other thought:

I could see an argument for being more explicit about the recap types (removing allOf/oneOf), even if it means repeating yourself. The argument for this would be that it makes the JSON schema way more readable to humans.

I'm indifferent on that argument at this point, but I could see the spec being a bit confusing to humans that are trying to read (vs. validators like python-jsonschema). I was a bit disoriented when I looked at it; I didn't actually understand until you mentioned the allOf/oneOf in the comment.

@criccomini
Copy link
Contributor

Re: the black failure, try running pdm style to get things cleaned up.

@mjperrone
Copy link
Contributor Author

One thing I didn't catch that I think is a common gotcha: disallowing alias to be set when type reference an alias.

Did this by not repeating the alias definition for RecapAliasReference

I could see an argument for being more explicit about the recap types (removing allOf/oneOf), even if it means repeating yourself. The argument for this would be that it makes the JSON schema way more readable to humans.

I ended up doing as I was iterating on this because the validation errors were harder to understand.

@mjperrone mjperrone requested a review from criccomini August 2, 2023 18:42
@adrianisk
Copy link
Contributor

This is awesome 🚀

Copy link
Contributor

@criccomini criccomini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥 This looks great.

Only request:

  1. Can you move unit/meta_schema/test_json_schema.py to tests/spec
  2. And then move spec/*.json to a subdir like schemas or something?
  3. And then update the .github/workflows/ci.yaml to have an extra job called spec-tests that runs pdm run pytest tests/spec -vv

I want to keep the spec enforcement separate from the standard unit tests.

@mjperrone
Copy link
Contributor Author

  1. Can you move unit/meta_schema/test_json_schema.py to tests/spec
  2. And then move spec/*.json to a subdir like schemas or something?
  3. And then update the .github/workflows/ci.yaml to have an extra job called spec-tests that runs pdm run pytest tests/spec -vv
  1. done
  2. moved them to spec/{in,}valid, so removed the .invalid. file sub-extension
  3. done. I just followed the pattern of the existing jobs, but it's doing checkout+python+pdm installs for each step. That could be cached so it only has to happen once per build, or really once per new dependencies

I also realized that these tests should test that recap is validating the schemas the same way as the JSON Schema, so I added the recap validation to each.

This made me realize that there's no reason we shouldn't take every positive and negative validation example in the recap test suites and evaluate that against the jsonschema validation as well. Given the current structure of the tests, how feasible does that sound?

@mjperrone mjperrone requested a review from criccomini August 2, 2023 22:59
@criccomini
Copy link
Contributor

This made me realize that there's no reason we shouldn't take every positive and negative validation example in the recap test suites and evaluate that against the jsonschema validation as well. Given the current structure of the tests, how feasible does that sound?

Ahh, yes I've been thinking about this.

What I was thinking was to replace the proto/avro/json schema/recap CST tests with this suite. You've already begun to define Recap CSTs (in JSON). We should be able to define matching .proto, .avsc, and .schema.json files that we expect to get when we call types.py :: .from_dict followed by *Converter.from_recap().

We could then go the other way as well, from proto/avsc/json schema back to Recap's CST to verify bidirectional conversion is as expected.

This doesn't let us verify every detail we might want to verify at the AST level, so I think we'll still want to have some unit tests for that part. It should get us VERY comfortable with Recap's behavior, though.

WDYT? Not for this PR, but more GH issues to open if you're into it.

@criccomini criccomini merged commit d53150e into gabledata:main Aug 3, 2023
@criccomini
Copy link
Contributor

💥 Merged! Thanks!!

Follow-on things I found:

  1. Migrating .json spec to recap-website
  2. Update to pull spec from the recap-website
  3. GH issue for logical type spec tests
  4. (Maybe?) GH issue to do validation when translating to/from recap CST for proto/avro/JSON schema.

@mjperrone
Copy link
Contributor Author

💥 Merged! Thanks!!

Follow-on things I found:

  1. Migrating .json spec to recap-website
  2. Update to pull spec from the recap-website
  3. GH issue for logical type spec tests
  4. (Maybe?) GH issue to do validation when translating to/from recap CST for proto/avro/JSON schema.
  1. Add spec test for built-in logical types #348

Another thought: this test suite could be extended to read specs defined in YAML or TOML. There ¿might? be some situations where those languages handle values differently like null

@criccomini
Copy link
Contributor

Yup, totally. I think it's worth defining in YAML, TOML, and JSON, and validating the CST using the python-jsonschema lib.

One gotcha with TOML, in particular, is that it doesn't have a null AFAIK. Makes null defaults tricky. At least we added optional for the common case, which will work with TOML.

@criccomini
Copy link
Contributor

Opened these follow-on issues:

#349
#350
#351

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add meta schema describing valid recap schemas
3 participants