Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RecordSchemaValidator can resolve $ref schemas #19625

Merged
merged 12 commits into from
Nov 30, 2022

Conversation

edgao
Copy link
Contributor

@edgao edgao commented Nov 18, 2022

Wire up the RecordSchemaValidator with the new data types protocol typedefs. Specifically:

  • Dockerfile now copies in WellKnownTypes.yaml and converts it to JSON
  • JsonSchemaValidator correctly resolves the $ref (the changes are necessary because otherwise it would require file:WellKnownTypes.json, which breaks in Python schema validation)

Existing behavior should remain unchanged.

Example jsonl file -> dev-null sync: (note that it correctly detects the int field as being the wrong type)
Screen Shot 2022-11-18 at 2 22 59 PM

This was the schema: (I ran this locally and edited the connection record manually)

{
  "int": {
    "$ref": "WellKnownTypes.json#/definitions/Integer"
  },
  "str": {
    "$ref": "WellKnownTypes.json#/definitions/Boolean"
  }
}

Reading order:

  1. build.gradle + Dockerfile
  2. JsonSchemaValidator.java
  3. JsonSchemaValidatorTest

@octavia-squidington-iv octavia-squidington-iv added area/platform issues related to the platform area/worker Related to worker labels Nov 18, 2022
@edgao edgao temporarily deployed to more-secrets November 18, 2022 22:47 Inactive
@edgao edgao marked this pull request as ready for review November 18, 2022 22:48
@edgao edgao requested a review from gosusnp November 18, 2022 22:48
@edgao edgao changed the title Edgao/wellknowntypes in worker RecordSchemaValidator can resolve $ref schemas Nov 18, 2022
@edgao edgao temporarily deployed to more-secrets November 18, 2022 22:51 Inactive
@edgao edgao temporarily deployed to more-secrets November 18, 2022 23:01 Inactive
@edgao edgao temporarily deployed to more-secrets November 18, 2022 23:19 Inactive
@edgao edgao temporarily deployed to more-secrets November 18, 2022 23:25 Inactive
airbyte-workers/Dockerfile Outdated Show resolved Hide resolved
airbyte-workers/Dockerfile Outdated Show resolved Hide resolved
this.schemaValidatorsConfig = new SchemaValidatorsConfig();
// This URI just needs to point at any path in the same directory as /app/WellKnownTypes.json
// It's required for the JsonSchema#validate method to resolve $ref correctly.
this("file:///app/nonexistent_file.json");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this BASE_URI for? Could be good to have some explanation here, it's not obvious why we only want to expose this for tests given its current usage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added some comments - lmk if this is still unclear

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if I understand correctly, we pass a value so that the validation can generate local references that we ignore anyway?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the references aren't ignored - see the new unit test where the validator does read the referenced file

@edgao edgao temporarily deployed to more-secrets November 23, 2022 01:10 Inactive
Copy link
Contributor

@gosusnp gosusnp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Left a couple suggestions. I don't think the docker would matter a lot given this file shouldn't change a lot. In general, I think it's a better practice to keep the copy of build artifacts closer to the end for better caching.


static {
try {
DEFAULT_BASE_URI = new URI("file:///app/nonexistent_file.json");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use file:///app/WellKnownTypes.json instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to make it clear that it doesn't need to be the exact file being referenced. E.g. if we later added a second file, we wouldn't need to update this URI.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the part that always feels a bit unusual is referencing a non existing file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm. what if we appended the nonexistent_file.json in the constructor? so callers would just need new JsonSchemaValidator(new URI("file:///app/"))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bump @gosusnp

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure that appending this in the constructor helps understanding that much. It probably boils down to why do we actually need to pass a full URI rather than just the actual base URI.
I'd say that it doesn't feel ideal, at the same time, it doesn't seem important enough to block this PR. The comments helps with the understanding.

airbyte-workers/Dockerfile Outdated Show resolved Hide resolved
this.schemaValidatorsConfig = new SchemaValidatorsConfig();
// This URI just needs to point at any path in the same directory as /app/WellKnownTypes.json
// It's required for the JsonSchema#validate method to resolve $ref correctly.
this("file:///app/nonexistent_file.json");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if I understand correctly, we pass a value so that the validation can generate local references that we ignore anyway?

@edgao edgao temporarily deployed to more-secrets November 23, 2022 16:41 Inactive
@edgao edgao temporarily deployed to more-secrets November 23, 2022 17:04 Inactive
@edgao edgao merged commit 56bfdaa into master Nov 30, 2022
@edgao edgao deleted the edgao/wellknowntypes_in_worker branch November 30, 2022 15:59
gosusnp added a commit that referenced this pull request Dec 6, 2022
gosusnp added a commit that referenced this pull request Dec 6, 2022
edgao added a commit that referenced this pull request Dec 13, 2022
edgao added a commit that referenced this pull request Dec 15, 2022
…start EC2 runners (#20439)

* Revert "Revert "RecordSchemaValidator can resolve $ref schemas (#19625)" (#20113)"

This reverts commit 86f61a5.

* just hardcode build?

* sshable instance

* pass arg for release oss only

* also skip octavia + create PR

* update ec2 runner

* revert CI test changes

* whoops

* whoopswhoops
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/platform issues related to the platform area/worker Related to worker
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants