New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

feat(ingest): json-schema - add json schema support for files and kaf… #7361

Merged

shirshanka merged 4 commits into datahub-project:master from shirshanka:json-schema-support

Feb 19, 2023

Contributor

shirshanka commented Feb 17, 2023 •

edited

Loading

…ka schema registry

Adds support for JSON schemas stored in:

File
Directory
Kafka Schema Registry

Checklist

The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
Links to related issues (if applicable)
Tests for the changes have been added/updated (if applicable)
[ x] Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub


          feat(ingest): json-schema - add json schema support for files and kaf…

…ka schema registry

github-actions bot added the ingestion label

shirshanka requested a review from treff7es

February 17, 2023 06:55

jjoyce0510 reviewed

View reviewed changes

metadata-ingestion/src/datahub/ingestion/extractor/json_schema_util.py

+                          field_path = field_path.expand_type(discriminated_type, schema)
+                          for field_name, field_schema in schema.get("properties", {}).items():
+                              required_field: bool = field_name in schema.get("required", [])

Collaborator

jjoyce0510 Feb 18, 2023

nice default accessors

shirshanka mentioned this pull request

feat(API): support for additional methods and reading of JSON schema from swagger #7279

Open

5 tasks

jjoyce0510 reviewed

View reviewed changes

metadata-ingestion/src/datahub/ingestion/extractor/json_schema_util.py

+                          }
+                          (union_category, union_category_schema) = [
+                              (k, v) for k, v in union_category_map.items() if v
+                          ][0]

Collaborator

jjoyce0510 Feb 18, 2023

this 0 index access is confusing me mind

jjoyce0510 reviewed

View reviewed changes

metadata-ingestion/src/datahub/ingestion/extractor/json_schema_util.py

+                  raw_schema_string: Optional[str] = None,
+              ) -> SchemaMetadata:
+                  json_schema_as_string = raw_schema_string or json.dumps(json_schema)
+                  md5_hash: str = md5(json_schema_as_string.encode()).hexdigest()

Collaborator

jjoyce0510 Feb 18, 2023

nicee!

jjoyce0510 reviewed

View reviewed changes

metadata-ingestion/src/datahub/ingestion/source/confluent_schema_registry.py

+                      schema_ref: SchemaReference
+                      for schema_ref in schema.references:
+                          ref_subject: str = schema_ref["subject"]

Collaborator

jjoyce0510 Feb 18, 2023

assuming both "subject" and "version" are required fields?

jjoyce0510 reviewed

View reviewed changes

metadata-ingestion/src/datahub/ingestion/source/confluent_schema_registry.py

+                          all_schemas.extend(
+                              self.get_schemas_from_confluent_ref_json(
+                                  reference_schema.schema,
+                                  name=schema_ref["name"],

Collaborator

jjoyce0510 Feb 18, 2023

assuming "name" is a required field? (or maybe this method can take Nones?)

jjoyce0510 reviewed

View reviewed changes

metadata-ingestion/src/datahub/ingestion/source/confluent_schema_registry.py

@@ @@ -228,6 +301,21 @@ def _get_schema_fields( @@
                               imported_schemas,
                               is_key_schema=is_key_schema,
                           )
+                      elif schema.schema_type == "JSON":

Collaborator

jjoyce0510 Feb 18, 2023

nitpick: Constants for the raw strings

jjoyce0510 reviewed

View reviewed changes

metadata-ingestion/src/datahub/ingestion/source/kafka.py

+              )
+              @capability(
+                  SourceCapability.SCHEMA_METADATA,
+                  "Schemas associated with each topic are extracted from the schema registry. Avro and Protobuf (certified), JSON (incubating). Schema references are supported.",

Collaborator

jjoyce0510 Feb 18, 2023

amazing!

jjoyce0510 reviewed

View reviewed changes

metadata-ingestion/src/datahub/ingestion/source/schema/json_schema.py

+              class JsonSchemaSourceConfig(StatefulIngestionConfigBase):
+                  path: Union[FilePath, DirectoryPath, AnyHttpUrl] = Field(
+                      description="Set this to a single file-path or a directory-path (for recursive traversal) or a remote url. e.g. https://json.schemastore.org/petstore-v1.0.json"

Collaborator

jjoyce0510 Feb 18, 2023

Really nice that you can point this to a JSON Web URL

jjoyce0510 reviewed

View reviewed changes

metadata-ingestion/src/datahub/ingestion/source/schema/json_schema.py Outdated

+                              if not JsonSchemaTranslator._get_id_from_any_schema(schema_dict):
+                                  schema_dict["$id"] = str(v)
+                              with tempfile.NamedTemporaryFile(mode="w", delete=False) as tmp_file:
+                                  tmp_file.write(json.dumps(schema_dict))

Collaborator

jjoyce0510 Feb 18, 2023

Nit: Any try: except wrapping here when dealing with file read / write

Collaborator

jjoyce0510 Feb 18, 2023

try-except-nice-error-message

Contributor Author

shirshanka Feb 18, 2023

👍

jjoyce0510 reviewed

View reviewed changes

metadata-ingestion/src/datahub/ingestion/source/schema/json_schema.py

Comment on lines +165 to +166

		if reference.startswith("#/"):
		parts = reference[2:].split("/")

Collaborator

jjoyce0510 Feb 18, 2023

nitpick: would like to see this tedious logic pulled into smaller well named functions

Contributor Author

shirshanka Feb 18, 2023

this is already a function

jjoyce0510 reviewed

View reviewed changes

metadata-ingestion/src/datahub/ingestion/source/schema/json_schema.py

+                      base_path = dirname(str(path))
+                      base_uri = "file://{}/".format(base_path)
+                      with open(path) as schema_file:

Collaborator

jjoyce0510 Feb 18, 2023

Thoughts on wrapping this open in try:except

Contributor Author

shirshanka Feb 18, 2023 •

edited

Loading

outer try except per file should be enough here

jjoyce0510 reviewed

View reviewed changes

metadata-ingestion/src/datahub/ingestion/source/schema/json_schema.py Outdated

+                      if os.path.isdir(self.config.path):
+                          for root, dirs, files in os.walk(self.config.path, topdown=False):
+                              for file_name in [f for f in files if f.endswith(".json")]:
+                                  yield from self._load_one_file(

Collaborator

jjoyce0510 Feb 18, 2023

Can we try-except-warn in case a single file fails to be loaded? Instead of tossing the connector run completely

Contributor Author

shirshanka Feb 18, 2023

👍

jjoyce0510 reviewed

View reviewed changes

metadata-ingestion/src/datahub/ingestion/source/schema/json_schema.py

+                          )
+                      else:
+                          ref_loader = jsonref.jsonloader
+                      browse_prefix = f"/{self.config.env.lower()}/{self.config.platform}"

Collaborator

jjoyce0510 Feb 18, 2023

Nitpick: We are hoping to eradicate "env" - thoughts on keeping it out of this in favor of platform instance

jjoyce0510 reviewed

View reviewed changes

metadata-ingestion/tests/unit/schema/test_json_schema_util.py

+                  }
+                  fields: List[SchemaField] = list(
+                      JsonSchemaTranslator.get_fields_from_schema(malformed_schema)
+                  )

Collaborator

jjoyce0510 Feb 18, 2023

Awesome tests!

jjoyce0510 reviewed

View reviewed changes

metadata-ingestion/tests/unit/test_kafka_source.py

@@ @@ -376,7 +376,7 @@ def test_kafka_ignore_warnings_on_schema_type( @@
                       schema_id="schema_id_2",
                       schema=Schema(
                           schema_str="{}",
-                          schema_type="JSON",
+                          schema_type="UNKNOWN_TYPE",

Collaborator

jjoyce0510 Feb 18, 2023

This is to prevent schema parsing? Anything we'd want to do here to try out the schema parsing?

Contributor Author

shirshanka Feb 18, 2023

Previously, we were asserting that if confluent schema registry sends us back JSON schemas, we would fail and we have a config flag to control how the connector behaves when we encounter a schema type we dont handle. Now since JSON schemas actually work, this test fails, so I had to change this to pass in a schema type that we don't handle.

jjoyce0510 approved these changes

View reviewed changes

Collaborator

jjoyce0510 left a comment

Left some comments. I know you want to move on this but please take a look!

shirshanka commented

View reviewed changes

metadata-ingestion/src/datahub/ingestion/source/state/stale_entity_removal_handler.py

@@ @@ -194,7 +194,7 @@ def compute_job_id(cls, platform: Optional[str]) -> JobId: @@
                       return JobId(f"{platform}_{job_name_suffix}" if platform else job_name_suffix)
                   def _init_job_id(self) -> JobId:
-                      platform: Optional[str] = getattr(self.source, "platform")
+                      platform: Optional[str] = getattr(self.source, "platform", "default")

Contributor Author

shirshanka Feb 18, 2023

@treff7es : wanted to bring this diff to your attention since it is related to stateful ingestion. Previously this handler was expecting the class to have a platform member variable and failing hard if it did not.

shirshanka added 3 commits

February 18, 2023 00:04


          address review comments

39efbac


          Merge branch 'master' into json-schema-support

d9a3c95


          add support for titles for referred objects

cba93a5

shirshanka merged commit 07e4d06 into datahub-project:master

oleg-ruban pushed a commit to RChygir/datahub that referenced this pull request


          feat(ingest): json-schema - add json schema support for files and kaf… (

a70aa11

datahub-project#7361)

yoonhyejin pushed a commit that referenced this pull request


          feat(ingest): json-schema - add json schema support for files and kaf… (

e6b16fe

#7361)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels