New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

feat(ingest): Produce browse paths v2 on demand and with platform instance #8173

Merged

asikowitz merged 10 commits into datahub-project:master from asikowitz:browse-path-updates

Jun 9, 2023

Collaborator

asikowitz commented Jun 5, 2023

Also updates MetadataWorkUnit.get_aspects_of_type to only return a single aspect (the last one) for code simplicity.

Checklist

The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
Links to related issues (if applicable)
Tests for the changes have been added/updated (if applicable)
Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

asikowitz added 2 commits

June 5, 2023 14:59


          feat(ingest): Produce browse paths v2 on demand and with platform ins…

09ee517

…tance


          add platform instance test

ab20401

asikowitz requested review from treff7es and hsheth2

June 5, 2023 22:05

github-actions bot added the ingestion label


          lint

24aa6a6

vercel bot had a problem deploying to Preview

June 5, 2023 22:45

Failure

asikowitz requested a review from chriscollins3456

June 6, 2023 16:50


          Merge branch 'master' into browse-path-updates

79d726e

vercel bot deployed to Preview

June 6, 2023 21:45

View deployment


          simplify telemetry

da092cb

vercel bot deployed to Preview

June 7, 2023 09:17

View deployment

asikowitz added release-0.10.4 and removed release-0.10.4 labels


          add if check

6c519ce

vercel bot deployed to Preview

June 7, 2023 18:39

View deployment


          dry run; hide flags from docs

a14fc5b

vercel bot deployed to Preview

June 7, 2023 20:54

View deployment


          lint

fb1fee7

vercel bot deployed to Preview

June 7, 2023 21:21

View deployment


          Merge branch 'master' into browse-path-updates

1afd8fe

vercel bot deployed to Preview

June 8, 2023 18:41

View deployment


          Merge branch 'master' into browse-path-updates

vercel bot deployed to Preview

June 9, 2023 11:07

View deployment

pedro93 approved these changes

View reviewed changes

hsheth2 approved these changes

View reviewed changes

Collaborator

hsheth2 left a comment

overall pretty clean

none of these comments are blocking, so we can merge and then fix in a follow-up

metadata-ingestion/src/datahub/ingestion/api/source.py

		browse_path_processor,
		partial(auto_workunit_reporter, self.get_report()),

Collaborator

hsheth2 Jun 7, 2023

i think the original order made more sense

Collaborator Author

asikowitz Jun 9, 2023

Don't we want to report these browse path workunits?

Collaborator

hsheth2 Jun 9, 2023

Oh are these run top to bottom? For some reason I thought the bottom was the innermost one. In that case this is fine

metadata-ingestion/src/datahub/ingestion/run/pipeline_config.py

@@ @@ -58,7 +66,7 @@ class PipelineConfig(ConfigModel): @@
                   source: SourceConfig
                   sink: DynamicTypedConfig
                   transformers: Optional[List[DynamicTypedConfig]]
-                  flags: FlagsConfig = Field(default=FlagsConfig())
+                  flags: FlagsConfig = Field(default=FlagsConfig(), hidden_from_docs=True)

Collaborator

hsheth2 Jun 9, 2023

the pipeline config docs aren't autogenerated anywhere so this is a no-op, but still nice to have

metadata-ingestion/src/datahub/ingestion/api/source_helpers.py

+                              parent_urn = container_aspect.container
+                              containers_used_as_parent.add(parent_urn)
+                              paths[urn] = [
+                                  *paths.setdefault(parent_urn, []),  # Guess parent has no parents

Collaborator

hsheth2 Jun 9, 2023

i think it'd be more clear from a code readability perspective to split this into two lines

also, do we need paths.setdefault here or can we just use paths.get(parent_urn, []). the mutation is throwing me off a bit

parent_path = paths.setdefault(parent_urn, [])
paths[urn] = [...]

Collaborator Author

asikowitz Jun 9, 2023

I think this used to have an impact but looking at the code, doesn't seem like it anymore. I'll just replace with the .get(...)

metadata-ingestion/src/datahub/ingestion/api/source_helpers.py

+                          if browse_path_aspect and browse_path_aspect.paths:
+                              legacy_path = [
+                                  BrowsePathEntryClass(id=p.strip())
+                                  for p in browse_path_aspect.paths[0].strip("/").split("/")

Collaborator

hsheth2 Jun 9, 2023

slight nit - we call p.strip() three times, so maybe we should have an inner generator that breaks those apart

or maybe a helper method for _split_legacy_browse_path_entry?

Collaborator Author

asikowitz Jun 9, 2023

Yeahh I didn't like this much, but it's solved by upgrading to python 3.8 and using the walrus operator, which we will have to do quite soon. I can add a TODO to update once we upgrade though

metadata-ingestion/src/datahub/ingestion/api/source_helpers.py

+                  """
+                  # For telemetry, to see if our sources violate assumptions
+                  num_out_of_order = 0

Collaborator

hsheth2 Jun 9, 2023

Suggested change

      
                num_out_of_order = 0
          
                num_containers_out_of_order = 0

Collaborator Author

asikowitz Jun 9, 2023

I think this name can be confusing because it's num_container_aspects_out_of_order rather than num_container_entities_out_of_order. I'll think about it

Collaborator

hsheth2 Jun 9, 2023

That's fine too - I just wanted the word container in there

metadata-ingestion/src/datahub/ingestion/api/source_helpers.py

+                  # For telemetry, to see if our sources violate assumptions
+                  num_out_of_order = 0
+                  num_out_of_batch = 0

Collaborator

hsheth2 Jun 9, 2023

Suggested change

      
                num_out_of_batch = 0
          
                num_aspects_out_of_batch = 0

metadata-ingestion/src/datahub/ingestion/api/source_helpers.py

+                  # Set for all containers and urns with a Container aspect
+                  # Used to construct container paths while iterating through stream
+                  # Assumes topological order of entities in stream
+                  paths: Dict[str, List[BrowsePathEntryClass]] = {}

Collaborator

hsheth2 Jun 9, 2023

maybe note that this one does not contain platform instance details

asikowitz merged commit f2c66fd into datahub-project:master

asikowitz deleted the browse-path-updates branch

June 9, 2023 17:35

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels