-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(ingest): Produce browse paths v2 on demand and with platform instance #8173
feat(ingest): Produce browse paths v2 on demand and with platform instance #8173
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall pretty clean
none of these comments are blocking, so we can merge and then fix in a follow-up
browse_path_processor, | ||
partial(auto_workunit_reporter, self.get_report()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think the original order made more sense
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we want to report these browse path workunits?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh are these run top to bottom? For some reason I thought the bottom was the innermost one. In that case this is fine
@@ -58,7 +66,7 @@ class PipelineConfig(ConfigModel): | |||
source: SourceConfig | |||
sink: DynamicTypedConfig | |||
transformers: Optional[List[DynamicTypedConfig]] | |||
flags: FlagsConfig = Field(default=FlagsConfig()) | |||
flags: FlagsConfig = Field(default=FlagsConfig(), hidden_from_docs=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the pipeline config docs aren't autogenerated anywhere so this is a no-op, but still nice to have
parent_urn = container_aspect.container | ||
containers_used_as_parent.add(parent_urn) | ||
paths[urn] = [ | ||
*paths.setdefault(parent_urn, []), # Guess parent has no parents |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think it'd be more clear from a code readability perspective to split this into two lines
also, do we need paths.setdefault here or can we just use paths.get(parent_urn, [])
. the mutation is throwing me off a bit
parent_path = paths.setdefault(parent_urn, [])
paths[urn] = [...]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this used to have an impact but looking at the code, doesn't seem like it anymore. I'll just replace with the .get(...)
if browse_path_aspect and browse_path_aspect.paths: | ||
legacy_path = [ | ||
BrowsePathEntryClass(id=p.strip()) | ||
for p in browse_path_aspect.paths[0].strip("/").split("/") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
slight nit - we call p.strip()
three times, so maybe we should have an inner generator that breaks those apart
or maybe a helper method for _split_legacy_browse_path_entry
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeahh I didn't like this much, but it's solved by upgrading to python 3.8 and using the walrus operator, which we will have to do quite soon. I can add a TODO to update once we upgrade though
""" | ||
|
||
# For telemetry, to see if our sources violate assumptions | ||
num_out_of_order = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
num_out_of_order = 0 | |
num_containers_out_of_order = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this name can be confusing because it's num_container_aspects_out_of_order
rather than num_container_entities_out_of_order
. I'll think about it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fine too - I just wanted the word container in there
|
||
# For telemetry, to see if our sources violate assumptions | ||
num_out_of_order = 0 | ||
num_out_of_batch = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
num_out_of_batch = 0 | |
num_aspects_out_of_batch = 0 |
# Set for all containers and urns with a Container aspect | ||
# Used to construct container paths while iterating through stream | ||
# Assumes topological order of entities in stream | ||
paths: Dict[str, List[BrowsePathEntryClass]] = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe note that this one does not contain platform instance details
Also updates MetadataWorkUnit.get_aspects_of_type to only return a single aspect (the last one) for code simplicity.
Checklist