Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(ingest): stateful redundant run skip handler #8467

Merged
merged 33 commits into from
Aug 28, 2023

Conversation

mayurinehate
Copy link
Collaborator

@mayurinehate mayurinehate commented Jul 20, 2023

Summary of changes -

  1. Add separate redundant run timestamps for lineage and usage. Renamed stateful configs
    store_last_profiling_timestamps → enable_stateful_profiling
    store_last_usage_extraction_timestamp → enable_stateful_usage_ingestion
    store_last_lineage_extraction_timestamp → enable_stateful_lineage_ingestion
  2. Fix the redundant run time check issues described here. Current run is skipped only if past run time window already included current run time window. Current run time window is reduced if the past run time window already covered some part of it.
  3. Do not commit usage/lineage state in case of failures. - track status of extraction - especially(and currently only) failure status to determine if current run was successful.
  4. Added stateful ingestion for snowflake lineage.
  5. Refractored IngestionStageReport to make it available to use in other sources. Added this report to snowflake and redshift (Not directly related to the main theme of PR).
  6. Added unit tests for redundant run skip handler.

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@github-actions github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Jul 20, 2023
@hsheth2
Copy link
Collaborator

hsheth2 commented Jul 20, 2023

@mayurinehate looks like we're using X | Y syntax for typing that isn't supported with python 3.7 yet

@mayurinehate
Copy link
Collaborator Author

@mayurinehate looks like we're using X | Y syntax for typing that isn't supported with python 3.7 yet

Oops. fixed. IDE autocomplete got me.

@mayurinehate
Copy link
Collaborator Author

mayurinehate commented Aug 22, 2023

@asikowitz I have addressed the review comments. Can you please take another pass at review ?

Changes:

  1. report config specified start time and end time and final start time, end time used for usage, lineage
  2. refractor around update_time_window / get_time_window
  3. renamed extractor variable lineage_start_time/lineage_end_time and usage_start_time/usage_end_time to start_time/end_time
  4. fixed is_current_run_succeessful check and moved it inside update_state, added tests for this
  5. refractored redundant run checks should_skip_this_run and suggest_run_time_window to improve readability by using separate TimeWindow class.
  6. removed forced bucket start time alignment for start_time if absolute start_time is specified in config.

@@ -213,19 +213,6 @@ def ensure_top_n_queries_is_not_too_big(cls, v: int) -> int:
)
return v

@pydantic.validator("start_time")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is removed to allow user to specify exact absolute start_time if necessary for some reason. In case of scheduled ingestions (default/relative start time), aligning start_time with bucket start time is already taken care of elsewhere.

Copy link
Collaborator

@asikowitz asikowitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like how you've refactored around TimeWindow. Have a couple nits here and there but I think this is good to go. How do you think we should go about testing this?

Comment on lines 1105 to 1106
self.usage_start_time,
self.usage_end_time,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where are you doing this?

metadata-ingestion/src/datahub/utilities/time.py Outdated Show resolved Hide resolved
metadata-ingestion/src/datahub/utilities/time.py Outdated Show resolved Hide resolved
@hsheth2 hsheth2 added merge-pending-ci A PR that has passed review and should be merged once CI is green. and removed merge-pending-ci A PR that has passed review and should be merged once CI is green. labels Aug 22, 2023
@mayurinehate
Copy link
Collaborator Author

For testing -

  1. Cases for adjustments to expand/reduce time window are well covered in unit tests
  2. Case for failed extraction runs is also covered in unit tests

I've tested this manually against real Snowflake and BigQuery instance for multiple cases to make sure start, end times are correctly picked up. About actual workunits there was an issue when emitting zero usage aspects for configured time window and I've fixed it.

As a follow up - we can add integration tests for snowflake, redshift, bigquery for scheduled ingestion cases

  1. ingestion repeated on same day as scheduled ingestion
  2. ingestion repeated on next day
    for both lineage and usage to make sure generated mcps are as expected for common scenarios.

@@ -36,7 +36,7 @@ def right_intersects(self, other: "TimeWindow") -> bool:

def starts_after(self, other: "TimeWindow") -> bool:
"""Whether current window starts after other window ends"""
return other.start_time < other.end_time <= self.start_time
return other.start_time < other.end_time < self.start_time
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meant to say that this should be other.start_time <= other.endtime like the change to contains

@asikowitz asikowitz added the merge-pending-ci A PR that has passed review and should be merged once CI is green. label Aug 23, 2023
@anshbansal anshbansal merged commit cc94ffb into datahub-project:master Aug 28, 2023
51 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ingestion PR or Issue related to the ingestion of metadata merge-pending-ci A PR that has passed review and should be merged once CI is green.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants