-
Notifications
You must be signed in to change notification settings - Fork 172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat/1681 collects load job metrics and adds remote uri #1708
Conversation
✅ Deploy Preview for dlt-hub-docs canceled.
|
@@ -143,8 +143,8 @@ def create_load_id() -> str: | |||
|
|||
|
|||
# folders to manage load jobs in a single load package | |||
TJobState = Literal["new_jobs", "failed_jobs", "started_jobs", "completed_jobs"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good!
@@ -392,6 +398,11 @@ def complete_jobs( | |||
# create followup jobs | |||
self.create_followup_jobs(load_id, state, job, schema) | |||
|
|||
# preserve metrics | |||
metrics = job.metrics() | |||
if metrics: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when does it happen that a job does not have metrics?
"""Job metrics aggregated by table""" | ||
|
||
|
||
class LoadJobMetrics(NamedTuple): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be of interested to also track whether the data from this job was loaded to the staging destination or the staging dataset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would also be cool to track if this was a table chain followup job. With these three flags you could make the/any platform UI much more legible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking a bit more about it, maybe a job_type would be interesting too in the context of displaying data to the user. Types could be:
single table jobs
- copy (all jobs that just move files around)
- load (all jobs that move data into a database)
table chain jobs
- merge (merge jobs)
- upsert (upsert merge jobs)
- scd2 (scd2 merge)
- replace (replace via insert from staging)
- whatever we call this chunking update
started_at: datetime.datetime | ||
finished_at: datetime.datetime | ||
state: Optional[str] | ||
remote_uri: Optional[str] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking that LoadJobs could have a space for arbitrary data (defined as Dict[str, str]) for information such as the remote_uri. What comes to mind is, that for example BigQuery Jobs have an internal BigQuery Job Id which we could store here. Many SQL Databases also have transaction_ids which could be saved in the traces and the contract would be future proof for other information we might come up with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool PR! I think it is good, but I have some more ideas of what to track in the load metrics when I think about how to display trace data in a useful way to the user in some tool.
* collects basic load job metrics in LoadJob * adds remote uri to filesystem copy jobs metrics * adds job id to load package info * adds table name to job metrics * skips run step when serializing trace * adds trace shape test with trace schema * tests job file name too long * docs running pipelines with the same name for different envs * extracts step metrics in common, renames followup jobs * fixes tests * fixes tests * tests delta filesystem for remote_uri * adds exec_info to trace contract test * tests remote_uri for filesystem copy * fixes platform test
* collects basic load job metrics in LoadJob * adds remote uri to filesystem copy jobs metrics * adds job id to load package info * adds table name to job metrics * skips run step when serializing trace * adds trace shape test with trace schema * tests job file name too long * docs running pipelines with the same name for different envs * extracts step metrics in common, renames followup jobs * fixes tests * fixes tests * tests delta filesystem for remote_uri * adds exec_info to trace contract test * tests remote_uri for filesystem copy * fixes platform test
* collects basic load job metrics in LoadJob * adds remote uri to filesystem copy jobs metrics * adds job id to load package info * adds table name to job metrics * skips run step when serializing trace * adds trace shape test with trace schema * tests job file name too long * docs running pipelines with the same name for different envs * extracts step metrics in common, renames followup jobs * fixes tests * fixes tests * tests delta filesystem for remote_uri * adds exec_info to trace contract test * tests remote_uri for filesystem copy * fixes platform test
Description
fixes #1681
also changes shape of the trace and puts contract on it (in tests)
trace__steps__step_info__load_packages__jobs
run
step removed (is a copy of last step in the pipeline) + corresponding columns intrace
table_name
injob_metrics
trace__steps__load_info__job_metrics
also refactors code