feat/1681 collects load job metrics and adds remote uri #1708

rudolfix · 2024-08-19T14:56:58Z

Description

also changes shape of the trace and puts contract on it (in tests)

job_id added to trace__steps__step_info__load_packages__jobs
run step removed (is a copy of last step in the pipeline) + corresponding columns in trace
added table_name in job_metrics
added trace__steps__load_info__job_metrics

also refactors code

netlify · 2024-08-19T14:57:13Z

✅ Deploy Preview for dlt-hub-docs canceled.

Name	Link
🔨 Latest commit	`18a848e`
🔍 Latest deploy log	https://app.netlify.com/sites/dlt-hub-docs/deploys/66c6eb1780985700081137e4

sh-rp · 2024-08-23T08:36:30Z

dlt/common/storages/load_package.py

@@ -143,8 +143,8 @@ def create_load_id() -> str:


 # folders to manage load jobs in a single load package
-TJobState = Literal["new_jobs", "failed_jobs", "started_jobs", "completed_jobs"]


sh-rp · 2024-08-23T08:58:03Z

dlt/load/load.py

@@ -392,6 +398,11 @@ def complete_jobs(
                # create followup jobs
                self.create_followup_jobs(load_id, state, job, schema)

+                # preserve metrics
+                metrics = job.metrics()
+                if metrics:


when does it happen that a job does not have metrics?

sh-rp · 2024-08-23T09:17:41Z

dlt/common/metrics.py

+    """Job metrics aggregated by table"""
+
+
+class LoadJobMetrics(NamedTuple):


I think it might be of interested to also track whether the data from this job was loaded to the staging destination or the staging dataset

It would also be cool to track if this was a table chain followup job. With these three flags you could make the/any platform UI much more legible.

Thinking a bit more about it, maybe a job_type would be interesting too in the context of displaying data to the user. Types could be:

single table jobs

copy (all jobs that just move files around)

load (all jobs that move data into a database)

table chain jobs

merge (merge jobs)

upsert (upsert merge jobs)

scd2 (scd2 merge)

replace (replace via insert from staging)

whatever we call this chunking update

sh-rp · 2024-08-23T09:23:24Z

dlt/common/metrics.py

+    started_at: datetime.datetime
+    finished_at: datetime.datetime
+    state: Optional[str]
+    remote_uri: Optional[str]


I was thinking that LoadJobs could have a space for arbitrary data (defined as Dict[str, str]) for information such as the remote_uri. What comes to mind is, that for example BigQuery Jobs have an internal BigQuery Job Id which we could store here. Many SQL Databases also have transaction_ids which could be saved in the traces and the contract would be future proof for other information we might come up with.

sh-rp

Cool PR! I think it is good, but I have some more ideas of what to track in the load metrics when I think about how to display trace data in a useful way to the user in some tool.

rudolfix · 2024-08-25T21:06:52Z

@sh-rp we followup here dlt-hub/dlt-plus#40 and here #1744

* collects basic load job metrics in LoadJob * adds remote uri to filesystem copy jobs metrics * adds job id to load package info * adds table name to job metrics * skips run step when serializing trace * adds trace shape test with trace schema * tests job file name too long * docs running pipelines with the same name for different envs * extracts step metrics in common, renames followup jobs * fixes tests * fixes tests * tests delta filesystem for remote_uri * adds exec_info to trace contract test * tests remote_uri for filesystem copy * fixes platform test

rudolfix added 9 commits August 19, 2024 15:06

collects basic load job metrics in LoadJob

7191af8

adds remote uri to filesystem copy jobs metrics

8079bf3

adds job id to load package info

de3edc7

adds table name to job metrics

20142fb

skips run step when serializing trace

2226653

adds trace shape test with trace schema

b2f75d7

tests job file name too long

537efb8

docs running pipelines with the same name for different envs

3b4c06d

extracts step metrics in common, renames followup jobs

359c09e

rudolfix requested a review from sh-rp August 19, 2024 14:56

rudolfix self-assigned this Aug 19, 2024

rudolfix marked this pull request as draft August 19, 2024 14:57

rudolfix added 5 commits August 20, 2024 00:13

fixes tests

bb6424b

fixes tests

0fcb34b

tests delta filesystem for remote_uri

6c4e0c2

adds exec_info to trace contract test

3365bb1

tests remote_uri for filesystem copy

8e7064e

rudolfix marked this pull request as ready for review August 21, 2024 19:48

fixes platform test

18a848e

sh-rp reviewed Aug 23, 2024

View reviewed changes

sh-rp approved these changes Aug 23, 2024

View reviewed changes

rudolfix merged commit 011d7ff into devel Aug 25, 2024
56 checks passed

rudolfix deleted the feat/1681-collects-load-job-metrics branch August 25, 2024 21:07

rudolfix mentioned this pull request Sep 15, 2024

bumps to 1.0.0 + docs cleanup #1809

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat/1681 collects load job metrics and adds remote uri #1708

feat/1681 collects load job metrics and adds remote uri #1708

rudolfix commented Aug 19, 2024 •

edited

Loading

netlify bot commented Aug 19, 2024 •

edited

Loading

sh-rp Aug 23, 2024 •

edited

Loading

sh-rp Aug 23, 2024

sh-rp Aug 23, 2024

sh-rp Aug 23, 2024

sh-rp Aug 23, 2024

sh-rp Aug 23, 2024 •

edited

Loading

sh-rp left a comment

rudolfix commented Aug 25, 2024

		@@ -143,8 +143,8 @@ def create_load_id() -> str:


		# folders to manage load jobs in a single load package
		TJobState = Literal["new_jobs", "failed_jobs", "started_jobs", "completed_jobs"]

		"""Job metrics aggregated by table"""


		class LoadJobMetrics(NamedTuple):

feat/1681 collects load job metrics and adds remote uri #1708

feat/1681 collects load job metrics and adds remote uri #1708

Conversation

rudolfix commented Aug 19, 2024 • edited Loading

Description

netlify bot commented Aug 19, 2024 • edited Loading

✅ Deploy Preview for dlt-hub-docs canceled.

sh-rp Aug 23, 2024 • edited Loading

Choose a reason for hiding this comment

sh-rp Aug 23, 2024

Choose a reason for hiding this comment

sh-rp Aug 23, 2024

Choose a reason for hiding this comment

sh-rp Aug 23, 2024

Choose a reason for hiding this comment

sh-rp Aug 23, 2024

Choose a reason for hiding this comment

sh-rp Aug 23, 2024 • edited Loading

Choose a reason for hiding this comment

sh-rp left a comment

Choose a reason for hiding this comment

rudolfix commented Aug 25, 2024

rudolfix commented Aug 19, 2024 •

edited

Loading

netlify bot commented Aug 19, 2024 •

edited

Loading

sh-rp Aug 23, 2024 •

edited

Loading

sh-rp Aug 23, 2024 •

edited

Loading