Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat/1681 collects load job metrics and adds remote uri #1708

Merged
merged 15 commits into from
Aug 25, 2024

Conversation

rudolfix
Copy link
Collaborator

@rudolfix rudolfix commented Aug 19, 2024

Description

fixes #1681

also changes shape of the trace and puts contract on it (in tests)

  • job_id added to trace__steps__step_info__load_packages__jobs
  • run step removed (is a copy of last step in the pipeline) + corresponding columns in trace
  • added table_name in job_metrics
  • added trace__steps__load_info__job_metrics

also refactors code

@rudolfix rudolfix requested a review from sh-rp August 19, 2024 14:56
@rudolfix rudolfix self-assigned this Aug 19, 2024
Copy link

netlify bot commented Aug 19, 2024

Deploy Preview for dlt-hub-docs canceled.

Name Link
🔨 Latest commit 18a848e
🔍 Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/66c6eb1780985700081137e4

@rudolfix rudolfix marked this pull request as draft August 19, 2024 14:57
@rudolfix rudolfix marked this pull request as ready for review August 21, 2024 19:48
@@ -143,8 +143,8 @@ def create_load_id() -> str:


# folders to manage load jobs in a single load package
TJobState = Literal["new_jobs", "failed_jobs", "started_jobs", "completed_jobs"]
Copy link
Collaborator

@sh-rp sh-rp Aug 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good!

@@ -392,6 +398,11 @@ def complete_jobs(
# create followup jobs
self.create_followup_jobs(load_id, state, job, schema)

# preserve metrics
metrics = job.metrics()
if metrics:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when does it happen that a job does not have metrics?

"""Job metrics aggregated by table"""


class LoadJobMetrics(NamedTuple):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be of interested to also track whether the data from this job was loaded to the staging destination or the staging dataset

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would also be cool to track if this was a table chain followup job. With these three flags you could make the/any platform UI much more legible.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking a bit more about it, maybe a job_type would be interesting too in the context of displaying data to the user. Types could be:

single table jobs

  • copy (all jobs that just move files around)
  • load (all jobs that move data into a database)

table chain jobs

  • merge (merge jobs)
  • upsert (upsert merge jobs)
  • scd2 (scd2 merge)
  • replace (replace via insert from staging)
  • whatever we call this chunking update

started_at: datetime.datetime
finished_at: datetime.datetime
state: Optional[str]
remote_uri: Optional[str]
Copy link
Collaborator

@sh-rp sh-rp Aug 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking that LoadJobs could have a space for arbitrary data (defined as Dict[str, str]) for information such as the remote_uri. What comes to mind is, that for example BigQuery Jobs have an internal BigQuery Job Id which we could store here. Many SQL Databases also have transaction_ids which could be saved in the traces and the contract would be future proof for other information we might come up with.

Copy link
Collaborator

@sh-rp sh-rp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool PR! I think it is good, but I have some more ideas of what to track in the load metrics when I think about how to display trace data in a useful way to the user in some tool.

@rudolfix
Copy link
Collaborator Author

@sh-rp we followup here dlt-hub/dlt-plus#40 and here #1744

@rudolfix rudolfix merged commit 011d7ff into devel Aug 25, 2024
56 checks passed
@rudolfix rudolfix deleted the feat/1681-collects-load-job-metrics branch August 25, 2024 21:07
willi-mueller pushed a commit that referenced this pull request Sep 2, 2024
* collects basic load job metrics in LoadJob

* adds remote uri to filesystem copy jobs metrics

* adds job id to load package info

* adds table name to job metrics

* skips run step when serializing trace

* adds trace shape test with trace schema

* tests job file name too long

* docs running pipelines with the same name for different envs

* extracts step metrics in common, renames followup jobs

* fixes tests

* fixes tests

* tests delta filesystem for remote_uri

* adds exec_info to trace contract test

* tests remote_uri for filesystem copy

* fixes platform test
willi-mueller pushed a commit that referenced this pull request Sep 2, 2024
* collects basic load job metrics in LoadJob

* adds remote uri to filesystem copy jobs metrics

* adds job id to load package info

* adds table name to job metrics

* skips run step when serializing trace

* adds trace shape test with trace schema

* tests job file name too long

* docs running pipelines with the same name for different envs

* extracts step metrics in common, renames followup jobs

* fixes tests

* fixes tests

* tests delta filesystem for remote_uri

* adds exec_info to trace contract test

* tests remote_uri for filesystem copy

* fixes platform test
willi-mueller pushed a commit that referenced this pull request Sep 2, 2024
* collects basic load job metrics in LoadJob

* adds remote uri to filesystem copy jobs metrics

* adds job id to load package info

* adds table name to job metrics

* skips run step when serializing trace

* adds trace shape test with trace schema

* tests job file name too long

* docs running pipelines with the same name for different envs

* extracts step metrics in common, renames followup jobs

* fixes tests

* fixes tests

* tests delta filesystem for remote_uri

* adds exec_info to trace contract test

* tests remote_uri for filesystem copy

* fixes platform test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Add actual file paths to traces in case of filesystem destination
2 participants