Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add load_id to _dlt_versions table #320

Closed
rudolfix opened this issue May 9, 2023 · 5 comments · Fixed by #466
Closed

add load_id to _dlt_versions table #320

rudolfix opened this issue May 9, 2023 · 5 comments · Fixed by #466
Labels
enhancement New feature or request

Comments

@rudolfix
Copy link
Collaborator

rudolfix commented May 9, 2023

Background
We want the users to be able to link the loads in _dlt_loads to _dlt_versions to identify which load updated the schema without enabling the trace (https://dlthub.com/docs/running-in-production/running#inspect-and-save-the-load-info-and-trace).

Tasks

    • insert load_id into the _dlt_versions, see SqlJobClientBase
    • you need to change the _dlt_versions schema definition (schema/utils.py) - make sure it is NULLABLE so old dlt installations can migrate
    • we need to increase schema engine version and write migration that will add the column to existing schema (schema/utils.py)

Tests

  1. Extend existing tests for updating schema and make sure the load_id is stored (test_job_client.py)
  2. Test the schema migration: mock the schema to not have this column, load data, then load the schema again and load the data again to make sure the schema was added
@rudolfix rudolfix added enhancement New feature or request good first issue Good for newcomers labels May 9, 2023
@codingcyclist
Copy link
Collaborator

Hi @rudolfix thanks for opening this ticket 🙌 I recently ran into another use case wrt. schema version names: I wanted to identify which schema version was used for which load (i.e. the reverse of what you're describing above). Have you also considered adding the schema version to the _dlt_loads table?

@rudolfix
Copy link
Collaborator Author

why not? we can do it in this ticket. there will be duplicated information (you can always join the tables) but I assume it is more user friendly
@codingcyclist btw. could you check if maybe this https://dlthub.com/docs/running-in-production/running#inspect-and-save-the-load-info-and-trace
solves your problem? dlt produces very rich trace of what got loaded including all the information that you need. you can load this trace into your destination.

@rudolfix
Copy link
Collaborator Author

@tarunsamanta2k20 there was github outage and several copies of this ticket were created. glad to hear that you want to contribute! what kind of help do you need?

@codingcyclist
Copy link
Collaborator

@rudolfix I just filed a small PR for this, let me know what you think. Instead of adding the load_id to the version table, I decided to add the version number to the loads table. The reason was that the _update_schema_in_storage() function, which updates the version table, is not aware of the load id, whereas the complete_load() function, which updates the load table, is aware of the current schema version.

@rudolfix
Copy link
Collaborator Author

rudolfix commented Jul 6, 2023

@codingcyclist thanks for the PR. I'll come back with the review soon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants