New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Docs: schema evolution #1078
Docs: schema evolution #1078
Conversation
✅ Deploy Preview for dlt-hub-docs ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
|
||
We’ll review some examples here and figure out how `dlt` creates initial schema and how normalisation works. Let’s start by running a simple pipeline with organizations and department details in the data resource. Here’s the resource: | ||
|
||
```python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make this a pipeline that loads a data /record. So user understands how this becomes the tables. Do not mention resources as it's not needed /extra complexity. Do not use yield for the same reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
```python | ||
# Define a data resource using 'dlt.resource' with a schema contract set to 'evolve' | ||
|
||
yield { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same feedback about yield - just load the data, like in the data talks club notebook for example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
Please note how `dlt` infers deeply nested schema. | ||
::: | ||
|
||
## What did the schema evolution engine do? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block got replaced by "what happened?" above - so this can now be deleted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
Let’s load the data and look at the tables: | ||
<iframe width="560" height="315" src='https://dbdiagram.io/e/65e80303cd45b569fba28e9d/65e80556cd45b569fba2b8ab'> </iframe> | ||
|
||
What happened? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replaces the block on the bottom.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
As the structure of data changes, such as the addition of new columns, changing data types, etc., `dlt` handles these schema changes, enabling you to adapt to changes without losing velocity. | ||
|
||
## Inferring a schema from nested data | ||
The first run of a pipeline will scan the data that goes through it and generate a schema. To convert nested data into relational format, dlt flattens dictionaries and unpacks nested lists into sub-tables. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Format dlt as code dlt
. You missed one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
By separating the technical process of loading data from curation, you free the data engineer to do engineering, and the analytics to curate data without technical obstacles. So, the analyst must be kept in the loop. | ||
|
||
**Keeping track of column lineage** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tracking column lineage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
|
||
**Keeping track of column lineage** | ||
|
||
By loading the ‘load_info’ to the destination, info about the column ‘data types’, ‘add times’ and ‘load id’. To read more please see [the data lineage article](https://dlthub.com/docs/blog/dlt-data-lineage) we have on the blog. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The column lineage can be tracked by loading the 'load_info' to the destination. The 'load_info' contains information about columns ‘data types’, ‘add times’, and ‘load id’
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
By loading the ‘load_info’ to the destination, info about the column ‘data types’, ‘add times’ and ‘load id’. To read more please see [the data lineage article](https://dlthub.com/docs/blog/dlt-data-lineage) we have on the blog. | ||
|
||
**Getting notifications** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either make the 'we' after Getting notification small case or use ':'
Getting notifications: We ..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extra empty line is missed at line 105
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, an extra line was added.
) | ||
) | ||
``` | ||
This script sends Slack notifications for data schema updates using the 'send_slack_message' function from the dlt library. It provides details on the updated table and column. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove data.
This script sends Slack notifications for schema updates
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use code formatting for code, for example function names: send_slack_message
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
- Added column: | ||
- A new column named “ceo” is added to the “org” table. | ||
- Variant column: | ||
- A new column named `inventory_nr__v_text` is added as the datatype of the column was changed from “integer” to “string”. | ||
- Removed column stopped loading: | ||
- New data to column “room” is not loaded. | ||
- Column stopped loading and new one was added: | ||
- A new column “address__building” was added and now data will be loaded to that and stop loading in the column “address__main_block”. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The column inventory_nr__v_text
is formatted as code while the rest are in "". I would format the columns as code or bold in this block as it makes for an easier reading experience.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes changed all column names as code.
|
||
The schema evolution engine in the `dlt` library is designed to handle changes in the structure of your data over time. For example: | ||
|
||
- As above in continuation of the inferred schema, the “specifications” are nested in ‘details” which are nested in “Inventory” all under table name “org”. So the table created for projects is ‘org__inventory__details__specifications’. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- use "" or '' not both.
- The 'inventory' in the ‘org__inventory__details__specifications’ is being formatted as bold in the doc preview
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
corrected
## Inferring a schema from nested data | ||
The first run of a pipeline will scan the data that goes through it and generate a schema. To convert nested data into relational format, dlt flattens dictionaries and unpacks nested lists into sub-tables. | ||
|
||
We’ll review some examples here and figure out how `dlt` creates initial schema and how normalisation works. Let's begin by creating a pipeline that loads the following data: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a reader, I am expecting some code when I read the following
Let's begin by creating a pipeline that loads the following data:
Either add some basic code or change the sentence.
"Consider a pipeline that load the following schema"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of general things that I observed:
- either use "" or '' quotes. You keep switching b/w the two.
- Is there a way to lock the db-diagrams? If possible, do it.
I left some other comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice article!
- Added column: | ||
- A new column named “ceo” is added to the “org” table. | ||
- Variant column: | ||
- A new column named `inventory_nr__v_text` is added as the datatype of the column was changed from “integer” to “string”. | ||
- Removed column stopped loading: | ||
- New data to column “room” is not loaded. | ||
- Column stopped loading and new one was added: | ||
- A new column “address__building” was added and now data will be loaded to that and stop loading in the column “address__main_block”. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
By loading the ‘load_info’ to the destination, info about the column ‘data types’, ‘add times’ and ‘load id’. To read more please see [the data lineage article](https://dlthub.com/docs/blog/dlt-data-lineage) we have on the blog. | ||
|
||
**Getting notifications** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extra empty line is missed at line 105
) | ||
) | ||
``` | ||
This script sends Slack notifications for data schema updates using the 'send_slack_message' function from the dlt library. It provides details on the updated table and column. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use code formatting for code, for example function names: send_slack_message
|
||
The data in the pipeline mentioned above is modified. | ||
|
||
- Updated data pipeine now includes key 'specifications' within 'details', nested in 'Inventory'. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need this - bullet point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
The data in the pipeline mentioned above is modified. | ||
|
||
- Updated data pipeine now includes key 'specifications' within 'details', nested in 'Inventory'. | ||
```python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete please extra spaces in the code snippet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
By loading the ‘load_info’ to the destination, info about the column ‘data types’, ‘add times’ and ‘load id’. To read more please see [the data lineage article](https://dlthub.com/docs/blog/dlt-data-lineage) we have on the blog. | ||
|
||
**Getting notifications** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @zem360 @AstrakhantsevaAA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well done Aman! Just one comment about column names
|
||
The schema evolution engine in the `dlt` library is designed to handle changes in the structure of your data over time. For example: | ||
|
||
- As above in continuation of the inferred schema, the “specifications” are nested in "details” which are nested in “Inventory” all under table name “org”. So the table created for projects is "org__inventory__details__specifications". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very Good! Thank you Aman <3
Description
Added documentation on schema evolution.
Related Issues
Additional Context