Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs: schema evolution #1078

Merged
merged 11 commits into from Mar 18, 2024
Merged

Docs: schema evolution #1078

merged 11 commits into from Mar 18, 2024

Conversation

dat-a-man
Copy link
Collaborator

Description

Added documentation on schema evolution.

Related Issues

  • Fixes #...
  • Closes #...
  • Resolves #...

Additional Context

Copy link

netlify bot commented Mar 12, 2024

Deploy Preview for dlt-hub-docs ready!

Name Link
🔨 Latest commit b000b25
🔍 Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/65f8290a42d8900008c05c0a
😎 Deploy Preview https://deploy-preview-1078--dlt-hub-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.


We’ll review some examples here and figure out how `dlt` creates initial schema and how normalisation works. Let’s start by running a simple pipeline with organizations and department details in the data resource. Here’s the resource:

```python
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this a pipeline that loads a data /record. So user understands how this becomes the tables. Do not mention resources as it's not needed /extra complexity. Do not use yield for the same reason.

Copy link
Collaborator Author

@dat-a-man dat-a-man Mar 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

```python
# Define a data resource using 'dlt.resource' with a schema contract set to 'evolve'

yield {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same feedback about yield - just load the data, like in the data talks club notebook for example

Copy link
Collaborator Author

@dat-a-man dat-a-man Mar 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Please note how `dlt` infers deeply nested schema.
:::

## What did the schema evolution engine do?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block got replaced by "what happened?" above - so this can now be deleted

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Let’s load the data and look at the tables:
<iframe width="560" height="315" src='https://dbdiagram.io/e/65e80303cd45b569fba28e9d/65e80556cd45b569fba2b8ab'> </iframe>

What happened?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaces the block on the bottom.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

As the structure of data changes, such as the addition of new columns, changing data types, etc., `dlt` handles these schema changes, enabling you to adapt to changes without losing velocity.

## Inferring a schema from nested data
The first run of a pipeline will scan the data that goes through it and generate a schema. To convert nested data into relational format, dlt flattens dictionaries and unpacks nested lists into sub-tables.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Format dlt as code dlt. You missed one.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


By separating the technical process of loading data from curation, you free the data engineer to do engineering, and the analytics to curate data without technical obstacles. So, the analyst must be kept in the loop.

**Keeping track of column lineage**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracking column lineage

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated


**Keeping track of column lineage**

By loading the ‘load_info’ to the destination, info about the column ‘data types’, ‘add times’ and ‘load id’. To read more please see [the data lineage article](https://dlthub.com/docs/blog/dlt-data-lineage) we have on the blog.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The column lineage can be tracked by loading the 'load_info' to the destination. The 'load_info' contains information about columns ‘data types’, ‘add times’, and ‘load id’

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


By loading the ‘load_info’ to the destination, info about the column ‘data types’, ‘add times’ and ‘load id’. To read more please see [the data lineage article](https://dlthub.com/docs/blog/dlt-data-lineage) we have on the blog.

**Getting notifications**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either make the 'we' after Getting notification small case or use ':'

Getting notifications: We ..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra empty line is missed at line 105

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, an extra line was added.

)
)
```
This script sends Slack notifications for data schema updates using the 'send_slack_message' function from the dlt library. It provides details on the updated table and column.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove data.

This script sends Slack notifications for schema updates

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use code formatting for code, for example function names: send_slack_message

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Comment on lines 87 to 94
- Added column:
- A new column named “ceo” is added to the “org” table.
- Variant column:
- A new column named `inventory_nr__v_text` is added as the datatype of the column was changed from “integer” to “string”.
- Removed column stopped loading:
- New data to column “room” is not loaded.
- Column stopped loading and new one was added:
- A new column “address__building” was added and now data will be loaded to that and stop loading in the column “address__main_block”.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The column inventory_nr__v_text is formatted as code while the rest are in "". I would format the columns as code or bold in this block as it makes for an easier reading experience.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also always look in deploy preview
image

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes changed all column names as code.


The schema evolution engine in the `dlt` library is designed to handle changes in the structure of your data over time. For example:

- As above in continuation of the inferred schema, the “specifications” are nested in ‘details” which are nested in “Inventory” all under table name “org”. So the table created for projects is ‘org__inventory__details__specifications’.
Copy link
Contributor

@zem360 zem360 Mar 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • use "" or '' not both.
  • The 'inventory' in the ‘org__inventory__details__specifications’ is being formatted as bold in the doc preview

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

corrected

## Inferring a schema from nested data
The first run of a pipeline will scan the data that goes through it and generate a schema. To convert nested data into relational format, dlt flattens dictionaries and unpacks nested lists into sub-tables.

We’ll review some examples here and figure out how `dlt` creates initial schema and how normalisation works. Let's begin by creating a pipeline that loads the following data:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a reader, I am expecting some code when I read the following

Let's begin by creating a pipeline that loads the following data:

Either add some basic code or change the sentence.

"Consider a pipeline that load the following schema"

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

Copy link
Contributor

@zem360 zem360 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of general things that I observed:

  • either use "" or '' quotes. You keep switching b/w the two.
  • Is there a way to lock the db-diagrams? If possible, do it.

I left some other comments

Copy link
Contributor

@AstrakhantsevaAA AstrakhantsevaAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice article!

Comment on lines 87 to 94
- Added column:
- A new column named “ceo” is added to the “org” table.
- Variant column:
- A new column named `inventory_nr__v_text` is added as the datatype of the column was changed from “integer” to “string”.
- Removed column stopped loading:
- New data to column “room” is not loaded.
- Column stopped loading and new one was added:
- A new column “address__building” was added and now data will be loaded to that and stop loading in the column “address__main_block”.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also always look in deploy preview
image


By loading the ‘load_info’ to the destination, info about the column ‘data types’, ‘add times’ and ‘load id’. To read more please see [the data lineage article](https://dlthub.com/docs/blog/dlt-data-lineage) we have on the blog.

**Getting notifications**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra empty line is missed at line 105

)
)
```
This script sends Slack notifications for data schema updates using the 'send_slack_message' function from the dlt library. It provides details on the updated table and column.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use code formatting for code, for example function names: send_slack_message


The data in the pipeline mentioned above is modified.

- Updated data pipeine now includes key 'specifications' within 'details', nested in 'Inventory'.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need this - bullet point.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

The data in the pipeline mentioned above is modified.

- Updated data pipeine now includes key 'specifications' within 'details', nested in 'Inventory'.
```python
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete please extra spaces in the code snippet

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


By loading the ‘load_info’ to the destination, info about the column ‘data types’, ‘add times’ and ‘load id’. To read more please see [the data lineage article](https://dlthub.com/docs/blog/dlt-data-lineage) we have on the blog.

**Getting notifications**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

@dat-a-man
Copy link
Collaborator Author

Thanks @zem360 @AstrakhantsevaAA
Changed and updated.

Copy link
Contributor

@AstrakhantsevaAA AstrakhantsevaAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well done Aman! Just one comment about column names


The schema evolution engine in the `dlt` library is designed to handle changes in the structure of your data over time. For example:

- As above in continuation of the inferred schema, the “specifications” are nested in "details” which are nested in “Inventory” all under table name “org”. So the table created for projects is "org__inventory__details__specifications".
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again the mess with column names, all column names in the article should be formatted as code, because it's not a plain text
image

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Copy link
Contributor

@AstrakhantsevaAA AstrakhantsevaAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very Good! Thank you Aman <3

@AstrakhantsevaAA AstrakhantsevaAA changed the base branch from master to devel March 18, 2024 11:43
@AstrakhantsevaAA AstrakhantsevaAA dismissed their stale review March 18, 2024 11:43

The base branch was changed.

@AstrakhantsevaAA AstrakhantsevaAA changed the base branch from devel to master March 18, 2024 11:43
@AstrakhantsevaAA AstrakhantsevaAA merged commit b09d336 into master Mar 18, 2024
44 checks passed
@AstrakhantsevaAA AstrakhantsevaAA deleted the docs/schema_evolution_docs branch March 18, 2024 13:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants