Docs: schema evolution #1078

dat-a-man · 2024-03-12T03:05:38Z

Description

Added documentation on schema evolution.

Related Issues

Fixes #...
Closes #...
Resolves #...

Additional Context

netlify · 2024-03-12T03:05:53Z

✅ Deploy Preview for dlt-hub-docs ready!

Name	Link
🔨 Latest commit	`b000b25`
🔍 Latest deploy log	https://app.netlify.com/sites/dlt-hub-docs/deploys/65f8290a42d8900008c05c0a
😎 Deploy Preview	https://deploy-preview-1078--dlt-hub-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

adrianbr · 2024-03-12T08:26:54Z

docs/website/docs/general-usage/schema-evolution.md

+
+We’ll review some examples here and figure out how `dlt` creates initial schema and how normalisation works. Let’s start by running a simple pipeline with organizations and department details in the data resource. Here’s the resource:
+
+```python


Make this a pipeline that loads a data /record. So user understands how this becomes the tables. Do not mention resources as it's not needed /extra complexity. Do not use yield for the same reason.

adrianbr · 2024-03-12T08:27:24Z

docs/website/docs/general-usage/schema-evolution.md

+```python
+# Define a data resource using 'dlt.resource' with a schema contract set to 'evolve'
+
+    yield {


same feedback about yield - just load the data, like in the data talks club notebook for example

adrianbr · 2024-03-12T08:30:07Z

docs/website/docs/general-usage/schema-evolution.md

+Please note how `dlt`  infers deeply nested schema.
+:::
+
+## What did the schema evolution engine do?


This block got replaced by "what happened?" above - so this can now be deleted

adrianbr · 2024-03-12T08:30:24Z

docs/website/docs/general-usage/schema-evolution.md

+Let’s load the data and look at the tables:
+<iframe width="560" height="315" src='https://dbdiagram.io/e/65e80303cd45b569fba28e9d/65e80556cd45b569fba2b8ab'> </iframe>
+
+What happened?


Replaces the block on the bottom.

zem360 · 2024-03-13T10:54:49Z

docs/website/docs/general-usage/schema-evolution.md

+As the structure of data changes, such as the addition of new columns, changing data types, etc., `dlt` handles these schema changes, enabling you to adapt to changes without losing velocity.
+
+## Inferring a schema from nested data
+The first run of a pipeline will scan the data that goes through it and generate a schema. To convert nested data into relational format, dlt flattens dictionaries and unpacks nested lists into sub-tables.


Format dlt as code dlt. You missed one.

zem360 · 2024-03-13T10:56:29Z

docs/website/docs/general-usage/schema-evolution.md

+
+By separating the technical process of loading data from curation, you free the data engineer to do engineering, and the analytics to curate data without technical obstacles. So, the analyst must be kept in the loop.
+
+**Keeping track of column lineage**


Tracking column lineage

zem360 · 2024-03-13T10:58:49Z

docs/website/docs/general-usage/schema-evolution.md

+
+**Keeping track of column lineage**
+
+By loading the ‘load_info’ to the destination,  info about the column ‘data types’, ‘add times’ and ‘load id’. To read more please see [the data lineage article](https://dlthub.com/docs/blog/dlt-data-lineage) we have on the blog.


The column lineage can be tracked by loading the 'load_info' to the destination. The 'load_info' contains information about columns ‘data types’, ‘add times’, and ‘load id’

zem360 · 2024-03-13T11:02:00Z

docs/website/docs/general-usage/schema-evolution.md

+
+By loading the ‘load_info’ to the destination,  info about the column ‘data types’, ‘add times’ and ‘load id’. To read more please see [the data lineage article](https://dlthub.com/docs/blog/dlt-data-lineage) we have on the blog.
+
+**Getting notifications**


Either make the 'we' after Getting notification small case or use ':'

Getting notifications: We ..

extra empty line is missed at line 105

Yes, an extra line was added.

zem360 · 2024-03-13T11:02:57Z

docs/website/docs/general-usage/schema-evolution.md

+                )
+            )
+```
+This script sends Slack notifications for data schema updates using the 'send_slack_message' function from the dlt library. It provides details on the updated table and column.


Remove data.

This script sends Slack notifications for schema updates

use code formatting for code, for example function names: send_slack_message

zem360 · 2024-03-13T11:05:55Z

docs/website/docs/general-usage/schema-evolution.md

+- Added column:
+    - A new column named “ceo” is added  to the “org” table.
+- Variant column:
+    - A new column named `inventory_nr__v_text` is added as the datatype of the column was changed from “integer” to “string”.
+- Removed column stopped loading:
+    - New data to column “room” is not loaded.
+- Column stopped loading and new one was added:
+    - A new column “address__building” was added and now data will be loaded to that and stop loading in the column “address__main_block”.


The column inventory_nr__v_text is formatted as code while the rest are in "". I would format the columns as code or bold in this block as it makes for an easier reading experience.

also always look in deploy preview

Yes changed all column names as code.

zem360 · 2024-03-13T11:06:32Z

docs/website/docs/general-usage/schema-evolution.md

+
+The schema evolution engine in the `dlt` library is designed to handle changes in the structure of your data over time. For example: 
+
+- As above in continuation of the inferred schema, the “specifications” are nested in ‘details” which are nested in “Inventory” all under table name “org”. So the table created for projects is ‘org__inventory__details__specifications’.


use "" or '' not both.

The 'inventory' in the ‘org__inventory__details__specifications’ is being formatted as bold in the doc preview

zem360 · 2024-03-13T11:13:42Z

docs/website/docs/general-usage/schema-evolution.md

+## Inferring a schema from nested data
+The first run of a pipeline will scan the data that goes through it and generate a schema. To convert nested data into relational format, dlt flattens dictionaries and unpacks nested lists into sub-tables.
+
+We’ll review some examples here and figure out how `dlt` creates initial schema and how normalisation works. Let's begin by creating a pipeline that loads the following data:


As a reader, I am expecting some code when I read the following

Let's begin by creating a pipeline that loads the following data:

Either add some basic code or change the sentence.

"Consider a pipeline that load the following schema"

zem360

A couple of general things that I observed:

either use "" or '' quotes. You keep switching b/w the two.
Is there a way to lock the db-diagrams? If possible, do it.

I left some other comments

AstrakhantsevaAA

Nice article!

AstrakhantsevaAA · 2024-03-13T16:19:40Z

docs/website/docs/general-usage/schema-evolution.md

+- Added column:
+    - A new column named “ceo” is added  to the “org” table.
+- Variant column:
+    - A new column named `inventory_nr__v_text` is added as the datatype of the column was changed from “integer” to “string”.
+- Removed column stopped loading:
+    - New data to column “room” is not loaded.
+- Column stopped loading and new one was added:
+    - A new column “address__building” was added and now data will be loaded to that and stop loading in the column “address__main_block”.


also always look in deploy preview

AstrakhantsevaAA · 2024-03-13T16:20:55Z

docs/website/docs/general-usage/schema-evolution.md

+
+By loading the ‘load_info’ to the destination,  info about the column ‘data types’, ‘add times’ and ‘load id’. To read more please see [the data lineage article](https://dlthub.com/docs/blog/dlt-data-lineage) we have on the blog.
+
+**Getting notifications**


extra empty line is missed at line 105

AstrakhantsevaAA · 2024-03-13T16:22:27Z

docs/website/docs/general-usage/schema-evolution.md

+                )
+            )
+```
+This script sends Slack notifications for data schema updates using the 'send_slack_message' function from the dlt library. It provides details on the updated table and column.


use code formatting for code, for example function names: send_slack_message

AstrakhantsevaAA · 2024-03-13T16:26:56Z

docs/website/docs/general-usage/schema-evolution.md

+
+The data in the pipeline mentioned above is modified.
+
+- Updated data pipeine now includes key 'specifications' within 'details', nested in 'Inventory'.


You don't need this - bullet point.

AstrakhantsevaAA · 2024-03-13T16:27:23Z

docs/website/docs/general-usage/schema-evolution.md

+The data in the pipeline mentioned above is modified.
+
+- Updated data pipeine now includes key 'specifications' within 'details', nested in 'Inventory'.
+```python


delete please extra spaces in the code snippet

AstrakhantsevaAA · 2024-03-13T16:31:53Z

docs/website/docs/general-usage/schema-evolution.md

+
+By loading the ‘load_info’ to the destination,  info about the column ‘data types’, ‘add times’ and ‘load id’. To read more please see [the data lineage article](https://dlthub.com/docs/blog/dlt-data-lineage) we have on the blog.
+
+**Getting notifications**


dat-a-man · 2024-03-14T05:45:27Z

Thanks @zem360 @AstrakhantsevaAA
Changed and updated.

AstrakhantsevaAA

Well done Aman! Just one comment about column names

AstrakhantsevaAA · 2024-03-14T10:24:47Z

docs/website/docs/general-usage/schema-evolution.md

+
+The schema evolution engine in the `dlt` library is designed to handle changes in the structure of your data over time. For example: 
+
+- As above in continuation of the inferred schema, the “specifications” are nested in "details” which are nested in “Inventory” all under table name “org”. So the table created for projects is "org__inventory__details__specifications".


again the mess with column names, all column names in the article should be formatted as code, because it's not a plain text

AstrakhantsevaAA

Very Good! Thank you Aman <3

The base branch was changed.

to merge faster

Updated

17da4ac

dat-a-man assigned adrianbr and AstrakhantsevaAA Mar 12, 2024

Updated

40e199d

adrianbr previously requested changes Mar 12, 2024

View reviewed changes

Updated

42d69e0

zem360 reviewed Mar 13, 2024

View reviewed changes

AstrakhantsevaAA reviewed Mar 13, 2024

View reviewed changes

Updated

83cd877

updated

26f7d68

AstrakhantsevaAA requested changes Mar 14, 2024

View reviewed changes

dat-a-man and others added 5 commits March 14, 2024 12:26

Update

c089f2f

Merge branch 'master' into docs/schema_evolution_docs

5632275

Update

a5bc407

Update

755fe92

Update

46e101a

AstrakhantsevaAA previously approved these changes Mar 18, 2024

View reviewed changes

AstrakhantsevaAA changed the base branch from master to devel March 18, 2024 11:43

AstrakhantsevaAA changed the base branch from devel to master March 18, 2024 11:43

Merge branch 'master' into docs/schema_evolution_docs

b000b25

AstrakhantsevaAA approved these changes Mar 18, 2024

View reviewed changes

AstrakhantsevaAA merged commit b09d336 into master Mar 18, 2024
44 checks passed

AstrakhantsevaAA deleted the docs/schema_evolution_docs branch March 18, 2024 13:08


		We’ll review some examples here and figure out how `dlt` creates initial schema and how normalisation works. Let’s start by running a simple pipeline with organizations and department details in the data resource. Here’s the resource:

		```python


		By separating the technical process of loading data from curation, you free the data engineer to do engineering, and the analytics to curate data without technical obstacles. So, the analyst must be kept in the loop.

		Keeping track of column lineage


		Keeping track of column lineage

		By loading the ‘load_info’ to the destination, info about the column ‘data types’, ‘add times’ and ‘load id’. To read more please see [the data lineage article](https://dlthub.com/docs/blog/dlt-data-lineage) we have on the blog.


		By loading the ‘load_info’ to the destination, info about the column ‘data types’, ‘add times’ and ‘load id’. To read more please see [the data lineage article](https://dlthub.com/docs/blog/dlt-data-lineage) we have on the blog.

		Getting notifications


		The schema evolution engine in the `dlt` library is designed to handle changes in the structure of your data over time. For example:

		- As above in continuation of the inferred schema, the “specifications” are nested in ‘details” which are nested in “Inventory” all under table name “org”. So the table created for projects is ‘org__inventory__details__specifications’.


		The data in the pipeline mentioned above is modified.

		- Updated data pipeine now includes key 'specifications' within 'details', nested in 'Inventory'.

Docs: schema evolution #1078

Docs: schema evolution #1078

Conversation

dat-a-man commented Mar 12, 2024

Description

Related Issues

Additional Context

netlify bot commented Mar 12, 2024 • edited

✅ Deploy Preview for dlt-hub-docs ready!

Choose a reason for hiding this comment

dat-a-man Mar 13, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dat-a-man Mar 13, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zem360 Mar 13, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zem360 left a comment

Choose a reason for hiding this comment

AstrakhantsevaAA left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dat-a-man commented Mar 14, 2024

AstrakhantsevaAA left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AstrakhantsevaAA left a comment

Choose a reason for hiding this comment

netlify bot commented Mar 12, 2024 •

edited

dat-a-man Mar 13, 2024 •

edited

dat-a-man Mar 13, 2024 •

edited

zem360 Mar 13, 2024 •

edited