Enable schema evolution for `merge` write disposition with `delta` table format #1742

jorritsandbrink · 2024-08-25T14:23:59Z

Description

enables schema evolution (adding new columns) for the merge write disposition with the delta table format
increases minimum deltalake version to access add_columns method
allows to pass a schema name to get_delta_tables for pipelines with multiple schemas (that may explain problem with "missing" delta tables)

Related Issues

netlify · 2024-08-25T14:24:14Z

✅ Deploy Preview for dlt-hub-docs canceled.

Name	Link
🔨 Latest commit	`e169455`
🔍 Latest deploy log	https://app.netlify.com/sites/dlt-hub-docs/deploys/66ccf9932c7d810009062456

jorritsandbrink · 2024-08-25T16:38:05Z

poetry.lock

@@ -6660,63 +6659,52 @@ files = [

 [[package]]
 name = "pyarrow"
-version = "14.0.2"
+version = "16.1.0"


Note that pyarrow gets upgraded.

tests/libs/test_deltalake.py

rudolfix

Thanks for working on this! It looks good! Two things:

if arrow_ds is empty you do not evolve the schema. IMO that should happen. please add a test for it (if arrow_ds.head(1).num_rows == 0:)
should we update all table schemas like in other destinations where it happens in update_stored_schema? if you agree let's create a ticket for that
same thing for truncating tables before the load. this is actually used by refresh option

…schema-evolution-delta-filesystem-merge

jorritsandbrink · 2024-08-26T12:48:31Z

@rudolfix

if arrow_ds is empty you do not evolve the schema. IMO that should happen. please add a test for it (if arrow_ds.head(1).num_rows == 0:)

Done.

should we update all table schemas like in other destinations where it happens in update_stored_schema? if you agree let's create a ticket for that

Three options:

We let delta-rs do automatic schema evolution.
- This already works for write_deltalake (which we use for the append and replace dispositions).
- This does not yet work for DeltaTable.merge (which we use for the merge write disposition).
- This does not yet work for the "empty batch case".
We manually manage schema evolution.
- In this case I think using update_stored_schema is a good idea.
Mix of 1 and 2.

We currently do 3. 1 is not possible yet, but might become possible when the linked tickets are done (they are already assigned, so could be soon). 2 is possible, but is a bigger burden on our side. Which has your preference?

same thing for truncating tables before the load. this is actually used by refresh option

Okay, then we should probably use it.

rudolfix

OK this is top. let me add windows fix

rudolfix · 2024-08-26T15:25:25Z

@jorritsandbrink

So what I'd do:
in update_stored_schema

make sure that table prefix == table dir for delta (then we know that each table has a separate "folder"). then the folder structure is good and weird layouts are eliminated

in truncate / drop tables

disable for delta (ie. with an error message) OR
implement dropping and truncating tables properly (all refresh options should work after that)

migrating schema
You already have all the building blocks for (2) and it IMO makes sense to migrate tables before we start loading but the priority is low.

…ble counts

jorritsandbrink · 2024-08-27T07:40:10Z

dlt/destinations/impl/filesystem/filesystem.py

-                partition_by=self._partition_columns,
-                storage_options=self._storage_options,
-            )
+        with self.arrow_ds.scanner().to_reader() as arrow_rbr:  # RecordBatchReader


Curious why you inserted a with context here. Is it because arrow_rbr gets exhausted and is effectively useless after the context?

it has a close method... so it has internal unmanaged resources that we should free ASAP. otherwise garbage collector does it way later

…ble format (#1742) * black format * increase minimum deltalake version dependency * enable schema evolution for delta table merge * extract delta table merge logic into separate function * remove big decimal exclusion due to upstream bugfix * evolve delta table schema in empty source case * refactor DeltaLoadFilesystemJob * uses right table path format in delta lake load job * allows to pass schema name when getting delta tables and computing table counts * cleansup usage of remote paths and uris in filesystem load jobs * removes tempfile from file_storage --------- Co-authored-by: Marcin Rudolf <rudolfix@rudolfix.org>

jorritsandbrink added 4 commits August 25, 2024 17:36

black format

d743086

increase minimum deltalake version dependency

a661aeb

enable schema evolution for delta table merge

4720083

extract delta table merge logic into separate function

34d903f

jorritsandbrink linked an issue Aug 25, 2024 that may be closed by this pull request

No schema evolution with delta table format on filesystem destination with merge write disposition #1739

Closed

remove big decimal exclusion due to upstream bugfix

ae29363

jorritsandbrink commented Aug 25, 2024

View reviewed changes

tests/libs/test_deltalake.py Show resolved Hide resolved

jorritsandbrink requested a review from rudolfix August 25, 2024 16:46

jorritsandbrink marked this pull request as ready for review August 25, 2024 16:46

rudolfix requested changes Aug 25, 2024

View reviewed changes

rudolfix added the support This issue is monitored by Solution Engineer label Aug 25, 2024

jorritsandbrink added 3 commits August 26, 2024 13:33

evolve delta table schema in empty source case

9423ae8

refactor DeltaLoadFilesystemJob

c676800

Merge branch 'devel' of https://github.com/dlt-hub/dlt into fix/1739-…

d192ee7

…schema-evolution-delta-filesystem-merge

jorritsandbrink requested a review from rudolfix August 26, 2024 12:49

rudolfix previously approved these changes Aug 26, 2024

View reviewed changes

uses right table path format in delta lake load job

00e46e9

rudolfix dismissed their stale review via 00e46e9 August 26, 2024 14:31

rudolfix added 3 commits August 26, 2024 23:53

allows to pass schema name when getting delta tables and computing ta…

7fd560a

…ble counts

cleansup usage of remote paths and uris in filesystem load jobs

03480d0

removes tempfile from file_storage

e169455

jorritsandbrink commented Aug 27, 2024

View reviewed changes

rudolfix merged commit 08e5e7a into devel Aug 27, 2024
55 of 56 checks passed

rudolfix deleted the fix/1739-schema-evolution-delta-filesystem-merge branch August 27, 2024 07:59

rudolfix restored the fix/1739-schema-evolution-delta-filesystem-merge branch August 27, 2024 07:59

rudolfix deleted the fix/1739-schema-evolution-delta-filesystem-merge branch August 27, 2024 07:59

jorritsandbrink mentioned this pull request Sep 2, 2024

Proper pipeline refresh with delta table format #1777

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable schema evolution for `merge` write disposition with `delta` table format #1742

Enable schema evolution for `merge` write disposition with `delta` table format #1742

jorritsandbrink commented Aug 25, 2024 •

edited by rudolfix

Loading

netlify bot commented Aug 25, 2024 •

edited

Loading

jorritsandbrink Aug 25, 2024

rudolfix left a comment

jorritsandbrink commented Aug 26, 2024

rudolfix left a comment

rudolfix commented Aug 26, 2024

jorritsandbrink Aug 27, 2024

rudolfix Aug 27, 2024

Enable schema evolution for merge write disposition with delta table format #1742

Enable schema evolution for merge write disposition with delta table format #1742

Conversation

jorritsandbrink commented Aug 25, 2024 • edited by rudolfix Loading

Description

Related Issues

netlify bot commented Aug 25, 2024 • edited Loading

✅ Deploy Preview for dlt-hub-docs canceled.

jorritsandbrink Aug 25, 2024

Choose a reason for hiding this comment

rudolfix left a comment

Choose a reason for hiding this comment

jorritsandbrink commented Aug 26, 2024

rudolfix left a comment

Choose a reason for hiding this comment

rudolfix commented Aug 26, 2024

jorritsandbrink Aug 27, 2024

Choose a reason for hiding this comment

rudolfix Aug 27, 2024

Choose a reason for hiding this comment

Enable schema evolution for `merge` write disposition with `delta` table format #1742

Enable schema evolution for `merge` write disposition with `delta` table format #1742

jorritsandbrink commented Aug 25, 2024 •

edited by rudolfix

Loading

netlify bot commented Aug 25, 2024 •

edited

Loading