Add Delta table support for `filesystem` destination #1382

jorritsandbrink · 2024-05-17T20:36:53Z

Description

This PR enables writing datasets to Delta tables in the filesystem destination.

A user can specify delta as table_format in a resource definition:

@dlt.resource(table_name="a_delta_table", table_format="delta")
def a_resource():
    ...

Related Issues

Contributes to #978

netlify · 2024-05-17T20:37:08Z

✅ Deploy Preview for dlt-hub-docs canceled.

Name	Link
🔨 Latest commit	`a75a0f9`
🔍 Latest deploy log	https://app.netlify.com/sites/dlt-hub-docs/deploys/6660c9a216f2bd000830fc81

…ystem-delta-table

jorritsandbrink · 2024-05-19T14:14:33Z

dlt/common/destination/reference.py

@@ -309,8 +309,12 @@ def restore_file_load(self, file_path: str) -> LoadJob:
        """Finds and restores already started loading job identified by `file_path` if destination supports it."""
        pass

+    def can_do_logical_replace(self, table: TTableSchema) -> bool:


Perhaps this can become a destination capability if we turn Delta into a full destination.

IMO we do not need this on the highest abstraction level. This belongs only to filesystem client and there it is enough to override should_truncate_table_before_load.

i'm trying to keep the abstract classes as simple as possible. two methods below are already a stretch (but I do not have an idea where to move them)

dlt/destinations/impl/filesystem/filesystem.py

dlt/pipeline/pipeline.py

tests/cases.py

jorritsandbrink · 2024-05-19T14:31:16Z

dlt/destinations/impl/filesystem/filesystem.py

+
+            assert isinstance(self.config.credentials, AwsCredentials)
+            storage_options = self.config.credentials.to_session_credentials()
+            storage_options["AWS_REGION"] = self.config.credentials.region_name


The deltalake library requires that AWS_REGION is provided. We need to add it to DLT_SECRETS_TOML under [destination.filesystem.credentials] to make s3 tests pass on CI.

yeah! this also may come from the machine default credentials. nevertheless we should warn or exit when this is not set

jorritsandbrink · 2024-05-19T14:32:57Z

dlt/destinations/impl/filesystem/filesystem.py

+
+            assert isinstance(self.config.credentials, GcpServiceAccountCredentials)
+            gcs_creds = self.config.credentials.to_gcs_credentials()
+            gcs_creds["token"]["private_key_id"] = "921837921798379812"


This must be changed so that private_key_id is fetched from configuration.

hmmmm OK, when you authenticate in Python you do not need to do that... we can add this as optional field. this also means that OAUTH authentication will not work? I think it is fine.

btw, can delta-rs find default google credentials? you can check if has_default_credentials() and then leave token as None. works for fsspec

jorritsandbrink · 2024-05-19T14:35:41Z

dlt/destinations/impl/filesystem/filesystem.py

+            storage_options = self.config.credentials.to_session_credentials()
+            storage_options["AWS_REGION"] = self.config.credentials.region_name
+            # https://delta-io.github.io/delta-rs/usage/writing/writing-to-s3-with-locking-provider/#enable-unsafe-writes-in-s3-opt-in
+            storage_options["AWS_S3_ALLOW_UNSAFE_RENAME"] = "true"


Setting AWS_S3_ALLOW_UNSAFE_RENAME to true is the simplest setup. Perhaps we can later extend and let the user configure a locking provider.

Context: https://delta-io.github.io/delta-rs/usage/writing/writing-to-s3-with-locking-provider/.

interesting. we have a locking provider but probably not compatible with delta. it is called transactional_file.py

jorritsandbrink · 2024-05-19T14:46:15Z

@rudolfix Can you review?

Delta tables are managed at the folder level, not the file level. Hence, they are treated differently than the csv, jsonl, and parquet file formats.

I'll add docs after we've settled on the user interface.

rudolfix

ohhh I was sure we can have delta tables as a single file. when I look at your code I think we should do something else:

make a destination out of that. it could be based on filesystem and use the same credentials. OFC we support only append and replace.

we do not use file format. we should use table_format, we add delta to it (along pyiceberg)
you can create different jobs withing the filesystem destination depending on table_format. - check how we do it in athena. I think here it will be much simpler
you can then separate delta code from regular file code and in the future we can add pyiceberg support easily
I'm not sure we can add merge support this way.... however we can always abuse the existing merge mechanism (when there's merge write disposition, delta job does nothing but requests a followup job so at the end we process all table files at once)

dlt/pipeline/pipeline.py

rudolfix · 2024-05-20T18:24:52Z

dlt/destinations/impl/filesystem/filesystem.py

+    def _write_delta_table(
+        self, path: str, table: "pa.Table", write_disposition: TWriteDisposition  # type: ignore[name-defined] # noqa
+    ) -> None:
+        """Writes in-memory Arrow table to on-disk Delta table."""


two questions here:

we can have many files for a given table. are we able to write them at once?

to the above: writing several tables at once in parallel: is it supported? (should be really :))

do we really need to load parquet file into memory? I know that you clean it up. but we can implement paruqet alignment differently ie. via another "flavour" of parquet that given destination can request.

I introduced the concept of a "directory job", 1 is possible now. 2 is also possible. Both are tested in test_pipeline_delta_filesystem_destination. 3 seems not possible as discussed on the chat:

rudolfix · 2024-05-20T18:29:21Z

dlt/destinations/impl/filesystem/filesystem.py

+
+            assert isinstance(self.config.credentials, AwsCredentials)
+            storage_options = self.config.credentials.to_session_credentials()
+            storage_options["AWS_REGION"] = self.config.credentials.region_name


yeah! this also may come from the machine default credentials. nevertheless we should warn or exit when this is not set

rudolfix · 2024-05-20T18:30:01Z

dlt/destinations/impl/filesystem/filesystem.py

+            storage_options = self.config.credentials.to_session_credentials()
+            storage_options["AWS_REGION"] = self.config.credentials.region_name
+            # https://delta-io.github.io/delta-rs/usage/writing/writing-to-s3-with-locking-provider/#enable-unsafe-writes-in-s3-opt-in
+            storage_options["AWS_S3_ALLOW_UNSAFE_RENAME"] = "true"


interesting. we have a locking provider but probably not compatible with delta. it is called transactional_file.py

rudolfix · 2024-05-20T18:37:55Z

dlt/destinations/impl/filesystem/filesystem.py

+
+            assert isinstance(self.config.credentials, GcpServiceAccountCredentials)
+            gcs_creds = self.config.credentials.to_gcs_credentials()
+            gcs_creds["token"]["private_key_id"] = "921837921798379812"


hmmmm OK, when you authenticate in Python you do not need to do that... we can add this as optional field. this also means that OAUTH authentication will not work? I think it is fine.

btw, can delta-rs find default google credentials? you can check if has_default_credentials() and then leave token as None. works for fsspec

rudolfix · 2024-05-20T18:38:36Z

dlt/destinations/impl/filesystem/filesystem.py

+        import pyarrow as pa
+        from deltalake import write_deltalake
+
+        def adjust_arrow_schema(


all of those look like utility function that could be available independently and also unit tested

I moved them to utils.py and made them independent. Haven't added unit tests yet.

jorritsandbrink · 2024-05-26T22:48:07Z

dlt/common/destination/reference.py

@@ -214,6 +214,20 @@ def exception(self) -> str:
        pass


+class DirectoryLoadJob:


Very minimal for now. Want to get some feedback before further polishing.

jorritsandbrink · 2024-05-26T22:49:07Z

dlt/common/storages/load_package.py

@@ -177,6 +178,15 @@ def __str__(self) -> str:
        return self.job_id()


+class ParsedLoadJobDirectoryName(NamedTuple):


Also very minimal. Same as above.

jorritsandbrink · 2024-05-26T22:52:20Z

dlt/destinations/impl/filesystem/filesystem.py


    def restore_file_load(self, file_path: str) -> LoadJob:
        return EmptyLoadJob.from_file_path(file_path, "completed")

+    def start_dir_load(self, table: TTableSchema, dir_path: str, load_id: str) -> DirectoryLoadJob:


Perhaps iceberg will also be a directory job if we add it.

jorritsandbrink · 2024-05-26T23:11:17Z

@rudolfix Can you review once more?

I addressed some of your feedback. Biggest changes since last review:

Introduction of the "directory job" so multiple files can be loaded at once. Thus far, the concept of a job was tightly coupled with a single file (if I understand correctly). But if there are multiple Parquet files, we want to be able to load them into the Delta table in a single commit. That is now possible. Code is still rogue/minimal, but I want some feedback before polishing it up.
Dedicated job for Delta table loads: DeltaLoadFilesystemJob.
Turned Delta-related methods into static utility methods.

…ystem-delta-table

dlt/normalize/normalize.py

docs/website/docs/dlt-ecosystem/table-formats/delta.md

jorritsandbrink · 2024-06-05T13:38:22Z

tests/load/pipeline/test_filesystem_pipeline.py

@@ -216,6 +222,244 @@ def some_source():
        assert table.column("value").to_pylist() == [1, 2, 3, 4, 5]


+@pytest.mark.essential


I only marked this test as essential. Perhaps the other tests also need the mark. Do we have a guideline for this?

if we add a new feature, it makes sense to write 1-2 tests with happy paths and make them as essential

if tests are finishing quickly (like all the local pipeline tests here, they also can be marked as essential)

essential is used to have a fast smoke-testing for destinations in normal CI runs

…eltaTable method signature

rudolfix

LGTM! amazing work!

I added a few small fixes mostly to handle failing pydantic tests. see commits for details

add delta table support for filesystem destination

683b35c

jorritsandbrink added the enhancement New feature or request label May 17, 2024

jorritsandbrink self-assigned this May 17, 2024

Jorrit Sandbrink added 10 commits May 18, 2024 18:44

Merge branch 'refs/heads/devel' into 978-filesystem-delta-table

a650de7

Merge branch 'devel' of https://github.com/dlt-hub/dlt into 978-files…

d66cbb2

…ystem-delta-table

remove duplicate method definition

6e3dced

make property robust

b241e8c

exclude high-precision decimal columns

10185df

make delta imports conditional

574215f

include pyarrow in deltalake dependency

ae03815

install extra deltalake dependency

88cbfcf

disable high precision decimal arrow test columns by default

b83ca8b

include arrow max precision decimal column

b8d2967

jorritsandbrink commented May 19, 2024

View reviewed changes

dlt/destinations/impl/filesystem/filesystem.py Show resolved Hide resolved

jorritsandbrink commented May 19, 2024

View reviewed changes

dlt/pipeline/pipeline.py Outdated Show resolved Hide resolved

jorritsandbrink commented May 19, 2024

View reviewed changes

tests/cases.py Show resolved Hide resolved

jorritsandbrink commented May 19, 2024

View reviewed changes

jorritsandbrink requested a review from rudolfix May 19, 2024 14:46

rudolfix requested changes May 20, 2024

View reviewed changes

rudolfix mentioned this pull request May 22, 2024

Partitions in AWS Athena #555

Open

14 tasks

introduce directory job and refactor delta table code

7a38470

jorritsandbrink commented May 26, 2024

View reviewed changes

jorritsandbrink added 11 commits June 4, 2024 12:45

rename pyarrow lib method names

83745fa

rename utils to delta_utils

72553d6

import pyarrow from dlt common libs

3f41402

move delta lake utitilities to module in dlt common libs

a49b23d

import delta lake utils early to assert dependencies availability

5b9071f

handle file format adaptation at table level

6f76587

Merge branch 'devel' of https://github.com/dlt-hub/dlt into 978-files…

bae5266

…ystem-delta-table

initialize file format variables

70fddc3

split delta table format tests

1a10d2d

handle table schema is None case

e1e4772

add test for dynamic dispatching of delta tables

d25ebc4

rudolfix reviewed Jun 5, 2024

View reviewed changes

dlt/normalize/normalize.py Outdated Show resolved Hide resolved

dlt/normalize/normalize.py Outdated Show resolved Hide resolved

dlt/normalize/normalize.py Outdated Show resolved Hide resolved

dlt/normalize/normalize.py Show resolved Hide resolved

jorritsandbrink added 4 commits June 5, 2024 14:06

mark core delta table test as essential

4420a36

simplify item normalizer dict key

0bde8b9

make list copy to prevent in place mutations

86ab9ff

add extra deltalake dependency

9a302dc

jorritsandbrink commented Jun 5, 2024

View reviewed changes

docs/website/docs/dlt-ecosystem/table-formats/delta.md Show resolved Hide resolved

only test deltalake lib on local filesystem

6c3c8c7

jorritsandbrink commented Jun 5, 2024

View reviewed changes

rudolfix added 5 commits June 5, 2024 16:56

Merge branch 'devel' into 978-filesystem-delta-table

e17a54b

properly evaluates lazy annotations

bcfb418

uses base FilesystemConfiguration from common in libs

8f7831f

solves union type reordering due to caching and clash with delta-rs D…

e33af63

…eltaTable method signature

creates a table with just root name to cache item normalizers properly

cdeefd2

rudolfix previously approved these changes Jun 5, 2024

View reviewed changes

Merge branch 'devel' into 978-filesystem-delta-table

a75a0f9

rudolfix dismissed their stale review via a75a0f9 June 5, 2024 20:25

rudolfix approved these changes Jun 6, 2024

View reviewed changes

rudolfix merged commit 1c1ce7e into devel Jun 6, 2024
48 of 50 checks passed

rudolfix deleted the 978-filesystem-delta-table branch June 6, 2024 08:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Delta table support for `filesystem` destination #1382

Add Delta table support for `filesystem` destination #1382

jorritsandbrink commented May 17, 2024 •

edited

netlify bot commented May 17, 2024 •

edited

jorritsandbrink May 19, 2024

rudolfix Jun 3, 2024

jorritsandbrink May 19, 2024

rudolfix May 20, 2024

jorritsandbrink May 19, 2024

rudolfix May 20, 2024

jorritsandbrink May 19, 2024

rudolfix May 20, 2024

jorritsandbrink commented May 19, 2024

rudolfix left a comment

rudolfix May 20, 2024

jorritsandbrink May 26, 2024

rudolfix May 20, 2024

rudolfix May 20, 2024

rudolfix May 20, 2024

rudolfix May 20, 2024

jorritsandbrink May 26, 2024

jorritsandbrink May 26, 2024

jorritsandbrink May 26, 2024

jorritsandbrink May 26, 2024

jorritsandbrink commented May 26, 2024

jorritsandbrink Jun 5, 2024

rudolfix Jun 6, 2024

rudolfix left a comment

		@@ -214,6 +214,20 @@ def exception(self) -> str:
		pass


		class DirectoryLoadJob:

		@@ -177,6 +178,15 @@ def __str__(self) -> str:
		return self.job_id()


		class ParsedLoadJobDirectoryName(NamedTuple):

		@@ -216,6 +222,244 @@ def some_source():
		assert table.column("value").to_pylist() == [1, 2, 3, 4, 5]


		@pytest.mark.essential

Add Delta table support for filesystem destination #1382

Add Delta table support for filesystem destination #1382

Conversation

jorritsandbrink commented May 17, 2024 • edited

Description

Related Issues

netlify bot commented May 17, 2024 • edited

✅ Deploy Preview for dlt-hub-docs canceled.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorritsandbrink commented May 19, 2024

rudolfix left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorritsandbrink commented May 26, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rudolfix left a comment

Choose a reason for hiding this comment

Add Delta table support for `filesystem` destination #1382

Add Delta table support for `filesystem` destination #1382

jorritsandbrink commented May 17, 2024 •

edited

netlify bot commented May 17, 2024 •

edited