Feat/fsspec destination #342

steinitzu · 2023-05-19T17:31:34Z

WIP

netlify · 2023-05-19T17:31:46Z

✅ Deploy Preview for dlt-hub-docs canceled.

Name	Link
🔨 Latest commit	`c530da4`
🔍 Latest deploy log	https://app.netlify.com/sites/dlt-hub-docs/deploys/6470d9b5252f190008b5027a

rudolfix

this looks good! decorator to create dynamic type hints seems to work well.

we'll need a few tests for it for a few mock credentials cases, somewhere in tests/common/configuration

dlt/common/configuration/specs/aws_credentials.py

dlt/common/configuration/specs/base_configuration.py

dlt/common/configuration/specs/aws_credentials.py

rudolfix

I did a few changes

default buckets are defined in tests/config.toml ENV override will still work
changes the scheme from gcs to gs
some fixes when passing gcp credentials

my biggest issue is lack of test coverage, the client is tested in replace mode. you do not test if append is adding more files. not sure if you test if replace is testing if files are deleted.
you do
I do not see you testing if file layout is exactly as in ticket spec
I do not see tests for merge special case

Please add tests where filesystem is used as destination in pipelines. Maybe some of existing tests in load/pipelines could be converted (separating parts that require sql_client, you could look if files are present instead). it is some work but without that it is hard to say if this thing works

rudolfix · 2023-05-23T15:52:45Z

.github/workflows/test_destinations.yml


      # - name: Install self
      #   run: poetry install --no-interaction

      - run: |
-          poetry run pytest tests/load tests/cli --ignore=tests/load/bigquery -k '(not bigquery)'
+          poetry run pytest tests/load/filesystem tests/cli --ignore=tests/load/bigquery -k '(not bigquery)'


please remove this change after all tests passed

rudolfix · 2023-05-23T15:53:50Z

dlt/common/configuration/resolve.py

@@ -117,6 +117,9 @@ def _resolve_config_fields(
    unresolved_fields: Dict[str, Sequence[LookupTrace]] = {}

    for key, hint in fields.items():
+        if key in config.__hint_resolvers__:
+            # Type hint for this field is created dynamically
+            hint = config.__hint_resolvers__[key](config)


please raise a nice configuration error when the key is not available, the key name the available hint resolves and the configuration type being resolved should be present. the message should be clear so user can fix the problem

I added an exception with message when the decorator is used without matching fields, e.g. the message:

The config spec ConfigWithInvalidDynamicType has dynamic type resolvers for fields: ['a', 'b', 'c'] but these fields are not defined in the spec. When using @resolve_type() decorator, Add the fields with 'Any' or another common type hint, example: >>> class ConfigWithInvalidDynamicType(BaseConfiguration) >>> a: Any >>> b: Any >>> c: Any

I'm not clear what you mean with the other cases. The standard config fields missing exception is raised as normal when e.g. aws credentials are missing.

it looks good now, thanks!

dlt/destinations/filesystem/configuration.py

dlt/destinations/filesystem/filesystem.py

tests/common/configuration/test_credentials.py

tests/load/filesystem/conftest.py

tests/load/filesystem/test_filesystem_client.py

tests/common/configuration/test_credentials.py

dlt/destinations/filesystem/filesystem.py

rudolfix · 2023-05-25T14:08:08Z

dlt/destinations/filesystem/filesystem.py

+    def create_merge_job(self, table_chain: Sequence[TTableSchema]) -> NewLoadJob:
+        return None
+
+    def complete_load(self, load_id: str) -> None:


I was looking at our use cases and actually it would be useful to create a file here. the reason: when the file is present we now that load to the bucket was finalized and we can use the loaded data. You could probably execute LoadFileSystemJob. write_disposition: append, the file can be empty, and the table name is LOADS_TABLE_NAME (_dlt_loads)

63a97ec#diff-524215919bce1cfd1cd9bae6d4ae82c9f753d4b3ffb1215a418dd06277113a5dR107-R110
Will this do? Just creates empty file named {schema_name}.{_dlt_loads}.{load_id} . Figure extension and file ID is not needed since it's just a marker per load ID?

no, it is not needed. thx!

rudolfix · 2023-05-26T07:26:00Z

dlt/common/configuration/resolve.py

@@ -117,6 +117,9 @@ def _resolve_config_fields(
    unresolved_fields: Dict[str, Sequence[LookupTrace]] = {}

    for key, hint in fields.items():
+        if key in config.__hint_resolvers__:
+            # Type hint for this field is created dynamically
+            hint = config.__hint_resolvers__[key](config)


it looks good now, thanks!

rudolfix · 2023-05-26T07:28:21Z

dlt/destinations/filesystem/filesystem.py

+    def create_merge_job(self, table_chain: Sequence[TTableSchema]) -> NewLoadJob:
+        return None
+
+    def complete_load(self, load_id: str) -> None:


no, it is not needed. thx!

…als for unknown + tests for memory fs.

rudolfix reviewed May 20, 2023

View reviewed changes

steinitzu added 11 commits May 20, 2023 20:15

Pass load ID to file load jobs

ebfa44d

add dynamic type hints for config fields with resolve_type decorator

c7b4a4b

fsspec dependency (TODO groups)

f212554

Add gcsfs dependency

50dae8b

Basic fsspec destination supporting gcs and local filesystem

32ded25

Aws credentials and pass gcp credentials spec to gcsfs

2e5ce5f

Update lockfile

9832e22

merge wd fallback

849384b

Add load_id to start_file_load in tests

0bb29f0

Init destination client tests

0198fa0

Config tests, aws credentials match env vars

61b5629

steinitzu force-pushed the feat/fsspec-destination branch from 8f918c2 to 61b5629 Compare May 21, 2023 00:15

steinitzu and others added 9 commits May 20, 2023 20:18

Dependency extras gcs and s3

a1db9fc

Raise missing dependency exceptions

21c9172

Fix storage in tests

ac843d2

adds config.toml with test settings

0a26aa6

runs only filesystem tests - temporarily

4e1b097

correctly passes service credentials to fsspec

b0900a4

initializes test config provides early

7508ec8

fixes config tests to use only env provider

e6a81ed

renames gcs to cs in code and extras in pyproject

951f0b4

rudolfix requested changes May 23, 2023

View reviewed changes

installs filesystemextras in github workflows

1f1c489

rudolfix force-pushed the feat/fsspec-destination branch from bba25c2 to 1f1c489 Compare May 23, 2023 16:12

rudolfix and others added 4 commits May 23, 2023 18:16

adds gcp credentials to test filesystem dest

1463524

fixes telemetry test

235fc14

Let gcsfs resolve default credentials

3929ea5

Verify destination filename in test

c7407ed

steinitzu added 5 commits May 24, 2023 23:03

fs pipeline test with merge fallback and file contents verified

dbe63f8

Skipt filesystem in bigquery tests

ef5cfaa

Type import

aff2530

Fix pytest argument

78497a5

Depend on boto3 for aws credentials

e74df63

steinitzu force-pushed the feat/fsspec-destination branch from d115f6d to e74df63 Compare May 25, 2023 03:47

adds fsspect to main dependencies

827d3da

rudolfix requested changes May 25, 2023

View reviewed changes

rudolfix added 2 commits May 25, 2023 16:18

updates types

de6c259

fixes order of jobs in replace filesytem test

a0fe702

rudolfix force-pushed the feat/fsspec-destination branch from 14ec1b8 to a0fe702 Compare May 25, 2023 17:29

No filesystem extra dependency

3d43582

steinitzu force-pushed the feat/fsspec-destination branch from 0f6315e to 18c4a1e Compare May 25, 2023 21:36

Test windows posix paths

2597381

steinitzu force-pushed the feat/fsspec-destination branch from 18c4a1e to 2597381 Compare May 25, 2023 21:42

steinitzu added 4 commits May 25, 2023 21:08

Implement complete_load

63a97ec

Add exceptions when @resolve_type is used without matching field

13881fc

Run all load tests

0c242bc

Fix is_filesystem when no destination

4bfb268

steinitzu marked this pull request as ready for review May 26, 2023 03:58

rudolfix previously approved these changes May 26, 2023

View reviewed changes

rudolfix added 3 commits May 26, 2023 16:58

fixes extract pipe config section

551821b

increases pool size to 50 for dlt http adapter + adds this as argument

1e1fe00

allows for any filesystem to be used with fsspec, uses empty credenti…

609c446

…als for unknown + tests for memory fs.

rudolfix dismissed their stale review via 609c446 May 26, 2023 14:59

rudolfix added 3 commits May 26, 2023 17:41

uses posix paths in fsspec code

191edc0

fixes ls for memory f

4288bea

enables all tests

c530da4

rudolfix merged commit 00f94cb into devel May 26, 2023

rudolfix deleted the feat/fsspec-destination branch May 26, 2023 18:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/fsspec destination #342

Feat/fsspec destination #342

steinitzu commented May 19, 2023

netlify bot commented May 19, 2023 •

edited

Loading

rudolfix left a comment

rudolfix left a comment

rudolfix May 23, 2023

rudolfix May 23, 2023

steinitzu May 26, 2023

steinitzu May 26, 2023

rudolfix May 26, 2023

rudolfix May 25, 2023

steinitzu May 26, 2023

rudolfix May 26, 2023

rudolfix May 26, 2023

rudolfix May 26, 2023

Feat/fsspec destination #342

Feat/fsspec destination #342

Conversation

steinitzu commented May 19, 2023

netlify bot commented May 19, 2023 • edited Loading

✅ Deploy Preview for dlt-hub-docs canceled.

rudolfix left a comment

Choose a reason for hiding this comment

rudolfix left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

netlify bot commented May 19, 2023 •

edited

Loading