Generic destination / sink decorator #1065

sh-rp · 2024-03-07T10:59:43Z

Description

Implementation of #752

This is a re-submit of PR #891 which I cannot reopen after an accidental merge

Notes:

We also introduce the concept of a load package state here

ToDos Alpha release:

Tests for locked injection context
Tests for resolved partial
Tests for migration from no load package state to having load package state

Follow up tasks:

Nice example of loading to bigquery or kafka with the generic destination
Update CLI scaffolding scripts to create a sink function if "destination" is selected as Destination

# Conflicts: # dlt/destinations/__init__.py # tests/utils.py

…ctions

…sink_decorator # Conflicts: # dlt/pipeline/current.py

# Conflicts: # dlt/common/storages/load_package.py # dlt/destinations/impl/athena/athena.py # dlt/destinations/impl/bigquery/bigquery.py # dlt/load/load.py

… sink

# Conflicts: # dlt/extract/incremental/__init__.py

netlify · 2024-03-07T10:59:59Z

✅ Deploy Preview for dlt-hub-docs canceled.

Name	Link
🔨 Latest commit	`246df72`
🔍 Latest deploy log	https://app.netlify.com/sites/dlt-hub-docs/deploys/65f2ac5d1bc1930008b15772

sh-rp · 2024-03-07T11:02:13Z

@rudolfix current questions:

I have create a method to create a partial from a function decorated with "with_config", can you let me know if this looks good? Then I can add some tests
Can you show me how an example of a sink function with google cloud credentials should look like? Do you want to send the spec as one arg or what is the interface for the user you have in mind? Then I will better understand what you have in mind

Generally you can have a look at my changes, I did most of the stuff of your last review.

rudolfix

pls see review. I'll have more comments but let's fix those first. overall this is so good

dlt/common/configuration/inject.py

dlt/common/reflection/spec.py

dlt/common/storages/load_package.py

rudolfix · 2024-03-07T11:31:19Z

dlt/common/storages/load_package.py

@@ -334,8 +408,13 @@ def create_package(self, load_id: str) -> None:
        self.storage.create_folder(os.path.join(load_id, PackageStorage.COMPLETED_JOBS_FOLDER))
        self.storage.create_folder(os.path.join(load_id, PackageStorage.FAILED_JOBS_FOLDER))
        self.storage.create_folder(os.path.join(load_id, PackageStorage.STARTED_JOBS_FOLDER))
+        # ensure created timestamp is set in state when load package is created
+        state = self.get_load_package_state(load_id)


we must make this change backward compatible or explicitly upgrade all storages that have load packages internally,

see this test test_pipeline_with_dlt_update and similar: they actually upgrade dlt from 0.3 to 0.4 and test packages in process

my worry: people will have old completed load packages or packages in process, they upgrade dlt and it stops working

so you need to test it.

dlt/common/storages/load_package.py

rudolfix · 2024-03-07T11:36:43Z

dlt/common/storages/load_package.py

+    try:
+        state_ctx = container[LoadPackageStateInjectableContext]
+    except ContextDefaultCannotBeCreated:
+        raise Exception("Load package state not available")


Please derive exception properly. see other similar exceptions.
provide explanation why this is happening. ie when you access this outside of dlt.destination decorated function (we inject package only in loader, btw. we can start injecting it in normalize and extract as well)

rudolfix · 2024-03-07T11:41:48Z

dlt/destinations/decorators.py

+    naming_convention: str = "direct",
+    spec: Type[SinkClientConfiguration] = SinkClientConfiguration,
+) -> Any:
+    def decorator(destination_callable: TSinkCallable) -> TDestinationReferenceArg:


Decorator returns Destination but the user should see the signature of the decorated function as return type, We''ll add __call__ to Destination to simulate it being a function. this is same trick we do with DltResource

I can do the above if you want

Ah, now I think I get what you want here. Yes why not, maybe you can add that one small change

Actually I think to me it is not quite clear what i should do here, i changed the decorator typings a bit and made it so a wrapped function is returned with the args in place (i think at least) but maybe you need to make some changes here.

rudolfix · 2024-03-07T11:43:35Z

dlt/destinations/impl/destination/factory.py

+
+    def __init__(
+        self,
+        destination_callable: t.Union[TSinkCallable, str] = None,  # noqa: A003


we can still configure this right?

as in give a function name / module path in a config bar? then yes, see the tests.

dlt/destinations/impl/destination/factory.py

dlt/load/load.py

add test for nested configs

sh-rp · 2024-03-07T12:36:13Z

@rudolfix thanks for the comments, will continue on this later, there is just one thing I'd maybe need your input on: I added a lock now on the method to resolve the partial, this a solution for the problem that the sink function and the client can be run in parallel threads and thus the context for that generated spec can be injected simultaneously and create configcontextmangled errors. The fixes for this can be:

Create a third affinity which is thread based and not pipeline based, would be quite easy to do
Allow for locking when resolving config vars, I had a version before where you could say that it should lock in the with_config provider.
Have a setting that somehow prevents stacking of the same context type and retains the first one.

I think the best way would be to have a true thread based affinity, but not 100% lmk.

lock injection context for wrapped functions small pr fixes

sh-rp · 2024-03-12T09:50:55Z

dlt/common/configuration/inject.py

        else:
            SPEC = spec

        if SPEC is None:
            return f

+        func_id = id(f)


we use the id of the wrapped function as part of the key to created the threadlock, if the same function is wrapped multiple times, then the key will collide, so we could also consider a uniqid

it looks we do not need it, we can just pass a flag to create a lock on a thread context

sh-rp · 2024-03-12T09:51:39Z

dlt/common/configuration/inject.py

@@ -58,6 +65,7 @@ def with_config(
    include_defaults: bool = True,
    accept_partial: bool = False,
    initial_config: Optional[BaseConfiguration] = None,


should we have a "thread_safe" parameter so we only lock on decorated functions where we know that there might be issues?

I think we will always lock when resolving configuration so not optional

rudolfix

OK we are really close here! changes seem to be easy to make. test coverage is good. my take:
implement two tests that you mention

we should add tests for the inject context with lock (when you fix the code) where we makes sure two threads are getting into critical section at the same time
that we actually resolve config once in create_resolved_partial (and possibly can pass external one - see my review)
test for callable destination factory
after that we can merge and do alpha release. a dlt migration test for load packages is a must before stable release

rudolfix · 2024-03-12T14:43:17Z

dlt/common/configuration/specs/base_configuration.py

@@ -2,6 +2,8 @@
 import inspect
 import contextlib
 import dataclasses
+import threading


rudolfix · 2024-03-12T15:45:52Z

dlt/common/configuration/inject.py

@@ -58,6 +65,7 @@ def with_config(
    include_defaults: bool = True,
    accept_partial: bool = False,
    initial_config: Optional[BaseConfiguration] = None,


I think we will always lock when resolving configuration so not optional

rudolfix · 2024-03-12T15:46:22Z

dlt/common/configuration/inject.py

        else:
            SPEC = spec

        if SPEC is None:
            return f

+        func_id = id(f)


it looks we do not need it, we can just pass a flag to create a lock on a thread context

rudolfix · 2024-03-12T15:50:37Z

dlt/common/configuration/inject.py

+                nonlocal config
+
+                # Do we need an exception here?
+                if spec_arg and spec_arg.name in kwargs:


I do not think we care at all? spec_arg did his job in resolve_config and created config from itself. right? if it is present in subsequent calls we can simply ignore it. (or I do not understand something)

Yeah, but in the case of partially resolving, the resolve_config is called before even the func args are available, do you know what i mean? It's a bit complicated but also quite clear. We can also remove it if you like, it's just a warning that made sense to put in given the mechanics of the code.

rudolfix · 2024-03-12T15:52:18Z

dlt/common/configuration/container.py

+        lock: AbstractContextManager[Any]
+
+        # if there is a lock_id, we need a lock for the lock_id in the scope of the current context
+        if lock_id:


we just need a boolean flag to create lock in given thread context. as discussed

rudolfix · 2024-03-12T17:46:07Z

dlt/destinations/impl/destination/factory.py

+        return self._spec
+
+    @property
+    def client_class(self) -> t.Type["SinkClient"]:


to make it work with the sink decorator, it should implement __callable__ so factory behaves like function. the arguments should be passed to conifg_kwargs exactly like in init method.

we should also test it. if I have

@dlt.destination() def my_gcp_sink( file_path, table, credentials: Union[GcpOAuthCredentials, GcpServiceAccountCredentials] = dlt.secrets.value, use_stream_insert: bool = False ):

I should be able to

sink = my_gcp_sink(GcpCredentials(...), use_stream_insert=True)

and then pass it to dlt.pipeline as destination.

I did not even need to implement the call function, just forward the kwargs to the destination and then use the pre-resolved config in the create resolved partial function (see the code, I have a test for this too). Very nice!

rudolfix · 2024-03-12T17:49:50Z

dlt/destinations/impl/destination/sink.py

+        super().__init__(schema, config)
+        self.config: SinkClientConfiguration = config
+        # create pre-resolved callable to avoid multiple config resolutions during execution of the jobs
+        self.destination_callable = create_resolved_partial(self.config.destination_callable)


here you should pass config to create_resolved_partial. config is at this point resolved via configuration() of the factory. so maybe create_resolved_partial could take a resolved config parameter (optional) that is used instead of resolving in create_resolved_partial (the inner one)?

this works, nice!

rudolfix · 2024-03-12T17:50:50Z

dlt/common/configuration/inject.py

@@ -197,3 +253,10 @@ def last_config(**kwargs: Any) -> Any:

 def get_orig_args(**kwargs: Any) -> Tuple[Tuple[Any], DictStrAny]:
    return kwargs[_ORIGINAL_ARGS]  # type: ignore
+
+
+def create_resolved_partial(f: AnyFun) -> AnyFun:


if you could rename this or the inner function with the same name? I'm conflating them even when doing the review...

rudolfix · 2024-03-12T17:54:44Z

dlt/load/load.py

@@ -1,4 +1,4 @@
-import contextlib
+import contextlib, threading


we do not need threading here. not use import

rudolfix · 2024-03-12T17:58:04Z

tests/load/sink/test_sink.py

+
+
+@pytest.mark.parametrize("loader_file_format", SUPPORTED_LOADER_FORMATS)
+def test_all_datatypes(loader_file_format: TLoaderFileFormat) -> None:


would it be possible to move all tests that do not require bigquery or special credentials to tests/destination?
then we can run them in common test:

- name: Install pipeline dependencies run: poetry install --no-interaction -E duckdb -E cli -E parquet --with sentry-sdk --with pipeline - run: | poetry run pytest tests/extract tests/pipeline tests/libs tests/cli/common tests/destinations

and those tests are runnable from forks which is a big win

I removed the dependency on google stuff and moved them into the commons test area

forward generic destiation call params into config small fixes

sh-rp · 2024-03-13T12:04:51Z

tests/destinations/test_generic_destination.py

+        )
+
+    # we may give the value via __callable__ function
+    dlt.pipeline("sink_test", destination=my_sink(my_val="something"), full_refresh=True).run(


fyi: callable my_sink is tested here

make simple test for loading load package without state

rudolfix

LGTM!

I'm not entirely sure that calling former sinks GenericDestination is what we want. maybe DecoratedDestination? anyway! let's merge this and write a followup ticket

sh-rp added 30 commits January 30, 2024 12:36

start sink

8b6028b

parquet sink prototype

6a924d9

some more sink implementations

83c1af6

finish first batch of helpers

b972f23

add missing tests and fix linting

9dabeff

make configuratio more versatile

af6defd

implement sink function progress state

4d730e8

move to iterator

3b577b0

persist sink load state in pipeline state

5527689

fix unrelated typo

0657034

move sink state storage to loadpackage state

b5db5b8

additional pr fixes

189d24b

disable creating empty state file on loadpackage init

57ed090

add sink docs page

4f53bc4

small changes

c6c06ba

Merge branch 'devel' into d#/data_sink_decorator

2f6a15a

# Conflicts: # dlt/destinations/__init__.py # tests/utils.py

make loadstorage state versioned and separate out common base functions

374b267

restrict access of destinations to load package state in accessor fun…

daae33e

…ctions

fix tests

644e6f3

add tests for state and new injectable context

9930ad6

fix linter

678d187

fix linter error

376832d

some pr fixes

79fce9e

Merge commit '17aea98527eab19f3300ee161114541f7eb2b5b5' into d#/data_…

8eb3e1a

…sink_decorator # Conflicts: # dlt/pipeline/current.py

more pr fixes

105569a

small readme changes

27b8b2c

Merge branch 'devel' into d#/data_sink_decorator

e60f2f1

# Conflicts: # dlt/common/storages/load_package.py # dlt/destinations/impl/athena/athena.py # dlt/destinations/impl/bigquery/bigquery.py # dlt/load/load.py

add load id to loadpackage info in current

3229745

add support for directly passing through the naming convention to the…

dbbbe7c

… sink

add support for batch size zero (filepath passthrouh)

db9b488

sh-rp added 4 commits March 5, 2024 18:08

fix small linting problem

d7eb19d

add support for config specs

ef35502

add possibility to create a resolved partial

2db3430

Merge branch 'devel' into d#/data_sink_decorator

9582b9f

# Conflicts: # dlt/extract/incremental/__init__.py

sh-rp marked this pull request as ready for review March 7, 2024 11:03

rudolfix requested changes Mar 7, 2024

View reviewed changes

add lock for resolving config

f488210

add test for nested configs

change resolved partial method to dedicated function

4d2563b

rudolfix mentioned this pull request Mar 7, 2024

New "refresh" mode and "dev_mode" #1063

Open

change signatures in decorator

12b5cbd

lock injection context for wrapped functions small pr fixes

sh-rp commented Mar 12, 2024

View reviewed changes

sh-rp added 2 commits March 12, 2024 12:10

fixes bug in inject wrapper refactor

3e5f808

mark destination decorator as experimental in the docs

53dc2f9

rudolfix requested changes Mar 12, 2024

View reviewed changes

sh-rp added 3 commits March 13, 2024 12:22

change injection context locking strategy

d1e1e38

forward generic destiation call params into config small fixes

make tests independent from gcp imports

943ac3e

move generic destination tests into common tests section destinations

1f47a05

sh-rp commented Mar 13, 2024

View reviewed changes

sh-rp added 3 commits March 13, 2024 14:04

fix global instantiation test after file move

b66c653

add tests for locking injection context

31b661b

make inject test a bit better

97ed409

make simple test for loading load package without state

rudolfix previously approved these changes Mar 13, 2024

View reviewed changes

skip generic destination in init test

246df72

sh-rp dismissed rudolfix’s stale review via 246df72 March 14, 2024 07:50

sh-rp merged commit 7f43e76 into devel Mar 14, 2024
52 of 57 checks passed

sh-rp deleted the d#/generic_destination_decorator branch March 14, 2024 08:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generic destination / sink decorator #1065

Generic destination / sink decorator #1065

sh-rp commented Mar 7, 2024 •

edited

netlify bot commented Mar 7, 2024 •

edited

sh-rp commented Mar 7, 2024

rudolfix left a comment

rudolfix Mar 7, 2024

rudolfix Mar 7, 2024

sh-rp Mar 11, 2024

rudolfix Mar 7, 2024

sh-rp Mar 7, 2024

sh-rp Mar 11, 2024

rudolfix Mar 7, 2024

sh-rp Mar 7, 2024

sh-rp commented Mar 7, 2024

sh-rp Mar 12, 2024

rudolfix Mar 12, 2024

sh-rp Mar 12, 2024

rudolfix Mar 12, 2024

rudolfix left a comment

rudolfix Mar 12, 2024

rudolfix Mar 12, 2024

rudolfix Mar 12, 2024

rudolfix Mar 12, 2024

sh-rp Mar 13, 2024

rudolfix Mar 12, 2024

rudolfix Mar 12, 2024

sh-rp Mar 13, 2024

rudolfix Mar 12, 2024

sh-rp Mar 13, 2024

rudolfix Mar 12, 2024

rudolfix Mar 12, 2024

rudolfix Mar 12, 2024

sh-rp Mar 13, 2024

sh-rp Mar 13, 2024

rudolfix left a comment

		@@ -1,4 +1,4 @@
		import contextlib
		import contextlib, threading



		@pytest.mark.parametrize("loader_file_format", SUPPORTED_LOADER_FORMATS)
		def test_all_datatypes(loader_file_format: TLoaderFileFormat) -> None:

Generic destination / sink decorator #1065

Generic destination / sink decorator #1065

Conversation

sh-rp commented Mar 7, 2024 • edited

Description

netlify bot commented Mar 7, 2024 • edited

✅ Deploy Preview for dlt-hub-docs canceled.

sh-rp commented Mar 7, 2024

rudolfix left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sh-rp commented Mar 7, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rudolfix left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rudolfix left a comment

Choose a reason for hiding this comment

sh-rp commented Mar 7, 2024 •

edited

netlify bot commented Mar 7, 2024 •

edited