implement destination sinks #752

rudolfix · 2023-11-10T21:32:37Z

Background
Sinks are the reverse of the source. A Python function decorated with @dlt.sink should generate destination factory (#746 ) that may be passed to pipeline in destination argument. Sink will consume data that is sent to it during the load stage.
Sink will be fed data in configurable chunks and will keep state of what got consumed already so it will be "transactional" when size of the chunk == 1.
Sink state is separate from source state but initially will be handled together with it - and preserved in file. For "transactional" sinks there will be tons of state updates so general mechanism we use now (store state at destination) does not make sense. We may support bucket storage for load.py and load storage for fully resumable, transactional sinks.

Tasks
This is rather a big ticket, tasks below are just outline:

- implement sink destination (like any other) with a proper destination factory
- implement @dlt.sink decorator working similarly to @dlt.transformer. It returns configured destination factory
- sink decorator should accept a sink function, where first argument is TDataItems, second a table schema (and maybe something else), and optional parameters (same as parametrized transformer)
- sink decorator should allow to specify batch size, file format (json, parquet). batch_size=1 sends single rows, >1 sends lists of objects (or arrow tables if parquet), batch_size=0 sends the FileDict as data item so user can handle file processing as s/he wants
- sink should accept naming convention to be used (a lot of people will use direct)
- we should decorate function properly. we should allow for additional parameters which should become a SPEC of this destination and is used as such (ideally we would derive this spec from DestinationClientConfiguration (trivial change in with_config may be needed)
- expose current load package context via dlt.current.load_package which should at least contain load id and state dict (I think we have it)
- (optional) allow for 3 callbacks in decorator: on_init (called o schema update), on_completed (called on completed, mind the abort flag), on_garbage_collected (called when factory is garbage collected, that allows for some kind of cleanup)

Implementation
Basic implementation may be quite simple. Sink destination implements a copy job that opens a job file and then streams it into a sink function in chunks that were configured. After each chunk it emits sink state.

On restart the state must be retrieved and processed rows must be skipped.

State handling could be part of LoadStorage and we may make the LoadStorage comparible with fsspec (maybe in separate tickets - it depends a lot on "file move" to be fast and atomic)

The text was updated successfully, but these errors were encountered:

rudolfix · 2023-11-10T21:34:51Z

optionally sinks could be chained after main destination consumes data and not just replace it. there's for sure a clever way of doing that without complicating user interface

sh-rp · 2024-01-12T08:19:17Z

Questions:

Should we support awaitable sink functions?
Should we add a flag to guarantee sink function execution on the main thread for each item, probably using multiprocessing queues?
I don't get what the dltsource interaction should be, we should talk about it :)
where should we store the pipeline and the schemas plus the execution state of the current loadpackage (if at all?)

Other ideas:

Should we sent sentinel values as an extra function parameter for special occassions? For example at the start of the load and the end of the load (or start and end of a file)? This would be useful for creating and destroying database or queue connections.
Should we allow the user to return certain sentinel values to indicate that this batch should be retried or the load should fail or something like that?
Should we provide a context object (simple dictionary) that can carry over state between sink function calls? Would be useful to store open database or queue connections. It could also be one context object per thread for example, so we can enable multithreaded processing.

sh-rp · 2024-03-04T17:03:03Z

@rudolfix what do you mean with "filedict"? In my current pr i am simply passing the full filepath as a string into the sink function.

github-project-automation bot added this to dlt core library Nov 10, 2023

github-project-automation bot moved this to Todo in dlt core library Nov 10, 2023

rudolfix mentioned this issue Nov 13, 2023

0.4.1 release notes and master merge #763

Closed

rudolfix moved this from Todo to Planned in dlt core library Dec 13, 2023

rudolfix moved this from Planned to Todo in dlt core library Dec 18, 2023

sh-rp self-assigned this Jan 9, 2024

sh-rp moved this from Todo to In Progress in dlt core library Jan 11, 2024

sh-rp mentioned this issue Jan 12, 2024

data sink destination #891

Merged

sh-rp linked a pull request Jan 22, 2024 that will close this issue

data sink destination #891

Merged

rudolfix assigned rudolfix and unassigned sh-rp Feb 12, 2024

VioletM added support This issue is monitored by Solution Engineer community This issue came from slack community workspace labels Feb 26, 2024

sh-rp closed this as completed in #891 Mar 7, 2024

github-project-automation bot moved this from In Progress to Done in dlt core library Mar 7, 2024

sh-rp mentioned this issue Mar 7, 2024

Generic destination / sink decorator #1065

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

implement destination sinks #752

implement destination sinks #752

rudolfix commented Nov 10, 2023 •

edited by sh-rp

Loading

rudolfix commented Nov 10, 2023

sh-rp commented Jan 12, 2024 •

edited

Loading

sh-rp commented Mar 4, 2024

implement destination sinks #752

implement destination sinks #752

Comments

rudolfix commented Nov 10, 2023 • edited by sh-rp Loading

rudolfix commented Nov 10, 2023

sh-rp commented Jan 12, 2024 • edited Loading

sh-rp commented Mar 4, 2024

rudolfix commented Nov 10, 2023 •

edited by sh-rp

Loading

sh-rp commented Jan 12, 2024 •

edited

Loading