-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
implement destination sinks #752
Comments
optionally sinks could be chained after main destination consumes data and not just replace it. there's for sure a clever way of doing that without complicating user interface |
Questions:
Other ideas:
|
Merged
Merged
VioletM
added
support
This issue is monitored by Solution Engineer
community
This issue came from slack community workspace
labels
Feb 26, 2024
@rudolfix what do you mean with "filedict"? In my current pr i am simply passing the full filepath as a string into the sink function. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Background
Sinks are the reverse of the source. A Python function decorated with
@dlt.sink
should generate destination factory (#746 ) that may be passed to pipeline indestination
argument. Sink will consume data that is sent to it during theload
stage.Sink will be fed data in configurable chunks and will keep state of what got consumed already so it will be "transactional" when size of the chunk == 1.
Sink state is separate from source state but initially will be handled together with it - and preserved in file. For "transactional" sinks there will be tons of state updates so general mechanism we use now (store state at destination) does not make sense. We may support bucket storage for
load.py
and load storage for fully resumable, transactional sinks.Tasks
This is rather a big ticket, tasks below are just outline:
@dlt.sink
decorator working similarly to@dlt.transformer
. It returns configured destination factoryTDataItems
, second a table schema (and maybe something else), and optional parameters (same as parametrized transformer)DestinationClientConfiguration
(trivial change inwith_config
may be needed)dlt.current.load_package
which should at least contain load id and state dict (I think we have it)Implementation
Basic implementation may be quite simple. Sink destination implements a copy job that opens a job file and then streams it into a sink function in chunks that were configured. After each chunk it emits sink state.
On restart the state must be retrieved and processed rows must be skipped.
State handling could be part of
LoadStorage
and we may make theLoadStorage
comparible with fsspec (maybe in separate tickets - it depends a lot on "file move" to be fast and atomic)The text was updated successfully, but these errors were encountered: