Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add end_value support to incremental #467

Merged
merged 3 commits into from
Jul 8, 2023
Merged

Conversation

steinitzu
Copy link
Collaborator

@steinitzu steinitzu commented Jul 1, 2023

Implements chunked loading support described in dlt-hub/verified-sources#197 (comment)

Usage e.g.

import dlt
from dlt.common import pendulum


@dlt.resource
def some_data(
    updated: dlt.sources.incremental[pendulum.DateTime] = dlt.sources.incremental(
        'updated', initial_value=pendulum.DateTime(2000, 1, 1)
    )
):
    ...


# Backfill a time range

some_data(updated=dlt.sources.incremental(
    initial_value=pendulum.DateTime(2023, 1, 1), end_value=pendulum.DateTime(2023, 1, 2)
))

When end_value is reached the generator is stopped and incremental state gets updated with either end_value or previous last_value (whichever is higher (or lower, etc, depending on last_value_func)).
So it's possible to load chunks in any order and then stop sending end_value arg to continue incremental load and vice versa.

Decided on using initial_value instead of adding another "start" argument. When end_value is supplied, the initial acts as an override.

@netlify
Copy link

netlify bot commented Jul 1, 2023

Deploy Preview for dlt-hub-docs ready!

Name Link
🔨 Latest commit 5f31a42
🔍 Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/64a9955c5671ce000799a60a
😎 Deploy Preview https://deploy-preview-467--dlt-hub-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

dlt/extract/incremental.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR looks good but the whole concept raises some questions:

  1. when having a concrete chunk to load we should IMO not modify the state at all. we may read the state to get ie. last_value or initial_value but I think it's better if we require initial value to be always present when end value is present and always create a mock state
  2. this let's you to run several chunks in parallel. the state is never written to (and possibly also not read)
  3. the incremental loading will start when user calls the resource without passing end date. it is user responsibility to pass the correct initial value (which is the highest end date used by back loading). interestingly the incremental load also can happen in parallel

what do you think?

@steinitzu
Copy link
Collaborator Author

PR looks good but the whole concept raises some questions:

1. when having a concrete chunk to load we should IMO not modify the state at all. we may read the state to get ie. last_value or initial_value but I think it's better if we require initial value to be always present when end value is present and always create a mock state

2. this let's you to run several chunks in parallel. the state is never written to (and possibly also not read)

3. the incremental loading will start when user calls the resource without passing end date. it is user responsibility to pass the correct initial value (which is the highest end date used by back loading). interestingly the incremental load also can happen in parallel

what do you think?

Yeah, I agree.
The only benefit is you don't need to forward the last end_value to initial_value. Incremental just works after.
But that's not such a big deal compared to the weirdness you get from a bunch of tasks overwriting eachother's state.

And 👍 on requiring initial value. Should error also when initial value is higher, i.e. last_value_func(end_value, initial_value) == initial_value

@steinitzu
Copy link
Collaborator Author

@rudolfix updated like this in last commit. Using dummy state dict for this, and added a test to make sure the state isn't updated.

steinitzu and others added 3 commits July 8, 2023 18:56
* Stop the generator when end_value is reached
* Override last_value with initial_value while loading end_value
* Add `Incremental.merge` method for simpler overrides
@rudolfix rudolfix force-pushed the feat/incremental-backloading branch from 9fb9971 to 5f31a42 Compare July 8, 2023 16:56
Copy link
Collaborator

@rudolfix rudolfix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

excellent! I've added a short paragraph on end_value to our docs.

@rudolfix rudolfix merged commit f7ddc7c into devel Jul 8, 2023
@rudolfix rudolfix deleted the feat/incremental-backloading branch July 8, 2023 18:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants