Fix incremental hitting end_value throwing out whole batches #495

steinitzu · 2023-07-14T18:46:28Z

Raise a custom StopGenerator exception after filter so whole batch isn't thrown out.
StopIteration turns into a RuntimeError when raised from generator so use custom exception instead.

netlify · 2023-07-14T18:46:33Z

✅ Deploy Preview for dlt-hub-docs canceled.

Name	Link
🔨 Latest commit	`d1dce53`
🔍 Latest deploy log	https://app.netlify.com/sites/dlt-hub-docs/deploys/64b5fb9466aeb50008cb3b64

rudolfix · 2023-07-15T16:19:52Z

dlt/extract/incremental.py

@@ -236,7 +240,8 @@ def transform(self, row: TDataItem) -> bool:
        if self.end_value is not None and (


what I would do instead of this PR:

# Check whether end_value has been reached if self.end_value is not None and ( self.last_value_func((row_value, self.end_value)) != self.end_value ): return False

so we just return False when we are out of range. we do the same with the start_value - we do not close the gen.
why

we do not assume that list is ordered

if someone wants to exit earlier s/he can request data properly from the endpoint

I removed row_value == self.end_value because we should not compare row_value with end/start values. that should happen only after processing by last_value_func. as a result we process data inclusively. which IMO is OK

we can add some special exception to close pipe outside form the generator but when we have a first real case :)

Hmm, yes, maybe the "stop generator" thing could be better thought out, and should st least be opt-in for sources we know are ordered.

It makes a lot of assumptions the way it is now, also assumes that you have no steps that need to execute after yielding the records.

I like it for endpoints that don't have any "end value" filter. A "start" param and ascending order is pretty common, but there's not always an end param.
So this "stop" logic is code that doesn't need to be added to sources specifically, you can just throw this into most existing incremental sources.

What if we have something like an incremental.end_value_reached property?
Resources that know better can ignore it, but ones that are ordered can check and make use of it as a "stop" signal.

hmmm interesting. this way we could help the resources that have ordered data. I actually have this case and it is github which is incremental decreasing so I could use some kind of incremental.is_out_of_range.

for page in _get_rest_pages(access_token, repos_path + "?per_page=100"): yield page # stop requesting pages if the last element was already older than initial value # note: incremental will skip those items anyway, we just do not want to use the api limits if page and page[-1]["created_at"] < last_created_at.initial_value: # do not get more pages, we overlap with previous run print( f"Overlap with previous run created at {last_created_at.initial_value}" ) break

still I'm not sure that is is really worth it:

case with min function and start_value

case with max function and end_value

any other function - we do not know what kind of order it represents so probably we cant set it

we possibly are adding a lot of code that will be executed for each single item in the iterator, not sure we gain so much to really bother

Yes, similar to what I did with pipedrive leads as well. Fetch in descending order and check every page, https://github.com/dlt-hub/verified-sources/blob/master/sources/pipedrive/__init__.py#L173-L199

Imo best would be to add both start/end_out_of_range flags and set them anyway on the first out of range item (the places we return False from the filter). The only extra cost is one self.something = True assignment. Wouldn't add any other checks for which last value func, etc.
Just document what they mean and that they're invalid for unordered results.

But for now, should we put a pin in this? Maybe best to revisit after we implement a few more use cases with end value?
(will remove the stop generator stuff too)

OK! please set this flag whenever an item was filtered out due to being out of range so it works both for end and start value (I can use it for github then) - I hope this makes sense. and drop the StopIteration. please document it in the code.

Added both flags and wrote a section in the docs too. Hope it's not too confusing.

I kept last_value_func((row_value, )) == self.end_value check as well, so the range is exclusive at the end. Imo that makes this lot more convenient for chunked loading, so you can chain start, end ranges: (a, b), (b, c), (c, d), ... with no overlap

Raise a custom StopGenerator exception after filter so whole batch isn't thrown oout. StopIteration turns into a RuntimeError when raised from generator so use custom exception instead.

rudolfix

LGTM! I hope our Incremental class is (almost) complete now - it is becoming really complex :)

rudolfix · 2023-07-18T10:58:53Z

@steinitzu I will do a pre-release of dlt when this is merged so we can update the zenpy source

…#495) * Fix incremental hitting end_value throwing out whole batches Raise a custom StopGenerator exception after filter so whole batch isn't thrown oout. StopIteration turns into a RuntimeError when raised from generator so use custom exception instead. * Test with 2 runs * Remove "stop generator" exception, add start/end_out_of_range flags * Document start/end_of_range usage and add + backloading info * Test out_of_range flags * Typo * Range inclusive at start, more tests

steinitzu requested a review from rudolfix July 14, 2023 19:04

steinitzu mentioned this pull request Jul 14, 2023

Zendesk backload and custom fields dlt-hub/verified-sources#215

Merged

1 task

rudolfix requested changes Jul 15, 2023

View reviewed changes

steinitzu added 2 commits July 17, 2023 11:26

Fix incremental hitting end_value throwing out whole batches

d320336

Raise a custom StopGenerator exception after filter so whole batch isn't thrown oout. StopIteration turns into a RuntimeError when raised from generator so use custom exception instead.

Test with 2 runs

0eee10e

steinitzu force-pushed the fix/incremental-end-value-with-batches branch from 4f8b48c to cca6d6b Compare July 18, 2023 00:33

steinitzu added 4 commits July 17, 2023 20:36

Remove "stop generator" exception, add start/end_out_of_range flags

48f2bf8

Document start/end_of_range usage and add + backloading info

6693f34

Test out_of_range flags

f8d81a1

Typo

eb15918

steinitzu force-pushed the fix/incremental-end-value-with-batches branch from 4e6d38a to eb15918 Compare July 18, 2023 00:37

Range inclusive at start, more tests

d1dce53

rudolfix approved these changes Jul 18, 2023

View reviewed changes

rudolfix merged commit 7fa667b into devel Jul 18, 2023
36 checks passed

rudolfix deleted the fix/incremental-end-value-with-batches branch July 18, 2023 12:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix incremental hitting end_value throwing out whole batches #495

Fix incremental hitting end_value throwing out whole batches #495

steinitzu commented Jul 14, 2023

netlify bot commented Jul 14, 2023 •

edited

Loading

rudolfix Jul 15, 2023

rudolfix Jul 15, 2023

steinitzu Jul 15, 2023

steinitzu Jul 15, 2023

rudolfix Jul 15, 2023

steinitzu Jul 16, 2023 •

edited

Loading

rudolfix Jul 16, 2023

steinitzu Jul 18, 2023

rudolfix left a comment

rudolfix commented Jul 18, 2023

		@@ -236,7 +240,8 @@ def transform(self, row: TDataItem) -> bool:
		if self.end_value is not None and (

Fix incremental hitting end_value throwing out whole batches #495

Fix incremental hitting end_value throwing out whole batches #495

Conversation

steinitzu commented Jul 14, 2023

netlify bot commented Jul 14, 2023 • edited Loading

✅ Deploy Preview for dlt-hub-docs canceled.

rudolfix Jul 15, 2023

Choose a reason for hiding this comment

rudolfix Jul 15, 2023

Choose a reason for hiding this comment

steinitzu Jul 15, 2023

Choose a reason for hiding this comment

steinitzu Jul 15, 2023

Choose a reason for hiding this comment

rudolfix Jul 15, 2023

Choose a reason for hiding this comment

steinitzu Jul 16, 2023 • edited Loading

Choose a reason for hiding this comment

rudolfix Jul 16, 2023

Choose a reason for hiding this comment

steinitzu Jul 18, 2023

Choose a reason for hiding this comment

rudolfix left a comment

Choose a reason for hiding this comment

rudolfix commented Jul 18, 2023

netlify bot commented Jul 14, 2023 •

edited

Loading

steinitzu Jul 16, 2023 •

edited

Loading