🎉Source Salesforce: add checkpointing #24888

roman-yermilov-gl · 2023-04-05T07:39:52Z

What

Add more frequent checkpointing by implementing stream slicer
Fix duplicates for first two requests

airbyte-integrations/connectors/source-salesforce/source_salesforce/streams.py

brianjlai

I have a few high level questions about our existing approach and how it related to the new slice behavior. The new sliced queries make sense, but I have some general concerns about how we still swallow the rate limit errors and return a successful sync which will be compounded now that we do checkpointing

brianjlai · 2023-04-06T20:20:35Z

airbyte-integrations/connectors/source-salesforce/source_salesforce/streams.py

@@ -633,36 +647,62 @@ def get_updated_state(self, current_stream_state: MutableMapping[str, Any], late


 class BulkIncrementalSalesforceStream(BulkSalesforceStream, IncrementalRestSalesforceStream):
+    STREAM_SLICE_STEP = 120


How did we arrive at the decision to slice over 120 day windows?

There is still a risk that we will drop records even w/ the slicing. Alex details this pretty well here: https://airbytehq-team.slack.com/archives/C04UY2A9Z53/p1680717935424479?thread_ts=1680710973.672509&cid=C04UY2A9Z53 . The larger the window, the higher chances we hit a rate limit mid way through a slice and we checkpoint with records missing.

If to make it smaller then integration tests will last for about 4 hours until a process get killed. While testing we can immediately see that small step leads to big performance issue. One more disadvantage is that amount of queries also increased. So after some testing I decided to make ~3 checkpoints per year because it is better then 1 checkpoint in the end of synchronization and performance is not so bad

I see fair point and I trust your analysis of the impact on performance. Can you please add a comment in the code with what you had just mentioned so there is future context on how this number was determined.

nit: we could make this configurable with an optional parameter in the spec

Can we include this changes to feather refactoring with changes requested in this #24888 (comment)? I will do it in scope of common improvements of this connector

brianjlai · 2023-04-06T20:38:32Z

airbyte-integrations/connectors/source-salesforce/source_salesforce/source.py

            if error.response.status_code == codes.FORBIDDEN and error_code == "REQUEST_LIMIT_EXCEEDED":
-                logger.warn(f"API Call limit is exceeded. Error message: '{error_data.get('message')}'")
+                logger.warning(f"API Call {url} limit is exceeded. Error message: '{error_data.get('message')}'")
                raise AirbyteStopSync()  # if got 403 rate limit response, finish the sync with success.


I am thinking whether this is the right behavior we want to continue. Due to the 24 hour rate limit, we didn't want to block future syncs, so we just marked it successful. This has its drawbacks and we could lose records. However, now that we have checkpointing at date slice windows, maybe we should more concretely throw back an error and not swallow them. And on the next sync we pick up where the previous bookmark left off.

And now with slices, we can still make incremental progress even if we hit the rate limit issue again instead of retrying the whole sync again

We have checkpointing now but not for full refresh sync. I also wanted to remove it but decided to leave as is

Failing due to daily rate limits will trigger alerts if 3 workspaces start moving more data than they can. I'm not sure what the best way to expose this kind of limitations this without introducing a new status type

airbyte-integrations/connectors/source-salesforce/source_salesforce/streams.py

roman-yermilov-gl · 2023-04-10T21:55:23Z

/test connector=connectors/source-salesforce

🕑 connectors/source-salesforce https://github.com/airbytehq/airbyte/actions/runs/4661598152
✅ connectors/source-salesforce https://github.com/airbytehq/airbyte/actions/runs/4661598152
Python tests coverage:

Name                                         Stmts   Miss  Cover
----------------------------------------------------------------
source_salesforce/utils.py                       8      0   100%
source_salesforce/__init__.py                    2      0   100%
source_salesforce/source.py                    102      6    94%
source_salesforce/streams.py                   416     35    92%
source_salesforce/api.py                       155     14    91%
source_salesforce/exceptions.py                  8      1    88%
source_salesforce/rate_limiting.py              22      3    86%
source_salesforce/availability_strategy.py      17      8    53%
----------------------------------------------------------------
TOTAL                                          730     67    91%
Name                                         Stmts   Miss  Cover
----------------------------------------------------------------
source_salesforce/__init__.py                    2      0   100%
source_salesforce/exceptions.py                  8      1    88%
source_salesforce/api.py                       155     21    86%
source_salesforce/availability_strategy.py      17      3    82%
source_salesforce/streams.py                   416    104    75%
source_salesforce/rate_limiting.py              22      6    73%
source_salesforce/source.py                    102     34    67%
source_salesforce/utils.py                       8      7    12%
----------------------------------------------------------------
TOTAL                                          730    176    76%

Build Passed

Test summary info:

=========================== short test summary info ============================
SKIPPED [1] ../usr/local/lib/python3.9/site-packages/connector_acceptance_test/tests/test_core.py:100: The previous and actual specifications are identical.
SKIPPED [1] ../usr/local/lib/python3.9/site-packages/connector_acceptance_test/tests/test_core.py:578: The previous and actual discovered catalogs are identical.
================== 39 passed, 2 skipped in 1534.35s (0:25:34) ==================

clnoll · 2023-04-14T21:24:56Z

airbyte-integrations/connectors/source-salesforce/source_salesforce/streams.py

        return None

+    def stream_slices(


@roman-yermilov-gl we want this logic to apply to both IncrementalRestSalesforceStream and BulkIncrementalSalesforceStream, so it should be moved up. That should also allow for cleanup of slice-related logic that was added to request_params, since much of it is duplicated between IncrementalRestSalesforceStream and BulkIncrementalSalesforceStream.

Can you also add unit tests for the IncrementalRestSalesforceStream case?

@clnoll
This PR is fixing the following P1 issues: #20471, #19947, #19014, https://github.com/airbytehq/oncall/issues/1735

If those code changes are ok for you then can we approve/merge/close this PR in order to not to hold those P1 open and after that I will start working on refactoring and testing?

@roman-yermilov-gl the request isn't just for refactoring - with the current implementation, only bulk streams support checkpointing

Hey @roman-yermilov-gl Just wanted to circle back here. If we address these changes we should be able to merge this in :)

@girarda
Moved slicer to IncrementalRestSalesforceStream so all the incremental streams use slicing now

@bnchrch
What should I do?)

bnchrch · 2023-04-19T18:56:04Z

/test connector=connectors/source-salesforce

🕑 connectors/source-salesforce https://github.com/airbytehq/airbyte/actions/runs/4746900119
✅ connectors/source-salesforce https://github.com/airbytehq/airbyte/actions/runs/4746900119
Python tests coverage:

Name                                         Stmts   Miss  Cover
----------------------------------------------------------------
source_salesforce/utils.py                       8      0   100%
source_salesforce/__init__.py                    2      0   100%
source_salesforce/source.py                    102      6    94%
source_salesforce/streams.py                   416     35    92%
source_salesforce/api.py                       155     14    91%
source_salesforce/exceptions.py                  8      1    88%
source_salesforce/rate_limiting.py              22      3    86%
source_salesforce/availability_strategy.py      17      8    53%
----------------------------------------------------------------
TOTAL                                          730     67    91%
Name                                         Stmts   Miss  Cover
----------------------------------------------------------------
source_salesforce/__init__.py                    2      0   100%
source_salesforce/exceptions.py                  8      1    88%
source_salesforce/api.py                       155     21    86%
source_salesforce/availability_strategy.py      17      3    82%
source_salesforce/streams.py                   416     98    76%
source_salesforce/rate_limiting.py              22      6    73%
source_salesforce/source.py                    102     34    67%
source_salesforce/utils.py                       8      7    12%
----------------------------------------------------------------
TOTAL                                          730    170    77%

Build Passed

Test summary info:

=========================== short test summary info ============================
SKIPPED [1] ../usr/local/lib/python3.9/site-packages/connector_acceptance_test/tests/test_core.py:100: The previous and actual specifications are identical.
SKIPPED [1] ../usr/local/lib/python3.9/site-packages/connector_acceptance_test/tests/test_core.py:578: The previous and actual discovered catalogs are identical.
================== 39 passed, 2 skipped in 2792.33s (0:46:32) ==================

clnoll · 2023-04-21T11:25:17Z

airbyte-integrations/connectors/source-salesforce/source_salesforce/streams.py

        if self.name not in UNSUPPORTED_FILTERING_STREAMS:
-            query += f"ORDER BY {self.cursor_field} ASC"
+            order_by_clause = f"ORDER BY {self.cursor_field} ASC"


Why is it that we're only ordering by self.cursor_field in the incremental sync case, but ordering by cursor & primary key in the bulk case? Should they be consistent?

We are using primary key in bulk operations query in WHERE clause and in ORDER BY clause when primary key exists. For this type of queries we need to handle pagination by ourselves so we are slicing by primary key inside base stream slicer. But rest streams pagination works in another way and we don't need it here.

Why we need primary key for Bulk Streams:
Given a table

id date

1 01.01.2023

2 01.01.2023

3 01.01.2023

4 01.01.2023

5 01.01.2023

6 01.03.2023

Page size = 2
cursor field = date
primary key = id

Query for first slice would be:

SELECT fields FROM table WHERE date >= 01.01.2023 AND date < 01.02.2023 ORDER BY date LIMIT 15000;

Salesforce prepares data (max 15000 records but imagine it handles only 2 for example purpose):

id date

1 01.01.2023

2 01.01.2023

So for now we have only 2 of 5 records satisfied first query and it means we are not ready to move to second slice. And we also see that all the 5 records have the same date 01.01.2023. This is where primary key comes in handy. In order to get next two records we are making second query like that:

SELECT fields FROM table WHERE date >= 01.01.2023 AND date < 01.02.2023 AND id > 2 ORDER BY date, id LIMIT 15000;

This will return:

id date

3 01.01.2023

4 01.01.2023

Why we don't need primary key in REST Stream:
Here (first rows of current method)

if next_page_token: """ If `next_page_token` is set, subsequent requests use `nextRecordsUrl`, and do not include any parameters. """ return {}

we can see that there is a link made by Salesforce and we just use it as is for getting next page.

Makes sense, thank you for clarifying!

clnoll · 2023-04-21T14:07:42Z

/test connector=connectors/source-salesforce

🕑 connectors/source-salesforce https://github.com/airbytehq/airbyte/actions/runs/4765735311
✅ connectors/source-salesforce https://github.com/airbytehq/airbyte/actions/runs/4765735311
Python tests coverage:

Name                                         Stmts   Miss  Cover
----------------------------------------------------------------
source_salesforce/utils.py                       8      0   100%
source_salesforce/__init__.py                    2      0   100%
source_salesforce/source.py                    102      6    94%
source_salesforce/streams.py                   416     35    92%
source_salesforce/api.py                       155     14    91%
source_salesforce/exceptions.py                  8      1    88%
source_salesforce/rate_limiting.py              22      3    86%
source_salesforce/availability_strategy.py      17      8    53%
----------------------------------------------------------------
TOTAL                                          730     67    91%
Name                                         Stmts   Miss  Cover
----------------------------------------------------------------
source_salesforce/__init__.py                    2      0   100%
source_salesforce/exceptions.py                  8      1    88%
source_salesforce/api.py                       155     21    86%
source_salesforce/availability_strategy.py      17      3    82%
source_salesforce/streams.py                   416     88    79%
source_salesforce/rate_limiting.py              22      6    73%
source_salesforce/source.py                    102     34    67%
source_salesforce/utils.py                       8      7    12%
----------------------------------------------------------------
TOTAL                                          730    160    78%

Build Passed

Test summary info:

=========================== short test summary info ============================
SKIPPED [1] ../usr/local/lib/python3.9/site-packages/connector_acceptance_test/tests/test_core.py:100: The previous and actual specifications are identical.
SKIPPED [1] ../usr/local/lib/python3.9/site-packages/connector_acceptance_test/tests/test_core.py:578: The previous and actual discovered catalogs are identical.
================== 39 passed, 2 skipped in 2402.87s (0:40:02) ==================

girarda

left a few small comments, but the flow looks good to me. Thank you for the hard work!

girarda · 2023-04-21T15:31:39Z

airbyte-integrations/connectors/source-salesforce/source_salesforce/streams.py

+
+        now = pendulum.now(tz="UTC")
+        initial_date = pendulum.parse((stream_state or {}).get(self.cursor_field, self.start_date), tz="UTC")
+        period_end = initial_date.add(days=now.diff(initial_date).in_days())


It's not clear to me why we need to now with initial_date. Is this equivalent to period_end = pendulum.today(tz="UTC")?

You are right, we don't need those calculations. I am going to get rid of period_end and base (going to replace base with initial_date)

girarda · 2023-04-21T15:43:35Z

airbyte-integrations/connectors/source-salesforce/source_salesforce/streams.py

+
+        slice_number = 1
+        while not end == now:
+            base = period_end.subtract(days=period_end.diff(initial_date).in_days())


base can be computed outside of the while loop

girarda · 2023-04-21T15:54:16Z

airbyte-integrations/connectors/source-salesforce/source_salesforce/streams.py

+        period_end = initial_date.add(days=now.diff(initial_date).in_days())
+
+        slice_number = 1
+        while not end == now:


nit: while end <= now makes the intent clearer

This is not possible with current logic because end is equal to None when now is a datetime. So only == operation can be applied in this comparison

girarda · 2023-04-21T16:11:07Z

airbyte-integrations/connectors/source-salesforce/source_salesforce/streams.py

        if self.name not in UNSUPPORTED_FILTERING_STREAMS:
-            order_by_fields = [self.cursor_field, self.primary_key] if self.primary_key else [self.cursor_field]
-            query += f"ORDER BY {','.join(order_by_fields)} ASC LIMIT {self.page_size}"
+            primary_key = (next_page_token or {}).get("primary_key", "")


nit: can you rename this variable to last_key for clarity

roman-yermilov-gl · 2023-04-21T17:29:04Z

/test connector=connectors/source-salesforce

🕑 connectors/source-salesforce https://github.com/airbytehq/airbyte/actions/runs/4767344820
✅ connectors/source-salesforce https://github.com/airbytehq/airbyte/actions/runs/4767344820
Python tests coverage:

Name                                         Stmts   Miss  Cover
----------------------------------------------------------------
source_salesforce/utils.py                       8      0   100%
source_salesforce/__init__.py                    2      0   100%
source_salesforce/source.py                    102      6    94%
source_salesforce/streams.py                   414     35    92%
source_salesforce/api.py                       155     14    91%
source_salesforce/exceptions.py                  8      1    88%
source_salesforce/rate_limiting.py              22      3    86%
source_salesforce/availability_strategy.py      17      8    53%
----------------------------------------------------------------
TOTAL                                          728     67    91%
Name                                         Stmts   Miss  Cover
----------------------------------------------------------------
source_salesforce/__init__.py                    2      0   100%
source_salesforce/exceptions.py                  8      1    88%
source_salesforce/api.py                       155     21    86%
source_salesforce/availability_strategy.py      17      3    82%
source_salesforce/streams.py                   414     88    79%
source_salesforce/rate_limiting.py              22      6    73%
source_salesforce/source.py                    102     34    67%
source_salesforce/utils.py                       8      7    12%
----------------------------------------------------------------
TOTAL                                          728    160    78%

Build Passed

Test summary info:

=========================== short test summary info ============================
SKIPPED [1] ../usr/local/lib/python3.9/site-packages/connector_acceptance_test/tests/test_core.py:100: The previous and actual specifications are identical.
SKIPPED [1] ../usr/local/lib/python3.9/site-packages/connector_acceptance_test/tests/test_core.py:578: The previous and actual discovered catalogs are identical.
================== 39 passed, 2 skipped in 2332.43s (0:38:52) ==================

clnoll

LGTM @roman-yermilov-gl!

roman-yermilov-gl · 2023-04-21T19:34:55Z

@clnoll Thanks
Going to publish and merge on Monday

clnoll · 2023-04-21T19:36:08Z

Awesome, thanks @roman-yermilov-gl!

roman-yermilov-gl · 2023-04-24T08:24:26Z

/publish connector=connectors/source-salesforce

🕑 Publishing the following connectors:
connectors/source-salesforce
https://github.com/airbytehq/airbyte/actions/runs/4784110239

Connector	Version	Did it publish?	Were definitions generated?
connectors/source-salesforce	2.0.10	✅	✅

if you have connectors that successfully published but failed definition generation, follow step 4 here ▶️

* Source Salesforce: add checkpointing * Source-Iterable: fix integration tests * Source Salesforce: fix integration test s;ices * Source Salesforce: wait for latest record to be accessible * Source Salesforce: retry for 10 times for everything * Source Salesforce: refactoring. Add checkpointing for all incremental * Source Salesforce: small fixes * auto-bump connector version --------- Co-authored-by: Octavia Squidington III <octavia-squidington-iii@users.noreply.github.com>

roman-yermilov-gl requested review from grubberr, davydov-d and lazebnyi April 5, 2023 07:39

roman-yermilov-gl self-assigned this Apr 5, 2023

octavia-squidington-iii added area/connectors Connector related issues connectors/source/salesforce labels Apr 5, 2023

roman-yermilov-gl force-pushed the ryermilov/source-salesforce-checkpointing branch from db2e23c to 0f113cf Compare April 5, 2023 09:22

octavia-squidington-iii added the area/documentation Improvements or additions to documentation label Apr 5, 2023

grubberr suggested changes Apr 5, 2023

View reviewed changes

airbyte-integrations/connectors/source-salesforce/source_salesforce/streams.py Outdated Show resolved Hide resolved

arsenlosenko mentioned this pull request Apr 6, 2023

Salesforce <-> Airbyte full sync blocked to 14900 records for OpportunityLineItems #19014

Closed

roman-yermilov-gl requested a review from grubberr April 6, 2023 20:10

roman-yermilov-gl force-pushed the ryermilov/source-salesforce-checkpointing branch from f7858a6 to 8ceea7c Compare April 6, 2023 20:11

brianjlai reviewed Apr 6, 2023

View reviewed changes

arsenlosenko mentioned this pull request Apr 7, 2023

Source Salesforce: creates duplicates after update #20471

Closed

davydov-d reviewed Apr 10, 2023

View reviewed changes

airbyte-integrations/connectors/source-salesforce/source_salesforce/streams.py Outdated Show resolved Hide resolved

roman-yermilov-gl added 2 commits April 10, 2023 15:56

Source Salesforce: add checkpointing

a7589f1

Source-Iterable: fix integration tests

906a890

roman-yermilov-gl force-pushed the ryermilov/source-salesforce-checkpointing branch from 3936010 to 9683446 Compare April 10, 2023 12:14

Source Salesforce: fix integration test s;ices

9683446

lazebnyi changed the title ~~Ryermilov/source salesforce checkpointing~~ 🎉Source Salesforce: add checkpointing Apr 10, 2023

roman-yermilov-gl force-pushed the ryermilov/source-salesforce-checkpointing branch from cb25cea to da4e771 Compare April 10, 2023 16:27

Source Salesforce: wait for latest record to be accessible

da4e771

lazebnyi requested review from girarda, brianjlai and davydov-d April 10, 2023 17:57

Source Salesforce: retry for 10 times for everything

1f1a223

roman-yermilov-gl mentioned this pull request Apr 13, 2023

Source Salesforce: "QUERY_TIMEOUT" on Task Object Import #19947

Closed

lazebnyi requested a review from clnoll April 14, 2023 18:21

clnoll requested changes Apr 14, 2023

View reviewed changes

roman-yermilov-gl requested a review from clnoll April 18, 2023 13:12

bnchrch self-requested a review April 19, 2023 20:46

Source Salesforce: refactoring. Add checkpointing for all incremental

e48f17d

clnoll reviewed Apr 21, 2023

View reviewed changes

roman-yermilov-gl requested a review from clnoll April 21, 2023 13:41

girarda approved these changes Apr 21, 2023

View reviewed changes

roman-yermilov-gl added 2 commits April 21, 2023 20:44

Merge branch 'master' into ryermilov/source-salesforce-checkpointing

8ccadd5

Source Salesforce: small fixes

565eae5

clnoll approved these changes Apr 21, 2023

View reviewed changes

arsenlosenko approved these changes Apr 24, 2023

View reviewed changes

auto-bump connector version

c5e08bf

roman-yermilov-gl enabled auto-merge (squash) April 24, 2023 09:42

roman-yermilov-gl disabled auto-merge April 24, 2023 09:42

grubberr approved these changes Apr 24, 2023

View reviewed changes

roman-yermilov-gl merged commit dd607dc into master Apr 24, 2023
26 checks passed

roman-yermilov-gl deleted the ryermilov/source-salesforce-checkpointing branch April 24, 2023 11:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🎉Source Salesforce: add checkpointing #24888

🎉Source Salesforce: add checkpointing #24888

roman-yermilov-gl commented Apr 5, 2023

brianjlai left a comment

brianjlai Apr 6, 2023

roman-yermilov-gl Apr 6, 2023

brianjlai Apr 7, 2023

girarda Apr 14, 2023

roman-yermilov-gl Apr 18, 2023

brianjlai Apr 6, 2023

roman-yermilov-gl Apr 6, 2023

girarda Apr 6, 2023

roman-yermilov-gl commented Apr 10, 2023 •

edited by github-actions bot

clnoll Apr 14, 2023

roman-yermilov-gl Apr 18, 2023

girarda Apr 19, 2023

bnchrch Apr 20, 2023

roman-yermilov-gl Apr 21, 2023

roman-yermilov-gl Apr 21, 2023

bnchrch commented Apr 19, 2023 •

edited by github-actions bot

clnoll Apr 21, 2023

roman-yermilov-gl Apr 21, 2023

clnoll Apr 21, 2023

clnoll commented Apr 21, 2023 •

edited by github-actions bot

girarda left a comment

girarda Apr 21, 2023

roman-yermilov-gl Apr 21, 2023

girarda Apr 21, 2023

girarda Apr 21, 2023

roman-yermilov-gl Apr 21, 2023

girarda Apr 21, 2023

roman-yermilov-gl Apr 21, 2023

roman-yermilov-gl commented Apr 21, 2023 •

edited by github-actions bot

clnoll left a comment

roman-yermilov-gl commented Apr 21, 2023

clnoll commented Apr 21, 2023

roman-yermilov-gl commented Apr 24, 2023 •

edited by github-actions bot

		@@ -633,36 +647,62 @@ def get_updated_state(self, current_stream_state: MutableMapping[str, Any], late


		class BulkIncrementalSalesforceStream(BulkSalesforceStream, IncrementalRestSalesforceStream):
		STREAM_SLICE_STEP = 120

id	date
1	01.01.2023
2	01.01.2023
3	01.01.2023
4	01.01.2023
5	01.01.2023
6	01.03.2023

🎉Source Salesforce: add checkpointing #24888

🎉Source Salesforce: add checkpointing #24888

Conversation

roman-yermilov-gl commented Apr 5, 2023

What

brianjlai left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

roman-yermilov-gl commented Apr 10, 2023 • edited by github-actions bot

Build Passed

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bnchrch commented Apr 19, 2023 • edited by github-actions bot

Build Passed

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clnoll commented Apr 21, 2023 • edited by github-actions bot

Build Passed

girarda left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

roman-yermilov-gl commented Apr 21, 2023 • edited by github-actions bot

Build Passed

clnoll left a comment

Choose a reason for hiding this comment

roman-yermilov-gl commented Apr 21, 2023

clnoll commented Apr 21, 2023

roman-yermilov-gl commented Apr 24, 2023 • edited by github-actions bot

roman-yermilov-gl commented Apr 10, 2023 •

edited by github-actions bot

bnchrch commented Apr 19, 2023 •

edited by github-actions bot

clnoll commented Apr 21, 2023 •

edited by github-actions bot

roman-yermilov-gl commented Apr 21, 2023 •

edited by github-actions bot

roman-yermilov-gl commented Apr 24, 2023 •

edited by github-actions bot