New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Source Salesforce: change the sequence of requests #23610
Source Salesforce: change the sequence of requests #23610
Conversation
/test connector=connectors/source-salesforce
|
@davydov-d nice! To make sure I understand - when we query salesforce, we'll send a request for, e.g. properties Do we do any checks to make sure we have all properties? e.g. if the page with |
@clnoll each set of properties (chunk) has its own pagination. If the Regarding your first question - first we'll request first pages of all the chunks, then the sequence of requests will depend on the page size of each chunk returned. |
I looked over the code trying to functionally understand what its doing. I think it makes sense and should perform the grouping as @davydov-d clearly explained in the PR description and comments. I also don't see how this would be affecting the abnormal sync behavior so it is a little bit puzzling why that test is starting to fail all of a sudden. I will continue reviewing the code itself tomorrow too The Salesforce rate limits are still posing a serious problem and I'm nervous about how often we will be able to test this without exceeding them again and blocking us for another 24 hours. |
/test connector=connectors/source-salesforce
|
… into a helper object to organize state
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes look good @davydov-d , nice work on what was a really challenging implementation of the grouping logic.
I pushed a fix for the infinite loop and consolidating the chunk states into a new object to clean the code up a bit (that was originally going to be my main suggestion during the review). If you want to look over those changes I made, otherwise I think this is in a good state to publish as long as we can get a passing CAT run.
airbyte-integrations/connectors/source-salesforce/source_salesforce/streams.py
Show resolved
Hide resolved
/test connector=connectors/source-salesforce
Build PassedTest summary info:
|
airbyte-integrations/connectors/source-salesforce/source_salesforce/streams.py
Show resolved
Hide resolved
/publish connector=connectors/source-salesforce
if you have connectors that successfully published but failed definition generation, follow step 4 here |
* #1571 source salesforce: change the sequence of requests * #1571 source Salesforce: format * #1571 source salesforce: fix endless loop * #1571 source salesforce: update unit tests * fix infinite loop for streams with no records and refactor properties into a helper object to organize state * auto-bump connector version --------- Co-authored-by: brianjlai <brian.lai@airbyte.io> Co-authored-by: Octavia Squidington III <octavia-squidington-iii@users.noreply.github.com>
* airbytehq#1571 source salesforce: change the sequence of requests * airbytehq#1571 source Salesforce: format * airbytehq#1571 source salesforce: fix endless loop * airbytehq#1571 source salesforce: update unit tests * fix infinite loop for streams with no records and refactor properties into a helper object to organize state * auto-bump connector version --------- Co-authored-by: brianjlai <brian.lai@airbyte.io> Co-authored-by: Octavia Squidington III <octavia-squidington-iii@users.noreply.github.com>
What
https://github.com/airbytehq/oncall/issues/1571
How
Version 2.0.1 introduced new functionality in the PR. It allowed supporting non-bulk streams with a huge amount of properties that couldn't fit in a single http request because of max length. The idea was to split all the properties of an entity into chunks and make multiple requests to get the data: one request per chunk per page. Requests were made in a sequence:
Page 1
id
,a
,b
fromtable
order byid
)id
,c
,d
fromtable
order byid
)Page 2
id
,a
,b
fromtable
order byid
) # offset and limit are handled by the page numberid
,c
,d
fromtable
order byid
) # offset and limit are handled by the page numberThen parts of the records were sticked together by the primary key value within one page. This solution did not take into account that the Salesforce API does not apply a constant page size for all the queries. So when the first query returns 2k records, second returns 200 records, we have 1.8k incomplete records.
To fix this situation we change the sequence of requests. Since now each subsequent request will be made to retrieve data from the chunk that has the fewest number of records read from independent of pagination (although we store current and next page for each chunk). After sticking the record we may yield it right after we know it contains all the properties needed. This allows us supporting streams with a huge number of properties, avoiding high memory consumption and emitting consistent data.
Let's see an example. Let's say we have 6 properties split into 3 chunks.
1 request: select
id
,a
,b
fromtable
order byid
-> returns [{id
: 1,a
: 1,b
: 1}, {id
: 2,a
: 2,b
: 2}, {id
: 3,a
: 3,b
: 3}, {id
: 4,a
: 4,b
: 4}]2 request: select
id
,c
,d
fromtable
order byid
-> returns [{id
: 1,c
: 1,d
: 1}, {id
: 2,c
: 2,d
: 2}] # salesforce decided to decrease page size compared to first request in favour of keeping high performance3 request: select
id
,e
,f
fromtable
order byid
-> returns [{id
: 1,e
: 1,f
: 1}] # page size decreased even moreNow we have 1 (id=1) complete record that can be emitted. Next steps are: find out which chunk has the fewest records (#3) and make request to fetch next page for it. When we have the response, we'll be able to emit 1+ or more records. Repeat until we have the full set of properties for each primary key.