-
Notifications
You must be signed in to change notification settings - Fork 3.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
馃帀Source Salesforce: add checkpointing #24888
Changes from all commits
a7589f1
906a890
9683446
da4e771
1f1a223
e48f17d
8ccadd5
565eae5
c5e08bf
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
@@ -579,6 +579,7 @@ def transform_empty_string_to_none(instance: Any, schema: Any): | |||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||
class IncrementalRestSalesforceStream(RestSalesforceStream, ABC): | ||||||||||||||||||||||||||||
state_checkpoint_interval = 500 | ||||||||||||||||||||||||||||
STREAM_SLICE_STEP = 120 | ||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||
def __init__(self, replication_key: str, start_date: Optional[str], **kwargs): | ||||||||||||||||||||||||||||
super().__init__(**kwargs) | ||||||||||||||||||||||||||||
|
@@ -592,6 +593,20 @@ def format_start_date(start_date: Optional[str]) -> Optional[str]: | |||||||||||||||||||||||||||
return pendulum.parse(start_date).strftime("%Y-%m-%dT%H:%M:%SZ") # type: ignore[attr-defined,no-any-return] | ||||||||||||||||||||||||||||
return None | ||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||
def stream_slices( | ||||||||||||||||||||||||||||
self, *, sync_mode: SyncMode, cursor_field: List[str] = None, stream_state: Mapping[str, Any] = None | ||||||||||||||||||||||||||||
) -> Iterable[Optional[Mapping[str, Any]]]: | ||||||||||||||||||||||||||||
start, end = (None, None) | ||||||||||||||||||||||||||||
now = pendulum.now(tz="UTC") | ||||||||||||||||||||||||||||
initial_date = pendulum.parse((stream_state or {}).get(self.cursor_field, self.start_date), tz="UTC") | ||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||
slice_number = 1 | ||||||||||||||||||||||||||||
while not end == now: | ||||||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is not possible with current logic because |
||||||||||||||||||||||||||||
start = initial_date.add(days=(slice_number - 1) * self.STREAM_SLICE_STEP) | ||||||||||||||||||||||||||||
end = min(now, initial_date.add(days=slice_number * self.STREAM_SLICE_STEP)) | ||||||||||||||||||||||||||||
yield {"start_date": start.isoformat(timespec="milliseconds"), "end_date": end.isoformat(timespec="milliseconds")} | ||||||||||||||||||||||||||||
slice_number = slice_number + 1 | ||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||
def request_params( | ||||||||||||||||||||||||||||
self, | ||||||||||||||||||||||||||||
stream_state: Mapping[str, Any], | ||||||||||||||||||||||||||||
|
@@ -607,14 +622,28 @@ def request_params( | |||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||
property_chunk = property_chunk or {} | ||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||
stream_date = stream_state.get(self.cursor_field) | ||||||||||||||||||||||||||||
start_date = stream_date or self.start_date | ||||||||||||||||||||||||||||
start_date = max( | ||||||||||||||||||||||||||||
(stream_state or {}).get(self.cursor_field, self.start_date), | ||||||||||||||||||||||||||||
(stream_slice or {}).get("start_date", ""), | ||||||||||||||||||||||||||||
(next_page_token or {}).get("start_date", ""), | ||||||||||||||||||||||||||||
) | ||||||||||||||||||||||||||||
end_date = (stream_slice or {}).get("end_date", pendulum.now(tz="UTC").isoformat(timespec="milliseconds")) | ||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||
select_fields = ",".join(property_chunk.keys()) | ||||||||||||||||||||||||||||
table_name = self.name | ||||||||||||||||||||||||||||
where_conditions = [] | ||||||||||||||||||||||||||||
order_by_clause = "" | ||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||
query = f"SELECT {','.join(property_chunk.keys())} FROM {self.name} " | ||||||||||||||||||||||||||||
if start_date: | ||||||||||||||||||||||||||||
query += f"WHERE {self.cursor_field} >= {start_date} " | ||||||||||||||||||||||||||||
where_conditions.append(f"{self.cursor_field} >= {start_date}") | ||||||||||||||||||||||||||||
if end_date: | ||||||||||||||||||||||||||||
where_conditions.append(f"{self.cursor_field} < {end_date}") | ||||||||||||||||||||||||||||
if self.name not in UNSUPPORTED_FILTERING_STREAMS: | ||||||||||||||||||||||||||||
query += f"ORDER BY {self.cursor_field} ASC" | ||||||||||||||||||||||||||||
order_by_clause = f"ORDER BY {self.cursor_field} ASC" | ||||||||||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why is it that we're only ordering by There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We are using primary key in bulk operations query in Why we need primary key for Bulk Streams:
Page size = 2 Query for first slice would be:
Salesforce prepares data (max 15000 records but imagine it handles only 2 for example purpose):
So for now we have only 2 of 5 records satisfied first query and it means we are not ready to move to second slice. And we also see that all the 5 records have the same date
This will return:
Why we don't need primary key in REST Stream:
we can see that there is a link made by Salesforce and we just use it as is for getting next page. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Makes sense, thank you for clarifying! |
||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||
where_clause = f"WHERE {' AND '.join(where_conditions)}" | ||||||||||||||||||||||||||||
query = f"SELECT {select_fields} FROM {table_name} {where_clause} {order_by_clause}" | ||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||
return {"q": query} | ||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||
@property | ||||||||||||||||||||||||||||
|
@@ -635,34 +664,33 @@ def get_updated_state(self, current_stream_state: MutableMapping[str, Any], late | |||||||||||||||||||||||||||
class BulkIncrementalSalesforceStream(BulkSalesforceStream, IncrementalRestSalesforceStream): | ||||||||||||||||||||||||||||
def next_page_token(self, last_record: Mapping[str, Any]) -> Optional[Mapping[str, Any]]: | ||||||||||||||||||||||||||||
if self.name not in UNSUPPORTED_FILTERING_STREAMS: | ||||||||||||||||||||||||||||
page_token: str = last_record[self.cursor_field] | ||||||||||||||||||||||||||||
res = {"next_token": page_token} | ||||||||||||||||||||||||||||
# use primary key as additional filtering param, if cursor_field is not increased from previous page | ||||||||||||||||||||||||||||
if self.primary_key and self.prev_start_date == page_token: | ||||||||||||||||||||||||||||
res["primary_key"] = last_record[self.primary_key] | ||||||||||||||||||||||||||||
return res | ||||||||||||||||||||||||||||
return {"next_token": last_record[self.cursor_field], "primary_key": last_record.get(self.primary_key)} | ||||||||||||||||||||||||||||
return None | ||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||
def request_params( | ||||||||||||||||||||||||||||
self, stream_state: Mapping[str, Any], stream_slice: Mapping[str, Any] = None, next_page_token: Mapping[str, Any] = None | ||||||||||||||||||||||||||||
) -> MutableMapping[str, Any]: | ||||||||||||||||||||||||||||
selected_properties = self.get_json_schema().get("properties", {}) | ||||||||||||||||||||||||||||
start_date = max( | ||||||||||||||||||||||||||||
(stream_state or {}).get(self.cursor_field, ""), | ||||||||||||||||||||||||||||
(stream_slice or {}).get("start_date", ""), | ||||||||||||||||||||||||||||
(next_page_token or {}).get("start_date", ""), | ||||||||||||||||||||||||||||
) | ||||||||||||||||||||||||||||
end_date = stream_slice["end_date"] | ||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||
stream_date = stream_state.get(self.cursor_field) | ||||||||||||||||||||||||||||
next_token = (next_page_token or {}).get("next_token") | ||||||||||||||||||||||||||||
primary_key = (next_page_token or {}).get("primary_key") | ||||||||||||||||||||||||||||
start_date = next_token or stream_date or self.start_date | ||||||||||||||||||||||||||||
self.prev_start_date = start_date | ||||||||||||||||||||||||||||
select_fields = ", ".join(self.get_json_schema().get("properties", {}).keys()) | ||||||||||||||||||||||||||||
table_name = self.name | ||||||||||||||||||||||||||||
where_conditions = [f"{self.cursor_field} >= {start_date}", f"{self.cursor_field} < {end_date}"] | ||||||||||||||||||||||||||||
order_by_clause = "" | ||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||
query = f"SELECT {','.join(selected_properties.keys())} FROM {self.name} " | ||||||||||||||||||||||||||||
if start_date: | ||||||||||||||||||||||||||||
if primary_key and self.name not in UNSUPPORTED_FILTERING_STREAMS: | ||||||||||||||||||||||||||||
query += f"WHERE ({self.cursor_field} = {start_date} AND {self.primary_key} > '{primary_key}') OR ({self.cursor_field} > {start_date}) " | ||||||||||||||||||||||||||||
else: | ||||||||||||||||||||||||||||
query += f"WHERE {self.cursor_field} >= {start_date} " | ||||||||||||||||||||||||||||
if self.name not in UNSUPPORTED_FILTERING_STREAMS: | ||||||||||||||||||||||||||||
order_by_fields = [self.cursor_field, self.primary_key] if self.primary_key else [self.cursor_field] | ||||||||||||||||||||||||||||
query += f"ORDER BY {','.join(order_by_fields)} ASC LIMIT {self.page_size}" | ||||||||||||||||||||||||||||
last_primary_key = (next_page_token or {}).get("primary_key", "") | ||||||||||||||||||||||||||||
if last_primary_key: | ||||||||||||||||||||||||||||
where_conditions.append(f"{self.primary_key} > '{last_primary_key}'") | ||||||||||||||||||||||||||||
order_by_fields = ", ".join([self.cursor_field, self.primary_key] if self.primary_key else [self.cursor_field]) | ||||||||||||||||||||||||||||
order_by_clause = f"ORDER BY {order_by_fields} ASC LIMIT {self.page_size}" | ||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||
where_clause = f"WHERE {' AND '.join(where_conditions)}" | ||||||||||||||||||||||||||||
query = f"SELECT {select_fields} FROM {table_name} {where_clause} {order_by_clause}" | ||||||||||||||||||||||||||||
return {"q": query} | ||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thinking whether this is the right behavior we want to continue. Due to the 24 hour rate limit, we didn't want to block future syncs, so we just marked it successful. This has its drawbacks and we could lose records. However, now that we have checkpointing at date slice windows, maybe we should more concretely throw back an error and not swallow them. And on the next sync we pick up where the previous bookmark left off.
And now with slices, we can still make incremental progress even if we hit the rate limit issue again instead of retrying the whole sync again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have checkpointing now but not for full refresh sync. I also wanted to remove it but decided to leave as is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Failing due to daily rate limits will trigger alerts if 3 workspaces start moving more data than they can. I'm not sure what the best way to expose this kind of limitations this without introducing a new status type