Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concurrent CDK: support partitioned states #36811

Merged
merged 12 commits into from
Apr 9, 2024

Conversation

maxi297
Copy link
Contributor

@maxi297 maxi297 commented Apr 3, 2024

What

Addresses https://github.com/airbytehq/airbyte-internal-issues/issues/6865

How

By having the CDK support partitioned states. This implies a couple of changes:

  • the cursor should generate the slices (see ConcurrentCursor.generate_slices)
  • sequential state is not a given anymore and should be slowly removed from the concurrent cursor (see various changes to state_converters and ConcurrentCursor)

Once this is done, we have that errors should not break the sync

Note: this PR serves as a CDK feature implementation and the changes on source-salesforce are only to show usage. The changes to source-salesforce will be moved to another PR

Migration

Following the point above: Right now if there is a state, we can only assume that everything from self._start to <cursor value> was synced. If there is a state, we therefore need to create a slice from {"start": self._start, "end": }. I think we had already thought about this logic but I didn't see it in action in my round of testing today (2024-04-05)

Copy link

vercel bot commented Apr 3, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Comments Updated (UTC)
airbyte-docs ⬜️ Ignored (Inspect) Visit Preview Apr 9, 2024 8:32pm

@octavia-squidington-iii octavia-squidington-iii added area/connectors Connector related issues CDK Connector Development Kit connectors/source/salesforce labels Apr 3, 2024
yield self._slice

slice_start += self.stream_slice_step
if self._cursor:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really hate this but for full refresh, we create a FinalStateCursor and not a ConcurrentCursor. It probably means that the generation of slices based on a state is not valid. In other words, with or without a state, we generate slices that relies on incrementality and the fact that the stream is full_refresh is irrelevant. I think this will be more true with Resumable Full Refresh but it is a bit annoying right now

@maxi297 maxi297 requested a review from girarda April 3, 2024 20:50
}
else:
now = pendulum.now(tz="UTC")
assert LOOKBACK_SECONDS is not None and LOOKBACK_SECONDS >= 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know. It indeed does not make sense as we are fully owner of LOOKBACK_SECONDS. This is code that was there before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding to that, some tests were patching this constant with invalid values. I just removed them as it is not a real case scenario

@@ -25,6 +25,10 @@ class Comparable(Protocol):
def __lt__(self: "Comparable", other: "Comparable") -> bool:
pass

@abstractmethod
def __add__(self: "Comparable", other: Any) -> "Comparable":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: let's add a docstring

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this does not always make sense. I think the typing is kind of wrong for this class. I think we need a generic with two "variables": one that represents cursor values and the other that represents gap between those. For a cursor where values are datetimes, the first would be datetime and the second would be timedelta

@maxi297 maxi297 force-pushed the issue-6865/move-to-partitioned-state branch from 8461a12 to f187529 Compare April 4, 2024 20:43
@@ -118,7 +122,9 @@ def __init__(
connector_state_converter: AbstractStreamStateConverter,
cursor_field: CursorField,
slice_boundary_fields: Optional[Tuple[str, str]],
start: Optional[Any],
start: Optional[Comparable],
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚨 🚨 🚨 Breaking change alert: The type was changed for start as it was either a Python object when being passed a partitioned state or the serialized value when being passed a sequential state. As I'm trying to remove sequential state from this class, we needed to remove the second case

@maxi297 maxi297 requested review from clnoll and girarda April 5, 2024 14:57
@maxi297 maxi297 marked this pull request as ready for review April 5, 2024 14:58
@maxi297 maxi297 requested a review from a team as a code owner April 5, 2024 14:58
@maxi297 maxi297 changed the title [TMP] want feedback on having salesforce rely on concurrent state Concurrent CDK: support partitioned states Apr 5, 2024
@octavia-squidington-iv octavia-squidington-iv requested a review from a team April 5, 2024 14:59
@girarda girarda requested a review from bazarnov April 5, 2024 17:15
Copy link
Contributor

@clnoll clnoll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maxi297 I'm not seeing anything that I think would pose a problem for file-based sources. I can do a follow up to update the S3 connector after this is deployed for Salesforce if that would be helpful.

What I'm not clear on is why the cursor needs to be updated in the same PR that's changing the thread shutdown logic, which I thought was the primary issue that we were trying to solve in this PR. Can you help me understand that?

@maxi297 maxi297 force-pushed the issue-6865/move-to-partitioned-state branch from 0a673c9 to 675b544 Compare April 9, 2024 19:12
@octavia-squidington-iii octavia-squidington-iii removed the area/connectors Connector related issues label Apr 9, 2024
@maxi297 maxi297 merged commit bbf69ae into master Apr 9, 2024
27 checks passed
@maxi297 maxi297 deleted the issue-6865/move-to-partitioned-state branch April 9, 2024 20:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CDK Connector Development Kit connectors/source/salesforce
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants