Emit state when no partitions are generated for ccdk #34605

maxi297 · 2024-01-29T15:26:49Z

What

CCDK does not emit the state message when no partitions are generated. This is something the platform expect and therefore CCDK should support that.

Note that until we have clarity on the platform expectations regarding the state, we will always emit the state at the end of a sync even if it's duplication

How

I was struggling to make this fit in our domain as I don't get why the platform needs this. I've asked for more information on https://airbytehq-team.slack.com/archives/C03GD9SV36E/p1706540517839939.

Given the information I have, this is the best way I have of formalizing the domain. I don't like this because:

it adds a dependency between the cursor and the stream + read processor (while the dependency was only with the partition before)
the interface is kind of weird with emit_state_given_no_partitions_generated. If we decide to go with something more generic like emit_state, we are not explicit as to why this is needed (which might be fine but I don't like that when I check the Cursor class, I don't see everything related to state management). The other name I can think of is ensure_state_emitted_at_least_once which is less generic but still relates to the platform expectations. Once we understand that platform expectation, we can simply emit a state even if one as already been submitted or track if Cursor._emit_state_message has been called. Whatever the solution, I would like to think about this in the lens of Sync data accuracy project

I thought of other solutions which where also not optimal

Keep the logic in DefaultStream

The benefit is that there is no interface change and we can do something like:

    def generate_partitions(self) -> Iterable[Partition]:
        for partition in self._stream_partition_generator.generate():
            yield partition
        else:
            # something like `cursor.emit_state()`

The big drawback is that the solution only applies for sources using DefaultStream and this logic would need to be implemented in the FCDK as well. Hence, it feels like a bad way for formalize our domain and that interfaces should change.

Having ConcurrentReadProcessor do all the work

We could have passed the ConnectorStateManager to the ConcurrentReadProcessor and expose AbstractStream.namespace so that when no partitions generated (ConcurrentReadProcessor or PartitionEnqueuer can track that), we return self._connector_state_manager.create_state_message(...)

For this solution, the interfaces still do not change a lot and this would be supported by all the sources. What I don't like about this is:

if I do a change in terms of state management, there is a use case that is outside of Cursor and hence as a dev, I could miss some crucial information
once we will work on the Sync data accuracy project, if the record count in the state message is by stream, cursor.observe would be able to keep track of the number of records and the solution would be very easy for us if the cursor emits that state. In the Having ConcurrentReadProcessor do all the work solution, it is not the cursor that emits the state so the observe can't be used. Not that if the record count is expected to be across all streams, our domain does not capture that very well today and it might be challenging...

vercel · 2024-01-29T15:26:56Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment

Name	Status	Preview	Comments	Updated (UTC)
airbyte-docs	⬜️ Ignored (Inspect)	Visit Preview		Jan 29, 2024 10:47pm

clnoll

Thanks for the detailed explanation @maxi297.

I think it would be ideal not to have state being stored on the sentinel but it feels pretty safe overall (i.e. it doesn't look like anything will be able to or be tempted to mutate that state anywhere).

Regarding your "Keep the logic in DefaultStream" option - is it worth revisiting now that the file-based approach doesn't have its own DefaultStream?

clnoll · 2024-01-29T15:42:00Z

airbyte-cdk/python/airbyte_cdk/sources/concurrent_source/concurrent_read_processor.py

+                sentinel.stream.cursor.emit_state_given_no_partitions_generated()
+                yield from self._message_repository.consume_queue()


I'm probably being slow, but what is the reasoning behind emitting the state before consuming from the queue? If the previous lines have generated messages that go into the repository it feels like they should be emitted first.

Emitting the state only push it to the message repository queue. The owner of actually ordering/printing the records decide when to emit these.

I don't really like this logic because as a user, I would assume that emit state would actually send it. I don't remember why we did it like that. That being said, it has advantages: we can send messages to the message repository in a multi threaded way and emit them only on the main thread which allows us to avoid issues regarding messages order

Ah right, that makes sense!

we can send messages to the message repository in a multi threaded way and emit them only on the main thread which allows us to avoid issues regarding messages order

+1. super important to avoid emitting state messages before the records

clnoll · 2024-01-29T15:48:20Z

airbyte-cdk/python/airbyte_cdk/sources/streams/concurrent/partition_enqueuer.py

@@ -50,6 +51,7 @@ def generate_partitions(self, stream: AbstractStream) -> None:
                while self._thread_pool_manager.prune_to_validate_has_reached_futures_limit():
                    time.sleep(self._sleep_time_in_seconds)
                self._queue.put(partition)
-            self._queue.put(PartitionGenerationCompletedSentinel(stream))
+                has_generated_partitions = True
+            self._queue.put(PartitionGenerationCompletedSentinel(stream, has_generated_partitions))


Would it be more useful to have the partition count instead of a bool, down the line?

I asked myself the same question and didn't see benefits today for that. There is a (very minor) drawback and it's using a int take more space and this space can grow as the int gets bigger

Okay yeah not much of a drawback. Given that this feels like this is something we'll want when we add record counts and debug logging, I feel like it makes sense to go ahead and implement it as an int.

See #34605 (comment) which seems somewhat relevant to this comment

Right - I read the drawbacks that you listed for that approach and agree that the sentinel approach seems preferable.

maxi297 · 2024-01-29T16:00:29Z

Regarding your "Keep the logic in DefaultStream" option - is it worth revisiting now that the file-based approach doesn't have its own DefaultStream?

Any source that would re-implement AbstractStream is at risk though and would have to re-implement the logic or emitting at least one state message

maxi297 · 2024-01-29T16:11:30Z

I think it would be ideal not to have state being stored on the sentinel

We don't need to. The PartitionEnqueuer could call stream.cursor.emit_state_given_no_partitions_generated if no partitions are generated. We would just have ConcurrentReadProcessor consume the queue and no state would be needed

girarda

girarda · 2024-01-29T18:29:46Z

airbyte-cdk/python/airbyte_cdk/sources/concurrent_source/concurrent_read_processor.py

+                sentinel.stream.cursor.emit_state_given_no_partitions_generated()
+                yield from self._message_repository.consume_queue()


we can send messages to the message repository in a multi threaded way and emit them only on the main thread which allows us to avoid issues regarding messages order

+1. super important to avoid emitting state messages before the records

One solution to emit state when no partitions are generated for ccdk

6976240

maxi297 requested a review from a team as a code owner January 29, 2024 15:26

octavia-squidington-iii added the CDK Connector Development Kit label Jan 29, 2024

clnoll reviewed Jan 29, 2024

View reviewed changes

clnoll approved these changes Jan 29, 2024

View reviewed changes

maxi297 mentioned this pull request Jan 29, 2024

✨ Source Stripe: Enable concurrency on incremental syncs for balance_transactions, events, files, file_links and shipping_rates #34619

Merged

girarda approved these changes Jan 29, 2024

View reviewed changes

maxi297 added 3 commits January 29, 2024 16:14

Always emit at the end of the sync

23407c0

mypy

ab9dab0

flake

2e4df07

maxi297 merged commit 2c8b47b into master Jan 30, 2024
20 checks passed

maxi297 deleted the maxi297/emit-state-when-no-partition-generated-solution1 branch January 30, 2024 13:45

maxi297 mentioned this pull request Jan 30, 2024

Source S3: updates for compatibility with the concurrent CDK #34591

Merged

maxi297 mentioned this pull request Jan 31, 2024

✨ Source Stripe: Enable concurrency on incremental syncs for balance_transactions, files, file_links and shipping_rates #34696

Merged

jbfbell pushed a commit that referenced this pull request Feb 1, 2024

Emit state when no partitions are generated for ccdk (#34605)

8cef869

maxi297 mentioned this pull request Feb 5, 2024

File-based CDK: make incremental syncs concurrent #34540

Merged

jbfbell mentioned this pull request Feb 21, 2024

Destination Oracle - Remove Normalization #35470

Closed

jatinyadav-cc pushed a commit to ollionorg/datapipes-airbyte that referenced this pull request Feb 21, 2024

Emit state when no partitions are generated for ccdk (airbytehq#34605)

e8bce81

jatinyadav-cc pushed a commit to ollionorg/datapipes-airbyte that referenced this pull request Feb 26, 2024

Emit state when no partitions are generated for ccdk (airbytehq#34605)

5672e54

jatinyadav-cc pushed a commit to ollionorg/datapipes-airbyte that referenced this pull request Feb 26, 2024

Emit state when no partitions are generated for ccdk (airbytehq#34605)

fbccc9d

jbfbell mentioned this pull request Mar 7, 2024

Destination MSSQL - Remove Normalization #35874

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Emit state when no partitions are generated for ccdk #34605

Emit state when no partitions are generated for ccdk #34605

maxi297 commented Jan 29, 2024 •

edited

Loading

vercel bot commented Jan 29, 2024 •

edited

Loading

clnoll left a comment

clnoll Jan 29, 2024

maxi297 Jan 29, 2024

clnoll Jan 29, 2024

girarda Jan 29, 2024

clnoll Jan 29, 2024

maxi297 Jan 29, 2024

clnoll Jan 29, 2024

maxi297 Jan 29, 2024

clnoll Jan 29, 2024

maxi297 commented Jan 29, 2024

maxi297 commented Jan 29, 2024

girarda left a comment

girarda Jan 29, 2024

		sentinel.stream.cursor.emit_state_given_no_partitions_generated()
		yield from self._message_repository.consume_queue()

Emit state when no partitions are generated for ccdk #34605

Emit state when no partitions are generated for ccdk #34605

Conversation

maxi297 commented Jan 29, 2024 • edited Loading

What

How

Keep the logic in DefaultStream

Having ConcurrentReadProcessor do all the work

vercel bot commented Jan 29, 2024 • edited Loading

clnoll left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maxi297 commented Jan 29, 2024

maxi297 commented Jan 29, 2024

girarda left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maxi297 commented Jan 29, 2024 •

edited

Loading

vercel bot commented Jan 29, 2024 •

edited

Loading