File-based CDK: make full refresh concurrent #34411

clnoll · 2024-01-22T15:13:33Z

Uses the thread-based concurrent CDK to add concurrency to the file-based CDK.

This is done by creating a FileBasedStreamFacade and associated file-based classes to use as partitions.

For now this just uses a FileBasedNoopCursor; once incremental is put into place we will also have a FileBasedConcurrentCursor.

Note: as this PR stands, it relies on the file-based streams creating new connections to the source when a read is being done. On the face of it this is not entirely optimal, as described in the ticket for this work, but we can control the number of StreamReaders created using the concurrency limits, and this may be sufficient. Rather than putting the optimization in with this PR I plan to do some production testing to determine whether it's needed.

Before Merging a Connector Pull Request

Wow! What a great pull request you have here! 🎉

To merge this PR, ensure the following has been done/considered for each connector added or updated:

PR name follows PR naming conventions
Breaking changes are considered. If a Breaking Change is being introduced, ensure an Airbyte engineer has created a Breaking Change Plan.
Connector version has been incremented in the Dockerfile and metadata.yaml according to our Semantic Versioning for Connectors guidelines
You've updated the connector's metadata.yaml file any other relevant changes, including a breakingChanges entry for major version bumps. See metadata.yaml docs
Secrets in the connector's spec are annotated with airbyte_secret
All documentation files are up to date. (README.md, bootstrap.md, docs.md, etc...)
Changelog updated in docs/integrations/<source or destination>/<name>.md with an entry for the new version. See changelog example
Migration guide updated in docs/integrations/<source or destination>/<name>-migrations.md with an entry for the new version, if the version is a breaking change. See migration guide example
If set, you've ensured the icon is present in the platform-internal repo. (Docs)

If the checklist is complete, but the CI check is failing,

Check for hidden checklists in your PR description
Toggle the github label checklist-action-run on/off to re-run the checklist CI.

maxi297

My big concern is related to creating a new StreamFacade. Can we explicit why we decided to go this way?

maxi297 · 2024-01-26T15:45:04Z

airbyte-cdk/python/airbyte_cdk/sources/concurrent_source/concurrent_source_adapter.py

@@ -58,6 +59,6 @@ def _select_abstract_streams(self, config: Mapping[str, Any], configured_catalog
                    f"The stream {configured_stream.stream.name} no longer exists in the configuration. "
                    f"Refresh the schema in replication settings and remove this stream from future sync attempts."
                )
-            if isinstance(stream_instance, StreamFacade):
+            if isinstance(stream_instance, (StreamFacade, FileBasedStreamFacade)):


I haven't checked much yet but it seems odd that the CCDK knows about the FCDK and vice versa. Given that, it feels like both will be tightly coupled

Agree but I'd like to address this separately (see comment below).

Update: I actually decided to pull the change in from a comment in the incremental sync PR (this commit).

maxi297 · 2024-01-26T15:49:53Z

airbyte-cdk/python/airbyte_cdk/sources/file_based/file_based_source.py

        try:
            parsed_config = self._get_parsed_config(config)
            self.stream_reader.config = parsed_config
-            streams: List[Stream] = []
+            streams: List[AbstractFileBasedStream] = []


Why change the type here?

We specifically require file-based streams since they have additional methods. I think this makes sense since this is a file-based source - probably should have been this way from the beginning.

maxi297 · 2024-01-26T15:59:42Z

airbyte-cdk/python/airbyte_cdk/sources/file_based/stream/concurrent/adapters.py

+
+
+@deprecated("This class is experimental. Use at your own risk.")
+class FileBasedStreamFacade(Stream):


There seems to be a lot of duplicated code between this class and StreamFacade. Is there a reason why we don't instantiate a StreamFacade for the FCDK instead of creating a new one? For each item identified, can we make it part of the StreamFacade?

I see:

primary_key being different without being sure why the implementation that calls as_airbyte_stream would not be sufficient

read_records using state = str(self._cursor.state) instead of state = self._cursor.state which seems odd to me

I hear you on the duplicated code. I'm unable to use StreamFacade here because there are file-specific methods and types that are required by the callers.

I've played around a lot with this to try to clean it up - e.g. creating an interface that StreamFacade and FileBasedStreamFacade can use, and avoiding the use of StreamFacade-like pattern altogether, and avoiding use of DefaultStream, but did not come up with something that I felt worked with the existing framework.

What I'd like to do is keep this as-is, and consider a separate refactor that would help make the concurrency interface a little more flexible, but I'd prefer not to block this on it. In the mean time I'm not worried about the fact that we have a parallel file-based flow that has some duplicated code since it does still fit into the concurrency patterns that we've put in place rather than reinventing it from scratch.

@clnoll can you create an issue specifying exactly what problems the refactor should resolve so we don't lose track of it?

The bit I'm particularly surprised about is that the file-based classes need to know about the stream facade. Maybe FileBasedStreamFacade should implement a AbstractFileBasedStream? This way, the mess is kept to the facade and the file-based framework doesn't know anything about it

Update: I was at least able to git rid of the need for DefaultFileBasedStream.

@girarda I wrote that last comment before I saw yours.

Good call re having FileBasedStreamFacade implement AbstractFileBasedStream, made that update and it cleaned up some of the issues with types.

airbyte-cdk/python/airbyte_cdk/sources/file_based/stream/concurrent/cursor.py

airbyte-cdk/python/airbyte_cdk/sources/file_based/stream/default_file_based_stream.py

girarda

looks good! main question is whether we can avoid leaking the facade into the file_based module.

airbyte-cdk/python/airbyte_cdk/sources/file_based/file_based_source.py

girarda · 2024-01-28T20:12:19Z

airbyte-cdk/python/airbyte_cdk/sources/file_based/stream/concurrent/adapters.py

+
+
+@deprecated("This class is experimental. Use at your own risk.")
+class FileBasedStreamFacade(Stream):


@clnoll can you create an issue specifying exactly what problems the refactor should resolve so we don't lose track of it?

The bit I'm particularly surprised about is that the file-based classes need to know about the stream facade. Maybe FileBasedStreamFacade should implement a AbstractFileBasedStream? This way, the mess is kept to the facade and the file-based framework doesn't know anything about it

girarda · 2024-01-28T20:15:59Z

airbyte-cdk/python/airbyte_cdk/sources/file_based/stream/concurrent/adapters.py

+                state = str(self._cursor.state)
+            else:
+                # This shouldn't happen if the ConcurrentCursor was used
+                state = "unknown; no state attribute was available on the cursor"


should this fail loudly?

This was copied over from the non-file-based adapters but is no longer needed now that we're inheriting from AbstractFileBasedStream.

maxi297

I haven't played with/manually tested the solution but from reading the code, I don't see any issue. 🚢

maxi297 · 2024-01-29T13:49:48Z

airbyte-cdk/python/airbyte_cdk/sources/file_based/stream/concurrent/adapters.py

+        return f"FileBasedStreamPartition({self._stream.name}, {self._slice})"
+
+
+class FileBasedStreamPartitionGenerator(PartitionGenerator):


Should we have unit tests for the classes in this file?

They do have coverage via scenario-based tests, but unit tests are a good idea too. Added.

maxi297 · 2024-01-29T13:52:13Z

airbyte-cdk/python/airbyte_cdk/sources/file_based/stream/concurrent/adapters.py

+    def generate(self) -> Iterable[FileBasedStreamPartition]:
+        pending_partitions = []
+        for _slice in self._stream.stream_slices(sync_mode=self._sync_mode, cursor_field=self._cursor_field, stream_state=self._state):
+            if _slice is not None:


In which case can this occur? What does it mean? It feels odd that the stream is creating a slice but we won't read records for that slice

The signature for stream_slices is Iterable[Optional[Mapping[str, Any]]] so we have to handle that case. I don't expect it to ever be None for file-based.

girarda

!

made compatible with #34411

sentry-io · 2024-01-30T17:09:14Z

Suspect Issues

This pull request was deployed and Sentry observed the following issues:

‼️ airbyte_cdk.sources.file_based.exceptions.SchemaInferenceError: Error inferring schema from files. Are the files valid? Contact Support if you need assistance. /usr/local/lib/python3.9/site-packages/airbyte_... View Issue
‼️ RuntimeError: No sync mode was found for bounce. /usr/local/lib/python3.9/site-packages/airbyte_... View Issue

_{Did you find this useful? React with a 👍 or 👎}

clnoll requested a review from a team as a code owner January 22, 2024 15:13

clnoll marked this pull request as draft January 22, 2024 15:13

octavia-squidington-iii added area/connectors Connector related issues CDK Connector Development Kit connectors/source/s3 labels Jan 22, 2024

octavia-squidington-iv requested a review from a team January 22, 2024 15:16

octavia-squidington-iii removed the area/connectors Connector related issues label Jan 22, 2024

clnoll force-pushed the fcdk-with-ccdk branch from ff0cb0b to 741e1b6 Compare January 23, 2024 01:11

clnoll requested review from girarda and maxi297 January 23, 2024 02:08

clnoll marked this pull request as ready for review January 23, 2024 02:08

maxi297 reviewed Jan 26, 2024

View reviewed changes

clnoll force-pushed the fcdk-with-ccdk branch from 01b2cc5 to fbce5ce Compare January 28, 2024 19:30

girarda reviewed Jan 28, 2024

View reviewed changes

clnoll mentioned this pull request Jan 28, 2024

Source S3: updates for compatibility with the concurrent CDK #34591

Merged

clnoll force-pushed the fcdk-with-ccdk branch from defff5d to bcef56f Compare January 29, 2024 02:05

clnoll requested review from girarda and maxi297 January 29, 2024 13:55

maxi297 approved these changes Jan 29, 2024

View reviewed changes

girarda approved these changes Jan 29, 2024

View reviewed changes

clnoll added 7 commits January 29, 2024 19:14

File-based CDK: make full refresh concurrent

1e92032

Revert S3 changes; these will be done in a separate PR

a8707c4

Some reorg

9c02d8c

Actually use concurrent source

966fa6f

mypy fix

3e1d853

Use DefaultStream instead of DefaultFileBasedStream

6c1d3d4

Make sources set concurrency level

670cab6

clnoll added 4 commits January 29, 2024 19:14

Make FileBasedStreamFacade implement AbstractFileBasedStream

9932817

CR comments

a526357

Make stream facades implement AbstractStreamFacade

8bf3263

reorg

afb9578

clnoll force-pushed the fcdk-with-ccdk branch from c9b2752 to afb9578 Compare January 30, 2024 00:14

clnoll merged commit eb31e4d into master Jan 30, 2024
22 checks passed

clnoll deleted the fcdk-with-ccdk branch January 30, 2024 00:33

clnoll added a commit that referenced this pull request Jan 30, 2024

Pin file-based sources to airbyte-cdk version 0.59.2 until they can be

1e12d90

made compatible with #34411

clnoll mentioned this pull request Jan 30, 2024

Pin file-based sources to airbyte-cdk version 0.59.2 #34661

Merged

jatinyadav-cc pushed a commit to ollionorg/datapipes-airbyte that referenced this pull request Feb 21, 2024

File-based CDK: make full refresh concurrent (airbytehq#34411)

ffb9cbc

jatinyadav-cc pushed a commit to ollionorg/datapipes-airbyte that referenced this pull request Feb 26, 2024

File-based CDK: make full refresh concurrent (airbytehq#34411)

cc4f2e3

jatinyadav-cc pushed a commit to ollionorg/datapipes-airbyte that referenced this pull request Feb 26, 2024

File-based CDK: make full refresh concurrent (airbytehq#34411)

197771c

jatinyadav-cc pushed a commit to ollionorg/datapipes-airbyte that referenced this pull request Feb 26, 2024

File-based CDK: make full refresh concurrent (airbytehq#34411)

a127f92

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

File-based CDK: make full refresh concurrent #34411

File-based CDK: make full refresh concurrent #34411

clnoll commented Jan 22, 2024

vercel bot commented Jan 22, 2024 •

edited

Loading

github-actions bot commented Jan 22, 2024

maxi297 left a comment

maxi297 Jan 26, 2024

clnoll Jan 28, 2024 •

edited

Loading

clnoll Jan 29, 2024

maxi297 Jan 26, 2024

clnoll Jan 26, 2024

maxi297 Jan 26, 2024

clnoll Jan 28, 2024

girarda Jan 28, 2024

clnoll Jan 28, 2024

clnoll Jan 28, 2024

girarda left a comment

girarda Jan 28, 2024

girarda Jan 28, 2024

clnoll Jan 29, 2024

maxi297 left a comment

maxi297 Jan 29, 2024

clnoll Jan 30, 2024

maxi297 Jan 29, 2024

clnoll Jan 29, 2024

girarda left a comment

sentry-io bot commented Jan 30, 2024 •

edited

Loading



		@deprecated("This class is experimental. Use at your own risk.")
		class FileBasedStreamFacade(Stream):

		return f"FileBasedStreamPartition({self._stream.name}, {self._slice})"


		class FileBasedStreamPartitionGenerator(PartitionGenerator):

File-based CDK: make full refresh concurrent #34411

File-based CDK: make full refresh concurrent #34411

Conversation

clnoll commented Jan 22, 2024

Recommended reading order

vercel bot commented Jan 22, 2024 • edited Loading

github-actions bot commented Jan 22, 2024

Before Merging a Connector Pull Request

maxi297 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

clnoll Jan 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

girarda left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

maxi297 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

girarda left a comment

Choose a reason for hiding this comment

sentry-io bot commented Jan 30, 2024 • edited Loading

Suspect Issues

vercel bot commented Jan 22, 2024 •

edited

Loading

clnoll Jan 28, 2024 •

edited

Loading

sentry-io bot commented Jan 30, 2024 •

edited

Loading