AirbyteLib: add null cache and null writer #34587

aaronsteers · 2024-01-28T03:13:26Z

Note:

This should merge after:

AirbyteLib: Progress Printer #34588

This is born out of a desire to get a better understanding of performance bottlenecks. This PR add a "NullCache" which does nothing at all with incoming records.

The goal with this is to be able to get performance benchmarks on a source connector, with little or no slowdown from AirbyteLib, the file writer, and/or the SQL cache.

…le flag during install

…-cache

airbyte-lib/airbyte_lib/_file_writers/null.py

…e-lib/progress-print

…all-failure-handling

…e-lib/progress-print

…nto aj/airbyte-lib/null-cache

alafanechere · 2024-01-29T13:09:11Z

airbyte-lib/airbyte_lib/_file_writers/__init__.py

Should a _null_writers packages be introduced instead of adding this write to _file_writers?

alafanechere · 2024-01-29T13:10:27Z

airbyte-lib/airbyte_lib/_file_writers/null.py

+class NullWriter(FileWriterBase):
+    """A Null (no-op) file writer implementation."""
+
+    config_class = NullWriterConfig


config_class: Final[Classvar] = NullWriterConfig

alafanechere · 2024-01-29T13:11:33Z

airbyte-lib/airbyte_lib/_file_writers/null.py

+        stream_name: str,
+        batch_id: str | None = None,  # ULID of the batch


If params are not used can you just pass, *args, **kwargs?

alafanechere · 2024-01-29T13:53:28Z

airbyte-lib/airbyte_lib/_file_writers/null.py

+    ) -> FileWriterBatchHandle:
+        """Process a record batch.
+
+        Return the path to the cache file.


This is not returning a cache file but a FileWriteBatchHandle if i'm not mistaken

alafanechere · 2024-01-29T13:54:30Z

airbyte-lib/airbyte_lib/_file_writers/null.py

+        Return the path to the cache file.
+        """
+        _ = batch_id, record_batch  # unused
+        output_file_path = self.get_new_cache_file_path(stream_name)


Is the call to get_new_cache_file_path necessary there as its returning a dummy object?

alafanechere · 2024-01-29T13:55:50Z

airbyte-lib/airbyte_lib/_file_writers/null.py

+
+        batch_handle = FileWriterBatchHandle()
+        batch_handle.files.append(output_file_path)
+        return batch_handle


I'm not sure I get the purpose of returning a mutated FileWriterBatchHandle as its content will be dummy path? Could we create a NullWriterBatchHandle` instead?

alafanechere · 2024-01-29T13:57:24Z

airbyte-lib/airbyte_lib/_file_writers/null.py

+
+    def _table_exists(self, table_name: str) -> bool:
+        """Check if a table exists."""
+        _ = table_name


Why are you assigning here to _ (and in other parts) if the parameters values are not used?

alafanechere · 2024-01-29T14:12:31Z

airbyte-lib/airbyte_lib/caches/null.py

+    """A DuckDB implementation of the cache.
+
+    Parquet is used for local file storage before bulk loading.
+    Unlike the Snowflake implementation, we can't use the COPY command to load data
+    so we insert as values instead.


I think this docstring is not up to date

alafanechere · 2024-01-29T14:13:12Z

airbyte-lib/airbyte_lib/caches/null.py

+    def _execute_sql(self, sql: str) -> None:
+        """Execute SQL."""
+        _ = sql
+        # Do nothing


I'd appreciate an explicit "return None"

alafanechere · 2024-01-29T14:14:08Z

airbyte-lib/airbyte_lib/caches/null.py

+        pass
+
+    @overrides
+    def _write_files_to_new_table(self, files: list[Path], stream_name: str, batch_id: str) -> str:


please change the type hint if none is returned here

flash1293 · 2024-01-29T14:18:53Z

Based on the stated goal of this PR - the best you should be able to do is to just use list(source.get_records("...")). It's basically

with little or no slowdown from AirbyteLib, the file writer, and/or the SQL cache.

I'm always for taking a chance to not write some code :)

natikgadzhi · 2024-04-20T23:04:33Z

I'm closing because I assume you already made this change in pyairbyte repo. If not, apologies! Seems like a huge lift to rebase and rename anyway.

aaronsteers added 30 commits January 26, 2024 09:59

new exception type: AirbyteConnectorNotRegisteredError

abbb256

make constructors more resilient

3845f5c

print stderr in exception text, cleanup failed install, remove editab…

9fccace

…le flag during install

move auto-install out of venv constructor, for easier debugging

a217a6e

add test to assert that install failure includes pip log text

6aa85d6

update docs

dddbc78

auto-format

b1d966b

update docs

f61152a

refactor version handling, control for side effects

d665088

fix exception handling in _get_installed_version()

809918b

fix tests

4a41ffb

improve thread safety

bab5e06

handle quoted spaces in pip_url

10ce077

fix import sorts

063bba3

standalone validate_config() method

ab75be4

add Source.yaml_spec property

8880b0b

make _yaml_spec a protected member

3773149

fix too-limited json package_data glob

90918c8

basic progress reporting

5f0bcb3

remove raw=True

9ed1929

add progress tracker class

ec4d8dd

update docs

5d4eb45

bug fixes

a164168

fix separator

0c62f04

bug fixes

bec8d11

bug fix

df24520

improved logs

6cc9f50

fix progress bugs, add unit tests

6d6708c

add reset() at beginning of sync

bf11816

add null cache and null writer

f28e88c

aaronsteers added 2 commits January 28, 2024 08:39

fix tests, make tests more flexible

0b6895b

reorder import

6675233

aaronsteers mentioned this pull request Jan 28, 2024

AirbyteLib: DuckDB Perf Boost #34589

Merged

Merge branch 'aj/airbyte-lib/progress-print' into aj/airbyte-lib/null…

66b73ab

…-cache

aaronsteers commented Jan 28, 2024

View reviewed changes

airbyte-lib/airbyte_lib/_file_writers/null.py Outdated Show resolved Hide resolved

Update airbyte-lib/airbyte_lib/_file_writers/null.py

c913159

aaronsteers commented Jan 28, 2024

View reviewed changes

airbyte-lib/airbyte_lib/_file_writers/null.py Outdated Show resolved Hide resolved

fix docstrings

ca2f1a2

octavia-squidington-iv requested review from a team January 28, 2024 16:59

aaronsteers added the airbyte-lib Related to AirbyteLib label Jan 28, 2024

aaronsteers added 11 commits January 28, 2024 09:05

fix missing copyright str

9197728

docstring

f73f288

update docs

dd9ac99

revert source-github change

a2bed01

updated comment

2e49154

Merge branch 'aj/airbyte-lib/install-failure-handling' into aj/airbyt…

0772a68

…e-lib/progress-print

remove redundant strings

f975282

Merge remote-tracking branch 'origin/master' into aj/airbyte-lib/inst…

ace7208

…all-failure-handling

update docs (removes empty cloud page)

8775c1b

Merge branch 'aj/airbyte-lib/install-failure-handling' into aj/airbyt…

34cb485

…e-lib/progress-print

Merge remote-tracking branch 'origin/aj/airbyte-lib/progress-print' i…

edbdfab

…nto aj/airbyte-lib/null-cache

vercel bot deployed to Preview January 28, 2024 20:03 View deployment

alafanechere requested changes Jan 29, 2024

View reviewed changes

aaronsteers marked this pull request as draft January 30, 2024 06:17

Base automatically changed from aj/airbyte-lib/progress-print to master January 30, 2024 06:39

aaronsteers mentioned this pull request Feb 6, 2024

🐛 Bug: Read from GitHub is *really* slow (~3 records per second) airbytehq/PyAirbyte#13

Closed

natikgadzhi closed this Apr 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AirbyteLib: add null cache and null writer #34587

AirbyteLib: add null cache and null writer #34587

aaronsteers commented Jan 28, 2024 •

edited

alafanechere Jan 29, 2024

alafanechere Jan 29, 2024

alafanechere Jan 29, 2024

alafanechere Jan 29, 2024

alafanechere Jan 29, 2024

alafanechere Jan 29, 2024

alafanechere Jan 29, 2024

alafanechere Jan 29, 2024

alafanechere Jan 29, 2024

alafanechere Jan 29, 2024

flash1293 commented Jan 29, 2024 •

edited

natikgadzhi commented Apr 20, 2024

		stream_name: str,
		batch_id: str \| None = None, # ULID of the batch

AirbyteLib: add null cache and null writer #34587

AirbyteLib: add null cache and null writer #34587

Conversation

aaronsteers commented Jan 28, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

flash1293 commented Jan 29, 2024 • edited

natikgadzhi commented Apr 20, 2024

aaronsteers commented Jan 28, 2024 •

edited

flash1293 commented Jan 29, 2024 •

edited