Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Source S3: speed up discovery #22500

Merged
merged 5 commits into from
Feb 9, 2023

Conversation

davydov-d
Copy link
Collaborator

@davydov-d davydov-d commented Feb 7, 2023

What

https://github.com/airbytehq/oncall/issues/1470

How

When discover is called, the connector reads all the files one by one and tries to infer the schema of each of them and then merge them into a single final schema. In case there are too many files, it takes much time. This PR introduces a threaded approach to reading the files.

It used to take the connector mentioned in the oncall issue about 1 second to read each of the 3.3k files they have in the bucket + 60-90 seconds to fetch the list of files. Now the whole discover takes about 6.5 minutes

@davydov-d
Copy link
Collaborator Author

davydov-d commented Feb 7, 2023

/test connector=connectors/source-s3

🕑 connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/4116035029
❌ connectors/source-s3 https://github.com/airbytehq/airbyte/actions/runs/4116035029
🐛 https://gradle.com/s/f6misk3mbo5jc

Build Failed

Test summary info:

=========================== short test summary info ============================
FAILED test_core.py::TestBasicRead::test_read[inputs2] - Failed: Please check...
SKIPPED [1] ../usr/local/lib/python3.9/site-packages/connector_acceptance_test/tests/test_core.py:98: The previous and actual specifications are identical.
SKIPPED [6] ../usr/local/lib/python3.9/site-packages/connector_acceptance_test/tests/test_core.py:507: The previous and actual discovered catalogs are identical.
============= 1 failed, 107 passed, 7 skipped in 262.87s (0:04:22) =============

@@ -52,6 +55,7 @@ def fileformatparser_map(self) -> Mapping[str, type]:
ab_file_name_col = "_ab_source_file_url"
airbyte_columns = [ab_additional_col, ab_last_mod_col, ab_file_name_col]
datetime_format_string = "%Y-%m-%dT%H:%M:%S%z"
parallel_tasks_size = 256
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

128 results in 6 min 50 seconds
256 results in 6 min 30 seconds
512 ~ almost the same

@davydov-d
Copy link
Collaborator Author

davydov-d commented Feb 7, 2023

☝️ the failing test is not related. It's about the actual and expected schema mismatch (avro)

@octavia-squidington-iii octavia-squidington-iii added the area/frontend Related to the Airbyte webapp label Feb 9, 2023
@davydov-d davydov-d closed this Feb 9, 2023
@davydov-d davydov-d force-pushed the ddavydov/#1470-source-s3-speedup-discovery branch from 9789607 to a70d6e8 Compare February 9, 2023 17:01
@octavia-squidington-iii octavia-squidington-iii removed area/documentation Improvements or additions to documentation area/connectors Connector related issues area/frontend Related to the Airbyte webapp labels Feb 9, 2023
@octavia-squidington-iii octavia-squidington-iii temporarily deployed to more-secrets February 9, 2023 17:15 — with GitHub Actions Inactive
@octavia-squidington-iii octavia-squidington-iii temporarily deployed to more-secrets February 9, 2023 17:36 — with GitHub Actions Inactive
@davydov-d davydov-d reopened this Feb 9, 2023
@octavia-squidington-iii octavia-squidington-iii added area/connectors Connector related issues area/documentation Improvements or additions to documentation labels Feb 9, 2023
@davydov-d
Copy link
Collaborator Author

davydov-d commented Feb 9, 2023

/publish connector=connectors/source-s3 run-tests=false

if you have connectors that successfully published but failed definition generation, follow step 4 here ▶️

@davydov-d
Copy link
Collaborator Author

davydov-d commented Feb 9, 2023

/publish connector=connectors/source-s3 run-tests=false

🕑 Publishing the following connectors:
connectors/source-s3
https://github.com/airbytehq/airbyte/actions/runs/4137162039


Connector Did it publish? Were definitions generated?
connectors/source-s3

if you have connectors that successfully published but failed definition generation, follow step 4 here ▶️

@airbyteio airbyteio temporarily deployed to more-secrets February 9, 2023 18:28 — with GitHub Actions Inactive
@airbyteio airbyteio temporarily deployed to more-secrets February 9, 2023 18:28 — with GitHub Actions Inactive
@github-actions
Copy link
Contributor

github-actions bot commented Feb 9, 2023

Airbyte Code Coverage

There is no coverage information present for the Files changed

Total Project Coverage 24.67%

@davydov-d davydov-d temporarily deployed to more-secrets February 9, 2023 19:06 — with GitHub Actions Inactive
@davydov-d davydov-d temporarily deployed to more-secrets February 9, 2023 19:06 — with GitHub Actions Inactive
@davydov-d davydov-d merged commit 3dc79f5 into master Feb 9, 2023
@davydov-d davydov-d deleted the ddavydov/#1470-source-s3-speedup-discovery branch February 9, 2023 19:44
marcosmarxm pushed a commit that referenced this pull request Feb 9, 2023
* #1470 source S3: speed up discovery

* #1470 source s3: upd changelog

* auto-bump connector version

---------

Co-authored-by: Octavia Squidington III <octavia-squidington-iii@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connectors Connector related issues area/documentation Improvements or additions to documentation connectors/source/s3
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants