Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Destinations V2 - T&D Streams in parallel #30020

Merged
merged 31 commits into from Sep 6, 2023

Conversation

jbfbell
Copy link
Contributor

@jbfbell jbfbell commented Aug 31, 2023

We were running typing and deduping operations per stream sequentially. This change makes them execute concurrently.

@evantahler @edgao @cynthiaxyin : One thing I have not had time to track down is why snowflake does not show queries in the logs despite seemingly doing the tasks.

In my testing with source faker this was a little bit faster, the snowflake snyc that was taking 45-49 seconds took 30-37 seconds and the bigquery sync that was taking 1m40-1m45 seconds was taking 1m30 seconds

I have not yet set up a test with many streams, which is what exacerbates the sequential T&D issue - will set this up tomorrow.

UPDATE

After testing changes locally for BigQuery GCS Staging, I noticed a minimal increase, 25min -> 20 minutes on 30 streams. By removing incremental typing and deduping I saw a significant speed improvement 25min -> 11 min. This convinces me that we should hold off on incremental typing and deduping until we have a better story around it.

While testing snowflake, I encountered several OOM errors, however the syncs do eventually succeed. I saw the same behavior with the non parallel T&D so I'm inclined to think this is an existing bug.

I created a platform PR which allows us to configure the number of threads used by typing and deduping, but it will fallback to 8 if nothing is provided.

SECOND UPDATE
The tests for Standard Inserts were failing after this change, so I've removed this for standard inserts. SI is already slower than GCS staging so I think this is acceptable

@github-actions
Copy link
Contributor

github-actions bot commented Aug 31, 2023

Before Merging a Connector Pull Request

Wow! What a great pull request you have here! 🎉

To merge this PR, ensure the following has been done/considered for each connector added or updated:

  • PR name follows PR naming conventions
  • Breaking changes are considered. If a Breaking Change is being introduced, ensure an Airbyte engineer has created a Breaking Change Plan.
  • Connector version has been incremented in the Dockerfile and metadata.yaml according to our Semantic Versioning for Connectors guidelines
  • You've updated the connector's metadata.yaml file any other relevant changes, including a breakingChanges entry for major version bumps. See metadata.yaml docs
  • Secrets in the connector's spec are annotated with airbyte_secret
  • All documentation files are up to date. (README.md, bootstrap.md, docs.md, etc...)
  • Changelog updated in docs/integrations/<source or destination>/<name>.md with an entry for the new version. See changelog example
  • Migration guide updated in docs/integrations/<source or destination>/<name>-migrations.md with an entry for the new version, if the version is a breaking change. See migration guide example
  • If set, you've ensured the icon is present in the platform-internal repo. (Docs)

If the checklist is complete, but the CI check is failing,

  1. Check for hidden checklists in your PR description

  2. Toggle the github label checklist-action-run on/off to re-run the checklist CI.

@jbfbell jbfbell marked this pull request as ready for review August 31, 2023 05:28
@jbfbell jbfbell requested a review from a team as a code owner August 31, 2023 05:28
@github-actions

This comment was marked as outdated.

@github-actions

This comment was marked as outdated.

@github-actions

This comment was marked as outdated.

@github-actions

This comment was marked as outdated.

Copy link
Contributor

@edgao edgao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

supplyAsync makes me nervous, otherwise I think this looks right. Had a question around correctness wrt to the final dropStage / T+D interaction, plus a few nits.

@jbfbell jbfbell requested a review from a team as a code owner September 1, 2023 02:49
@github-actions

This comment was marked as outdated.

@github-actions

This comment was marked as outdated.

@github-actions

This comment was marked as outdated.

@github-actions

This comment was marked as outdated.

@edgao
Copy link
Contributor

edgao commented Sep 1, 2023

previous commit timed out on bigquery integration tests. I think we need to call cleanup in the standard inserts record consumer as well. will push that commit in a few minutes.

@jbfbell
Copy link
Contributor Author

jbfbell commented Sep 1, 2023

previous commit timed out on bigquery integration tests. I think we need to call cleanup in the standard inserts record consumer as well. will push that commit in a few minutes.

:doh: good catch, thanks @edgao

@github-actions

This comment was marked as outdated.

@@ -104,6 +104,7 @@ public static OnCloseFunction onCloseFunction(final JdbcDatabase database,
// After moving data from staging area to the target table (airybte_raw) clean up the staging
// area (if user configured)
log.info("Cleaning up destination started for {} streams", writeConfigs.size());
typerDeduper.typeAndDedupe();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benmoriceau / @tryangul fyi we're tweaking the T+D interface again 😅 . There's now a typeAndDedupe() method, which runs T+D on all streams, along with a new cleanup method (under the hood, it shuts down a thread pool, and is mandatory to run at the end of a sync to allow the process to exit)

@github-actions

This comment was marked as outdated.

@edgao edgao disabled auto-merge September 1, 2023 21:10
@github-actions
Copy link
Contributor

github-actions bot commented Sep 1, 2023

destination-bigquery test report (commit 49246f89a8) - ✅

⏲️ Total pipeline duration: 14mn53s

Step Result
Java Connector Unit Tests
Build connector tar
Build destination-bigquery docker image for platform linux/x86_64
Java Connector Integration Tests
Validate airbyte-integrations/connectors/destination-bigquery/metadata.yaml
Connector version semver check
Connector version increment check
QA checks

🔗 View the logs here

☁️ View runs for commit in Dagger Cloud

Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command

airbyte-ci connectors --name=destination-bigquery test

@github-actions
Copy link
Contributor

github-actions bot commented Sep 1, 2023

destination-snowflake test report (commit 49246f89a8) - ✅

⏲️ Total pipeline duration: 14mn16s

Step Result
Java Connector Unit Tests
Build connector tar
Build destination-snowflake docker image for platform linux/x86_64
Java Connector Integration Tests
Validate airbyte-integrations/connectors/destination-snowflake/metadata.yaml
Connector version semver check
Connector version increment check
QA checks

🔗 View the logs here

☁️ View runs for commit in Dagger Cloud

Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command

airbyte-ci connectors --name=destination-snowflake test

@github-actions
Copy link
Contributor

github-actions bot commented Sep 5, 2023

destination-snowflake test report (commit ca5fa0ab62) - ✅

⏲️ Total pipeline duration: 13mn45s

Step Result
Java Connector Unit Tests
Build connector tar
Build destination-snowflake docker image for platform linux/x86_64
Java Connector Integration Tests
Validate airbyte-integrations/connectors/destination-snowflake/metadata.yaml
Connector version semver check
Connector version increment check
QA checks

🔗 View the logs here

☁️ View runs for commit in Dagger Cloud

Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command

airbyte-ci connectors --name=destination-snowflake test

@github-actions
Copy link
Contributor

github-actions bot commented Sep 5, 2023

destination-bigquery test report (commit ca5fa0ab62) - ✅

⏲️ Total pipeline duration: 13mn44s

Step Result
Java Connector Unit Tests
Build connector tar
Build destination-bigquery docker image for platform linux/x86_64
Java Connector Integration Tests
Validate airbyte-integrations/connectors/destination-bigquery/metadata.yaml
Connector version semver check
Connector version increment check
QA checks

🔗 View the logs here

☁️ View runs for commit in Dagger Cloud

Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command

airbyte-ci connectors --name=destination-bigquery test

@edgao
Copy link
Contributor

edgao commented Sep 5, 2023

ran a test to dest-bigquery with 11 streams and got this error during table setup:

Exceeded rate limits: too many dataset metadata update operations for this dataset.

... which I had to go into the bq cli for, because job.waitFor() actually never terminated, so the sync just got stuck. Very exciting.

@github-actions
Copy link
Contributor

github-actions bot commented Sep 5, 2023

destination-bigquery test report (commit fb48e50a47) - ✅

⏲️ Total pipeline duration: 15mn01s

Step Result
Java Connector Unit Tests
Build connector tar
Build destination-bigquery docker image for platform linux/x86_64
Java Connector Integration Tests
Validate airbyte-integrations/connectors/destination-bigquery/metadata.yaml
Connector version semver check
Connector version increment check
QA checks

🔗 View the logs here

☁️ View runs for commit in Dagger Cloud

Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command

airbyte-ci connectors --name=destination-bigquery test

@github-actions
Copy link
Contributor

github-actions bot commented Sep 5, 2023

destination-snowflake test report (commit fb48e50a47) - ✅

⏲️ Total pipeline duration: 03mn05s

Step Result
Java Connector Unit Tests
Build connector tar
Build destination-snowflake docker image for platform linux/x86_64
Java Connector Integration Tests
Validate airbyte-integrations/connectors/destination-snowflake/metadata.yaml
Connector version semver check
Connector version increment check
QA checks

🔗 View the logs here

☁️ View runs for commit in Dagger Cloud

Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command

airbyte-ci connectors --name=destination-snowflake test

@edgao
Copy link
Contributor

edgao commented Sep 5, 2023

snowflake prerelease publish running here https://github.com/airbytehq/airbyte/actions/runs/6090550929

@github-actions
Copy link
Contributor

github-actions bot commented Sep 6, 2023

destination-snowflake test report (commit 232b2cfc0f) - ✅

⏲️ Total pipeline duration: 14mn42s

Step Result
Java Connector Unit Tests
Build connector tar
Build destination-snowflake docker image for platform linux/x86_64
Java Connector Integration Tests
Validate airbyte-integrations/connectors/destination-snowflake/metadata.yaml
Connector version semver check
Connector version increment check
QA checks

🔗 View the logs here

☁️ View runs for commit in Dagger Cloud

Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command

airbyte-ci connectors --name=destination-snowflake test

@github-actions
Copy link
Contributor

github-actions bot commented Sep 6, 2023

destination-bigquery test report (commit 232b2cfc0f) - ❌

⏲️ Total pipeline duration: 14mn12s

Step Result
Java Connector Unit Tests
Build connector tar
Build destination-bigquery docker image for platform linux/x86_64
Java Connector Integration Tests
Validate airbyte-integrations/connectors/destination-bigquery/metadata.yaml
Connector version semver check
Connector version increment check
QA checks

🔗 View the logs here

☁️ View runs for commit in Dagger Cloud

Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command

airbyte-ci connectors --name=destination-bigquery test

@github-actions
Copy link
Contributor

github-actions bot commented Sep 6, 2023

destination-bigquery test report (commit 232b2cfc0f) - ✅

⏲️ Total pipeline duration: 10mn00s

Step Result
Java Connector Unit Tests
Build connector tar
Build destination-bigquery docker image for platform linux/x86_64
Java Connector Integration Tests
Validate airbyte-integrations/connectors/destination-bigquery/metadata.yaml
Connector version semver check
Connector version increment check
QA checks

🔗 View the logs here

☁️ View runs for commit in Dagger Cloud

Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command

airbyte-ci connectors --name=destination-bigquery test

@github-actions
Copy link
Contributor

github-actions bot commented Sep 6, 2023

destination-snowflake test report (commit 232b2cfc0f) - ✅

⏲️ Total pipeline duration: 03mn37s

Step Result
Java Connector Unit Tests
Build connector tar
Build destination-snowflake docker image for platform linux/x86_64
Java Connector Integration Tests
Validate airbyte-integrations/connectors/destination-snowflake/metadata.yaml
Connector version semver check
Connector version increment check
QA checks

🔗 View the logs here

☁️ View runs for commit in Dagger Cloud

Please note that tests are only run on PR ready for review. Please set your PR to draft mode to not flood the CI engine and upstream service on following commits.
You can run the same pipeline locally on this branch with the airbyte-ci tool with the following command

airbyte-ci connectors --name=destination-snowflake test

@edgao edgao merged commit 11068c1 into master Sep 6, 2023
24 checks passed
@edgao edgao deleted the joseph.bell/28812/t+d-in-parallel branch September 6, 2023 17:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants