Performance Harness: Support for pseudo-parallel streams #26219

ryankfu · 2023-05-18T05:54:10Z

What

Introduced pseudo-parallel streams to mock the behavior of CDC (Change Data Capture)

How

Uses the same dataset but uses the same catalog across the same dataset and uses a random function to select which Stream to add metadata to when generating the AirbyteRecordMessage

🚨 User Impact 🚨

No breaking changes

For connector PRs, use this section to explain which type of semantic versioning bump occurs as a result of the changes. Refer to our Semantic Versioning for Connectors guidelines for more information. Breaking changes to connectors must be documented by an Airbyte engineer (PR author, or reviewer for community PRs) by using the Breaking Change Release Playbook.

If there are breaking changes, please merge this PR with the 🚨🚨 emoji so changelog authors can further highlight this if needed.

Pre-merge Actions

Expand the relevant checklist and delete the others.

New Connector

Community member or Airbyter

Community member? Grant edit access to maintainers (instructions)
Unit & integration tests added and passing. Community members, please provide proof of success locally e.g: screenshot or copy-paste unit, integration, and acceptance test output. To run acceptance tests for a Python connector, follow instructions in the README. For java connectors run ./gradlew :airbyte-integrations:connectors:<name>:integrationTest.
Connector version is set to 0.0.1
- Dockerfile has version 0.0.1
Documentation updated
- Connector's README.md
- Connector's bootstrap.md. See description and examples
- docs/integrations/<source or destination>/<name>.md including changelog with an entry for the initial version. See changelog example
- docs/integrations/README.md

Airbyter

If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.

Create a non-forked branch based on this PR and test the below items on it
Build is successful
If new credentials are required for use in CI, add them to GSM. Instructions.

Updating a connector

Community member or Airbyter

Grant edit access to maintainers (instructions)
Unit & integration tests added

Airbyter

If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.

Create a non-forked branch based on this PR and test the below items on it
Build is successful
If new credentials are required for use in CI, add them to GSM. Instructions.

Connector Generator

Issue acceptance criteria met
PR name follows PR naming conventions
If adding a new generator, add it to the list of scaffold modules being tested
The generator test modules (all connectors with -scaffold in their name) have been updated with the latest scaffold by running ./gradlew :airbyte-integrations:connector-templates:generator:testScaffoldTemplates then checking in your changes
Documentation which references the generator is updated as needed

davinchia · 2023-05-19T19:21:26Z

airbyte-integrations/connectors-performance/destination-harness/README.md

@@ -4,3 +4,8 @@ Performance harness for destination connectors.

 This component is used by the `/connector-performance` GitHub action and is used in order to test throughput of
 destination connectors on a number of datasets.
+
+Associated files are:
+<li>Main.java - the main entrypoint for the harness


davinchia · 2023-05-19T19:21:41Z

.../destination-harness/src/main/java/io/airbyte/integrations/destination_performance/Main.java

@@ -26,16 +28,23 @@ public class Main {
  private static final String CREDENTIALS_PATH = "secrets/%s_%s_credentials.json";

  public static void main(final String[] args) {
+    // If updating args for Github Actions, also update the run-performance-test.yml file


...mance/destination-harness/src/main/resources/catalogs/destination-snowflake/10m_catalog.json

...n-harness/src/main/java/io/airbyte/integrations/destination_performance/PerformanceTest.java

davinchia · 2023-05-19T19:27:14Z

Ryan, how do we know the random logic works? Is it worth adding a quick test?

ryankfu · 2023-05-19T19:30:29Z

I did a manual test here and confirmed it creates the correct number of streams. Also got the list of tables by looking at the Snowflake connection. Did you want to have a unit test added for brevity?

EDIT: quick sniff test is too see when Snowflake flushes which streams it's flushes and whether it gets up to 11 (the number of streams allocated for this run)

davinchia · 2023-05-19T19:31:10Z

Yes. Probably a good idea to avoid regressions.

.github/workflows/connector-performance-command.yml

.../destination-harness/src/main/java/io/airbyte/integrations/destination_performance/Main.java

tools/bin/run-harness-process.yaml

...rmance/destination-harness/src/main/resources/catalogs/destination-snowflake/1m_catalog.json

.../destination-harness/src/main/java/io/airbyte/integrations/destination_performance/Main.java

rodireich · 2023-05-20T00:24:13Z

.github/workflows/connector-performance-command.yml

-            #### Note: The following `dataset=` values are supported: `1m`<sub>(default)</sub>, `10m`, `20m`, `bottleneck_stream1`
-            > :runner: ${{github.event.inputs.connector}} https://github.com/${{github.repository}}/actions/runs/${{github.run_id}}
+            #### Note: The following `dataset=` values are supported: `1m`<sub>(default)</sub>, `10m`, `20m`, `bottleneck_stream1`.
+            For destination performance only: you can also use `stream-numbers=N` to simulate N number of parallel streams.


@ryankfu Is this ok?

ryankfu · 2023-05-20T00:31:53Z

/connector-performance connector=connectors/destination-snowflake

Note: The following `dataset=` values are supported: `1m`_(default), `10m`, `20m`, `bottleneck_stream1`

🏃 connectors/destination-snowflake https://github.com/airbytehq/airbyte/actions/runs/5029414329

Performance test Result:

at io.airbyte.integrations.destination_performance.Main.main(Main.java:47)

rodireich · 2023-05-20T00:43:41Z

/connector-performance connector=connectors/destination-snowflake ref=ryan/parallel-stream-performance

Note: The following `dataset=` values are supported: `1m`_(default), `10m`, `20m`, `bottleneck_stream1`.

For destination performance only: you can also use stream-numbers=N to simulate N number of parallel streams.

🏃 connectors/destination-snowflake https://github.com/airbytehq/airbyte/actions/runs/5029469891.

Performance test Result:

{"type":"LOG","log":{"level":"INFO","message":"INFO i.a.i.d.PerformanceHarness(computeThroughput):193 total secs: 66.325. total MB read: 344.5901937484741, rps: 15077.271013946474, throughput: 5.195479739894068

…arness yaml

…ame configured catalog

Moved text up to connect with other argument discussion

ryankfu · 2023-05-22T19:12:56Z

/approve-and-merge reason="not in critical path and only affect performance harness, also automake is borked"

octavia-approvington · 2023-05-22T19:13:36Z

This is really good

ryankfu · 2023-05-22T20:56:07Z

/connector-performance connector=connectors/destination-snowflake ref=ryan/parallel-stream-performance

Note: The following `dataset=` values are supported: `1m`_(default), `10m`, `20m`, `bottleneck_stream1`.

For destination performance only: you can also use stream-numbers=N to simulate N number of parallel streams.

🏃 connectors/destination-snowflake https://github.com/airbytehq/airbyte/actions/runs/5050323466.

Performance test Result:

{"type":"LOG","log":{"level":"INFO","message":"INFO i.a.i.d.PerformanceHarness(computeThroughput):193 total secs: 67.631. total MB read: 344.5901937484741, rps: 14786.118791678373, throughput: 5.095151539212404

* Adds support for pseudo-parallel datasets * Ran ./gradlew :spotlessJavaApply * Automated Change * Fixes issue with parallel datasets credentials * Fixes filter for parallel credentials * Adds a new configurable property to build a pseudo-parallel catalog * Fixes Github Actions variable to be processed properly with the K8s harness yaml * Adds unit test for random streams and generating streams within the same configured catalog * Ran ./gradlew :spotlessJavaApply * Added additional description for GitHub Actions * Update connector-performance-command.yml Moved text up to connect with other argument discussion * Fixes spotBugs issue * Automated Commit - Formatting Changes * Update GitHub Action description --------- Co-authored-by: ryankfu <ryankfu@users.noreply.github.com> Co-authored-by: Rodi Reich Zilberman <867491+rodireich@users.noreply.github.com>

…6219) * Adds support for pseudo-parallel datasets * Ran ./gradlew :spotlessJavaApply * Automated Change * Fixes issue with parallel datasets credentials * Fixes filter for parallel credentials * Adds a new configurable property to build a pseudo-parallel catalog * Fixes Github Actions variable to be processed properly with the K8s harness yaml * Adds unit test for random streams and generating streams within the same configured catalog * Ran ./gradlew :spotlessJavaApply * Added additional description for GitHub Actions * Update connector-performance-command.yml Moved text up to connect with other argument discussion * Fixes spotBugs issue * Automated Commit - Formatting Changes * Update GitHub Action description --------- Co-authored-by: ryankfu <ryankfu@users.noreply.github.com> Co-authored-by: Rodi Reich Zilberman <867491+rodireich@users.noreply.github.com>

This comment was marked as outdated.

Sign in to view

ryankfu force-pushed the ryan/parallel-stream-performance branch from 3f07f68 to b42380c Compare May 19, 2023 00:53

This comment was marked as outdated.

Sign in to view

ryankfu requested a review from rodireich May 19, 2023 04:57

ryankfu marked this pull request as ready for review May 19, 2023 04:57

ryankfu requested review from cgardens and davinchia May 19, 2023 05:08

This comment was marked as outdated.

Sign in to view

davinchia reviewed May 19, 2023

View reviewed changes

...mance/destination-harness/src/main/resources/catalogs/destination-snowflake/10m_catalog.json Show resolved Hide resolved

davinchia reviewed May 19, 2023

View reviewed changes

...n-harness/src/main/java/io/airbyte/integrations/destination_performance/PerformanceTest.java Outdated Show resolved Hide resolved

rodireich reviewed May 19, 2023

View reviewed changes

ryankfu requested a review from rodireich May 19, 2023 23:15

rodireich reviewed May 20, 2023

View reviewed changes

rodireich approved these changes May 20, 2023

View reviewed changes

This comment was marked as outdated.

Sign in to view

davinchia approved these changes May 20, 2023

View reviewed changes

ryankfu added 2 commits May 20, 2023 19:01

Adds support for pseudo-parallel datasets

72160c7

Ran ./gradlew :spotlessJavaApply

f33da6d

ryankfu and others added 10 commits May 20, 2023 19:01

Automated Change

7385dbe

Fixes issue with parallel datasets credentials

001dd31

Fixes filter for parallel credentials

bc863ab

Adds a new configurable property to build a pseudo-parallel catalog

910ac3b

Fixes Github Actions variable to be processed properly with the K8s h…

7191727

…arness yaml

Adds unit test for random streams and generating streams within the s…

7160cac

…ame configured catalog

Ran ./gradlew :spotlessJavaApply

f884ab5

Added additional description for GitHub Actions

4729803

Update connector-performance-command.yml

a25b53c

Moved text up to connect with other argument discussion

Fixes spotBugs issue

5c2cd98

ryankfu force-pushed the ryan/parallel-stream-performance branch from d477b11 to 5c2cd98 Compare May 21, 2023 02:02

ryankfu and others added 2 commits May 21, 2023 02:06

Automated Commit - Formatting Changes

9e39792

Update GitHub Action description

3db00f3

octavia-approvington approved these changes May 22, 2023

View reviewed changes

octavia-approvington merged commit 3689ff3 into master May 22, 2023
12 of 13 checks passed

octavia-approvington deleted the ryan/parallel-stream-performance branch May 22, 2023 19:13

This comment was marked as outdated.

Sign in to view

ryankfu restored the ryan/parallel-stream-performance branch May 22, 2023 20:55

ryankfu deleted the ryan/parallel-stream-performance branch May 23, 2023 00:19

ryankfu mentioned this pull request May 30, 2023

[Epic] Destination Snowflake to 10 MB/s #24788

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Harness: Support for pseudo-parallel streams #26219

Performance Harness: Support for pseudo-parallel streams #26219

ryankfu commented May 18, 2023 •

edited

Loading

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

davinchia May 19, 2023

davinchia May 19, 2023

davinchia commented May 19, 2023 •

edited

Loading

ryankfu commented May 19, 2023 •

edited

Loading

davinchia commented May 19, 2023

rodireich May 20, 2023

This comment was marked as outdated.

ryankfu commented May 20, 2023 •

edited by github-actions bot

Loading

rodireich commented May 20, 2023 •

edited by github-actions bot

Loading

ryankfu commented May 22, 2023

octavia-approvington commented May 22, 2023

This comment was marked as outdated.

ryankfu commented May 22, 2023 •

edited by github-actions bot

Loading

Performance Harness: Support for pseudo-parallel streams #26219

Performance Harness: Support for pseudo-parallel streams #26219

Conversation

ryankfu commented May 18, 2023 • edited Loading

What

How

Recommended reading order

🚨 User Impact 🚨

Pre-merge Actions

Community member or Airbyter

Airbyter

Community member or Airbyter

Airbyter

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

davinchia May 19, 2023

Choose a reason for hiding this comment

davinchia May 19, 2023

Choose a reason for hiding this comment

davinchia commented May 19, 2023 • edited Loading

ryankfu commented May 19, 2023 • edited Loading

davinchia commented May 19, 2023

rodireich May 20, 2023

Choose a reason for hiding this comment

This comment was marked as outdated.

ryankfu commented May 20, 2023 • edited by github-actions bot Loading

Note: The following dataset= values are supported: 1m(default), 10m, 20m, bottleneck_stream1

Performance test Result:

rodireich commented May 20, 2023 • edited by github-actions bot Loading

Note: The following dataset= values are supported: 1m(default), 10m, 20m, bottleneck_stream1.

Performance test Result:

ryankfu commented May 22, 2023

octavia-approvington commented May 22, 2023

This comment was marked as outdated.

ryankfu commented May 22, 2023 • edited by github-actions bot Loading

Note: The following dataset= values are supported: 1m(default), 10m, 20m, bottleneck_stream1.

Performance test Result:

ryankfu commented May 18, 2023 •

edited

Loading

davinchia commented May 19, 2023 •

edited

Loading

ryankfu commented May 19, 2023 •

edited

Loading

ryankfu commented May 20, 2023 •

edited by github-actions bot

Loading

Note: The following `dataset=` values are supported: `1m`_(default), `10m`, `20m`, `bottleneck_stream1`

rodireich commented May 20, 2023 •

edited by github-actions bot

Loading

Note: The following `dataset=` values are supported: `1m`_(default), `10m`, `20m`, `bottleneck_stream1`.

ryankfu commented May 22, 2023 •

edited by github-actions bot

Loading

Note: The following `dataset=` values are supported: `1m`_(default), `10m`, `20m`, `bottleneck_stream1`.