Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance Harness: Support for pseudo-parallel streams #26219

Merged
merged 14 commits into from
May 22, 2023

Conversation

ryankfu
Copy link
Contributor

@ryankfu ryankfu commented May 18, 2023

What

Introduced pseudo-parallel streams to mock the behavior of CDC (Change Data Capture)

How

Uses the same dataset but uses the same catalog across the same dataset and uses a random function to select which Stream to add metadata to when generating the AirbyteRecordMessage

Recommended reading order

  1. PerformanceTest.java - Added random stream assignment for each record
  2. Main.java - Added ability to duplicate the stream catalog to mock having multiple streams
  3. run-harness-process.yaml - Adds configurability for new parameter to have multiple streams
  4. connector-performance-command.yaml - Adds new parameter for Github Actions

🚨 User Impact 🚨

No breaking changes

For connector PRs, use this section to explain which type of semantic versioning bump occurs as a result of the changes. Refer to our Semantic Versioning for Connectors guidelines for more information. Breaking changes to connectors must be documented by an Airbyte engineer (PR author, or reviewer for community PRs) by using the Breaking Change Release Playbook.

If there are breaking changes, please merge this PR with the 🚨🚨 emoji so changelog authors can further highlight this if needed.

Pre-merge Actions

Expand the relevant checklist and delete the others.

New Connector

Community member or Airbyter

  • Community member? Grant edit access to maintainers (instructions)
  • Unit & integration tests added and passing. Community members, please provide proof of success locally e.g: screenshot or copy-paste unit, integration, and acceptance test output. To run acceptance tests for a Python connector, follow instructions in the README. For java connectors run ./gradlew :airbyte-integrations:connectors:<name>:integrationTest.
  • Connector version is set to 0.0.1
    • Dockerfile has version 0.0.1
  • Documentation updated
    • Connector's README.md
    • Connector's bootstrap.md. See description and examples
    • docs/integrations/<source or destination>/<name>.md including changelog with an entry for the initial version. See changelog example
    • docs/integrations/README.md

Airbyter

If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.

  • Create a non-forked branch based on this PR and test the below items on it
  • Build is successful
  • If new credentials are required for use in CI, add them to GSM. Instructions.
Updating a connector

Community member or Airbyter

  • Grant edit access to maintainers (instructions)
  • Unit & integration tests added

Airbyter

If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.

  • Create a non-forked branch based on this PR and test the below items on it
  • Build is successful
  • If new credentials are required for use in CI, add them to GSM. Instructions.
Connector Generator
  • Issue acceptance criteria met
  • PR name follows PR naming conventions
  • If adding a new generator, add it to the list of scaffold modules being tested
  • The generator test modules (all connectors with -scaffold in their name) have been updated with the latest scaffold by running ./gradlew :airbyte-integrations:connector-templates:generator:testScaffoldTemplates then checking in your changes
  • Documentation which references the generator is updated as needed

@ryankfu

This comment was marked as outdated.

@ryankfu

This comment was marked as outdated.

@ryankfu

This comment was marked as outdated.

@ryankfu ryankfu force-pushed the ryan/parallel-stream-performance branch from 3f07f68 to b42380c Compare May 19, 2023 00:53
@ryankfu

This comment was marked as outdated.

@ryankfu

This comment was marked as outdated.

@ryankfu

This comment was marked as outdated.

@ryankfu

This comment was marked as outdated.

@ryankfu ryankfu requested a review from rodireich May 19, 2023 04:57
@ryankfu ryankfu marked this pull request as ready for review May 19, 2023 04:57
@ryankfu

This comment was marked as outdated.

@@ -4,3 +4,8 @@ Performance harness for destination connectors.

This component is used by the `/connector-performance` GitHub action and is used in order to test throughput of
destination connectors on a number of datasets.

Associated files are:
<li>Main.java - the main entrypoint for the harness
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@@ -26,16 +28,23 @@ public class Main {
private static final String CREDENTIALS_PATH = "secrets/%s_%s_credentials.json";

public static void main(final String[] args) {
// If updating args for Github Actions, also update the run-performance-test.yml file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@davinchia
Copy link
Contributor

davinchia commented May 19, 2023

Ryan, how do we know the random logic works? Is it worth adding a quick test?

@ryankfu
Copy link
Contributor Author

ryankfu commented May 19, 2023

I did a manual test here and confirmed it creates the correct number of streams. Also got the list of tables by looking at the Snowflake connection. Did you want to have a unit test added for brevity?

EDIT: quick sniff test is too see when Snowflake flushes which streams it's flushes and whether it gets up to 11 (the number of streams allocated for this run)

@davinchia
Copy link
Contributor

Yes. Probably a good idea to avoid regressions.

@ryankfu ryankfu requested a review from rodireich May 19, 2023 23:15
#### Note: The following `dataset=` values are supported: `1m`<sub>(default)</sub>, `10m`, `20m`, `bottleneck_stream1`
> :runner: ${{github.event.inputs.connector}} https://github.com/${{github.repository}}/actions/runs/${{github.run_id}}
#### Note: The following `dataset=` values are supported: `1m`<sub>(default)</sub>, `10m`, `20m`, `bottleneck_stream1`.
For destination performance only: you can also use `stream-numbers=N` to simulate N number of parallel streams.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ryankfu Is this ok?

@ryankfu

This comment was marked as outdated.

@ryankfu
Copy link
Contributor Author

ryankfu commented May 20, 2023

/connector-performance connector=connectors/destination-snowflake

Note: The following dataset= values are supported: 1m(default), 10m, 20m, bottleneck_stream1

🏃 connectors/destination-snowflake https://github.com/airbytehq/airbyte/actions/runs/5029414329

Performance test Result:

at io.airbyte.integrations.destination_performance.Main.main(Main.java:47)

@rodireich
Copy link
Contributor

rodireich commented May 20, 2023

/connector-performance connector=connectors/destination-snowflake ref=ryan/parallel-stream-performance

Note: The following dataset= values are supported: 1m(default), 10m, 20m, bottleneck_stream1.

For destination performance only: you can also use stream-numbers=N to simulate N number of parallel streams.

🏃 connectors/destination-snowflake https://github.com/airbytehq/airbyte/actions/runs/5029469891.

Performance test Result:

{"type":"LOG","log":{"level":"INFO","message":"INFO i.a.i.d.PerformanceHarness(computeThroughput):193 total secs: 66.325. total MB read: 344.5901937484741, rps: 15077.271013946474, throughput: 5.195479739894068

@ryankfu ryankfu force-pushed the ryan/parallel-stream-performance branch from d477b11 to 5c2cd98 Compare May 21, 2023 02:02
@ryankfu
Copy link
Contributor Author

ryankfu commented May 22, 2023

/approve-and-merge reason="not in critical path and only affect performance harness, also automake is borked"

@octavia-approvington
Copy link
Contributor

This is really good
simply the best

@octavia-approvington octavia-approvington merged commit 3689ff3 into master May 22, 2023
17 of 18 checks passed
@octavia-approvington octavia-approvington deleted the ryan/parallel-stream-performance branch May 22, 2023 19:13
@ryankfu

This comment was marked as outdated.

@ryankfu ryankfu restored the ryan/parallel-stream-performance branch May 22, 2023 20:55
@ryankfu
Copy link
Contributor Author

ryankfu commented May 22, 2023

/connector-performance connector=connectors/destination-snowflake ref=ryan/parallel-stream-performance

Note: The following dataset= values are supported: 1m(default), 10m, 20m, bottleneck_stream1.

For destination performance only: you can also use stream-numbers=N to simulate N number of parallel streams.

🏃 connectors/destination-snowflake https://github.com/airbytehq/airbyte/actions/runs/5050323466.

Performance test Result:

{"type":"LOG","log":{"level":"INFO","message":"INFO i.a.i.d.PerformanceHarness(computeThroughput):193 total secs: 67.631. total MB read: 344.5901937484741, rps: 14786.118791678373, throughput: 5.095151539212404

@ryankfu ryankfu deleted the ryan/parallel-stream-performance branch May 23, 2023 00:19
nguyenaiden pushed a commit that referenced this pull request May 25, 2023
* Adds support for pseudo-parallel datasets

* Ran ./gradlew :spotlessJavaApply

* Automated Change

* Fixes issue with parallel datasets credentials

* Fixes filter for parallel credentials

* Adds a new configurable property  to build a pseudo-parallel catalog

* Fixes Github Actions variable to be processed properly with the K8s harness yaml

* Adds unit test for random streams and generating streams within the same configured catalog

* Ran ./gradlew :spotlessJavaApply

* Added additional description for GitHub Actions

* Update connector-performance-command.yml

Moved text up to connect with other argument discussion

* Fixes spotBugs issue

* Automated Commit - Formatting Changes

* Update GitHub Action description

---------

Co-authored-by: ryankfu <ryankfu@users.noreply.github.com>
Co-authored-by: Rodi Reich Zilberman <867491+rodireich@users.noreply.github.com>
marcosmarxm pushed a commit to natalia-miinto/airbyte that referenced this pull request Jun 8, 2023
…6219)

* Adds support for pseudo-parallel datasets

* Ran ./gradlew :spotlessJavaApply

* Automated Change

* Fixes issue with parallel datasets credentials

* Fixes filter for parallel credentials

* Adds a new configurable property  to build a pseudo-parallel catalog

* Fixes Github Actions variable to be processed properly with the K8s harness yaml

* Adds unit test for random streams and generating streams within the same configured catalog

* Ran ./gradlew :spotlessJavaApply

* Added additional description for GitHub Actions

* Update connector-performance-command.yml

Moved text up to connect with other argument discussion

* Fixes spotBugs issue

* Automated Commit - Formatting Changes

* Update GitHub Action description

---------

Co-authored-by: ryankfu <ryankfu@users.noreply.github.com>
Co-authored-by: Rodi Reich Zilberman <867491+rodireich@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants