Skip to content
This repository has been archived by the owner on Dec 15, 2022. It is now read-only.

Scale Test Shared gRPC server-based Implementations #233

Closed
ulucinar opened this issue Feb 14, 2022 · 6 comments
Closed

Scale Test Shared gRPC server-based Implementations #233

ulucinar opened this issue Feb 14, 2022 · 6 comments
Assignees
Labels
enhancement New feature or request

Comments

@ulucinar
Copy link
Collaborator

ulucinar commented Feb 14, 2022

What problem are you facing?

We have produced a shared gRPC server-based implementation for provider-jet-azure in the context of #38. The provider-jet-azure packages ulucinar/provider-jet-azure-arm64:shared-grpc and ulucinar/provider-jet-azure-amd64:shared-grpc are modified to run the terraform-provider-azurerm binary plugin in the background as a shared gRPC server and the Terraform CLI does not have to fork the binary plugin for each of its requests.

How could Terrajet help solve your problem?

Similar to what we have done previously in #55, we need to reevaluate the performance of provider-jet-azure@v0.7.0 and also the shared gRPC implementation using the above provider packages. This will allow us to assess and quantify any performance improvements with the shared gRPC implementation. Some of the previously used scripts for #55 are available in https://github.com/ulucinar/terrajet-scale.

@ulucinar ulucinar added the enhancement New feature or request label Feb 14, 2022
@sergenyalcin sergenyalcin self-assigned this Feb 14, 2022
@sergenyalcin
Copy link
Member

sergenyalcin commented Mar 3, 2022

Here are the results from two experiments on the provider-jet-azure.

Experiment Setup:

On a GKE cluster with the following specs:

Machine family: General purpose e2-standard-4 (4 vCPU, 16 GB memory)
3 worker nodes
Control plane version - v1.20.11-gke.1300

Note: For previous tests and general context, please see this issue: #55

Case 1: Test provider-jet-azure v0.8.0 version without shared gRPC implementation

For this case the following image was used: crossplane/provider-jet-azure:v0.8.0

Firstly, provider-jet-azure v0.8.0 was deployed to the cluster. Then 50 VirutalNetwork and 50 LoadBalancer MRs were created simultaneously. Total = 100 MRs An example invocation of the generator script and an example generated MR manifest looks like the following:

$ ./manage-mr.sh create ./loadbalancer.yaml $(seq 1 50)
$ ./manage-mr.sh create ./virtualnetwork.yaml $(seq 1 50)

The following graphs are observed from the Grafana Dashboard:

Current-1

The chart above shows us the MR counts and CPU/Memory Utilizations. As can be seen, CPU usage peaked at the beginning of the resource creation process and reached a level close to 40%. Although there are various fluctuations afterward, it can be said that the CPU usage from the beginning to the end of the test is between 25-30% on average.

image

The data above and the histogram show the time it takes to get ready for 100 created resources.

Case 2: Test provider-jet-azure with shared gRPC implementation

For this case the following image was used: ulucinar/provider-jet-azure-amd64:shared-grpc

Firstly, provider-jet-azure (with a custom image that contains the gRPC implementation) was deployed to the cluster. Then 50 VirutalNetwork and 50 LoadBalancer MRs were created simultaneously. Total = 100 MRs An example invocation of the generator script and an example generated MR manifest looks like the following:

$ ./manage-mr.sh create ./loadbalancer.yaml $(seq 1 50)
$ ./manage-mr.sh create ./virtualnetwork.yaml $(seq 1 50)

The following graphs are observed from the Grafana Dashboard:

Grpc-1

The chart above shows us the MR counts and CPU/Memory Utilizations. As can be seen, CPU usage peaked at the beginning of the resource creation process and reached a level close to 21-22%. Although there are various fluctuations afterward, it can be said that the CPU usage from the beginning to the end of the test is between 15% on average.

Note: While testing this case, any stability issues were not observed, such as restarting the provider pod.

Screen Shot 2022-03-03 at 23 50 12

The data above and the histogram show the time it takes to get ready for 100 created resources.

Result:

  • When we check the CPU/Memory utilizations, we see a decrease. Both of average and peak values are lower in the gRPC based implementation case.

  • For Readiness time, all of the statistics show us the improvement in the gRPC based implementation case.

In the light of the above results, it is possible to say that the gRPC implementation makes a significant difference both in terms of resource consumption (CPU/memory) and the time it takes for resources to become Ready.

@ulucinar
Copy link
Collaborator Author

ulucinar commented Mar 4, 2022

Thank you @sergenyalcin for carrying out these experiments, excellent work! Could you please also record the shared gRPC implementation image you have used in the experiments in your comment?

@ulucinar
Copy link
Collaborator Author

ulucinar commented Mar 4, 2022

@sergenyalcin it may also be helpful to record in your comment that we have not observed any stability issues with the shared gRPC server in your experiments as it depends on non-production (testing) Terraform configuration. One important aspect of these experiments is to observe the stability of the shared gRPC implementation under load.

@sergenyalcin
Copy link
Member

@ulucinar thank you for your comments. Both of two comments were addressed!

@muvaf
Copy link
Member

muvaf commented Mar 5, 2022

Thanks @sergenyalcin ! I think we can conclude and close this issue and also #38 . The only risk seems to be that we'll be using an undocumented path but it's quite easy to turn on/off with a config so provider maintainers can choose whether they'd like to take the risk or not. @sergenyalcin @ulucinar do you agree?

The next step could be to open an issue targeting implementation of gRPC usage. Once an example usage is in provider-jet-template, we can update the guide and the Jet providers we're maintaining to that method.

@sergenyalcin
Copy link
Member

@muvaf I think we can close this issue as you suggest.

As a summary of this scale tests, when the gRPC server-based implementation is used, there are significant improvements both in terms of resource consumption (CPU/memory) and the time it takes for managed resources to become Ready.

We can open another issue for tracking the implementation of gRPC usage.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants