Scale Test Shared gRPC server-based Implementations #233

ulucinar · 2022-02-14T07:53:05Z

What problem are you facing?

We have produced a shared gRPC server-based implementation for provider-jet-azure in the context of #38. The provider-jet-azure packages ulucinar/provider-jet-azure-arm64:shared-grpc and ulucinar/provider-jet-azure-amd64:shared-grpc are modified to run the terraform-provider-azurerm binary plugin in the background as a shared gRPC server and the Terraform CLI does not have to fork the binary plugin for each of its requests.

How could Terrajet help solve your problem?

Similar to what we have done previously in #55, we need to reevaluate the performance of provider-jet-azure@v0.7.0 and also the shared gRPC implementation using the above provider packages. This will allow us to assess and quantify any performance improvements with the shared gRPC implementation. Some of the previously used scripts for #55 are available in https://github.com/ulucinar/terrajet-scale.

The text was updated successfully, but these errors were encountered:

sergenyalcin · 2022-03-03T20:56:25Z

Here are the results from two experiments on the provider-jet-azure.

Experiment Setup:

On a GKE cluster with the following specs:

Machine family: General purpose e2-standard-4 (4 vCPU, 16 GB memory)
3 worker nodes
Control plane version - v1.20.11-gke.1300

crossplane-jet-azure-provider worked with 1 worker-count for two test cases.
The scripts, manifests and dashbord templates from https://github.com/ulucinar/terrajet-scale were used.
For collecting and reporting metrics, Prometheus Operator and Grafana was used. Prometheus Operator was installed by using the following helm chart. This chart also contains the Grafana: https://prometheus-community.github.io/helm-charts

Note: For previous tests and general context, please see this issue: #55

Case 1: Test provider-jet-azure v0.8.0 version without shared gRPC implementation

For this case the following image was used: crossplane/provider-jet-azure:v0.8.0

Firstly, provider-jet-azure v0.8.0 was deployed to the cluster. Then 50 VirutalNetwork and 50 LoadBalancer MRs were created simultaneously. Total = 100 MRs An example invocation of the generator script and an example generated MR manifest looks like the following:

$ ./manage-mr.sh create ./loadbalancer.yaml $(seq 1 50)
$ ./manage-mr.sh create ./virtualnetwork.yaml $(seq 1 50)

The following graphs are observed from the Grafana Dashboard:

The chart above shows us the MR counts and CPU/Memory Utilizations. As can be seen, CPU usage peaked at the beginning of the resource creation process and reached a level close to 40%. Although there are various fluctuations afterward, it can be said that the CPU usage from the beginning to the end of the test is between 25-30% on average.

The data above and the histogram show the time it takes to get ready for 100 created resources.

Case 2: Test provider-jet-azure with shared gRPC implementation

For this case the following image was used: ulucinar/provider-jet-azure-amd64:shared-grpc

Firstly, provider-jet-azure (with a custom image that contains the gRPC implementation) was deployed to the cluster. Then 50 VirutalNetwork and 50 LoadBalancer MRs were created simultaneously. Total = 100 MRs An example invocation of the generator script and an example generated MR manifest looks like the following:

$ ./manage-mr.sh create ./loadbalancer.yaml $(seq 1 50)
$ ./manage-mr.sh create ./virtualnetwork.yaml $(seq 1 50)

The following graphs are observed from the Grafana Dashboard:

The chart above shows us the MR counts and CPU/Memory Utilizations. As can be seen, CPU usage peaked at the beginning of the resource creation process and reached a level close to 21-22%. Although there are various fluctuations afterward, it can be said that the CPU usage from the beginning to the end of the test is between 15% on average.

Note: While testing this case, any stability issues were not observed, such as restarting the provider pod.

The data above and the histogram show the time it takes to get ready for 100 created resources.

Result:

When we check the CPU/Memory utilizations, we see a decrease. Both of average and peak values are lower in the gRPC based implementation case.
For Readiness time, all of the statistics show us the improvement in the gRPC based implementation case.

In the light of the above results, it is possible to say that the gRPC implementation makes a significant difference both in terms of resource consumption (CPU/memory) and the time it takes for resources to become Ready.

ulucinar · 2022-03-04T08:57:07Z

Thank you @sergenyalcin for carrying out these experiments, excellent work! Could you please also record the shared gRPC implementation image you have used in the experiments in your comment?

ulucinar · 2022-03-04T09:39:00Z

@sergenyalcin it may also be helpful to record in your comment that we have not observed any stability issues with the shared gRPC server in your experiments as it depends on non-production (testing) Terraform configuration. One important aspect of these experiments is to observe the stability of the shared gRPC implementation under load.

sergenyalcin · 2022-03-04T11:51:33Z

@ulucinar thank you for your comments. Both of two comments were addressed!

muvaf · 2022-03-05T12:24:23Z

Thanks @sergenyalcin ! I think we can conclude and close this issue and also #38 . The only risk seems to be that we'll be using an undocumented path but it's quite easy to turn on/off with a config so provider maintainers can choose whether they'd like to take the risk or not. @sergenyalcin @ulucinar do you agree?

The next step could be to open an issue targeting implementation of gRPC usage. Once an example usage is in provider-jet-template, we can update the guide and the Jet providers we're maintaining to that method.

sergenyalcin · 2022-03-09T11:47:48Z

@muvaf I think we can close this issue as you suggest.

As a summary of this scale tests, when the gRPC server-based implementation is used, there are significant improvements both in terms of resource consumption (CPU/memory) and the time it takes for managed resources to become Ready.

We can open another issue for tracking the implementation of gRPC usage.

ulucinar added the enhancement New feature or request label Feb 14, 2022

sergenyalcin self-assigned this Feb 14, 2022

sergenyalcin closed this as completed Mar 9, 2022

muvaf mentioned this issue Mar 10, 2022

Consider making Terraform CLI talk to the single provider server #38

Closed

ulucinar mentioned this issue Mar 11, 2022

Allow Terrajet-based providers to run in shared gRPC mode #261

Closed

muvaf mentioned this issue Mar 5, 2022

One pager about performance characteristics of providers crossplane/crossplane#2979

Closed

muvaf mentioned this issue Oct 18, 2022

Re-enable shared TF provider usage crossplane-contrib/provider-upjet-aws#86

Closed

patst mentioned this issue Oct 16, 2023

Add --terraform-native-provider-path option to use shared gRPC provider grafana/crossplane-provider-grafana#48

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scale Test Shared gRPC server-based Implementations #233

Scale Test Shared gRPC server-based Implementations #233

ulucinar commented Feb 14, 2022 •

edited

Loading

sergenyalcin commented Mar 3, 2022 •

edited

Loading

ulucinar commented Mar 4, 2022

ulucinar commented Mar 4, 2022 •

edited

Loading

sergenyalcin commented Mar 4, 2022

muvaf commented Mar 5, 2022

sergenyalcin commented Mar 9, 2022

Scale Test Shared gRPC server-based Implementations #233

Scale Test Shared gRPC server-based Implementations #233

Comments

ulucinar commented Feb 14, 2022 • edited Loading

What problem are you facing?

How could Terrajet help solve your problem?

sergenyalcin commented Mar 3, 2022 • edited Loading

ulucinar commented Mar 4, 2022

ulucinar commented Mar 4, 2022 • edited Loading

sergenyalcin commented Mar 4, 2022

muvaf commented Mar 5, 2022

sergenyalcin commented Mar 9, 2022

ulucinar commented Feb 14, 2022 •

edited

Loading

sergenyalcin commented Mar 3, 2022 •

edited

Loading

ulucinar commented Mar 4, 2022 •

edited

Loading