This code is designed for testing purposes only and should never be used in production environments.
- This code will DELETE TOPICS on your Kafka clusters
- Data loss will occur during testing
- Changes are irreversible
Use this code ONLY on:
- Dedicated test clusters
- Development environments
- Isolated sandbox environments
- Production clusters
- Live data environments
- Shared development clusters
- Any system containing important data
This performance testing framework evaluates the maximum throughput of a Kafka cluster, and compares the put latency of different broker, producer, and consumer configurations.
To run a test, you need to specify the different parameters that should be tested, and the tool will iterate through all different combinations of the parameters.
The test results can be visualized in a comparative performance graph similar to the graph below.
The performance testing framework is based on the kafka-producer-perf-test.sh
and kafka-consumer-perf-test.sh
tools that are part of the Apache Kafka distribution.
This framework adds AWS-powered automation and visualization around Kafka's native performence testing utilities, leveraging AWS Step Functions for scheduling and orchestration, and AWS Batch for running individual tests.
To deploy the CDK template and run the performance test, you have two options:
Ensure you have the following tools installed and configured on your local machine:
- docker >= 24.0.0
- node >= 18.0.0
- typescript >= 5.0.0
- cdk >= 2.171.1
- npm >= 10.9.1
- git >= 2.40.1
If you prefer not to configure your local environment, you can use an Amazon EC2 instance. The create environment section contains detailed instructions on how to install the dependencies required to successfully deploy the CDK template.
You have to clone the remote repository to create a local copy on your development environment
# Clone the remote repository
git clone https://github.com/aws-samples/sample-apache-kafka-performance-testing-framework.git
# Navigate to project directory
cd sample-apache-kafka-performance-testing-framework
# Verify repository contents
ls -la
Once your environment is configured, you need to edit the defaults
variable in cdk/bin/cdk.ts
so that it matches your AWS account details:
# Navigate to 'cdk/bin'
cd cdk/bin
# Edit the `cdk.ts` using the nano text editor
nano cdk.ts
Update the AWS account ID and target AWS Region
const defaults : { 'env': Environment } = {
env: {
account: '123456789012', // Replace with your AWS account ID
region: 'us-east-1' // Replace with your target region
}
};
Finally, you need to install all dependencies of the CDK template by running the following command in the cdk
folder.
# Navigate back to the 'cdk' folder
cd ..
# Install CDK template dependencies
npm install
# Bootstrap CDK envitonment, if you have never executed CDK in this AWS account and region.
cdk bootstrap
You can now start deploying the CDK stacks to create the resouces in your AWS account.
# List the available stacks
cdk ls
You can observe 9 CDK stacks:
MSK-Perf-Test-VPC
MSK-Perf-Test-M7g-Large
MSK-Perf-Test-M7g-XLarge
MSK-Perf-Test-M7g-2XLarge
MSK-Perf-Test-M7g-2XLarge-PST (Provisioned Storage Throughput enabled at 312 MB/s)
MSK-Perf-Test-M7g-4XLarge
MSK-Perf-Test-M7g-4XLarge-PST (Provisioned Storage Throughput enabled at 625 MB/s)
MSK-Perf-Test-M7g-8XLarge
MSK-Perf-Test-M7g-8XLarge-PST (Provisioned Storage Throughput enabled at 1000 MB/s)
The VPC stack will create a new VPC in your AWS account with a 10.0.0.0/16 CIDR, 3 public subnets, 3 private subnets, 3 NAT Gateways and 1 Internet Gateway.
The VPC stack will be deployed by the MSK stasks as a dependency, and it's not required to deploy it independently. However, in case you'd like to deploy the VPC stack individually, you can run the below command:
# Deploy the CDK VPC stack
cdk deploy MSK-Perf-Test-VPC
The CDK template can either create a new MSK cluster for the performance tests, or you can specify an existing MSK cluster. The framework is creating and deleting topics on the MSK cluster to run the performance tests, so it's highly recommended to use a dedicated cluster to avoid any data loss.
The default configuration of the CDK template includes three MSK provisioned clusters leveraging standard brokers backed by AWS Graviton processors. You can customize the CDK stack name, the count of brokers, the instance type, instance size, the Apache Kafka version, and more, by editing the cdk/bin/cdk.ts
file.
new CdkStack(app, 'MSK-Perf-Test-M7g-Large', {
...defaults,
vpc: vpc,
clusterProps: {
numberOfBrokerNodes: 1,
instanceType: InstanceType.of(InstanceClass.M7G, InstanceSize.LARGE),
ebsStorageInfo: {
volumeSize: 6000
},
encryptionInTransit: {
enableInCluster: false,
clientBroker: ClientBrokerEncryption.TLS_PLAINTEXT
},
kafkaVersion: KafkaVersion.V3_8_X,
},
});
new CdkStack(app, 'stack-prefix--', {
...defaults,
vpc: vpc,
clusterName: '...',
bootstrapBrokerString: '...',
sg: '...',
});
You can then deploy the MSK CDK templates to create the performance test resources into your account.
# List the available stacks
cdk ls
# Deploy a specific stack
cdk deploy MSK-Perf-Test-M7g-Large
- Resources created by the CDK stacks will incur AWS charges immediately upon creation
- Costs continue to accumulate even when no tests are running
After the infrastructure is deployed, you can start the performance tests using an AWS Step Functions workflow that was created during deployment. The workflow uses a simplified test specification that controls how each individual test will run and what parameters will be used.
In the Step Functions console, navigate to State machines, and select {CDK-stack-name}-State-Machine-main
. Start the execution by providing the following input:
{
"test_specification": {
"parameters": {
"cluster_throughput_mb_per_sec": [ 12, 24, 36, 48, 60, 72, 84, 96, 108, 120, 132, 144, 156, 168 ],
"consumer_groups" : [ { "num_groups": 0, "size": 0 }, { "num_groups": 1, "size": 6 }, { "num_groups": 2, "size": 6 } ],
"num_producers": [ 6 ],
"client_props": [
{
"producer": "acks=all linger.ms=5 batch.size=262114 buffer.memory=2147483648 security.protocol=PLAINTEXT",
"consumer": "security.protocol=PLAINTEXT"
}
],
"num_partitions": [ 72 ],
"record_size_byte": [ 1024 ],
"replication_factor": [ 3 ],
"duration_sec": [ 3600 ]
}
}
}
Using the test specification above, the workflow will execute 42 performance tests to measure put latency under varying throughput conditions.
The test configuration is using the following parameters:
- Producers: 6
- Partitions: 72
- Record Size: 1024 bytes
- Replication Factor: 3
- Duration: 3600 seconds (1 hour)
- Throughput: 12 → 168 MB/sec
- Consumer Groups:
- 0 groups (no consumers)
- 1 group (6 consumers)
- 2 groups (6 consumers each)
Phase 1: No consumer groups
- Throughput: 16 → 168 MB/sec
- Measure put latency
Phase 2: One consumer group
- Throughput: 16 → 168 MB/sec
- 6 consumers active
- Measure put latency
Phase 3: Two consumer groups
- Throughput: 16 → 168 MB/sec
- 12 consumers total (6 per group)
- Measure put latency
The framework supports additional test parameters through list configuration. Any parameter defined as a list will be automatically included in the test iterations.
Since you may not know a cluster's maximum throughput, you can specify a stop condition. This condition tells the test framework when to reset the cluster throughput to the lowest value and select the next test option (or stop if there are no more options).
To automatically stop increasing throughput when the cluster becomes saturated, start the {CDK-stack-name}-State-Machine-main
execution by providing the following input:
{
"test_specification": {
"parameters": {
"cluster_throughput_mb_per_sec": [ 12, 24, 36, 48, 60, 72, 84, 96, 108, 120, 132, 144, 156, 168 ],
"consumer_groups" : [ { "num_groups": 0, "size": 0 }, { "num_groups": 1, "size": 6 }, { "num_groups": 2, "size": 6 } ],
"num_producers": [ 6 ],
"client_props": [
{
"producer": "acks=all linger.ms=5 batch.size=262114 buffer.memory=2147483648 security.protocol=PLAINTEXT",
"consumer": "security.protocol=PLAINTEXT"
}
],
"num_partitions": [ 72 ],
"record_size_byte": [ 1024 ],
"replication_factor": [ 3 ],
"duration_sec": [ 3600 ]
},
"skip_remaining_throughput": {
"less-than": [ "sent_div_requested_mb_per_sec", 0.995 ]
}
}
}
The framework will skip remaining throughput values if the actual throughput sent to the cluster by producers falls below 99.5% of the specified value.
The testing process follows this pattern:
- The first test uses 0 consumer groups and sends 12 MB/s into the cluster.
- The next test maintains 0 consumer groups but increases throughput to 24 MB/s.
- This pattern continues, incrementing throughput, until the cluster becomes saturated.
For example, if the cluster can only absorb 154 MB/s instead of the specified 156 MB/s, the skip condition is met (as the difference exceeds 0.5%).
When this occurs:
- The framework stops increasing throughput for the current consumer group setting.
- It resets the throughput to 12 MB/s.
- It increments to the next consumer group setting (in this case, from 0 to 1).
Amazon EC2 networking, Amazon EBS, and Amazon EBS networking all leverage a credit system that allows the performance to burst over a given baseline for a certain amount of time. This means that the cluster is able to ingest much more throughput for a certain amount of time.
To obtain stable results for the baseline performance of a cluster, the test framework can deplete credits before a test is carried out. To this end it generates traffic that exceeds the peak performance of a cluster and waits until the throughput drops below configurable thresholds.
{
"test_specification": {
"parameters": {
"cluster_throughput_mb_per_sec": [ 12, 24, 36, 48, 60, 72, 84, 96, 108, 120, 132, 144, 156, 168 ],
"consumer_groups" : [ { "num_groups": 0, "size": 0 }, { "num_groups": 1, "size": 6 }, { "num_groups": 2, "size": 6 } ],
"num_producers": [ 6 ],
"client_props": [
{
"producer": "acks=all linger.ms=5 batch.size=262114 buffer.memory=2147483648 security.protocol=PLAINTEXT",
"consumer": "security.protocol=PLAINTEXT"
}
],
"num_partitions": [ 72 ],
"record_size_byte": [ 1024 ],
"replication_factor": [ 3 ],
"duration_sec": [ 3600 ]
},
"skip_remaining_throughput": {
"less-than": [ "sent_div_requested_mb_per_sec", 0.995 ]
},
"depletion_configuration": {
"upper_threshold": {
"mb_per_sec": 185
},
"lower_threshold": {
"mb_per_sec": 150
},
"approximate_timeout_hours": 1
}
}
}
With this configuration, the framework will generate a high load to consume burst credits.
Then, it monitors the throughput to detect when credits are depleted by observing two conditions in sequence:
- First, the cluster throughput must exceed the
upper_threshold
, indicating the cluster is operating at burst performance levels - Then, the throughput must drop below the
lower_threshold
, indicating burst credits have been depleted
Only after both conditions are met (exceeding upper threshold, then falling below lower threshold) will the actual performance test begin. This ensures the test measures baseline performance without the influence of burst credits.
The approximate_timeout_hours
value of 1 hour sets the maximum duration for the credit depletion phase.
-
upper_threshold
:- Set to approximately 70-80% of your cluster's maximum burst throughput
- Should be achievable during burst credit usage
- Example: If your cluster can burst to 250 MB/sec, set to ~185 MB/sec
-
lower_threshold
- Set to approximately 80% of your upper threshold
- Should be above your expected baseline performance
- Example: If upper threshold is 185 MB/sec, set to ~150 MB/sec
-
For clusters with different broker counts:
- Scale thresholds linearly with broker count
- Example: For 6 brokers, multiply the 3-broker values by 2
You can visualize the performance test results through Jupyter notebooks once the AWS Step Functions workflow is running. You can find the URL for the SageMaker Notebook in the Outputs section of the CloudFormation stack.
Note Initial test execution is required before any output becomes visible. With credit depletion enabled, this process may take up to several hours to complete.
To install the dependencies, you have to run the 01-install-dependencies.ipynb
Jupyter notebook.
To create a visualization for the performance test results:
- Open the
02-create-visualization-template.ipynb
Jupyter notebook. - Add the execution ARN of your Step Functions workflow to the second cell
test_params = []
test_params.extend([
{'execution_arn': 'your-execution-arn-here'} # Replace with your actual ARN
])
- Execute all cells in the notebook sequentially.
The Jupyter notebook processes test results in three steps:
- Data Collection: Retrieves performance metrics from CloudWatch Logs using one or more Step Functions state machines execution ARNs
- Data Processing: Groups and aggregates results based on test parameters
- Visualization: Generates comparative plots using a user-defined
partitions
variable
The partitions
variable controls how test results are grouped and displayed. Here's a basic configuration example:
partitions = {
'ignore_keys': [ 'topic_id', 'cluster_name', 'test_id', 'client_props.consumer', 'cluster_id', 'duration_sec', 'provisioned_throughput', 'throughput_series_id', 'brokers_type_numeric', ],
'title_keys': [ 'kafka_version', 'broker_storage', 'in_cluster_encryption', 'producer.security.protocol', ],
'row_keys': [ 'num_producers', 'consumer_groups.num_groups', ],
'column_keys': [ 'producer.acks', 'producer.batch.size', 'num_partitions', ],
'metric_color_keys': [ 'brokers_type_numeric', 'brokers', ],
}
To customize this configuration, assign test parameters to different visualization roles:
Configuration Keys:
ignore_keys
: Parameters to exclude from visualizationtitle_keys
: Parameters shown in the first row of the plot title (cluster configuration)column_keys
: Parameters shown in the second row of the plot title (test conditions)row_keys
: Parameters to display in the third row of the plot title (test variations)metric_color_keys
: Parameters shown as different colored lines in the plot together with the metric choosen (comparison criteria)
Available Parameters
cluster_name
: Cluster name used in the testbroker_type
: Broker instance type (e.g., kafka.m7g.large)kafka_version
: MSK cluster Kafka version (e.g. 3.6.0)broker_storage
: Provisioned broker storage in GBin_cluster_encryption
: MSK in-cluster encryption status (true/false)num_partitions
: Partition countproducer.security.protocol
: Client-broker protocol type (e.g. SSL)producer.acks
: Producer acks configuration (e.g.: all)producer.batch.size
: Producer batch sizenum_producers
: Producer countnum_brokers
: Broker countbroker_storage.pt
: Storage provisioned throughput in MB/s
- For comparing tests with different Consumer Groups and same MSK Cluster size and instance types
partitions = {
'ignore_keys': ['topic_id', 'cluster_id', 'test_id', 'cluster_name'],
'title_keys': ['kafka_version','broker_storage','broker_type'],
'row_keys': ['producer.acks','producer.batch.size','num_partitions'],
'column_keys': ['num_producers','in_cluster_encryption','producer.security.protocol'],
'metric_color_keys': ['consumer_groups.num_groups']
}
- For comparing tests with different in-cluster encryption and client-broker protocols while keeping same MSK Cluster size and instance types
partitions = {
'ignore_keys': ['topic_id', 'cluster_id', 'test_id', 'cluster_name'],
'title_keys': ['kafka_version','broker_storage','broker_type'],
'row_keys': ['producer.acks','producer.batch.size','num_partitions'],
'column_keys': ['num_producers','consumer_groups.num_groups'],
'metric_color_keys': ['in_cluster_encryption','producer.security.protocol']
}
To remove all CDK deployed resources for the performance test and avoid unnecessary AWS charges, follow these steps:
-
Run the destroy command:
cdk destroy {stack-name}
-
Important Notes:
- CloudWatch log streams containing experiment outputs will be retained. These logs are preserved for future analysis and visualization
- To completely eliminate all costs, manually delete the following:
- CloudWatch Log Groups
- Any remaining CloudWatch Log Streams
- Any manually created resources not managed by CDK
-
Verify Cleanup:
- Check your AWS Console to ensure all resources are properly removed
- Pay special attention to regions where you deployed resources
See CONTRIBUTING for more information.
This library is licensed under the MIT-0 License. See the LICENSE file.