This repository contains a network latency test stack that consists of a Java-based trading client and Ansible playbooks to coordinate distributed tests.
The Java-based trading client is designed to send limit and cancel orders, allowing you to measure round-trip times of the network communication.
This repository also contains a mock trading server developed in Rust that responds to limit and cancel orders.
- Introduction
- Prerequisites
- Getting Started
- Deployment with Ansible
- Start Running Tests
- Fetch and Analyze Logs
- Available Commands
- Generating Self Signed Certificates for Testing SSL connections
- Optimization used for Java Client
- Contributing
- License
Before you can use this network latency test stack, you'll need to ensure that you have the following prerequisites in place:
Ansible: Make sure Ansible is installed on your system. You can download it here.
Java Development Kit (JDK): You'll need a JDK installed on your machine to compile and run the Java client. You can download the JDK from OpenJDK or Oracle.
OpenSSL: Required for generating self-signed certificates for SSL/TLS connections.
- Generate SSH key pairs for the instances
- Update
.aws_ec2.yml
inventory files underdeployment/ansible/inventory
with your EC2 instance names. As an example you can find an inventory file in that folder. - Open
deploy.sh
file and updateINVENTORY
andSSH_KEY_FILE
.SSH_KEY_FILE
is the ssh key pair that you use to connect to EC2 instances. - Run
deploy.sh
, The deploy.sh script handles deploying the application and dependencies to EC2 instances. It makes use of Ansible to provision the instances and run the deployment tasks.
The deploy.sh
script will:
- Provision instances and deploys client and server to set of ec2 instances defined in inventory file
- Builds the hft java client and rust mock trading server applications on instances
- Creates self-signed ssl files on the remote ec2 instances
- Copies across key scripts and config files for both client and server
start_latency_test.yaml
playbook is used to start the client processes for performance testing on AWS EC2 instances.
The playbook defines tasks to:
- Stop the exchange client
- Start the exchange client processes in tmux on the client instances
You can monitor client logs from /home/ec2-user/output.log
and server logs from /home/ec2-user/mock-trading-server/target/release/output.log
deployment/show_latency_reports.sh
script fetches latency histogram logs from EC2 instances and analyzes them locally.
You can run the script with the following options:
./deployment/show_latency_reports.sh [--inventory INVENTORY_FILE] [--key SSH_KEY_FILE] [--output OUTPUT_DIR]
For example:
./deployment/show_latency_reports.sh --inventory ./ansible/inventory/virginia_inventory.aws_ec2.yml --key ~/.ssh/virginia_keypair.pem
Or you can manually:
- Open
show_latency_reports.sh
- Set
INVENTORY
- Set
SSH_KEY_FILE
- Run
show_latency_reports.sh
The script performs the following steps:
- Runs the fetch_histogram_logs.yaml Ansible playbook to copy logs from instances
- Loops through fetched log files
- Calls a Java program to analyze each log
- The program outputs latency reports
- Generates a summary report in Markdown format
The Java application supports the following commands:
java -jar ExchangeFlow-1.0-SNAPSHOT.jar <command> [<args>]
Available commands:
latency-test
: Run round-trip latency test between client and serverping-latency
: Run ping latency test to measure network round-trip timelatency-report <path>
: Generate and print latency report from log filehelp
: Print help message
Examples:
# Run latency test
java -jar ExchangeFlow-1.0-SNAPSHOT.jar latency-test
# Generate latency report from log file
java -jar ExchangeFlow-1.0-SNAPSHOT.jar latency-report ./histogram_logs/latency.hlog
- Generate a 2048-bit RSA private key (localhost.key) and a Certificate Signing Request (localhost.csr) using OpenSSL. The private key is kept secret and is used to digitally sign documents. The CSR contains information about the key and identity of the requestor and is used to apply for a certificate.
openssl genrsa -out localhost.key 2048
openssl req -new -key localhost.key -out localhost.csr
- Self-sign the CSR to generate a localhost test certificate (localhost.crt) that is valid for 365 days. This acts as a certificate authority to sign our own certificate for testing SSL connections.
openssl x509 -req -days 365 -in localhost.csr -signkey localhost.key -out localhost.crt
- Export the private key and self-signed certificate to a PKCS#12 keystore (keystore.p12). This bundles the private key and certificate together in a format usable by Java high-frequency trading (HFT) client. A password protects the keystore.
openssl pkcs12 -export -out keystore.p12 -inkey localhost.key -in localhost.crt
- Configure the Java high-frequency trading (HFT) client to use the keystore for SSL connections by setting USE_SSL=true
and providing the KEY_STORE_PATH and KEY_STORE_PASSWORD in the
config.properties
file.
USE_SSL=true
KEY_STORE_PATH=keystore.p12
KEY_STORE_PASSWORD=YOUR_PASSWORD
- Similarly, configure the Rust mock exchange server to use the localhost.key for its private key and the localhost.crt for its certificate chain.
Enable SSL usage by setting use_ssl=true in its
configuration.toml
file.
private_key = "/path/to/localhost.key"
cert_chain = "/path/to/localhost.crt"
use_ssl = true
cipher_list = "ECDHE-RSA-AES128-GCM-SHA256"
port = 8888
host = "0.0.0.0"
Testing was conducted on Amazon EC2 c6id.metal instances running Amazon Linux. Various application layer various optimisations and techniques were implemented for the HFT client:
Each time a thread is assigned to a core the processor copies thread's data and instructions into its cache. Since this affects latency, one of the optimization technique that is used in this space is core pinning. Once enabled operating system ensures that given thread executes only on the assigned cores which then prevents that unnecessary copy operation.
Our HFT client library uses Netty as underlying networking framework therefore to enable this feature followings are done;
Add AffinityThreadFactory library to the classpath
<dependency>
<groupId>net.openhft</groupId>
<artifactId>affinity</artifactId>
<version>3.0.6</version>
</dependency>
Create thread factories for business logic and io operations seperately.
private static final ThreadFactory NETTY_IO_THREAD_FACTORY = new AffinityThreadFactory("netty-io", AffinityStrategies.DIFFERENT_CORE);
private static final ThreadFactory NETTY_WORKER_THREAD_FACTORY = new AffinityThreadFactory("netty-worker", AffinityStrategies.DIFFERENT_CORE);
Composite buffer is a virtual buffer in Netty which is used to reduce unnecessary object allocations and copy operations during merging multiple frames of data together. For example;
Unpooled.wrappedBuffer(
ExchangeProtocolImpl.HEADER,
pair.getBytes(StandardCharsets.UTF_8), ExchangeProtocolImpl.SYMBOL_END,
clientId.getBytes(StandardCharsets.UTF_8), ExchangeProtocolImpl.CLIENT_ID_END,
ExchangeProtocolImpl.buySide, ExchangeProtocolImpl.SIDE_END,
ExchangeProtocolImpl.dummyType, ExchangeProtocolImpl.TYPE_END,
ExchangeProtocolImpl.dummyBuyPrice, ExchangeProtocolImpl.PRICE_END,
ExchangeProtocolImpl.dummyAmount, ExchangeProtocolImpl.AMOUNT_END,
ExchangeProtocolImpl.dummyTimeInForce, ExchangeProtocolImpl.TIME_IN_FORCE_END
)
This exposes single ByteBuf API that Netty uses to send messages whereas underlying data can consist of several different bytes, or ByteBuf objects.
io_uring is an asynchronous I/O interface buit for linux kernel. It's a ring buffer in shared memory that are used as a queue between application user space and kernel space. Application puts messages to the submission queue and consumes responses from the completition queue.
Netty is fully async networking library and uses EventLoop mechanism to achieve that. To adapt IO_URING feature Netty has created spesific EventLoop implementation which provides seamless integration with underlying OS kernel. To do so we implemented following - notice the interfaces are the same;
this.nettyIOGroup = USE_IOURING ? new IOUringEventLoopGroup(NETTY_THREAD_COUNT, NETTY_IO_THREAD_FACTORY) : new NioEventLoopGroup(NETTY_THREAD_COUNT, NETTY_IO_THREAD_FACTORY);
this.workerGroup = USE_IOURING ? new IOUringEventLoopGroup(NETTY_THREAD_COUNT, NETTY_WORKER_THREAD_FACTORY) : new NioEventLoopGroup(NETTY_THREAD_COUNT, NETTY_WORKER_THREAD_FACTORY);
Single responsiblity is common technique that helps make applications modular and re-usable. However that the same technique can also help making applications faster as well. For example in this application we have 2 seperate responsibilities, first is network IO layer and second is measuring round trip latencies and accumulating results in HDR histograms. Second part is the business logic which can be expensive and shouldn't keep network IO threads busy. Therefore network IO threads only receives and sends messages while also putting timestamps on them and worker threads calculates round trip times and save HDR histograms to the disk. In the code snippet below workerGroup is the worker event group that is given to the pipeline implicitly. That forces business logic handler to run on worker event loop.
ChannelPipeline pipeline = channel.pipeline();
pipeline.addLast("http-codec", new HttpClientCodec());
pipeline.addLast("aggregator", new HttpObjectAggregator(65536));
pipeline.addLast(workerGroup, "ws-handler", handler);
Because we tested such high rates of message per second, HDR Histogram can grow quite fast and consume a lot of memory. To prevent that we utilized HistogramLogWriter to keep intermediate results in the disk and finally merged using HistogramLogReader to get the final report in the seperate process. Therefore application main function supports 2 commands one is latency test and second is latency report generation which takes histogram file as an input and produces percentiles as an output.
Before starting tests we warmed up JVM processes by sending/receiving orders without measuring time. This enables JIT compiler to compile and optimize the code during runtime. Also we restarted processes every hour to reduce memory defragmentation. Since HDR is accumulative this can be done without effecting results. Also to reduce GC pressure following JVM parameters are applied at the client level;
# Add NUMA binding to separate from server memory
numactl --localalloc -- taskset -c 2-4 chrt -f 80 java \
-Xms7g -Xmx7g \
-XX:+AlwaysPreTouch \
-XX:+UnlockExperimentalVMOptions \
-XX:+UseZGC \
-XX:ConcGCThreads=2 \
-XX:ZCollectionInterval=300 \
-XX:+UseNUMA \
-XX:+UnlockDiagnosticVMOptions \
-XX:GuaranteedSafepointInterval=0 \
-XX:+UseCountedLoopSafepoints \
-XX:+DisableExplicitGC \
-XX:+DoEscapeAnalysis \
-XX:+OptimizeStringConcat \
-XX:+UseCompressedOops \
-XX:+UseTLAB \
-XX:+UseThreadPriorities \
-XX:ThreadPriorityPolicy=1 \
-XX:CompileThreshold=1000 \
-XX:+TieredCompilation \
-XX:CompileCommand=inline,com.aws.trading.*::* \
-XX:-UseBiasedLocking \
-Djava.nio.channels.spi.SelectorProvider=sun.nio.ch.EPollSelectorProvider \
-Dsun.rmi.dgc.server.gcInterval=0x7FFFFFFFFFFFFFFE \
-Dsun.rmi.dgc.client.gcInterval=0x7FFFFFFFFFFFFFFE \
-Dfile.encoding=UTF-8 \
-Dio.netty.allocator.numDirectArenas=3 \
-Dio.netty.allocator.numHeapArenas=0 \
-Dio.netty.allocator.tinyCacheSize=256 \
-Dio.netty.allocator.smallCacheSize=64 \
-Dio.netty.allocator.normalCacheSize=32 \
-Dio.netty.buffer.checkBounds=false \
-Dio.netty.buffer.checkAccessible=false \
-Dio.netty.leakDetection.level=DISABLED \
-Dio.netty.recycler.maxCapacity=32 \
-Dio.netty.eventLoop.maxPendingTasks=1024 \
-server \
-jar ExchangeFlow-1.0-SNAPSHOT.jar latency-test
We implemented 2 different versions for this application. First version waits for matching engine acks and then reacts to it, the other version doesn't wait for acks and keeps seending orders in parallel. The second version was using LMAX RingBuffer to achieve that. Since this has created too many parallel flows we noticed that this is not the real world behaviour. Even though we were able to observe benefits of using LMAX RingBuffer to achieve greater performance benefits we sticked to initial version where we don't use LMAX RingBuffer as real life business use case fits better to first version.
In the interests of maintaining a simple, vanilla, baseline many additional stack optimisations that are typically implemented for specific workload types were NOT applied for this testing (IRQ handling, CPU P-state and C-state controls, network buffers, kernel bypass, receive side scaling, transmit packet steering, Linux scheduler policies and AWS Elastic Network Adapter tuning). Results could be improved by more closely visiting a combination of these areas to further tune the testing setup.
See CONTRIBUTING for more information.
This library is licensed under the MIT-0 License. See the LICENSE file.