Run performance test by Siegrift · Pull Request #154 · api3dao/signed-api

Siegrift · 2023-12-03T09:57:48Z

Closes #40

Rationale

Used https://pool.nodary.io/ as an API and created a small script that uses all endpoints from https://pool.nodary.io/<AIRNODE> and treats them as different dAPIs.

There are ~25 different Airnodes, each has ~100 beacons giving an upper limit of ~2500 for perf test. I immediately noticed I am rate limited so I deployed a bunch of cloud workers to avoid sending requests from a single URL (which worked).

Results overview

For our expected load want to use:

Signed API => cpu=512, mem=1024
Airnode feed => cpu=2048, mem=4096

The cost of logging is negligible.
Both services are CPU bound.
I noticed Signed API latency issues (requests taking seconds to complete) but it was caused by network roundtrips because I deployed the services in US.
Logs were quite usable.

Details

I deployed everything on us-east-1 or us-east-2 with one exception at the end when I debugged the latency issues of Signed API.

~2500 dAPIs across 25 triggers each with 60s fetchInterval

cpu=256, mem=512, logLevel=debug

Airnode feed

Signed API

cpu=512, mem=1024, logLevel=debug

Airnode feed

Signed API

cpu=512, mem=1024, logLevel=warn

Airnode feed

Signed API

cpu=1024, mem=2048, logLevel=warn

Airnode feed

Signed API

The stats are looking OK on this machines, but Signed API is slow, requests take ~5s, response size is ~1MB. Half of requests fail with 502 error.

cpu=2048, mem=4096, logLevel=warn

Airnode feed

Signed API

"Only" 1/4 of requests fail with 502, it seems the bottleneck is somewhere else. I also redeployed using logLevel debug and the results were similar - so the cost of logging is negligible.

399 dAPIs across 4 triggers each with 60s fetchInterval

cpu=2048, mem=4096, logLevel=debug

Airnode feed

Signed API

All request successful, response time is ~200ms (but ~120 is inherent latency).

cpu=256, mem=512, logLevel=debug

Airnode feed

Signed API

1/4 requests fail, some requests take ~500ms and some ~5s. Probably depends on whether the machine is processing something or not.

cpu=512, mem=1024, logLevel=debug

Airnode feed

Signed API

30 request, delay between them 1s:

Success: 29, error: 1
Min time: 799.2865409851074
Max time: 4470.1671669483185
Average time: 2168.4711624145507 (this is not nice)

399 dAPIs across 4 triggers each with 1s fetchInterval

cpu=512, mem=1024, logLevel=debug

Airnode feed

Signed API

Request stats:

Success: 29, Fail: 1
Min time: 479.0236670970917
Max time: 3341.9595420360565
Average time: 1702.8893277486166

cpu=2048, mem=4096, logLevel=debug

Airnode feed

Signed API

Weaker machines were causing 100% CPU in Airnode feed. Request stats:

Success: 30, Fail: 0
Min time: 628.2525420188904
Max time: 2912.6059169769287
Average time: 1079.9902278343836 (this is not nice)

cpu=8192, mem=16384, logLevel=debug

Signed API

Redeployed just the Signed API to see if bigger machine helps. Stats:

Success: 30, Fail: 0
Min time: 639.1069579124451
Max time: 1726.8410420417786
Average time: 1054.997409717242

Debugging the Signed API latency

From AWS logs the server spends ~50ms processing the query. I run smaller Airnode feed (40 dAPIs) decreased the response payload to 15kb (from 180kb) and request times were stable, around ~180ms which was expected. I suspected this is all caused by a combination of network distance and response payload.

I confirmed the hypothesis by deploying Signed API (512cpu, 1024mem) on Europe. This should not be an issue in prod, because we will use CDN over the signed API. Stats:

Success: 30, Fail: 0
Min time: 98.72512499988079
Max time: 2730.293709002435
Median time: 139.67866700142622
Average time: 391.1581014672915

I ran the script a bit longer and out of 2000 calls only 2 failed. The singed API CPU and memory was stable

bbenligiray · 2023-12-04T08:53:27Z

Thanks, interesting data. It's surprising that CPU is that much of a bottleneck for Airnode (though the number of triggers is a bit low). Fortunately, Airnode can be scaled linearly by simply deploying more of them (and the same applies to signed APIs, API providers deploying their own distributes the load in a natural way, and we can put ours behind a sort of a load balancer). We must find out the limits of a single instance first though.

For our expected load

I'm not sure what you refer to as the expected load. I would look at the following.

25 Airnodes, 300 Beacons each, one trigger per Beacon, 1s fetch interval (APIs don't support batching and ideal specs)
25 Airnodes, 300 Beacons each, one trigger per 100 Beacons, 1s fetch interval (APIs are nicely batchable and ideal specs)
25 Airnodes, 300 Beacons each, one trigger per Beacon, 5s fetch interval (APIs don't support batching and relaxed specs)
25 Airnodes, 300 Beacons each, one trigger per 100 Beacons, 5s fetch interval (APIs are nicely batchable and relaxed specs)

We should have headroom in both the number of Airnodes and the number of Beacons per Airnode, 100 Beacons is potentially on the low side. Similarly, 25 triggers is not necessarily the upper limit. Even the best machine configuration will probably fail to perform well in the most pessimistic scenario, in which case we can adjust the specs.

Siegrift · 2023-12-04T12:24:23Z

CPU is that much of a bottleneck for Airnode

I guess you mean Airnode feed.

though the number of triggers is a bit low

I tried mimick 300 unique dAPIs and every 1s updates, and the API supporting batching.

I'm not sure what you refer to as the expected load.

Sorry, I meant that those numbers are the minimum. I consider 399 dAPIs across 4 triggers each with 1s fetchInterval to be the upper bound of our 300 unique dAPIs and every 1s updates, and the API supporting batching.

I would look at the following.

Sure, it fairly quick for me to run these, but I didn't get the:

25 Airnodes, 300 Beacons each, one trigger per Beacon, 1s fetch interval

What do you mean by the "25 Airnodes" part? Do you mean that 25 Airnode feeds pushing to single Signed API?

bbenligiray · 2023-12-04T12:39:53Z

It's not clear to me what you mean by a dAPI, Airnode and signed API only cares about Beacons

What do you mean by the "25 Airnodes" part? Do you mean that 25 Airnode feeds pushing to single Signed API?

Yes. So the signed API receives 25 * 300 individual API calls a second in the case that no API supports batching (or rather double that if we have API providers deploy Airnodes redundantly, which we will).

Siegrift · 2023-12-04T12:43:52Z

It's not clear to me what you mean by a dAPI, Airnode and signed API only cares about Beacons

I mean beacons

bbenligiray · 2023-12-04T12:55:09Z

Let's talk about it on tomorrow's call

Siegrift · 2023-12-06T11:16:15Z

fyi: I did some tests like we talked about, but the results are pretty bad and I didn't even do the full scaled one... My intention is to properly identify what is causing the performance issues, because I think the performance should be much better.

Siegrift · 2023-12-08T20:13:05Z

Context update

The bottleneck mentioned in this comment is signed data verification, which takes 2-3ms. That doesn't scale with the number of API calls we want signed API to handle.

We discussed possible options internally, but in the meantime I continued doing performance tests with the verification in Signed API removed (which has good performance).

Because the Signed API is currently without authentication and signed data verification, there is not much point in stress testing it in current state (although the added auth should be lightweight and the result should have similar perf). We agreed to wrap up the performance test and focus on different Airnode feed scenarios. So here are the results.

Results

Airnode feed (cpu=2048, mem=4096), 300 beacons, 1s fetchInterval, no batching

Smaller machine results in 100% CPU. As a note, the target signed API without verification (cpu=1024, mem=2048) was 12% cpu, 6% mem. This scaled linearly with the number of Airnode feeds deployed.

Airnode feed (cpu=1024, mem=2048), 300 beacons, 5s fetchInterval, no batching

I also tried to deploy cpu=512, mem=1024 but the result was 100% cpu (and it never went down).

Airnode feed (cpu=1024, mem=2048), 100 beacons, 1s fetchInterval, no batching

Airnode feed (cpu=512, mem=1024), 100 beacons, 5s fetchInterval, no batching

Airnode feed (cpu=1024, mem=2048), 300 beacons, 1s fetchInterval, batched per 100 (3 triggers)

As a note, increase to 400 beacons ends up with 100% CPU. I also did a similar test with 300 beacons yesterday with not so optimal Signed API (larger responses and that had 100% CPU as well).

Airnode feed (cpu=1024, mem=2048), 100 beacons, 1s fetchInterval, batched per 100 (1 triggers)

Final notes

I used Signed API (cpu=2048, mem=4096) as a data source for Airnode feed with no batching and 300 requests per second. The Signed data served large responses (300 signed datas) was killed with fatal error and restarted shortly by AWS. FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory. I updated the machine to (cpu=2048, mem=16384) and it failed with a different error Error: write ENOBUFS for which I found limited information.
When scaling the machine with higher memory it's very important to set --max-old-space-size (node heap size) and set it to ~70% of the host machine memory. The default limit of node is 4GB making larger host machine useless.

Siegrift · 2023-12-08T20:29:52Z

packages/performance-test/README.md

I understand that these instructions might be hard to follow, but it's quite complex to set the configurations right. If you are interested I can explain it on-call or short Q/A.

Siegrift · 2023-12-08T20:32:57Z

packages/performance-test/airnode-feed/cloudformation.json

The templates here have a risk of going stale, but it's convenient to have them here. In the README I mention explicitly to check whether the CF template is correct.

andreogle

I read through everything and it doesn't seem too complicated. I didn't replicate locally though

andreogle · 2023-12-11T13:28:23Z

packages/performance-test/airnode-feed/create-config.ts

+  apiCredentials: [],
+  nodeSettings: {
+    nodeVersion: '0.1.0',
+    airnodeWalletMnemonic: 'destroy manual orange pole pioneer enemy detail lady cake bus shed visa',


Where does this come from?

Randomly generated wallet. I hardcoded it for convenience.

andreogle · 2023-12-11T13:29:50Z

packages/performance-test/signed-api/signed-api.json

+    },
+    {
+      "urlPath": "/30s-delay",
+      "delaySeconds": 20


Suggested change

"delaySeconds": 20

"delaySeconds": 30

Hehe, I actually never ended up using these.

…esponses

Siegrift self-assigned this Dec 3, 2023

Siegrift changed the title ~~WIP: Performance test~~ Performance test Dec 3, 2023

Siegrift requested review from andreogle and bbenligiray December 3, 2023 12:35

bbenligiray requested a review from bdrhn9 December 4, 2023 07:59

andreogle approved these changes Dec 4, 2023

View reviewed changes

Siegrift force-pushed the payload-limit branch 2 times, most recently from 12e4a80 to 7ba8b6d Compare December 4, 2023 13:09

Base automatically changed from payload-limit to main December 6, 2023 15:34

Siegrift force-pushed the performance-test branch 2 times, most recently from 649b7c9 to 4476f99 Compare December 6, 2023 16:51

Siegrift changed the base branch from main to perf-test-fixes December 6, 2023 17:12

Siegrift force-pushed the performance-test branch 4 times, most recently from 52d03eb to 97f316e Compare December 8, 2023 17:01

Siegrift force-pushed the performance-test branch from 9c72efb to cdc18a0 Compare December 8, 2023 20:27

Siegrift commented Dec 8, 2023

View reviewed changes

Siegrift requested a review from andreogle December 8, 2023 20:33

Siegrift changed the title ~~Performance test~~ Run performance test Dec 8, 2023

andreogle approved these changes Dec 11, 2023

View reviewed changes

Siegrift force-pushed the perf-test-fixes branch 2 times, most recently from e93fc8e to f537f35 Compare December 11, 2023 14:38

Siegrift added 27 commits December 18, 2023 21:32

Update configurations, use newly deployed Docker images

49ce5e1

Cleanup perf utils

783e944

Use the powerful Signed API as data provider

0246cf3

Avoid batching API calls and prepare pusher CF template

b769b1d

Increase machine

1e65e06

Update config after wierdness

f788e24

Get each signed data from a different endpoint to reduce Signed API r…

9561e51

…esponses

Update CF template to create new Airnode

3b61090

Update CF template and config to create new Airnode

0927b11

Regenerate airnode-feed config

1648cde

WIP: Prepare for cloud deployments

2641856

Prepare script for populating signed data

df50639

Prepare no batching perf test script

552f326

Document how to deploy Signed API

3441b9a

Move performance test to separate workspace

2952912

Document how to populate the Signed API

d777523

Document how to create Airnode feed configuration

fbcf46e

Prepare Airnode feed config

1daf4d1

Decrease the number of triggers to 100

bb782fc

Create config file with batching (100 beacons per 3 triggers)

5fec2b6

Create config file with batching (100 beacons per 4 triggers)

c40557e

Create config file with batching (100 beacons per 1 triggers)

5e88a47

Create config file without batching (300 beacons, 5s fetch interval)

1a4ed7d

Create config file without batching (100 beacons, 5s fetch interval)

a0cfbad

Create config file without batching (300 beacons, 5s fetch interval)

3d549b0

Self review

ed578ad

Avoid hardcoding docker image

ff7e4bf

Siegrift force-pushed the performance-test branch from ea7c9d2 to ff7e4bf Compare December 18, 2023 20:34

Siegrift merged commit e92fd55 into main Dec 18, 2023

Siegrift deleted the performance-test branch December 18, 2023 20:40

Conversation

Siegrift commented Dec 3, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale

Results overview

Details

Debugging the Signed API latency

Uh oh!

bbenligiray commented Dec 4, 2023

Uh oh!

Siegrift commented Dec 4, 2023

Uh oh!

bbenligiray commented Dec 4, 2023

Uh oh!

Siegrift commented Dec 4, 2023

Uh oh!

bbenligiray commented Dec 4, 2023

Uh oh!

Siegrift commented Dec 6, 2023

Uh oh!

Siegrift commented Dec 8, 2023

Context update

Results

Final notes

Uh oh!

Siegrift Dec 8, 2023

Choose a reason for hiding this comment

Uh oh!

Siegrift Dec 8, 2023

Choose a reason for hiding this comment

Uh oh!

andreogle left a comment

Choose a reason for hiding this comment

Uh oh!

andreogle Dec 11, 2023

Choose a reason for hiding this comment

Uh oh!

Siegrift Dec 11, 2023

Choose a reason for hiding this comment

Uh oh!

andreogle Dec 11, 2023

Choose a reason for hiding this comment

Uh oh!

Siegrift Dec 11, 2023

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Siegrift commented Dec 3, 2023 •

edited

Loading