Skip to content

Run performance test#154

Merged
Siegrift merged 47 commits intomainfrom
performance-test
Dec 18, 2023
Merged

Run performance test#154
Siegrift merged 47 commits intomainfrom
performance-test

Conversation

@Siegrift
Copy link
Collaborator

@Siegrift Siegrift commented Dec 3, 2023

Closes #40

Rationale

Used https://pool.nodary.io/ as an API and created a small script that uses all endpoints from https://pool.nodary.io/<AIRNODE> and treats them as different dAPIs.

There are ~25 different Airnodes, each has ~100 beacons giving an upper limit of ~2500 for perf test. I immediately noticed I am rate limited so I deployed a bunch of cloud workers to avoid sending requests from a single URL (which worked).

Results overview

  1. For our expected load want to use:
  • Signed API => cpu=512, mem=1024
  • Airnode feed => cpu=2048, mem=4096
  1. The cost of logging is negligible.
  2. Both services are CPU bound.
  3. I noticed Signed API latency issues (requests taking seconds to complete) but it was caused by network roundtrips because I deployed the services in US.
  4. Logs were quite usable.

Details

I deployed everything on us-east-1 or us-east-2 with one exception at the end when I debugged the latency issues of Signed API.

~2500 dAPIs across 25 triggers each with 60s fetchInterval

cpu=256, mem=512, logLevel=debug

Airnode feed
image

Signed API
image

cpu=512, mem=1024, logLevel=debug

Airnode feed
image

Signed API
image

cpu=512, mem=1024, logLevel=warn

Airnode feed
image

Signed API
image

cpu=1024, mem=2048, logLevel=warn

Airnode feed
image

Signed API
image

The stats are looking OK on this machines, but Signed API is slow, requests take ~5s, response size is ~1MB. Half of requests fail with 502 error.

cpu=2048, mem=4096, logLevel=warn

Airnode feed
image

Signed API
image

"Only" 1/4 of requests fail with 502, it seems the bottleneck is somewhere else. I also redeployed using logLevel debug and the results were similar - so the cost of logging is negligible.

399 dAPIs across 4 triggers each with 60s fetchInterval

cpu=2048, mem=4096, logLevel=debug

Airnode feed
image

Signed API
image

All request successful, response time is ~200ms (but ~120 is inherent latency).

cpu=256, mem=512, logLevel=debug

Airnode feed
image

Signed API
image

1/4 requests fail, some requests take ~500ms and some ~5s. Probably depends on whether the machine is processing something or not.

cpu=512, mem=1024, logLevel=debug

Airnode feed
image

Signed API
image

30 request, delay between them 1s:

  • Success: 29, error: 1
  • Min time: 799.2865409851074
  • Max time: 4470.1671669483185
  • Average time: 2168.4711624145507 (this is not nice)

399 dAPIs across 4 triggers each with 1s fetchInterval

cpu=512, mem=1024, logLevel=debug

Airnode feed
image

Signed API
image

Request stats:

  • Success: 29, Fail: 1
  • Min time: 479.0236670970917
  • Max time: 3341.9595420360565
  • Average time: 1702.8893277486166

cpu=2048, mem=4096, logLevel=debug

Airnode feed
image

Signed API
image

Weaker machines were causing 100% CPU in Airnode feed. Request stats:

  • Success: 30, Fail: 0
  • Min time: 628.2525420188904
  • Max time: 2912.6059169769287
  • Average time: 1079.9902278343836 (this is not nice)

cpu=8192, mem=16384, logLevel=debug

Signed API

image

Redeployed just the Signed API to see if bigger machine helps. Stats:

  • Success: 30, Fail: 0
  • Min time: 639.1069579124451
  • Max time: 1726.8410420417786
  • Average time: 1054.997409717242

Debugging the Signed API latency

From AWS logs the server spends ~50ms processing the query. I run smaller Airnode feed (40 dAPIs) decreased the response payload to 15kb (from 180kb) and request times were stable, around ~180ms which was expected. I suspected this is all caused by a combination of network distance and response payload.

I confirmed the hypothesis by deploying Signed API (512cpu, 1024mem) on Europe. This should not be an issue in prod, because we will use CDN over the signed API. Stats:

  • Success: 30, Fail: 0
  • Min time: 98.72512499988079
  • Max time: 2730.293709002435
  • Median time: 139.67866700142622
  • Average time: 391.1581014672915

I ran the script a bit longer and out of 2000 calls only 2 failed. The singed API CPU and memory was stable
image

@Siegrift Siegrift self-assigned this Dec 3, 2023
@Siegrift Siegrift changed the title WIP: Performance test Performance test Dec 3, 2023
@bbenligiray bbenligiray requested a review from bdrhn9 December 4, 2023 07:59
@bbenligiray
Copy link
Member

Thanks, interesting data. It's surprising that CPU is that much of a bottleneck for Airnode (though the number of triggers is a bit low). Fortunately, Airnode can be scaled linearly by simply deploying more of them (and the same applies to signed APIs, API providers deploying their own distributes the load in a natural way, and we can put ours behind a sort of a load balancer). We must find out the limits of a single instance first though.

For our expected load

I'm not sure what you refer to as the expected load. I would look at the following.

  • 25 Airnodes, 300 Beacons each, one trigger per Beacon, 1s fetch interval (APIs don't support batching and ideal specs)

  • 25 Airnodes, 300 Beacons each, one trigger per 100 Beacons, 1s fetch interval (APIs are nicely batchable and ideal specs)

  • 25 Airnodes, 300 Beacons each, one trigger per Beacon, 5s fetch interval (APIs don't support batching and relaxed specs)

  • 25 Airnodes, 300 Beacons each, one trigger per 100 Beacons, 5s fetch interval (APIs are nicely batchable and relaxed specs)

We should have headroom in both the number of Airnodes and the number of Beacons per Airnode, 100 Beacons is potentially on the low side. Similarly, 25 triggers is not necessarily the upper limit. Even the best machine configuration will probably fail to perform well in the most pessimistic scenario, in which case we can adjust the specs.

@Siegrift
Copy link
Collaborator Author

Siegrift commented Dec 4, 2023

CPU is that much of a bottleneck for Airnode

I guess you mean Airnode feed.

though the number of triggers is a bit low

I tried mimick 300 unique dAPIs and every 1s updates, and the API supporting batching.

I'm not sure what you refer to as the expected load.

Sorry, I meant that those numbers are the minimum. I consider 399 dAPIs across 4 triggers each with 1s fetchInterval to be the upper bound of our 300 unique dAPIs and every 1s updates, and the API supporting batching.

I would look at the following.

Sure, it fairly quick for me to run these, but I didn't get the:

25 Airnodes, 300 Beacons each, one trigger per Beacon, 1s fetch interval

What do you mean by the "25 Airnodes" part? Do you mean that 25 Airnode feeds pushing to single Signed API?

@bbenligiray
Copy link
Member

It's not clear to me what you mean by a dAPI, Airnode and signed API only cares about Beacons

What do you mean by the "25 Airnodes" part? Do you mean that 25 Airnode feeds pushing to single Signed API?

Yes. So the signed API receives 25 * 300 individual API calls a second in the case that no API supports batching (or rather double that if we have API providers deploy Airnodes redundantly, which we will).

@Siegrift
Copy link
Collaborator Author

Siegrift commented Dec 4, 2023

It's not clear to me what you mean by a dAPI, Airnode and signed API only cares about Beacons

I mean beacons

@bbenligiray
Copy link
Member

Let's talk about it on tomorrow's call

@Siegrift Siegrift force-pushed the payload-limit branch 2 times, most recently from 12e4a80 to 7ba8b6d Compare December 4, 2023 13:09
@Siegrift
Copy link
Collaborator Author

Siegrift commented Dec 6, 2023

fyi: I did some tests like we talked about, but the results are pretty bad and I didn't even do the full scaled one... My intention is to properly identify what is causing the performance issues, because I think the performance should be much better.

Base automatically changed from payload-limit to main December 6, 2023 15:34
@Siegrift Siegrift force-pushed the performance-test branch 2 times, most recently from 649b7c9 to 4476f99 Compare December 6, 2023 16:51
@Siegrift Siegrift changed the base branch from main to perf-test-fixes December 6, 2023 17:12
@Siegrift Siegrift force-pushed the performance-test branch 4 times, most recently from 52d03eb to 97f316e Compare December 8, 2023 17:01
@Siegrift
Copy link
Collaborator Author

Siegrift commented Dec 8, 2023

Context update

The bottleneck mentioned in this comment is signed data verification, which takes 2-3ms. That doesn't scale with the number of API calls we want signed API to handle.

We discussed possible options internally, but in the meantime I continued doing performance tests with the verification in Signed API removed (which has good performance).

Because the Signed API is currently without authentication and signed data verification, there is not much point in stress testing it in current state (although the added auth should be lightweight and the result should have similar perf). We agreed to wrap up the performance test and focus on different Airnode feed scenarios. So here are the results.

Results

Airnode feed (cpu=2048, mem=4096), 300 beacons, 1s fetchInterval, no batching

image

Smaller machine results in 100% CPU. As a note, the target signed API without verification (cpu=1024, mem=2048) was 12% cpu, 6% mem. This scaled linearly with the number of Airnode feeds deployed.

Airnode feed (cpu=1024, mem=2048), 300 beacons, 5s fetchInterval, no batching

image

I also tried to deploy cpu=512, mem=1024 but the result was 100% cpu (and it never went down).

Airnode feed (cpu=1024, mem=2048), 100 beacons, 1s fetchInterval, no batching

image

Airnode feed (cpu=512, mem=1024), 100 beacons, 5s fetchInterval, no batching

image

Airnode feed (cpu=1024, mem=2048), 300 beacons, 1s fetchInterval, batched per 100 (3 triggers)

image

As a note, increase to 400 beacons ends up with 100% CPU. I also did a similar test with 300 beacons yesterday with not so optimal Signed API (larger responses and that had 100% CPU as well).

Airnode feed (cpu=1024, mem=2048), 100 beacons, 1s fetchInterval, batched per 100 (1 triggers)

image

Airnode feed (cpu=1024, mem=2048), 100 beacons, 1s fetchInterval, batched per 100 (1 triggers)

image

Final notes

  • I used Signed API (cpu=2048, mem=4096) as a data source for Airnode feed with no batching and 300 requests per second. The Signed data served large responses (300 signed datas) was killed with fatal error and restarted shortly by AWS. FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory. I updated the machine to (cpu=2048, mem=16384) and it failed with a different error Error: write ENOBUFS for which I found limited information.
  • When scaling the machine with higher memory it's very important to set --max-old-space-size (node heap size) and set it to ~70% of the host machine memory. The default limit of node is 4GB making larger host machine useless.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that these instructions might be hard to follow, but it's quite complex to set the configurations right. If you are interested I can explain it on-call or short Q/A.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The templates here have a risk of going stale, but it's convenient to have them here. In the README I mention explicitly to check whether the CF template is correct.

@Siegrift Siegrift requested a review from andreogle December 8, 2023 20:33
@Siegrift Siegrift changed the title Performance test Run performance test Dec 8, 2023
Copy link
Member

@andreogle andreogle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read through everything and it doesn't seem too complicated. I didn't replicate locally though

apiCredentials: [],
nodeSettings: {
nodeVersion: '0.1.0',
airnodeWalletMnemonic: 'destroy manual orange pole pioneer enemy detail lady cake bus shed visa',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does this come from?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Randomly generated wallet. I hardcoded it for convenience.

},
{
"urlPath": "/30s-delay",
"delaySeconds": 20
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"delaySeconds": 20
"delaySeconds": 30

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hehe, I actually never ended up using these.

@Siegrift Siegrift force-pushed the perf-test-fixes branch 2 times, most recently from e93fc8e to f537f35 Compare December 11, 2023 14:38
@Siegrift Siegrift merged commit e92fd55 into main Dec 18, 2023
@Siegrift Siegrift deleted the performance-test branch December 18, 2023 20:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Performance test

3 participants