feat: use redis for rate limiting & next caching to resolve memory issues #2078

ShubhamPalriwala · 2024-02-14T17:46:25Z

What does this PR do?

Resolves Memory Leak due to Middleware LRU Cache
Simplifies Rate Limiting in Cloud by using a simpler int approach rather than an array
Introduces the use of Redis for Internal Caching + Rate Limiting instead of in-heap storage for reducing memory consumption spike.
Introduces a new env var REDIS_CLIENT_URL that users can use to set

How should this be tested?

Check Memory Usage on running both when

Deployed on Vercel
Deployed on Self Hosted Instance with & without the Redis Env var.

It should improve significantly & the GC should also periodically bring it down in the load drops!

Fixes the Memory Leak based on the below findings

Load tested the current self hosting infra with around 300 reqs/sec with 10 concurrent users and could notice a SIGNIFICANT BUMP in memory (reached from 200 MB to 1 GB in a few minutes!)

Capture_2024-02-09-13-23-51.mp4

And on dropping the load to 0 reqs/sec, the Node Garbage Collector (GC) cleaned things up but could NOT bring it under 600 MB.
We then commented the codebase of a couple of endpoints to understand where the memory leak is actually happening ie infra/code basically narrowing down the scope and our finding was:

Even on returning an empty {} json as response from sync endpoint and removed any logic/service calls happening! We still see the RAM usage gradually increasing with the same amount of load

Now, from this, we inferred that the memory leak isnt in our core business logic.
On further looking for previous witnesses on memory leaks in nextjs on Github/StackOverFlow, many pointers led us to Image Optimisation & Sharp package dependency.

So for that we tested and it and got the following conclusion: There's still gradual increase in memory even when the load is constant but the increase is comparatively lesser than when used w next/image (took 2x time to reach 1 GB of memory (16 minutes)). On stopping the load, the memory consumptions drops to around half ie 512 MB.

This also did not feel like our solution!

We now decided to go back to the basics and use our very own Node's Debugger to find out things internally at runtime. We started with the JS Heap size and found that as the load is increased and over time, the total memory usage to heap usage ratio keeps on decreasing, meaning the memory keeps on growing!

Capture_2024-02-12-14-05-13.mp4

Now on further debugging inside the Heap looking at the Deltas between the snapshots, we found something unusual:

Our middleware.js had a ton of closures & array calls!

So now going deeper into the rabbit hole, the next thing to find was why was this happening ie if these are per-requests, the GC should clean them up at a decent rate post response. So we now used the --trace-gc for node to see how the GC is performing. Turned out the GC was working fine.
Now we get back to debugging the heap snapshots and then found out that as the number of different IP users increased, the heap size increased, hinting at our internal LRU Cache that we used!
Next debug step was to comment out the middleware.js logic and see how the memory leak was performing and boooom, the memory handled the load pretty well without going crazy! We found our CULPRIT!
Now, after a lot of ideation and discussions internally, we implemented the webdis (Redis HTTP client + server) because the Nextjs middleware is compiled to run on edge runtime (serverless, cloudflare workers)(it cannot connect with Redis using its own protocol or on top of raw TCP) so now on self hosting instances, we recommend to use Redis for rate limiting buckets!

With this,

The RAM usage is stable (increases gradually (way slower than previous times) but drops when GC runs in a few mins majorly)
The Webdis (Redis server + client) does not take more than 10 MB of RAM
Even after around 1-2 hours of such constant load of around 800 concurrent users and 150 reqs/sec, the memory did not shoot up
And once the load is dropped, in a few minutes, the GC cleans things up!

All the end users needs to do is:
Run the below service in their existing docker-compose.yml:

  webdis:
    image: nicolas/webdis
    ports:
      - 7379:7379
    restart: unless-stopped
    <<: *environment

And pass this env var to Formbricks: REDIS_HTTP_CLIENT_URL: http://webdis:7379

Checklist

Required

Filled out the "How to test" section in this PR
Read How we Code at Formbricks
Self-reviewed my own code
Commented on my code in hard-to-understand bits
Ran pnpm build
Checked for warnings, there are none
Removed all console.logs
Merged the latest changes from main onto my branch with git pull origin main
My changes don't cause any responsiveness issues

Appreciated

If a UI change was made: Added a screen recording or screenshots to this PR
Updated the Formbricks Docs if changes were necessary

vercel · 2024-02-14T17:46:32Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

2 Ignored Deployments

Name	Status	Preview	Comments	Updated (UTC)
formbricks-cloud	⬜️ Ignored (Inspect)	Visit Preview		Apr 2, 2024 1:28pm
formbricks-com	⬜️ Ignored (Inspect)	Visit Preview		Apr 2, 2024 1:28pm

github-actions · 2024-02-14T17:46:38Z

Thank you for following the naming conventions for pull request titles! 🙏

review-agent-prime · 2024-02-14T17:47:25Z

apps/web/app/middleware/rateLimit.ts

Instead of using fetch for Redis operations in the selfHostingRateLimiter function, consider using a Redis client library like ioredis. This would provide better performance and error handling capabilities.
Create Issue
See the diff
Checkout the fix

    import Redis from 'ioredis';

    const redis = new Redis(REDIS_HTTP_CLIENT_URL);

    const selfHostingRateLimiter = async (options: Options) => {
      const tokenCount = await redis.incr(token);
      if (tokenCount === 1) {
        await redis.expire(token, options.interval / 1000);
      }
      if (tokenCount > options.allowedPerInterval) {
        throw new Error("Rate limit exceeded for IP: " + token);
      }
    };

git fetch origin && git checkout -b ReviewBot/The-c-9d1eqsu origin/ReviewBot/The-c-9d1eqsu

docker-compose.yml

Consider adding a network configuration to the webdis service to restrict its access to only the necessary services. This would enhance the security of the application by limiting the potential attack surface.
Create Issue
See the diff
Checkout the fix

    webdis:
      image: nicolas/webdis
      ports:
        - 7379:7379
      restart: unless-stopped
      networks:
        - formbricks

git fetch origin && git checkout -b ReviewBot/The-d-7f2pe77 origin/ReviewBot/The-d-7f2pe77

review-agent-prime · 2024-02-14T17:47:28Z

apps/web/app/middleware/rateLimit.ts

+const selfHostingRateLimiter = (options: Options) => {
+  return async (token: string) => {
+    const tokenCountResponse = await fetch(`${REDIS_HTTP_CLIENT_URL}/INCR/${token}`);
+    const tokenCountData = await tokenCountResponse.json();
+    const tokenCount = parseInt(tokenCountData["INCR"]);
+    if (tokenCount === 1) {
+      await fetch(`${REDIS_HTTP_CLIENT_URL}/EXPIRE/${token}/${options.interval / 1000}`);
+    }

-        const currentUsage = tokenCount[0];
-        const isRateLimited = currentUsage >= options.allowedPerInterval;
-        return isRateLimited ? reject() : resolve();
-      }),
+    if (tokenCount > options.allowedPerInterval) {
+      throw new Error("Rate limit exceeded for IP: " + token);
+    }
  };
+};


The fetch API has been replaced with the ioredis library for Redis operations in the selfHostingRateLimiter function. This change improves performance and error handling.

Suggested change

const selfHostingRateLimiter = (options: Options) => {

return async (token: string) => {

const tokenCountResponse = await fetch(`${REDIS_HTTP_CLIENT_URL}/INCR/${token}`);

const tokenCountData = await tokenCountResponse.json();

const tokenCount = parseInt(tokenCountData["INCR"]);

if (tokenCount === 1) {

await fetch(`${REDIS_HTTP_CLIENT_URL}/EXPIRE/${token}/${options.interval / 1000}`);

}

const currentUsage = tokenCount[0];

const isRateLimited = currentUsage >= options.allowedPerInterval;

return isRateLimited ? reject() : resolve();

}),

if (tokenCount > options.allowedPerInterval) {

throw new Error("Rate limit exceeded for IP: " + token);

}

};

};

import Redis from 'ioredis';

const redis = new Redis(REDIS_HTTP_CLIENT_URL);

const selfHostingRateLimiter = async (options: Options) => {

const tokenCount = await redis.incr(token);

if (tokenCount === 1) {

await redis.expire(token, options.interval / 1000);

}

if (tokenCount > options.allowedPerInterval) {

throw new Error("Rate limit exceeded for IP: " + token);

}

};

review-agent-prime · 2024-02-14T17:47:31Z

docker-compose.yml

+  webdis:
+    image: nicolas/webdis
+    ports:
+      - 7379:7379
+    restart: unless-stopped


Adding a network configuration to the webdis service to restrict its access to only the necessary services. This enhances the security of the application by limiting the potential attack surface.

Suggested change

webdis:

image: nicolas/webdis

ports:

- 7379:7379

restart: unless-stopped

webdis:

image: nicolas/webdis

ports:

- 7379:7379

restart: unless-stopped

networks:

- formbricks

mattinannt

@ShubhamPalriwala thanks for the PR and the investigation 💪🚀 A few comments as we are moving to a new infrastructure where we might also use this in Formbricks Cloud:

mattinannt · 2024-03-02T10:58:53Z

apps/web/app/middleware/rateLimit.ts

+};
+
+export default function rateLimit(options: Options) {
+  if (IS_FORMBRICKS_CLOUD || !REDIS_HTTP_CLIENT_URL) {


Do we need the IS_FORMBRICKS_CLOUD check here or could be just check if redis is configured?

mattinannt · 2024-03-02T10:59:52Z

apps/web/app/middleware/rateLimit.ts

+  };
+};
+
+const selfHostingRateLimiter = (options: Options) => {


I would rename them as we shouldn't distinguish between cloud and self-hosting but if redis is configured or not.

…shubham/for-1862-reproduce-flix-scalability-issues

… env var

…to not be restarted

…shubham/for-1862-reproduce-flix-scalability-issues

ShubhamPalriwala added 4 commits February 14, 2024 22:36

feat: (resolve memory leak) use redis for self hosting rate limiting

6667396

chore: remove console logs

bea6cf7

chore: remove vals in docker compose

78ea383

feat: add webdis in single script depl

eaeb90a

review-agent-prime bot reviewed Feb 14, 2024

View reviewed changes

mattinannt changed the title ~~feat: (resolve memory leak) use redis for self hosting rate limiting~~ feat: use redis for self hosting rate limiting to resolve memory issues Feb 15, 2024

mattinannt added the 🙋‍♀️ to be reviewed label Feb 24, 2024

mattinannt reviewed Mar 2, 2024

View reviewed changes

mattinannt added 👩‍💻 waiting for changes and removed 🙋‍♀️ to be reviewed labels Mar 2, 2024

mattinannt marked this pull request as draft March 14, 2024 14:01

ShubhamPalriwala added 7 commits March 18, 2024 20:28

Merge branch 'main' of https://github.com/formbricks/formbricks into …

1b83cbf

…shubham/for-1862-reproduce-flix-scalability-issues

fix: remove formbricks cloud check, rename methods & depend solely on…

867757c

… env var

chore: cleanup console error

275cab4

debug: use native redis for rate limiting

b706804

debug: build in ci on my fork

071fd26

debug: try one level higher subdomain

1641bc7

fix: redis works + kamal shift to deploy in CI since accessory needs …

66bb51e

…to not be restarted

ShubhamPalriwala had a problem deploying to production March 19, 2024 13:33 — with GitHub Actions Failure

ShubhamPalriwala added 3 commits March 19, 2024 21:22

feat: use redis in next cache

ab6ad2a

debug: try redis strings in next cache

de51a97

chore: cleanup

764c99e

ShubhamPalriwala had a problem deploying to production March 19, 2024 16:23 — with GitHub Actions Failure

chore: cleanup

e69e196

ShubhamPalriwala had a problem deploying to production March 19, 2024 16:28 — with GitHub Actions Failure

ShubhamPalriwala changed the title ~~feat: use redis for self hosting rate limiting to resolve memory issues~~ feat: use redis for rate limiting & next caching to resolve memory issues Mar 19, 2024

Merge branch 'main' of https://github.com/formbricks/formbricks into …

f95e4c1

…shubham/for-1862-reproduce-flix-scalability-issues

ShubhamPalriwala marked this pull request as ready for review March 19, 2024 16:29

ShubhamPalriwala had a problem deploying to production March 19, 2024 16:30 — with GitHub Actions Failure

ShubhamPalriwala added 2 commits March 19, 2024 22:01

fix: correct ip for redis

fa2e31a

fix: correct branch to deploy

cd5bdd6

vercel bot deployed to Preview – formbricks-cloud March 20, 2024 07:14 View deployment

feat: use cache handler if not vercel

d3e6181

vercel bot deployed to Preview – formbricks-cloud March 20, 2024 08:40 View deployment

fix: condition for use cache handler if not vercel

f4c6111

vercel bot deployed to Preview – formbricks-cloud March 20, 2024 08:53 View deployment

ShubhamPalriwala added 2 commits March 20, 2024 14:26

chore: chagne console log to error

76b4791

chore: rename env to REDIS_CLIENT_URL

8a41aa3

vercel bot deployed to Preview – formbricks-cloud March 20, 2024 09:07 View deployment

mattinannt self-assigned this Mar 21, 2024

solve merge conflicts

81aef37

mattinannt approved these changes Apr 2, 2024

View reviewed changes

mattinannt added this pull request to the merge queue Apr 2, 2024

Merged via the queue into main with commit 20eb679 Apr 2, 2024
12 of 14 checks passed

mattinannt deleted the shubham/for-1862-reproduce-flix-scalability-issues branch April 2, 2024 14:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: use redis for rate limiting & next caching to resolve memory issues #2078

feat: use redis for rate limiting & next caching to resolve memory issues #2078

ShubhamPalriwala commented Feb 14, 2024 •

edited

Loading

vercel bot commented Feb 14, 2024 •

edited

Loading

github-actions bot commented Feb 14, 2024 •

edited

Loading

review-agent-prime bot commented Feb 14, 2024

review-agent-prime bot Feb 14, 2024

review-agent-prime bot Feb 14, 2024

mattinannt left a comment

mattinannt Mar 2, 2024

mattinannt Mar 2, 2024

feat: use redis for rate limiting & next caching to resolve memory issues #2078

feat: use redis for rate limiting & next caching to resolve memory issues #2078

Conversation

ShubhamPalriwala commented Feb 14, 2024 • edited Loading

What does this PR do?

How should this be tested?

Fixes the Memory Leak based on the below findings

Checklist

Required

Appreciated

vercel bot commented Feb 14, 2024 • edited Loading

github-actions bot commented Feb 14, 2024 • edited Loading

review-agent-prime bot commented Feb 14, 2024

apps/web/app/middleware/rateLimit.ts

docker-compose.yml

review-agent-prime bot Feb 14, 2024

Choose a reason for hiding this comment

review-agent-prime bot Feb 14, 2024

Choose a reason for hiding this comment

mattinannt left a comment

Choose a reason for hiding this comment

mattinannt Mar 2, 2024

Choose a reason for hiding this comment

mattinannt Mar 2, 2024

Choose a reason for hiding this comment

ShubhamPalriwala commented Feb 14, 2024 •

edited

Loading

vercel bot commented Feb 14, 2024 •

edited

Loading

github-actions bot commented Feb 14, 2024 •

edited

Loading