Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: use redis for rate limiting & next caching to resolve memory issues #2078

Merged
merged 23 commits into from
Apr 2, 2024

Conversation

ShubhamPalriwala
Copy link
Contributor

@ShubhamPalriwala ShubhamPalriwala commented Feb 14, 2024

What does this PR do?

  • Resolves Memory Leak due to Middleware LRU Cache
  • Simplifies Rate Limiting in Cloud by using a simpler int approach rather than an array
  • Introduces the use of Redis for Internal Caching + Rate Limiting instead of in-heap storage for reducing memory consumption spike.
  • Introduces a new env var REDIS_CLIENT_URL that users can use to set

How should this be tested?

Check Memory Usage on running both when

  1. Deployed on Vercel
  2. Deployed on Self Hosted Instance with & without the Redis Env var.

It should improve significantly & the GC should also periodically bring it down in the load drops!

Fixes the Memory Leak based on the below findings

  1. Load tested the current self hosting infra with around 300 reqs/sec with 10 concurrent users and could notice a SIGNIFICANT BUMP in memory (reached from 200 MB to 1 GB in a few minutes!)
Capture_2024-02-09-13-23-51.mp4
  1. And on dropping the load to 0 reqs/sec, the Node Garbage Collector (GC) cleaned things up but could NOT bring it under 600 MB.
    Screenshot_2024-02-09-13-31-21_1920x1080

  2. We then commented the codebase of a couple of endpoints to understand where the memory leak is actually happening ie infra/code basically narrowing down the scope and our finding was:

Even on returning an empty {} json as response from sync endpoint and removed any logic/service calls happening! We still see the RAM usage gradually increasing with the same amount of load

Screenshot_2024-02-09-13-59-12_1920x1080

  1. Now, from this, we inferred that the memory leak isnt in our core business logic.
  2. On further looking for previous witnesses on memory leaks in nextjs on Github/StackOverFlow, many pointers led us to Image Optimisation & Sharp package dependency.

So for that we tested and it and got the following conclusion: There's still gradual increase in memory even when the load is constant but the increase is comparatively lesser than when used w next/image (took 2x time to reach 1 GB of memory (16 minutes)). On stopping the load, the memory consumptions drops to around half ie 512 MB.

This also did not feel like our solution!

  1. We now decided to go back to the basics and use our very own Node's Debugger to find out things internally at runtime. We started with the JS Heap size and found that as the load is increased and over time, the total memory usage to heap usage ratio keeps on decreasing, meaning the memory keeps on growing!
Capture_2024-02-12-14-05-13.mp4
  1. Now on further debugging inside the Heap looking at the Deltas between the snapshots, we found something unusual:

image

Our middleware.js had a ton of closures & array calls!

  1. So now going deeper into the rabbit hole, the next thing to find was why was this happening ie if these are per-requests, the GC should clean them up at a decent rate post response. So we now used the --trace-gc for node to see how the GC is performing. Turned out the GC was working fine.

  2. Now we get back to debugging the heap snapshots and then found out that as the number of different IP users increased, the heap size increased, hinting at our internal LRU Cache that we used!

  3. Next debug step was to comment out the middleware.js logic and see how the memory leak was performing and boooom, the memory handled the load pretty well without going crazy! We found our CULPRIT!

  4. Now, after a lot of ideation and discussions internally, we implemented the webdis (Redis HTTP client + server) because the Nextjs middleware is compiled to run on edge runtime (serverless, cloudflare workers)(it cannot connect with Redis using its own protocol or on top of raw TCP) so now on self hosting instances, we recommend to use Redis for rate limiting buckets!

image (5)

With this,

  • The RAM usage is stable (increases gradually (way slower than previous times) but drops when GC runs in a few mins majorly)
  • The Webdis (Redis server + client) does not take more than 10 MB of RAM
  • Even after around 1-2 hours of such constant load of around 800 concurrent users and 150 reqs/sec, the memory did not shoot up
    Screenshot_2024-02-14-21-49-30_1920x1080 (1)
  • And once the load is dropped, in a few minutes, the GC cleans things up!

Screenshot_2024-02-14-21-51-04_1920x1080

All the end users needs to do is:
Run the below service in their existing docker-compose.yml:

  webdis:
    image: nicolas/webdis
    ports:
      - 7379:7379
    restart: unless-stopped
    <<: *environment

And pass this env var to Formbricks: REDIS_HTTP_CLIENT_URL: http://webdis:7379

Checklist

Required

  • Filled out the "How to test" section in this PR
  • Read How we Code at Formbricks
  • Self-reviewed my own code
  • Commented on my code in hard-to-understand bits
  • Ran pnpm build
  • Checked for warnings, there are none
  • Removed all console.logs
  • Merged the latest changes from main onto my branch with git pull origin main
  • My changes don't cause any responsiveness issues

Appreciated

  • If a UI change was made: Added a screen recording or screenshots to this PR
  • Updated the Formbricks Docs if changes were necessary

Copy link

vercel bot commented Feb 14, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

2 Ignored Deployments
Name Status Preview Comments Updated (UTC)
formbricks-cloud ⬜️ Ignored (Inspect) Visit Preview Apr 2, 2024 1:28pm
formbricks-com ⬜️ Ignored (Inspect) Visit Preview Apr 2, 2024 1:28pm

Copy link
Contributor

github-actions bot commented Feb 14, 2024

Thank you for following the naming conventions for pull request titles! 🙏

Copy link
Contributor

apps/web/app/middleware/rateLimit.ts

Instead of using fetch for Redis operations in the selfHostingRateLimiter function, consider using a Redis client library like ioredis. This would provide better performance and error handling capabilities.
Create Issue
See the diff
Checkout the fix

    import Redis from 'ioredis';

    const redis = new Redis(REDIS_HTTP_CLIENT_URL);

    const selfHostingRateLimiter = async (options: Options) => {
      const tokenCount = await redis.incr(token);
      if (tokenCount === 1) {
        await redis.expire(token, options.interval / 1000);
      }
      if (tokenCount > options.allowedPerInterval) {
        throw new Error("Rate limit exceeded for IP: " + token);
      }
    };
git fetch origin && git checkout -b ReviewBot/The-c-9d1eqsu origin/ReviewBot/The-c-9d1eqsu

docker-compose.yml

Consider adding a network configuration to the webdis service to restrict its access to only the necessary services. This would enhance the security of the application by limiting the potential attack surface.
Create Issue
See the diff
Checkout the fix

    webdis:
      image: nicolas/webdis
      ports:
        - 7379:7379
      restart: unless-stopped
      networks:
        - formbricks
git fetch origin && git checkout -b ReviewBot/The-d-7f2pe77 origin/ReviewBot/The-d-7f2pe77

Comment on lines 25 to 38
const selfHostingRateLimiter = (options: Options) => {
return async (token: string) => {
const tokenCountResponse = await fetch(`${REDIS_HTTP_CLIENT_URL}/INCR/${token}`);
const tokenCountData = await tokenCountResponse.json();
const tokenCount = parseInt(tokenCountData["INCR"]);
if (tokenCount === 1) {
await fetch(`${REDIS_HTTP_CLIENT_URL}/EXPIRE/${token}/${options.interval / 1000}`);
}

const currentUsage = tokenCount[0];
const isRateLimited = currentUsage >= options.allowedPerInterval;
return isRateLimited ? reject() : resolve();
}),
if (tokenCount > options.allowedPerInterval) {
throw new Error("Rate limit exceeded for IP: " + token);
}
};
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fetch API has been replaced with the ioredis library for Redis operations in the selfHostingRateLimiter function. This change improves performance and error handling.

Suggested change
const selfHostingRateLimiter = (options: Options) => {
return async (token: string) => {
const tokenCountResponse = await fetch(`${REDIS_HTTP_CLIENT_URL}/INCR/${token}`);
const tokenCountData = await tokenCountResponse.json();
const tokenCount = parseInt(tokenCountData["INCR"]);
if (tokenCount === 1) {
await fetch(`${REDIS_HTTP_CLIENT_URL}/EXPIRE/${token}/${options.interval / 1000}`);
}
const currentUsage = tokenCount[0];
const isRateLimited = currentUsage >= options.allowedPerInterval;
return isRateLimited ? reject() : resolve();
}),
if (tokenCount > options.allowedPerInterval) {
throw new Error("Rate limit exceeded for IP: " + token);
}
};
};
import Redis from 'ioredis';
const redis = new Redis(REDIS_HTTP_CLIENT_URL);
const selfHostingRateLimiter = async (options: Options) => {
const tokenCount = await redis.incr(token);
if (tokenCount === 1) {
await redis.expire(token, options.interval / 1000);
}
if (tokenCount > options.allowedPerInterval) {
throw new Error("Rate limit exceeded for IP: " + token);
}
};

Comment on lines 77 to 81
webdis:
image: nicolas/webdis
ports:
- 7379:7379
restart: unless-stopped
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a network configuration to the webdis service to restrict its access to only the necessary services. This enhances the security of the application by limiting the potential attack surface.

Suggested change
webdis:
image: nicolas/webdis
ports:
- 7379:7379
restart: unless-stopped
webdis:
image: nicolas/webdis
ports:
- 7379:7379
restart: unless-stopped
networks:
- formbricks

@mattinannt mattinannt changed the title feat: (resolve memory leak) use redis for self hosting rate limiting feat: use redis for self hosting rate limiting to resolve memory issues Feb 15, 2024
Copy link
Member

@mattinannt mattinannt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ShubhamPalriwala thanks for the PR and the investigation 💪🚀 A few comments as we are moving to a new infrastructure where we might also use this in Formbricks Cloud:

};

export default function rateLimit(options: Options) {
if (IS_FORMBRICKS_CLOUD || !REDIS_HTTP_CLIENT_URL) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the IS_FORMBRICKS_CLOUD check here or could be just check if redis is configured?

};
};

const selfHostingRateLimiter = (options: Options) => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rename them as we shouldn't distinguish between cloud and self-hosting but if redis is configured or not.

@ShubhamPalriwala ShubhamPalriwala changed the title feat: use redis for self hosting rate limiting to resolve memory issues feat: use redis for rate limiting & next caching to resolve memory issues Mar 19, 2024
…shubham/for-1862-reproduce-flix-scalability-issues
@ShubhamPalriwala ShubhamPalriwala marked this pull request as ready for review March 19, 2024 16:29
@mattinannt mattinannt added this pull request to the merge queue Apr 2, 2024
Merged via the queue into main with commit 20eb679 Apr 2, 2024
12 of 14 checks passed
@mattinannt mattinannt deleted the shubham/for-1862-reproduce-flix-scalability-issues branch April 2, 2024 14:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants