-
Notifications
You must be signed in to change notification settings - Fork 202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Limit the number of concurrent dead link requests #4333
Conversation
11533b2
to
01fd7a0
Compare
01fd7a0
to
968ca11
Compare
968ca11
to
27e7e1f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm undrafting this now because it works, and I need input from others on whether it's worth moving forward with this approach.
The primary difference from my initial version is that this is a request-local limit, rather than a worker-global limit. That is, the concurrency of dead link requests is limited per search request, and so any global per worker request would be contingent on a worker-level request limit. uvicorn already enforces that with its own flow control, so theoretically having our own tool for preventing runaway sockets in our code should work together with the worker's overall limit.
When we configure the environment variable to enable the limit, we need to consider it within that context.
Pulling some production stats, estimating per worker requirements based on the canary service (which is a single worker and sent a proportional number of requests), these are the numbers we should keep in mind:
- 50 requests per second
- 2 second head request timeout
- 400 work max pagination depth, ~300 p95, ~170 average
- 50% assumed dead link ratio
return url, -1 | ||
|
||
|
||
# https://stackoverflow.com/q/55259755 | ||
@async_to_sync | ||
async def _make_head_requests(urls: list[str]) -> list[tuple[str, int]]: | ||
tasks = [asyncio.ensure_future(_head(url)) for url in urls] | ||
session = await get_aiohttp_session() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving session getting here significantly reduces the number of log lines. get_aiohttp_session
logs whether it's reusing or creating a new session for the loop on every call to it. Even in log storage, we could save money, let alone in the real cost to calling get_aiohttp_session
, which isn't really free anyway.
It does not have any effect on this flow control issue, however (just to clarify).
if not settings.LINK_VALIDATION_MAX_CONCURRENT_REQUESTS: | ||
yield | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Default of None for the setting means this won't have any effect when deployed, until we actually configure a setting via the environment variable. I'll share my thoughts elsewhere on what an appropriate starting point for this should be.
@@ -55,5 +55,6 @@ IS_PROXIED=False | |||
#SENTRY_DSN= | |||
|
|||
FILTER_DEAD_LINKS_BY_DEFAULT=False | |||
LINK_VALIDATION_MAX_CONCURRENT_REQUESTS=10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Default for local usage to make it easier to test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll look at this again tomorrow. I haven't worked much with semaphores but looks interesting (and challenging at the same time!).
# https://stackoverflow.com/q/55259755 | ||
@async_to_sync | ||
async def _make_head_requests(urls: list[str]) -> list[tuple[str, int]]: | ||
tasks = [asyncio.ensure_future(_head(url)) for url in urls] | ||
# Limits total concurrent dead link requests per search request | ||
# Must be created (and therefore bound to loop) within `_make_head_request` | ||
# This prevents |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This prevents... ?
Besides, noticing the link as a comment mentions there is already a limit number of simultaneous connections to 100 by aiohttp.
It achieves by setting default limit to TCPConnector object that is used by ClientSession.
Is it something you've touched on before and this new code would be a second limit? Or could it be a simpler/more direct way to limit the number of requests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I wasn't aware of that aiohttp feature. I need to look into it again! Maybe there's a simpler way to do this than manually tracking with a semaphore? 100 is rather far below the limit I would have proposed, so that's interesting 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So yeah looking at it, it should be using a limit of 100 like you said.
This makes me far less confident this change would be meaningful. I think I'll close this PR, and we should think of logging we could add that might reveal where the issues are coming from?
What do you think, @krysal? Any ideas for logs we could add to try to reveal this?
Fixes
Fixes #4332 by @sarayourfriend
Description
Limits the number of concurrent dead link requests by blocking on a semaphore before sending a request. The semaphore enforces the limit.
This is a draft because I need to think about how to write tests for this. Manual testing appears to prove this works in the happy path, but I'm not convinced that's enough!
I would also love input on the maximum to enforce.
The new logs are set to
info
, but I'll reduce them to debug after testing. It's easier to keep them at info right now, otherwise if debug logs are enabled there's too much to sort through to find the semaphore ones.Testing Instructions
Run the API locally with
just api/up
and runjust logs web
to view the logs.Make a search request with
filter_dead=True
and a large page size. The large page size helps illustrate the issue.The following patch makes this easier to test by bypassing cached values and reducing the number of unrelated logs that make it hard to follow the semaphore logs:
Checklist
Update index.md
).main
) or a parent feature branch.just catalog/generate-docs
for catalogPRs) or the media properties generator (
just catalog/generate-docs media-props
for the catalog or
just api/generate-docs
for the API) where applicable.Developer Certificate of Origin
Developer Certificate of Origin