Limit the number of concurrent dead link requests #4333

sarayourfriend · 2024-05-15T06:03:42Z

Fixes

Description

Limits the number of concurrent dead link requests by blocking on a semaphore before sending a request. The semaphore enforces the limit.

This is a draft because I need to think about how to write tests for this. Manual testing appears to prove this works in the happy path, but I'm not convinced that's enough!

I would also love input on the maximum to enforce.

The new logs are set to info, but I'll reduce them to debug after testing. It's easier to keep them at info right now, otherwise if debug logs are enabled there's too much to sort through to find the semaphore ones.

Testing Instructions

Run the API locally with just api/up and run just logs web to view the logs.

Make a search request with filter_dead=True and a large page size. The large page size helps illustrate the issue.

The following patch makes this easier to test by bypassing cached values and reducing the number of unrelated logs that make it hard to follow the semaphore logs:

diff --git a/api/api/utils/check_dead_links/__init__.py b/api/api/utils/check_dead_links/__init__.py
index 0afeebe63..dcb4623b5 100644
--- a/api/api/utils/check_dead_links/__init__.py
+++ b/api/api/utils/check_dead_links/__init__.py
@@ -133,8 +133,8 @@ def check_dead_links(
     # Anything that isn't in the cache needs to be validated via HEAD request.
     to_verify = {}
     for idx, url in enumerate(image_urls):
-        if cached_statuses[idx] is None:
-            to_verify[url] = idx
+        # if cached_statuses[idx] is None:
+        to_verify[url] = idx
     logger.debug(f"len(to_verify)={len(to_verify)}")
 
     verified = _make_head_requests(to_verify.keys())
@@ -180,19 +180,20 @@ def check_dead_links(
         status_mapping = provider_status_mappings[provider]
 
         if status in status_mapping.unknown:
-            logger.warning(
-                "Image validation failed due to rate limiting or blocking. "
-                f"url={image_urls[idx]} "
-                f"status={status} "
-                f"provider={provider} "
-            )
+            ...
+            # logger.warning(
+            #     "Image validation failed due to rate limiting or blocking. "
+            #     f"url={image_urls[idx]} "
+            #     f"status={status} "
+            #     f"provider={provider} "
+            # )
         elif status not in status_mapping.live:
-            logger.info(
-                "Deleting broken image from results "
-                f"id={results[del_idx]['identifier']} "
-                f"status={status} "
-                f"provider={provider} "
-            )
+            # logger.info(
+            #     "Deleting broken image from results "
+            #     f"id={results[del_idx]['identifier']} "
+            #     f"status={status} "
+            #     f"provider={provider} "
+            # )
             # remove the result, mutating in place
             del results[del_idx]
             # update the result's position in the mask to indicate it is dead

Checklist

My pull request has a descriptive title (not a vague title likeUpdate index.md).
My pull request targets the default branch of the repository (main) or a parent feature branch.
My commit messages follow best practices.
My code follows the established code style of the repository.
I added or updated tests for the changes I made (if applicable).
I added or updated documentation (if applicable).
I tried running the project locally and verified that there are no visible errors.
[N/A] I ran the DAG documentation generator (just catalog/generate-docs for catalog
PRs) or the media properties generator (just catalog/generate-docs media-props
for the catalog or just api/generate-docs for the API) where applicable.

Developer Certificate of Origin

Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

sarayourfriend

I'm undrafting this now because it works, and I need input from others on whether it's worth moving forward with this approach.

The primary difference from my initial version is that this is a request-local limit, rather than a worker-global limit. That is, the concurrency of dead link requests is limited per search request, and so any global per worker request would be contingent on a worker-level request limit. uvicorn already enforces that with its own flow control, so theoretically having our own tool for preventing runaway sockets in our code should work together with the worker's overall limit.

When we configure the environment variable to enable the limit, we need to consider it within that context.

Pulling some production stats, estimating per worker requirements based on the canary service (which is a single worker and sent a proportional number of requests), these are the numbers we should keep in mind:

50 requests per second
2 second head request timeout
400 work max pagination depth, ~300 p95, ~170 average
50% assumed dead link ratio

sarayourfriend · 2024-05-22T03:08:16Z

api/api/utils/check_dead_links/__init__.py

        return url, -1


 # https://stackoverflow.com/q/55259755
 @async_to_sync
 async def _make_head_requests(urls: list[str]) -> list[tuple[str, int]]:
-    tasks = [asyncio.ensure_future(_head(url)) for url in urls]
+    session = await get_aiohttp_session()


Moving session getting here significantly reduces the number of log lines. get_aiohttp_session logs whether it's reusing or creating a new session for the loop on every call to it. Even in log storage, we could save money, let alone in the real cost to calling get_aiohttp_session, which isn't really free anyway.

It does not have any effect on this flow control issue, however (just to clarify).

sarayourfriend · 2024-05-22T03:10:17Z

api/api/utils/check_dead_links/__init__.py

+    if not settings.LINK_VALIDATION_MAX_CONCURRENT_REQUESTS:
+        yield
+        return


Default of None for the setting means this won't have any effect when deployed, until we actually configure a setting via the environment variable. I'll share my thoughts elsewhere on what an appropriate starting point for this should be.

sarayourfriend · 2024-05-22T03:13:00Z

api/env.template

@@ -55,5 +55,6 @@ IS_PROXIED=False
 #SENTRY_DSN=

 FILTER_DEAD_LINKS_BY_DEFAULT=False
+LINK_VALIDATION_MAX_CONCURRENT_REQUESTS=10


Default for local usage to make it easier to test.

krysal

I'll look at this again tomorrow. I haven't worked much with semaphores but looks interesting (and challenging at the same time!).

krysal · 2024-05-28T00:18:56Z

api/api/utils/check_dead_links/__init__.py

 # https://stackoverflow.com/q/55259755
 @async_to_sync
 async def _make_head_requests(urls: list[str]) -> list[tuple[str, int]]:
-    tasks = [asyncio.ensure_future(_head(url)) for url in urls]
+    # Limits total concurrent dead link requests per search request
+    # Must be created (and therefore bound to loop) within `_make_head_request`
+    # This prevents


This prevents... ?

Besides, noticing the link as a comment mentions there is already a limit number of simultaneous connections to 100 by aiohttp.

It achieves by setting default limit to TCPConnector object that is used by ClientSession.

Is it something you've touched on before and this new code would be a second limit? Or could it be a simpler/more direct way to limit the number of requests?

Oh, I wasn't aware of that aiohttp feature. I need to look into it again! Maybe there's a simpler way to do this than manually tracking with a semaphore? 100 is rather far below the limit I would have proposed, so that's interesting 🤔

So yeah looking at it, it should be using a limit of 100 like you said.

This makes me far less confident this change would be meaningful. I think I'll close this PR, and we should think of logging we could add that might reveal where the issues are coming from?

What do you think, @krysal? Any ideas for logs we could add to try to reveal this?

sarayourfriend added the 🟨 priority: medium Not blocking but should be addressed soon label May 15, 2024

github-actions bot added the 🧱 stack: api Related to the Django API label May 15, 2024

sarayourfriend added 💻 aspect: code Concerns the software code in the repository 🧰 goal: internal improvement Improvement that benefits maintainers, not users 🧱 stack: api Related to the Django API labels May 15, 2024

sarayourfriend force-pushed the fix/limit-concurrent-deadlink-checks branch from 11533b2 to 01fd7a0 Compare May 22, 2024 01:24

sarayourfriend mentioned this pull request May 22, 2024

Results considered dead if SSL fails during dead link check, even though they might not actually be dead #4371

Open

sarayourfriend force-pushed the fix/limit-concurrent-deadlink-checks branch from 01fd7a0 to 968ca11 Compare May 22, 2024 03:00

Limit the number of concurrent dead link requests

27e7e1f

sarayourfriend force-pushed the fix/limit-concurrent-deadlink-checks branch from 968ca11 to 27e7e1f Compare May 22, 2024 03:04

sarayourfriend added 2 commits May 22, 2024 13:12

Remove unnecessarily confusing TimeoutError alias usage

6040ccd

Use request local limit rather than worker-global limit

54c7b97

sarayourfriend commented May 22, 2024

View reviewed changes

sarayourfriend marked this pull request as ready for review May 22, 2024 04:07

sarayourfriend requested review from a team as code owners May 22, 2024 04:07

sarayourfriend requested review from zackkrida and krysal May 22, 2024 04:07

krysal reviewed May 28, 2024

View reviewed changes

sarayourfriend closed this May 28, 2024

sarayourfriend mentioned this pull request Jul 16, 2024

Limit the number of concurrent dead link requests #4332

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Limit the number of concurrent dead link requests #4333

Limit the number of concurrent dead link requests #4333

sarayourfriend commented May 15, 2024 •

edited

Loading

sarayourfriend left a comment

sarayourfriend May 22, 2024

sarayourfriend May 22, 2024

sarayourfriend May 22, 2024

krysal left a comment

krysal May 28, 2024

sarayourfriend May 28, 2024 •

edited

Loading

sarayourfriend May 28, 2024 •

edited

Loading

Limit the number of concurrent dead link requests #4333

Limit the number of concurrent dead link requests #4333

Conversation

sarayourfriend commented May 15, 2024 • edited Loading

Fixes

Description

Testing Instructions

Checklist

Developer Certificate of Origin

sarayourfriend left a comment

Choose a reason for hiding this comment

sarayourfriend May 22, 2024

Choose a reason for hiding this comment

sarayourfriend May 22, 2024

Choose a reason for hiding this comment

sarayourfriend May 22, 2024

Choose a reason for hiding this comment

krysal left a comment

Choose a reason for hiding this comment

krysal May 28, 2024

Choose a reason for hiding this comment

sarayourfriend May 28, 2024 • edited Loading

Choose a reason for hiding this comment

sarayourfriend May 28, 2024 • edited Loading

Choose a reason for hiding this comment

sarayourfriend commented May 15, 2024 •

edited

Loading

sarayourfriend May 28, 2024 •

edited

Loading

sarayourfriend May 28, 2024 •

edited

Loading