Skip to content

fix: reset RequestQueue state after 5 minutes of inactivity#1324

Merged
B4nan merged 1 commit intomasterfrom
zero-concurrency
Apr 7, 2022
Merged

fix: reset RequestQueue state after 5 minutes of inactivity#1324
B4nan merged 1 commit intomasterfrom
zero-concurrency

Conversation

@B4nan
Copy link
Copy Markdown
Member

@B4nan B4nan commented Apr 6, 2022

We now track last activity done on a RequestQueue instance:

  • added new request
  • started processing a request (added to inProgress cache)
  • marked request as handled
  • reclaimed request

If we don't detect one of those actions in last 5 minutes, and we have some requests in the inProgress cache, we try to reset the state. We can override this limit via APIFY_INTERNAL_TIMEOUT env var. If request handler timeout value is higher, we use that instead. This internal timeout is also used when executing the user-provided request handler, so we know there needs to be some activity in this internal (the handler either finished or timeouted).

This should finally resolve the 0 concurrency bug, as it was always about stuck requests in the inProgress cache.

Closes #997

@B4nan
Copy link
Copy Markdown
Member Author

B4nan commented Apr 6, 2022

Output from the E2E test:

➜  apify-js git:(zero-concurrency) ✗ node test/request_queue_reset.test-e2e.js
WARN  Neither APIFY_LOCAL_STORAGE_DIR nor APIFY_TOKEN environment variable is set, defaulting to APIFY_LOCAL_STORAGE_DIR="/Users/adamek/htdocs/apify/apify-js/apify_storage"
INFO  System info {"apifyVersion":"2.3.0","apifyClientVersion":"2.2.0","osType":"Darwin","nodeVersion":"v16.10.0"}
WARN  Neither APIFY_LOCAL_STORAGE_DIR nor APIFY_TOKEN environment variable is set, defaulting to APIFY_LOCAL_STORAGE_DIR="/Users/adamek/htdocs/apify/apify-js/apify_storage"
INFO  CheerioCrawler:AutoscaledPool:Snapshotter: Setting max memory of this run to 4096 MB. Use the APIFY_MEMORY_MBYTES environment variable to override it.
DEBUG CheerioCrawler:AutoscaledPool: scaling up {"oldConcurrency":1,"newConcurrency":2,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":null},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.7,"actualRatio":null},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":null},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":null}}}
INFO  CheerioCrawler:AutoscaledPool: state {"currentConcurrency":0,"desiredConcurrency":2,"systemStatus":{"isSystemIdle":true,"memInfo":{"isOverloaded":false,"limitRatio":0.2,"actualRatio":null},"eventLoopInfo":{"isOverloaded":false,"limitRatio":0.7,"actualRatio":null},"cpuInfo":{"isOverloaded":false,"limitRatio":0.4,"actualRatio":null},"clientInfo":{"isOverloaded":false,"limitRatio":0.3,"actualRatio":null}}}
DEBUG CheerioCrawler:SessionPool: Created new Session - session_cffdMdWFM4
DEBUG CheerioCrawler:SessionPool: Created new Session - session_NiDeZFYe1C
INFO  pvjOOriUMWoJNqA
INFO  35bn3sNHmEkLZ5B
WARN  RequestQueue: Queue head returned a request that is already in progress?! {"nextRequestId":"QkJTmIh3PUZUkQs","inProgress":true,"recentlyHandled":false}
DEBUG CheerioCrawler:SessionPool: Created new Session - session_0wAmzeXEWj
^[[AWARN  RequestQueue: The request queue seems to be stuck for 30s, resetting internal state. {"inProgress":["QkJTmIh3PUZUkQs"]}
DEBUG CheerioCrawler:SessionPool: Created new Session - session_TCguiWXgZF
INFO  QkJTmIh3PUZUkQs
INFO  CheerioCrawler: All the requests from request list and/or request queue have been processed, the crawler will shut down.
DEBUG CheerioCrawler:SessionPool: Persisting state {"persistStateKey":"SDK_SESSION_POOL_STATE"}
DEBUG Statistics: Persisting state {"persistStateKey":"SDK_CRAWLER_STATISTICS_0"}
INFO  CheerioCrawler: Final request statistics: {"requestsFinished":3,"requestsFailed":0,"retryHistogram":[3],"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":753,"requestsFinishedPerMinute":6,"requestsFailedPerMinute":0,"requestTotalDurationMillis":2258,"requestsTotal":3,"crawlerRuntimeMillis":31834}

One of the 3 requests is added to the inProgress cache to trigger the 0 concurrency state. We don't explicitly check whether we are at 0 concurrency, we check that the RQ is being actively used instead. The test gets stuck after first 2 requests are processed, waits for the internal timeout to pass, then automatically resets the RQ, reloads state and processes the missing third request.

Will test this on platform now via patch-package on some real world example, with browsers involved too.

…tivity

We now track last activity done on a `RequestQueue` instance:
- added new request
- started processing a request (added to `inProgress` cache)
- marked request as handled
- reclaimed request

If we don't detect one of those actions in last 5 minutes, and we have some
requests in the `inProgress` cache, we try to reset the state. We can override
this limit via `APIFY_INTERNAL_TIMEOUT` env var.

This should finally resolve the 0 concurrency bug, as it was always about
stuck requests in the `inProgress` cache.

Closes #997
@B4nan B4nan force-pushed the zero-concurrency branch from 55b75be to 8cad475 Compare April 6, 2022 14:55
@B4nan B4nan requested review from metalwarrior665 and mnmkng April 6, 2022 14:58
this.internalTimeoutMillis = Math.max(this.handleRequestTimeoutMillis, 300e3); // allow at least 5min for internal timeouts
const tryEnv = (val) => (val == null ? val : +val);
// allow at least 5min for internal timeouts
this.internalTimeoutMillis = tryEnv(process.env.APIFY_INTERNAL_TIMEOUT) ?? Math.max(this.handleRequestTimeoutMillis, 300e3);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would go with something like this.handleRequestTimeoutMillis * 2 + 10. There are all the internal SDK function calls that need to complete as well.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timeout was already there, all I changed was to make it configurable via env vars. It's been in production for about 4 months, so I'd say the value is fine. But we can double it for the new RQ reset check to be reallu sure.

Note that this value is used as a timeout for several things separately, it is not meant to be some "master timeout that fits all for a request life cycle", it is used for the internal API calls, and also as a timeout for the user handler.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if it was there, that does not change the validity of my argument that such a timeout has a race condition 😄

async isFinished() {
if (this.inProgressCount() > 0 && (Date.now() - +this.lastActivity) > this.internalTimeoutMillis) {
const message = `The request queue seems to be stuck for ${this.internalTimeoutMillis / 1e3}s, resetting internal state.`;
this.log.warning(message, { inProgress: [...this.inProgress] });
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to know where it got stuck here. If it was when fetching request, or handling request, marking request as handled and so on. Can we do this?

Copy link
Copy Markdown
Member Author

@B4nan B4nan Apr 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can save the last action that was done next to the time, but I already know it won't help us. The problem was never about the last action that was done on the queue, but a stuck request - and that one was processed much ealier, not at the very end. We can detect this only once all other requests are processed.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I meant to have this tied to the stuck request, not the one currently being processed.

@B4nan B4nan changed the title fix: automatically reset RequestQueue state after 5 minutes of inactivity fix: reset RequestQueue state after 5 minutes of inactivity Apr 7, 2022
@B4nan B4nan merged commit fc6e4bf into master Apr 7, 2022
@B4nan B4nan deleted the zero-concurrency branch April 7, 2022 08:21
B4nan added a commit that referenced this pull request Apr 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Actor stuck on 0 concurrency for infinite time

2 participants