Cannot resume if the request queue is too big #1990

dragospopa420 · 2023-03-23T15:20:07Z

dragospopa420
Mar 23, 2023

Which package is this bug report for? If unsure which one to select, leave blank

@crawlee/memory-storage

Issue description

Steps to reproduce:
It doesn't matter if the scrapers is with Cheerio, Playwright or Puppeteer. Tested on all 3 of them. ( I'll add sample code from a Cheerio one)
From what I saw after 1 million requests ( + - obvious ), if for some reason I pause the scraper it cannot be resumed... it never resumes.

Maybe it would be useful to have the option to use a Redis backend for the request queues.

Code sample

const config = Configuration.getGlobalConfig();
config.set('defaultDatasetId', 'myCrawler');
config.set('defaultKeyValueStoreId', 'myCrawler');
config.set('defaultRequestQueueId', 'myCrawler');

const crawler = new CheerioCrawler({
    proxyConfiguration,
    requestHandler: router,
    minConcurrency: 8,
    maxConcurrency: 128,
    maxRequestRetries: 10,
    useSessionPool: true,
    failedRequestHandler({ request }) {
        log.debug(`Request ${request.url} failed 10 times.`);
    },
});

Package version

3.3.0

Node.js version

18.14.2

Operating system

Fedora 37

Apify platform

Tick me if you encountered this issue on the Apify platform

I have tested this on the `next` release

No response

Other context

No response

B4nan · 2023-07-17T16:41:37Z

B4nan
Jul 17, 2023
Maintainer

What do you mean by pausing? And how do you resume? Can you provide an actual repro?

There were various small improvements in recent versions, so things might behave better now. The memory storage persists things to the file system, and is generally not suited for millions of requests. In the long run, we'd like to introduce more storage clients, including some backed by regular databases like postgres, which should generally help. You could also try https://github.com/apify/apify-storage-local-js which uses sqlite as the backend.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot resume if the request queue is too big #1990

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Cannot resume if the request queue is too big #1990

dragospopa420 Mar 23, 2023

Which package is this bug report for? If unsure which one to select, leave blank

Issue description

Code sample

Package version

Node.js version

Operating system

Apify platform

I have tested this on the next release

Other context

Replies: 1 comment

B4nan Jul 17, 2023 Maintainer

dragospopa420
Mar 23, 2023

I have tested this on the `next` release

B4nan
Jul 17, 2023
Maintainer