Running the Crawl in the same process multiple times #1970

germanattanasio · 2023-07-12T01:03:26Z

germanattanasio
Jul 12, 2023

HI, first of all this library is amazing. Thank you! I'm writing this Q&A because I've spend some time looking at the source code, issues and existing discussions and I think there is an opportunity to help the community behind this project understand how the crawling works.

Crawlee:: 3.4.0

Problem
I consume a Kafka message and run a Playwright crawler based on the configuration from that message. Each page that is found is then publish into another topic.

Steps to reproduce

Based on the following function

const crawlPage = async (seedUrl: string, onDocument: (string) => void) => {
  const crawler = new PlaywrightCrawler({
    launchContext: { launchOptions: { headless: true } },
    maxRequestRetries: 1,
    requestHandlerTimeoutSecs: 20,
    maxRequestsPerCrawl: 10
    async requestHandler({ request, page, enqueueLinks }) {
      try {
        const html = await page.evaluate('document.body.innerHTML');
        // Publish this html
        onDocument(html);
        
        // If the page is part of a seed, visit the links
        await enqueueLinks({ strategy: EnqueueStrategy.SameHostname });
      } catch (err) {
        log.warn('Error processing url: ' + request.url);
      }
    },
  });

  await crawler.addRequests([seedUrl]);
  await crawler.run();
  await crawler.teardown();
}

I wrote a simple test that tries to call crawlPage multiple times.

test("crawl multiple URLs", async () => {
   const onDocument = jest.fn();

   await crawlPage("https://moveo.ai", onDocument);
   expect(onDocument).toHaveBeenCalled();

   const onDocumentSecond = jest.fn();
   await crawlPage("https://moveo.ai", onDocumentSecond);
   expect(onDocumentSecond).toHaveBeenCalled();
});

When running the code, the onDocument is called ~11 times the first time. The output includes the following

[7:50:11 PM] INFO: Crawler reached the maxRequestsPerCrawl limit of 10 requests and will shut down soon. Requests that are in progress will be allowed to finish.
[7:50:12 PM] INFO: Earlier, the crawler reached the maxRequestsPerCrawl limit of 10 requests and all requests that were in progress at that time have now finished. In total, the crawler processed 11 requests and will shut down.
[7:50:12 PM] INFO: Crawl finished. Final request statistics:

So far so good.

When I try a second time, I call crawlPage again which creates a new PlaywrightCrawler. I would expect the state between the two instances of PlaywrightCrawler is not shared by default. The test fails and says that onDocumentSecond has not been called.

[7:50:15 PM] INFO: Starting the crawl
[7:50:15 PM] INFO: Crawler reached the maxRequestsPerCrawl limit of 10 requests and will shut down soon. Requests that are in progress will be allowed to finish.
[7:50:15 PM] INFO: Earlier, the crawler reached the maxRequestsPerCrawl limit of 10 requests and all requests that were in progress at that time have now finished. In total, the crawler processed 11 requests and will shut down.
[7:50:15 PM] INFO: Crawl finished. Final request statistics:

Workaround
Somehow digging into issues I found that someone was calling the requestQueue.drop(); to cleanup the already visited pages so I'm now doing the following

  await crawler.run();

  // Drop the queue to prevent the crawler from reusing existing visited urls
  const requestQueue = await crawler.getRequestQueue();
  await requestQueue.drop();

  await crawler.teardown();

Question

Is requestQueue.drop() the right thing to do here?
How can I run the crawlPage() method multiple times in parallel?

Related issues/discussions

crawler fail on second run (env CRAWLEE_PURGE_ON_START not used?) #1602: The advice is to run Configuration.getGlobalCOnfig().getStorageClient()?.purge?.() but I get a file not found exception.
Crawler instances are not disposed #1670: The issue talks about state been retain in the Node.js process and it seems to be fixed

germanattanasio · 2023-07-12T15:22:14Z

germanattanasio
Jul 12, 2023
Author

@B4nan mentioned that crawlee 3.4.0 does this automatically already
See https://github.com/apify/crawlee/blob/master/packages/basic-crawler/src/internals/basic-crawler.ts#L648-L655

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running the Crawl in the same process multiple times #1970

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Running the Crawl in the same process multiple times #1970

germanattanasio Jul 12, 2023

Replies: 1 comment

germanattanasio Jul 12, 2023 Author

germanattanasio
Jul 12, 2023

germanattanasio
Jul 12, 2023
Author