Add ability to crawl a single URL with any type of crawler #1684

nikosson · 2022-11-18T13:17:26Z

nikosson
Nov 18, 2022

I have an application, which does next things:

it has a CLI command which parses a set of predefined URLs. I use there Cheerio and Playwright crawler, works perfect and very fast!
it has an HTTP endpoint, which receives URL, and returns parsed data from this URL as a response. I use PlaywrightCrawler here, and it doesn't look nice as for me. Still, I didn't find any better decision, so at this moment it look something like this

  async getProductData(url: string): Promise<ProductDataDto | undefined> {
    return new Promise(async (resolve, reject) => {
      const self = this;

      const playwrightCrawler = new PlaywrightCrawler(
        {
          maxRequestRetries: 0,
          useSessionPool: false,
          keepAlive: true,
          requestList: await RequestList.open(null, [{ url }]),
          async requestHandler({ request, page, log, parseWithCheerio }) {
            log.info(`Processing ${request.url}...`);

            const $ = await parseWithCheerio();
            const productData = self.htmlService.parseProductData(
              $,
              request.url,
            );

            if (productData?.offer) {
              resolve(productData);
            }
          },
          failedRequestHandler({ request, log }) {
            reject(undefined);
          },
        },
        this.getCrawlerConfig(),
      );

      await playwrightCrawler.run();
    });
  }

I've read the docs, and you have an example of how someone can parse a single URL, but there is no way to parse a single URL with the help of Puppeteer/Playwright.

It would be very nice if there would be an ability to parse a single URL in a way like this for example:

const playwrightCrawler = new PlaywrightCrawler({
  //some options
});
const html = await playwrightCrawler.run({url: 'https://example.com'});

ydennisy · 2023-01-04T18:41:31Z

ydennisy
Jan 4, 2023

Yeah plus one! Is there no way to do this? The crawler seems to default to store the output into some storage provided by Crawlee, writing something to disk - but ideally the crawler would just return the data from the handler function and allow the developer choice on how/where to persist the data.

0 replies

metalwarrior665 · 2023-01-19T22:59:12Z

metalwarrior665
Jan 19, 2023
Maintainer

You just need to collect the result into a variable. The crawler cannot just return data because it will not hold arbitrary results in its memory.

// disable writing to disk
Configuration.getGlobalConfig().set('persistStorage', false);

let result;
const playwrightCrawler = new PlaywrightCrawler({
  //some options,
  async requestHandler({ request, page, log, parseWithCheerio }) {
    result = getData();
  },
});
await playwrightCrawler.run([{url: 'https://example.com'}]);
return result; // or resolve(result)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ability to crawl a single URL with any type of crawler #1684

{{title}}

Replies: 2 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Add ability to crawl a single URL with any type of crawler #1684

nikosson Nov 18, 2022

Replies: 2 comments

ydennisy Jan 4, 2023

metalwarrior665 Jan 19, 2023 Maintainer

nikosson
Nov 18, 2022

ydennisy
Jan 4, 2023

metalwarrior665
Jan 19, 2023
Maintainer