feat: adaptive playwright crawler #2316

janbuchar · 2024-01-31T16:38:46Z

This uses the newly added restricted crawling contexts to execute request handlers. This allows us to compare browser and http-only request handler runs for a request and switch to http-only crawling on sites that we predict to be static. More information will be added here.

The intended usage is as follows:

import { AdaptivePlaywrightCrawler } from 'crawlee';

const startUrls = [{url: 'https://warehouse-theme-metal.myshopify.com/collections', label: 'START'}];

const crawler = new AdaptivePlaywrightCrawler({
    requestHandler: async ({ request, enqueueLinks, pushData, querySelector }) => {
        console.log(`Processing: ${request.url} (${request.label})`);

        if (request.label === 'DETAIL') {
            const urlPart = request.url.split('/').slice(-1); // ['sennheiser-mke-440-professional-stereo-shotgun-microphone-mke-440']
            const manufacturer = urlPart[0].split('-')[0]; // 'sennheiser'

            const title = (await querySelector('.product-meta h1')).text();
            const sku = (await querySelector('span.product-meta__sku-number')).text();

            const $prices = await querySelector('span.price')
            const currentPriceString = $prices.filter(':contains("$")').first().text()

            const rawPrice = currentPriceString.split('$')[1];
            const price = Number(rawPrice.replaceAll(',', ''));

            const inStockElements = await querySelector('span.product-form__inventory')
            const inStock = inStockElements.filter(':contains("In stock")').length > 0;

            const results = {
                url: request.url,
                manufacturer,
                title,
                sku,
                currentPrice: price,
                availableInStock: inStock,
            };

            await pushData(results);
        } else if (request.label === 'CATEGORY') {
            await enqueueLinks({
                selector: '.product-item > a',
                label: 'DETAIL', // <= note the different label
            });

            await enqueueLinks({
                selector: 'a.pagination__next',
                label: 'CATEGORY', // <= note the same label
            });
        } else if (request.label === 'START') {
            await enqueueLinks({
                selector: '.collection-block-item',
                label: 'CATEGORY',
            });
        }
    },
    renderingTypeDetectionRatio: 0.1,
    maxRequestsPerCrawl: 100,
    maxRequestRetries: 0,
    minConcurrency: 1,
    maxConcurrency: 1,
    headless: true,
});

await crawler.run(startUrls);

When handling a request from the queue, the crawler

tries to predict the rendering type (static/client only) based on URL, label and potentially other criteria (using a logistic regression model that gets updated on the fly)
for static pages, a HTTP-only scrape is done and the request handler works with Cheerio-based portadom
for client only pages, a playwright scrape is done and the request handler receives a portadom instance that uses Playwright locators (hence it waits for content to appear implicitly)
for a configurable percentage of requests, a detection is done (also if we're not confident about the prediction) - both HTTP-only and playwright scrapes are done and the results are compared. If (and only if) the HTTP-only scrape behaves the same, we conclude the page is static and update our logistic regression model.

github-actions

⚠️ Pull Request Tookit has failed!

github-actions

⚠️ Pull Request Tookit has failed!

packages/playwright-crawler/src/internals/adaptive-playwright-crawler.ts

packages/playwright-crawler/src/internals/rendering-type-prediction.ts

packages/playwright-crawler/src/internals/adaptive-playwright-crawler.ts

packages/playwright-crawler/src/internals/rendering-type-prediction.ts

packages/playwright-crawler/src/internals/adaptive-playwright-crawler.ts

github-actions

⚠️ Pull Request Tookit has failed!

github-actions

⚠️ Pull Request Tookit has failed!

packages/playwright-crawler/src/internals/rendering-type-prediction.ts

packages/playwright-crawler/src/internals/adaptive-playwright-crawler.ts

github-actions

⚠️ Pull Request Tookit has failed!

github-actions

⚠️ Pull Request Tookit has failed!

github-actions

⚠️ Pull Request Tookit has failed!

packages/playwright-crawler/src/internals/adaptive-playwright-crawler.ts

barjin

Damn, I haven't checked this in a while... good job!

Given we mark this as experimental, I have no problem merging + releasing it and maybe having the marketing team make some promo for this so we get some real-life feedback.

I'm just still a bit unsure about the portadom - IMO it's good for the experiment right now, but 6 month stale package with no other dependents than us... just scares me a bit 😄

B4nan · 2024-02-19T15:11:57Z

Same

janbuchar · 2024-02-19T15:12:52Z

@barjin @B4nan agreed, do you think that the querySelector function added in the last couple commits is good enough? Would you do something else? Extend it somehow?

barjin · 2024-02-19T15:20:48Z

(I'm sure there was some reason, but it was too long ago, sorry) - why don't we use parseWithCheerio? imo the cheerio / jQuery syntax is better known among devs + we have this in all our crawlers already... We're not doing anything with the calls / returned values from this function right now, right?

janbuchar · 2024-02-19T15:24:20Z

querySelector actually returns a Cheerio object. The difference to parseWithCheerio is that it also waits for the selector to appear, which is necessary if we want to use the same request handler for browser-based and http-only scraping... which is the entire point of this.

barjin · 2024-02-19T15:35:08Z

A-ha, I see now... clever! It's a bit shame that there is a new API now (I thought we could switch between crawlers just by changing the constructor name), but given the circumstances, this is imo a good compromise.

B4nan

i got just few nits and documentation notes, plus:

lets remove portadom now if it's not tightly coupled to the implementation anyhow and we have a simple but good enough alternative
we might want to move the querySelector to all the crawlers just like we have the parseWithCheerio (can happen later once we see this is the right direction)
as you mentioned yourself, the jsdoc for the adaptive crawler needs some work before we merge
ran the e2e tests and they are passing, would be nice to have one for the adaptive crawling too

packages/core/src/crawlers/crawler_commons.ts

packages/playwright-crawler/src/internals/adaptive-playwright-crawler.ts

packages/core/src/crawlers/crawler_commons.ts

packages/playwright-crawler/src/internals/utils/rendering-type-prediction.ts

B4nan · 2024-02-20T13:36:09Z

btw can you fix the lock file so the tests can run here?

…package

B4nan · 2024-02-21T09:21:49Z

so did you modify the lock file by hand or what was it about? :D

janbuchar · 2024-02-21T09:22:46Z

rebased on master, dropped the original yarn-updating commit, ran yarn 🤷

github-actions bot assigned janbuchar Feb 6, 2024

github-actions bot added this to the 82nd sprint - Tooling team milestone Feb 6, 2024

github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Feb 6, 2024

github-actions bot reviewed Feb 6, 2024

View reviewed changes

github-actions bot reviewed Feb 7, 2024

View reviewed changes

janbuchar commented Feb 7, 2024

View reviewed changes

vladfrangu reviewed Feb 7, 2024

View reviewed changes

packages/playwright-crawler/src/internals/rendering-type-prediction.ts Outdated Show resolved Hide resolved

janbuchar commented Feb 7, 2024

View reviewed changes

packages/playwright-crawler/src/internals/adaptive-playwright-crawler.ts Outdated Show resolved Hide resolved

github-actions bot reviewed Feb 8, 2024

View reviewed changes

This comment was marked as abuse.

Sign in to view

github-actions bot reviewed Feb 8, 2024

View reviewed changes

janbuchar commented Feb 8, 2024

View reviewed changes

packages/playwright-crawler/src/internals/rendering-type-prediction.ts Outdated Show resolved Hide resolved

janbuchar commented Feb 8, 2024

View reviewed changes

packages/playwright-crawler/src/internals/adaptive-playwright-crawler.ts Show resolved Hide resolved

github-actions bot reviewed Feb 9, 2024

View reviewed changes

B4nan added the adhoc Ad-hoc unplanned task added during the sprint. label Feb 9, 2024

github-actions bot added the tested Temporary label used only programatically for some analytics. label Feb 12, 2024

janbuchar commented Feb 19, 2024

View reviewed changes

packages/playwright-crawler/src/internals/adaptive-playwright-crawler.ts Outdated Show resolved Hide resolved

janbuchar marked this pull request as ready for review February 19, 2024 12:28

janbuchar requested a review from B4nan February 19, 2024 12:29

barjin reviewed Feb 19, 2024

View reviewed changes

B4nan reviewed Feb 20, 2024

View reviewed changes

WIP adaptive playwright crawler

5b5dff8

janbuchar added 22 commits February 21, 2024 10:14

Working prototype

3ad175e

Better RequestHandlerResult implementation

e7ec8fe

Less scientific notation

b90f188

Scope rendering type predictions by label

7adb0bd

Export extractUrlsFromPage for use in playwright crawling

dcdea04

Improve RequestHandlerResult typings and doc and move it to the core …

a4bc015

…package

Add missing dependencies to package.json

c6d79b3

Move RenderingTypePredictor to playwright utils

365bb6f

Implement default result checker and comparator

1d87b5a

Add querySelector context helper

fadeb56

Bug

b85107c

Lint

6020f9c

Rendering type detection test

435353b

Test link enqueueing

4ea950e

Extended tests

59de28d

Remove portadom

f00083f

Mark experimental things

e59f097

Remove unnecessary Promise.resolve

909e760

Formatting

c599c38

Add example usage

acb701b

Track additional statistics

3636934

Do not accept handlePageFunction

0e5eacd

janbuchar force-pushed the adaptive-crawling-2 branch from d24d78b to 0e5eacd Compare February 21, 2024 09:15

yarn

de31f1b

janbuchar merged commit 8e4218a into master Feb 21, 2024
8 checks passed

janbuchar deleted the adaptive-crawling-2 branch February 21, 2024 10:10

This was referenced Feb 21, 2024

querySelector should be propagated to all crawling contexts #2349

Open

Write an e2e test of adaptive playwright crawler #2350

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: adaptive playwright crawler #2316

feat: adaptive playwright crawler #2316

janbuchar commented Jan 31, 2024 •

edited

github-actions bot left a comment

github-actions bot left a comment

github-actions bot left a comment

This comment was marked as abuse.

github-actions bot left a comment

github-actions bot left a comment

github-actions bot left a comment

github-actions bot left a comment

barjin left a comment

B4nan commented Feb 19, 2024

janbuchar commented Feb 19, 2024

barjin commented Feb 19, 2024

janbuchar commented Feb 19, 2024 •

edited

barjin commented Feb 19, 2024

B4nan left a comment

B4nan commented Feb 20, 2024

B4nan commented Feb 21, 2024

janbuchar commented Feb 21, 2024

feat: adaptive playwright crawler #2316

feat: adaptive playwright crawler #2316

Conversation

janbuchar commented Jan 31, 2024 • edited

github-actions bot left a comment

Choose a reason for hiding this comment

github-actions bot left a comment

Choose a reason for hiding this comment

github-actions bot left a comment

Choose a reason for hiding this comment

This comment was marked as abuse.

github-actions bot left a comment

Choose a reason for hiding this comment

github-actions bot left a comment

Choose a reason for hiding this comment

github-actions bot left a comment

Choose a reason for hiding this comment

github-actions bot left a comment

Choose a reason for hiding this comment

barjin left a comment

Choose a reason for hiding this comment

B4nan commented Feb 19, 2024

janbuchar commented Feb 19, 2024

barjin commented Feb 19, 2024

janbuchar commented Feb 19, 2024 • edited

barjin commented Feb 19, 2024

B4nan left a comment

Choose a reason for hiding this comment

B4nan commented Feb 20, 2024

B4nan commented Feb 21, 2024

janbuchar commented Feb 21, 2024

janbuchar commented Jan 31, 2024 •

edited

janbuchar commented Feb 19, 2024 •

edited