Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: adaptive playwright crawler #2316

Merged
merged 29 commits into from Feb 21, 2024
Merged

feat: adaptive playwright crawler #2316

merged 29 commits into from Feb 21, 2024

Conversation

janbuchar
Copy link
Contributor

@janbuchar janbuchar commented Jan 31, 2024

This uses the newly added restricted crawling contexts to execute request handlers. This allows us to compare browser and http-only request handler runs for a request and switch to http-only crawling on sites that we predict to be static. More information will be added here.

The intended usage is as follows:

import { AdaptivePlaywrightCrawler } from 'crawlee';

const startUrls = [{url: 'https://warehouse-theme-metal.myshopify.com/collections', label: 'START'}];

const crawler = new AdaptivePlaywrightCrawler({
    requestHandler: async ({ request, enqueueLinks, pushData, querySelector }) => {
        console.log(`Processing: ${request.url} (${request.label})`);

        if (request.label === 'DETAIL') {
            const urlPart = request.url.split('/').slice(-1); // ['sennheiser-mke-440-professional-stereo-shotgun-microphone-mke-440']
            const manufacturer = urlPart[0].split('-')[0]; // 'sennheiser'

            const title = (await querySelector('.product-meta h1')).text();
            const sku = (await querySelector('span.product-meta__sku-number')).text();

            const $prices = await querySelector('span.price')
            const currentPriceString = $prices.filter(':contains("$")').first().text()

            const rawPrice = currentPriceString.split('$')[1];
            const price = Number(rawPrice.replaceAll(',', ''));

            const inStockElements = await querySelector('span.product-form__inventory')
            const inStock = inStockElements.filter(':contains("In stock")').length > 0;

            const results = {
                url: request.url,
                manufacturer,
                title,
                sku,
                currentPrice: price,
                availableInStock: inStock,
            };

            await pushData(results);
        } else if (request.label === 'CATEGORY') {
            await enqueueLinks({
                selector: '.product-item > a',
                label: 'DETAIL', // <= note the different label
            });

            await enqueueLinks({
                selector: 'a.pagination__next',
                label: 'CATEGORY', // <= note the same label
            });
        } else if (request.label === 'START') {
            await enqueueLinks({
                selector: '.collection-block-item',
                label: 'CATEGORY',
            });
        }
    },
    renderingTypeDetectionRatio: 0.1,
    maxRequestsPerCrawl: 100,
    maxRequestRetries: 0,
    minConcurrency: 1,
    maxConcurrency: 1,
    headless: true,
});

await crawler.run(startUrls);

When handling a request from the queue, the crawler

  1. tries to predict the rendering type (static/client only) based on URL, label and potentially other criteria (using a logistic regression model that gets updated on the fly)
  2. for static pages, a HTTP-only scrape is done and the request handler works with Cheerio-based portadom
  3. for client only pages, a playwright scrape is done and the request handler receives a portadom instance that uses Playwright locators (hence it waits for content to appear implicitly)
  4. for a configurable percentage of requests, a detection is done (also if we're not confident about the prediction) - both HTTP-only and playwright scrapes are done and the results are compared. If (and only if) the HTTP-only scrape behaves the same, we conclude the page is static and update our logistic regression model.

@github-actions github-actions bot added this to the 82nd sprint - Tooling team milestone Feb 6, 2024
@github-actions github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Feb 6, 2024
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Pull Request Tookit has failed!

Pull request is neither linked to an issue or epic nor labeled as adhoc!

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Pull Request Tookit has failed!

Pull request is neither linked to an issue or epic nor labeled as adhoc!

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Pull Request Tookit has failed!

Pull request is neither linked to an issue or epic nor labeled as adhoc!

github-actions[bot]

This comment was marked as abuse.

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Pull Request Tookit has failed!

Pull request is neither linked to an issue or epic nor labeled as adhoc!

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Pull Request Tookit has failed!

Pull request is neither linked to an issue or epic nor labeled as adhoc!

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Pull Request Tookit has failed!

Pull request is neither linked to an issue or epic nor labeled as adhoc!

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Pull Request Tookit has failed!

Pull request is neither linked to an issue or epic nor labeled as adhoc!

@B4nan B4nan added the adhoc Ad-hoc unplanned task added during the sprint. label Feb 9, 2024
@github-actions github-actions bot added the tested Temporary label used only programatically for some analytics. label Feb 12, 2024
@janbuchar janbuchar marked this pull request as ready for review February 19, 2024 12:28
@janbuchar janbuchar requested a review from B4nan February 19, 2024 12:29
Copy link
Contributor

@barjin barjin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Damn, I haven't checked this in a while... good job!

Given we mark this as experimental, I have no problem merging + releasing it and maybe having the marketing team make some promo for this so we get some real-life feedback.

I'm just still a bit unsure about the portadom - IMO it's good for the experiment right now, but 6 month stale package with no other dependents than us... just scares me a bit 😄

@B4nan
Copy link
Member

B4nan commented Feb 19, 2024

I'm just still a bit unsure about the portadom - IMO it's good for the experiment right now, but 6 month stale package with no other dependents than us... just scares me a bit 😄

Same

@janbuchar
Copy link
Contributor Author

@barjin @B4nan agreed, do you think that the querySelector function added in the last couple commits is good enough? Would you do something else? Extend it somehow?

@barjin
Copy link
Contributor

barjin commented Feb 19, 2024

(I'm sure there was some reason, but it was too long ago, sorry) - why don't we use parseWithCheerio? imo the cheerio / jQuery syntax is better known among devs + we have this in all our crawlers already... We're not doing anything with the calls / returned values from this function right now, right?

@janbuchar
Copy link
Contributor Author

janbuchar commented Feb 19, 2024

(I'm sure there was some reason, but it was too long ago, sorry) - why don't we use parseWithCheerio? imo the cheerio / jQuery syntax is better known among devs + we have this in all our crawlers already... We're not doing anything with the calls / returned values from this function right now, right?

querySelector actually returns a Cheerio object. The difference to parseWithCheerio is that it also waits for the selector to appear, which is necessary if we want to use the same request handler for browser-based and http-only scraping... which is the entire point of this.

@barjin
Copy link
Contributor

barjin commented Feb 19, 2024

A-ha, I see now... clever! It's a bit shame that there is a new API now (I thought we could switch between crawlers just by changing the constructor name), but given the circumstances, this is imo a good compromise.

Copy link
Member

@B4nan B4nan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i got just few nits and documentation notes, plus:

  • lets remove portadom now if it's not tightly coupled to the implementation anyhow and we have a simple but good enough alternative
  • we might want to move the querySelector to all the crawlers just like we have the parseWithCheerio (can happen later once we see this is the right direction)
  • as you mentioned yourself, the jsdoc for the adaptive crawler needs some work before we merge
  • ran the e2e tests and they are passing, would be nice to have one for the adaptive crawling too

@B4nan
Copy link
Member

B4nan commented Feb 20, 2024

btw can you fix the lock file so the tests can run here?

@B4nan
Copy link
Member

B4nan commented Feb 21, 2024

so did you modify the lock file by hand or what was it about? :D

@janbuchar
Copy link
Contributor Author

so did you modify the lock file by hand or what was it about? :D

rebased on master, dropped the original yarn-updating commit, ran yarn 🤷

@janbuchar janbuchar merged commit 8e4218a into master Feb 21, 2024
8 checks passed
@janbuchar janbuchar deleted the adaptive-crawling-2 branch February 21, 2024 10:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
adhoc Ad-hoc unplanned task added during the sprint. t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants