Playwright Headless Crawler Crashes After Multiple Successive Runs #2031

obiknows · 2023-08-11T13:55:50Z

Which package is this bug report for? If unsure which one to select, leave blank

@crawlee/playwright (PlaywrightCrawler)

Issue description

When running a Playwright crawlee behind a task queue, after 2 days of successive runs, the Playwright crawlee will begin failing requests with the following error (see picture)

Code sample

No response

Package version

3.4.2

Node.js version

16.18.1

Operating system

No response

Apify platform

Tick me if you encountered this issue on the Apify platform

I have tested this on the `next` release

No response

Other context

This is the text output from the server on crash:

WARN  PlaywrightCrawler: Reclaiming failed request back to the list or queue. browserType.launch: Browser closed.
==================== Browser output: ====================
<launching> /root/.cache/ms-playwright/chromium-1071/chrome-linux/chrome --disable-field-trial-config --disable-background-networking --enable-features=NetworkService,NetworkServiceInProcess --disable-background-timer-throttling --disable-backgrounding-occluded-windows --disable-back-forward-cache --disable-breakpad --disable-client-side-phishing-detection --disable-component-extensions-with-background-pages --disable-component-update --no-default-browser-check --disable-default-apps --disable-dev-shm-usage --disable-extensions --disable-features=ImprovedCookieControls,LazyFrameLoading,GlobalMediaControls,DestroyProfileOnBrowserClose,MediaRouter,DialMediaRouteProvider,AcceptCHFrame,AutoExpandDetailsElement,CertificateTransparencyComponentUpdater,AvoidUnnecessaryBeforeUnloadCheckSync,Translate --allow-pre-commit-input --disable-hang-monitor --disable-ipc-flooding-protection --disable-popup-blocking --disable-prompt-on-repost --disable-renderer-backgrounding --force-color-profile=srgb --metrics-recording-only --no-first-run --enable-automation --password-store=basic --use-mock-keychain --no-service-autorun --export-tagged-pdf --headless --hide-scrollbars --mute-audio --blink-settings=primaryHoverType=2,availableHoverTypes=2,primaryPointerType=4,availablePointerTypes=4 --no-sandbox --proxy-server=http://127.0.0.1:35791/ --proxy-bypass-list=<-loopback> --user-agent=Mozilla/5.0 (iPhone14,6; U; CPU iPhone OS 15_4 like Mac OS X) AppleWebKit/602.1.50 (KHTML, like Gecko) Version/10.0 Mobile/19E241 Safari/602.1 --disable-blink-features=AutomationControlled --user-data-dir=/tmp/playwright_chromiumdev_profile-j1WoRf --remote-debugging-pipe --no-startup-window
<launched> pid=27551
[pid=27551][err] [0811/134940.797141:FATAL:zygote_host_impl_linux.cc(184)] Check failed: process.IsValid(). Failed to launch zygote process
[pid=27551][err] #0 0x5597d5370302 base::debug::CollectStackTrace()
[pid=27551][err] #1 0x5597d535d753 base::debug::StackTrace::StackTrace()
[pid=27551][err] #2 0x5597d52b1c54 logging::LogMessage::~LogMessage()
[pid=27551][err] #3 0x5597d52b270e logging::LogMessage::~LogMessage()
[pid=27551][err] #4 0x5597d529c497 logging::CheckError::~CheckError()
[pid=27551][err] #5 0x5597d3a2d627 content::ZygoteHostImpl::LaunchZygote()
[pid=27551][err] #6 0x5597d4767faf content::(anonymous namespace)::LaunchZygoteHelper()
[pid=27551][err] #7 0x5597d313b55f content::ZygoteCommunication::Init()
[pid=27551][err] #8 0x5597d22ae562 content::CreateUnsandboxedZygote()
[pid=27551][err] #9 0x5597d476726c content::ContentMainRunnerImpl::Initialize()
[pid=27551][err] #10 0x5597d4764b66 content::RunContentProcess()
[pid=27551][err] #11 0x5597d4764fbd content::ContentMain()
[pid=27551][err] #12 0x5597d4dd5ffe headless::HeadlessShellMain()
[pid=27551][err] #13 0x5597d0f2d265 ChromeMain
[pid=27551][err] #14 0x7f6daffe6d90 (/usr/lib/x86_64-linux-gnu/libc.so.6+0x29d8f)
[pid=27551][err] #15 0x7f6daffe6e40 __libc_start_main
[pid=27551][err] #16 0x5597d0f2d02a _start
[pid=27551][err]
=========================== logs ===========================
<launching> /root/.cache/ms-playwright/chromium-1071/chrome-linux/chrome --disable-field-trial-config --disable-background-networking --enable-features=NetworkService,NetworkServiceInProcess --disable-background-timer-throttling --disable-backgrounding-occluded-windows --disable-back-forward-cache --disable-breakpad --disable-client-side-phishing-detection --disable-component-extensions-with-background-pages --disable-component-update --no-default-browser-check --disable-default-apps --disable-dev-shm-usage --disable-extensions --disable-features=ImprovedCookieControls,LazyFrameLoading,GlobalMediaControls,DestroyProfileOnBrowserClose,MediaRouter,DialMediaRouteProvider,AcceptCHFrame,AutoExpandDetailsElement,CertificateTransparencyComponentUpdater,AvoidUnnecessaryBeforeUnloadCheckSync,Translate --allow-pre-commit-input --disable-hang-monitor --disable-ipc-flooding-protection --disable-popup-blocking --disable-prompt-on-repost --disable-renderer-backgrounding --force-color-profile=srgb --metrics-recording-only --no-first-run --enable-automation --password-store=basic --use-mock-keychain --no-service-autorun --export-tagged-pdf --headless --hide-scrollbars --mute-audio --blink-settings=primaryHoverType=2,availableHoverTypes=2,primaryPointerType=4,availablePointerTypes=4 --no-sandbox --proxy-server=http://127.0.0.1:35791/ --proxy-bypass-list=<-loopback> --user-agent=Mozilla/5.0 (iPhone14,6; U; CPU iPhone OS 15_4 like Mac OS X) AppleWebKit/602.1.50 (KHTML, like Gecko) Version/10.0 Mobile/19E241 Safari/602.1 --disable-blink-features=AutomationControlled --user-data-dir=/tmp/playwright_chromiumdev_profile-j1WoRf --remote-debugging-pipe --no-startup-window
<launched> pid=27551
[pid=27551][err] [0811/134940.797141:FATAL:zygote_host_impl_linux.cc(184)] Check failed: process.IsValid(). Failed to launch zygote process
[pid=27551][err] #0 0x5597d5370302 base::debug::CollectStackTrace()
[pid=27551][err] #1 0x5597d535d753 base::debug::StackTrace::StackTrace()
[pid=27551][err] #2 0x5597d52b1c54 logging::LogMessage::~LogMessage()
[pid=27551][err] #3 0x5597d52b270e logging::LogMessage::~LogMessage()
[pid=27551][err] #4 0x5597d529c497 logging::CheckError::~CheckError()
[pid=27551][err] #5 0x5597d3a2d627 content::ZygoteHostImpl::LaunchZygote()
[pid=27551][err] #6 0x5597d4767faf content::(anonymous namespace)::LaunchZygoteHelper()
[pid=27551][err] #7 0x5597d313b55f content::ZygoteCommunication::Init()
[pid=27551][err] #8 0x5597d22ae562 content::CreateUnsandboxedZygote()
[pid=27551][err] #9 0x5597d476726c content::ContentMainRunnerImpl::Initialize()
[pid=27551][err] #10 0x5597d4764b66 content::RunContentProcess()
[pid=27551][err] #11 0x5597d4764fbd content::ContentMain()
[pid=27551][err] #12 0x5597d4dd5ffe headless::HeadlessShellMain()
[pid=27551][err] #13 0x5597d0f2d265 ChromeMain
[pid=27551][err] #14 0x7f6daffe6d90 (/usr/lib/x86_64-linux-gnu/libc.so.6+0x29d8f)
[pid=27551][err] #15 0x7f6daffe6e40 __libc_start_main
[pid=27551][err] #16 0x5597d0f2d02a _start
[pid=27551][err]
============================================================

The text was updated successfully, but these errors were encountered:

barjin · 2023-09-11T12:29:30Z

Hello @obiknows and thank you for submitting this issue!

The issue you are describing sounds like some sort of a memory leak (Playwright / Puppeteer does similar stuff when low on memory, see identical issue here).

Can you please share more details regarding your solution (best case, the whole project as a GitHub repo)? It's possible that you are leaking memory somewhere in the task queue you mentioned. I briefly checked different scenarios with Crawlee and in none of those I was able to get any sort of memleak.

Thank you!

obiknows · 2023-09-11T12:45:28Z

Thanks for the included link and response @barjin

Here's the server instance that calls the crawlee crawler

Server Code:

app.post('/v1/scrape/instagram/:collectiveId', async (req: Request, res: Response) => {
    // get collectiveId from request
    const { collectiveId } = req.params;
    const { instagramURL } = req.body;

    // ensure that collectiveId is not blank
    if (collectiveId === undefined) {
        return res.status(400).json({
            success: false,
            message: "collectiveId is required."
        });
    }
    // ensure that instagramURL is not blank
    if (instagramURL === undefined) {
        return res.status(400).json({
            success: false,
            message: "instagramURL is required."
        });
    }

    // run the IG crawler
    if (IGCrawler.running) {
        await IGCrawler.addRequests([{ url: instagramURL, userData: { collectiveId }}], { waitForAllRequestsToBeAdded: true })
    } else {
        await IGCrawler.run([{ url: instagramURL, userData: { collectiveId }}])
    }

    return res.json({
        success: true,
        message: `Successfully Started Instagram Scrape for ${instagramURL} [${collectiveId}].`
    });
});

Our requests are created and serialized by a BullMQ task server so only one instance of the crawler is ever running at any given time. So we never reach IGCrawler.addRequests(). I would like to run multiple instances, but that proved too difficult.

Only IGCrawler.run() is triggered. But this is where we get memory leaks after 2 days of running.

Crawler Code:

// IGCrawler: Crawls the Instagram profile page and extracts all links to Instagram posts
export const IGCrawler = new PlaywrightCrawler({
    proxyConfiguration: new ProxyConfiguration({ 
        proxyUrls: [ process.env.PROXY_URL ]  
    }),
    launchContext: {
        useIncognitoPages: true,
        // set userAgent to emulate mobile device (iPhone SE)
        userAgent: "Mozilla/5.0 (iPhone14,6; U; CPU iPhone OS 15_4 like Mac OS X) AppleWebKit/602.1.50 (KHTML, like Gecko) Version/10.0 Mobile/19E241 Safari/602.1"
    },
    // Use the requestHandler to process each of the crawled pages.
    async requestHandler({  request,  response, page, enqueueLinks, log, proxyInfo, sendRequest  }) {
        // 0. setup
        const userData = request.userData
        const collectiveId = userData.collectiveId;
        const url = page.url();

        // Wait for the element with id="splash-screen" to have style "display:none"
        await page.waitForSelector('#splash-screen');

        // Wait for the element with id="splash-screen" to have style "display:none"
        await page.waitForFunction(() => {
            const splashScreen = document.getElementById('splash-screen');
            if (!splashScreen) {
                console.log("Splash screen not found.");
                return false;
            }
            return window.getComputedStyle(splashScreen).getPropertyValue('display') === 'none';
        // });
        }, { timeout: 10000, polling: 1000 });
        console.log("Splash screen should now be hidden.");

        // wait for 1 seconds
        await page.waitForTimeout(1000);
        
        // Now that the splash-screen has disappeared, you can perform other actions or crawls here
        // For example, you can find all links on the page:

        // INSTAGRAM PROFILE PAGE
        log.info(`URL for IG page: ${url}`);

        // 1. extract data (find all links to IG posts)
        // Get all the anchor elements on the page
        const links = await page.$$('a');
        // console.log("Total Link Count (empty + valid):", links.length);
        // get each href attribute from the links
        let linkHrefs: string[] = []
        await Promise.all(links.map(async (link) => {
            const href = await link.getAttribute('href');
            // filter out null values
            if (href !== null) {
                linkHrefs.push(href);
            }
        }));

        // filter out links that don't point to IG posts
        let IGPostIds: string[] = [];
        const IGPostLinks = linkHrefs.filter((href) => {
            return href.startsWith('/p/');
        });
        IGPostLinks.forEach((link) => {
            const postId = link.split('/')[2];
            IGPostIds.push(postId);
        });

        // wait another second
        await page.waitForTimeout(1000);
        
        // if posts are found call core API to create IG posts
        if (IGPostIds.length > 0) {
            await API.post(`/v1/scrape/${collectiveId}/instagram`, {
                postIds: IGPostIds
            });
        } else {
            console.log("No IG Post Ids to send to core API. Scraper couldn't find any posts.");
        }

        // wait another second, then purge the default storages
        await page.waitForTimeout(1000);
        await purgeDefaultStorages();
    },
    // Crawler Options
    maxRequestRetries: 5,
    maxConcurrency: 1,
});

This is the code for the crawler. It scrapes an Instagram post, gets the post ids, and sends them to our core api service for processing. This pipeline works successfully for us, we just are seeing the memory leak issue after 2 days.

From the Stack Overflow link you sent @barjin , I will try to find the browser instance and explicitly call browser.close() as they did in their example after our await purgeDefaultStorages() call.

I assume, I can get the browser instance from the extracted params of the crawler's requestHandler but I've yet to confirm this is possible.

B4nan · 2023-09-11T13:11:44Z

FYI the purgeDefaultStorages call won't do anything most probably, as it executed only once per the context, and it is already fired via the Actor.init() call (and on several other places, generally speaking, the first async method that is touching some storages will trigger it, so this is valid even outside of apify platform, even without the actor sdk in use) - we just changed this behavior in the next release, so the explicit calls will actually purge the storages. So be sure to update to crawlee 3.5.4, it will be out in a few minutes, I just started the release process.

edit: its out https://github.com/apify/crawlee/releases/tag/v3.5.4

obiknows · 2023-09-11T17:26:44Z

so the explicit calls will actually purge the storages. So be sure to update to crawlee 3.5.4, it will be out in a few minutes, I just started the release process.

thanks for the heads up @B4nan , I will definitely update to 3.5.4 across our stack rn.

And yeah, I noticed it didn't solve our issue, b/c its not explicitly a disk related issue, more a memory problem, but left it in their just in case.

barjin · 2023-09-12T09:17:42Z

I would like to run multiple instances, but that proved too difficult.

Actually, we've fixed that one or two weeks ago - you can now run multiple crawler instances in one process by instantiating them with separate Configuration instances - something like this:

const a = new PlaywrightCrawler({
      ...
   },
   new Configuration({ persistStorage: false }),
);

const b = new PlaywrightCrawler({
      ...
   },
   new Configuration({ persistStorage: false }),
);

// `a` and `b` can now run simultaneously, without affecting each other
a.run([...]);
b.run([...]);

The persistStorage: false option is there so the crawlers don't write their data to the disk - if you want the storage to be persistent, you can pass localDataDirectory: [folder name] with a different folder name for each crawler.

This way, you can always instantiate a new crawler instance on a new express request and even run multiple crawlers at the same time.

After looking into this more closely, instantiating a new crawler (with persistStorage: false) in every express request might actually fix the memory leak as well. Running one crawler over and over again causes the SDK_SESSION_POOL_STATE.json to grow infinitely (it never gets purged).

I did a little experiment with two CheerioCrawler instances - in one case, I was running the same crawler instance repeatedly, in the other one, I always instantiated a new crawler instance for new run:

code

memory utilization

const c = new CheerioCrawler({
    requestHandler: async ({ request }) => {
        console.log(request.url);
    },
});

setInterval(() => {
    c.run([`https://jindrich.bar/${Math.random().toString(36).substring(7)}`]);
}, 1000);

setInterval(() => {
    const c = new CheerioCrawler({
        requestHandler: async ({ request }) => {
            console.log(request.url);
        },
    }, new Configuration({
        persistStorage: false,
    }));

    c.run([`https://jindrich.bar/${Math.random().toString(36).substring(7)}`]);
}, 1000);

I'll create a separate issue for the growing SDK_SESSION_POOL_STATE KVS record, but in the meantime, you can try this trick with new crawler for each request - make sure to pass the persistStorage: false option as well. And definitely let us know how it went :)

obiknows · 2023-09-13T20:17:00Z

wow, thank you @barjin ! yeah I will look into running multiple. that would best the most ideal case for us.

also, yeah calling await browserController.close() did not work so I will test out the persistStorage flag now.

Fixes #2074 Related #2031 --------- Co-authored-by: Martin Adámek <banan23@gmail.com>

barjin · 2023-09-25T14:57:20Z

hi @obiknows , can you please let us know if you are still experiencing the issue? cheers! :)

obiknows · 2023-09-25T15:48:26Z

Hey @barjin , this seems to have solved the issue, now I think we're just running into an alternative memory issue, but after 5 days since implementing this, our servers are able to restart if they reach an Out of Memory condition.

I believe this fixed the Crawlee side of things. Thanks a bunch. Also thanks too @B4nan

Much appreciated 👌🏿

barjin · 2023-09-25T17:04:00Z

Cheers, we're glad we could help. I'll close this issue now, but feel free to let us know in case of any additional questions.

Thanks!

obiknows added the bug Something isn't working. label Aug 11, 2023

gippy added the t-tooling Issues with this label are in the ownership of the tooling team. label Aug 14, 2023

B4nan assigned barjin Sep 11, 2023

barjin mentioned this issue Sep 12, 2023

SDK_SESSION_POOL_STATE growing infinitely on crawler reruns #2074

Closed

1 task

barjin mentioned this issue Sep 18, 2023

fix: session pool leaks memory on multiple crawler runs #2083

Merged

barjin added a commit that referenced this issue Sep 20, 2023

fix: session pool leaks memory on multiple crawler runs (#2083)

b96582a

Fixes #2074 Related #2031 --------- Co-authored-by: Martin Adámek <banan23@gmail.com>

barjin closed this as completed Sep 25, 2023

vladfrangu mentioned this issue Oct 26, 2023

Bug: Crawler reached the maxRequestsPerCrawl limit of 1 requests and will shut down soon #2147

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Playwright Headless Crawler Crashes After Multiple Successive Runs #2031

Playwright Headless Crawler Crashes After Multiple Successive Runs #2031

obiknows commented Aug 11, 2023

barjin commented Sep 11, 2023

obiknows commented Sep 11, 2023

B4nan commented Sep 11, 2023 •

edited

obiknows commented Sep 11, 2023

barjin commented Sep 12, 2023 •

edited

obiknows commented Sep 13, 2023

barjin commented Sep 25, 2023

obiknows commented Sep 25, 2023 •

edited

barjin commented Sep 25, 2023

Playwright Headless Crawler Crashes After Multiple Successive Runs #2031

Playwright Headless Crawler Crashes After Multiple Successive Runs #2031

Comments

obiknows commented Aug 11, 2023

Which package is this bug report for? If unsure which one to select, leave blank

Issue description

Code sample

Package version

Node.js version

Operating system

Apify platform

I have tested this on the next release

Other context

barjin commented Sep 11, 2023

obiknows commented Sep 11, 2023

B4nan commented Sep 11, 2023 • edited

obiknows commented Sep 11, 2023

barjin commented Sep 12, 2023 • edited

obiknows commented Sep 13, 2023

barjin commented Sep 25, 2023

obiknows commented Sep 25, 2023 • edited

barjin commented Sep 25, 2023

I have tested this on the `next` release

B4nan commented Sep 11, 2023 •

edited

barjin commented Sep 12, 2023 •

edited

obiknows commented Sep 25, 2023 •

edited