Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Playwright Headless Crawler Crashes After Multiple Successive Runs #2031

Closed
1 task
obiknows opened this issue Aug 11, 2023 · 9 comments
Closed
1 task

Playwright Headless Crawler Crashes After Multiple Successive Runs #2031

obiknows opened this issue Aug 11, 2023 · 9 comments
Assignees
Labels
bug Something isn't working. t-tooling Issues with this label are in the ownership of the tooling team.

Comments

@obiknows
Copy link

Which package is this bug report for? If unsure which one to select, leave blank

@crawlee/playwright (PlaywrightCrawler)

Issue description

When running a Playwright crawlee behind a task queue, after 2 days of successive runs, the Playwright crawlee will begin failing requests with the following error (see picture)

Screenshot 2023-08-11 at 09 46 49

Code sample

No response

Package version

3.4.2

Node.js version

16.18.1

Operating system

No response

Apify platform

  • Tick me if you encountered this issue on the Apify platform

I have tested this on the next release

No response

Other context

This is the text output from the server on crash:

WARN  PlaywrightCrawler: Reclaiming failed request back to the list or queue. browserType.launch: Browser closed.
==================== Browser output: ====================
<launching> /root/.cache/ms-playwright/chromium-1071/chrome-linux/chrome --disable-field-trial-config --disable-background-networking --enable-features=NetworkService,NetworkServiceInProcess --disable-background-timer-throttling --disable-backgrounding-occluded-windows --disable-back-forward-cache --disable-breakpad --disable-client-side-phishing-detection --disable-component-extensions-with-background-pages --disable-component-update --no-default-browser-check --disable-default-apps --disable-dev-shm-usage --disable-extensions --disable-features=ImprovedCookieControls,LazyFrameLoading,GlobalMediaControls,DestroyProfileOnBrowserClose,MediaRouter,DialMediaRouteProvider,AcceptCHFrame,AutoExpandDetailsElement,CertificateTransparencyComponentUpdater,AvoidUnnecessaryBeforeUnloadCheckSync,Translate --allow-pre-commit-input --disable-hang-monitor --disable-ipc-flooding-protection --disable-popup-blocking --disable-prompt-on-repost --disable-renderer-backgrounding --force-color-profile=srgb --metrics-recording-only --no-first-run --enable-automation --password-store=basic --use-mock-keychain --no-service-autorun --export-tagged-pdf --headless --hide-scrollbars --mute-audio --blink-settings=primaryHoverType=2,availableHoverTypes=2,primaryPointerType=4,availablePointerTypes=4 --no-sandbox --proxy-server=http://127.0.0.1:35791/ --proxy-bypass-list=<-loopback> --user-agent=Mozilla/5.0 (iPhone14,6; U; CPU iPhone OS 15_4 like Mac OS X) AppleWebKit/602.1.50 (KHTML, like Gecko) Version/10.0 Mobile/19E241 Safari/602.1 --disable-blink-features=AutomationControlled --user-data-dir=/tmp/playwright_chromiumdev_profile-j1WoRf --remote-debugging-pipe --no-startup-window
<launched> pid=27551
[pid=27551][err] [0811/134940.797141:FATAL:zygote_host_impl_linux.cc(184)] Check failed: process.IsValid(). Failed to launch zygote process
[pid=27551][err] #0 0x5597d5370302 base::debug::CollectStackTrace()
[pid=27551][err] #1 0x5597d535d753 base::debug::StackTrace::StackTrace()
[pid=27551][err] #2 0x5597d52b1c54 logging::LogMessage::~LogMessage()
[pid=27551][err] #3 0x5597d52b270e logging::LogMessage::~LogMessage()
[pid=27551][err] #4 0x5597d529c497 logging::CheckError::~CheckError()
[pid=27551][err] #5 0x5597d3a2d627 content::ZygoteHostImpl::LaunchZygote()
[pid=27551][err] #6 0x5597d4767faf content::(anonymous namespace)::LaunchZygoteHelper()
[pid=27551][err] #7 0x5597d313b55f content::ZygoteCommunication::Init()
[pid=27551][err] #8 0x5597d22ae562 content::CreateUnsandboxedZygote()
[pid=27551][err] #9 0x5597d476726c content::ContentMainRunnerImpl::Initialize()
[pid=27551][err] #10 0x5597d4764b66 content::RunContentProcess()
[pid=27551][err] #11 0x5597d4764fbd content::ContentMain()
[pid=27551][err] #12 0x5597d4dd5ffe headless::HeadlessShellMain()
[pid=27551][err] #13 0x5597d0f2d265 ChromeMain
[pid=27551][err] #14 0x7f6daffe6d90 (/usr/lib/x86_64-linux-gnu/libc.so.6+0x29d8f)
[pid=27551][err] #15 0x7f6daffe6e40 __libc_start_main
[pid=27551][err] #16 0x5597d0f2d02a _start
[pid=27551][err]
=========================== logs ===========================
<launching> /root/.cache/ms-playwright/chromium-1071/chrome-linux/chrome --disable-field-trial-config --disable-background-networking --enable-features=NetworkService,NetworkServiceInProcess --disable-background-timer-throttling --disable-backgrounding-occluded-windows --disable-back-forward-cache --disable-breakpad --disable-client-side-phishing-detection --disable-component-extensions-with-background-pages --disable-component-update --no-default-browser-check --disable-default-apps --disable-dev-shm-usage --disable-extensions --disable-features=ImprovedCookieControls,LazyFrameLoading,GlobalMediaControls,DestroyProfileOnBrowserClose,MediaRouter,DialMediaRouteProvider,AcceptCHFrame,AutoExpandDetailsElement,CertificateTransparencyComponentUpdater,AvoidUnnecessaryBeforeUnloadCheckSync,Translate --allow-pre-commit-input --disable-hang-monitor --disable-ipc-flooding-protection --disable-popup-blocking --disable-prompt-on-repost --disable-renderer-backgrounding --force-color-profile=srgb --metrics-recording-only --no-first-run --enable-automation --password-store=basic --use-mock-keychain --no-service-autorun --export-tagged-pdf --headless --hide-scrollbars --mute-audio --blink-settings=primaryHoverType=2,availableHoverTypes=2,primaryPointerType=4,availablePointerTypes=4 --no-sandbox --proxy-server=http://127.0.0.1:35791/ --proxy-bypass-list=<-loopback> --user-agent=Mozilla/5.0 (iPhone14,6; U; CPU iPhone OS 15_4 like Mac OS X) AppleWebKit/602.1.50 (KHTML, like Gecko) Version/10.0 Mobile/19E241 Safari/602.1 --disable-blink-features=AutomationControlled --user-data-dir=/tmp/playwright_chromiumdev_profile-j1WoRf --remote-debugging-pipe --no-startup-window
<launched> pid=27551
[pid=27551][err] [0811/134940.797141:FATAL:zygote_host_impl_linux.cc(184)] Check failed: process.IsValid(). Failed to launch zygote process
[pid=27551][err] #0 0x5597d5370302 base::debug::CollectStackTrace()
[pid=27551][err] #1 0x5597d535d753 base::debug::StackTrace::StackTrace()
[pid=27551][err] #2 0x5597d52b1c54 logging::LogMessage::~LogMessage()
[pid=27551][err] #3 0x5597d52b270e logging::LogMessage::~LogMessage()
[pid=27551][err] #4 0x5597d529c497 logging::CheckError::~CheckError()
[pid=27551][err] #5 0x5597d3a2d627 content::ZygoteHostImpl::LaunchZygote()
[pid=27551][err] #6 0x5597d4767faf content::(anonymous namespace)::LaunchZygoteHelper()
[pid=27551][err] #7 0x5597d313b55f content::ZygoteCommunication::Init()
[pid=27551][err] #8 0x5597d22ae562 content::CreateUnsandboxedZygote()
[pid=27551][err] #9 0x5597d476726c content::ContentMainRunnerImpl::Initialize()
[pid=27551][err] #10 0x5597d4764b66 content::RunContentProcess()
[pid=27551][err] #11 0x5597d4764fbd content::ContentMain()
[pid=27551][err] #12 0x5597d4dd5ffe headless::HeadlessShellMain()
[pid=27551][err] #13 0x5597d0f2d265 ChromeMain
[pid=27551][err] #14 0x7f6daffe6d90 (/usr/lib/x86_64-linux-gnu/libc.so.6+0x29d8f)
[pid=27551][err] #15 0x7f6daffe6e40 __libc_start_main
[pid=27551][err] #16 0x5597d0f2d02a _start
[pid=27551][err]
============================================================
@obiknows obiknows added the bug Something isn't working. label Aug 11, 2023
@gippy gippy added the t-tooling Issues with this label are in the ownership of the tooling team. label Aug 14, 2023
@barjin
Copy link
Contributor

barjin commented Sep 11, 2023

Hello @obiknows and thank you for submitting this issue!

The issue you are describing sounds like some sort of a memory leak (Playwright / Puppeteer does similar stuff when low on memory, see identical issue here).

Can you please share more details regarding your solution (best case, the whole project as a GitHub repo)? It's possible that you are leaking memory somewhere in the task queue you mentioned. I briefly checked different scenarios with Crawlee and in none of those I was able to get any sort of memleak.

Thank you!

@obiknows
Copy link
Author

Thanks for the included link and response @barjin

Here's the server instance that calls the crawlee crawler

Server Code:

app.post('/v1/scrape/instagram/:collectiveId', async (req: Request, res: Response) => {
    // get collectiveId from request
    const { collectiveId } = req.params;
    const { instagramURL } = req.body;

    // ensure that collectiveId is not blank
    if (collectiveId === undefined) {
        return res.status(400).json({
            success: false,
            message: "collectiveId is required."
        });
    }
    // ensure that instagramURL is not blank
    if (instagramURL === undefined) {
        return res.status(400).json({
            success: false,
            message: "instagramURL is required."
        });
    }

    // run the IG crawler
    if (IGCrawler.running) {
        await IGCrawler.addRequests([{ url: instagramURL, userData: { collectiveId }}], { waitForAllRequestsToBeAdded: true })
    } else {
        await IGCrawler.run([{ url: instagramURL, userData: { collectiveId }}])
    }

    return res.json({
        success: true,
        message: `Successfully Started Instagram Scrape for ${instagramURL} [${collectiveId}].`
    });
});

Our requests are created and serialized by a BullMQ task server so only one instance of the crawler is ever running at any given time. So we never reach IGCrawler.addRequests(). I would like to run multiple instances, but that proved too difficult.

Only IGCrawler.run() is triggered. But this is where we get memory leaks after 2 days of running.

Crawler Code:

// IGCrawler: Crawls the Instagram profile page and extracts all links to Instagram posts
export const IGCrawler = new PlaywrightCrawler({
    proxyConfiguration: new ProxyConfiguration({ 
        proxyUrls: [ process.env.PROXY_URL ]  
    }),
    launchContext: {
        useIncognitoPages: true,
        // set userAgent to emulate mobile device (iPhone SE)
        userAgent: "Mozilla/5.0 (iPhone14,6; U; CPU iPhone OS 15_4 like Mac OS X) AppleWebKit/602.1.50 (KHTML, like Gecko) Version/10.0 Mobile/19E241 Safari/602.1"
    },
    // Use the requestHandler to process each of the crawled pages.
    async requestHandler({  request,  response, page, enqueueLinks, log, proxyInfo, sendRequest  }) {
        // 0. setup
        const userData = request.userData
        const collectiveId = userData.collectiveId;
        const url = page.url();

        // Wait for the element with id="splash-screen" to have style "display:none"
        await page.waitForSelector('#splash-screen');

        // Wait for the element with id="splash-screen" to have style "display:none"
        await page.waitForFunction(() => {
            const splashScreen = document.getElementById('splash-screen');
            if (!splashScreen) {
                console.log("Splash screen not found.");
                return false;
            }
            return window.getComputedStyle(splashScreen).getPropertyValue('display') === 'none';
        // });
        }, { timeout: 10000, polling: 1000 });
        console.log("Splash screen should now be hidden.");

        // wait for 1 seconds
        await page.waitForTimeout(1000);
        
        // Now that the splash-screen has disappeared, you can perform other actions or crawls here
        // For example, you can find all links on the page:

        // INSTAGRAM PROFILE PAGE
        log.info(`URL for IG page: ${url}`);

        // 1. extract data (find all links to IG posts)
        // Get all the anchor elements on the page
        const links = await page.$$('a');
        // console.log("Total Link Count (empty + valid):", links.length);
        // get each href attribute from the links
        let linkHrefs: string[] = []
        await Promise.all(links.map(async (link) => {
            const href = await link.getAttribute('href');
            // filter out null values
            if (href !== null) {
                linkHrefs.push(href);
            }
        }));

        // filter out links that don't point to IG posts
        let IGPostIds: string[] = [];
        const IGPostLinks = linkHrefs.filter((href) => {
            return href.startsWith('/p/');
        });
        IGPostLinks.forEach((link) => {
            const postId = link.split('/')[2];
            IGPostIds.push(postId);
        });

        // wait another second
        await page.waitForTimeout(1000);
        
        // if posts are found call core API to create IG posts
        if (IGPostIds.length > 0) {
            await API.post(`/v1/scrape/${collectiveId}/instagram`, {
                postIds: IGPostIds
            });
        } else {
            console.log("No IG Post Ids to send to core API. Scraper couldn't find any posts.");
        }

        // wait another second, then purge the default storages
        await page.waitForTimeout(1000);
        await purgeDefaultStorages();
    },
    // Crawler Options
    maxRequestRetries: 5,
    maxConcurrency: 1,
});

This is the code for the crawler. It scrapes an Instagram post, gets the post ids, and sends them to our core api service for processing. This pipeline works successfully for us, we just are seeing the memory leak issue after 2 days.

From the Stack Overflow link you sent @barjin , I will try to find the browser instance and explicitly call browser.close() as they did in their example after our await purgeDefaultStorages() call.

I assume, I can get the browser instance from the extracted params of the crawler's requestHandler but I've yet to confirm this is possible.

@B4nan
Copy link
Member

B4nan commented Sep 11, 2023

FYI the purgeDefaultStorages call won't do anything most probably, as it executed only once per the context, and it is already fired via the Actor.init() call (and on several other places, generally speaking, the first async method that is touching some storages will trigger it, so this is valid even outside of apify platform, even without the actor sdk in use) - we just changed this behavior in the next release, so the explicit calls will actually purge the storages. So be sure to update to crawlee 3.5.4, it will be out in a few minutes, I just started the release process.

edit: its out https://github.com/apify/crawlee/releases/tag/v3.5.4

@obiknows
Copy link
Author

so the explicit calls will actually purge the storages. So be sure to update to crawlee 3.5.4, it will be out in a few minutes, I just started the release process.

thanks for the heads up @B4nan , I will definitely update to 3.5.4 across our stack rn.

And yeah, I noticed it didn't solve our issue, b/c its not explicitly a disk related issue, more a memory problem, but left it in their just in case.

@barjin
Copy link
Contributor

barjin commented Sep 12, 2023

I would like to run multiple instances, but that proved too difficult.

Actually, we've fixed that one or two weeks ago - you can now run multiple crawler instances in one process by instantiating them with separate Configuration instances - something like this:

const a = new PlaywrightCrawler({
      ...
   },
   new Configuration({ persistStorage: false }),
);

const b = new PlaywrightCrawler({
      ...
   },
   new Configuration({ persistStorage: false }),
);

// `a` and `b` can now run simultaneously, without affecting each other
a.run([...]);
b.run([...]);

The persistStorage: false option is there so the crawlers don't write their data to the disk - if you want the storage to be persistent, you can pass localDataDirectory: [folder name] with a different folder name for each crawler.

This way, you can always instantiate a new crawler instance on a new express request and even run multiple crawlers at the same time.


After looking into this more closely, instantiating a new crawler (with persistStorage: false) in every express request might actually fix the memory leak as well. Running one crawler over and over again causes the SDK_SESSION_POOL_STATE.json to grow infinitely (it never gets purged).

I did a little experiment with two CheerioCrawler instances - in one case, I was running the same crawler instance repeatedly, in the other one, I always instantiated a new crawler instance for new run:

code memory utilization
const c = new CheerioCrawler({
    requestHandler: async ({ request }) => {
        console.log(request.url);
    },
});

setInterval(() => {
    c.run([`https://jindrich.bar/${Math.random().toString(36).substring(7)}`]);
}, 1000);

obrazek

setInterval(() => {
    const c = new CheerioCrawler({
        requestHandler: async ({ request }) => {
            console.log(request.url);
        },
    }, new Configuration({
        persistStorage: false,
    }));

    c.run([`https://jindrich.bar/${Math.random().toString(36).substring(7)}`]);
}, 1000);

obrazek

I'll create a separate issue for the growing SDK_SESSION_POOL_STATE KVS record, but in the meantime, you can try this trick with new crawler for each request - make sure to pass the persistStorage: false option as well. And definitely let us know how it went :)

@obiknows
Copy link
Author

wow, thank you @barjin ! yeah I will look into running multiple. that would best the most ideal case for us.

also, yeah calling await browserController.close() did not work so I will test out the persistStorage flag now.

barjin added a commit that referenced this issue Sep 20, 2023
Fixes #2074
Related #2031

---------

Co-authored-by: Martin Adámek <banan23@gmail.com>
@barjin
Copy link
Contributor

barjin commented Sep 25, 2023

hi @obiknows , can you please let us know if you are still experiencing the issue? cheers! :)

@obiknows
Copy link
Author

obiknows commented Sep 25, 2023

Hey @barjin , this seems to have solved the issue, now I think we're just running into an alternative memory issue, but after 5 days since implementing this, our servers are able to restart if they reach an Out of Memory condition.

I believe this fixed the Crawlee side of things. Thanks a bunch. Also thanks too @B4nan

Much appreciated 👌🏿

@barjin
Copy link
Contributor

barjin commented Sep 25, 2023

Cheers, we're glad we could help. I'll close this issue now, but feel free to let us know in case of any additional questions.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working. t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

No branches or pull requests

4 participants