New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Playwright Headless Crawler Crashes After Multiple Successive Runs #2031
Comments
Hello @obiknows and thank you for submitting this issue! The issue you are describing sounds like some sort of a memory leak (Playwright / Puppeteer does similar stuff when low on memory, see identical issue here). Can you please share more details regarding your solution (best case, the whole project as a GitHub repo)? It's possible that you are leaking memory somewhere in the task queue you mentioned. I briefly checked different scenarios with Crawlee and in none of those I was able to get any sort of memleak. Thank you! |
Thanks for the included link and response @barjin Here's the server instance that calls the crawlee crawler Server Code:
Our requests are created and serialized by a BullMQ task server so only one instance of the crawler is ever running at any given time. So we never reach Only Crawler Code:
This is the code for the crawler. It scrapes an Instagram post, gets the post ids, and sends them to our core api service for processing. This pipeline works successfully for us, we just are seeing the memory leak issue after 2 days. From the Stack Overflow link you sent @barjin , I will try to find the browser instance and explicitly call I assume, I can get the |
FYI the edit: its out https://github.com/apify/crawlee/releases/tag/v3.5.4 |
thanks for the heads up @B4nan , I will definitely update to 3.5.4 across our stack rn. And yeah, I noticed it didn't solve our issue, b/c its not explicitly a disk related issue, more a memory problem, but left it in their just in case. |
Actually, we've fixed that one or two weeks ago - you can now run multiple crawler instances in one process by instantiating them with separate const a = new PlaywrightCrawler({
...
},
new Configuration({ persistStorage: false }),
);
const b = new PlaywrightCrawler({
...
},
new Configuration({ persistStorage: false }),
);
// `a` and `b` can now run simultaneously, without affecting each other
a.run([...]);
b.run([...]); The This way, you can always instantiate a new crawler instance on a new After looking into this more closely, instantiating a new crawler (with I did a little experiment with two CheerioCrawler instances - in one case, I was running the same crawler instance repeatedly, in the other one, I always instantiated a new crawler instance for new run:
I'll create a separate issue for the growing |
wow, thank you @barjin ! yeah I will look into running multiple. that would best the most ideal case for us. also, yeah calling |
hi @obiknows , can you please let us know if you are still experiencing the issue? cheers! :) |
Hey @barjin , this seems to have solved the issue, now I think we're just running into an alternative memory issue, but after 5 days since implementing this, our servers are able to restart if they reach an Out of Memory condition. I believe this fixed the Crawlee side of things. Thanks a bunch. Also thanks too @B4nan Much appreciated 👌🏿 |
Cheers, we're glad we could help. I'll close this issue now, but feel free to let us know in case of any additional questions. Thanks! |
Which package is this bug report for? If unsure which one to select, leave blank
@crawlee/playwright (PlaywrightCrawler)
Issue description
When running a Playwright crawlee behind a task queue, after 2 days of successive runs, the Playwright crawlee will begin failing requests with the following error (see picture)
Code sample
No response
Package version
3.4.2
Node.js version
16.18.1
Operating system
No response
Apify platform
I have tested this on the
next
releaseNo response
Other context
This is the text output from the server on crash:
The text was updated successfully, but these errors were encountered: