-
-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cheerio+v8 "leaks" memory from original HTML #263
Comments
Wow, thanks for the detailed report. I definitely think this is something that should be fixed in V8 as there's a considerable number of modules that deal with manipulating large strings. I'm going to keep an eye on the ticket as it looks like there's already some momentum. If nothing is done we'll look into a solution specific to cheerio. |
That's definitely interesting! Having slices of the original HTML in the DOM isn't a real issue and shouldn't add too much memory overhead for normal use-cases. Most people won't notice anyway. If you consume multiple documents and actually have a problem with memory consumption, you should add the fix yourself. I was hoping Golang already had a solution for this, but apparently, they share the same issue:
It'll be interesting to see what the V8 people come up with. The linked paper already looks promising, I'm looking forward to reading it. |
Agree with everything you guys are saying :). Just to clarify: the challenge with consuming multiple documents is 1) HTML files are verbose and 2) the default behavior is leaky. There's basically no way to do it without instrumenting your code (e.g., with memwatch). Oh, and I should quantify the problem: if you have a bunch of 21kb pages and you want to extract 1kb from each, you leak 20kb each page (in code like my example above). That's 20MB per 1,000 pages, and 2GB per 100,000 pages. node tends to crash when you get that far. So cheerio's practical limit is 50,000 pages' worth of strings (unless you purge your memory somehow, such as consuming the strings and dropping references to them), instead of 1,000,000. |
Well, this is also a structural problem: When you're dealing with that kind of data, you should seriously consider using streams. In almost all cases, pages are loaded from some IO resource, just to be processed and then an IO destination. htmparser2 is intended to be used with streams, you can start working with the DOM as it's build. (The parser actually outputs a DOM traversal, which is then turned into a DOM by the domhandler module. If you care about speed and memory, you might want to skip building the DOM.) cheerio doesn't support querying a stream yet; a proof-of-concept implementation that also builds a DOM and can be queried with CSS selectors can be found here. My project processing the most pages is fb55/ReadableFeeds, which not only processes pages as they are streamed in, but also sends them as soon as they are done. Due to bandwidth limitations, I'll never process 50k pages at once, although I'm spending exactly as much time (even less, given a reasonable fast CPU) with reading and processing the pages. |
Yep, I've been thinking a lot about a cheerio 2 lately, which fixes a lot of the jquery annoyances and supports streams. Not sure when I'll have time to work on it though :-) |
I was talking to one of the node core developers and it sounds like they do this in a few places in node. I'm down to do the string copying but I'd like some benchmarks to see what kind of impact on performance this change will cause. |
As this is a V8 specific bug, I'm closing it. |
I ran into this one as well! The unleak method suggested in (https://code.google.com/p/v8/issues/detail?id=2869) doesn't work, i ended up having to do: (' ' + string).replace(/^\s/, '') This is a very annoying issue, as I'm trying to get everything to work on a 512MB instance. |
I have ran into this issue attempting to scrape data from 200k+ large html pages. It get's to about 10k and uses 15g of ram and 4 cpus at 100%. Has anyone found a workaround? |
@DavidHooper You can force V8 to create a copy of all persistent strings (as described above). |
Thanks @fb55, I'll try implement a fix using @icodeforlove's example: (' ' + string).replace(/^\s/, '') It worked. |
|
I have just spent several hours trying to debug a leak like this. Here's what helped: Debugging the memory leak was not easy. It was a lot of trial and error. But what helped was:
In my case it was (string) that was accumulating memory. I didn't have any experience with memory heap profiling in Chrome dev tools but with some trial and error, I isolated the leak to the use of a global variable. |
@adamhooper Sorry, what do you mean with I made a simple code to test the leak, I am using that trick but the memory is still growing. This code goes to wikipedia and visit all the urls one to one:
This is the trace:
The memory is growing endless. I my real code, where I need to scrape many urls of a web, the memory growing faster, in that case I take many attributes and then I store them to mongo. |
@developez you don't need Also, you should put a |
@adamhooper Thank you. However, using Edit: Problem solved using queue pattern of async.js package. |
When you create a closure (a Good call with async.js. It makes for an elegant solution. |
I solved this problem in a way to relatively simple , have a webcrawler q processes over 1,000 links per minute and uses the cheerio as data extraction module, to solve the problem using the --expose -gc parameter in time to start application and always send the global.gc(); function after using the cheerio |
@andrehrf I doubt this actually fixes the problem.
To be even clearer: Maybe you weren't experiencing this problem at all. |
I understand your position, is not a resolution of the problem, most solves most cases, at least in my case resolved and have turned over 20 million records per day on a machine with 4GB memory |
@andrehrf @adamhooper In my experience |
I am also having a memory leak issue. I noticed this post is a couple years old. Would upgrading to a newer version of Node fix the issue in V8 (therefore fixing the issue in Cheerio), or would V8 still have this issue? |
@danieltjewett To my knowledge, v8 still has this issue. (Actually, lots of platforms do.) See #263 (comment) for a workaround. |
I had the same issue and I think that using
|
To be clear, for confused readers: If you have a problem and |
@adamhooper Thank you for a really good explanation. function forceFreeMemory(fn) {
return function wrapper() {
//Convert result to strings, to drop all refences to leaky data.
var str = JSON.stringify(fn.apply(this, arguments));
//Be paranoid and modify string to force copy.
str = (' ' + str).substr(1);
if(typeof global.gc !== 'function')
throw Error('You should expose GC, run Node with "--expose-gc".');
global.gc();
return JSON.parse(str);
}
} It meant to be used like that: .then(forceFreeMemory(function (data) {
var $ = cheerio.load(data);
return /* scrapped data */;
}) Maybe you can suggest something to make it better? |
@IvanGoncharov That's a good idea, but your implementation looks like it's not very maintainable -- six months from now, you won't remember how it works. I'd suggest instead (untested): // Return a copy of the given object, taking a predictable amount of space.
function recreateObject(plainOldDataObject) {
return JSON.parse(JSON.stringify(plainOldDataObject));
}
function maybeGcCollect() {
if (typeof(global.gc) === 'function') {
global.gc();
}
}
// And if you really must make one function that does two things:
function recreateObjectAndMaybeGcCollect(plainOldDataObject) {
var ret = recreateObject(plainOldDataObject);
maybeGcCollect();
return ret;
} Emphasize simple. Make sure you're very clear on what each method does. And don't use these workarounds normally. Only use them when you have identified a problem and you are certain this fixes the problem, and you have commented why it fixes the problem. You may find that |
I have encountered the same issue. After dig into it, it turned out to be the sliced string mechanism make the memory unable to be released ASAP. |
|
(If you forget even one call to |
The issue about leaking memory is more than 7 years old and I am wondering if it is still (in 2021 with cheerio v1.0.0-rc.5) required to use an I am asking because I built a crawler with cheerio which runs into the following error from time to time:
|
I believe it is still issue in V8 so yeah it affects Cheerio as well. |
This is actually a bug in v8 (and probably any other JS engine), but it is particularly damaging in cheerio.
The v8 bug: https://code.google.com/p/v8/issues/detail?id=2869
In brief: let's say I have code like this:
Then the
strings
array is going to hold a tiny substring of each huge HTML string. Unfortunately, v8 will use that tiny substring as a reason to keep the entire HTML string in memory. As a result, the process runs out of memory.You can see this in action using memwatch, at https://github.com/lloyd/node-memwatch, and dropping code like this at the top of your script:
This is obviously a frustrating bug, as it isn't cheerio's place to second-guess v8. However, as it stands, huge memory leaks are the norm in cheerio.
A workaround: create an unleak function (such as the one at https://code.google.com/p/v8/issues/detail?id=2869) and use it by default in
.text()
,.attr()
and similar methods. (To maintain speed, cheerio could use and provide a.leakyText()
method or some-such which does what the current.text()
method does.)A more comprehensive workaround is to rebuild the DOM, so even if people access
children[0].attribs.data
they won't leak memory. I imagine that would slow down the parser considerably.Whatever the solution, I think the normal examples (the stuff people would write in howto guides) should not leak memory. To me, that's more important than saving a few milliseconds.
The text was updated successfully, but these errors were encountered: