Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cheerio+v8 "leaks" memory from original HTML #263

Closed
adamhooper opened this issue Sep 4, 2013 · 32 comments
Closed

cheerio+v8 "leaks" memory from original HTML #263

adamhooper opened this issue Sep 4, 2013 · 32 comments

Comments

@adamhooper
Copy link

This is actually a bug in v8 (and probably any other JS engine), but it is particularly damaging in cheerio.

The v8 bug: https://code.google.com/p/v8/issues/detail?id=2869

In brief: let's say I have code like this:

var cheerio = require('cheerio');
var hugeHtmlProducer = require('./hugeHtmlProducer');

var strings = [];

function handlePage(hugeHtml) {
  var $ = cheerio.load(hugeHtml);
  strings.push($('#tiny-string').text());
}

// hugeHtmlProducer.forEachAsync() loops like this:
// 1. fetch a huge HTML string
// 2. call the first callback on that HTML string
// 3. loop, dropping all references to the huge HTML string
// 4. calls the second callback
hugeHtmlProducer.forEachAsync(handlePage, function() { process.exit(); });

Then the strings array is going to hold a tiny substring of each huge HTML string. Unfortunately, v8 will use that tiny substring as a reason to keep the entire HTML string in memory. As a result, the process runs out of memory.

You can see this in action using memwatch, at https://github.com/lloyd/node-memwatch, and dropping code like this at the top of your script:

var lastHeap = null
var memwatch = require('memwatch')
memwatch.on('stats', function(info) {
  if (lastHeap) {
    var hd = lastHeap.end();
    console.log(JSON.stringify(hd, null, '  '));
  }
  console.log("base:" + (info.current_base / 1024 / 1024).toFixed(1) + "M fullGCs:" + info.num_full_gc + " incrGCs:" + info.num_inc_gc);
  lastHeap = new memwatch.HeapDiff();
});

This is obviously a frustrating bug, as it isn't cheerio's place to second-guess v8. However, as it stands, huge memory leaks are the norm in cheerio.

A workaround: create an unleak function (such as the one at https://code.google.com/p/v8/issues/detail?id=2869) and use it by default in .text(), .attr() and similar methods. (To maintain speed, cheerio could use and provide a .leakyText() method or some-such which does what the current .text() method does.)

A more comprehensive workaround is to rebuild the DOM, so even if people access children[0].attribs.data they won't leak memory. I imagine that would slow down the parser considerably.

Whatever the solution, I think the normal examples (the stuff people would write in howto guides) should not leak memory. To me, that's more important than saving a few milliseconds.

@matthewmueller
Copy link
Member

Wow, thanks for the detailed report. I definitely think this is something that should be fixed in V8 as there's a considerable number of modules that deal with manipulating large strings.

I'm going to keep an eye on the ticket as it looks like there's already some momentum. If nothing is done we'll look into a solution specific to cheerio.

@fb55
Copy link
Member

fb55 commented Sep 4, 2013

That's definitely interesting!

Having slices of the original HTML in the DOM isn't a real issue and shouldn't add too much memory overhead for normal use-cases. Most people won't notice anyway.

If you consume multiple documents and actually have a problem with memory consumption, you should add the fix yourself.

I was hoping Golang already had a solution for this, but apparently, they share the same issue:

Since the slice references the original array, as long as the slice is kept around the garbage collector can't release the array; the few useful bytes of the file keep the entire contents in memory.

It'll be interesting to see what the V8 people come up with. The linked paper already looks promising, I'm looking forward to reading it.

@adamhooper
Copy link
Author

Agree with everything you guys are saying :).

Just to clarify: the challenge with consuming multiple documents is 1) HTML files are verbose and 2) the default behavior is leaky. There's basically no way to do it without instrumenting your code (e.g., with memwatch).

Oh, and I should quantify the problem: if you have a bunch of 21kb pages and you want to extract 1kb from each, you leak 20kb each page (in code like my example above). That's 20MB per 1,000 pages, and 2GB per 100,000 pages. node tends to crash when you get that far. So cheerio's practical limit is 50,000 pages' worth of strings (unless you purge your memory somehow, such as consuming the strings and dropping references to them), instead of 1,000,000.

@fb55
Copy link
Member

fb55 commented Sep 4, 2013

Well, this is also a structural problem: When you're dealing with that kind of data, you should seriously consider using streams.

In almost all cases, pages are loaded from some IO resource, just to be processed and then an IO destination. htmparser2 is intended to be used with streams, you can start working with the DOM as it's build. (The parser actually outputs a DOM traversal, which is then turned into a DOM by the domhandler module. If you care about speed and memory, you might want to skip building the DOM.)

cheerio doesn't support querying a stream yet; a proof-of-concept implementation that also builds a DOM and can be queried with CSS selectors can be found here.

My project processing the most pages is fb55/ReadableFeeds, which not only processes pages as they are streamed in, but also sends them as soon as they are done. Due to bandwidth limitations, I'll never process 50k pages at once, although I'm spending exactly as much time (even less, given a reasonable fast CPU) with reading and processing the pages.

@matthewmueller
Copy link
Member

Yep, I've been thinking a lot about a cheerio 2 lately, which fixes a lot of the jquery annoyances and supports streams. Not sure when I'll have time to work on it though :-)

@matthewmueller
Copy link
Member

I was talking to one of the node core developers and it sounds like they do this in a few places in node. I'm down to do the string copying but I'd like some benchmarks to see what kind of impact on performance this change will cause.

@fb55
Copy link
Member

fb55 commented Apr 8, 2014

As this is a V8 specific bug, I'm closing it.

@fb55 fb55 closed this as completed Apr 8, 2014
@fb55 fb55 mentioned this issue Jun 2, 2014
@icodeforlove
Copy link

I ran into this one as well!

The unleak method suggested in (https://code.google.com/p/v8/issues/detail?id=2869) doesn't work, i ended up having to do:

(' ' + string).replace(/^\s/, '')

This is a very annoying issue, as I'm trying to get everything to work on a 512MB instance.

@DavidHooper
Copy link

I have ran into this issue attempting to scrape data from 200k+ large html pages. It get's to about 10k and uses 15g of ram and 4 cpus at 100%. Has anyone found a workaround?

@fb55
Copy link
Member

fb55 commented Feb 23, 2015

@DavidHooper You can force V8 to create a copy of all persistent strings (as described above).

@DavidHooper
Copy link

Thanks @fb55, I'll try implement a fix using @icodeforlove's example:

(' ' + string).replace(/^\s/, '')

It worked.

@adamhooper
Copy link
Author

(' ' + string).substr(1) is shorter, easier to read, and faster. (And if you make a typo and write .substring(), it'll work just as well.)

@diffen
Copy link

diffen commented Dec 2, 2015

I have just spent several hours trying to debug a leak like this. Here's what helped:
I was using a global variable to track some results of the scrape. That seemed to be the cause of the memory leak. I changed it to a local variable (inside request(url, function() { ...put the variable here... });

Debugging the memory leak was not easy. It was a lot of trial and error. But what helped was:

  1. Installing the memwatch and heapdump modules.
  2. Using the code in this blog post http://www.nearform.com/nodecrunch/self-detect-memory-leak-node/ to dump the heap when memory leaks.
  3. Opening the heap dumps in Chrome Dev Tools to see what could possibly be the culprit.

In my case it was (string) that was accumulating memory. I didn't have any experience with memory heap profiling in Chrome dev tools but with some trial and error, I isolated the leak to the use of a global variable.

@developez
Copy link

@adamhooper Sorry, what do you mean with (' ' + string).substr(1) I am suffering this problem but I cannot fix the memory leak.

I made a simple code to test the leak, I am using that trick but the memory is still growing. This code goes to wikipedia and visit all the urls one to one:

function unleak(s)
{
    return (" " + s).substr(1);
}

function memory_leak_test(index, urls)
{       
    if(index == 350)
        return;

    var url = urls[index];
    request({
        method: 'GET',
        url: url
    }, function(err, response, body) {      

        // Print used memory
        console.log(new Date().toISOString() + ' ' + process.memoryUsage().rss + '  ' + str);

        // If we are in error, we need to connecto the url again
        if (!(!err && response.statusCode == 200)) {
            memory_leak_test(index, urls)   
            return;
        }       

        var $ = cheerio.load(body);

        $('a[href^="/"]').each(function() {

            var new_url;
            // Here I take the href and doing clean stuff to get correct url, this is not important 
            var aux = unleak($(this).attr("href"));
            if(aux.substr(0, 2) == "//")
                new_url = "https:" + aux;
            else 
                new_url = "https://es.wikipedia.org" + aux;
            // end cleaning url
            urls.push(new_url);
        });        

        index = index + 1;
        memory_leak_test(index, urls);
        return;
    });
}

console.log("BEGIN app.js");
memory_leak_test(0, urls);

This is the trace:

2015-12-13T15:00:05.310Z **122327040** BEGIN [0] https://es.wikipedia.org/wiki/Wikipedia:Portada
2015-12-13T15:00:05.835Z 122781696 BEGIN [1] https://es.wikipedia.org/wiki/Wikipedia:Bienvenidos
2015-12-13T15:00:06.311Z 122970112 BEGIN [2] https://es.wikipedia.org/wiki/Ayuda:Introducci%C3%B3n
2015-12-13T15:00:06.751Z 123092992 BEGIN [3] https://es.wikipedia.org/wiki/Wikipedia:Contacto
...
2015-12-13T15:04:45.573Z 193650688 BEGIN [349] https://es.wikipedia.org/wiki/Enciclopedia
2015-12-13T15:04:46.292Z 193654784 BEGIN [350] https://es.wikipedia.org/wiki/Wikipedia:Punto_de_vista_neutral
2015-12-13T15:04:46.861Z 196263936 BEGIN [351] https://es.wikipedia.org/wiki/Licencia_de_documentaci%C3%B3n_libre_de_GNU
2015-12-13T15:04:47.472Z 196534272 BEGIN [352] https://es.wikipedia.org/wiki/Wikipedia:Derechos_de_autor
2015-12-13T15:04:48.166Z 196546560 BEGIN [353] https://es.wikipedia.org/wiki/Wikipedia:Etiqueta
2015-12-13T15:04:48.696Z **196550656** BEGIN [354] https://es.wikipedia.org/wiki/Wikipedia:S%C3%A9_valiente_al_editar_p%C3%A1ginas

The memory is growing endless. I my real code, where I need to scrape many urls of a web, the memory growing faster, in that case I take many attributes and then I store them to mongo.

@adamhooper
Copy link
Author

@developez you don't need unleak() in that code. The problem is infinite recursion. Stop the infinite recursion by using the event loop -- e.g., process.nextTick().

Also, you should put a var new_url somewhere to plug an insignificant memory leak.

@developez
Copy link

@adamhooper Thank you. However, using process.nextTick(function() { memory_leak_test(index, urls) }) instead of memory_leak_text(index, urls) doesn't makes much difference reading the rss value of process.memoryUsage(), I thought that there is no recursive pyramid of memory using functions with asynchronous calls inside.

Edit:

Problem solved using queue pattern of async.js package.

@adamhooper
Copy link
Author

When you create a closure (a function(){} block in JavaScript), the closure retains the environment that created it. None of the body values you ever created could be freed, because there was always a closure that relied on them. (I could be wrong. I haven't taken the time to really inspect your code, but this seems like a plausible problem -- it's certainly a common one.)

Good call with async.js. It makes for an elegant solution.

@andrehrf
Copy link

I solved this problem in a way to relatively simple , have a webcrawler q processes over 1,000 links per minute and uses the cheerio as data extraction module, to solve the problem using the --expose -gc parameter in time to start application and always send the global.gc(); function after using the cheerio

@adamhooper
Copy link
Author

@andrehrf I doubt this actually fixes the problem.

global.gc() clears unused objects from memory. It does not resize the objects that are still in memory.

To be even clearer: global.gc() will never fix a bug. Heuristically, if you call it before a code path that code path may complete more quickly. And heuristically, if you call it regularly your program may consume less memory. But it won't fix a bug, and this is a bug: https://bugs.chromium.org/p/v8/issues/detail?id=2869

Maybe you weren't experiencing this problem at all.

@andrehrf
Copy link

I understand your position, is not a resolution of the problem, most solves most cases, at least in my case resolved and have turned over 20 million records per day on a machine with 4GB memory

@diffen
Copy link

diffen commented Feb 29, 2016

@andrehrf @adamhooper In my experience gc() can address some memory leaks but slows down the program to a not-insignificant level. So it's a trade-off.

@danieltjewett
Copy link

I am also having a memory leak issue. I noticed this post is a couple years old. Would upgrading to a newer version of Node fix the issue in V8 (therefore fixing the issue in Cheerio), or would V8 still have this issue?

@adamhooper
Copy link
Author

@danieltjewett To my knowledge, v8 still has this issue. (Actually, lots of platforms do.)

See #263 (comment) for a workaround.

@Vayvala
Copy link

Vayvala commented Mar 2, 2016

I had the same issue and (' ' + string).substr(1) didn't do the trick, however global.gc() did.

I think that using global.gc() is a bad practice, so I found another solution:

node app.js --optimize_for_size

@adamhooper
Copy link
Author

To be clear, for confused readers:

If you have a problem and global.gc() fixes it, then this issue -- issue #263 -- is not your problem.

@IvanGoncharov
Copy link

@adamhooper Thank you for a really good explanation.
It saved my sanity during debug process.
I have many scrappers which affected by this issue and also by ineffective GC.
So I tried to create "universal" hack, and come up with this:

function forceFreeMemory(fn) {
  return function wrapper() {
    //Convert result to strings, to drop all refences to leaky data.
    var str = JSON.stringify(fn.apply(this, arguments));
    //Be paranoid and modify string to force copy.
    str = (' ' + str).substr(1);

    if(typeof global.gc !== 'function')
      throw Error('You should expose GC, run Node with "--expose-gc".');
    global.gc();

    return JSON.parse(str);
  }
}

It meant to be used like that:

.then(forceFreeMemory(function (data) {
  var $ = cheerio.load(data);
  return /* scrapped data */;
})

Maybe you can suggest something to make it better?

@adamhooper
Copy link
Author

@IvanGoncharov That's a good idea, but your implementation looks like it's not very maintainable -- six months from now, you won't remember how it works. I'd suggest instead (untested):

// Return a copy of the given object, taking a predictable amount of space.
function recreateObject(plainOldDataObject) {
  return JSON.parse(JSON.stringify(plainOldDataObject));
}

function maybeGcCollect() {
  if (typeof(global.gc) === 'function') {
    global.gc();
  }
}

// And if you really must make one function that does two things:
function recreateObjectAndMaybeGcCollect(plainOldDataObject) {
  var ret = recreateObject(plainOldDataObject);
  maybeGcCollect();
  return ret;
}

Emphasize simple. Make sure you're very clear on what each method does.

And don't use these workarounds normally. Only use them when you have identified a problem and you are certain this fixes the problem, and you have commented why it fixes the problem.

You may find that recreateObject() ends up increasing your app's memory footprint. If you don't understand how, stop using it. Don't name it forceFreeMemory(), because that isn't what it does. (Personally, I'd shy away from global.gc() in all cases, pending overwhelming evidence that it solves a real problem.)

@yanjunz
Copy link

yanjunz commented Apr 3, 2019

I have encountered the same issue. After dig into it, it turned out to be the sliced string mechanism make the memory unable to be released ASAP.
To fixed it, use parseInt or console.log to flatten sliced string.
Here is a post for my case:
https://juejin.im/post/5ca32dc86fb9a05e3a344344

@huiming1313
Copy link

huiming1313 commented Jan 1, 2020

(' ' + string).replace(/^\s/, '')
Can guide me where should i apply this in my code?

var page = await axios.get(url);
$ = cheerio.load(page.data)
$('.products-grid').children('li').each(async(i,e) => {
            let productName = $(e).find('.product-name').text().trim()
            let productPrice = $(e).find('.price').text().trim()
            let productUrl = $(e).find('a').attr('href')
            let productImage = $(e).find('img').attr('src')
            let productId = $(e).find('img').attr('id').split('-')[3]
}

@adamhooper
Copy link
Author

adamhooper commented Jan 1, 2020

@huiming1313

function unleak(string) {
    return (' ' + string).substr(1)
}

$('.products-grid').children('li').each(async(i,e) => {
            let productName = unleak($(e).find('.product-name').text().trim())
            let productPrice = unleak($(e).find('.price').text().trim())
            let productUrl = unleak($(e).find('a').attr('href'))
            let productImage = unleak($(e).find('img').attr('src'))
            let productId = unleak($(e).find('img').attr('id').split('-')[3])
}

(If you forget even one call to unleak(), the entire HTML will stay in memory, even after $ and page drop out of scope.)

@bennycode
Copy link

The issue about leaking memory is more than 7 years old and I am wondering if it is still (in 2021 with cheerio v1.0.0-rc.5) required to use an unleak function when accessing text() or attr()?

I am asking because I built a crawler with cheerio which runs into the following error from time to time:

2021-03-28T12:28:08.456666+00:00 app[web.1]:
2021-03-28T12:28:08.456708+00:00 app[web.1]: <--- Last few GCs --->
2021-03-28T12:28:08.456709+00:00 app[web.1]:
2021-03-28T12:28:08.456712+00:00 app[web.1]: [28:0x58b5880]   207272 ms: Scavenge 250.9 (257.7) -> 250.4 (257.7) MB, 1.6 / 0.0 ms  (average mu = 0.995, current mu = 0.997) allocation failure
2021-03-28T12:28:08.456712+00:00 app[web.1]: [28:0x58b5880]   207311 ms: Scavenge (reduce) 251.0 (257.7) -> 251.0 (257.9) MB, 6.9 / 0.0 ms  (average mu = 0.995, current mu = 0.997) allocation failure
2021-03-28T12:28:08.456712+00:00 app[web.1]: [28:0x58b5880]   207330 ms: Scavenge (reduce) 251.2 (257.9) -> 251.2 (257.9) MB, 5.7 / 0.0 ms  (average mu = 0.995, current mu = 0.997) allocation failure
2021-03-28T12:28:08.456713+00:00 app[web.1]:
2021-03-28T12:28:08.456713+00:00 app[web.1]:
2021-03-28T12:28:08.456713+00:00 app[web.1]: <--- JS stacktrace --->
2021-03-28T12:28:08.456713+00:00 app[web.1]:
2021-03-28T12:28:08.456720+00:00 app[web.1]: FATAL ERROR: Scavenger: semi-space copy Allocation failed - JavaScript heap out of memory
2021-03-28T12:28:08.457317+00:00 app[web.1]: 1: 0xa877f0 node::Abort() [node]
2021-03-28T12:28:08.457782+00:00 app[web.1]: 2: 0x9abe29 node::FatalError(char const*, char const*) [node]
2021-03-28T12:28:08.458289+00:00 app[web.1]: 3: 0xc6ea6e v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, bool) [node]
2021-03-28T12:28:08.458803+00:00 app[web.1]: 4: 0xc6ede7 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [node]
2021-03-28T12:28:08.459393+00:00 app[web.1]: 5: 0xe38865  [node]
2021-03-28T12:28:08.460029+00:00 app[web.1]: 6: 0xeb3d9e v8::internal::SlotCallbackResult v8::internal::Scavenger::EvacuateShortcutCandidate<v8::internal::FullHeapObjectSlot>(v8::internal::Map, v8::internal::FullHeapObjectSlot, v8::internal::ConsString, int) [node]
2021-03-28T12:28:08.460649+00:00 app[web.1]: 7: 0xeb5a25 v8::internal::SlotCallbackResult v8::internal::Scavenger::ScavengeObject<v8::internal::FullHeapObjectSlot>(v8::internal::FullHeapObjectSlot, v8::internal::HeapObject) [node]
2021-03-28T12:28:08.461260+00:00 app[web.1]: 8: 0xebacab v8::internal::Scavenger::Process(v8::internal::OneshotBarrier*) [node]
2021-03-28T12:28:08.461877+00:00 app[web.1]: 9: 0xebade9 v8::internal::ScavengingTask::ProcessItems() [node]
2021-03-28T12:28:08.462491+00:00 app[web.1]: 10: 0xebaf91 v8::internal::ScavengingTask::RunInParallel(v8::internal::ItemParallelJob::Task::Runner) [node]
2021-03-28T12:28:08.463097+00:00 app[web.1]: 11: 0xe52b69 v8::internal::ItemParallelJob::Run() [node]
2021-03-28T12:28:08.463709+00:00 app[web.1]: 12: 0xebc24e v8::internal::ScavengerCollector::CollectGarbage() [node]
2021-03-28T12:28:08.464343+00:00 app[web.1]: 13: 0xe38f54 v8::internal::Heap::Scavenge() [node]
2021-03-28T12:28:08.464946+00:00 app[web.1]: 14: 0xe474e8 v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [node]
2021-03-28T12:28:08.465573+00:00 app[web.1]: 15: 0xe4aa9c v8::internal::Heap::AllocateRawWithRetryOrFailSlowPath(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [node]
2021-03-28T12:28:08.466170+00:00 app[web.1]: 16: 0xe0e9ba v8::internal::Factory::NewFillerObject(int, bool, v8::internal::AllocationType, v8::internal::AllocationOrigin) [node]
2021-03-28T12:28:08.466849+00:00 app[web.1]: 17: 0x116579b v8::internal::Runtime_AllocateInYoungGeneration(int, unsigned long*, v8::internal::Isolate*) [node]
2021-03-28T12:28:08.467592+00:00 app[web.1]: 18: 0x14fc8f9  [node]

@5saviahv
Copy link
Contributor

I believe it is still issue in V8 so yeah it affects Cheerio as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests