Crawler instances are not disposed #1670

matjaeck · 2022-11-11T18:26:33Z

Which package is this bug report for? If unsure which one to select, leave blank

@crawlee/core

Issue description

With the addition of the "Pausing" feature and its corresponding new warning messages, it occurred to me ~~that I now have to press STRG+C twice each time I rebuild my crawler during development~~ that crawler instances don't seem to be disposed unless you stop the node process.

In the below example, the old crawler instances should by my understanding of garbage collection in JS have been garbage collected, unless the lib somehow keeps references over instantiations. But I'm happy to learn if my understanding or the example is flawed.

Usage:
Install and run example, then after some cycles stop process in terminal with CTRL+C.

You will notice that all crawler instances with their dynamically assigned names are still there. The logging uses an instance method and not a static method (

crawlee/packages/basic-crawler/src/internals/basic-crawler.ts

Line 603 in 49e270c

    
           this.log.warning('Pausing... Press CTRL+C again to force exit. To resume, do: CRAWLEE_PURGE_ON_START=0 npm run start');

)

This is a (potential) memory leak, impact of course depends on how much data you store on crawler instances and how long your process runs.

Ideas?

Code sample

{
  "name": "crawlee-reproduction",
  "type": "module",
  "dependencies": {
    "crawlee": "^3.1.1",
    "playwright": "^1.27.1"
  }
}

import Crawlee from "crawlee"
// Simulate a long running task, like a worker process
let i = 0
while(true) {
  await crawlerDisposalTest(i)
  i++
}
async function crawlerDisposalTest(i) {
  const navigationQueue = await Crawlee.RequestQueue.open()
  await navigationQueue.addRequest({ url: "https://crawlee.dev/" })
  // For illustrative purposes only
  Object.defineProperty(Crawlee.PlaywrightCrawler, "name", {
    writable: true,
    value: `PlaywrightCrawler#${i}`
  })
  let crawler = new Crawlee.PlaywrightCrawler({
    requestQueue: navigationQueue,
    postNavigationHooks: [
      ctx => { 
        console.log(`Visit #${i}`)
      }
    ],
    requestHandler: async ctx => { },
  })
  await crawler.run()
  await crawler.teardown()
  crawler = null
  console.log(`No more references to PlaywrightCrawler#${i} in my program!`)
  await navigationQueue.drop()
}

Package version

3.1.1

Node.js version

v16.13.1

Operating system

Ubuntu 18.04.6 LTS

Priority this issue should have

High

matjaeck · 2022-11-14T15:32:26Z

Just for confirmation: below is an updated example that leaks enough menory to make it quickly observable. This will trigger a built-in memory leak warnig at around 50 cycles:

Possible AsyncEventEmitter memory leak detected. 51 migrating listeners added to AsyncEventEmitter. Use emitter.setMaxListeners() to increase the limit.
Possible AsyncEventEmitter memory leak detected. 51 aborting listeners added to AsyncEventEmitter. Use emitter.setMaxListeners() to increase the limit.

WARNING: This will eat up your memory if you let it run too long!

import os from "os"

import Crawlee from "crawlee"
// Simulate a long running task, like a worker process
let i = 0
while(true) {
  await crawlerDisposalTest(i)
  i++
}
async function crawlerDisposalTest(i) {
  const navigationQueue = await Crawlee.RequestQueue.open()
  await navigationQueue.addRequest({ url: "https://google.de" })
  // For illustrative purposes only
  Object.defineProperty(Crawlee.PlaywrightCrawler, "name", {
    writable: true,
    value: `PlaywrightCrawler#${i}`
  })
  let crawler = new Crawlee.PlaywrightCrawler({
    requestQueue: navigationQueue,
    requestHandler: async ctx => { }
  })
  await crawler.run()
  await crawler.teardown()
  crawler = null
  console.log(`Available system memory: ${os.freemem()} bytes.`)
  await navigationQueue.drop()
}

matjaeck · 2022-11-14T19:30:58Z

So the code that was intended to store images never runs, as Chrom(ium)e does not assign the resource type like I thought it would and we probably shouldn't crawl unsplash.com without good reason but that's all not important.

The relevant takeaway of my flawed attempt to increase memory consumption is that the crawler instances have quite a sizeable memory footprint even if you don't store any data on them. The above (updated!) example leaks 1,117 GB on my machine in 100 cycles.

matjaeck · 2022-11-14T23:32:46Z

It seems to me, after some profiling with node --inspect and using the allocation profiler in chrome://inspect/, that the problem is located in event handling.

Event sources might not be closed, which could prevent garbage collection (using default V8 event plumbing). One example is

crawlee/packages/basic-crawler/src/internals/basic-crawler.ts

Line 602 in 49e270c

process.once('SIGINT', async () => {

, which I assume is not closed when the crawler is done. Perhaps there are more listeners to SIGINT, as it shows up multiple times in the retainers. Since the process is not closed, perhaps everthing that listens to process-directed signals will not be eligible for garbage collection. But I think once the object can't be reached anymore, the listeners defined on it should alltogether be eligible for garbage collection.
Custom _AsyncEventEmitter might interfer with V8 garbage collector or keeps references in _events in _AsyncEventEmitter. It seems to be another retainer, going by the profiler data. This could also be the only reason, since the rules for the default event emitter probably don't apply to a custom implementation (of a pretty fundamental runtime thing...):

The event listener functions of an events value are recorded in a map of event listeners of the object: if the object can't be reached in code, the event listeners can't be reached either. If the only way or reaching them was through the event handler map (they were anonymous) the can be garbage collected.

(https://stackoverflow.com/questions/71439730/are-event-listeners-from-an-eventsource-automatically-garbage-collected)

Which probably does not apply to possible references in _events in _AsyncEventEmitter.
Maybe something different involving the global Configuration object, but I think it only appears in the retainer tree cause the custom event emitter is defined on it.

matjaeck · 2022-11-15T17:41:24Z

So it seems that removing the listeners fixes the issue:

diff --git a/packages/basic-crawler/src/internals/basic-crawler.ts b/packages/basic-crawler/src/internals/basic-crawler.ts
index 0bd87c44..592f15bf 100644
--- a/packages/basic-crawler/src/internals/basic-crawler.ts
+++ b/packages/basic-crawler/src/internals/basic-crawler.ts
@@ -1174,6 +1174,8 @@ export class BasicCrawler<Context extends CrawlingContext = BasicCrawlingContext
         }

         await this.autoscaledPool?.abort();
+        this.events.removeAllListeners();
+        process.removeAllListeners('SIGINT');
     }

     protected _handlePropertyNameChange<New, Old>({
diff --git a/packages/core/src/events/event_manager.ts b/packages/core/src/events/event_manager.ts
index 184abf1a..28027e95 100644
--- a/packages/core/src/events/event_manager.ts
+++ b/packages/core/src/events/event_manager.ts
@@ -105,4 +105,8 @@ export abstract class EventManager {
     waitForAllListenersToComplete() {
         return this.events.waitForAllListenersToComplete();
     }
+
+    removeAllListeners() {
+        return this.events.removeAllListeners();
+    }
 }

Here are the heap profiles for the example code below:

await new Promise(resolve => setTimeout(resolve, 5000))

import os from "os"
import Crawlee from "crawlee"

for (let i = 0; i < 3; i++)
  await crawlerDisposalTest(i)

async function crawlerDisposalTest(i) {
  const navigationQueue = await Crawlee.RequestQueue.open()
  await navigationQueue.addRequest({ url: "https://google.de" })
  // For illustrative purposes only
  Object.defineProperty(Crawlee.PlaywrightCrawler, "name", {
    writable: true,
    value: `PlaywrightCrawler#${i}`
  })
  let crawler = new Crawlee.PlaywrightCrawler({
    requestQueue: navigationQueue,
    requestHandler: async ctx => { }
  })
  await crawler.run()
  await crawler.teardown()
  crawler = null
  console.log(`Available system memory: ${os.freemem()} bytes.`)
  await navigationQueue.drop()
}

On master:

Note the blue bars, they represent the retained memory of the crawler instances, which can not be garbage collected due to the references in the two event emitters events map.

With above fix:

I'm happy to send a pull request, but don't know if this conflicts with anything else so feedback is welcomed.

There still seem to be "smaller" leaks, but these are not connected to this issue.

szmarczak · 2022-11-16T13:20:42Z

Thanks for the very detailed issue @matjaeck! Indeed the listeners aren't removed, keeping the reference to the instance. I'll send a patch ASAP with proper attribution.

LeMoussel · 2022-11-18T07:15:30Z

@matjaeck your analysis is very interesting.
Do you have any tips on how to use node --inspect and using the allocation profiler chrome://inspect/?

matjaeck · 2022-11-18T12:41:42Z

@LeMoussel I don't have much experience in using it so I can only give some basic hints:

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Memory_Management gives a good overview over memory management in Node/V8
https://developer.chrome.com/docs/devtools/memory-problems/memory-101/, https://developer.chrome.com/docs/devtools/memory-problems/#identify_js_heap_memory_leaks_with_allocation_timelines explains the basics of the tool I used, it's called "Allocation instrumentation timeline" in my version of google chrome.
To use it,
- Start a program (like my example above) with node --inspect-brk <filename>. --inspect-brk is useful, as it allows you to start the program from the debugger. See https://nodejs.org/en/docs/guides/debugging-getting-started/#command-line-options
- Visit chrome://inspect/ in google chrome. Then select "Open dedicated DevTools for Node". Go to "Memory" tab and select "Allocation instrumentation timeline" and click "Start". Then go to "Sources" tab and click the "Play" button in the top right corner. Switch back to memory tab and you can observe the memory allocations. Stop Recording with "stop recording" when you want with button in top left corner. You can now analyse the allocations on a selected timeframe. Close window to detach debugger from your program.

LeMoussel · 2022-11-18T15:10:30Z

@matjaeck Thanks

Co-authored-by: matjaeck <80065804+matjaeck@users.noreply.github.com> Closes #1670

opendeluxe · 2023-10-26T21:14:24Z

I came across a similar thing in #2147

It counts the number of requests until maxRequestsPerCrawl has been reached as the number of all requests made in total. From this time, no further request will be made. This even happens when a new instance of CheerioCrawler will be used with every single request. So the count of requests seems to be stored globally somewhere and lead to failure once reached.

matjaeck added the bug Something isn't working. label Nov 11, 2022

B4nan assigned szmarczak Nov 14, 2022

szmarczak mentioned this issue Nov 16, 2022

fix: remove memory leaks from migration event handling #1679

Merged

B4nan closed this as completed in #1679 Nov 30, 2022

B4nan pushed a commit that referenced this issue Nov 30, 2022

fix: remove memory leaks from migration event handling (#1679)

49bba25

Co-authored-by: matjaeck <80065804+matjaeck@users.noreply.github.com> Closes #1670

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawler instances are not disposed #1670

Crawler instances are not disposed #1670

matjaeck commented Nov 11, 2022 •

edited

Loading

matjaeck commented Nov 14, 2022 •

edited

Loading

matjaeck commented Nov 14, 2022

matjaeck commented Nov 14, 2022

matjaeck commented Nov 15, 2022 •

edited

Loading

szmarczak commented Nov 16, 2022

LeMoussel commented Nov 18, 2022

matjaeck commented Nov 18, 2022

LeMoussel commented Nov 18, 2022

opendeluxe commented Oct 26, 2023

Crawler instances are not disposed #1670

Crawler instances are not disposed #1670

Comments

matjaeck commented Nov 11, 2022 • edited Loading

Which package is this bug report for? If unsure which one to select, leave blank

Issue description

Code sample

Package version

Node.js version

Operating system

Priority this issue should have

matjaeck commented Nov 14, 2022 • edited Loading

matjaeck commented Nov 14, 2022

matjaeck commented Nov 14, 2022

matjaeck commented Nov 15, 2022 • edited Loading

szmarczak commented Nov 16, 2022

LeMoussel commented Nov 18, 2022

matjaeck commented Nov 18, 2022

LeMoussel commented Nov 18, 2022

opendeluxe commented Oct 26, 2023

matjaeck commented Nov 11, 2022 •

edited

Loading

matjaeck commented Nov 14, 2022 •

edited

Loading

matjaeck commented Nov 15, 2022 •

edited

Loading