Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crawler instances are not disposed #1670

Closed
matjaeck opened this issue Nov 11, 2022 · 9 comments · Fixed by #1679
Closed

Crawler instances are not disposed #1670

matjaeck opened this issue Nov 11, 2022 · 9 comments · Fixed by #1679
Assignees
Labels
bug Something isn't working.

Comments

@matjaeck
Copy link

matjaeck commented Nov 11, 2022

Which package is this bug report for? If unsure which one to select, leave blank

@crawlee/core

Issue description

With the addition of the "Pausing" feature and its corresponding new warning messages, it occurred to me that I now have to press STRG+C twice each time I rebuild my crawler during development that crawler instances don't seem to be disposed unless you stop the node process.

In the below example, the old crawler instances should by my understanding of garbage collection in JS have been garbage collected, unless the lib somehow keeps references over instantiations. But I'm happy to learn if my understanding or the example is flawed.

Usage:
Install and run example, then after some cycles stop process in terminal with CTRL+C.

You will notice that all crawler instances with their dynamically assigned names are still there. The logging uses an instance method and not a static method (

this.log.warning('Pausing... Press CTRL+C again to force exit. To resume, do: CRAWLEE_PURGE_ON_START=0 npm run start');
)

This is a (potential) memory leak, impact of course depends on how much data you store on crawler instances and how long your process runs.

Ideas?

Code sample

{
  "name": "crawlee-reproduction",
  "type": "module",
  "dependencies": {
    "crawlee": "^3.1.1",
    "playwright": "^1.27.1"
  }
}
import Crawlee from "crawlee"
// Simulate a long running task, like a worker process
let i = 0
while(true) {
  await crawlerDisposalTest(i)
  i++
}
async function crawlerDisposalTest(i) {
  const navigationQueue = await Crawlee.RequestQueue.open()
  await navigationQueue.addRequest({ url: "https://crawlee.dev/" })
  // For illustrative purposes only
  Object.defineProperty(Crawlee.PlaywrightCrawler, "name", {
    writable: true,
    value: `PlaywrightCrawler#${i}`
  })
  let crawler = new Crawlee.PlaywrightCrawler({
    requestQueue: navigationQueue,
    postNavigationHooks: [
      ctx => { 
        console.log(`Visit #${i}`)
      }
    ],
    requestHandler: async ctx => { },
  })
  await crawler.run()
  await crawler.teardown()
  crawler = null
  console.log(`No more references to PlaywrightCrawler#${i} in my program!`)
  await navigationQueue.drop()
}

Package version

3.1.1

Node.js version

v16.13.1

Operating system

Ubuntu 18.04.6 LTS

Priority this issue should have

High

@matjaeck matjaeck added the bug Something isn't working. label Nov 11, 2022
@matjaeck
Copy link
Author

matjaeck commented Nov 14, 2022

Just for confirmation: below is an updated example that leaks enough menory to make it quickly observable. This will trigger a built-in memory leak warnig at around 50 cycles:

Possible AsyncEventEmitter memory leak detected. 51 migrating listeners added to AsyncEventEmitter. Use emitter.setMaxListeners() to increase the limit.
Possible AsyncEventEmitter memory leak detected. 51 aborting listeners added to AsyncEventEmitter. Use emitter.setMaxListeners() to increase the limit.

WARNING: This will eat up your memory if you let it run too long!

import os from "os"

import Crawlee from "crawlee"
// Simulate a long running task, like a worker process
let i = 0
while(true) {
  await crawlerDisposalTest(i)
  i++
}
async function crawlerDisposalTest(i) {
  const navigationQueue = await Crawlee.RequestQueue.open()
  await navigationQueue.addRequest({ url: "https://google.de" })
  // For illustrative purposes only
  Object.defineProperty(Crawlee.PlaywrightCrawler, "name", {
    writable: true,
    value: `PlaywrightCrawler#${i}`
  })
  let crawler = new Crawlee.PlaywrightCrawler({
    requestQueue: navigationQueue,
    requestHandler: async ctx => { }
  })
  await crawler.run()
  await crawler.teardown()
  crawler = null
  console.log(`Available system memory: ${os.freemem()} bytes.`)
  await navigationQueue.drop()
}

@matjaeck
Copy link
Author

So the code that was intended to store images never runs, as Chrom(ium)e does not assign the resource type like I thought it would and we probably shouldn't crawl unsplash.com without good reason but that's all not important.

The relevant takeaway of my flawed attempt to increase memory consumption is that the crawler instances have quite a sizeable memory footprint even if you don't store any data on them. The above (updated!) example leaks 1,117 GB on my machine in 100 cycles.

@matjaeck
Copy link
Author

It seems to me, after some profiling with node --inspect and using the allocation profiler in chrome://inspect/, that the problem is located in event handling.

  • Event sources might not be closed, which could prevent garbage collection (using default V8 event plumbing). One example is

    process.once('SIGINT', async () => {
    , which I assume is not closed when the crawler is done. Perhaps there are more listeners to SIGINT, as it shows up multiple times in the retainers. Since the process is not closed, perhaps everthing that listens to process-directed signals will not be eligible for garbage collection. But I think once the object can't be reached anymore, the listeners defined on it should alltogether be eligible for garbage collection.

  • Custom _AsyncEventEmitter might interfer with V8 garbage collector or keeps references in _events in _AsyncEventEmitter. It seems to be another retainer, going by the profiler data. This could also be the only reason, since the rules for the default event emitter probably don't apply to a custom implementation (of a pretty fundamental runtime thing...):

    The event listener functions of an events value are recorded in a map of event listeners of the object: if the object can't be reached in code, the event listeners can't be reached either. If the only way or reaching them was through the event handler map (they were anonymous) the can be garbage collected.

    (https://stackoverflow.com/questions/71439730/are-event-listeners-from-an-eventsource-automatically-garbage-collected)

    Which probably does not apply to possible references in _events in _AsyncEventEmitter.

  • Maybe something different involving the global Configuration object, but I think it only appears in the retainer tree cause the custom event emitter is defined on it.

@matjaeck
Copy link
Author

matjaeck commented Nov 15, 2022

So it seems that removing the listeners fixes the issue:

diff --git a/packages/basic-crawler/src/internals/basic-crawler.ts b/packages/basic-crawler/src/internals/basic-crawler.ts
index 0bd87c44..592f15bf 100644
--- a/packages/basic-crawler/src/internals/basic-crawler.ts
+++ b/packages/basic-crawler/src/internals/basic-crawler.ts
@@ -1174,6 +1174,8 @@ export class BasicCrawler<Context extends CrawlingContext = BasicCrawlingContext
         }

         await this.autoscaledPool?.abort();
+        this.events.removeAllListeners();
+        process.removeAllListeners('SIGINT');
     }

     protected _handlePropertyNameChange<New, Old>({
diff --git a/packages/core/src/events/event_manager.ts b/packages/core/src/events/event_manager.ts
index 184abf1a..28027e95 100644
--- a/packages/core/src/events/event_manager.ts
+++ b/packages/core/src/events/event_manager.ts
@@ -105,4 +105,8 @@ export abstract class EventManager {
     waitForAllListenersToComplete() {
         return this.events.waitForAllListenersToComplete();
     }
+
+    removeAllListeners() {
+        return this.events.removeAllListeners();
+    }
 }

Here are the heap profiles for the example code below:

await new Promise(resolve => setTimeout(resolve, 5000))

import os from "os"
import Crawlee from "crawlee"

for (let i = 0; i < 3; i++)
  await crawlerDisposalTest(i)

async function crawlerDisposalTest(i) {
  const navigationQueue = await Crawlee.RequestQueue.open()
  await navigationQueue.addRequest({ url: "https://google.de" })
  // For illustrative purposes only
  Object.defineProperty(Crawlee.PlaywrightCrawler, "name", {
    writable: true,
    value: `PlaywrightCrawler#${i}`
  })
  let crawler = new Crawlee.PlaywrightCrawler({
    requestQueue: navigationQueue,
    requestHandler: async ctx => { }
  })
  await crawler.run()
  await crawler.teardown()
  crawler = null
  console.log(`Available system memory: ${os.freemem()} bytes.`)
  await navigationQueue.drop()
}
On master:

image


Note the blue bars, they represent the retained memory of the crawler instances, which can not be garbage collected due to the references in the two event emitters events map.

With above fix:
image

I'm happy to send a pull request, but don't know if this conflicts with anything else so feedback is welcomed.

There still seem to be "smaller" leaks, but these are not connected to this issue.

@szmarczak
Copy link
Contributor

Thanks for the very detailed issue @matjaeck! Indeed the listeners aren't removed, keeping the reference to the instance. I'll send a patch ASAP with proper attribution.

@LeMoussel
Copy link

@matjaeck your analysis is very interesting.
Do you have any tips on how to use node --inspect and using the allocation profiler chrome://inspect/?

@matjaeck
Copy link
Author

@LeMoussel I don't have much experience in using it so I can only give some basic hints:

@LeMoussel
Copy link

@matjaeck Thanks

B4nan pushed a commit that referenced this issue Nov 30, 2022
Co-authored-by: matjaeck <80065804+matjaeck@users.noreply.github.com>

Closes #1670
@opendeluxe
Copy link

I came across a similar thing in #2147

It counts the number of requests until maxRequestsPerCrawl has been reached as the number of all requests made in total. From this time, no further request will be made. This even happens when a new instance of CheerioCrawler will be used with every single request. So the count of requests seems to be stored globally somewhere and lead to failure once reached.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants