Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Occasional stack traces from the CLI #798

Open
andrewdbate opened this issue Oct 14, 2021 · 15 comments
Open

Occasional stack traces from the CLI #798

andrewdbate opened this issue Oct 14, 2021 · 15 comments

Comments

@andrewdbate
Copy link

@andrewdbate andrewdbate commented Oct 14, 2021

I am running the following command from a Bash shell (MinGW on Windows 10):

docker run --mount "type=bind,src=$PWD/cookiedir,dst=/cookiedir" --mount "type=bind,src=$PWD/sitedir,dst=/sitedir" singlefile --browser-cookies-file=/cookiedir/cookies.txt --urls-file="/sitedir/urls.txt" --output-directory="/sitedir" --dump-content=false --filename-template="{url-pathname-flat}.html"

Note that I am using the Docker image and the --urls-file option.

Sometimes I get the following error:

Execution context was destroyed, most likely because of a navigation. URL: <redacted>
Stack: Error: Execution context was destroyed, most likely because of a navigation.
    at rewriteError (/usr/src/app/node_modules/puppeteer-core/lib/cjs/puppeteer/common/ExecutionContext.js:265:23)
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (internal/process/task_queues.js:93:5)
    at async ExecutionContext._evaluateInternal (/usr/src/app/node_modules/puppeteer-core/lib/cjs/puppeteer/common/ExecutionContext.js:219:60)
    at async ExecutionContext.evaluate (/usr/src/app/node_modules/puppeteer-core/lib/cjs/puppeteer/common/ExecutionContext.js:110:16)
    at async getPageData (/usr/src/app/node_modules/single-file/cli/back-ends/puppeteer.js:139:10)
    at async Object.exports.getPageData (/usr/src/app/node_modules/single-file/cli/back-ends/puppeteer.js:51:10)
    at async capturePage (/usr/src/app/node_modules/single-file/cli/single-file-cli-api.js:248:20)
    at async runNextTask (/usr/src/app/node_modules/single-file/cli/single-file-cli-api.js:169:20)
    at async Promise.all (index 0)

Sometimes I get the following different error:

Navigation failed because browser has disconnected! URL: <redacted>
Stack: Error: Navigation failed because browser has disconnected!
    at /usr/src/app/node_modules/puppeteer-core/lib/cjs/puppeteer/common/LifecycleWatcher.js:51:147
    at /usr/src/app/node_modules/puppeteer-core/lib/cjs/vendor/mitt/src/index.js:51:62
    at Array.map (<anonymous>)
    at Object.emit (/usr/src/app/node_modules/puppeteer-core/lib/cjs/vendor/mitt/src/index.js:51:43)
    at CDPSession.emit (/usr/src/app/node_modules/puppeteer-core/lib/cjs/puppeteer/common/EventEmitter.js:72:22)
    at CDPSession._onClosed (/usr/src/app/node_modules/puppeteer-core/lib/cjs/puppeteer/common/Connection.js:256:14)
    at Connection._onMessage (/usr/src/app/node_modules/puppeteer-core/lib/cjs/puppeteer/common/Connection.js:99:25)
    at WebSocket.<anonymous> (/usr/src/app/node_modules/puppeteer-core/lib/cjs/puppeteer/node/NodeWebSocketTransport.js:13:32)
    at WebSocket.onMessage (/usr/src/app/node_modules/puppeteer-core/node_modules/ws/lib/event-target.js:132:16)
    at WebSocket.emit (events.js:315:20)

I can download the pages at the URLs that failed by trying again. However, I would only usually expect to get a stack trace from an internal error (not a network connection error, or whatever might be the underlying cause here).

One difficulty I have is that there is no option to "resume" downloading pages should some pages fail to download. Utilities such as youtube-dl allow you to run them a second time to continue downloading files that were not downloaded in the previous run.

  1. It would be good if the above errors were more user friendly (or explained what to do to fix the problem).
  2. It would also be good if downloads from a list of URLs could be resumed if interrupted / incomplete (similar to youtube-dl for example).
  3. Finally, is it guaranteed that if there is an error, then no file will be produced (i.e. HTML files are only created after a successful download)?
    If partial files or zero-byte files can be left behind after an error, then one has to inspect the log to be sure that all pages have downloaded correctly (where youtube-dl will create .part files that are renamed only once the file is fully downloaded to avoid this problem and allow resuming of downloads.)

Many thanks!

@andrewdbate andrewdbate changed the title Occasional errors from the CLI Occasional stack traces from the CLI Oct 14, 2021
@andrewdbate
Copy link
Author

@andrewdbate andrewdbate commented Oct 15, 2021

Here is another error that I sometimes get:

Protocol error: Connection closed. Most likely the page has been closed. URL: <redacted>
Stack: Error: Protocol error: Connection closed. Most likely the page has been closed.
    at Object.assert (/usr/src/app/node_modules/puppeteer-core/lib/cjs/puppeteer/common/assert.js:26:15)
    at Page.close (/usr/src/app/node_modules/puppeteer-core/lib/cjs/puppeteer/common/Page.js:2069:21)
    at Object.exports.getPageData (/usr/src/app/node_modules/single-file/cli/back-ends/puppeteer.js:54:15)
    at async capturePage (/usr/src/app/node_modules/single-file/cli/single-file-cli-api.js:248:20)
    at async runNextTask (/usr/src/app/node_modules/single-file/cli/single-file-cli-api.js:169:20)
    at async Promise.all (index 0)
    at async runNextTask (/usr/src/app/node_modules/single-file/cli/single-file-cli-api.js:185:3)
    at async Promise.all (index 0)
    at async runNextTask (/usr/src/app/node_modules/single-file/cli/single-file-cli-api.js:185:3)
    at async Promise.all (index 0)

And another:

net::ERR_ABORTED at <redacted> URL: <redacted>
Stack: Error: net::ERR_ABORTED at <redacted>
    at navigate (/usr/src/app/node_modules/puppeteer-core/lib/cjs/puppeteer/common/FrameManager.js:116:23)
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (internal/process/task_queues.js:93:5)
    at async FrameManager.navigateFrame (/usr/src/app/node_modules/puppeteer-core/lib/cjs/puppeteer/common/FrameManager.js:91:21)
    at async Frame.goto (/usr/src/app/node_modules/puppeteer-core/lib/cjs/puppeteer/common/FrameManager.js:417:16)
    at async Page.goto (/usr/src/app/node_modules/puppeteer-core/lib/cjs/puppeteer/common/Page.js:1156:16)
    at async pageGoto (/usr/src/app/node_modules/single-file/cli/back-ends/puppeteer.js:187:3)
    at async handleJSRedirect (/usr/src/app/node_modules/single-file/cli/back-ends/puppeteer.js:164:3)
    at async getPageData (/usr/src/app/node_modules/single-file/cli/back-ends/puppeteer.js:148:21)
    at async Object.exports.getPageData (/usr/src/app/node_modules/single-file/cli/back-ends/puppeteer.js:51:10)

(I have had to redact the URL from the errors, but nothing else was changed.)

@andrewdbate
Copy link
Author

@andrewdbate andrewdbate commented Oct 15, 2021

Would it be possible have the SingleFile CLI automatically retry in case of an error?

All of these errors come from Puppeteer. Would it be more reliable to use --back-end=jsdom instead?

What are the advantages / disadvantages of using jsdom instead of Chrome with SingleFile?

@gildas-lormeau
Copy link
Owner

@gildas-lormeau gildas-lormeau commented Oct 18, 2021

Because you don't mention them and out of curiosity, did you try to use the options --crawl-save-session, --crawl-load-session or --crawl-sync-session?

@andrewdbate
Copy link
Author

@andrewdbate andrewdbate commented Oct 18, 2021

I followed your other suggestion, which was to use the sitemap.xml so I do not need to use the crawl options.

(Also, to followup on my comments about jsdom: It doesn't seem to work as well as using Chrome. When I tried using jsdom to download various Wikipedia pages, the images were missing.)

@gildas-lormeau
Copy link
Owner

@gildas-lormeau gildas-lormeau commented Oct 18, 2021

I recommend to use for example --crawl-save-session, it will allow you to identify which URL failed when processing multiple URLs.

The errors you see are related to puppeteer. You could use playwright as an alternative but you have to install it manually with NPM by running npm install playwright in the folder of SingleFile. Then, you can pass --back-end=playwright to use it.

@andrewdbate
Copy link
Author

@andrewdbate andrewdbate commented Oct 18, 2021

Here is another error I sometimes get:

Protocol error (Target.closeTarget): Target closed. URL: <redacted>
Stack: Error: Protocol error (Target.closeTarget): Target closed.
    at /usr/src/app/node_modules/puppeteer-core/lib/cjs/puppeteer/common/Connection.js:71:63
    at new Promise (<anonymous>)
    at Connection.send (/usr/src/app/node_modules/puppeteer-core/lib/cjs/puppeteer/common/Connection.js:70:16)
    at Page.close (/usr/src/app/node_modules/puppeteer-core/lib/cjs/puppeteer/common/Page.js:2075:44)
    at Object.exports.getPageData (/usr/src/app/node_modules/single-file/cli/back-ends/puppeteer.js:54:15)
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (internal/process/task_queues.js:93:5)
    at async capturePage (/usr/src/app/node_modules/single-file/cli/single-file-cli-api.js:248:20)
    at async runNextTask (/usr/src/app/node_modules/single-file/cli/single-file-cli-api.js:169:20)
    at async Promise.all (index 0)

However, the error I posted in an earlier comment which says Execution context was destroyed, most likely because of a navigation is by far the most frequently occurring.

As mentioned above, I have been able to download the page successfully by retrying, however, this requires manual intervention (although I am trying to use bash scripts where possible). Hence me asking whether SingleFile CLI could automatically retry in case of an error.

@andrewdbate
Copy link
Author

@andrewdbate andrewdbate commented Oct 18, 2021

(I think we commented at the same time.)

Is playwright more reliable in your experience?

Can I use --crawl-save-session even though I do not want to crawl? (I prefer to use the sitemap.xml now because I know it is accurate, i.e., it contains all pages that need to be downloaded.)

@gildas-lormeau
Copy link
Owner

@gildas-lormeau gildas-lormeau commented Oct 18, 2021

I don't know if playwright is more reliable, I have not done any intensive testing. It's a very popular alternative to puppeteer though.

Crawling in SingleFile CLI means processing multiple URLs in a batch. The option --crawl-save-session should work if you use --urls-file for example.

Regarding the intermittent errors you're encountering, maybe SingleFile consumes too much CPU, did you try to set --max-parallel-workers to 2 for example? (it should not be higher than your number of logical CPU cores)

@andrewdbate
Copy link
Author

@andrewdbate andrewdbate commented Oct 19, 2021

I thought I'd try out the --crawl-save-session option with --urls-file to see how it works. I used an URLs file with ~3800 URLs.

I did run out of memory at some point:

<--- Last few GCs --->

[29912:0000024F43C1BF10]  4559331 ms: Scavenge (reduce) 4093.2 (4102.8) -> 4093.2 (4105.0) MB, 4.1 / 0.0 ms  (average mu = 0.625, current mu = 0.579) allocation failure
[29912:0000024F43C1BF10]  4559343 ms: Scavenge (reduce) 4093.9 (4108.0) -> 4093.7 (4108.8) MB, 5.6 / 0.0 ms  (average mu = 0.625, current mu = 0.579) allocation failure
[29912:0000024F43C1BF10]  4559401 ms: Scavenge (reduce) 4094.3 (4104.0) -> 4094.2 (4106.0) MB, 5.1 / 0.0 ms  (average mu = 0.625, current mu = 0.579) allocation failure


<--- JS stacktrace --->

FATAL ERROR: MarkCompactCollector: young object promotion failed Allocation failed - JavaScript heap out of memory
 1: 00007FF7D1F3052F napi_wrap+109311
 2: 00007FF7D1ED5256 v8::internal::OrderedHashTable<v8::internal::OrderedHashSet,1>::NumberOfElementsOffset+33302
<rest of trace elided>

But I was able to resume the downloads with the --crawl-sync-session option to download all URLs. So that's good!

Since the crawl session file is modified during the crawl, what happens if we get a crash (like the one above)? Is the file modified in a crash proof way (i.e., it won't leave the file in an inconsistent state, e.g. not a valid json file)?

Also, is it guaranteed that if there is an error, then no HTML file will be created (i.e. HTML files are only created after a successful download, no partial or zero byte files)?

@gildas-lormeau
Copy link
Owner

@gildas-lormeau gildas-lormeau commented Oct 19, 2021

I did run out of memory at some point:

<--- Last few GCs --->

[29912:0000024F43C1BF10]  4559331 ms: Scavenge (reduce) 4093.2 (4102.8) -> 4093.2 (4105.0) MB, 4.1 / 0.0 ms  (average mu = 0.625, current mu = 0.579) allocation failure
[29912:0000024F43C1BF10]  4559343 ms: Scavenge (reduce) 4093.9 (4108.0) -> 4093.7 (4108.8) MB, 5.6 / 0.0 ms  (average mu = 0.625, current mu = 0.579) allocation failure
[29912:0000024F43C1BF10]  4559401 ms: Scavenge (reduce) 4094.3 (4104.0) -> 4094.2 (4106.0) MB, 5.1 / 0.0 ms  (average mu = 0.625, current mu = 0.579) allocation failure


<--- JS stacktrace --->

FATAL ERROR: MarkCompactCollector: young object promotion failed Allocation failed - JavaScript heap out of memory
 1: 00007FF7D1F3052F napi_wrap+109311
 2: 00007FF7D1ED5256 v8::internal::OrderedHashTable<v8::internal::OrderedHashSet,1>::NumberOfElementsOffset+33302
<rest of trace elided>

Was it the Node or Chrome processes?

But I was able to resume the downloads with the --crawl-sync-session option to download all URLs. So that's good!

Glad to hear it :)

Since the crawl session file is modified during the crawl, what happens if we get a crash (like the one above)? Is the file modified in a crash proof way (i.e., it won't leave the file in an inconsistent state, e.g. not a valid json file)?

I don't know yet, I need to read the doc.
Edit: It's not documented, cf. https://nodejs.org/api/fs.html#fswritefilesyncfile-data-options

Also, is it guaranteed that if there is an error, then no HTML file will be created (i.e. HTML files are only created after a successful download, no partial or zero byte files)?

Yes. However, I cannot guarantee they will be complete for the same reason than the previous question.

@andrewdbate
Copy link
Author

@andrewdbate andrewdbate commented Oct 19, 2021

It was the Node processes that crashed. The full stack trace was:

<--- Last few GCs --->

[29912:0000024F43C1BF10]  4559331 ms: Scavenge (reduce) 4093.2 (4102.8) -> 4093.2 (4105.0) MB, 4.1 / 0.0 ms  (average mu = 0.625, current mu = 0.579) allocation failure
[29912:0000024F43C1BF10]  4559343 ms: Scavenge (reduce) 4093.9 (4108.0) -> 4093.7 (4108.8) MB, 5.6 / 0.0 ms  (average mu = 0.625, current mu = 0.579) allocation failure
[29912:0000024F43C1BF10]  4559401 ms: Scavenge (reduce) 4094.3 (4104.0) -> 4094.2 (4106.0) MB, 5.1 / 0.0 ms  (average mu = 0.625, current mu = 0.579) allocation failure


<--- JS stacktrace --->

FATAL ERROR: MarkCompactCollector: young object promotion failed Allocation failed - JavaScript heap out of memory
 1: 00007FF7D1F3052F napi_wrap+109311
 2: 00007FF7D1ED5256 v8::internal::OrderedHashTable<v8::internal::OrderedHashSet,1>::NumberOfElementsOffset+33302
 3: 00007FF7D1ED6026 node::OnFatalError+294
 4: 00007FF7D27A163E v8::Isolate::ReportExternalAllocationLimitReached+94
 5: 00007FF7D27864BD v8::SharedArrayBuffer::Externalize+781
 6: 00007FF7D263094C v8::internal::Heap::EphemeronKeyWriteBarrierFromCode+1516
 7: 00007FF7D261B58B v8::internal::NativeContextInferrer::Infer+59243
 8: 00007FF7D2600ABF v8::internal::MarkingWorklists::SwitchToContextSlow+57327
 9: 00007FF7D261470B v8::internal::NativeContextInferrer::Infer+30955
10: 00007FF7D260B82D v8::internal::MarkCompactCollector::EnsureSweepingCompleted+6269
11: 00007FF7D261395E v8::internal::NativeContextInferrer::Infer+27454
12: 00007FF7D26178EB v8::internal::NativeContextInferrer::Infer+43723
13: 00007FF7D2621142 v8::internal::ItemParallelJob::Task::RunInternal+18
14: 00007FF7D26210D1 v8::internal::ItemParallelJob::Run+641
15: 00007FF7D25F49D3 v8::internal::MarkingWorklists::SwitchToContextSlow+7939
16: 00007FF7D260BCDC v8::internal::MarkCompactCollector::EnsureSweepingCompleted+7468
17: 00007FF7D260A524 v8::internal::MarkCompactCollector::EnsureSweepingCompleted+1396
18: 00007FF7D2608088 v8::internal::MarkingWorklists::SwitchToContextSlow+87480
19: 00007FF7D26366D1 v8::internal::Heap::LeftTrimFixedArray+929
20: 00007FF7D26387B5 v8::internal::Heap::PageFlagsAreConsistent+789
21: 00007FF7D262DA61 v8::internal::Heap::CollectGarbage+2033
22: 00007FF7D2634855 v8::internal::Heap::GlobalSizeOfObjects+229
23: 00007FF7D266EC9B v8::internal::StackGuard::HandleInterrupts+891
24: 00007FF7D237DB26 v8::internal::interpreter::JumpTableTargetOffsets::iterator::operator=+8182
25: 00007FF7D2829FED v8::internal::SetupIsolateDelegate::SetupHeap+463949
26: 00007FF7D280B393 v8::internal::SetupIsolateDelegate::SetupHeap+337907
27: 000001CD8DDD17CF

@gildas-lormeau
Copy link
Owner

@gildas-lormeau gildas-lormeau commented Oct 19, 2021

Do you know if this memory leak error is more likely to occur when there are capture errors?

@andrewdbate
Copy link
Author

@andrewdbate andrewdbate commented Oct 19, 2021

I didn't see any errors printed to standard error or output before the stack trace from Node.

@gildas-lormeau
Copy link
Owner

@gildas-lormeau gildas-lormeau commented Oct 19, 2021

I'll try to reproduce the issue, do you use puppeteer or playwright?

@andrewdbate
Copy link
Author

@andrewdbate andrewdbate commented Oct 19, 2021

I used the default of Puppeteer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants