New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Occasional stack traces from the CLI #798
Comments
|
Here is another error that I sometimes get: And another: (I have had to redact the URL from the errors, but nothing else was changed.) |
|
Would it be possible have the SingleFile CLI automatically retry in case of an error? All of these errors come from Puppeteer. Would it be more reliable to use What are the advantages / disadvantages of using jsdom instead of Chrome with SingleFile? |
|
Because you don't mention them and out of curiosity, did you try to use the options |
|
I followed your other suggestion, which was to use the (Also, to followup on my comments about jsdom: It doesn't seem to work as well as using Chrome. When I tried using jsdom to download various Wikipedia pages, the images were missing.) |
|
I recommend to use for example The errors you see are related to puppeteer. You could use playwright as an alternative but you have to install it manually with NPM by running |
|
Here is another error I sometimes get: However, the error I posted in an earlier comment which says As mentioned above, I have been able to download the page successfully by retrying, however, this requires manual intervention (although I am trying to use bash scripts where possible). Hence me asking whether SingleFile CLI could automatically retry in case of an error. |
|
(I think we commented at the same time.) Is playwright more reliable in your experience? Can I use |
|
I don't know if playwright is more reliable, I have not done any intensive testing. It's a very popular alternative to puppeteer though. Crawling in SingleFile CLI means processing multiple URLs in a batch. The option Regarding the intermittent errors you're encountering, maybe SingleFile consumes too much CPU, did you try to set |
|
I thought I'd try out the I did run out of memory at some point: But I was able to resume the downloads with the Since the crawl session file is modified during the crawl, what happens if we get a crash (like the one above)? Is the file modified in a crash proof way (i.e., it won't leave the file in an inconsistent state, e.g. not a valid json file)? Also, is it guaranteed that if there is an error, then no HTML file will be created (i.e. HTML files are only created after a successful download, no partial or zero byte files)? |
Was it the Node or Chrome processes?
Glad to hear it :)
I don't know yet, I need to read the doc.
Yes. However, I cannot guarantee they will be complete for the same reason than the previous question. |
|
It was the Node processes that crashed. The full stack trace was: |
|
Do you know if this memory leak error is more likely to occur when there are capture errors? |
|
I didn't see any errors printed to standard error or output before the stack trace from Node. |
|
I'll try to reproduce the issue, do you use puppeteer or playwright? |
|
I used the default of Puppeteer. |
I am running the following command from a Bash shell (MinGW on Windows 10):
Note that I am using the Docker image and the
--urls-fileoption.Sometimes I get the following error:
Sometimes I get the following different error:
I can download the pages at the URLs that failed by trying again. However, I would only usually expect to get a stack trace from an internal error (not a network connection error, or whatever might be the underlying cause here).
One difficulty I have is that there is no option to "resume" downloading pages should some pages fail to download. Utilities such as
youtube-dlallow you to run them a second time to continue downloading files that were not downloaded in the previous run.youtube-dlfor example).If partial files or zero-byte files can be left behind after an error, then one has to inspect the log to be sure that all pages have downloaded correctly (where
youtube-dlwill create.partfiles that are renamed only once the file is fully downloaded to avoid this problem and allow resuming of downloads.)Many thanks!
The text was updated successfully, but these errors were encountered: