Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLI option --crawl-replace-urls does not do anything #801

Open
andrewdbate opened this issue Oct 19, 2021 · 3 comments
Open

CLI option --crawl-replace-urls does not do anything #801

andrewdbate opened this issue Oct 19, 2021 · 3 comments

Comments

@andrewdbate
Copy link

@andrewdbate andrewdbate commented Oct 19, 2021

When I run this command:

single-file --output-directory=outdir --dump-content=false --filename-template="{url-pathname-flat}.html" --crawl-links --crawl-save-session=session.json --crawl-replace-urls=true https://en.wikipedia.org/wiki/Thomas_Lipton

none of the files in the outdir directory have URLs of saved pages replaced with relative paths of other saved pages in outdir.

When I run this command, _wiki_Thomas_Lipton.html is downloaded to outdir. This is the file of URL from which the crawl started.

The Wikipedia page https://en.wikipedia.org/wiki/Thomas_Lipton has a link to https://en.wikipedia.org/wiki/Self-made_man in the first sentence. This page was also downloaded by SingleFile as _wiki_Self-made_man.html.

I was expecting the href to https://en.wikipedia.org/wiki/Self-made_man in _wiki_Thomas_Lipton.html to be rewritten to _wiki_Self-made_man.html but it was not. Am I using the CLI options incorrectly?

@gildas-lormeau
Copy link
Owner

@gildas-lormeau gildas-lormeau commented Oct 19, 2021

Did you interrupt the command? URLs are replaced when all the pages have been crawled.

@andrewdbate
Copy link
Author

@andrewdbate andrewdbate commented Oct 19, 2021

No I didn't interrupt the command.

@amirrh6
Copy link

@amirrh6 amirrh6 commented Nov 8, 2021

Hi @gildas-lormeau!
First, I'd like to appreciate for this amazing extension.

I faced the very same issue @andrewdbate discussed.

I tested https://xmrig.com because of its simple hierarchy.

Following internal links on https://xmrig.com should be considered:

image

  • Some links are duplicated inside the page, so I used --filename-conflict-action=skip flag.

This is the command I ran:

./single-file --output-directory=saved --filename-template="{url-pathname-flat}.html" --crawl-links=true --crawl-replace-urls=true --filename-conflict-action=skip https://xmrig.com

As the result, following files were created inside saved directory (as expected):

  • _.html
  • _benchmark.html
  • _download.html
  • _wizard.html

Everything has been well so far but links inside these files are not changed to relative links on file system.

You may find these files useful:

saved.zip

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants