Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion: Option to exclude certain webpaths from being crawled, or at least written, maybe both #1

Open
5000thinmints opened this issue Jan 18, 2023 · 2 comments
Labels
wontfix This will not be worked on

Comments

@5000thinmints
Copy link

For example in my run of http://www.someweb.com I would like to exclude all of http://www.someweb.com/boringnotes/ from being crawled/written since there is nothing of interest to me there.

@ballerburg9005
Copy link
Owner

Unfortunately this is not possible with wget itself, so the only point of such an option would be for the script to delete the directories afterwards in order to make the ZIM file smaller.

I am not sure if that would be so helpful. What do you think?

@ballerburg9005
Copy link
Owner

ballerburg9005 commented Jan 29, 2023

It is somewhat far-fetched, but you could set up Privoxy with "https-inspection" enabled and put "--no-check-certificate -e use_proxy=yes -e http_proxy=127.0.0.1:8118" into the wget command. This way the proxy would be able to read your HTTP requests and you could set it up to block the URL paths you want.

I suppose this is a much more desirable result than deleting the folders afterwards, if you really need it.

@ballerburg9005 ballerburg9005 added the wontfix This will not be worked on label Mar 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

2 participants