Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wget: Cannot write to page (Is a directory). #9

Open
ilyaigpetrov opened this issue Oct 7, 2023 · 1 comment
Open

wget: Cannot write to page (Is a directory). #9

ilyaigpetrov opened this issue Oct 7, 2023 · 1 comment

Comments

@ilyaigpetrov
Copy link

ilyaigpetrov commented Oct 7, 2023

I've noticed some index.html files were missing after scraping a site with your script.
Seems the problem is that if wget downloads some binary files to a directory then a html page at this directory's path cant be saved to index.html. See example below.

I suggest adding --trust-server-names opt to wget, but I haven't had enough time to test it yet.

$ tree
example.com
├── index.html
└── main
    ├── index.html
    └── logo.png
$ cat example.com/index.html
<!DOCTYPE html>
<a href="./main/logo.png">MAIN LOGO</a>
<a href="./main">MAIN PAGE</a>
$ cd example.com && python3 -m http.server
$ wget -r http://localhost:8000
‘localhost:8080/index.html’ saved
‘localhost:8080/main/logo.png’ saved
Cannot write to ‘localhost:8080/main’ (Is a directory).

example.com.zip

@ilyaigpetrov
Copy link
Author

Was able to reproduce with non-binary files too.

non-binary.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant