Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't crawl the website's images? #4

Closed
wnark opened this issue Apr 1, 2020 · 3 comments
Closed

Can't crawl the website's images? #4

wnark opened this issue Apr 1, 2020 · 3 comments

Comments

@wnark
Copy link

wnark commented Apr 1, 2020

Hello, after downloading the website from web.archive.org, after deploying the server, I found that it is not the same as what is displayed on web.archive.org.
image
image
Like the background of the web page is not loading

@wnark
Copy link
Author

wnark commented Apr 2, 2020

image
image
image

If the website uses a Image Hosting, it will not display properly. Including the php file of the website, it cannot be called normally. Even if the website is displayed on web.archive.org

@erlange
Copy link
Owner

erlange commented Apr 3, 2020

The Wayback Machine works by storing the crawled web site and rewrites all the resources referenced by the pages in oorder to make them available in case the original site is removed.

So, if your original page has links to other pages or images, for example:

<img src="https://yoursite.com/image.jpg" />
<a href="https://yoursite.com/otherpage/" />Other Page</a>

The Wayback Machine rewrites it to:

<img src="https://web.archive.org/web/20160919103156cs_/http://yoursite.com/image.jpg" />
<a href="https://web.archive.org/web/20160919103156cs_https://yoursite.com/otherpage/" />Other Page</a>

The files downloaded by the wbm-dl are the original ones and not the Wayback Machine rewritten versions.

The image files and the pages, however, are still downloaded but the references in the html files now point to the original site.

@wnark
Copy link
Author

wnark commented Apr 4, 2020

The Wayback Machine works by storing the crawled web site and rewrites all the resources referenced by the pages in oorder to make them available in case the original site is removed.

So, if your original page has links to other pages or images, for example:

<img src="https://yoursite.com/image.jpg" />
<a href="https://yoursite.com/otherpage/" />Other Page</a>

The Wayback Machine rewrites it to:

<img src="https://web.archive.org/web/20160919103156cs_/http://yoursite.com/image.jpg" />
<a href="https://web.archive.org/web/20160919103156cs_https://yoursite.com/otherpage/" />Other Page</a>

The files downloaded by the wbm-dl are the original ones and not the Wayback Machine rewritten versions.

The image files and the pages, however, are still downloaded but the references in the html files now point to the original site.

Well, I think about how to use script to deal with

@wnark wnark closed this as completed Apr 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants