Can't crawl the website's images? #4

wnark · 2020-04-01T11:26:32Z

Hello, after downloading the website from web.archive.org, after deploying the server, I found that it is not the same as what is displayed on web.archive.org.

Like the background of the web page is not loading

wnark · 2020-04-02T07:20:00Z

If the website uses a Image Hosting, it will not display properly. Including the php file of the website, it cannot be called normally. Even if the website is displayed on web.archive.org

erlange · 2020-04-03T16:40:47Z

The Wayback Machine works by storing the crawled web site and rewrites all the resources referenced by the pages in oorder to make them available in case the original site is removed.

So, if your original page has links to other pages or images, for example:

<img src="https://yoursite.com/image.jpg" />
<a href="https://yoursite.com/otherpage/" />Other Page</a>

The Wayback Machine rewrites it to:

<img src="https://web.archive.org/web/20160919103156cs_/http://yoursite.com/image.jpg" />
<a href="https://web.archive.org/web/20160919103156cs_https://yoursite.com/otherpage/" />Other Page</a>

The files downloaded by the wbm-dl are the original ones and not the Wayback Machine rewritten versions.

The image files and the pages, however, are still downloaded but the references in the html files now point to the original site.

wnark · 2020-04-04T15:43:40Z

The Wayback Machine works by storing the crawled web site and rewrites all the resources referenced by the pages in oorder to make them available in case the original site is removed.

So, if your original page has links to other pages or images, for example:
<img src="https://yoursite.com/image.jpg" />
<a href="https://yoursite.com/otherpage/" />Other Page</a>
The Wayback Machine rewrites it to:
<img src="https://web.archive.org/web/20160919103156cs_/http://yoursite.com/image.jpg" />
<a href="https://web.archive.org/web/20160919103156cs_https://yoursite.com/otherpage/" />Other Page</a>
The files downloaded by the wbm-dl are the original ones and not the Wayback Machine rewritten versions.

The image files and the pages, however, are still downloaded but the references in the html files now point to the original site.

Well, I think about how to use script to deal with

wnark closed this as completed Apr 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't crawl the website's images? #4

Can't crawl the website's images? #4

wnark commented Apr 1, 2020 •

edited

Loading

wnark commented Apr 2, 2020 •

edited

Loading

erlange commented Apr 3, 2020

wnark commented Apr 4, 2020

Can't crawl the website's images? #4

Can't crawl the website's images? #4

Comments

wnark commented Apr 1, 2020 • edited Loading

wnark commented Apr 2, 2020 • edited Loading

erlange commented Apr 3, 2020

wnark commented Apr 4, 2020

wnark commented Apr 1, 2020 •

edited

Loading

wnark commented Apr 2, 2020 •

edited

Loading