Skip to content

alicewriteswrongs/website-backup-helper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 

Repository files navigation

Website scraper and archiver

This is a little node.js script which uses wget to scrape a website and back it up to S3 (although right now I haven't implemented the backup part yet...). It is designed to be run as a cron job (or similar) in a continuous fashion, using wget ability to fetch only new content to save incremental backups.

Usage

This script is meant to be run unattended, so it doesn't have a CLI. Instead, it expects a backup-manifest.json file to be written to the directory from which it is executed. This basically looks like this:

{
  "websites": [
    {
      "url": "en.wikipedia.org",
      "dirname": "wikipedia"
    }
  ],
  "backup_dir": "~/my_huge_backup_directory"
}

Note: please don't try to scrape wikipedia. The websites array is an array of all the sites you'd like to back up, and the backup_dir is an optional location for the backups to be saved. If you don't specify one, it will use ~/backups instead.

Note that you need to have the wget installed. I have only tested this with very recent versions of nodejs, if it's not working get nvm and install whatever the most recent version is.

About

A little node.js script that helps you mirror websites

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published