Skip to content

daohoangson/go-sitemirror

Repository files navigation

go-sitemirror

Website mirror app with priority for response consistency.

Codecov GoDoc GitHub Actions Go Report Card

Goal

Easy to set up and run a mirror which copies content from somewhere else and provides a near exact web browsing experience in case the source server / network goes down.

Ideas

  1. All web assets should be downloaded and have with their metadata intact (content type etc.)
  2. Links should be followed with some restriction to save resources.
  3. Cached data should be refreshed periodically.
  4. A web server should be provided to serve visitor.

Usage

Mirror everything at :8080

Go to http://localhost:8080/https/github.com/ to see GitHub home page.

This is quite dangerous though, do NOT deploy this to the public internet to avoid abuses.

go-sitemirror -port 8080

Mirror GitHub at :8081

Go to http://localhost:8081/ to see GitHub home page.

go-sitemirror -mirror https://github.com \
  -mirror-port 8081 \
  -auto-download-depth=0 \
  -no-cross-host
  • -auto-download-depth=0 to turn off auto downloader
  • -no-cross-host to not modify assets urls from other domains

Docker

Do the same GitHub mirroring but with Docker.

docker run --rm -it \
  -p 8081:8081 \
  -v "$PWD/cache:/cache" \
  ghcr.io/daohoangson/go-sitemirror -mirror https://github.com \
  -mirror-port 8081 \
  -auto-download-depth=0 \
  -no-cross-host

Fly.io

See PR #8 for a couple of deployed demos. The fly.toml looks something like this:

app = "app-name"

[build]
  image = "ghcr.io/daohoangson/go-sitemirror:latest"

[experimental]
  cmd = ["go-sitemirror", "-mirror", "https://github.com", "-mirror-port", "80", "-auto-download-depth", "0", "-no-cross-host"]

[http_service]
  internal_port = 80
  force_https = true
  min_machines_running = 0

All flags

  -auto-download-depth=1:
    Maximum link depth for auto downloads, default=1

  -auto-refresh=0s:
    Interval for url auto refreshes, default=no refresh

  -cache-bump=1m0s:
    Validity of cache bump

  -cache-path="":
    HTTP Cache path (default working directory)

  -cache-ttl=10m0s:
    Validity of cached data

  -header=map[]:
    Custom request header, must be 'key=value'

  -http-timeout=10s:
    HTTP request timeout

  -log=4:
    Logging output level

  -mirror=[]:
    URL to mirror, multiple urls are supported

  -mirror-port=[]:
    Port to mirror a single site, each port number should immediately follow its URL.
    For url that doesn't have any port, it will still be mirrored but without a web server.

  -no-cross-host=false:
    Disable cross-host links

  -port=-1:
    Port to mirror all sites

  -rewrite=map[]:
    Link rewrites, must be 'source.domain.com=https://domain.com/some/path'

  -whitelist=[]:
    Restricted list of crawlable hosts

  -workers=4:
    Number of download workers