Skip to content

CLI and Go library for archiving webpages in WARC-format

License

Notifications You must be signed in to change notification settings

aholstenson/webpage-archiver

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

webpage-archiver

Capture and archive webpages to WARC-files, available both as a command-line tool and as a Go library.

Features:

  • Customizable output
    • WARC
    • Single file support via Obelisk
  • Screenshot support
  • Archives using a headless Chrome instance
    • Will automatically download a compatible headless browser

Capturing pages

Store WARC files in a specific directory:

webpage-archiver --output directory/ urlToArchive

To archive as a single file instead:

webpage-archiver --output fileOrDirectory --single-file urlToArchive

Storing a screenshot of each page can be done with --screenshot:

webpage-archiver --output directory/ --screenshot urlToArchive

Multiple URLs can be captured to the same archive:

webpage-archiver --output directory/ urlToArchive anotherUrlToArchive

Viewing pages

WARC-files captured with this tool need to be replayed, the easiest way to replay a capture is to use a tool like ReplayWeb.page.

Using as Go Library

go get github.com/aholstenson/webpage-archiver

Create an archiver instance to start capturing pages:

archiver, err := archiver.NewArchiver(ctx)

Capture pages using Capture:

output, err := warc.NewOutput(warc.WithDirectory(directory))

err := archiver.Capture(ctx, url, output)

output.Close()

Close the archiver when it's no longer needed:

archiver.Close()

Tracking progress

Archiver can take an optional progress reporter that will be used to log actions, requests and responses:

reporter := progress.NewConsoleReporter()
archiver := archiver.NewArchiver(ctx, archiver.WithProgress(reporter))

To use a progress reporter for a specific capture:

err := archiver.Capture(ctx, url, output, archiver.WithProgress(reporter))

User agents

The user agent can be specified via WithUserAgent:

archiver.WithUserAgent("Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible) Safari/537.36")

This option can be applied both to NewArchiver and to Archiver.Capture.

Screenshots

The option WithScreenshot can be passed to Capture to receive a screenshot of the page as it looks before the archiving ends.

archiver.Capture(ctx, url, output, archiver.WithScreenshot(func(data []byte) error {
  // Handle screenshot data here
  return nil
}))

About

CLI and Go library for archiving webpages in WARC-format

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages