Golang WARC (Web ARChive) Library
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
.github
testdata
.gitignore
LICENSE
README.md
capture.go
capture_test.go
digest.go
doc.go
encoding.go
encoding_test.go
fields.go
header.go
header_test.go
reader.go
reader_test.go
readme.md
record.go
record_test.go
records.go
sanitize.go
sanitize_test.go
writer.go
writer_test.go

README.md

warc

GitHub Slack GoDoc License

warc is an implementation of ISO28500 1.0, the WebARCive specfication. it provides readers, writers, and structs for working with warc records.

from the spec:

The WARC (Web ARChive) file format offers a convention for concatenating multiple resource records (data objects), each consisting of a set of simple text headers and an arbitrary data block into one long file. The WARC format is an extension of the ARC File Format [ARC] that has traditionally been used to store "web crawls" as sequences of content blocks harvested from the World Wide Web. Each capture in an ARC file is preceded by a one-line header that very briefly describes the harvested content and its length. This is directly followed by the retrieval protocol response messages and content. The original ARC format file is used by the Internet Archive (IA) since 1996 for managing billions of objects, and by several national libraries. package warc

License & Copyright

Affero General Public License v3

Getting Involved

We would love involvement from more people! If you notice any errors or would like to submit changes, please see our Contributing Guidelines.

We use GitHub issues for tracking bugs and feature requests and Pull Requests (PRs) for submitting changes

Usage

import "github.com/datatogether/warc"