Skip to content
master
Go to file
Code
This branch is 25 commits ahead of ikreymer:master.

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

README.md

Common Crawl Index Server

This project is a deployment of the pywb web archive replay and index server to provide an index query mechanism for datasets provided by Common Crawl

Usage & Installation

To run locally, please install with pip install -r requirements.txt

Common Crawl stores data on Amazon S3 and the index is publicly accessible from S3.

Currently, individual indexes for each crawl can be accessed under: s3://commoncrawl/cc-index/collections/[CC-MAIN-YYYY-WW]

Most of the index will be served from S3, however, a smaller secondary index must be installed locally for each collection.

This can be done automatically by running: install-collections.sh which will install all available collections locally. It uses the AWS CLI tool to sync the the index.

If successful, there should be collections directory with at least one index.

To run, simply run cdx-server to start up the index server, or optionally wayback, to run pywb replay system along with the cdx server.

Running with docker

If you have docker installed in your system, you can run index server with docker itself.

git clone https://github.com/commoncrawl/cc-index-server.git
cd cc-index-server
docker build . -t cc-index
docker run --rm --publish 8080:8080 -ti cc-index

You can use install-collections.sh to download indexes to your system and mount it on docker.

CDX Server API

The API endpoints correspond to existing index collections in collections directory.

For example, one currently available index is CC-MAIN-2015-06 and it can be accessed via

http://localhost:8080/CC-MAIN-2015-06-index?url=commoncrawl.org

Refer to CDX Server API for more detailed instructions on the API itself.

The pywb README provides additional information about pywb.

Building the Index

Please see the webarchive-indexing repository for more info on how the index is built.

About

Common Crawl Index Server

Topics

Resources

Releases

No releases published

Packages

No packages published
You can’t perform that action at this time.