Skip to content
Playing around with SavePageNow data.
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
analysis latest Jun 20, 2019
examples example metadata Aug 26, 2018
notebooks oh why not Jun 21, 2019
utils added some more pages Jun 15, 2019
.gitignore altair vis May 28, 2019
LICENSE
Pipfile added some more pages Jun 15, 2019
Pipfile.lock added some more pages Jun 15, 2019
README.md docs Jun 15, 2019

README.md

This repository includes Jupyter notebooks that document research into the Internet Archive's Save Page Now web archive data. The project is a collaboration between Shawn Walker, Jess Ogden and Ed Summers.

Notebooks:

The notebooks do have some order to them since some of them rely on data created in others. They are listed here as a table of contents if you want to follow the path of exploration.

  • Sizes: how SPN data has changed over time
  • Sample: sampling the full SPN dataset
  • Spark: an example of using Spark with WARC data
  • Tracery: tracing SPN requests in WARC data
  • URLs: extracting metadata for SPN requests
  • UserAgents: analyzing the User-Agents in SPN requests
  • Domains: examining the most popularly archived domains
  • Archival Novelty: what does newness look like in SPN data
  • WSDL Diversity Index: analyzing the diversity of SPN requests
  • Known Sites: taking a close look at particular websites in SPN data
  • Liveliness: examining whether archived content is still live

Some of the notebooks use Python extensions so you'll need to install those. pipenv is a handy tool for managing a project's Python dependencies. These steps should get you up and running:

pip install pipenv
git clone https://github.com/edsu/spn
cd Data
pipenv install
pipenv shell
jupyter notebook

Note: if you are using a notebook that requires Spark you'll need to set these in your environment before starting Jupyter:

export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS='notebook'
jupyter notebook Spark.ipynb

Utilities:

  • check.py: a utility to ensure that the downloaded files are complete
You can’t perform that action at this time.