Permalink
Find file
Fetching contributors…
Cannot retrieve contributors at this time
91 lines (66 sloc) 2.92 KB

Fetcher

Scripts used to fetch the HTML files from top Alexa sites.

Methodology

  • The top 1 million Alexa sites csv is downloaded, unzipped, and the URLs are extracted from it.

    Note: only the top 100,000 sites are kept for downloding.

  • The URLs are then fed to a Python script that downloads the HTML files and their HTTP headers using a process pool (to minimize waiting).

    Errors are reported to a log file (as below).

Usage

If you're on Linux or OS X, simply run ./getData.sh and you should be good to go. If you're on Windows, cygwin may be your best bet.

If you want to fetch resources other than Alexa's top HTMLs, you can do that by doing something like cat urls.txt | xargs -I % -n 1 -P64 ./downloadr.py download % webdevdata.org-2013-12-06-200358/

Dependencies

  • Python (tested with 2.7).
  • curl or wget (it will use curl in preference).
  • python-magic, which also requires libmagic (which you can install via homebrew). The Debian "python-magic" package is not the same thing. For all users, we recommend the virtualenv-based approach, below.

If you use virtualenv, you can install the required Python package locally:

  • virtualenv venv
  • . venv/bin/activate
  • pip install -r requirements.txt

Whenever you want to run this script, use:

  • . venv/bin/activate
  • ./getData.sh

If you use autoenv the activation step will be done automatically on entering the directory.

Results

The resulting directory structure is:

  • A root directory of the pattern "webdevdata.org-YYYY-MM-DD-HHMMSS"
  • A "log.txt" file within this directory contains a list of errors encountered across all downloads.
  • Sub-directories are 16 bit hashes of the URLs below them. Used to verify there are not toom many files in a single directory.

The resulting files have an ".html.txt" extension for the data files and ".html.hdr.txt" extension for the header files.

Queries

A java based script is available to get statistics on html tags/attributes with CSS-like queries.

See the Queries on WebDevData wiki.