Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time


Crawler based on a modified browser to detect online tracking. Used in the canvas fingerprinting and evercookie detection experiments in our CCS 2014 paper, The Web Never Forgets. Visit The Web Never Forgets website for more info.


We strongly suggest you to use a virtual machine, container or similar isolation to install modCrawler.

git clone
cd modCrawler

Running the tests

Please run the tests before running the crawler. For simplicity you can run py.test from within the test directory. py.test will discover and run all the test. Alternatively, you can run individual tests from the command line such as: python -m test.runenv_test

Command line parameters

Below we give a description of the parameters that are passed to the module.

  • --urls: path to file that contains the list of URLs to crawl
  • --max_rank: max line number of the url to be crawled (if urls contain rank info)
  • --min_rank (optional): min line number of the url to be crawled (if url contains rank info)
  • --max_proc: maximum number of browsers that will run in parallel
  • --flash: Flash support (0: disable, 1: enable (default))
  • --cookie: Cookie support (0: allow all (default), 1: allow 1st party, 2: disable, 3: allow third-party cookies from visited)
  • --upload: Upload crawl results to a remote server via SSH. 0: don't upload (default), 1: upload (SSH server info should be completed in crawler/


  • To crawl top 100 urls in the etc/top-1m.csv file using 10 parallel crawlers (Flash disabled).

    • python --urls etc/top-1m.csv --max_rank 100 max_proc 10 --flash 0
  • To crawl urls between rank 100-1000 in the etc/top-1m.csv file using 5 parallel crawlers (Flash enabled).

    • python --urls etc/top-1m.csv --max_rank 1000 --min_rank 100 max_proc 5

After the crawl

Report screenshot

modCrawler will store the data about the crawls in the jobs directory. For convenience, it places a symlink called latest that points to the directory of the most recent crawl.

During the crawl, you can watch the debug.log tail -f jobs/latest/debug.log

Once the crawl has finished, you can find the crawl data in the jobs/latest/ directory.

  • crawl.sqlite: Sqlite based crawl database.
  • An HTML based report that gives an overview of the results. The name of the file depends on the date and crawl parameters.
  • debug.log: Debug logs.
  • error.log: Error logs, file is not created if there is no error.

In addition, the crawl directory is gzipped and stored in the jobs directory.

Building your own browser

The script will download a modified Firefox which logs canvas fingerprinting related function calls. Alternatively, you can build your own Firefox using the provided browser patch. Make sure you use the right .mozconfig file for building (e.g., export MOZCONFIG=~/path/to/gecko-dev/.mozconfig-ffstd) Assuming you checked out the Firefox repository into ~/dev/gecko-dev/

cd ~/dev/gecko-dev/;
git fetch
git checkout GECKO4401_2016020518_RELBRANCH
git apply ~/dev/modCrawler/browser_patch/0001-Log-canvas-fingerprinting-related-function-calls.patch
./mach build
cd firefox-static
make package;
# copy it from dist dir to destination
cp dist/*.bz2 /path/to/modCrawler/bins

You need to place your freshly built browser to bins/ff-mod directory to make sure it is used by the crawler. Please consult the Mozilla documentation for errors you may run into.


Crawler based on a modified browser to detect online tracking.






No releases published


No packages published