Code and data release for our PETS 2018 paper: "I never signed up for this! Privacy implications of email tracking".
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
crawler_emails Updating readme files for crawlers Sep 27, 2017
crawler_mailinglists Updating readme files for crawlers Sep 27, 2017
email-tracking-tester @ 1c0ca1d Initial commit of data collection components Sep 26, 2017
example_email Updating to example image Sep 28, 2017
mailserver Updating mailserver readme Sep 27, 2017
.gitignore
.gitmodules Initial commit of data collection components Sep 26, 2017
README.md Adding links to two of the crawl datasets May 2, 2018

README.md

I never signed up for this! Privacy implications of email tracking

This is a public code and data release for the research paper "I never signed up for this! Privacy implications of email tracking.", which will appear at PETS 2018. Portions of the code for this project borrow heavily from Jeffrey's undergraduate senior thesis, available here.

Authors: Steven Englehardt (@englehardt), Jeffrey Han (@itdelatrisu), and Arvind Narayanan (@randomwalker)

Paper: available here.

Components

Core components:

  • crawler_emails/ - A web crawler, built on OpenWPM, to simulate email views and link clicks.
  • crawler_mailinglists/ - A web crawler, built on OpenWPM, to find and submit mailing list sign-ups.
  • email-tracking-tester/ - A tool to test the privacy properties of a mail client.
  • mailserver/ - The mail server used to collect our corpus of emails.
  • analysis/ - Coming soon

Code Usage

Additional documentation is available in the README of each component subdirectory.

System Requirements

  • The framework is fully tested only on Ubuntu 16.04, and requires Java and Python runtime environments.
  • The processes (described below) can be run on separate machines. The mail server is OS-independent, but the web crawlers only run on Linux.
  • Depending on the number of registered sites, the mail server might store anywhere from a few hundred megabytes to tens of gigabytes of data on disk per month.

Processes

The system consists of three long-running processes:

  • The mail server, which receives, stores, and analyzes incoming mail.
    $ cd mailsever
    $ mvn clean package
    $ java -jar target/mailserver.jar
    
  • The mailing list crawler, which crawls a list of input sites and searches for mailing lists.
    $ cd crawler_mailinglists
    $ python crawl_mailinglist_signup.py
    
  • The email crawler, which renders emails in a simulated webmail environment and visits links from those emails.
    $ cd crawler_emails
    $ python crawl_*.py
    

SMTP Configuration

Running the mail server requires a domain name with MX records pointing to the server. Additionally, if running the mailing list crawler from machines other than the mail server's machine, host records (A, CNAME) must also be set.

Data

The following data used in the analysis is available for download:

Mailbox

Includes email meta data (subjects, sender, etc) and email body content.

Download link: mailbox.tar.bz2

Contents:

  • email_inbox.sqlite
    • users table -- Email address registration records. Maps email address to registration site and time.
    • inbox table -- Subject, sender, delivery time, and other metadata for each email
  • mail/ -- Directory of raw .eml files saved by the mail server. Use the inbox table of the email_inbox.sqlite database to navigate.
  • html/ -- HTML bodies parsed from the corresponding raw email bodies. These are the HTML emails loaded by the crawlers.
  • html_after_filtering/ -- HTML bodies after filtering tracking tags using EasyList and EasyPrivacy. See Section 7 of the paper.

Email view crawl

Crawl data generated by opening the HTML email bodies given in the html/ directory of the mailbox using a simulated webmail client. This is the primary dataset used for the results in Section 4.

Download link: 2017-05-17_email_tracking_view_crawl.sqlite.bz2

Filtered email view crawl

Crawl data generated by opening the HTML email bodies given in the filtered_html/ directory of the mailbox using a simulated webmail client. This is the primary dataset used for the results in the "Server-side email content filtering" subsection of Section 7.

Download link: 2017-05-28_email_tracking_filtered_view_crawl.sqlite.bz2

Email click crawl

Crawl data generated by visiting a sample of links extracted from the HTML email bodies of each email in the html/ directory of the mailbox. This is the primary dataset used for the results in Section 5.

Download link: 2017-05-17_email_tracking_click_crawl.sqlite.bz2

Mailing list sign-up success rate crawl

Crawl data generated by running our mailing list sign-up procedure on the top sites, instrumenting the resulting pages to compute the overall level of successful sign-ups. This is the primary dataset used for the results in the "Form submission measurement" subsection of Section 3.

Download link: 2017-08-13_signup_success_measurement.sqlite.bz2

Funding

This project was funded by NSF Grant CNS 1526353, a research grant from Mozilla, and Amazon AWS Cloud Credits for Research.