Skip to content


Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?

Latest commit


Git stats


Failed to load latest commit information.
Latest commit message
Commit time


This is a work in progress - any help normalizing and extracting this data's much appreciated!

This repo contains code to transform Hillary Clinton's emails released through the FOIA request from raw PDF documents to CSV files and a SQLite database, making it easier to understand and analyze the documents.

A zip of the extracted data is available for download on Kaggle.

Check out some analytics on this data on Kaggle Scripts.

Note that conversion is very imprecise: there's plenty of room to improve the PDF conversion, the sender/receiver extraction, and the body text extraction.

Extracted data

There are five main output files this produces: four CSV files and one SQLite database.

Note that each table contains a numeric Id column. This Id column is only meant to be used to join the tables: it is internally consistent, but each entity may have a different Id when the data's updated.


This file currently contains the following fields:

  • Id - unique identifier for internal reference
  • DocNumber - FOIA document number
  • MetadataSubject - Email SUBJECT field (from the FOIA metadata)
  • MetadataTo - Email TO field (from the FOIA metadata)
  • MetadataFrom - Email FROM field (from the FOIA metadata)
  • SenderPersonId - PersonId of the email sender (linking to Persons table)
  • MetadataDateSent - Date the email was sent (from the FOIA metadata)
  • MetadataDateReleased - Date the email was released (from the FOIA metadata)
  • MetadataPdfLink - Link to the original PDF document (from the FOIA metadata)
  • MetadataCaseNumber - Case number (from the FOIA metadata)
  • MetadataDocumentClass - Document class (from the FOIA metadata)
  • ExtractedSubject - Email SUBJECT field (extracted from the PDF)
  • ExtractedTo - Email TO field (extracted from the PDF)
  • ExtractedFrom - Email FROM field (extracted from the PDF)
  • ExtractedCc - Email CC field (extracted from the PDF)
  • ExtractedDateSent - Date the email was sent (extracted from the PDF)
  • ExtractedCaseNumber - Case number (extracted from the PDF)
  • ExtractedDocNumber - Doc number (extracted from the PDF)
  • ExtractedDateReleased - Date the email was released (extracted from the PDF)
  • ExtractedReleaseInPartOrFull - Whether the email was partially censored (extracted from the PDF)
  • ExtractedBodyText - Attempt to only pull out the text in the body that the email sender wrote (extracted from the PDF)
  • RawText - Raw email text (extracted from the PDF)


  • Id - unique identifier for internal reference
  • Name - person's name


  • Id - unique identifier for internal reference
  • Alias - text in the From/To email fields that refers to the person
  • PersonId - person that the alias refers to


  • Id - unique identifier for internal reference
  • EmailId - Id of the email
  • PersonId - Id of the person that received the email


This SQLite database contains all of the above tables (Emails, Persons, Aliases, and EmailReceivers) with their corresponding fields. You can see the schema and ingest code under scripts/sqlImport.sql

Contributing: next steps

  • Improve the From/To address extraction mechanisms
  • Normalize various email address representations to people
  • Improve the BodyText extraction

Running the download and extraction code

Running make all in the root directory will download the data (~162mb total) and create the output files, assuming you have all the requirements installed.


This has only been tested on OS X, it may or may not work on other operating systems.

  • python3
    • pandas
    • arrow
    • numpy
  • pdftotext (utility to transform a PDF document to text)
  • GNU make
  • sqlite3


The source PDF documents for this repo were downlaoded from the WSJ Clinton Inbox search.

I created this project before I realized the WSJ also open-sourced some code they used to create the Inbox Search. Subsequently, I've included some material from their open source project as well: I used their HRCEMAIL_names.csv to seed alias_person.csv. I also scraped metadata from in a similar fashion as they did in


Code to transform Hillary's emails from raw PDF documents to a SQLite database







No releases published


No packages published