This program takes an email .mbox and outputs a text file containing the Subject, Date, Sender, and body of all emails as plaintext. It might be easier to work with a different format, but changing to a .csv or shelve would be very simple. This output is human-readable and can be used for language analysis.

I signed my Gmail up for campaign emails from the Donald Trump campaign in 2017 and received roughly 380 emails in that time.

I had all the emails in a single tag, and I used google's takeout to create a .mbox file that contains all the emails in that tag.
https://takeout.google.com/settings/takeout

I had tried a plugin that pulls emails with a given tag and exports them to .pdfs in google drive, but I decided on the .mbox through takeout because getting the plain text from the emails ended up being way more difficult than it should have been. Both the parsers I tried (PyPDF2, Kita) output with missing parts or had the text too mixed with formatting data.

Imports:
mailbox -- message and mbox classes for opening the .mbox file and extracting messages

re -- regular expression module for making regular expressions to remove hyperlinks and "&larr" etc characters

os -- operating system for changing the directory as needed

BeautifulSoup -- for pulling data from HTML.

In [23]:
import mailbox
import re
import os
from bs4 import BeautifulSoup

Setting the directory and the .mbox file to use; set variable names dependant on the recipient of emails. These will be removed 

In [26]:
os.chdir('D:\\DataAnalysisProjects\\TrumpEmails\\mBoxFile\\Mail\\TEST\\')
mb = mailbox.mbox('Trump.mbox')
#the name of the output text file
outputFileName = 'mboxOutput1.txt'
RECIPIENT_EMAIL = 'nathanmkemp@gmail.com'
#every time the name is used, it's used in this format in the emails I'm working with.
RECIPIENT_FIRNAME = '\nN,'
RECIPIENT_FULLNAME = '\nN K'

Create the regular expressions that will be used for cleaning up the messages

In [20]:
# create the regex for hyperlinks
regexHyper = re.compile(r'http://\S+', re.IGNORECASE)
# regex for quotes and apostrophes &rsquo; &ldquo; &rdquo;
regexQuote = re.compile(r'\&\w{5};')
# for arrows
regexArr = re.compile(r'\&\w{4};')
# for all other &... html characters
regexOther = re.compile(r'\&\S{3,6};')
# for all blocks of whitespace 2+lines with any number of spaces between
regexWhiteSpace = re.compile(r'(\n+\s+\n)+')


The unloadMessagePayload() function is passed a mailbox.Message as message and outputs a string with the metadata for the message (subj, date, from) and the body. It returns this as a string that is human-readable and without html formatting, but that will still need to be cleaned up before it is stored.

The keys included in the header were selected based on what I expect to be most useful-- if you open the .mbox file in notepad, you can read through and find all the available keys('X-Received', for example).

The body has to be extracted; some messages are multipart, some are not. 

When messages are multipart, all parts are extracted from the "payload" of the message as a list of strings, and and only the first value in the list is saved-- In the case of the emails I am working with, the first part (index 0) is the message without html formatting and is exactly what we want. The rest are html formatting and the message as html.

When messages are single-part, the payload is extracted as a single string of html. BeautifulSoup() extracts the text from the html string, but this also removes hyperlinks. Single-part messages are stamped with "++LINKS_REMOVED_FOR_THIS_MESSAGE++" to reflect this and this should be kept in mind during analysis.

In [21]:
def unloadMessagePayload(message):
    payloadParts = []
    printString = ''
    payloadString = ''
    
    # Create message header
    printString += ('**'*20 + '\n')
    printString += ('Subject:     ' + message['Subject'] + '\n')
    printString += ('Date:        ' + message['Date'] + '\n')
    printString += ('From:        ' + message['From'] + '\n')
    printString += ('Body:        ' + '\n')
    
    # Check to see if the message is a multi-part message.
    if message.is_multipart():
        # If it is multipart, get all parts of the payload, but only print the first one. The rest is just encoding stuff.
        # It looks like the first part is plaintext, and the second is the email as html. Not sure.
        for part in message.get_payload():
            payloadParts.append(part.get_payload()) 
        printString += str(payloadParts[0])
    else:
        payloadString = message.get_payload()
        payloadString = BeautifulSoup(payloadString,features="lxml").text
        printString += '\n++LINKS_REMOVED_FOR_THIS_MESSAGE++ ' + payloadString.split('{ display: none !important; }')[2]
    
    return(printString)

The cleanUpMessageBody() function takes a string as input and cleans it up and outputs a string for storage in a text file.

It replaces the html coded characters with plaintext, and replaces terms that identify the user with generic terms. It also replaces hyperlinks with ++HYPERLINK++, but as mentioned in unloadMessagePayload(), this will only affect multi-part message payloads.

Finally, it encodes then decodes the string as bytes, ignoring unicode so it can be stored as a text file. I'm not concerned with the characters it encodes so it's set to ignore.

In [24]:
def cleanUpMessageBody(payloadString):
    payloadString = regexQuote.sub("'",payloadString)
    payloadString = regexArr.sub("--",payloadString)
    payloadString = regexOther.sub("",payloadString)
    payloadString = payloadString.replace(RECIPIENT_EMAIL,'++RECIPIENT_EMAIL++')
    payloadString = payloadString.replace(RECIPIENT_FIRNAME,'\n++RECIPIENT_FIRNAME++')
    payloadString = payloadString.replace(RECIPIENT_FULLNAME,'\n++RECIPIENT_FULLNAME++')
    # use the regular expression defined above to replace all hyperlinks with "++HYPERLINK++"
    payloadString = regexHyper.sub('++HYPERLINK++',payloadString)
    payloadString = regexWhiteSpace.sub('\n',payloadString)
    return(bytes(payloadString,'ascii','ignore').decode('ascii','ignore'))

This is the main program. First, a text file is created in write mode. It will overwrite whatever is currently in that file.

mb is the .mbox file identified. For each message in mb, the payload is extracted, cleaned up, and written to the output text document. When the loop is done, the text file is closed.

In [27]:
# create an output file in write mode
outputText = open(outputFileName,'w')

for message in mb:
    outputString = unloadMessagePayload(message)
    outputString = cleanUpMessageBody(outputString)
    outputText.write(outputString)
    
outputText.close()

In [33]:
for message in mb:
    outputString = unloadMessagePayload(message)
    print('@@@@The message payload before it is cleaned up@@@@ \n' + outputString[:1000])
    outputString = cleanUpMessageBody(outputString)
    print('@@@@The cleaned up text that is written to .txt@@@@\n' + outputString[:1000])
    break

@@@@The message payload before it is cleaned up@@@@ 
****************************************
Subject:     Add your name
Date:        Mon, 07 Jan 2019 14:48:35 -0600
From:        "Donald J. Trump" <contact@victory.donaldtrump.com>
Body:        

  
 

 
http://click.campaigns.rnchq.com/?qs=9a3839cb826896667af03c002e9ae495990de1d3627655b9e3ffb3405532423c9ecd88bd1941637fe020f94dc094dcdf5d04263c02ee1779 

N, for the past two weeks we&rsquo;ve seen just how low the Democrats will sink in order to obstruct our agenda and everything we&rsquo;ve accomplished since 2016.


Nancy Pelosi and Chuck Schumer shut down YOUR government by choosing illegal immigrants over hard-working American patriots like YOU. NO MORE!


The only reason Democrats shut down the government is because they know they can&rsquo;t win in 2020. They don&rsquo;t care about your safety, they only care about Presidential Harassment!


We showed them in 2016, now let&rsquo;s show them again that the silent majority is back and