<DIV ALIGN=CENTER>

# Introduction to Social Media: Email
## Professor Robert J. Brunner
  
</DIV>  
-----
-----

## Introduction

In this IPython Notebook, we explore email messages as a data source for text analysis. Email is among the oldest forms of social media and have a fairly long history of an application area for text analysis. While an email message might seem fairly simple, an email can actually be quite complicated. Part of this complication arises from the requirements to safely and securely transmit an electronic message through the Internet. This information, which is metadata about the message itself, is generally included in the email header. 

The majority of the complication, however, arises from the change of emails moving beyond simple textual content to include multiple components within an email message. To handle multiple components, which can include images, documents, or just HTML-styled versions of the text content, email messages must enable multiple parts of a message to be identified and properly encoded/decoded (for example, binary or Unicode data) for safe and secure transmission. 

In the rest of this Notebook, we first explore reading and parsing a simple email, including both the header, message itself, and the different parts (or payloads) within the message. Next, we develop a text classification pipeline from a public email corpus. Finally, we apply this pipeline to blind emails to quantify the accuracy of our simple classification pipeline.

---

## Email Text Parsing

Python provides built-in support for [processing email messages][pem], which are an often overlooked source of information in data science projects. The library is part of the core Python distribution, and includes support for parsing email messages, as well as sending and receiving emails. For our purpose, we simply need to read in text and create an email `message`, which provides access to the basic email contents. The `message` instance provides access to the email header information as well as any payload data. 

Normally the payload is the email message, but with multipart messages, like HTML email messages, an email can have multiple payloads. In the next several code cells, we create an email `message` by reading an email from a file (the email should look familiar). We subsequently explore the Python email message interface to extract email headers and the message payload, before grabbing the HTML message for later parsing.

First, we read in one demonstration email (that I sent out to a class) and display message headers, which are accessible via dictionary keys. In the second code cell, we explicitly retrieve message header values by using the `msg` dictionary. Note that headers can be repeated (for example, the `Received` header as different mail servers add additional header information as the message is transmitted. Finally, in the third code cell, we directly display part of the email message.

-----

[pem]: https://docs.python.org/3/library/email.html

In [1]:
# Import email library, policy controls how to process the data
import email as em
from email import policy

# For demonstration purposes, open one good email
with open("data/ham/prg.eml") as fin:
    msg = em.message_from_file(fin, policy=policy.default)

# Enable pretty printing of data structures
import pprint
pp = pprint.PrettyPrinter(indent=2, depth=2, width=80, compact=True)

# Display available header keys, 
# values can be displayed via dictionary access (next cell)
pp.pprint(msg.keys())

[ 'Received', 'Received', 'Received', 'Received', 'Received', 'Date', 'To',
  'From', 'Subject', 'Message-ID', 'X-Mailer', 'List-Id', 'List-Help',
  'X-Course-Id', 'X-Course-Name', 'Precedence', 'X-Auto-Response-Suppress',
  'Auto-Submitted', 'Content-Type', 'X-Spam-Reason', 'Return-Path',
  'X-MS-Exchange-Organization-AuthSource', 'X-MS-Exchange-Organization-AuthAs',
  'X-MS-Exchange-Organization-Antispam-Report',
  'X-MS-Exchange-Organization-SCL',
  'X-MS-Exchange-Organization-AVStamp-Mailbox', 'MIME-Version']


In [2]:
print('To:', msg['to'])
print('From:', msg['from'])
print('Subject:', msg['subject'])

To: Robert Brunner <xxxyyy@illinois.edu>
From: Robert Brunner <xxxyyy@illinois.edu>
Subject: INFO 490 RB2 SP16: Peer Review Grading


In [3]:
# Display subset of entire message
print(msg.as_string()[2340:2693])

-------------------------------------------------------------
Peer Review Grading
by Robert Brunner - Monday, 29 February 2016, 10:22 AM
---------------------------------------------------------------------
I have received several emails or private Moodle messages regarding peer
grading. The basic context seems to be concern about losing a few points



-----

After parsing, different components of the email message can be accessed via attributes or methods. This was demonstrated previously by accessing the header values via the `msg` variable, or the message text content via the `as_string` method. These were just two of the available techniques, however, and we can display the entire suite of available access techniques by extracting and printing the contents of the email message variable.

-----

In [4]:
# Print out message methods and attributes
pp.pprint([att for att in dir(msg) if '__' not in att])

[ '_add_multipart', '_body_types', '_charset', '_default_type', '_find_body',
  '_get_params_preserve', '_headers', '_make_multipart', '_payload',
  '_unixfrom', 'add_alternative', 'add_attachment', 'add_header', 'add_related',
  'as_bytes', 'as_string', 'attach', 'clear', 'clear_content', 'defects',
  'del_param', 'epilogue', 'get', 'get_all', 'get_body', 'get_boundary',
  'get_charset', 'get_charsets', 'get_content', 'get_content_charset',
  'get_content_disposition', 'get_content_maintype', 'get_content_subtype',
  'get_content_type', 'get_default_type', 'get_filename', 'get_param',
  'get_params', 'get_payload', 'get_unixfrom', 'is_attachment', 'is_multipart',
  'items', 'iter_attachments', 'iter_parts', 'keys', 'make_alternative',
  'make_mixed', 'make_related', 'policy', 'preamble', 'raw_items',
  'replace_header', 'set_boundary', 'set_charset', 'set_content',
  'set_default_type', 'set_param', 'set_payload', 'set_raw', 'set_type',
  'set_unixfrom', 'values', 'walk']


-----

Email message can contain multiple parts, for example attachments to an email, or both a plain text and HTML-styled version of an email message. To identify if an email has multiple parts, we use the `is_multipart` method, after which we can extract different parts (or _payloads_) from the message. For our demonstration email, both a plain text and HTML-styled version of the text are included in the message (since the email was generated by moodle). In the following three code cells, we extract the plain text and html payloads and use them to display part (for the plain text) and entire message (for HTML form).

-----

In [5]:
# Display part of the message

if msg.is_multipart() == True:
    data = msg.get_payload(0)
    html=msg.get_payload(1)

    print("Text Data:\n---------\n", data.get_content()[:941])

Text Data:
---------
 
INFO 490 RB2 SP16 -> Forums -> Announcements -> Peer Review Grading
https://learn.illinois.edu/mod/forum/discuss.php?d=938611
---------------------------------------------------------------------
Peer Review Grading
by Robert Brunner - Monday, 29 February 2016, 10:22 AM
---------------------------------------------------------------------
I have received several emails or private Moodle messages regarding peer
grading. The basic context seems to be concern about losing a few points
from one reviewer. 

To be clear, we do not intervene in Peer Assessment unless there has ben an
egregious violation. An example of this would be someone giving a zero and
saying 'The code does not run' when it clearly does run. Or 'The code is
empty' when the assignment was completed. Losing a few points occasionally
will not lower your grade. And other than write the best code you possibly
can and document it thoroughly, we do not have any advice!


In [6]:
# Grab content, as decoded text. Display as HTML
from IPython.display import HTML
HTML(html.get_content())

0,1
,"Peer Review Gradingby Robert Brunner - Monday, 29 February 2016, 10:22 AM"
,"I have received several emails or private Moodle messages regarding peer grading. The basic context seems to be concern about losing a few points from one reviewer. To be clear, we do not intervene in Peer Assessment unless there has ben an egregious violation. An example of this would be someone giving a zero and saying 'The code does not run' when it clearly does run. Or 'The code is empty' when the assignment was completed. Losing a few points occasionally will not lower your grade. And other than write the best code you possibly can and document it thoroughly, we do not have any advice! The point of peer review is two-fold: 1) You get a chance to see how your peers completed an assignment. In some case, you might learn how to improve your own code, learn new coding tricks, or see ways that your coding style could improve. 2) You have your code reviewed by your peers and obtain comments from them that (hopefully) will be constructive and help you improve your work. In some cases, a review might grade you lower than you like, or in seeming contrast to others. Learning to deal with situations like this is part of the course. You will (probably) experience many of these situations later in your career, either via an interview (which is a form of peer assessment), in the workplace with pair coding or code reviews, or just via general interactions at conferences or public meetings. And if you ever release code online, expect comments and criticisms! Finally, please remember that this is a large online course, which Edward and I are building as the course is being offered. Please refrain from frivolous questions or peer assessment regrade requests; otherwise, we will be forced to invoke point reductions: https://github.com/UI-DataScience/info490-sp16/blob/master/orientation/syllabus.md#point-reductions and also the special point reduction under the Peer Review section: https://github.com/UI-DataScience/info490-sp16/blob/master/orientation/syllabus.md#peer-review Having said this, please understand we encourage valid questions; we do not want to stifle the learning process. Robert  ReplySee this post in context"


-----

We can also interact with a payload, both to obtain the payload data or to enable further processing of the payload (e.g., decoding). In the following cell, we demonstrate this by extracting the payload parameters and a subset of the text version of the HTML payload.

-----

In [7]:
# Display email parameters
print('Payload parameters: \n', html.get_params())
print(20*'-')

# Get raw payload, undecoded
print(html.get_payload()[:115])

Payload parameters: 
 [('text/html', ''), ('charset', 'UTF-8')]
--------------------
<html><head>
<meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3Dutf-8"></=
head>
<body id=3D"email">


-----

### Student Activity

In the preceding cells, we read an email message from a file, parsed the text, and extracted header information and payload content. Now that you have run the Notebook, go back and make the following changes to see how the results change.

1. Try reading other header content, such as 'Date', 'Message-ID', 'X-Mailer', and 'Precedence'. Do these values align with the message source (which you can see by viewing the file contents at the Unix command line)?
2. Try reading one of the other emails in the data directory for this week (you can use the JupyterHub server to browse directories and open the files, or view them directly on github). Did anything change?
3. Save one of your own emails (from your mail client) and open it with this notebook. Can you parse and view the content?

-----

## Email Classification

To demonstrate using email data in a text analysis project, we will build a [_spam_][sc] classification pipeline to identify emails that we do not wish to receive. For training data, we will use a public corpus of good (_ham_) and bad (_spam_) emails collected by an open-source spam classification tool, known as [spam assassin][sa]. Each email is stored as a separate file in a directory that classifies the type of email. To access these data, we iterate through each file in the two directories (`ham` and `spam`)

One issue when performing text analysis are the memory requirements of dealing with large text data sets. Given the shared nature of this JupyterHub server, each Docker container has a limited amount of memory. As a result, we restrict the data size in this Notebook to only the first `max_files` emails in each folder. By default, this value is set to `500`, which should enable the notebook to work effectively on the JupyterHub server. But you can, of course, change this value up and down to study the effects of changing the quantity of training and testing data. 

-----
[sc]: https://en.wikipedia.org/wiki/Email_spam
[sa]: https://spamassassin.apache.org/publiccorpus/

In [8]:
# https://spamassassin.apache.org/publiccorpus/readme.html
# Grabbed spam_2 and easyham_2

import os

mypath = '/home/data_scientist/data/email/'

ham = []
spam = []

# Max number of files to read
max_files = 500

# Read in good (ham) emails
for root, dirs, files in os.walk(os.path.join(mypath, 'ham')):
    for count, file in enumerate(files):
    
        # To control memory usage, we limit the number of files
        if count >= max_files:
            break
            
        with open(os.path.join(root, file), encoding='ISO-8859-1') as fin:
            msg = em.message_from_file(fin, policy=policy.default)
            for part in msg.walk():
                if part.get_content_type() == 'text/plain':
                    data = part.get_payload(None, decode=True)

            ham.append(data.decode(encoding='ISO-8859-1'))

# Read in bad (spam) emails
for root, dirs, files in os.walk(os.path.join(mypath, 'spam')):
    for count, file in enumerate(files):
        
        # To control memory usage, we limit the number of files
        if count >= max_files:
            break
           
        with open(os.path.join(root, file), encoding='ISO-8859-1') as fin:
            msg = em.message_from_file(fin, policy=policy.default)
            for part in msg.walk():
                if part.get_content_type() == 'text/plain':
                    data = part.get_payload(None, decode=True)

            spam.append(data.decode(encoding='ISO-8859-1'))


-----

To apply scikit learn, we myst convert the lists of emails to numpy arrays. In the next cell we create arrays from these two email lists. We also create label arrays, before printing out the number of messages in each category and the overall storage requirements for each array. To reduce memory requirements, in the second cell we instruct the IPython kernel to delete the `ham` and `spam` python variables and to also remove their values from the IPython cache.

-----

In [9]:
# For text analysis, we need numpy arrays.
# Covnert the text lists to numpy arrays
import numpy as np

pos_emails = np.array(ham)
neg_emails = np.array(spam) 

# Create label arrays
pos_labels = np.ones(pos_emails.shape[0])
neg_labels = np.zeros(neg_emails.shape[0])

# Display email counts, and memory usage
pbytes = pos_emails.nbytes / 1024**2
nbytes = neg_emails.nbytes / 1024**2

print('{0} Good Emails, requiring {1:6.1f} Mbs.'.format(pos_emails.shape[0], pbytes))
print('{0} Bad Emails, requiring {1:6.1f} Mbs.'.format(neg_emails.shape[0], nbytes))

500 Good Emails, requiring  170.5 Mbs.
500 Bad Emails, requiring   75.9 Mbs.


In [10]:
# Tell Python to delete ham and spam lists
# %xdel removes from IPYthon cache as well

%xdel ham
%xdel spam

-----

While we could use the `test_train_split` method in scikit learn, in the next cell, we create four new arrays from our email data and labels. The following cell deletes the original two numpy email arrays to reduce memory requirements. This approach gives us greater control of our memory usage, but does limit our ability to easily rerun the pipeline with a different quantity of training data (by deleting the numpy arrays, we now need to re-run almost the entire Notebook).

-----

In [11]:
# We split positive/negative emails into two groups test/train each. 
# This value must be less than max_files.
split_value = 300

# We combine neg and positive into four arrays.
x_train = np.concatenate((pos_emails[:split_value], 
                          neg_emails[:split_value]), axis = 0)

x_test = np.concatenate((pos_emails[split_value:],
                         neg_emails[split_value:]), axis = 0)

y_train = np.concatenate((pos_labels[:split_value], 
                          neg_labels[:split_value]), axis = 0)

y_test = np.concatenate((pos_labels[split_value:],
                         neg_labels[split_value:]), axis = 0)

In [12]:
# Remove the original two numpy arrays

%xdel pos_emails
%xdel neg_emails

-----

We now have our training and testing data and labels. As demonstrated previously, we can create a text analysis pipeline to tokenize the data before applying a simple Naive Bayes classifier. As demonstrated by the following code cell, we obtain reasonable results, even with a limited number of training emails.

-----

In [13]:
# Nw perform classification, via pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn import metrics

tools = [('cv', CountVectorizer()), ('nb', MultinomialNB())]
pclf = Pipeline(tools)

# Lowercase, bigrams, stop words.
pclf.set_params(cv__stop_words = 'english', \
                cv__ngram_range=(1,2), \
                cv__lowercase=True)

pclf.fit(x_train, y_train)
y_pred = pclf.predict(x_test)
print(metrics.classification_report(y_test, y_pred, target_names = ['Ham', 'Spam']))

             precision    recall  f1-score   support

        Ham       0.98      0.93      0.95       200
       Spam       0.93      0.98      0.95       200

avg / total       0.95      0.95      0.95       400



-----

### Blind Testing

While the above performance is impressive, the real test of our simple pipeline is its application to completely new emails. The github repository for this week includes a small number of emails that have been labeled by the instructors. The good emails are emails sent out by moodle, while the bad emails are spam messages that made it through the campus filters.

To apply our pipeline to these emails, we first need to read these messages. Since, at least some of these messages, are multipart, we need to carefully process the text data. First, we must read the files using the appropriate encoding. Second, we must grab the text payload or else we will be classifying a message based in part on HTML markup. Finally, we directly classify each email by using the pipeline after it has been processed to simplify memory management.

-----

In [14]:
mypath = 'data'

print('Ham File Processing:')
print(20*'-')
for root, dirs, files in os.walk(os.path.join(mypath, 'ham')):
    for file in files:
        with open(os.path.join(root, file), encoding='ISO-8859-1') as fin:
            msg = em.message_from_file(fin, policy=policy.default)
            for part in msg.walk():
                if part.get_content_type() == 'text/plain':
                    data = part.get_payload(None, decode=True)
                lst = []
                lst.append(data.decode(encoding='ISO-8859-1'))
                value = pclf.predict(np.array(lst))
                if value[0] == 0:
                    my_pred = 'spam'
                else:
                    my_pred = 'ham'

            print('{0} classified as {1}'.format(file, my_pred))

print(20*'-')          
print('Spam File Processing:')
print(20*'-')
for root, dirs, files in os.walk(os.path.join(mypath, 'spam')):
    for file in files:
        with open(os.path.join(root, file), encoding='ISO-8859-1') as fin:
            msg = em.message_from_file(fin, policy=policy.default)
            for part in msg.walk():
                if part.get_content_type() == 'text/plain':
                    data = part.get_payload(None, decode=True)
                lst = []                    
                lst.append(data.decode(encoding='ISO-8859-1'))
                value = pclf.predict(np.array(lst))
                if value[0] == 0:
                    my_pred = 'spam'
                else:
                    my_pred = 'ham'

            print('{0} classified as {1}'.format(file, my_pred))

Ham File Processing:
--------------------
kmdb.eml classified as spam
prg.eml classified as ham
pvc.eml classified as spam
rw6.eml classified as spam
rw7.eml classified as spam
w7.eml classified as ham
--------------------
Spam File Processing:
--------------------
add.eml classified as spam
cnn.eml classified as spam
lcm.eml classified as ham
mws.eml classified as ham
prs.eml classified as spam
vsfih.eml classified as ham


-----

### Student Activity

In the preceding cells, we developed and applied a simple Naive Bayes spam classifier. Now that you have run the Notebook, go back and make the following changes to see how the results change.

1. Change the split_value, both to lower and higher values. How does the validation sample perform? How does our blind data classification perform?
2. Change parameters in the classification pipeline, for example, the n-grams, the max number of features, stemming, or parameters in the Bayesian classifier. How does the validation sample perform? How does our blind data classification perform?
3. Try increasing the number of emails read, perhaps in increments of fifty, and see how the results change (for example, change `max_files` to 550, 600, ..., 750).  You can also modify the code to extract different emails, by using the if statement to only read in files in a specified range (like 500 to 1000).
4. Try employing a different classifier, such as Linear SVC with regularization. Do the results change?

-----