# Code for Chapter 4 

In this case study we will attempt to write a "priority inbox" algorithm for ranking email by some measures of importance. We will define these measures based on a set of email features, which moves beyond the simple work counts used in Chapter 3.

Set the global paths

In [1]:
import os

dataSpamDir = '../03-Classification/data/easy_ham/'
mailPaths = os.listdir(dataSpamDir)
mailPaths = [f'../03-Classification/data/easy_ham/{i}' for i in mailPaths]

We define a set of function that will extract the data for the feature set we have defined to rank email impportance. This includes the following: message body, message source, message subject, and date the message was sent.

In [2]:
#Simply returns the text of a given email message
def readMsg(path):
    lines = open(path, encoding="latin-1").readlines()
    return lines

In [3]:
#Similar to the function from Chapter 3, this returns only the message body for a given email.
def getBodyMsg(lines):
    startIndex = lines.index('\n')
    return ''.join(lines[startIndex+1 : len(lines)])

In [4]:
# Retuns the email address of the sender for a given email message
import re

def getSendersEmail(lines):
    sendersEmail = ''
    for line in lines:
        if line.startswith('From:'):
            sendersEmail = re.search(r'[\w\.-]+@[\w\.-]+', line).group(0)
            break

    return sendersEmail.lower()

In [5]:
# Retuns the subject string for a given email message
def getSubject(lines):
    prefix = 'Subject:'
    subject = ''
    for line in lines:
        if line.startswith(prefix):
            subject = line[len(prefix):].strip()
            break

    return subject.lower()

In [6]:
# Retuns the date a given email message was received
from datetime import datetime
import re

def tryParsingDate(text):
    date = text.rsplit('(edt)', 1)[0].strip()
    date = date.rsplit('(cest)', 1)[0].strip()
    date = date.rsplit('(pdt)', 1)[0].strip()
    date = date.rsplit('(bst)', 1)[0].strip()
    date = date.rsplit('(ist)', 1)[0].strip()
    date = date.rsplit('(cdt)', 1)[0].strip()
    date = date.rsplit('(est)', 1)[0].strip()
    date = date.rsplit('(eest)', 1)[0].strip()
    date = date.rsplit('(msd)', 1)[0].strip()
    date = date.rsplit('(gmt)', 1)[0].strip()
    date = date.rsplit('(pst)', 1)[0].strip()
    date = date.rsplit('ut', 1)[0].strip()
    date = date.rsplit('edt', 1)[0].strip()
    
    for fmt in ('%a, %d %b %Y %H:%M:%S %z', '%a, %d %b %Y %H:%M:%S %Z', '%d %b %Y %H:%M:%S %z', '%a, %d %b %Y %H:%M:%S'):
        try:
            return datetime.strptime(date, fmt)
        except ValueError:
            pass
    raise ValueError('no valid date format found ', date)

def getDate(lines):
    firstPrefix = 'Date:'
    secondPrefix = 'X-Original-Date:'
    date = ''
    for line in lines:
        if line.startswith(firstPrefix):
            date = line[len(firstPrefix):].strip()
            break
        elif line.startswith(secondPrefix):
            date = line[len(secondPrefix):].strip()
            break
    
    if date == '':
        return ''
    
    return tryParsingDate(date.lower())

## Create DataFrame with data

In [7]:
# This function ties all of the above helper functions together.
# It returns a vector of data containing the feature set
# used to categorize data as priority or normal HAM
import pandas as pd

df = pd.DataFrame({}, columns = ['Date','Email', 'Subject', 'Body', 'Path'])

for mailPath in mailPaths:
    if ('.ipynb_checkpoints' not in mailPath):
        msgLines = readMsg(mailPath)
        date = getDate(msgLines)
        email = getSendersEmail(msgLines)
        subject = getSubject(msgLines)
        bodyMsg = getBodyMsg(msgLines)
        df = df.append({'Date': date, 'Email': email, 'Subject': subject, 'Body': bodyMsg, 'Path': mailPath}, ignore_index=True)

In [8]:
# Order the messages chronologically
df['Date'] = pd.to_datetime(df.Date, utc=True)
df = df.sort_values(by=['Date'])

In [9]:
df.head()

Unnamed: 0,Date,Email,Subject,Body,Path
1060,2002-02-01 05:44:14+00:00,robinderbains@shaw.ca,please help a newbie compile mplayer :-),"\n Hello,\n \n I just installed ...",../03-Classification/data/easy_ham/01061.66101...
1061,2002-02-01 06:53:41+00:00,lance_tt@bellsouth.net,re: please help a newbie compile mplayer :-),Make sure you rebuild as root and you're in th...,../03-Classification/data/easy_ham/01062.ef795...
1062,2002-02-01 09:01:44+00:00,robinderbains@shaw.ca,re: please help a newbie compile mplayer :-),Lance wrote:\n\n>Make sure you rebuild as root...,../03-Classification/data/easy_ham/01063.ad344...
1063,2002-02-01 09:29:23+00:00,matthias@egwn.net,re: please help a newbie compile mplayer :-),"Once upon a time, rob wrote :\n\n> I dl'd gcc...",../03-Classification/data/easy_ham/01064.9f4fc...
1064,2002-02-01 13:00:22+00:00,harri.haataja@cs.helsinki.fi,http://apt.nixia.no/,\n--6sX45UoQRIJXqkqR\nContent-Type: text/plain...,../03-Classification/data/easy_ham/01065.b1ad1...


In [10]:
df.shape

(2500, 5)

Create train and test dataset

In [11]:
# We will use the first half of the priority.df to train our priority in-box algorithm.
# Later, we will use the second half to test.
import numpy as np

rows = int(df.shape[0] / 2)
df_train = pd.DataFrame(df.iloc[:rows])
df_test = pd.DataFrame(df.iloc[rows+1:])

print([df_train.shape, df_test.shape])

[(1250, 5), (1249, 5)]


Group messages by thread

In [12]:
def cleanSubject(subject):
    if subject.startswith('re: '):
        return subject[4:]
    else:
        subject
    
        
df_train['Clean_Subject'] = df_train.apply(lambda row: cleanSubject(row.Subject), axis = 1) 

uniqueSubjects = df_train['Clean_Subject'].unique()
uniqueSubjects = pd.DataFrame({'Clean_Subject': uniqueSubjects})
uniqueSubjects['Thread_Index'] = uniqueSubjects.index

df_train = pd.merge(df_train, uniqueSubjects, on='Clean_Subject')

Group messages by mail

In [13]:
uniqueMails = df_train['Email'].unique()
uniqueMails = pd.DataFrame({'Email': uniqueMails})
uniqueMails['Email_Index'] = uniqueMails.index

df_train = pd.merge(df_train, uniqueMails, on='Email')

Show thread and email popularity

In [14]:
df_train['Clean_Subject'].value_counts()[:10] 

java is for kiddies                                 27
selling wedded bliss (was re: ouch...)              25
sorting                                             21
hanson's sept 11 message in the national review     19
new sequences window                                18
alsa (almost) made easy                             18
[satalk] o.t. habeus -- why?                        18
the gov gets tough on net users.....er pirates..    13
secure sofware key                                  13
slaughter in the name of god                        13
Name: Clean_Subject, dtype: int64

In [15]:
df_train['Email'].value_counts()[:10] 

tim.one@comcast.net            45
tomwhore@slack.net             37
pudge@perl.org                 34
garym@canada.com               29
yyyy@spamassassin.taint.org    25
skip@pobox.com                 24
beberg@mithral.com             23
matthias@egwn.net              23
cdale@techmonkeys.net          21
cwg-exmh@deepeddy.com          20
Name: Email, dtype: int64

Calculating the length of each thread 

In [16]:
df_threads = pd.DataFrame({}, columns = ['Thread_Index', 'Length'])

for thread_index in df_train['Thread_Index'].unique():
    dates = df_train[df_train['Thread_Index'] == thread_index].Date
    length = (dates.max() - dates.min()).total_seconds()
    df_threads = df_threads.append({'Thread_Index': thread_index, 'Length': length}, ignore_index=True)

df_threads.sort_values(by=['Length'], ascending = False).head()

Unnamed: 0,Thread_Index,Length
0,0.0,20109455.0
225,141.0,1402129.0
160,11.0,1390105.0
132,150.0,1380639.0
25,183.0,792234.0


Calculating the number of mais per seconds in each thread

In [17]:
mails_in_threads = df_train['Thread_Index'].value_counts()

def calculateMailsPerSecond(row):
    thread_index = row['Thread_Index']
    thread_length = row['Length']
    
    if (thread_index == 0) or (thread_length == 0):
        mails_in_thread = 1
        mails_per_second = 0
    else:
        mails_in_thread = mails_in_threads[thread_index]
        mails_per_second = mails_in_thread / thread_length
    
    return mails_per_second
    
        
df_threads['Mails_Per_Second'] = df_threads.apply(lambda row: calculateMailsPerSecond(row), axis = 1) 
df_threads.head()

Unnamed: 0,Thread_Index,Length,Mails_Per_Second
0,0.0,20109455.0,0.0
1,1.0,9342.0,0.000321
2,2.0,259584.0,3.5e-05
3,6.0,239899.0,2.1e-05
4,8.0,11648.0,0.000515


## Calculating scoring
Scoring is a result of multiply two other scores:
 - scoring of mail popularity, if the mail is more popular (exists more often), the result is greater,
 - scoring of thread popularity, if the thread contains more messages per second, it's better.
 
We need to add '1' to basic value because we can't calculate logarithm for '0' value. Additionally, we add 10 to result value, because, for value (0,1), the logarithm result is negative.

In [18]:
import math
mails_popularity = df_train['Email'].value_counts()

def calculateScore(row):
    mail_popularity = mails_popularity[row['Email']]
    mail_score = 10 + math.log(mail_popularity + 1)
    
    mails_per_second = df_threads[df_threads['Thread_Index'] == row['Thread_Index']].Mails_Per_Second
    mails_per_second_score = 10 + math.log(mails_per_second + 1)
    
    return mail_score * mails_per_second_score

df_train['Score'] = df_train.apply(lambda row: calculateScore(row), axis = 1) 

Show the result. Messages grouped in the same thread and mail have bigger scoring

In [19]:
df_train.sort_values(by=['Score'], ascending = False).head()

Unnamed: 0,Date,Email,Subject,Body,Path,Clean_Subject,Thread_Index,Email_Index,Score
625,2002-09-08 20:46:47+00:00,tim.one@comcast.net,[spambayes] hammie.py vs. gbayes.py,"[Guido]\n> There seem to be two ""drivers"" for ...",../03-Classification/data/easy_ham/01724.9dd46...,,0,135,138.286414
628,2002-09-09 03:36:00+00:00,tim.one@comcast.net,[spambayes] testing results,[Neil Schemenauer]\n> Woops. I didn't have th...,../03-Classification/data/easy_ham/01729.01f5d...,,0,135,138.286414
618,2002-09-08 07:18:49+00:00,tim.one@comcast.net,[spambayes] test sets?,[Guido]\n> I *meant* to say that they were 0.9...,../03-Classification/data/easy_ham/01715.30f57...,,0,135,138.286414
619,2002-09-08 07:48:28+00:00,tim.one@comcast.net,[spambayes] test sets?,[Tim]\n> ...\n> I'd prefer to strip HTML tags ...,../03-Classification/data/easy_ham/01716.8e154...,,0,135,138.286414
620,2002-09-08 18:28:02+00:00,tim.one@comcast.net,[spambayes] testing results,[Neil Schemenauer]\n> These results are from t...,../03-Classification/data/easy_ham/01719.a401d...,,0,135,138.286414


## To be continue