# 3. Email Auto-forwarding Project - Application

Frankie Bromage

<h2> Introduction </h2>

- In my own experience, spending time re-routing emails to the correct person is not only mindless and time-consuming for the person forwarding the emails, but also increases the time it takes for the correct person to receive the email. It also increases the probability that emails will get missed and requests are over-looked.
- In this project I aim to build a program that takes emails as inputs direclty from an outlook application and forwards them to the appropriate person based on past forwarding behaviour.
- To do this, I use Natural Language Processing to convert the text into a bag of words and train a neural network model to classify the emails.
- To simulate a real-world situation and avoid using personal data, I am using emails forwarded by one employee with the enron email dataset, a public dataset of 500,000 emails.
- I use 2019 code from kaggle user DFOLY1 to pre-process the enron email data set. Accessed from: "https://www.kaggle.com/code/dfoly1/k-means-clustering-from-scratch".
- The model is based on a chat-bot model used in a 2020 video by NeuralNine which can be found here: "https://www.youtube.com/watch?v=1lwddP0KUEg".

<h3> This Notebook </h3>

In the previous notebook, I created a model (adapted from NeuralNine, 2020) to predict whether the emails forwarded by employee from the enron email dataset would be forwarded to the most common 4 people she forwarded emails to or to someone else.

I now imagine that this employee wants to auto-forward future emails from her outlook inbox to the appropriate person and create a task list with all other emails. The following code will:
- read emails from an outlook inbox
- classify them using the model created in the previous notebook
- use the model predictions to forward the emails to the correct person or assign emails to an outlook task list and put them in a user-defined folder.

<h2> Library Imports </h2>

In [None]:
import pickle
import pandas as pd
import win32com.client
import nltk
from nltk.stem import WordNetLemmatizer
from tensorflow.keras.models import load_model

<h2> Load Model and Files </h2>

In [None]:
#load the possible classifications from the previous notebook
classes = pickle.load(open('classes.pkl','rb'))
#load the test set target values and predictions from the previous notebook
y_v_y_hat = pd.read_csv("model_predictions_test.csv")
#load the emails that correspond to the classifications from the previous notebook
target_map = pickle.load(open('target_map.pkl','rb'))
#load the model created from the previous notebook
model = load_model('email_model_2.h5')
#load test set


In [None]:
#instantiate lemmatizer to lemmatize new emails
lemmatizer = WordNetLemmatizer()

<h2> Self-Defined Functions </h2>

The following functions come from NeuralNine, 2020, who used them to create a chatbot. 

In [395]:
def clean_up_sentence(sentence):
    '''This function takes in a string and tokenizes and lemmatizes the words in the string'''
    sentence_words = nltk.word_tokenize(sentence)
    sentence_words = [lemmatizer.lemmatize(word) for word in sentence_words]
    return sentence_words

def bag_of_words(sentence):
    '''This function converts the sentence into an an array of 1s and 0s that the model can use'''
    sentence_words = clean_up_sentence(sentence)
    bag = [0]*len(words)
    for w in sentence_words:
        for i, word in enumerate(words):
            if word == w:
                bag[i] = 1
    return np.array(bag)

In [396]:
def predict_class(sentence):
    '''This function takes in a sentence and returns the class and probability that the sentence belongs to this class'''
    bow = bag_of_words(sentence)
    res = model.predict(np.array([bow]))[0]
    ERROR_THRESHOLD = 0.25
    result = [[i,r] for i, r in enumerate(res) if r> ERROR_THRESHOLD]
    
    result.sort(key=lambda x:x[1], reverse = True)
    return_list = []
    return_list.append({'intent':classes[result[0][0]],'probability':str(result[0][1])})
    return return_list

The following functions are used in the email forwarding code.

In [397]:
#I changed the predict_class function to take in an email as an object instead of a sentence.
def predict_email_class(email):
    '''This function is adapted from Neural Nine, 2020. It takes in an email object and returns the predicted classification and probability'''
    #extract the body as a string from email object
    sentence = email.body
    #turn string into bag of words
    bow = bag_of_words(sentence)
    #predict the class of the email by using the loaded model.
    res = model.predict(np.array([bow]))[0]
    #Turn the float prediction into integer class prediction and probability
    ERROR_THRESHOLD = 0.25
    result = [[i,r] for i, r in enumerate(res)if r> ERROR_THRESHOLD]
    
    result.sort(key=lambda x:x[1], reverse = True)
    return_list = []
    return_list.append({'intent':classes[result[0][0]],'probability':str(result[0][1])})
    return return_list

In [398]:
def forward_email(message,email_address):
    '''This function takes in a message and forwarding email address as input and sends the message to the email address'''
    #Create a forward message item
    NewMsg = message.Forward()
    #Add the email address in the "To" box.
    NewMsg.To = email_address
    #Send the message. For testing I used 'Display' because I don't want to actually send any messages.
    #NewMsg.Send()
    NewMsg.Display()

<h2> Determine Acceptable Classification Probability Threshold</h2>

I want to be quite conservative with the implementation of the model because I want to minimize auto-forwarding emails to the wrong person (i.e. only auto-forward the emails where I can be confident that they have been classified correctly). I looked at the misclassified emails from the test set and investigated what probabilities were generally associated with misclassified emails to determine an appropriate probability threshold.

In [399]:
#look at values where predictions were not correct to decide probability threshold for the rest of the code.
y_v_y_hat.loc[y_v_y_hat['Target'] != y_v_y_hat['intent']]['probability']


25     0.666629
65     0.390829
101    0.390829
117    0.634746
144    0.634128
148    0.976773
168    0.666629
211    0.730075
245    0.390829
291    0.634128
Name: probability, dtype: float64

In [400]:
#make the probability threshold the maximum probability of missclassified emails.
prob_thresh = max(y_v_y_hat.loc[y_v_y_hat['Target'] != y_v_y_hat['intent']]['probability'])

In [401]:
prob_thresh

0.9767735

I investigate the proportion of emails that are removed from consideration by setting the probability threshold to 95.5%. If most of the emails are below this threshold, then it is probably not a useful threshold to use.

In [402]:
#Percentage of emails excluded because they do not meet the probability threshold
(len(y_v_y_hat.loc[y_v_y_hat['probability'] <= prob_thresh])/len(y_v_y_hat))*100

11.18421052631579

In [403]:
#Number of emails in testing set left after excluding those below the probability threshold
len(y_v_y_hat.loc[y_v_y_hat['probability'] > prob_thresh])

270

In [404]:
#Total number of emails in testing set
len(y_v_y_hat)

304