# 1. Email Auto-forwarding Project - Data Pre-Processing 

Frankie Bromage


<h2> Introduction </h2>

- In my own experience, spending time re-routing emails to the correct person is not only mindless and time-consuming for the person forwarding the emails, but also increases the time it takes for the correct person to receive the email. It also increases the probability that emails will get missed and requests are over-looked.
- In this project I aim to build a program that takes emails as inputs direclty from an outlook application and forwards them to the appropriate person based on past forwarding behaviour.
- To do this, I use Natural Language Processing to convert the text into a bag of words and train a neural network model to classify the emails.
- To simulate a real-world situation and avoid using personal data, I am using emails forwarded by one employee with the enron email dataset, a public dataset of 500,000 emails.
- I use 2019 code from kaggle user DFOLY1 to pre-process the enron email data set. Accessed from: "https://www.kaggle.com/code/dfoly1/k-means-clustering-from-scratch".
- The model is based on a chat-bot model used in a 2020 video by NeuralNine which can be found here: "https://www.youtube.com/watch?v=1lwddP0KUEg".

<h2> This Notebook </h2>

In this notebook I clean the enron email dataset so that I can use it for the purposes of my project.

<h2> Library Imports </h2>

In [1]:
import csv
import string
from string import Template
import os, sys, email,re
import numpy as np 
import pandas as pd





<h2> Data Pre-Processing </h2>

In a real situation, an outlook user can directly download their email information in csv format from outlook to access the body and header information of their emails. Instructions can be found here: "https://support.microsoft.com/en-us/office/back-up-your-email-e5845b0b-1aeb-424f-924c-aa1c33b18833"

For the purposes of this project, I am using the enron email dataset to simulate a real world situation. This dataset can be found here: 'https://www.kaggle.com/datasets/wcukierski/enron-email-dataset'

Data Import

In [2]:
emails_df = pd.read_csv('C:/Users/frankiecheng/Downloads/enron email dataset/emails.csv')

In [3]:
emails_df.head()

Unnamed: 0,file,message
0,allen-p/_sent_mail/1.,Message-ID: <18782981.1075855378110.JavaMail.e...
1,allen-p/_sent_mail/10.,Message-ID: <15464986.1075855378456.JavaMail.e...
2,allen-p/_sent_mail/100.,Message-ID: <24216240.1075855687451.JavaMail.e...
3,allen-p/_sent_mail/1000.,Message-ID: <13505866.1075863688222.JavaMail.e...
4,allen-p/_sent_mail/1001.,Message-ID: <30922949.1075863688243.JavaMail.e...


The following code is taken from DFOLY1, 2019 to convert the enron email dataset into a dataframe with "To", "From", "subject" and "body" columns.

In [4]:
emails = list(map(email.parser.Parser().parsestr,emails_df['message']))

# extract headings such as subject, from, to etc..
headings  = emails[0].keys()

# Goes through each email and grabs info for each key
# doc['From'] grabs who sent email in all emails
for key in headings:
    emails_df[key] = [doc[key] for doc in emails]

In [5]:
def get_raw_text(emails):
    email_text = []
    for email in emails.walk():
        if email.get_content_type() == 'text/plain':
            email_text.append(email.get_payload())
    return ''.join(email_text)

emails_df['body'] = list(map(get_raw_text, emails))
emails_df.head()
emails_df['user'] = emails_df['file'].map(lambda x: x.split('/')[0])

In [6]:
emails_df['Date'] = pd.to_datetime(emails_df['Date'], infer_datetime_format=True)
emails_df.head()
emails_df.dtypes

file                         object
message                      object
Message-ID                   object
Date                         object
From                         object
To                           object
Subject                      object
Mime-Version                 object
Content-Type                 object
Content-Transfer-Encoding    object
X-From                       object
X-To                         object
X-cc                         object
X-bcc                        object
X-Folder                     object
X-Origin                     object
X-FileName                   object
body                         object
user                         object
dtype: object

In [7]:
emails_df_2 = emails_df[['From','To','Subject','body']]

In [8]:
#Save the cleaned up data file
emails_df_2.to_csv('enron_clean.csv')