## Enron Email Classification using Machine Learning

you can find data cleaning notebook of enron email dataset at:

[https://www.kaggle.com/ankur561999/data-cleaning-enron-email-dataset](https://www.kaggle.com/ankur561999/data-cleaning-enron-email-dataset)

In [2]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/enron-email-classification-using-machine-learning/f1_score.csv
/kaggle/input/enron-email-classification-using-machine-learning/__results__.html
/kaggle/input/enron-email-classification-using-machine-learning/__notebook_source__.ipynb
/kaggle/input/enron-email-classification-using-machine-learning/__notebook__.ipynb
/kaggle/input/enron-email-classification-using-machine-learning/accuracy.csv
/kaggle/input/enron-email-classification-using-machine-learning/jacc_score.csv
/kaggle/input/enron-email-classification-using-machine-learning/__output__.json
/kaggle/input/enron-email-classification-using-machine-learning/custom.css
/kaggle/input/data-cleaning-enron-email-dataset/cleaned_data.csv
/kaggle/input/data-cleaning-enron-email-dataset/__results__.html
/kaggle/input/data-cleaning-enron-email-dataset/__notebook_source__.ipynb
/kaggle/input/data-cleaning-enron-email-dataset/__resultx__.html
/kaggle/input/data-cleaning-enron-email-dataset/__notebook__.ipynb
/kaggle/input/data-cle

### Import necessary libraries

In [3]:
import matplotlib.pyplot as plt
import re
import string
import time
pd.set_option('display.max_rows', 50)

from nltk.corpus import stopwords
stop = stopwords.words('english')

from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import cross_validate

from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import LinearSVC
from sklearn.ensemble import AdaBoostClassifier
from sklearn.neural_network import MLPClassifier

### Load Data

In [5]:
df = pd.read_csv("/kaggle/input/data-cleaning-enron-email-dataset/cleaned_data.csv")

# view first 5 rows of the dataframe
df.head()

Unnamed: 0,subject,X-Folder,body
0,Re:,'sent mail,Traveling to have a business meeting takes the...
1,Re: test,'sent mail,test successful. way to go!!!
2,Re: Hello,'sent mail,Let's shoot for Tuesday at 11:45.
3,Re: Hello,'sent mail,"Greg,\n\n How about either next Tuesday or Thu..."
4,Re: PRC review - phone calls,'sent mail,any morning between 10 and 11:30


### Data Pre-processing

#### Remove Folders
Remove folders that do not contain enough e-mails because such folders would not be significant for training our classifier. Also, we can infer that some folders with very little e-mails in them were created but unused.

In [6]:
def remove_folders(emails, n):
    # returns the number of folders containing more than 'n' number of emails
    email_count = dict(df['X-Folder'].value_counts())
    small_folders = [key for key, val in email_count.items() if val<=n]
    emails = df.loc[~df['X-Folder'].isin(small_folders)]
    return emails

In [7]:
n = 150
df = remove_folders(df, n)

In [8]:
print("Total folders: ", len(df['X-Folder'].unique()))
print("df.shape: ", df.shape)

Total folders:  82
df.shape:  (460141, 3)


**Combine subject and body columns**

In [9]:
df['text'] = df['subject'] + " " + df['body']

In [10]:
# drop the columns 'subject' and 'body'
df.drop(['subject','body'], axis=1, inplace=True)

Now, do the following to preprocess text:
- lowercasing all words
- Remove extra new lines
- Remove extra tabs, punctuations, commas
- Remove extra white spaces
- Remove stopwords

In [11]:
def preprocess(x):
    # lowercasing all the words
    x = x.lower()
    
    # remove extra new lines
    x = re.sub(r'\n+', ' ', x)
    
    # removing (replacing with empty spaces actually) all the punctuations
    x = re.sub("["+string.punctuation+"]", " ", x)
    
    # remove extra white spaces
    x = re.sub(r'\s+', ' ', x)
    
    return x

In [12]:
start = time.time()
df.loc[:,'text'] = df.loc[:, 'text'].map(preprocess)

# remove stopwords
df.loc[:, 'text'] = df.loc[:, 'text'].apply(lambda x: ' '.join([word for word in x.split() if word not in stop]))
end = time.time()
print("Execution time (sec): ",(end - start))

Execution time (sec):  305.9135808944702


- Randomly select any 20 folders which we would like to categorize.
- Only 20 folders have been selected because of very high training time and computational cost

In [13]:
start = time.time()
folders_dict = dict(df['X-Folder'].value_counts().sort_values()[50:70])
data = df[df['X-Folder'].isin(folders_dict.keys())]
end = time.time()
print("Execution time (sec): ",(end - start))

Execution time (sec):  0.06492829322814941


In [14]:
# check number of rows in the 'data' dataframe
print("Number of instances: ", data.shape[0])
data.to_csv('preprocessed.csv', index=False)

Number of instances:  13586


In [16]:
data = pd.read_csv("preprocessed.csv")

**Encode class labels**

In [17]:
data['X-Folder'].value_counts()

logistics              1170
tw-commercial group    1150
california             1014
bill williams iii      1004
deal discrepancies      878
management              799
calendar                700
esvl                    663
tufco                   604
resumes                 599
e-mail bin              592
ces                     572
online trading          567
junk                    544
junk file               494
ooc                     473
genco-jv_ipo            465
projects                459
corporate               420
archives                419
Name: X-Folder, dtype: int64

In [18]:
def label_encoder(data):
    class_le = LabelEncoder()
    # apply label encoder on the 'X-Folder' column
    y = class_le.fit_transform(data['X-Folder'])
    return y

In [19]:
y = label_encoder(data)
input_data = data['text']

## 1. Bag-of-Words

In [8]:
start = time.time()
vectorizer = CountVectorizer(min_df=5, max_features=5000)
X = vectorizer.fit_transform(input_data)
end = time.time()
print("Execution time (sec): ",(end - start))

Execution time (sec):  2.647146463394165


In [9]:
start = time.time()
X = X.toarray()
print("X.shape: ",X.shape)
end = time.time()
print("Execution time (sec): ",(end - start))

X.shape:  (13586, 5000)
Execution time (sec):  0.31673240661621094


In [24]:
# create dataframe to store results
f1_data = {
    'Algorithm': ['Gaussian NB', 'Multinomial NB','Decision Tree','SVM','AdaBoost','ANN'],
    'BoW': ''
}
f1_df = pd.DataFrame(f1_data)

jaccard_data = {
    'Algorithm': ['Gaussian NB', 'Multinomial NB', 'Decision Tree','SVM','AdaBoost','ANN'],
    'BoW': ''
}
jacc_df = pd.DataFrame(jaccard_data)

acc_data = {
    'Algorithm': ['Gaussian NB', 'Multinomial NB','Decision Tree','SVM','AdaBoost','ANN'],
    'BoW': ''
}
acc_df = pd.DataFrame(acc_data)
acc_df

Unnamed: 0,Algorithm,BoW
0,Gaussian NB,
1,Multinomial NB,
2,Decision Tree,
3,SVM,
4,AdaBoost,
5,ANN,


### Training and Evaluation

In [27]:
models = [GaussianNB(), MultinomialNB(), DecisionTreeClassifier(), LinearSVC(), 
          AdaBoostClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=5),
         MLPClassifier(hidden_layer_sizes=(10,))]

names = ["Gaussian NB", "Multinomial NB", "Decision Tree", "SVM", "AdaBoost", "ANN"]

jacc_scores = []
acc_scores = []
f1_scores = []
exec_times = []

for model, name in zip(models, names):
    print(name)
    start = time.time()
    scoring = {
        'acc': 'accuracy',
        'f1_mac': 'f1_macro',
        'jacc_mac': 'jaccard_macro'
    }
    scores = cross_validate(model, X, y, cv=10, n_jobs=4, scoring=scoring)
    training_time = (time.time() - start)
    print("accuracy: ", scores['test_acc'].mean())
    print("f1_score: ", scores['test_f1_mac'].mean())
    print("Jaccard_index: ", scores['test_jacc_mac'].mean())
    print("time (sec): ", training_time)
    print("\n")
    
    jacc_scores.append(scores['test_jacc_mac'].mean())
    acc_scores.append(scores['test_acc'].mean())
    f1_scores.append(scores['test_f1_mac'].mean())
    exec_times.append(training_time)
    
acc_df['BoW'] = acc_scores
jacc_df['BoW'] = jacc_scores
f1_df['BoW'] = f1_scores
acc_df['time'] = exec_times
acc_df

Gaussian NB
accuracy:  0.5852325249983473
f1_score:  0.5621318716847321
Jaccard_index:  0.4130840479833916
time (sec):  11.612146615982056


Multinomial NB
accuracy:  0.7377434135166094
f1_score:  0.7038215928245647
Jaccard_index:  0.5770619060081098
time (sec):  42.23520231246948


Decision Tree
accuracy:  0.6620045710644469
f1_score:  0.6434088762702024
Jaccard_index:  0.4976720244008499
time (sec):  84.62758040428162


SVM
accuracy:  0.7374505424481528
f1_score:  0.7187119697058196
Jaccard_index:  0.5871597129360764
time (sec):  31.174070835113525


AdaBoost
accuracy:  0.6678924445224712
f1_score:  0.6477421097163432
Jaccard_index:  0.5058628299126986
time (sec):  424.4590353965759


ANN
accuracy:  0.7353897704822809
f1_score:  0.7154945658183888
Jaccard_index:  0.584506441602881
time (sec):  879.5331366062164




Unnamed: 0,Algorithm,BoW,time
0,Gaussian NB,0.585233,11.612147
1,Multinomial NB,0.737743,42.235202
2,Decision Tree,0.662005,84.62758
3,SVM,0.737451,31.174071
4,AdaBoost,0.667892,424.459035
5,ANN,0.73539,879.533137


In [28]:
# save the results
acc_df.to_csv("accuracy.csv", index=False)
f1_df.to_csv("f1_score.csv", index=False)
jacc_df.to_csv("jacc_score.csv", index=False)

## 2. Bag-of-Words Bigram

In [36]:
start = time.time()
vectorizer = CountVectorizer(min_df=5, max_features=5000, ngram_range=(2,2))
X = vectorizer.fit_transform(input_data)

X = X.toarray()
print("X.shape: ",X.shape)

end = time.time()
print("Execution time (sec): ",(end - start))

X.shape:  (13586, 5000)
Execution time (sec):  7.333747625350952


### Training and Evaluation

In [37]:
models = [GaussianNB(), MultinomialNB(), DecisionTreeClassifier(), LinearSVC(), 
          AdaBoostClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=5),
         MLPClassifier(hidden_layer_sizes=(10,))]

names = ["Gaussian NB", "Multinomial NB", "Decision Tree", "SVM", "AdaBoost", "ANN"]

jacc_scores = []
acc_scores = []
f1_scores = []
exec_times = []

for model, name in zip(models, names):
    print(name)
    start = time.time()
    scoring = {
        'acc': 'accuracy',
        'f1_mac': 'f1_macro',
        'jacc_mac': 'jaccard_macro'
    }
    scores = cross_validate(model, X, y, cv=10, n_jobs=4, scoring=scoring)
    training_time = (time.time() - start)
    print("accuracy: ", scores['test_acc'].mean())
    print("f1_score: ", scores['test_f1_mac'].mean())
    print("Jaccard_index: ", scores['test_jacc_mac'].mean())
    print("time (sec): ", training_time)
    print("\n")
    
    jacc_scores.append(scores['test_jacc_mac'].mean())
    acc_scores.append(scores['test_acc'].mean())
    f1_scores.append(scores['test_f1_mac'].mean())
    exec_times.append(training_time)
    
acc_df['BoWBi'] = acc_scores
jacc_df['BoWBi'] = jacc_scores
f1_df['BoWBi'] = f1_scores
acc_df['BoWBi_time'] = exec_times
acc_df

Gaussian NB
accuracy:  0.5833930454364673
f1_score:  0.5621651556732388
Jaccard_index:  0.4068105548950894
time (sec):  11.270399570465088


Multinomial NB
accuracy:  0.6374178145803735
f1_score:  0.6170933752131809
Jaccard_index:  0.4707424107547659
time (sec):  42.18202495574951


Decision Tree
accuracy:  0.5911941987145101
f1_score:  0.5797069341612804
Jaccard_index:  0.4317389725588492
time (sec):  217.36713671684265


SVM
accuracy:  0.6324125098481621
f1_score:  0.619196736013025
Jaccard_index:  0.47206075684293436
time (sec):  21.832029104232788


AdaBoost
accuracy:  0.5783132360383674
f1_score:  0.5652402591763923
Jaccard_index:  0.41793531112035415
time (sec):  410.09173607826233


ANN
accuracy:  0.6169565575484877
f1_score:  0.604338923367805
Jaccard_index:  0.4571185691855167
time (sec):  1108.7890048027039




Unnamed: 0,Algorithm,BoW,time,BoWBi,BoWBi_time
0,Gaussian NB,0.585233,11.612147,0.583393,11.2704
1,Multinomial NB,0.737743,42.235202,0.637418,42.182025
2,Decision Tree,0.662005,84.62758,0.591194,217.367137
3,SVM,0.737451,31.174071,0.632413,21.832029
4,AdaBoost,0.667892,424.459035,0.578313,410.091736
5,ANN,0.73539,879.533137,0.616957,1108.789005


In [38]:
# save the results
acc_df.to_csv("accuracy.csv", index=False)
f1_df.to_csv("f1_score.csv", index=False)
jacc_df.to_csv("jacc_score.csv", index=False)

## 3. Tf-Idf (Term Frequency - Inverse Document Frequency)

In [20]:
start = time.time()
vectorizer = TfidfVectorizer(min_df=5, max_features=5000)
X = vectorizer.fit_transform(input_data)

X = X.toarray()
print("X.shape: ",X.shape)

end = time.time()
print("Execution time (sec): ",(end - start))

X.shape:  (13586, 5000)
Execution time (sec):  2.365476369857788


### Training and Evaluation

In [25]:
models = [GaussianNB(), MultinomialNB(), DecisionTreeClassifier(), LinearSVC(), 
          AdaBoostClassifier(base_estimator=DecisionTreeClassifier(), n_estimators=5),
         MLPClassifier(hidden_layer_sizes=(10,))]

names = ["Gaussian NB", "Multinomial NB", "Decision Tree", "SVM", "AdaBoost", "ANN"]

jacc_scores = []
acc_scores = []
f1_scores = []
exec_times = []

for model, name in zip(models, names):
    print(name)
    start = time.time()
    scoring = {
        'acc': 'accuracy',
        'f1_mac': 'f1_macro',
        'jacc_mac': 'jaccard_macro'
    }
    scores = cross_validate(model, X, y, cv=10, n_jobs=4, scoring=scoring)
    training_time = (time.time() - start)
    print("accuracy: ", scores['test_acc'].mean())
    print("f1_score: ", scores['test_f1_mac'].mean())
    print("Jaccard_index: ", scores['test_jacc_mac'].mean())
    print("time (sec): ", training_time)
    print("\n")
    
    jacc_scores.append(scores['test_jacc_mac'].mean())
    acc_scores.append(scores['test_acc'].mean())
    f1_scores.append(scores['test_f1_mac'].mean())
    exec_times.append(training_time)
    
acc_df['TfIdf'] = acc_scores
jacc_df['TfIdf'] = jacc_scores
f1_df['TfIdf'] = f1_scores
acc_df['TfIdf_time'] = exec_times
acc_df

Gaussian NB
accuracy:  0.6093018127120674
f1_score:  0.5877402363957523
Jaccard_index:  0.44084640698807825
time (sec):  10.88046383857727


Multinomial NB
accuracy:  0.7368567808999297
f1_score:  0.6967070564788325
Jaccard_index:  0.5701299709091912
time (sec):  6.833428621292114


Decision Tree
accuracy:  0.649639451602311
f1_score:  0.6336756930392328
Jaccard_index:  0.48894297941690895
time (sec):  95.03275084495544


SVM
accuracy:  0.7947884663526091
f1_score:  0.7771822256420796
Jaccard_index:  0.6613918628186176
time (sec):  7.603848695755005


AdaBoost
accuracy:  0.6595013226610141
f1_score:  0.6386553507155176
Jaccard_index:  0.4976178552852552
time (sec):  411.92880725860596


ANN
accuracy:  0.7534232591104306
f1_score:  0.7344935635448607
Jaccard_index:  0.6074949481016497
time (sec):  996.1522567272186




Unnamed: 0,Algorithm,BoW,time,BoWBi,BoWBi_time,TfIdf,TfIdf_time
0,Gaussian NB,0.585233,11.612147,0.583393,11.2704,0.609302,10.880464
1,Multinomial NB,0.737743,42.235202,0.637418,42.182025,0.736857,6.833429
2,Decision Tree,0.662005,84.62758,0.591194,217.367137,0.649639,95.032751
3,SVM,0.737451,31.174071,0.632413,21.832029,0.794788,7.603849
4,AdaBoost,0.667892,424.459035,0.578313,410.091736,0.659501,411.928807
5,ANN,0.73539,879.533137,0.616957,1108.789005,0.753423,996.152257


In [26]:
# save the results
acc_df.to_csv("accuracy.csv", index=False)
f1_df.to_csv("f1_score.csv", index=False)
jacc_df.to_csv("jacc_score.csv", index=False)

In [27]:
jacc_df

Unnamed: 0,Algorithm,BoW,BoWBi,TfIdf
0,Gaussian NB,0.413084,0.406811,0.440846
1,Multinomial NB,0.577062,0.470742,0.57013
2,Decision Tree,0.497672,0.431739,0.488943
3,SVM,0.58716,0.472061,0.661392
4,AdaBoost,0.505863,0.417935,0.497618
5,ANN,0.584506,0.457119,0.607495
