<h1><center>Email Spam Classifier - BERT</center></h1>

This project is based on Manning's book "Transfer Learning for NLP" (chapter 2).
The goal here is:

1. Curate a dataset with emails and spam consisting of random 1000 samples for each class
2. Extract from the emails only the text,i.e, no headers.
3. Create a simple bag-of-words model from the above content. Simple because it is based on term frequency (tf) only.
4. Choose one baseline classifier from Logistic Regression and Gradient Boosting Machine
5. Accuracy is the metric of choice as the dataset is balanced and consists of two classes
6. Train a SPAM classifier based on BERT embeddings

In [1]:
# Import requiered libraries
import pandas as pd
import numpy as np
import email
import os
import pickle

Define data file path, i.e., directory where data to train the classifier is. As this is a csv type,  the file is loaded with pandas read_csv function. If successfull the number of rows and columns with the first 5 rows are printed.

In [2]:
data_file_path = "/home/baosiek/Documents/deep_learning/transfer-learning/data/emails.csv"
emails = pd.read_csv(data_file_path)
print("Emails were loaded successfully containing {} rows and {} columns".format(emails.shape[0], emails.shape[1]))
print("Printing 5 first rows...")
print(emails.head(5))

Emails were loaded successfully containing 517401 rows and 2 columns
Printing 5 first rows...
                       file                                            message
0     allen-p/_sent_mail/1.  Message-ID: <18782981.1075855378110.JavaMail.e...
1    allen-p/_sent_mail/10.  Message-ID: <15464986.1075855378456.JavaMail.e...
2   allen-p/_sent_mail/100.  Message-ID: <24216240.1075855687451.JavaMail.e...
3  allen-p/_sent_mail/1000.  Message-ID: <13505866.1075863688222.JavaMail.e...
4  allen-p/_sent_mail/1001.  Message-ID: <30922949.1075863688243.JavaMail.e...


Prints the contect of the first email, under the message column, enabling data understanding.

In [3]:
print(emails.loc[0, "message"])

Message-ID: <18782981.1075855378110.JavaMail.evans@thyme>
Date: Mon, 14 May 2001 16:39:00 -0700 (PDT)
From: phillip.allen@enron.com
To: tim.belden@enron.com
Subject: 
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-From: Phillip K Allen
X-To: Tim Belden <Tim Belden/Enron@EnronXGate>
X-cc: 
X-bcc: 
X-Folder: \Phillip_Allen_Jan2002_1\Allen, Phillip K.\'Sent Mail
X-Origin: Allen-P
X-FileName: pallen (Non-Privileged).pst

Here is our forecast

 


Lets get only the text part of the message, discarding date, from, to and Subject info. 

In [4]:
def extract_text(emails):

    # Initializes a list of texts where each row will contain only the content of the email
    contents = []
    
    for item in emails["message"]:
        e = email.message_from_string(item)
        content = e.get_payload() # Gets only a string with the email content
        contents.append(content)
        
    return contents

In [5]:
# This process may take sometime. So we first check if the emails were already
# processed with their respective content in a list. If this data is not found this 
# procedure extracts the content from the emails and stores it.
if not os.path.exists('/home/baosiek/Documents/deep_learning/transfer-learning/data/contents.txt'):
    contents = extract_text(emails)
    with open("./data/contents.txt", "wb") as fp:   # Serializing
        pickle.dump(contents, fp)
else:
    with open("./data/contents.txt", "rb") as fp:   # Deserializing
        contents = pickle.load(fp)
        

# Prints the content at row 100
print(contents[100])

I tried the new address but I don't have access.  also, what do I need to 
enter under domain?


Testing if number of rows in emails data frame and contents list data structure are equal.

In [6]:
if len(contents) == emails.shape[0]:
    print("Success!")

Success!


Converts contents list into a data frame and prints the first 5 emails contents

In [7]:
contents_df = pd.DataFrame(contents)
print(contents_df.head(n=5))

                                                   0
0                          Here is our forecast\n\n 
1  Traveling to have a business meeting takes the...
2                     test successful.  way to go!!!
3  Randy,\n\n Can you send me a schedule of the s...
4                Let's shoot for Tuesday at 11:45.  


# The Spam dataset

In [8]:
data_file_path = "/home/baosiek/Documents/deep_learning/transfer-learning/data/fradulent_emails.txt"
with open(data_file_path, 'r', encoding='latin1') as file:
    spams = file.read()
# spams is a long string beacause originally the data downloaded is a single text file.
# We need to find how to split this entire file into units where each one in a single email.
# So reading the downloaded file with Gedit we can conclude that each email starts with
# the char sequence "From r". So we will use this to split this big string into a list of emails.
# The reason why we associate to the final structure all emails from 1 and not 0 is because the 
# original file starts with 'From r' splitting it into an empty first element and the first email in
# the second elemnt of the spams list
spams = spams.split('From r')[1:]
print(f'Spams was successfully downloaded and contains {len(spams)} emails')

spams_df = extract_text(pd.DataFrame(spams, columns=['message']))
spams_df = pd.DataFrame(spams_df)
print(spams_df.head())

Spams was successfully downloaded and contains 3977 emails
                                                   0
0  FROM:MR. JAMES NGOLA.\nCONFIDENTIAL TEL: 233-2...
1  Dear Friend,\n\nI am Mr. Ben Suleman a custom ...
2  FROM HIS ROYAL MAJESTY (HRM) CROWN RULER OF EL...
3  FROM HIS ROYAL MAJESTY (HRM) CROWN RULER OF EL...
4  Dear sir, \n \nIt is with a heart full of hope...


# Hyperparameters

Now lets create the dataset to train classifiers. It will contain n_samples from each data frame (emails and spams). Each sample will contain max_tokens and each token max_characters as hyperparameters to enable acceptable performance for training and classification.

In [9]:
n_samples = 1000 # number of samples frome each data frame
max_tokens = 50 # maximum number of tokens in each email
max_chars = 20 # maximum length of each token
threshold = 0.7 # percentage of training examples in the dataset

Function to tokenize emails

In [10]:
def tokenize(row):
    
    if row == None or row == '' or type(row) == list:
        tokens = ""
    else:
        tokens = row.split(" ")[:max_tokens]
        
    return tokens

Function to extract punctuations, lowercase all tokens anf limit token size

In [11]:
import re

def clean_text(token_list):
    
    tokens = []
    
    try:
        for token in token_list:
            token = token.lower()
            token = re.sub(r'[\W\d]', "", token)[:max_chars]
            tokens.append(token)
            
    except:
        tokens.append("")
        
    return tokens    

# Build dataframe with emails and spams.

Removing stopwords

In [12]:
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

stopwords = stopwords.words('english')

def stopwords_removal(token_list):
    
    token = [token for token in token_list if token not in stopwords]
    token = filter(None, token)
    return token

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/baosiek/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [13]:
real_emails = contents_df.iloc[:, 0]
real_emails = real_emails.apply(tokenize)
real_emails = real_emails.apply(stopwords_removal)
real_emails = real_emails.apply(clean_text)
real_emails = real_emails.sample(n_samples)
print(f'real_emails dataframe contains {real_emails.shape[0]} samples. Listing first 5...')
print(real_emails.head(n=5))

real_emails dataframe contains 1000 samples. Listing first 5...
389288    [, forwarded, elizabeth, sagerhouect, , , am, ...
468101    [, crisis, and, opportunitypower, markets, mar...
326517    [maci, sent, directly, ahmed, via, email, than...
293169    [the, issue, respect, addressed, one, simple, ...
184065    [i, conv, wmark, marcus, nettleton, today, par...
Name: 0, dtype: object


In [14]:
spam_emails = spams_df.iloc[:, 0]
spam_emails = spam_emails.apply(tokenize)
spam_emails = spam_emails.apply(stopwords_removal)
spam_emails = spam_emails.apply(clean_text)
spam_emails = spam_emails.sample(n_samples)
print(f'spam_emails dataframe contains {spam_emails.shape[0]} samples. Listing first 5...')
print(spam_emails.head(n=5))

spam_emails dataframe contains 1000 samples. Listing first 5...
2423                                                   []
345     [from, george, dukephone, , , , greetingthis, ...
2851    [th, floorguangxing, building, th, nanshan, ro...
1329    [from, mreric, udeno, , george, avenueoppst, s...
551     [attention, ceo, presidentfrom, mr, davis, tut...
Name: 0, dtype: object


Lets combine these two data frames into one NP array

In [15]:
data = pd.concat([spam_emails, real_emails], axis=0).values
print(f'data is a {type(data)}. Shape is {data.shape}.')
print(data[1])

data is a <class 'numpy.ndarray'>. Shape is (2000,).
['from', 'george', 'dukephone', '', '', '', 'greetingthis', 'letter', 'might', 'surprise', 'met', 'neither', 'person', 'nor', 'correspondence', 'but', 'i', 'believe', 'one', 'day', 'get', 'know', 'somebody', 'either', 'physical', 'correspondence', 'iswhy', 'in', 'spain', 'making', 'contart']


Creting labels to the emails in data. The first 1000 columsn [:1000] are real emails (label=1) and the last 1000 [1000:] are spams (label=0)

In [16]:
categories = ['spam''real']
labels = ([1]*n_samples) # spams
labels.extend(([0]*n_samples)) # emails

print(f'labels has shape {len(labels)}')
print(labels[:5]) # printing first 5 real labels
print(labels[1000: 1005]) # printing first spam labels

labels has shape 2000
[1, 1, 1, 1, 1]
[0, 0, 0, 0, 0]


# Transfer Learning with BERT

Import all required libraries for tensorflow on CPU and keras

In [18]:
import tensorflow as tf
import tensorflow_hub as hub
from keras import backend as K

Using TensorFlow backend.


Here we will use BERT embeddings.

In [19]:
if tf.__version__ == "1.15.0":
    print(f'Tensorflow version is {tf.__version__}. Correct!')
else:
    print(f'Tensorflow version is {tf.__version__}. Should be 1.15.0! Failed')
    
sess = tf.Session()
K.set_session(sess)

Tensorflow version is 1.15.0. Correct!


# BERT layer

In [20]:
class BertLayer(tf.keras.layers.Layer):
    def __init__(self,
        n_fine_tune_layer = 10,
        pooling = 'mean',
        bert_path = 'https://tfhub.dev/google/bert_uncased_L-12_H-768_A-12/1',
        **kwargs
    ):

        self.n_fine_tune_layer = n_fine_tune_layer
        self.trainable = True
        self.output_size = 768
        self.pooling = pooling
        self.bert_path = bert_path

        super(BertLayer, self).__init__(**kwargs)

    def build(self, input_features):

        self.bert_layer = hub.Module(
            self.bert_path, trainable=self.trainable, name=f'{self.name}_module')

        trainable_vars = self.bert_layer.variables
        if self.pooling == 'first':
            trainable_vars = [var for var in trainable_vars if not "/cls/" in var.name]
            trainable_layers = ["pooler/dense"]
        elif self.pooling == 'mean':
            trainable_vars = [
                var for var in trainable_vars
                if not "/cls/" in var.name 
                and not "/pooler/" in var.name]

            trainable_layers = []
        else:
            raise NameError('Undefined pooling type')

        for i in range(self.n_fine_tune_layer):
            trainable_layers.append(f'encoder/layer_{str(11 - i)}')

        trainable_vars = [
            var for var in trainable_vars
            if any([l in var.name for l in trainable_layers])
            ]

        for var in trainable_vars:
            self._trainable_weights.append(var)

        super(BertLayer, self).build(input_features)

    def call(self, inputs):
        inputs = [K.cast(x, dtype='int32') for x in inputs]
        input_ids, input_mask, segment_ids = inputs
        bert_inputs = dict(input_ids=input_ids, input_mask=input_mask, segment_ids=segment_ids)

        if self.pooling == 'first':
            pooled = self.bert_layer(
                inputs=bert_inputs, signature='tokens', as_dict=True)['pooled_output']

        elif self.pooling == 'mean':
            result = self.bert_layer(
                inputs=bert_inputs, signature='tokens', as_dict=True)['sequence_output']

            mul_mask = lambda x, m: x * tf.expand_dims(m, axis=-1)
            masked_reduce_mean = lambda x, m: tf.reduce_sum(mul_mask(x, m), axis=1) / (tf.reduce_sum(m, axis=1, keep_dims=True) + 1e-10)
            input_mask = tf.cast(input_mask, tf.float32)
            pooled = masked_reduce_mean(result, input_mask)

        else:
            raise NameError('Undefined pooling type')

        return pooled

    def compute_output_shape(self, input_shape):
        return (input_shape[0], self.output_size)
            


Before preparing data to the format expected by BERT I am going to convert data format from where each email consistes in a çist of tokens to another where each email consists of one string. Starting this process from the already processed data is due to the fact that in the latter, token were alredy filtered for punctuation, stopwords, lowercase, etc. 

In [None]:
def convert_data_to_bert_feature_format(x, y):

    converted_data, converted_labels = [], []
    
    # From list of tokens to one string
    for index in range(x.shape[0]):
        text = ' '.join(x[index])
        converted_data.append(text)
        converted_labels.append(y[index])

    # Converted_data to np.array
    converted_data = np.array(converted_data, dtype=object)[:, np.newaxis]

    return converted_data, np.array(converted_labels)

data, labels = convert_data_to_bert_feature_format(data, labels)

print(f'{data[1]} -> {labels[1]}')

In [None]:
idx = int(data.shape[0]*threshold)
train_x, train_y = data[:idx], labels[:idx]
test_x, test_y = data[idx:], labels[idx:]

print(f'train_x shape: {train_x.shape}, train_y shape: {train_y.shape}')
print(f'test_x shape: {test_x.shape}, test_y shape: {test_y.shape}')

In [None]:
def build_model(max_seq_length):

    in_id = tf.keras.layers.Input(shape=(max_seq_length,), name='input_ids')
    in_mask = tf.keras.layers.Input(shape=(max_seq_length,), name='input_masks')
    in_segment = tf.keras.layers.Input(shape=(max_seq_length,), name='segment_ids')
    bert_layer_inputs = [in_id, in_mask, in_segment]

    bert_output = BertLayer(n_fine_tune_layer=0)(bert_layer_inputs)
    dense = tf.keras.layers.Dense(256, activation='relu')(bert_output)
    output = tf.keras.layers.Dense(1, activation='sigmoid')(dense)

    model = tf.keras.models.Model(inputs=bert_layer_inputs, outputs=output)
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
    model.summary()

    return model

In [None]:
def initialize_vars(sess):
    sess.run(tf.local_variables_initializer())
    sess.run(tf.global_variables_initializer())
    sess.run(tf.tables_initializer())
    K.set_session(sess)

At the time of developing this notebook bert-tensorflow version 1.0.4 had an opened issue (https://github.com/google-research/bert/issues/1133). I had to downgrade it, as guided in the reffered issue, to version 1.0.1.

In [None]:
from tqdm import tqdm
import kerasBert as kb
import bert.tokenization as tk
from tensorflow_hub import Module
import pkg_resources
import bert_utils as bu

if pkg_resources.get_distribution("bert-tensorflow").version != '1.0.1':
    raise NameError(f'bert-tensorflow version: {pkg_resources.get_distribution("bert-tensorflow").version} is wrong. Correct is version 1.0.1')
else:
    print(f'bert-tensorflow version: {pkg_resources.get_distribution("bert-tensorflow").version}')

vocab_file_path = '/home/baosiek/Documents/deep_learning/transfer-learning/model/bert/assets/vocab.txt'
tokenizer = bu.create_tokenizer(vocab_file_path)

train_examples = kb.convert_text_to_examples(train_x, train_y)
test_examples = kb.convert_text_to_examples(test_x, test_y)

# (train_input_ids, train_input_masks, train_segment_ids, train_labels) = kb.convert_examples_to_features(
#     tokenizer, train_examples, max_seq_length=max_tokens)

# (test_input_ids, test_input_masks, test_segment_ids, test_labels) = convert_examples_to_features(
#     tokenizer, test_examples, max_seq_length=max_tokens)

# model = build_model(max_tokens)

# initialize_vars(sess)


In [None]:
import os

if not os.path.exists('model/elmo_model.h5'):
    print('Build and training model!')
    model = build_model()
    history = model.fit(train_X, train_y, validation_data=(test_X, test_y), epochs=5, batch_size=32)
    keras.models.save_model(model, 'model/elmo_model.h5')
    with open('model/train_history_dict', 'wb') as file_pi:
        pickle.dump(history.history, file_pi)
else:
    print('Loading trained model!')
    model = keras.models.load_model('model/elmo_model.h5', {'ElmoEmbeddingLayer': ElmoEmbeddingLayer})
    with open("model/train_history_dict", "rb") as fp:
        history = pickle.load(fp)

# ploting history for accuracy
plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])
plt.title('model accuracy')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()
# ploting history for loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'test'], loc='upper left')
plt.show()


toPredict = [['This is an email'],['The Royal family lives in Buckingham Palace']]
for i in range(len(toPredict)):
    predict = model.predict([toPredict[i]])
    if predict[0] < 0.5:
        print(f'{toPredict[i]} is an EMAIL!')
    else:
        print(f'{toPredict[i]} is SPAM!')

In my experience the above results are uncommon. Epoch 5/5 had a training accuracy of 0.9864 lower than validation accuracy of 0.9883. Base model had an accuracy of 0.9550. It seems that the model is neither under or overfitting. One should remember the hyperparameters of max_chars (the maximum number of characters in the emails) and max_tokens (maximum number of tokens per email) can be changed to reaching even better performance.

In [None]:
# os._exit(00)