1. Read data
Read the data from CSV and apply some basic pre-processing (remove non-ascii characters, convert our target variable to an integer label).

In [42]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.callbacks import ReduceLROnPlateau
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten, Conv1D, MaxPooling1D, Dropout, Activation, Input
from tensorflow.keras.layers import Embedding, LSTM
from tensorflow.keras.models import Model

# Others
import nltk
import string
import numpy as np
import pandas as pd
from nltk.corpus import stopwords
import re
from sklearn.manifold import TSNE
from nltk.stem import SnowballStemmer
import codecs
import csv

2. Function to clean the text data

In [43]:
def text_to_wordlist(text):
    
    ## Remove puncuation
    text = text.translate(string.punctuation)
    
    ## Convert words to lower case and split them
    text = text.lower().split()
    
    ## Remove stop words
    #stops = set(stopwords.words("english"))
    #text = [w for w in text if not w in stops and len(w) >= 3]
    
    text = " ".join(text)
    ## Clean the text
    text = re.sub(r"[^A-Za-z0-9^,!.\/'+-=]", " ", text)
    text = re.sub(r"what's", "what is ", text)
    text = re.sub(r"\'s", " ", text)
    text = re.sub(r"\'ve", " have ", text)
    text = re.sub(r"n't", " not ", text)
    text = re.sub(r"i'm", "i am ", text)
    text = re.sub(r"\'re", " are ", text)
    text = re.sub(r"\'d", " would ", text)
    text = re.sub(r"\'ll", " will ", text)
    text = re.sub(r",", " ", text)
    text = re.sub(r"\.", " ", text)
    text = re.sub(r"!", " ! ", text)
    text = re.sub(r"\/", " ", text)
    text = re.sub(r"\^", " ^ ", text)
    text = re.sub(r"\+", " + ", text)
    text = re.sub(r"\-", " - ", text)
    text = re.sub(r"\=", " = ", text)
    text = re.sub(r"'", " ", text)
    text = re.sub(r"(\d+)(k)", r"\g<1>000", text)
    text = re.sub(r":", " : ", text)
    text = re.sub(r" e g ", " eg ", text)
    text = re.sub(r" b g ", " bg ", text)
    text = re.sub(r" u s ", " american ", text)
    text = re.sub(r"\0s", "0", text)
    text = re.sub(r" 9 11 ", "911", text)
    text = re.sub(r"e - mail", "email", text)
    text = re.sub(r"j k", "jk", text)
    text = re.sub(r"\s{2,}", " ", text)
    return text

3. Read data
Read the data from CSV and applying preprocessing steps and then converting text and labels into different matrix

In [44]:
# docs = [] 
# labels = []
# with codecs.open(r'data/all_email_data.csv', encoding='utf-8') as f:
#     reader = csv.reader(f, delimiter=',')
#     header = next(reader)
#     for values in reader:
#         docs.append(text_to_wordlist(values[0]))
#         labels.append((values[1]))

data = pd.read_csv(r"data/all_email_data.csv")

In [45]:
data['text'] = data['text'].apply(text_to_wordlist)

In [46]:
data

Unnamed: 0,text,spam
0,naturally irresistible your corporate identity...,1
1,the stock trading gunslinger fanny is merrill ...,1
2,unbelievable new homes made easy im wanting to...,1
3,4 color printing special request additional in...,1
4,do not have money get software cds from here !...,1
...,...,...
25695,preferred non - smoker + + just what the doct...,1
25696,dear subscriber + + if i could show you a way ...,1
25697,mid - summer customer appreciation sale ! + +...,1
25698,attn : sir madan + + strictly confidential + +...,1


4. Encoding label class as 1 if class is 'spam' else 0

In [47]:
# labels = [1 if x == 'spam' else 0 for x in labels]

5. Tokenize the text data

In [48]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(data['text'])

6. calculation vocabulary size 

In [49]:
vocab_size = len(tokenizer.word_index) + 1

In [50]:
len(tokenizer.word_index)

125024

7. Making sequence of the documents

In [51]:
sequences = tokenizer.texts_to_sequences(data['text'])
print('Found %s unique tokens' % len(tokenizer.word_index))

Found 125024 unique tokens


8. Padding sequence with maximum length of 20

In [52]:
max_len = 300
padded_docs = pad_sequences(sequences, maxlen=max_len, padding='post')
labels = np.array(data['spam'])

9. Loading pre-trained dictionary of word embeddings that translates each word into a 100 dimensional vector.
More info on the project that created this dataset https://nlp.stanford.edu/projects/glove/

In [53]:
embeddings_index = dict()
f = open(r'data/glove.6B.100d.txt',encoding="utf8")
for line in f:
    values = line.split()
    #print(values)
    word = values[0]
    #print(word)
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()
print('Loaded %s word vectors.' % len(embeddings_index))

Loaded 400000 word vectors.


In [54]:
tokenizer.word_index

{'the': 1,
 'to': 2,
 'and': 3,
 'of': 4,
 'a': 5,
 'you': 6,
 'for': 7,
 'in': 8,
 'i': 9,
 'is': 10,
 'ect': 11,
 'on': 12,
 'this': 13,
 'that': 14,
 'it': 15,
 'be': 16,
 'enron': 17,
 'com': 18,
 'your': 19,
 'with': 20,
 'have': 21,
 'not': 22,
 'we': 23,
 'will': 24,
 'are': 25,
 'from': 26,
 'at': 27,
 'hou': 28,
 'as': 29,
 'http': 30,
 'or': 31,
 'by': 32,
 'if': 33,
 's': 34,
 '1': 35,
 '2000': 36,
 'am': 37,
 'can': 38,
 'please': 39,
 '2': 40,
 'me': 41,
 'our': 42,
 'subject': 43,
 'all': 44,
 'do': 45,
 'an': 46,
 'www': 47,
 '3': 48,
 'my': 49,
 'would': 50,
 'was': 51,
 '0': 52,
 'but': 53,
 'has': 54,
 'any': 55,
 'cc': 56,
 '00': 57,
 '10': 58,
 'vince': 59,
 'pm': 60,
 're': 61,
 'get': 62,
 'so': 63,
 'no': 64,
 'new': 65,
 'one': 66,
 'net': 67,
 'more': 68,
 'they': 69,
 'there': 70,
 'up': 71,
 'email': 72,
 'time': 73,
 'e': 74,
 '2001': 75,
 'know': 76,
 'out': 77,
 'list': 78,
 '5': 79,
 'about': 80,
 'gas': 81,
 'thanks': 82,
 '4': 83,
 'what': 84,
 '000': 8

10. we only need the subset of these 400,000 words that appear in our docs.So , we create a weight matrix for words in training docs

In [55]:
embedding_matrix = np.zeros((vocab_size, 100))

In [56]:
for word, index in tokenizer.word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[index] = embedding_vector

In [57]:
print(embedding_matrix.shape)

(125025, 100)


In [58]:
# def RNN():
#     inputs = Input(name='inputs',shape=[max_len])
#     layer = Embedding(vocab_size, 100, weights=[embedding_matrix], input_length=max_len, trainable=False)(inputs)
#     layer = LSTM(64)(layer)
#     layer = Dense(256,name='FC1')(layer)
#     layer = Activation('relu')(layer)
#     layer = Dropout(0.5)(layer)
#     layer = Dense(1,name='out_layer')(layer)
#     layer = Activation('sigmoid')(layer)
#     model = Model(inputs=inputs,outputs=layer)
#     return model

# def rnn_2():
#     inputs = Input(name='inputs',shape=[max_len])
#     embd = Embedding(vocab_size, 100, weights=[embedding_matrix], input_length=max_len, trainable=False)(inputs)
#     X = LSTM(128, return_sequences=True)(embd)
#     # Add dropout with a probability of 0.5
#     X = Dropout(0.5)(X)
#     # Propagate X trough another LSTM layer with 128-dimensional hidden state
#     X = LSTM(128, return_sequences=False)(X)
#     X = Dropout(0.5)(X)
#     # Propagate X through a Dense layer with 5 units and add softmax activation
#     X = Dense(1, activation="sigmoid")(X)
    
#     model = Model(inputs=inputs, outputs=X)

#     return model

Creating neural network layers

In [59]:
model = Sequential()
model.add(Embedding(vocab_size, 100, weights=[embedding_matrix], input_length=max_len, trainable=True))
model.add(Conv1D(64, 2, activation='relu'))
model.add(MaxPooling1D(2))
model.add(Dropout(0.3))
model.add(Conv1D(128, 2, activation='relu'))
model.add(MaxPooling1D(2))
model.add(Dropout(0.3))
model.add(Conv1D(256, 2, activation='relu'))
model.add(MaxPooling1D(2))
model.add(Dropout(0.3))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
# model = rnn_2()
# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# summarize the model
print(model.summary())

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 300, 100)          12502500  
_________________________________________________________________
conv1d_3 (Conv1D)            (None, 299, 64)           12864     
_________________________________________________________________
max_pooling1d_3 (MaxPooling1 (None, 149, 64)           0         
_________________________________________________________________
dropout_3 (Dropout)          (None, 149, 64)           0         
_________________________________________________________________
conv1d_4 (Conv1D)            (None, 148, 128)          16512     
_________________________________________________________________
max_pooling1d_4 (MaxPooling1 (None, 74, 128)           0         
_________________________________________________________________
dropout_4 (Dropout)          (None, 74, 128)          

Dividing the data into test and train

In [60]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(padded_docs, labels, test_size=0.2, random_state=42)

In [61]:
y_train.shape

(20560,)

In [62]:
class_weight = {0: 1.,
                1: 3.3}

taring and Evaluating the model

In [63]:
from tensorflow.keras.callbacks import EarlyStopping

early_stop = EarlyStopping(monitor='val_loss', restore_best_weights=True, patience=5)

model.fit(X_train, y_train, epochs=50, batch_size=32, validation_split=0.2, callbacks=[early_stop],
         class_weight=class_weight)
# evaluate the model
loss, accuracy = model.evaluate(X_test, y_test, verbose=0)
print('Accuracy: %f' % (accuracy*100))

Train on 16448 samples, validate on 4112 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Accuracy: 96.089494


In [64]:
not_spam = """Dear Counsel,
We acknowledge, with thanks, your email and the instructions contained therein. Our representative will collect the
hard copy documents.
We are proceeding to action the instructions and will keep you updated.
With best regards,
SF"""

In [65]:
not_spam = text_to_wordlist(not_spam)
not_spam

'dear counsel + we acknowledge with thanks your email and the instructions contained therein our representative will collect the + hard copy documents + we are proceeding to action the instructions and will keep you updated + with best regards + sf'

In [66]:
not_spam = tokenizer.texts_to_sequences([not_spam])

In [67]:
not_spam = pad_sequences(not_spam, maxlen=max_len, padding='post')

In [68]:
model.predict(not_spam)

array([[0.8551485]], dtype=float32)

In [69]:
spam = """Subject: congratulations hpshum you \' ve won !  congratulations !  official  notification  hpshum @ hotmail . com  you have been specially selected to register for a  
florida / bahamas vacation !  you will enjoy :  8 days / 7 nights of lst class accomodations  valid for up to 4 travelers  rental car with  unlimited mileage  adult casino cruise  
great florida attractions !  much much more . . .  click here !  ( limited availability )  to no longer receive this or any other offer from us , click here to unsubscribe .  
[ bjk 9 ^ " : } h & * tgobk 5 nkiys 5 ]"""

In [70]:
spam = text_to_wordlist(spam)
spam = tokenizer.texts_to_sequences([spam])
spam = pad_sequences(spam, maxlen=max_len, padding='post')

In [71]:
model.predict(spam)

array([[0.99990416]], dtype=float32)

In [72]:
not_spam = """Subject: congratulations  vince ,  congratulations on your promotion to managing director . you certainly deserve  it .  zhiyong"""

In [73]:
not_spam = text_to_wordlist(not_spam)
not_spam = tokenizer.texts_to_sequences([not_spam])
not_spam = pad_sequences(not_spam, maxlen=max_len, padding='post')

In [74]:
model.predict(not_spam)

array([[0.00056683]], dtype=float32)

In [75]:
spam = """Hello!

I am a hacker who has access to your operating system.
I also have full access to your account.

I've been watching you for a few months now.
The fact is that you were infected with malware through an adult site that you visited.

If you are not familiar with this, I will explain.
Trojan Virus gives me full access and control over a computer or other device.
This means that I can see everything on your screen, turn on the camera and microphone, but you do not know about it.

I also have access to all your contacts and all your correspondence.

Why your antivirus did not detect malware?
Answer: My malware uses the driver, I update its signatures every 4 hours so that your antivirus is silent.

I made a video showing how you satisfy yourself in the left half of the screen, and in the right half you see the video that you watched.
With one click of the mouse, I can send this video to all your emails and contacts on social networks.
I can also post access to all your e-mail correspondence and messengers that you use.

If you want to prevent this,
transfer the amount of $500 to my bitcoin address (if you do not know how to do this, write to Google: "Buy Bitcoin").

My bitcoin address (BTC Wallet) is:  15gyQqNaV7n6befX6gTvn1LHR8GQBPUc2A

After receiving the payment, I will delete the video and you will never hear me again.
I give you 50 hours (more than 2 days) to pay.
I have a notice reading this letter, and the timer will work when you see this letter.

Filing a complaint somewhere does not make sense because this email cannot be tracked like my bitcoin address.
I do not make any mistakes.

If I find that you have shared this message with someone else, the video will be immediately distributed.

Best regards!"""

In [76]:
spam = text_to_wordlist(spam)
spam = tokenizer.texts_to_sequences([spam])
spam = pad_sequences(spam, maxlen=max_len, padding='post')

In [77]:
result = model.predict(spam)
print(result)

[[0.9981748]]


In [78]:
spam = """Dear User,
Courtesy Notice from Admin Team
You have reached the storage limit for your Mailbox and database server.
You will be blocked from sending and receiving new messages. If your email is not verified within 48 hours.
Please click BELOW to verify and access the e-mail restore.

CLICK HERE
Thanks,
WINDOWS LIVE TEAM"""

In [79]:
spam = text_to_wordlist(spam)
spam = tokenizer.texts_to_sequences([spam])
spam = pad_sequences(spam, maxlen=max_len, padding='post')
result = model.predict(spam)
print(result)

[[0.606051]]


In [80]:
result[0][0]

0.606051

In [81]:
model.save(r'model/spam_detector.h5')

In [82]:
import joblib

joblib.dump(tokenizer, r'model/classifier.pkl', compress=True)

['model/classifier.pkl']