## Email Spam Classification (NLP/Word Embeddings)

Given *data about emails*, let's try to predict whether a given email will be **spam** or not.

We will use a Tensorflow/Keras Neural Network with word embeddings to make our predictions.

Data Source: https://www.kaggle.com/datasets/chandramoulinaidu/spam-classification-for-basic-nlp

### Importing Libraries

In [1]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
import tensorflow as tf

2025-01-04 11:18:54.621650: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


In [2]:
data = pd.read_csv('Spam Email raw text for NLP.csv')
data

Unnamed: 0,CATEGORY,MESSAGE,FILE_NAME
0,1,"Dear Homeowner,\n\n \n\nInterest Rates are at ...",00249.5f45607c1bffe89f60ba1ec9f878039a
1,1,ATTENTION: This is a MUST for ALL Computer Use...,00373.ebe8670ac56b04125c25100a36ab0510
2,1,This is a multi-part message in MIME format.\n...,00214.1367039e50dc6b7adb0f2aa8aba83216
3,1,IMPORTANT INFORMATION:\n\n\n\nThe new domain n...,00210.050ffd105bd4e006771ee63cabc59978
4,1,This is the bottom line. If you can GIVE AWAY...,00033.9babb58d9298daa2963d4f514193d7d6
...,...,...,...
5791,0,"I'm one of the 30,000 but it's not working ver...",00609.dd49926ce94a1ea328cce9b62825bc97
5792,0,Damien Morton quoted:\n\n>W3C approves HTML 4 ...,00957.e0b56b117f3ec5f85e432a9d2a47801f
5793,0,"On Mon, 2002-07-22 at 06:50, che wrote:\n\n\n\...",01127.841233b48eceb74a825417d8d918abf8
5794,0,"Once upon a time, Manfred wrote :\n\n\n\n> I w...",01178.5c977dff972cd6eef64d4173b90307f0


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5796 entries, 0 to 5795
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   CATEGORY   5796 non-null   int64 
 1   MESSAGE    5796 non-null   object
 2   FILE_NAME  5796 non-null   object
dtypes: int64(1), object(2)
memory usage: 136.0+ KB


### Preprocessing

In [4]:
df = data.copy()
df

Unnamed: 0,CATEGORY,MESSAGE,FILE_NAME
0,1,"Dear Homeowner,\n\n \n\nInterest Rates are at ...",00249.5f45607c1bffe89f60ba1ec9f878039a
1,1,ATTENTION: This is a MUST for ALL Computer Use...,00373.ebe8670ac56b04125c25100a36ab0510
2,1,This is a multi-part message in MIME format.\n...,00214.1367039e50dc6b7adb0f2aa8aba83216
3,1,IMPORTANT INFORMATION:\n\n\n\nThe new domain n...,00210.050ffd105bd4e006771ee63cabc59978
4,1,This is the bottom line. If you can GIVE AWAY...,00033.9babb58d9298daa2963d4f514193d7d6
...,...,...,...
5791,0,"I'm one of the 30,000 but it's not working ver...",00609.dd49926ce94a1ea328cce9b62825bc97
5792,0,Damien Morton quoted:\n\n>W3C approves HTML 4 ...,00957.e0b56b117f3ec5f85e432a9d2a47801f
5793,0,"On Mon, 2002-07-22 at 06:50, che wrote:\n\n\n\...",01127.841233b48eceb74a825417d8d918abf8
5794,0,"Once upon a time, Manfred wrote :\n\n\n\n> I w...",01178.5c977dff972cd6eef64d4173b90307f0


In [5]:
# Drop FILE_NAME column
df = df.drop('FILE_NAME', axis=1)
df

Unnamed: 0,CATEGORY,MESSAGE
0,1,"Dear Homeowner,\n\n \n\nInterest Rates are at ..."
1,1,ATTENTION: This is a MUST for ALL Computer Use...
2,1,This is a multi-part message in MIME format.\n...
3,1,IMPORTANT INFORMATION:\n\n\n\nThe new domain n...
4,1,This is the bottom line. If you can GIVE AWAY...
...,...,...
5791,0,"I'm one of the 30,000 but it's not working ver..."
5792,0,Damien Morton quoted:\n\n>W3C approves HTML 4 ...
5793,0,"On Mon, 2002-07-22 at 06:50, che wrote:\n\n\n\..."
5794,0,"Once upon a time, Manfred wrote :\n\n\n\n> I w..."


In [6]:
# Split df into X and y
y = df['CATEGORY']
X = df['MESSAGE']

In [7]:
y

0       1
1       1
2       1
3       1
4       1
       ..
5791    0
5792    0
5793    0
5794    0
5795    0
Name: CATEGORY, Length: 5796, dtype: int64

In [8]:
X

0       Dear Homeowner,\n\n \n\nInterest Rates are at ...
1       ATTENTION: This is a MUST for ALL Computer Use...
2       This is a multi-part message in MIME format.\n...
3       IMPORTANT INFORMATION:\n\n\n\nThe new domain n...
4       This is the bottom line.  If you can GIVE AWAY...
                              ...                        
5791    I'm one of the 30,000 but it's not working ver...
5792    Damien Morton quoted:\n\n>W3C approves HTML 4 ...
5793    On Mon, 2002-07-22 at 06:50, che wrote:\n\n\n\...
5794    Once upon a time, Manfred wrote :\n\n\n\n> I w...
5795    If you run Pick, and then use the "New FTOC" b...
Name: MESSAGE, Length: 5796, dtype: object

In [9]:
# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, shuffle=True, random_state=1)

In [10]:
X_train

4530    I don't know for skimming that article which D...
4478    I'm talking specifically about the last ~24 ho...
1156    <html>\n\n<body>\n\n<p align=3D"center"><br>\n...
5539    \n\nSo are PNGs still kosh?\n\n\n\n\n\n\n\n\n\...
2088    right Mike,\n\n\n\ni will agree to disagree bu...
                              ...                        
905     <html>\n\n\n\n<body>\n\n\n\n<font size="2" PTS...
5192    \n\n\n\nformail did the trick. Thanks to those...
3980    URL: http://www.askbjoernhansen.com/archives/2...
235     <html>\n\n<head>\n\n   <meta http-equiv=3D"Con...
5157    >>>>> "E" == Elias Sinderson <elias@cse.ucsc.e...
Name: MESSAGE, Length: 4057, dtype: object

In [11]:
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=30000)

In [12]:
tokenizer.fit_on_texts(X_train)

In [13]:
tokenizer.word_index

{'the': 1,
 'font': 2,
 '3d': 3,
 'to': 4,
 'a': 5,
 'and': 6,
 'of': 7,
 'td': 8,
 'br': 9,
 'you': 10,
 'in': 11,
 'http': 12,
 'is': 13,
 'i': 14,
 'size': 15,
 'for': 16,
 'com': 17,
 'b': 18,
 'that': 19,
 'p': 20,
 'this': 21,
 'tr': 22,
 'it': 23,
 'nbsp': 24,
 '0': 25,
 'color': 26,
 'face': 27,
 'width': 28,
 'on': 29,
 'your': 30,
 '1': 31,
 'www': 32,
 '2': 33,
 'be': 34,
 'with': 35,
 'are': 36,
 'align': 37,
 'or': 38,
 '20': 39,
 'have': 40,
 'from': 41,
 'not': 42,
 'arial': 43,
 'as': 44,
 'net': 45,
 'if': 46,
 'by': 47,
 'center': 48,
 'list': 49,
 'at': 50,
 'table': 51,
 'will': 52,
 'href': 53,
 'we': 54,
 'height': 55,
 'html': 56,
 'all': 57,
 'can': 58,
 'border': 59,
 'email': 60,
 'but': 61,
 'an': 62,
 '3': 63,
 'my': 64,
 'name': 65,
 'our': 66,
 'helvetica': 67,
 '4': 68,
 'do': 69,
 'was': 70,
 'one': 71,
 'more': 72,
 'serif': 73,
 'sans': 74,
 'no': 75,
 'linux': 76,
 '5': 77,
 'div': 78,
 'style': 79,
 'verdana': 80,
 'get': 81,
 'so': 82,
 'src': 83,
 

In [14]:
len(tokenizer.word_index)

79082

In [15]:
np.array(tokenizer.texts_to_sequences(X_train))

  np.array(tokenizer.texts_to_sequences(X_train))


array([list([14, 135, 188, 16, 18296, 19, 588, 118, 358, 358, 6, 4258, 6, 22861, 6, 70, 884, 61, 1, 22862, 5, 706, 18297, 58, 54, 2452, 1, 1622, 2868, 13691, 13, 1540, 64, 8777, 1766, 13, 42, 15436, 132, 133, 4, 993, 16, 2622, 14, 9386, 472, 12, 330, 17, 144, 109, 244]),
       list([183, 1452, 2119, 105, 1, 375, 405, 582, 273, 23, 625, 125, 500, 40, 42, 142, 504, 119, 13692, 142, 236, 1, 3664, 3810, 40, 142, 2655, 132, 3334, 142, 2143, 360, 809, 1131, 118, 296, 142, 22863, 4, 1, 566, 118, 40, 142, 5026, 110, 6, 29, 965, 323, 183, 42, 1268, 151, 1341, 42, 504, 38, 19, 23, 96, 75, 1591, 7, 285, 912, 222, 1028, 41, 103, 98, 19, 151, 42, 794, 504, 6, 115, 40, 142, 5, 290, 7, 18298, 7746, 11, 1, 375, 1302, 608, 714, 151, 597, 4, 6688, 19, 54, 135, 188, 634, 46, 1, 966, 494, 13, 434, 4, 178, 38, 42, 23, 106, 98, 472, 6689, 29, 6, 19, 22864, 625, 4, 40, 22865, 35, 119, 10082, 500, 40, 2869, 504, 130, 195, 29, 1805, 783, 206, 92, 50, 341, 1286, 639, 6690, 13693, 170, 29, 1078, 518, 498, 92, 5

In [16]:
def get_sequences(texts, tokenizer, train=True, max_seq_length=None):
    sequences = tokenizer.texts_to_sequences(texts)
    if train == True:
        max_seq_length = np.max(list(map(lambda x: len(x), sequences)))
    sequences = tf.keras.preprocessing.sequence.pad_sequences(sequences, maxlen=max_seq_length, padding='post')
    return sequences

In [17]:
get_sequences(X_train, tokenizer, train=True).shape

(4057, 14804)

In [18]:
# Create tokenizer 
tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=30000)

In [19]:
# Fit the tokenizer
tokenizer.fit_on_texts(X_train)

In [20]:
# Convert texts to sequences
X_train = get_sequences(X_train, tokenizer, train=True)
X_test = get_sequences(X_test, tokenizer, train=False, max_seq_length=X_train.shape[1])

In [21]:
X_train

array([[   14,   135,   188, ...,     0,     0,     0],
       [  183,  1452,  2119, ...,     0,     0,     0],
       [   56,   129,    20, ...,     0,     0,     0],
       ...,
       [  309,    12,    32, ...,     0,     0,     0],
       [   56,   179,   298, ...,     0,     0,     0],
       [   94,  4671, 11648, ...,     0,     0,     0]], dtype=int32)

In [22]:
X_train.shape

(4057, 14804)

In [23]:
y_train.value_counts()

CATEGORY
0    2738
1    1319
Name: count, dtype: int64

### Training

In [24]:
inputs = tf.keras.Input(shape=(14804,))

embedding = tf.keras.layers.Embedding(
    input_dim=30000,
    output_dim=64
)(inputs)

2025-01-04 11:19:01.646832: I tensorflow/core/common_runtime/process_util.cc:146] Creating new thread pool with default inter op setting: 2. Tune using inter_op_parallelism_threads for best performance.


In [25]:
inputs

<KerasTensor: shape=(None, 14804) dtype=float32 (created by layer 'input_1')>

In [26]:
embedding

<KerasTensor: shape=(None, 14804, 64) dtype=float32 (created by layer 'embedding')>

In [27]:
flatten = tf.keras.layers.Flatten()(embedding)
flatten

<KerasTensor: shape=(None, 947456) dtype=float32 (created by layer 'flatten')>

In [28]:
outputs = tf.keras.layers.Dense(1, activation='sigmoid')(flatten)

model = tf.keras.Model(inputs=inputs, outputs=outputs)

In [29]:
model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy', tf.keras.metrics.AUC(name='auc')]
)
print(model.summary())

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 14804)]           0         
                                                                 
 embedding (Embedding)       (None, 14804, 64)         1920000   
                                                                 
 flatten (Flatten)           (None, 947456)            0         
                                                                 
 dense (Dense)               (None, 1)                 947457    
                                                                 
Total params: 2,867,457
Trainable params: 2,867,457
Non-trainable params: 0
_________________________________________________________________
None


In [30]:
history = model.fit(
    X_train, 
    y_train,
    validation_split=0.2,
    batch_size=32,
    epochs=100,
    callbacks=[tf.keras.callbacks.EarlyStopping(
        monitor='val_loss',
        patience=3,
        restore_best_weights=True
        )
    ]
)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100


### Results

In [33]:
results = model.evaluate(X_test, y_test)

print("     Test Loss: {:.4f}".format(results[0]))
print("Test Accuracy: {:.2f}%".format(results[1]*100))
print("     Test AUC: {:.4f}".format(results[2]))

     Test Loss: 0.0254
Test Accuracy: 99.14%
     Test AUC: 0.9989%


In [40]:
(np.squeeze(np.array(model.predict(X_test) >= 0.5, dtype=np.int)) != y_test).sum()



Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  (np.squeeze(np.array(model.predict(X_test) >= 0.5, dtype=np.int)) != y_test).sum()


15

In [38]:
len(y_test)

1739

In [41]:
(1789-15)/1789

0.9916154276131918