Hi everyone!
<br>
I'm Aditya, this notebook will present analysis related data from <url>https://www.kaggle.com/datasets/mfaisalqureshi/spam-email</url>. For detail explanation, see <url>https://medium.com/@adityaadamf</url>.
<br>
Thank you!

---

# Install and import library

In [None]:
!pip install -q polars pyarrow plotly kaleido imbalanced-learn

In [None]:
import polars as pl
import pandas as pd
import numpy as np
import re, random, os, warnings
warnings.filterwarnings('ignore')
import plotly.express as px
import plotly.io as pio
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from sklearn.preprocessing import LabelEncoder
from imblearn.over_sampling import SMOTE
from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import tensorflow as tf
random.seed(0)

In [None]:
pl.Config(fmt_str_lengths=1500)

# Dataset

In [None]:
df = pl.read_csv('/kaggle/input/spam-email/spam.csv')

In [None]:
print('dataset has {} rows and {} columns'.format(df.shape[0], df.shape[1]))

In [None]:
df.head()

# Visualization

Display visualization number of data from each spam label

In [None]:
viz_spam_label = df.group_by('Category').count()
px.pie(viz_spam_label, title='(1) Proportion of Spam Label',
       values='count', names='Category').show(renderer='svg')

See, the result of spam label visualization above. `76.1%` (4360) not spam and `23.9%` (1368) spam, the result indicate imbalance data.

# Preprocessing
Step in this processing, I will:
1. Remove symbol `(such as !@#$%&*...etc)`
<br><br>
<i>for this notebook, I don't use spelling correction and stemming, because for time efficiency</i>

In [None]:
def preprocess(texts):
    # remove symbol
    texts = re.sub(r'[\W_]+',' ',texts)
    return texts

In [None]:
df = df.with_columns(pl.col('Message').map_elements(lambda x: preprocess(x)))

In [None]:
df.head()

# Transformation
In transformation step, I will create vocab and train from existing text data using tensorflow `TextVectorization` and output_mode is `tf_idf`. Then save the transformation model in format `tf`

In [None]:
def transformers(X: list, is_train: bool=True, 
                 paths: dict={'model':'model/transformation', 
                              'vocab':'model/numpy'}, 
                 output: str='tf_idf', save_format: str='tf'):
    if is_train:
        # Create vocabulary data
        vocab_data = tf.data.Dataset.from_tensor_slices( 
            list( np.sort( np.unique(' '.join(X).split()) ) ) 
        )

        # TextVectorization
        tfidf_layer = tf.keras.layers.TextVectorization(
            standardize = 'lower_and_strip_punctuation',
            split = 'whitespace',
            max_tokens = len(vocab_data),
            output_mode = output
        )
        tfidf_layer.adapt(vocab_data.batch(64))

        # Save the vocabulary in numpy save compressed
        vocab_list = tfidf_layer.get_vocabulary()
        if not os.path.exists(paths['vocab']):
            os.makedirs(paths['vocab'])
        np.savez_compressed(paths['vocab']+'/vocab.npz', vocab=vocab_list)
        
        # Create transformation model
        model = tf.keras.models.Sequential()
        model.add(tf.keras.Input(shape=(1,), dtype=tf.string))
        model.add(tfidf_layer)
        
        # Save transformer model
        model.save(paths['model'])
        
        # Predict
        X_pred = model.predict(X, verbose=0)
    else:
        vocab_list = np.load(paths['vocab']+'/vocab.npz')['vocab']
        model = tf.keras.models.load_model(paths['model'], compile=False)
        
        X_pred = model.predict(X, verbose=0)
        
    return X_pred, vocab_list

In [None]:
X, vocab_list = transformers(df['Message'].to_list())

Next for variable y, I will convert to numeric using LabelEncoder

In [None]:
def label_encoder(y: list, name: str, is_train: bool=True, paths: dict={'label':'model/numpy'}):
    lists = np.sort(np.unique(y)) if is_train else np.load(paths['label']+'/list_'+name+'.npy')
    if is_train:
        np.save(paths['label']+'/list_'+name+'.npy', lists)

    le = LabelEncoder()
    le_encode = le.fit(lists)
    le_encode = le_encode.transform(y)
    le_encode = tf.keras.utils.to_categorical(le_encode)
    
    return le_encode

In [None]:
y = label_encoder(df['Category'], df['Category'].name)

# Split data
Before I go to next step, <b>5572</b> data will be split into three:
<br>
`Use 80:20 proportion theory`<sup>[1]</sup>
- `80%` <b>(4457)</b> Train: <i>from 80% data, will split again for model evaluate, `80%` <b>(3565)</b> for Train and `20%` <b>(892)</b> for Validation</i>
- `20%` <b>(1115)</b> Test.
<br>
Why did I do it here?
<br>
- For short answer, because for the better experiments :D
<br>
- For detail answer, because `Train data` set for learning model, `Validation data` set to provide an unbiased evaluation of a model fitted, and `Test data` set to provide an unbiased evaluation of a final model <sup>[2]</sup>

In [None]:
# Split into Train and Test
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size = 0.2, train_size = 0.8, 
                                                    stratify = y, random_state = 42)

print('size of train: {}, size of test: {}'.format(X_train.shape, X_test.shape))

In [None]:
# From Train, split again into Train and Valid
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train, 
                                                      test_size = 0.2, train_size = 0.8, 
                                                      stratify = y_train, random_state = 42)

print('size of train: {}, size of valid: {}'.format(X_train.shape, X_valid.shape))

# Resampling
Why I add resampling step? Because see the visualization output (img.1), the result from each spam label is not equal. So for minority class, need to be resampling until it's equal to majority class. The method I use is ROS (Random Over Sampling): SMOTE<sup>[3]</sup>

In [None]:
res = SMOTE(random_state=42)
X_res, y_res = res.fit_resample(X_train, y_train.argmax(1))
print('Resampled dataset shape %s' % Counter(y_res))

In [None]:
y_res = tf.keras.utils.to_categorical(y_res)

# Modelling
Modelling step, I will use Simple NN as the training model

In [None]:
# Detect hardware, return appropriate distribution strategy
try:
    # TPU detection. No parameters necessary if TPU_NAME environment variable is
    # set: this is always the case on Kaggle.
    tpu = tf.distribute.cluster_resolver.TPUClusterResolver()
    print('Running on TPU ', tpu.master())
except ValueError:
    tpu = None

if tpu:
    tf.config.experimental_connect_to_cluster(tpu)
    tf.tpu.experimental.initialize_tpu_system(tpu)
    strategy = tf.distribute.experimental.TPUStrategy(tpu)
else:
    # Default distribution strategy in Tensorflow. Works on CPU and single GPU.
    strategy = tf.distribute.get_strategy()

In [None]:
with strategy.scope():
    model = tf.keras.Sequential()
    model.add(tf.keras.layers.Dense(units = 1024, activation = 'relu', input_dim = X_res.shape[1]))
    model.add(tf.keras.layers.Dense(units = 512, activation = 'relu'))
    model.add(tf.keras.layers.Dense(units = 256, activation = 'relu'))
    model.add(tf.keras.layers.Dense(units = y_res.shape[1], activation = 'softmax'))

model.summary()

In [None]:
model.compile(optimizer = tf.keras.optimizers.Adam(learning_rate=0.0001),
              metrics = ['accuracy'], loss = tf.keras.losses.CategoricalCrossentropy(from_logits=True))

In [None]:
history = model.fit(X_res, y_res, batch_size = 10, 
                    epochs = 10, validation_split = 0.2, verbose=1,
                    callbacks=[tf.keras.callbacks.EarlyStopping(monitor = 'accuracy', min_delta = 0.0001)])

In [None]:
valid_loss, valid_acc = model.evaluate(X_valid, y_valid)

print('Valid Loss:', valid_loss)
print('Valid Accuracy:', valid_acc)

In [None]:
# Create figure with secondary y-axis
fig_model = make_subplots(specs=[[{'secondary_y': True}]])

# Add traces
fig_model.add_trace(
    go.Scatter( y=history.history['val_loss'], name='val_loss'),
    secondary_y=False,
)

fig_model.add_trace(
    go.Scatter( y=history.history['loss'], name='loss'),
    secondary_y=False,
)

fig_model.add_trace(
    go.Scatter( y=history.history['val_accuracy'], name='val accuracy'),
    secondary_y=True,
)

fig_model.add_trace(
    go.Scatter( y=history.history['accuracy'], name='val accuracy'),
    secondary_y=True,
)

# Add figure title
fig_model.update_layout(
    title_text='Loss/Accuracy of SimpleNN Model'
)

# Set x-axis title
fig_model.update_xaxes(title_text='Epoch')

# Set y-axes titles
fig_model.update_yaxes(title_text='<b>primary</b> Loss', secondary_y=False)
fig_model.update_yaxes(title_text='<b>secondary</b> Accuracy', secondary_y=True)

fig_model.show(renderer='svg')

In [None]:
# Save model as format `tf`
model.save('model/rnn')

# Evaluate
Now, evaluation the model with predict `Test data` and compare the predict with real label and show the result with `classification_report`

In [None]:
y_pred = model.predict(X_test)

print(classification_report(y_test.argmax(1), y_pred.argmax(1), target_names=['not spam','spam']))

From the report, in each class on precision, the model can predict each class correctly by 98%. And on recall, the success rate of the model can find back the information by 92%.

# Conclusion
In this notebook create spam email detection with SMOTE resampling and Simple NN Model, the model has predict capability on validation training by 98% and testing by 98%. With SMOTE, the model can predict better for each class, and result of model are not indicated overfitting or underfitting.
<br>
Overall, the model can predict as well :)
<br>
<br>
<b>Thank you for reading my notebook!</b>

# References
1. Why 70/30 or 80/20 Relation Between Training and Testing Sets: A Pedagogical Explanation <i>(https://www.cs.utep.edu/vladik/2018/tr18-09.pdf)</i>
2. How to split data into three sets (train, validation, and test) And why? <i>(https://medium.com/towards-data-science/how-to-split-data-into-three-sets-train-validation-and-test-and-why-e50d22d3e54c)</i>
3. Effective Class-Imbalance Learning Based on SMOTE and Convolutional Neural Networks <i>(https://www.mdpi.com/2076-3417/13/6/4006)</i>