<h2 align="center">BERT_Email_Classification</h2>

In this tutorial, I have built a spam detection model. The spam detection model will classify emails as spam or not spam. This will be used to filter unwanted and unsolicited emails. I have built this model using BERT and Tensorflow.

BERT will be used to generate sentence encoding for all emails. Finally, I have used Tensorflow to build the neural networks. Tensorflow will create the input and output layers of our machine learning model.

## Importing important packages

tensorflow_text: It will allow us to work with text. In this tutorial, we are solving a text-classification problem

In [8]:
#!pip install tensorflow-text==2.8.1

In [33]:
!pip install gradio

Collecting gradio
  Downloading gradio-4.25.0-py3-none-any.whl (17.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m17.1/17.1 MB[0m [31m23.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting aiofiles<24.0,>=22.0 (from gradio)
  Downloading aiofiles-23.2.1-py3-none-any.whl (15 kB)
Collecting fastapi (from gradio)
  Downloading fastapi-0.110.1-py3-none-any.whl (91 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m91.9/91.9 kB[0m [31m12.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting ffmpy (from gradio)
  Downloading ffmpy-0.3.2.tar.gz (5.5 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting gradio-client==0.15.0 (from gradio)
  Downloading gradio_client-0.15.0-py3-none-any.whl (313 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m313.4/313.4 kB[0m [31m22.8 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting httpx>=0.24.1 (from gradio)
  Downloading httpx-0.27.0-py3-none-any.whl (75 kB)
[2K     [90m━━━━━━━━━━━━━━━━

TensorFlow: It is the machine learning package used to build the neural network. It creates the input and output layers of my machine learning model.

TensorFlow Hub: It contains a pre-trained machine model used to build our text classification. Our pre-trained model is BERT. I will re-use the BERT model and fine-tune it to meet my needs.

TensorFlow Text: It allows us to work with text.

Pandas: We will use Pandas to load our dataset. I will also use Pandas for data manipulation and analysis. It gives me a clear overview of how my dataset is structured.

In [34]:
# Importing required libraries
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_text as text
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
import numpy as np
import gradio as gr

In [10]:
# Read the dataset
df = pd.read_csv("/content/spam.csv")

In [11]:
# Display top 5 rows of the dataset
print(df.head())

  Category                                            Message
0      ham  Go until jurong point, crazy.. Available only ...
1      ham                      Ok lar... Joking wif u oni...
2     spam  Free entry in 2 a wkly comp to win FA Cup fina...
3      ham  U dun say so early hor... U c already then say...
4      ham  Nah I don't think he goes to usf, he lives aro...


The dataset has two categories: ham and spam. "Ham" represents emails that are not spam; these are emails from a trusted source. "Spam" represents emails from an unknown source.

The dataset also includes the Message column, which represents the email messages. Let's examine the individual value count for the spam and ham emails.

In [12]:
# Check the count of categories
category_counts = df['Category'].value_counts()
print("Category Counts:\n", category_counts)

Category Counts:
 ham     4825
spam     747
Name: Category, dtype: int64


The dataset has 4,825 ham emails and 747 spam emails. The number of ham emails is significantly higher.

In [13]:
# Calculate the ratio of spam to ham
spam_ratio = (category_counts['spam'] / (category_counts['spam'] + category_counts['ham'])) * 100
print("Spam Ratio: {:.2f}%".format(spam_ratio))

Spam Ratio: 13.41%


This result implies that about 13% of the emails are spam, while 87% are ham emails. This indicates a class imbalance, and I need to balance the two classes to reduce bias during model training.

## Balancing dataset
There are various techniques used to balance the dataset. I will employ the simplest approach by reducing the majority class from 4825 to 747, thereby achieving a balanced distribution between the two classes.

In [14]:
#checking the shape of spam
df_spam = df[df['Category'] == 'spam']
#print shape
df_spam.shape

(747, 2)

In [15]:
#checking the shape of ham
df_ham = df[df['Category'] == 'ham']
#print shape
df_ham.shape

(4825, 2)

Now that I have created the two data frames, I will reduce the number of instances in the ham class to match that of the spam class.

In [16]:
# Downsample the ham messages to balance the dataset
df_ham_downsampled = df_ham.sample(df_spam.shape[0])

#print the shape
df_ham_downsampled.shape

(747, 2)

I will save the new class into a variable called df_ham_downsampled. I need to concatenate the two balanced classes into a single data frame.

In [17]:
# Concatenate the downsampled ham and spam messages
df_balanced = pd.concat([df_ham_downsampled, df_spam])

#print the shape
df_balanced.shape

(1494, 2)

The pd.concat method will concatenate df_ham_downsampled and df_spam into a single DataFrame and save the dataset into a variable called df_balanced.

In [18]:
# Check the balanced count of categories
balanced_category_counts = df_balanced['Category'].value_counts()
print("Balanced Category Counts:\n", balanced_category_counts)

Balanced Category Counts:
 ham     747
spam    747
Name: Category, dtype: int64


## Adding labels
I need to label our dataset as 1 and 0. '1' will represent the data samples belonging to the spam class, while '0' will represent those belonging to the ham class.

I will use lambda to write the logic, and then the apply method will execute this logic, enabling us to label the dataset.

In [19]:
# Create a binary label for spam (1) and ham (0)
df_balanced['spam'] = df_balanced['Category'].apply(lambda x: 1 if x == 'spam' else 0)

df_balanced.head()

Unnamed: 0,Category,Message,spam
2922,ham,"Yo, any way we could pick something up tonight?",0
1037,ham,"No my blankets are sufficient, thx",0
5257,ham,"As usual..iam fine, happy &amp; doing well..:)",0
4554,ham,Sun ah... Thk mayb can if dun have anythin on....,0
2754,ham,"Derp. Which is worse, a dude who always wants ...",0


The dataset is labeled into two categories: some data samples are labeled as 1, while others are labeled as 0. Now, I need to split the labeled dataset.

## Split it into training and test dataset

I split the dataset into two sets: the first set will be used for training, and the second set will be used for testing.

In [20]:
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df_balanced["Message"], df_balanced['spam'], stratify=df_balanced['spam'])

In [21]:
#print the head of X_train
X_train.head()

635     Dear Voucher Holder, 2 claim this weeks offer,...
27      Did you catch the bus ? Are you frying an egg ...
2562                              And maybe some pressies
2899          If you r @ home then come down within 5 min
4436    Don't b floppy... b snappy & happy! Only gay c...
Name: Message, dtype: object

we use stratify to ensure equal distribution of classes in the train and test sample. This ensures we have an equal amount of spam and ham emails after splitting. After splitting the dataset, we can start working with BERT.

## Getting started with BERT
BERT stands for Bidirectional Encoder Representations from Transformers. BERT models help machines understand and interpret the meaning of text by using immediately preceding text to grasp the context and checking the relationships of words within a sentence to determine their actual meaning.

BERT converts a given sentence into an embedding vector, which represents the unique words in a document. This ensures that words with similar meanings have similar representations.

Since machine learning operates effectively with numbers rather than text, BERT converts input text into embedding vectors, facilitating model processing.

The BERT process comprises two stages: Preprocessing and Encoding.

Preprocessing is the initial stage in BERT where noise is removed from the dataset, duplicates are eliminated, and the dataset is formatted for ease of use during model training, thereby enhancing model performance.

Encoding, the subsequent stage, involves converting text into real numbers, which is crucial since machine learning algorithms work more effectively with numerical data. BERT accomplishes this by converting sentences into embedding vectors.

## Downloading the BERT model
BERT models are typically pre-trained and available in TensorFlow Hub, which contains all the pre-trained machine learning models that can be downloaded.

I will download two models: one for performing preprocessing and the other for encoding. The links for the models are provided below.

for bert_preprocess:<br>
https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3

for bert_encoder:<br>
"https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4"


In [22]:
#download the pre-trained BERT models with hub.kerasLayer
bert_preprocess = hub.KerasLayer('https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3')
bert_encoder = hub.KerasLayer('https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4')

Building model using TensorFlow
There are two types of models that one can build in TensorFlow: the Sequential model and the Functional model. In a Sequential model, layers are built on top of each other, one layer at a time. However, in a Sequential model, it does not have multiple inputs and outputs.

On the other hand, Functional models are more robust and flexible. They do not necessarily create layers in a strictly sequential order. Instead, in the Functional model, there can be multiple inputs and outputs. I will use the Functional approach to build the model, starting by initializing the BERT layers.

<h4>Build Model</h4>

The input layer is created using the tf.keras.layers.Input method. I will use the preprocessed_text as input for this layer.

The bert_encoder function will then convert the preprocessed text into embedding vectors. These vectors will serve as the output of this layer. The outputs will then be fed into the neural network layers.

The neural network comprises two layers: the Dropout layer and the Dense layer.

Dropout Layer:
This layer will be used to prevent model overfitting. I will set the dropout rate to 0.1% to address overfitting, which occurs when a model excessively learns from training data but performs poorly during testing.

Since I am using the functional approach to build the model, I will define the input for this layer as a function using (outputs['pooled_output']). This input corresponds to the output of the BERT layers.

Dense Layer:
This layer contains only one neuron. I will initialize the activation function as sigmoid. Sigmoid is suitable when the output values need to be between 0 and 1. In this case, during predictions, the probability of prediction will range from 0 to 1, making sigmoid the most appropriate choice.

The model will take text_input as inputs and will produce only one output. I will display the model summary to visualize all the input and output layers used.

In [23]:
# Define input and output layers for BERT
text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text_input')
preprocessed_text = bert_preprocess(text_input)
outputs = bert_encoder(preprocessed_text)

# Add neural network layers
dropout = tf.keras.layers.Dropout(0.1, name="dropout")(outputs['pooled_output'])
output = tf.keras.layers.Dense(1, activation='sigmoid', name="output")(dropout)

# Construct the model
model = tf.keras.Model(inputs=text_input, outputs=output)

Printing the model summary.

In [24]:
#print the summary
model.summary()

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 text_input (InputLayer)        [(None,)]            0           []                               
                                                                                                  
 keras_layer (KerasLayer)       {'input_word_ids':   0           ['text_input[0][0]']             
                                (None, 128),                                                      
                                 'input_mask': (Non                                               
                                e, 128),                                                          
                                 'input_type_ids':                                                
                                (None, 128)}                                                  

I have initialized all the input and output layers for our model. The output also displays the total params, trainable params, and non-trainable params.

Total params: This represents all the parameters in the model.

Trainable params: These represent the parameters that I will train.

Non-trainable params: These parameters are from the BERT model, and they are already trained.

In [25]:
#print the len
len(X_train)

1120

The optimizer is used to improve model performance and reduce errors that occur during model training. I use the Adam optimizer.

Metrics will be used to check the model's performance so that I can assess how well we trained our model. I set the BinaryAccuracy(name='accuracy') metric, which will be used to calculate the accuracy score of the model.

The loss function is used to calculate the model error during the training phase. I use binary_crossentropy as my loss function because the output is binary; it can either be a 0 or 1.

In [26]:
# Define evaluation metrics
METRICS = [
    tf.keras.metrics.BinaryAccuracy(name='accuracy'),
    tf.keras.metrics.Precision(name='precision'),
    tf.keras.metrics.Recall(name='recall')
]

# Compile the model
model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=METRICS
)

<h4>Train the model</h4>

The model learns from the training data samples, identifying patterns within the dataset to gain knowledge.

I will specify the number of epochs as 10. The model will iterate through the dataset ten times and print the accuracy score after each iteration.

In [27]:
# Train the model
model.fit(X_train, y_train, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7eff2f8d0130>

In [28]:
# Evaluate the model
model.evaluate(X_test, y_test)



[0.2719302177429199,
 0.9090909361839294,
 0.9090909361839294,
 0.9090909361839294]

## Evaluating model using the testing dataset

The model.predict method will yield the prediction results in a 2D array, yet I require my results in a 1D array. To achieve this conversion from 2D to 1D array, I utilize the y_predicted.flatten() function.

In [29]:
#predict our model
y_predicted = model.predict(X_test)
y_predicted = y_predicted.flatten()
y_predicted = np.where(y_predicted > 0.5, 1, 0)
print("Predicted Labels:\n", y_predicted)

Predicted Labels:
 [0 0 1 1 0 1 0 0 1 0 1 0 1 0 1 0 0 1 0 1 0 1 0 1 0 0 1 0 1 0 1 0 0 0 0 1 1
 1 1 0 0 1 1 1 1 1 0 1 1 1 1 1 0 1 0 1 1 1 1 1 0 1 0 0 0 1 1 0 1 0 0 1 1 0
 1 0 0 0 0 0 0 1 0 1 0 0 1 1 1 1 0 1 1 0 1 0 1 0 0 0 0 0 0 1 1 0 0 1 0 0 1
 1 0 1 0 0 0 0 1 1 1 1 0 1 1 1 0 1 0 0 0 0 1 0 1 0 0 1 0 1 0 0 1 1 0 1 0 0
 0 1 1 1 0 0 0 1 0 0 0 1 0 0 1 1 0 0 1 1 1 1 1 1 0 1 1 1 1 0 0 1 0 0 0 0 0
 1 1 0 0 1 1 0 0 0 0 1 1 1 1 0 0 0 1 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 1
 1 0 0 0 1 1 0 0 0 0 0 1 1 1 0 1 1 1 1 1 0 1 1 1 1 0 1 0 0 1 1 1 0 0 0 0 0
 1 0 1 0 0 0 0 1 0 1 1 1 1 1 1 1 1 0 1 0 1 1 0 0 0 0 1 0 1 1 1 1 1 1 0 0 1
 1 1 0 1 1 1 0 0 0 1 1 0 1 0 0 1 1 0 1 0 0 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 0
 0 1 0 0 0 0 1 0 0 1 1 0 1 0 1 1 0 0 0 1 0 1 1 1 0 1 1 1 1 1 0 0 0 1 0 1 0
 1 0 0 0]


Since I used a sigmoid activation function, the prediction probabilities will lie between 0.0 and 1.0. Therefore, if the prediction result is > 0.5, the output should be 1, and if it is < 0.5, the output should be 0.

In [30]:
# Calculate and display confusion matrix
cm = confusion_matrix(y_test, y_predicted)
print("Confusion Matrix:\n", cm)

Confusion Matrix:
 [[170  17]
 [ 17 170]]


In [31]:
# Print classification report
print("Classification Report:\n", classification_report(y_test, y_predicted))

Classification Report:
               precision    recall  f1-score   support

           0       0.91      0.91      0.91       187
           1       0.91      0.91      0.91       187

    accuracy                           0.91       374
   macro avg       0.91      0.91      0.91       374
weighted avg       0.91      0.91      0.91       374



## Try your inputs

You can change your inputs as per you choice:-

In [36]:
# Sample reviews for prediction
reviews = [
    'Enter a chance to win $5000, hurry up, offer valid until march 31, 2021',
    'You are awarded a SiPix Digital Camera! call 09061221061 from landline. Delivery within 28days. T Cs Box177. M221BP. 2yr warranty. 150ppm. 16 . p pÂ£3.99',
    'it to 80488. Your 500 free text messages are valid until 31 December 2005.',
    'Hey Sam, Are you coming for a cricket game tomorrow',
    "Why don't you wait 'til at least wednesday to see if you get your."
]

# Predict labels for sample reviews
predicted_reviews = model.predict(reviews)
print("Predicted Labels for Sample Reviews:\n", predicted_reviews)

# Convert predicted labels to 'ham' or 'spam'
predicted_labels = ["ham" if pred < 0.5 else "spam" for pred in predicted_reviews.flatten()]
print("Predicted Labels for Sample Reviews:\n",predicted_labels)

Predicted Labels for Sample Reviews:
 [[0.66333693]
 [0.7872124 ]
 [0.71563506]
 [0.1036278 ]
 [0.06791556]]
Predicted Labels for Sample Reviews:
 ['spam', 'spam', 'spam', 'ham', 'ham']


From the output above, the first three email messages have been classified as spam, as they have a prediction probability greater than 0.5. The last two email messages have been classified as ham, with a prediction probability less than 0.5. These are the correct predictions and demonstrate that we have successfully built our text classification model.

In [38]:
# Function to predict labels for input text
def predict_spam_or_ham(text):
    # Predict labels for input text
    predicted_reviews = model.predict([text])
    # Convert predicted labels to 'ham' or 'spam'
    predicted_label = "ham" if predicted_reviews.flatten()[0] < 0.5 else "spam"
    return predicted_label

# Create a Gradio interface
iface = gr.Interface(fn=predict_spam_or_ham, inputs="text", outputs="text", title="Spam or Ham Mail Detector")
iface.launch()

Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://50cbcac7ca0f1c7c04.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


