In [3]:

# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES
# TO THE CORRECT LOCATION (/kaggle/input) IN YOUR NOTEBOOK,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.

import os
import sys
from tempfile import NamedTemporaryFile
from urllib.request import urlopen
from urllib.parse import unquote, urlparse
from urllib.error import HTTPError
from zipfile import ZipFile
import tarfile
import shutil

CHUNK_SIZE = 40960
DATA_SOURCE_MAPPING = 'spam-text-message-classification:https%3A%2F%2Fstorage.googleapis.com%2Fkaggle-data-sets%2F2050%2F3494%2Fbundle%2Farchive.zip%3FX-Goog-Algorithm%3DGOOG4-RSA-SHA256%26X-Goog-Credential%3Dgcp-kaggle-com%2540kaggle-161607.iam.gserviceaccount.com%252F20240409%252Fauto%252Fstorage%252Fgoog4_request%26X-Goog-Date%3D20240409T105908Z%26X-Goog-Expires%3D259200%26X-Goog-SignedHeaders%3Dhost%26X-Goog-Signature%3Db3348b8b3b578a712f52615e00a7de2868390ea9635290bb60089158bfa4d05c970ee5a3d4b81fbf05d9e2ada3376efcf1589438173ef63c1a35962a730f039b30c06aec5d2118beb4ced9992a9fe3c8e800e8b8e800ebf07141526070f7825841e180834ba2c03e87d68a95d8fadb8aed22f47518518b369824807716b2ce77f72f69da77657786fbd957995b50ac09174a810f6b356bce95ed6124bf3ed83248feb9b4a015129e44dbd2e608a2e8470a30443bacf0dfd234897546b434202afb59a405a8cd04324467f386a22dc6ebc293985e9f9734b1bd220b9410041dfe8d30fd8ab6c98e78052ff9f77cae67d292a3798a5b943205961e0d564713ae2b'

KAGGLE_INPUT_PATH='/kaggle/input'
KAGGLE_WORKING_PATH='/kaggle/working'
KAGGLE_SYMLINK='kaggle'

!umount /kaggle/input/ 2> /dev/null
shutil.rmtree('/kaggle/input', ignore_errors=True)
os.makedirs(KAGGLE_INPUT_PATH, 0o777, exist_ok=True)
os.makedirs(KAGGLE_WORKING_PATH, 0o777, exist_ok=True)

try:
  os.symlink(KAGGLE_INPUT_PATH, os.path.join("..", 'input'), target_is_directory=True)
except FileExistsError:
  pass
try:
  os.symlink(KAGGLE_WORKING_PATH, os.path.join("..", 'working'), target_is_directory=True)
except FileExistsError:
  pass

for data_source_mapping in DATA_SOURCE_MAPPING.split(','):
    directory, download_url_encoded = data_source_mapping.split(':')
    download_url = unquote(download_url_encoded)
    filename = urlparse(download_url).path
    destination_path = os.path.join(KAGGLE_INPUT_PATH, directory)
    try:
        with urlopen(download_url) as fileres, NamedTemporaryFile() as tfile:
            total_length = fileres.headers['content-length']
            print(f'Downloading {directory}, {total_length} bytes compressed')
            dl = 0
            data = fileres.read(CHUNK_SIZE)
            while len(data) > 0:
                dl += len(data)
                tfile.write(data)
                done = int(50 * dl / int(total_length))
                sys.stdout.write(f"\r[{'=' * done}{' ' * (50-done)}] {dl} bytes downloaded")
                sys.stdout.flush()
                data = fileres.read(CHUNK_SIZE)
            if filename.endswith('.zip'):
              with ZipFile(tfile) as zfile:
                zfile.extractall(destination_path)
            else:
              with tarfile.open(tfile.name) as tarfile:
                tarfile.extractall(destination_path)
            print(f'\nDownloaded and uncompressed: {directory}')
    except HTTPError as e:
        print(f'Failed to load (likely expired) {download_url} to path {destination_path}')
        continue
    except OSError as e:
        print(f'Failed to load {download_url} to path {destination_path}')
        continue

print('Data source import complete.')


The system cannot find the path specified.


OSError: [WinError 1314] A required privilege is not held by the client: '/kaggle/input' -> '..\\input'

# **Introduction**
---

The **"Spam Text Message Classification"** dataset is a **collection of text messages** that have been **labeled** as either **spam or ham (not spam).** The dataset contains **5,572** messages in total, with **4,827 messages** labeled **as ham** and **747** messages **labeled as spam**.

**Each message** in the **dataset** is represented as a **string** and has a **corresponding label** indicating whether it is **spam or ham**. The dataset was created by **Team AI** and was **last updated** in **October 2020**.

The purpose of the **dataset** is to provide a resource for the **development** and **evaluation** of **text classification models** for **spam detection**. This is a **common task** in **natural language processing**, as **spam messages** can be a **significant problem** for **individuals and organizations alike**.

**Problem at hand**
---

---

The **problem at hand** is **spam detection** in **text messages**. **Spam messages** are **unsolicited messages** that are sent to a **large number of people**, often with the **intention of advertising** a **product or service** or of **committing fraud**. **Spam messages** can be a **nuisance** and can **even be dangerous**, as they may **contain links** to **malicious websites or phishing scams**.

The **goal of spam detection** is to develop models that can **accurately classify incoming text messages** as either **spam or legitimate (also known as "ham")**. This is a **challenging problem** because **spammers are constantly evolving** their **tactics to avoid detection**, and the **content of spam messages can be highly variable**.

# **Notebook Structure**
---

* **SetUp** : This section involves importing all the necessary modules required for the execution of the program. We also define the hyperparameters and constants that are required for the successful implementation of the program.

* **Data Loading** : In this step, we load the data into our program, which is a crucial first step towards working with and analyzing the data.

* **Text Preprocessing** : Currently, the data is in RAW format, so we focus on preprocessing the text so that it can be fed to the model. This section involves text vectorization, including steps such as tokenization, cleaning, and padding.

* **Transformer Network Architecture** : This section focuses on creating the required layers for the transformer network architecture, such as the token and word embeddings and the position embeddings, including the transformer layer.

* **Transformer Model Training** : In this section, we train the model using the transformer network architecture created in the previous section. We also evaluate the model's performance on the testing data to understand how well the model is performing. Looking at the training curve, you might think that the model is diverging. But once we evaluate the model's performance on the testing data, we see that the model is performing great and able to generalize on new samples. This shows that the model weights are robust and will work on new samples. Although the architecture is simple, we are still achieving excellent performance.

* **Transformer Model Predictions** : In this section, we create a function that allows us to input text, and the transformer will be able to recognize whether the input is spam or ham. This section focuses on using the model to make predictions on new data.

By following this structure, we can build a comprehensive and organized notebook that is easy to follow and understand, making it easier to implement and modify the model as required.

# **SetUp**
---

Here, we are **importing all the necessary modules** required for the **execution of the program**. In addition to that, we are also **defining the hyperparameters and constants** that are required for the **successful implementation of the program**.

In [1]:
# Common imports
import numpy as np
import tensorflow as tf
from tensorflow import keras

# Data processing and visualization imports
import string
import pandas as pd
import plotly.express as px
import tensorflow.data as tfd
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

# Model building imports
from sklearn.utils import class_weight
from tensorflow.keras import callbacks
from tensorflow.keras import Model, layers


ModuleNotFoundError: No module named 'tensorflow'

In [None]:
# Define hyperparameters
num_heads = 4
embed_dim = 256
ff_dim = 128
vocab_size = 10000
max_seq_len = 40

# Set constants
learning_rate = 1e-3
epochs = 100
batch_size = 32

# Define training callbacks
callbacks = [
    keras.callbacks.EarlyStopping(patience=3, restore_best_weights=True),
    keras.callbacks.ModelCheckpoint("SpamDetector.h5", save_best_only=True)
]

In [None]:
# Set up random seed for reproducibility
random_seed = 123
np.random.seed(random_seed)
tf.random.set_seed(random_seed)

# **Data Loading**

---
In this step, we begin the **process of loading the data** into our program, which is a **crucial first step** towards being able to **work with and analyze the data**. Without **properly loading the data**, we **would not be able to perform any further data processing or analysis**.

In [None]:
# Specify the path to the SPAM text message dataset
data_path = 'C:\Users\VISHESH JAIN\OneDrive\Desktop\sms spam\SPAM text message 20170820 - Data.csv'

# Load the dataset using the load_data function
data_frame = pd.read_csv(data_path)

# Print the first five rows of the dataset
data_frame.head()

Unnamed: 0,Category,Message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


Let's gather some deeper data informations.

In [None]:
# Get the counts of each class and their names
class_dis = data_frame.Category.value_counts()
class_names = class_dis.index

# Create the Pie Chart
fig = px.pie(names=class_names,
             values=class_dis,
             color=class_names,
             hole=0.4,
             labels={'value': 'Count', 'names': 'Class'},
             title='Class Distribution of Spam Text Messages')

# Customize the layout
fig.update_layout(
    margin=dict(l=10, r=10, t=60, b=10),
    legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="right", x=1),
)

# Show the plot
fig.show()


After analyzing the **data set**, it is **evident** that the **class distribution** is **highly skewed**. Specifically, **86.6% of the data points** belong to the **"ham"** class, while **only 13.4% of the data points** belong to the **"spam" class**. This **substantial class imbalance** suggests that **even a random guesser** could **classify instances correctly approximately 86%** of the **time**, rendering any prediction model **unreliable and ineffective.** We will deal with the problem of **Class Imbalance** in next section.

In [None]:
# Data set size
N_SAMPLES = len(data_frame)

print(f"Total Number of Samples : {N_SAMPLES}")

Total Number of Samples : 5572


The data is **neither too big, nor too small**. ALthough, it is still **quite small**.

In [None]:
max_len = max([len(text) for text in data_frame.Message])
print(f"Maximum Length Of Input Sequence(Chars) : {max_len}")

Maximum Length Of Input Sequence(Chars) : 910


In [None]:
# Extract X and y from the data frame
X = data_frame['Message'].tolist()
y = data_frame['Category'].tolist()


# Initialize label encoder
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(y)

# Print the first 5 elements of X and y
print(f'X[:5]: \n{X[:5]}\n')
print(f'y[:5]: {y[:5]}\n')
print(f"Label Mapping : {label_encoder.inverse_transform(y[:5])}")

X[:5]: 
['Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...', 'Ok lar... Joking wif u oni...', "Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's", 'U dun say so early hor... U c already then say...', "Nah I don't think he goes to usf, he lives around here though"]

y[:5]: [0 0 1 0 0]

Label Mapping : ['ham' 'ham' 'spam' 'ham' 'ham']


Great! We have completed the first step of loading the data. However, the data is currently in a raw format which needs to be preprocessed in order to make it compatible with the model.

---

There are **multiple ways** to create a **label mapping dictionary**, one of which is creating it **manually**. This process can be **time-consuming**, but it can provide **more control and flexibility** in terms of **label mapping**. However, there's also an **easier way** to do it using **label encoder**.

In this approach, the **label encoder** assigns a **unique numerical label to each class**, starting from **zero**. For example, in our case, the **label encoder encoded** the class **"ham" as 0** and **"spam" as 1**. This way, we can **quickly map each class** to its **corresponding numerical label** without the need for **manual dictionary creation**.

In this case, there are just **two classes**, so **manual dictionary will be equally effective**.

# **Text Vectorization**
---

One way to handle **class imbalances** to use **SMOTE algorithm**. When applying **SMOTE** to **text data**, the **input features** are usually represented as **vectors of numerical values** using **techniques such as bag-of-words, TF-IDF, or word embeddings**. These **numerical vectors** can then be used as input to the **SMOTE algorithm** to **generate synthetic data points**.

However, it is **important** to **note** that **SMOTE may not always be the best technique** for **addressing class imbalance** in **text data**. **Text data** is often **high-dimensional and sparse**, which can make it **challenging to generate meaningful synthetic data points**. In some cases, **alternative techniques** such as **data augmentation** or **cost-sensitive learning** may be **more effective** for **improving the performance** of **machine learning models** trained on **imbalanced text data**.

In summary, **SMOTE** can work with text data, but its **effectiveness** will depend on the **specific characteristics** of the **data** and the project's needs and constraints.


Here we will use **Cost-Sensitive Learning**.
* **Cost-sensitive learning**: Modifying the **machine learning algorithm** to give **higher weights** to the **minority class**. This approach works by **penalizing** the **model more for misclassifying the minority class**.

In [None]:
# Compute class weights
class_weights = class_weight.compute_class_weight(class_weight='balanced', classes=data_frame.Category.unique(), y=label_encoder.inverse_transform(y))
class_weights = {number: weight for number, weight in enumerate(class_weights)}
# Show
print(f"Associated class weights: {class_weights}")

Associated class weights: {0: 0.5774093264248704, 1: 3.7295850066934406}


Based on the **computed class weights**, we can observe that the **second class (i.e., "spam")** has a **higher weight** of **3.72958501**, while the **first class (i.e., "ham")** has a **lower weight** of **0.57740933**. This indicates that the **second class** is **more important** for the **classification task** and the model will be trained to **give more weight** to this **class during training.** On the other hand, the **first class** is **relatively less important** and will be given **less weight during training**.

Therefore, we can say that the **class weights** are **aligned with our expectations**, where we **wanted to prioritize** the **spam class** as it is the **target class** we want to **detect accurately**.

---
We just dealed with **class Imbalance**. It's time to move ahead to **natural language processing.**

In [None]:
# Define a function to preprocess the text
def preprocess_text(text: str) -> str:
    """
    Preprocesses the text by removing punctuation, lowercasing, and stripping whitespace.
    """
    # Replace punctuation with spaces
    text = tf.strings.regex_replace(text, f"[{string.punctuation}]", " ")

    # Lowercase the text
    text = tf.strings.lower(text)

    # Strip leading/trailing whitespace
    text = tf.strings.strip(text)

    return text


# Create a TextVectorization layer
text_vectorizer = layers.TextVectorization(
    max_tokens=vocab_size,                       # Maximum vocabulary size
    output_sequence_length=max_seq_len,          # Maximum sequence length
    standardize=preprocess_text,                 # Custom text preprocessing function
    pad_to_max_tokens=True,                      # Pad sequences to maximum length
    output_mode='int'                            # Output integer-encoded sequences
)

# Adapt the TextVectorization layer to the data
text_vectorizer.adapt(X)

Let's see the Text Vectorization working.

In [None]:
for _ in range(5):
    # Send a text to randomly.
    text_temp = X[np.random.randint(N_SAMPLES)]

    # Apply text to vectorization.
    text_vec_temp = text_vectorizer(text_temp)

    # Show the results
    print(f"Original Text: {text_temp}")
    print(f"Vectorized Text: {text_vec_temp}\n")

Original Text: Ard 4 lor...
Vectorized Text: [569  44  86   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0   0   0]

Original Text: Nowadays people are notixiquating the laxinorficated opportunity for bambling of entropication.... Have you ever oblisingately opted ur books for the masteriastering amplikater of fidalfication? It is very champlaxigating, i think it is atrocious.. Wotz Ur Opinion???? Junna
Vectorized Text: [3435  271   24 6074    6 6479 1767   14 8098   16 7302   19    4  372
 6045 5987   35 2822   14    6 6314 8267   16 7193   13   10  176 7823
    2  112   13   10 8162 4482   35 1233 6582    0    0    0]

Original Text: Que pases un buen tiempo or something like that
Vectorized Text: [5637 5901  831 7911 4868   31  200   59   18    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0 

In [None]:
# Get the vocabulary
VOCAB = text_vectorizer.get_vocabulary()

# Let's have a look at the tokens present in the vocabulary
print(f"Vocabulary size: {len(VOCAB)}")
print(f"Vocabulary: {VOCAB[150:200]}")

Vocabulary size: 8841
Vocabulary: ['number', 'message', 'e', 've', 'tomorrow', 'say', 'won', 'right', 'prize', 'already', 'after', 'said', 'ask', 'doing', 'cash', 'amp', '3', 'yeah', 'really', 'im', 'why', 'b', 'life', 'them', 'meet', 'find', 'very', 'miss', 'morning', 'let', 'babe', 'last', 'would', 'win', 'thanks', 'cos', 'anything', 'uk', 'lol', 'also', 'care', 'every', 'sure', 'pick', 'com', '150p', 'sent', 'nokia', 'keep', 'urgent']


The **vocabulary size** indicates the **total number of unique words** in the **corpus**, which in this case is **8841**. The **list of words** in the **vocabulary** seems to be **sorted in descending order** of **frequency**, with the **most frequent words** appearing at the **top**. The **words in the vocabulary** seem to be **mostly related** to **text messaging**, such as **"message," "tomorrow," "said," "win," and "urgent,"** among others. The **vocabulary** also **contains some commonly used words** and **abbreviations**, such as **"ve," "amp," "lol," and "cos."**

## **Data Splitting**
---

As we have our processing functions ready, let's split the data into **training and testing**, and also apply the **Text Vectorization**.

In [None]:
# Split the data into training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42, shuffle=True)

# Apply the Text Vectorization
X_train = text_vectorizer(X_train)
X_test = text_vectorizer(X_test)

# One Hot Vectors
Xoh_train = tf.one_hot(X_train, depth=10000)
Xoh_test  = tf.one_hot(X_test, depth=10000)

# **Transformer Network**
---

![image.png](attachment:9d28e3f3-e714-4e43-8778-e1e31d0035f1.png)

The **Transformer network** is a **deep neural network architecture** that is **widely** used for **natural language processing tasks**, including **text classification**. The **network comprises** a **stack of Transformer blocks**, each containing **two sub-layers**:
* The *self-attention layer(more spedifically MHA)*
* The *feedforward layer*.

The **self-attention layer** is the **key component** of the **Transformer network** that enables it to capture the dependencies **between different words** in a sentence. It works by computing a **weighted sum** of the **embeddings** of **all the words** in the **sentence**, where the **weights** are determined by the **attention scores** between **pairs of words**. These **attention scores** are calculated by taking the **dot product of the embeddings** of the **words** and applying a **softmax function** to the **result**, resulting in a **weight distribution** that reflects the **relevance of each word to every other word in the sentence**.

The **feedforward layer** is a **simple two-layer neural network** that processes the output of the **self-attention layer** and applies a **non-linear transformation** to it.

The **multi-head attention mechanism** is a **variation** of the **self-attention layer** in which the **attention scores** are **calculated multiple times** with **different linear projections** of the **input embeddings**. This **enables the network** to attend to **different aspects** of the **input embeddings in parallel**, which can lead to **better performance on certain tasks**.

For **text classification**, the **Transformer network** takes the **input text** as a **sequence of word embeddings** and passes it through **multiple Transformer blocks**. The **final output** of the **network** is a **vector representation** of the **input text** that can be used for **classification** using a **softmax layer**. By using the **self-attention mechanism** and the **multi-head attention mechanism**, the network can capture the **semantic relationships** between **different words** in the text and **produce highly accurate predictions.**

Before creating the **transformer layer**, let's first create the **Word and Positional embedding layer**.

In [None]:
class TokenAndPositionalEmbedding(layers.Layer):

    def __init__(self, embedding_dims, vocab_size, seq_len, **kwargs):
        super(TokenAndPositionalEmbedding, self).__init__(**kwargs)

        # Initialize parameters
        self.seq_len = seq_len
        self.vocab_size = vocab_size
        self.embedding_dims = embedding_dims
        self.embed_scale = tf.math.sqrt(tf.cast(embedding_dims, tf.float32))

        # Define layers
        self.token_embedding = layers.Embedding(
            input_dim=vocab_size,
            output_dim=embedding_dims,
            name="token_embedding"
        )

        self.positional_embedding = layers.Embedding(
            input_dim=seq_len,
            output_dim=embedding_dims,
            name="positional_embedding"
        )

    def call(self, inputs):
        seq_len = tf.shape(inputs)[1]

        # Token Embedding
        token_embedding = self.token_embedding(inputs)
        token_embedding *= self.embed_scale

        # Positional Embedding
        positions = tf.range(start=0, limit=seq_len, delta=1)
        positional_embedding = self.positional_embedding(positions)

        # Add Token and Positional Embedding
        embeddings = token_embedding + positional_embedding

        return embeddings


    def get_config(self):
        config = super(TokenAndPositionalEmbedding, self).get_config()
        config.update({
            'embedding_dims': self.embedding_dims,
            'vocab_size': self.vocab_size,
            'seq_len': self.seq_len,
        })
        return config


In [None]:
# Let's look what the layer do.
temp_embeds = TokenAndPositionalEmbedding(embed_dim, vocab_size, max_seq_len)(X_train[:1])
temp_embeds

The **Embedding layer** in a **neural network** is a **crucial component** that **plays a key role** in **converting text data** into **meaningful numerical representations**. Essentially, the **Embedding layer** creates a **word embedding**(also known as the **token embedding**), which **projects** the **input indexes**, (i.e., the tokens), into a **feature space**(or vector space), that **contains unique and informative representations for each token**(or word).

Additionally, it computes the **positional embeddings,** which represent the **positions** of the **tokens** in the **input sequence**. While traditionally, this was achieved using **sine and cosine waves**, here we have **leveraged the power of the Embedding layer** to handle this task in a **more efficient and effective manner**.

## **Transformer Layer**

In [None]:
class TransformerLayer(layers.Layer):

    def __init__(self, num_heads: int, dropout_rate: float, embedding_dims: int, ff_dim: int, **kwargs):
        super(TransformerLayer, self).__init__(**kwargs)

        # Initialize Parameters
        self.num_heads = num_heads
        self.dropout_rate = dropout_rate
        self.embedding_dims = embedding_dims
        self.ff_dim = ff_dim

        # Initialize Layers
        self.mha = layers.MultiHeadAttention(num_heads=num_heads, key_dim=embedding_dims, dropout=dropout_rate)
        self.ln1 = layers.LayerNormalization(epsilon=1e-6)

        self.ffn = keras.Sequential([
            layers.Dense(ff_dim, activation='relu', kernel_initializer='he_normal'),
            layers.Dense(embedding_dims)
        ])
        self.ln2 = layers.LayerNormalization(epsilon=1e-6)

    def call(self, inputs):
        """Forward pass of the Transformer Layer.

        Args:
            inputs: Tensor with shape `(batch_size, seq_len, embedding_dims)` representing the input sequence.

        Returns:
            Tensor with shape `(batch_size, seq_len, embedding_dims)` representing the output sequence after applying the Transformer Layer.
        """

        # Multi-Head Attention
        attention = self.mha(inputs, inputs, inputs)

        # Layer Normalization and Residual Connection
        normalized1 = self.ln1(attention + inputs)

        # Feedforward Network
        ffn_out = self.ffn(normalized1)

        # Layer Normalization and Residual Connection
        normalized2 = self.ln2(ffn_out + normalized1)

        return normalized2

    def get_config(self):
        """Get the configuration of the Transformer Layer.

        Returns:
            Dictionary with the configuration of the layer.
        """
        config = super(TransformerLayer, self).get_config()
        config.update({
            "num_heads": self.num_heads,
            "dropout_rate": self.dropout_rate,
            "embedding_dims": self.embedding_dims,
            "ff_dim": self.ff_dim
        })
        return config


In [None]:
# Transformer layers execution
TransformerLayer(num_heads=num_heads, embedding_dims=embed_dim, ff_dim=ff_dim, dropout_rate=0.1)(temp_embeds)

# **Transformer Text Classification Model**
---

It's time to combine the **Token and Positional Embedding** layer and the **Transformer layer** to make a **Transformer Network architecture** for **text classification**.

In [None]:
# Input layer
InputLayer = layers.Input(shape=(max_seq_len,), name="InputLayer")

# Embedding Layer
embeddings = TokenAndPositionalEmbedding(embed_dim, vocab_size, max_seq_len, name="EmbeddingLayer")(InputLayer)

# Transformer Layer
encodings = TransformerLayer(num_heads=num_heads, embedding_dims=embed_dim, ff_dim=ff_dim, dropout_rate=0.1, name="TransformerLayer")(embeddings)

# Classifier
gap = layers.GlobalAveragePooling1D(name="GlobalAveragePooling")(encodings)
drop = layers.Dropout(0.5, name="Dropout")(gap)
OutputLayer = layers.Dense(1, activation='sigmoid', name="OutputLayer")(drop)

# Model
model = keras.Model(InputLayer, OutputLayer, name="TransformerNet")

# Model Architecture Summary
model.summary()

# **Transformer Training**
---

In [None]:
# Compile the Model
model.compile(
    loss='binary_crossentropy',
    optimizer='adam',
    metrics=[
        keras.metrics.BinaryAccuracy(name='accuracy'),
        keras.metrics.Precision(name='precision'),
        keras.metrics.Recall(name='recall'),
        keras.metrics.AUC(name='auc'),
    ]
)

# Train Model
history = model.fit(
    X_train, y_train,
    validation_split=0.1,
    batch_size=batch_size,
    epochs=epochs,
    callbacks=callbacks,
    class_weight=class_weights
)

In [None]:
# Plot metrics
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(15, 8))
plt.subplots_adjust(hspace=0.5)

axes[0, 0].plot(history.history['loss'], label='Training Loss')
axes[0, 0].plot(history.history['val_loss'], label='Validation Loss')
axes[0, 0].set_title('Loss', fontsize=14)
axes[0, 0].set_xlabel('Epoch', fontsize=12)
axes[0, 0].set_ylabel('Loss', fontsize=12)
axes[0, 0].grid(True)
axes[0, 0].legend(fontsize=10)

axes[0, 1].plot(history.history['accuracy'], label='Training Accuracy')
axes[0, 1].plot(history.history['val_accuracy'], label='Validation Accuracy')
axes[0, 1].set_title('Accuracy', fontsize=14)
axes[0, 1].set_xlabel('Epoch', fontsize=12)
axes[0, 1].set_ylabel('Accuracy', fontsize=12)
axes[0, 1].grid(True)
axes[0, 1].legend(fontsize=10)

axes[1, 0].plot(history.history['precision'], label='Training Precision')
axes[1, 0].plot(history.history['val_precision'], label='Validation Precision')
axes[1, 0].set_title('Precision', fontsize=14)
axes[1, 0].set_xlabel('Epoch', fontsize=12)
axes[1, 0].set_ylabel('Precision', fontsize=12)
axes[1, 0].grid(True)
axes[1, 0].legend(fontsize=10)

axes[1, 1].plot(history.history['recall'], label='Training Recall')
axes[1, 1].plot(history.history['val_recall'], label='Validation Recall')
axes[1, 1].set_title('Recall', fontsize=14)
axes[1, 1].set_xlabel('Epoch', fontsize=12)
axes[1, 1].set_ylabel('Recall', fontsize=12)
axes[1, 1].grid(True)
axes[1, 1].legend(fontsize=10)

fig.suptitle('Model Performance Metrics', fontsize=16, y=1.05)
plt.show()


In [None]:
# Evaluate model performance on test data
loss, acc, precision, recall, auc = model.evaluate(X_test, y_test, verbose=0)

# Show the model performance
print('Test loss      :', loss)
print('Test accuracy  :', acc*100)
print('Test precision :', precision*100)
print('Test recall    :', recall*100)
print('Test AUC       :', auc*100)

Looking at the **training curve**, **one might think** that the **model is diverging** and the **performance is deteriorating.** However, upon **evaluating** the **model's performance** on the **testing data**, we see that the **model is actually performing great** and is able to **generalize on new samples**. This shows that the **model weights** are **robust** and **capable of working** on **new samples**.

Now, you might be wondering that this is an **extremely simple architecture**. Although the **architecture is simple**, we are **still getting awesome performance**. That's why **I'm not focusing** on **improving the model architecture** at this point. It's always a **good idea** to **start with a simple model** and then **gradually increase the complexity as required**.

# **Transformer Predictions**
---

In [None]:
def decode_tokens(tokens):
    """
    This function takes in a list of tokenized integers and returns the corresponding text based on the provided vocabulary.

    Args:
    - tokens: A list of integers representing tokenized text.
    - vocab: A list of words in the vocabulary corresponding to each integer index.

    Returns:
    - text: A string of decoded text.
    """
    text = " ".join(VOCAB[int(token)] for token in tokens).strip()
    return text


In [None]:
for _ in range(10):
    # Randomly select a text from the testing data.
    index = np.random.randint(1,len(X_test))
    tokens = X_test[index-1:index]
    label = y_test[index]

    # Feed the tokens to the model
    print(f"\nModel Prediction\n{'-'*100}")
    proba = 1 if model.predict(tokens, verbose=0)[0][0]>0.5 else 0
    pred = label_encoder.inverse_transform([proba])
    print(f"Message: '{decode_tokens(tokens[0])}' | Prediction: {pred[0].title()} | True : {label_encoder.inverse_transform([label])[0].title()}\n")

Note: You might get all the predictions to be predicted as "ham" initially. But, this does not mean the model only predicts one class. It is possible that the first few inputs you tried were actually "ham" messages. You can try running the cell multiple times with different inputs, and you will find that the model will predict "spam" as well. This variability in predictions is expected, and it's a good sign that the model is generalizing to new samples.

In [None]:
# Custom Input
text = input("Enter your Msg: ")

# Convert into tokens
tokens = text_vectorizer([text])

# Feed the tokens to the model
print(f"\nModel Predictions\n{'-'*100}")
proba = 1 if model.predict(tokens, verbose=0)[0][0]>0.5 else 0
pred = label_encoder.inverse_transform([proba])
print(f"Message: '{text}' | Prediction: {pred[0].title()}")

# This is not supported.

After completing the Transformer Predictions section, we have a fully functional model that can take input from users and classify them as spam or ham. Feel free to try out the model yourself by providing some sample texts and see how the model responds.

---
## **Summary**

Sure, here's a summary of the code and its steps:

* **SetUp:** Importing necessary libraries and defining constants and hyperparameters.

* **Data Loading:** Loading the SMS spam/ham dataset.

* **Text Vectorization:** Preprocessing the text data using the TextVectorization layer from TensorFlow, which tokenizes the text and converts them to numerical values.

* **Transformer Network:** Creating the transformer network architecture using the Keras functional API. This includes creating token and word embeddings, positional embeddings, and a multi-head self-attention layer.

* **Transformer Training:** Compiling and training the transformer network on the preprocessed SMS data. The training is performed on the training set and validated on the validation set.

* **Evaluation:** Evaluating the model on the test set to check its generalization capability and plot different metrics.

* **Custom Input:** Taking input from the user and using the trained model to classify the input text as spam or ham.

---
**DeepNets**