In [1]:
!pip install gensim




In [2]:
# Import necessary libraries
import numpy as np
import pandas as pd
from gensim.models import Word2Vec
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences


In [3]:
from google.colab import files
uploaded = files.upload()


Saving Dataset.csv to Dataset.csv


In [4]:
import pandas as pd

# Assuming the file name is 'Dataset.csv'
df = pd.read_csv('Dataset.csv', encoding='latin1')

# Display the first few rows of the dataset
print(df.head())


  Label                                              Email
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...


# Extract Email Texts and Labels

In [5]:
# Extract 'texts' and 'labels'
texts = df['Email'].tolist() #transform a column of a DataFrame into a Python list.
labels = df['Label'].map({'ham': 0, 'spam': 1}).tolist() #replace values in a column based on a mapping dictionary

# Print the total number of spam and ham emails
print("Total no. of spam emails:", sum(labels))
print("Total no. of ham emails:", len(labels) - sum(labels))

Total no. of spam emails: 747
Total no. of ham emails: 4825


# Split the Dataset

Training set: 70% of the original data

Validation set: 15% of the original data

Test set: 15% of the original data

In [6]:
# Split the dataset into training, validation, and test sets
X_train, X_temp, y_train, y_temp = train_test_split(texts, labels, test_size=0.3, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)


*  Here, 70% of the original dataset is assigned to the training set (X_train, y_train), because the test_size=0.3 means that 30% of the data is kept for the temporary set (X_temp, y_temp).
*  Now, the temporary set (X_temp, y_temp), which contains 30% of the data, is split in half (test_size=0.5). This means 15% of the original data goes to the validation set (X_val, y_val) and the remaining 15% goes to the test set (X_test, y_test).
* When you set random_state to a specific number (like 42), it ensures that every time you run the code, the splitting of data will always be the same.




# Tokenize and Pad Sequences


In [12]:
# Create a Tokenizer object to process the text
tokenizer = Tokenizer()

# Teach the tokenizer the vocabulary from the training, validation, and test sets combined
# This way, the tokenizer learns all the words used in the dataset
tokenizer.fit_on_texts(X_train + X_val + X_test)

# Convert the text in the training set to sequences of numbers (one number per word)
# Each word is replaced by its corresponding number from the tokenizer's learned vocabulary
sequences_train = tokenizer.texts_to_sequences(X_train)

# Do the same for the validation set
sequences_val = tokenizer.texts_to_sequences(X_val)

# Do the same for the test set
sequences_test = tokenizer.texts_to_sequences(X_test)

# Find the length of the longest sequence from all datasets (training, validation, and test)
# This tells us how much padding we need so that all sequences have the same length
max_sequence_length = max([len(seq) for seq in sequences_train + sequences_val + sequences_test])

# Determine the total number of unique words in the vocabulary
# Add 1 to include space for padding
vocab_size = len(tokenizer.word_index) + 1

# Pad the training sequences so they all have the same length (max_sequence_length)
# If a sequence is shorter, it gets padded with zeros
data_train = pad_sequences(sequences_train, maxlen=max_sequence_length)

# Pad the validation sequences
data_val = pad_sequences(sequences_val, maxlen=max_sequence_length)

# Pad the test sequences
data_test = pad_sequences(sequences_test, maxlen=max_sequence_length)





- **Tokenizer**: This is a tool that turns words into numbers. We first teach it the words from the dataset so it knows what words to expect.
- **fit_on_texts()**: The tokenizer looks at the entire dataset (training, validation, and test) and builds a dictionary of all the unique words, so it doesn't miss any words.
- **texts_to_sequences()**: After learning the words, the tokenizer can turn each word into a number. Words it knows will be replaced with their corresponding number. If the word is new, it gets skipped.
- **max_sequence_length**: We find the longest sentence in the dataset so that we can pad all the shorter ones to this length. This way, all sentences will have the same number of words.
- **vocab_size**: This is simply the number of unique words the tokenizer learned from the dataset, plus one extra spot for padding.
- **pad_sequences()**: If a sentence is shorter than the longest one, we pad it with zeros to make it the same length. This is necessary because models like LSTMs need all input sentences to be of the same length to work properly.


# Train a Word2Vec Model

In [13]:
# Split the text from all datasets (training, validation, and test) into words
# Each sentence is turned into a list of words by splitting on spaces
# This gives us a list of sentences, where each sentence is a list of words
sentences = [text.split() for text in X_train + X_val + X_test]

# Train a Word2Vec model using the list of sentences
# The model learns to represent each word as a vector in a 100-dimensional space (vector_size=100)
# The "window=5" means the model looks at 5 words before and after the current word
# Words that appear less than once (min_count=1) are ignored
# The "workers=4" means we’ll use 4 CPU cores to make the training faster
word2vec_model = Word2Vec(sentences=sentences, vector_size=100, window=5, min_count=1, workers=4)




- **sentences=sentences**: We break down each sentence into individual words from the training, validation, and test sets. This creates a list where every sentence is a list of words.

- **vector_size=100**: The model will learn to represent each word using a vector of 100 numbers, where each number represents a different feature of the word.

- **window=5**: The model will consider 5 words before and 5 words after the current word to understand the word's context.

- **min_count=1**: Even if a word appears only once, the model will still learn it. No words will be ignored based on their frequency.

- **workers=4**: The model will use 4 CPU cores to train faster. The more cores, the quicker it will process the data.



# Prepare the Embedding Matrix

In [14]:
# Create an empty matrix to hold the word vectors
# The matrix has 'vocab_size' rows (one for each word in the vocabulary)
# Each row will hold a vector of size 'vector_size' (the dimensions of the word vectors)
embedding_matrix = np.zeros((vocab_size, word2vec_model.vector_size))

# Loop through all the words and their assigned indices in the tokenizer's word index
for word, i in tokenizer.word_index.items():
    # Check if this word exists in the Word2Vec model's vocabulary
    if word in word2vec_model.wv:
        # If the word exists, place its word vector in the embedding matrix
        # The word's vector is stored in the row that matches its index 'i'
        embedding_matrix[i] = word2vec_model.wv[word]





- **embedding_matrix = np.zeros((vocab_size, word2vec_model.vector_size))**:
  - We create an empty matrix (full of zeros) to store the word vectors. Each word in the vocabulary will have its own row in this matrix. The number of columns is the size of the word vector (100 dimensions, as defined before).

- **for word, i in tokenizer.word_index.items():**:
  - This loop goes through each word and its index in the tokenizer's vocabulary. The index tells us which row in the embedding matrix corresponds to the word.

- **if word in word2vec_model.wv:**:
  - We check if this word exists in the Word2Vec model's learned vocabulary. If the word was seen during training, we’ll have its vector.

- **embedding_matrix[i] = word2vec_model.wv[word]:**:
  - If the word exists in the Word2Vec model, we place its vector in the corresponding row of the embedding matrix (at index 'i'). This way, each word gets represented by its vector in the matrix.


# Build an LSTM Model

In [15]:
# Build an LSTM model with Word2Vec embeddings
model = Sequential()

# Add an Embedding layer using the Word2Vec embeddings
# vocab_size: number of words in the vocabulary
# word2vec_model.vector_size: the size of the word vectors (100 dimensions)
# weights=[embedding_matrix]: use the pre-trained word vectors from the embedding matrix
# input_length: maximum length of the input sequences
# trainable=False: freeze the embeddings during training so they don't change
model.add(Embedding(vocab_size, word2vec_model.vector_size, weights=[embedding_matrix],
                    input_length=max_sequence_length, trainable=False))

# Add an LSTM layer with 100 units
# dropout=0.2: drop 20% of the neurons to prevent overfitting
# recurrent_dropout=0.2: apply dropout to the LSTM's recurrent connections
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))

# Add a Dense layer with 1 unit for binary classification (spam or not spam)
# Use 'sigmoid' activation function, which outputs a value between 0 and 1
model.add(Dense(1, activation='sigmoid'))

# Print the summary of the model to see its structure
model.summary()






- **Sequential()**: We’re building the model step by step using Keras's Sequential API, where we add one layer at a time.

- **Embedding Layer**:
  - We're using the word vectors we trained earlier with Word2Vec.
  - The embedding layer takes the `vocab_size` (total words in the vocabulary) and `vector_size` (100 dimensions) to set up the word embeddings.
  - **weights=[embedding_matrix]**: This uses the pre-trained Word2Vec embeddings.
  - **trainable=False**: We freeze the word vectors so they don't get changed during training. This way, the model focuses on learning other things while keeping the word vectors fixed.

- **LSTM Layer**:
  - We use an LSTM layer with 100 units to process the sequence data.
  - **dropout=0.2**: Dropout means we randomly drop 20% of the neurons during training to prevent overfitting.
  - **recurrent_dropout=0.2**: Dropout is also applied to the recurrent connections within the LSTM to add more regularization.

- **Dense Layer**:
  - The final layer is a Dense layer with 1 unit, which is used for binary classification (spam or ham).
  - **sigmoid activation**: The sigmoid function outputs a value between 0 and 1, making it perfect for binary classification tasks.

- **model.summary()**: This gives a quick overview of the model architecture, showing the number of parameters and how each layer is connected.


# Compile the Model

In [17]:
# Compile the model for training
# loss='binary_crossentropy': This is the loss function used for binary classification (spam or not spam)
# optimizer='adam': Adam optimizer is a good default choice for training neural networks
# metrics=['accuracy']: We'll track accuracy during training to see how well the model is performing
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])


# Train the Model

In [18]:
# Train the model with training data, labels, and validation data
model.fit(data_train, np.array(y_train), epochs=10, batch_size=32, validation_data=(data_val, np.array(y_val)))


Epoch 1/10
[1m122/122[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m19s[0m 136ms/step - accuracy: 0.8510 - loss: 0.3637 - val_accuracy: 0.9187 - val_loss: 0.2226
Epoch 2/10
[1m122/122[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m21s[0m 145ms/step - accuracy: 0.8954 - loss: 0.2619 - val_accuracy: 0.9007 - val_loss: 0.2304
Epoch 3/10
[1m122/122[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 138ms/step - accuracy: 0.9027 - loss: 0.2460 - val_accuracy: 0.9127 - val_loss: 0.2077
Epoch 4/10
[1m122/122[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m25s[0m 173ms/step - accuracy: 0.9130 - loss: 0.2298 - val_accuracy: 0.9163 - val_loss: 0.2001
Epoch 5/10
[1m122/122[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m16s[0m 135ms/step - accuracy: 0.9085 - loss: 0.2323 - val_accuracy: 0.9199 - val_loss: 0.1988
Epoch 6/10
[1m122/122[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m20s[0m 132ms/step - accuracy: 0.9060 - loss: 0.2292 - val_accuracy: 0.9246 - val_loss: 0.1956
Epoch 7/10

<keras.src.callbacks.history.History at 0x7943803f89a0>



- **Training process**: We're training the model for 10 cycles (epochs), meaning the model will see the full training data 10 times to learn the patterns.
  
- **Batch size of 32**: The model processes 32 samples at a time before updating its weights, which helps make training faster and more stable.

- **Validation data**: This is used to see how well the model performs on data it hasn't trained on, which helps us understand whether the model is overfitting or generalizing well.



# Evaluate the Model


In [19]:
# Evaluate the model's performance on the test set
evaluation_results = model.evaluate(data_test, np.array(y_test))

# Display the test loss and accuracy
print("Test Loss:", evaluation_results[0])
print("Test Accuracy:", evaluation_results[1])


[1m27/27[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 38ms/step - accuracy: 0.9069 - loss: 0.2287
Test Loss: 0.23843520879745483
Test Accuracy: 0.9019138813018799




- **Evaluating the model**: This step checks how well the model performs on completely unseen data (the test set). We’re using the test data and its labels to calculate two key metrics:
  - **Test loss**: How far off the model's predictions are from the true labels.
  - **Test accuracy**: The percentage of correct predictions made by the model on the test set.
  
- **Why it's important**: Evaluating the model on the test set gives us a sense of how well it will perform in real-world scenarios, where the data will be similar to the test set.
### Interpretation:

- **Test Loss: 0.2384**: This means the average difference between the predicted labels and the actual labels on the test set is relatively small. Lower loss values indicate that the model's predictions are more accurate.

- **Test Accuracy: 90.19%**: The model correctly classified about 90% of the emails in the test set. This is a solid performance, indicating that the model generalizes well to unseen data and could be effective for classifying emails as spam or not spam.

### What it means:
- **Good performance**: A test accuracy of 90% means the model is reliable and likely to perform well in real-world scenarios where it has to classify emails it hasn’t seen before.
- **Room for improvement**: While 90% accuracy is strong, you can still explore improvements by tuning the model, experimenting with more data, or trying different architectures.

In summary, the model does a good job of classifying emails, with a reasonably low error (loss) and a high accuracy.


# Generate Predictions

In [20]:
# Generate predicted probabilities for the test set
predictions = model.predict(data_test)

# Convert the probabilities into binary predictions
# If the probability is greater than 0.5, classify as 1 (spam); otherwise, classify as 0 (not spam)
predictions = (predictions > 0.5).astype(int)


[1m27/27[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 46ms/step



- **predictions = model.predict(data_test)**: The model generates probabilities for each email in the test set. These probabilities represent the likelihood that an email is spam. A value close to 1 means it's likely spam, while a value close to 0 means it's likely not spam.

- **predictions = (predictions > 0.5).astype(int)**: We convert these probabilities into binary predictions (spam or not spam) by setting a threshold at 0.5. If the probability is greater than 0.5, we classify the email as spam (1). Otherwise, it's classified as not spam (0).


# Print the Classification Report

In [21]:
# Generate and print the classification report
print("Classification Report:")
print(classification_report(np.array(y_test), predictions))


Classification Report:
              precision    recall  f1-score   support

           0       0.92      0.97      0.94       724
           1       0.71      0.46      0.55       112

    accuracy                           0.90       836
   macro avg       0.81      0.71      0.75       836
weighted avg       0.89      0.90      0.89       836





- **classification_report()**: This function gives a detailed summary of how well the model is performing. It shows key metrics like:
  - **Precision**: Out of all the emails predicted as spam, how many were actually spam?
  - **Recall**: Out of all the actual spam emails, how many did the model correctly identify as spam?
  - **F1-Score**: A balance between precision and recall, useful for understanding the model’s overall performance.
  - **Accuracy**: The overall percentage of correct predictions (both spam and not spam).
### Interpretation:

- **Class 0 (Not Spam)**:
  - **Precision (0.92)**: Out of all the emails predicted as **not spam**, 92% were actually **not spam**.
  - **Recall (0.97)**: Out of all the emails that were actually **not spam**, the model correctly identified 97% of them.
  - **F1-score (0.94)**: This is a balance between precision and recall, showing that the model performs very well in identifying emails that are not spam.

- **Class 1 (Spam)**:
  - **Precision (0.71)**: Out of all the emails predicted as **spam**, 71% were actually **spam**.
  - **Recall (0.46)**: Out of all the actual **spam** emails, the model correctly identified only 46%. This indicates the model struggles with catching all the spam emails.
  - **F1-score (0.55)**: This indicates that the model is less effective at identifying spam, with a noticeable gap between precision and recall.

- **Accuracy (0.90)**: Overall, the model correctly classified 90% of the emails, both spam and not spam.

- **Macro avg** (Average of Class 0 and Class 1):
  - **Precision (0.81)**: On average, the model is 81% precise in its predictions across both spam and not spam.
  - **Recall (0.71)**: On average, the model correctly identifies 71% of the actual labels across both classes.
  - **F1-score (0.75)**: This is the overall balance between precision and recall across both classes.

- **Weighted avg** (Weighted by the number of samples in each class):
  - The weighted average takes into account that there are many more **not spam** emails (724) than **spam** emails (112), giving an overall performance score of the model as:
    - **Precision (0.89)**, **Recall (0.90)**, and **F1-score (0.89)**.

### Summary:
- The model is **very good** at correctly identifying **not spam** emails but **struggles** with identifying **spam** emails.
- While it performs well overall with 90% accuracy, the lower recall for spam (46%) suggests that the model is missing a significant portion of actual spam emails. This means some spam might still get through undetected.




# Classify a new email

In [22]:
# Example email to classify
new_email = ["Congratulations! You've won a free gift card. Click here to claim it!"]

# Preprocess the new email (tokenize and pad the sequence)
new_email_sequence = tokenizer.texts_to_sequences(new_email)
new_email_padded = pad_sequences(new_email_sequence, maxlen=max_sequence_length)

# Make a prediction
prediction = model.predict(new_email_padded)

# Interpret the prediction
if prediction >= 0.5:
    print("The email is classified as SPAM.")
else:
    print("The email is classified as NOT SPAM.")


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 127ms/step
The email is classified as NOT SPAM.
