# Deep Learning for Email Classification with LSTM and Word2vec

In this project, you’ll use the Email Spam Detection Dataset to classify emails as spam or ham. The dataset comprises 5,572 emails categorized as 
either spam or ham. The columns have been renamed to make them more meaningful. The dataset now contains the following columns:

Label: This is the label assigned to each email (spam or ham).

Email: This is the complete email text.


## Task 1: Import Libraries

In this project, you’ll use several libraries and modules to perform the tasks. So, let’s start off by importing the required libraries and modules.

To complete this task, import the following:

1. NumPy and pandas:

numpy: NumPy is a library for numerical operations in Python.

pandas: pandas is a library for data manipulation and analysis.

2. Gensim for Word2Vec:

Word2Vec from gensim.models: Gensim is a library for topic modeling and document similarity analysis. In this project, you’ll use it for Word2Vec, a popular word embedding technique.

3. scikit-learn for data splitting and evaluation:

train_test_split from sklearn.model_selection: scikit-learn is a machine learning library. You’ll use train_test_split to split the dataset into training, validation, and test sets.

classification_report from sklearn.metrics: This provides a comprehensive report on the classification performance, including precision, recall, and F1 score.

4. TensorFlow and Keras for deep learning:

Sequential from tensorflow.keras.models: Keras is a high-level neural networks API. Sequential is a linear stack of layers for building neural network models.

Embedding, LSTM, and Dense from tensorflow.keras.layers: These are layers used in a neural network.

   Embedding: This is the layer for word embeddings.

   LSTM: This is the LSTM layer for sequence modeling.

   Dense: This is the fully connected layer.

Tokenizer from tensorflow.keras.preprocessing.text: This is used for tokenizing text data into sequences.

pad_sequences from tensorflow.keras.preprocessing.sequence: This is used for padding sequences to ensure uniform length.

                              

In [1]:
import numpy as np
import pandas as pd
from gensim.models import Word2Vec
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

2024-10-13 09:36:09.116848: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


## Task 2: Load the Dataset

In this task, you’ll load the dataset file into a pandas DataFrame considering the latin1 encoding.

Follow the given steps to complete this task:

1. Read the dataset file named Dataset.csv into a pandas DataFrame.

Use latin1 encoding for reading the CSV file.

2. Print the first few rows of the DataFrame to verify the data inside it.



In [2]:
# Load the dataset and specify the correct encoding when reading the CSV file
df = pd.read_csv('/usercode/Dataset.csv', encoding='latin1')

# Print the DataFrame head
print(df.head())

  Label                                              Email
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...


## Task 3: Extract Email Texts and Labels

In this task, you’ll preprocess the dataset and extract the email texts and labels. The email labels are assigned binary labels. You’ll extract the
Email column as a list of texts, then map the Label column to numerical values and convert them into a list of labels.

Follow the steps below to complete this task:

1. Extract the Email column as a list of texts.

2. Map the Label column to numerical values (0 for ham and 1 for spam) and convert it to a list.

3. Print the total number of spam and ham emails.



In [3]:
# Extract 'texts' and 'labels'
texts = df['Email'].tolist()
labels = df['Label'].map({'ham': 0, 'spam': 1}).tolist()

# Print the total number of spam and ham emails
print("Total no. of spam emails:", sum(labels))
print("Total no. of ham emails:", len(labels) - sum(labels))

Total no. of spam emails: 747
Total no. of ham emails: 4825


## Task 4: Split the Dataset

In this task, you’ll split the dataset into the training, validation, and test sets. The training set will be used to train the model, the validation
set to fine-tune the model’s parameters, and the test set to assess the model’s performance. This approach will ensure the model can generalize
to new, unseen data and provide a reliable evaluation metric for its overall effectiveness.

The three sets should have the following ratio:

Training set: 70% of the original data

Validation set: 15% of the original data

Test set: 15% of the original data

Follow the steps below to complete this task:

1. Split the data into a training set and a temporary set with a ratio of 70:30.

2. Split the temporary set into a validation set and a test set with a ratio of 50:50.

                                                            

In [4]:
# Split the dataset into training, validation, and test sets
X_train, X_temp, y_train, y_temp = train_test_split(texts, labels, test_size=0.3, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

## Task 5: Tokenize and Pad Sequences

In this task, you’ll tokenize text into sentences, train a tokenizer, and convert text to numerical sequences. You’ll also determine the maximum
sequence length, calculate the vocabulary size, and pad sequences for your deep learning model.

Follow the steps below to complete this task:

1. Instantiate a Tokenizer object.

2. Fit the tokenizer on the combined training, validation, and test datasets.

3. Use the tokenizer to convert text sequences to numerical sequences for training, validation, and test sets.

4. Find the maximum sequence length across all datasets.

5. Determine the vocabulary size based on the tokenizer’s word index.

6. Pad sequences to ensure uniform sequence length for training, validation, and test sets.

                                                  

In [5]:
# Tokenize and pad sequences for training, validation, and testing
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X_train + X_val + X_test)

sequences_train = tokenizer.texts_to_sequences(X_train)
sequences_val = tokenizer.texts_to_sequences(X_val)
sequences_test = tokenizer.texts_to_sequences(X_test)

max_sequence_length = max([len(seq) for seq in sequences_train + sequences_val + sequences_test])
vocab_size = len(tokenizer.word_index) + 1

data_train = pad_sequences(sequences_train, maxlen=max_sequence_length)
data_val = pad_sequences(sequences_val, maxlen=max_sequence_length)
data_test = pad_sequences(sequences_test, maxlen=max_sequence_length)

## Task 6: Train a Word2Vec Model

In this task, you’ll tokenize text into sentences and train a Word2Vec model using Gensim. The model will learn distributed representations (word embeddings) for words in the given sentences, with specified vector size, window size, and other parameters.

Follow the given steps to complete this task:

1. Split each text in the training, validation, and test sets into individual words, creating a list of sentences.

2. Train a Word2Vec model on the tokenized sentences. Specify the following parameters:

 vector_size for the dimensionality of the word vectors

 window for the maximum distance between the current and predicted word within a sentence

 min_count for ignoring words with a frequency lower than this parameter

 workers for the number of CPU cores to use



In [6]:
sentences = [text.split() for text in X_train + X_val + X_test]
word2vec_model = Word2Vec(sentences=sentences, vector_size=100, window=5, min_count=1, workers=4)

## Task 7: Prepare the Embedding Matrix

In this task, you’ll create an embedding matrix that can be used as pretrained word embeddings in a neural network. The matrix will be populated with
vectors from the Word2Vec model for words present in both the tokenizer’s word index and the Word2Vec model’s vocabulary.

1. Create an empty embedding matrix with dimensions (vocab_size, vector_size).

2. Iterate over words in the tokenizer’s word index. Check if the word is present in the Word2Vec model’s vocabulary.

    If present, update the corresponding row in the embedding matrix with the word’s vector.

In [8]:
embedding_matrix = np.zeros((vocab_size, word2vec_model.vector_size))
for word, i in tokenizer.word_index.items():
    if word in word2vec_model.wv:
        embedding_matrix[i] = word2vec_model.wv[word]

In [11]:
print(embedding_matrix.shape)
print(embedding_matrix)

(8921, 100)
[[ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
   0.00000000e+00  0.00000000e+00]
 [-1.91809669e-01  8.61374915e-01  6.14513099e-01 ... -8.58217716e-01
   2.78657973e-01 -1.06515139e-01]
 [-1.97632715e-01  1.07230961e+00  7.87104070e-01 ... -1.10375166e+00
   3.37027788e-01 -9.47562903e-02]
 ...
 [-8.49248748e-03  2.64786626e-03  5.44923940e-04 ... -1.45863518e-02
   1.19863059e-02 -5.92502416e-04]
 [-7.76110636e-03  8.62307753e-03 -6.94647571e-03 ...  3.40426835e-04
  -1.41433452e-03  6.45614741e-03]
 [-5.26425801e-03  1.06634814e-02  1.07753314e-02 ... -1.20559307e-02
   8.59323423e-03  2.39065709e-03]]


## Task 8: Build an LSTM Model

In this task, you’ll construct an LSTM model with Word2Vec embeddings. You’ll use the Word2Vec embeddings as pretrained weights in the Embedding 
layer. The model architecture will include a LSTM layer followed by a Dense layer for classification.

Follow the steps below to complete this task:

1. Create a Sequential model.

2. Add an Embedding layer with Word2Vec embeddings.

  Use pretrained weights from the embedding matrix.

  Keep Word2Vec embeddings fixed during training.

3. Add an LSTM layer with 100 units, dropout, and recurrent dropout.

4. Add a Dense layer with 1 unit and a sigmoid activation function for binary classification.

5. Display the summary of the model.

                                                                  

In [12]:
# Build an LSTM model with Word2Vec embeddings
model = Sequential()
model.add(Embedding(vocab_size, 
word2vec_model.vector_size, weights=[embedding_matrix], 
input_length=max_sequence_length, trainable=False))
model.add(LSTM(100, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(1, activation='sigmoid'))
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 189, 100)          892100    
                                                                 
 lstm (LSTM)                 (None, 100)               80400     
                                                                 
 dense (Dense)               (None, 1)                 101       
                                                                 
Total params: 972601 (3.71 MB)
Trainable params: 80501 (314.46 KB)
Non-trainable params: 892100 (3.40 MB)
_________________________________________________________________


## Task 9: Compile the Model

Now that you’ve completed the preprocessing and model building steps, it’s time to compile and train the model.

In this task, you’ll configure the model for training by choosing an appropriate loss function, optimizer, and metric for monitoring performance. 
You’ll use the Adam optimizer, which is a popular choice for its efficiency in training neural networks. The model will aim to minimize the
binary cross-entropy loss while maximizing accuracy during training.

To complete this task, configure the model for training using the following configuration:

1. Use binary cross-entropy as the loss function for binary classification.

2. Use the Adam optimizer for gradient descent.

3. Monitor accuracy as a metric during training.

                         

In [13]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

## Task 10: Train the Model

In this task, you’ll initiate the training process for the model. You’ll train the model for 10 epochs, with each epoch processing the entire 
training dataset in batches of size 32. You’ll also use validation data to assess the model’s performance on unseen data during each epoch.

To complete this task, train the model using the following configuration:

1. Provide the preprocessed training data and its corresponding labels as input.

2. Train the model for 10 epochs using a batch size of 32 samples per gradient update.

3. Include validation data along with its labels to monitor performance on a separate dataset during training.

                                                                                        

In [14]:
model.fit(data_train, np.array(y_train), epochs=10, batch_size=32, validation_data=(data_val, np.array(y_val)))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x72672064c490>

## Task 11: Evaluate the Model

In this task, you’ll evaluate the trained model’s performance on the test dataset. You’ll then print the test loss and accuracy metrics to gain
insights into how well the model generalized to the unseen data.

Follow the given steps to complete this task:

1. Evaluate the model’s performance on the test set.

2. Display the test loss and test accuracy obtained from evaluating the model on the test set.



In [15]:
evaluation_results = model.evaluate(data_test, np.array(y_test))
print("Test Loss:", evaluation_results[0])
print("Test Accuracy:", evaluation_results[1])

Test Loss: 0.23975703120231628
Test Accuracy: 0.8971291780471802


## Task 12: Generate Predictions

Now that you’ve trained and evaluated the model, it’s time to put it to the test.

In this task, you’ll use the trained model to generate predictions on the test set. You’ll then convert the predicted probabilities into 
binary predictions.

Follow the given steps to complete this task:

1. Obtain predicted probabilities for each instance in the test set.

2. Transform predicted probabilities into binary predictions. Set the threshold at 0.5, classifying values above as 1 and below as 0.



In [16]:
# Generate predictions on the test set
predictions = model.predict(data_test)
predictions = (predictions > 0.5).astype(int)



## Task 13: Print the Classification Report

In this task, you’ll generate and print a classification report. The classification report is a useful tool for evaluating the performance of 
a classification model. It provides detailed metrics such as precision, recall, and F1 score for each class.

To complete this task, generate and print a comprehensive classification report. Pass the true labels and the predicted labels as arguments.



In [17]:
print("Classification Report:")
print(classification_report(np.array(y_test), predictions))

Classification Report:
              precision    recall  f1-score   support

           0       0.93      0.96      0.94       724
           1       0.64      0.52      0.57       112

    accuracy                           0.90       836
   macro avg       0.79      0.74      0.76       836
weighted avg       0.89      0.90      0.89       836

