# LSTM classificator

Using the dataset `dataset_emails.csv` (or the same dataset you have used in S08_1) create a some text classificators:
* LSTM
* GRU 

Compare the results between LSTM and GRU. Compare the results with the S08_1 methods. 


In [39]:
import pandas as pd
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, GRU, Bidirectional
from tensorflow.keras.metrics import Precision, Recall

## 1. Load the Dataset

---

In [40]:
df = pd.read_csv("dataset_emails.csv")

df.head()

Unnamed: 0,prompt,label
0,"Can I send an email, please?",send
1,I'd like to compose an email.,send
2,I need to send an email.,send
3,Could you help me write an email?,send
4,Is it possible to send an email with you?,send


In [41]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   prompt  1000 non-null   object
 1   label   1000 non-null   object
dtypes: object(2)
memory usage: 15.8+ KB


## 2. Dividing in train and test split
---

In [42]:
X = df["prompt"]
y = df["label"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state= 42)

print("Train samples: ", len(X_train))
print("Test samples: ", len(X_test))

Train samples:  800
Test samples:  200


## 3. Tokenization and Padding

We need to convert the test to a sequence of numbers, using the Keras´tokenizer.

* *num_words*: define the maximum number of words to take into account. Example: 10000 most frequent words.
* *max_len*: maximum length of each sequence. If a sequence is shorter, it will be filled; if longer, will be clipped.

This step is important for having data prepared for LSTM nets.

In [43]:
# Tokenization parameters

max_words = 10000 # Maximum of words to take into consideration
max_len = 100 # Maximum len sequence

# Instanciate and adjust tokenizator over train set

tokenizer = Tokenizer(num_words = max_words)
tokenizer.fit_on_texts(X_train)

# Convert texts to numeric sequences
X_train_seq = tokenizer.texts_to_sequences(X_train)
X_test_seq = tokenizer.texts_to_sequences(X_test)

# Applying padding for obtaining fixed len sequences
X_train_pad = pad_sequences(X_train_seq,maxlen = max_len)
X_test_pad = pad_sequences(X_test_seq, maxlen = max_len)



---

## 4. LSTM Construction

A basic model architecture would be the following: 

1. **An embedding layer**: converts each word, represented as an integer, into a dense vector.
2. **A LSTM layer**: process the sequences and captures dependencies over time.
3. **A final dense layer**: for class prediction (for example, sigmoid activation for binary classification or softmax for multi-class classification).

### 4.1 One-hot encoding

I will turn labels into binary vectors using loss function `categorical_crossentropy`.


In [44]:
# Codifying labels 

le = LabelEncoder()
y_train_enc = le.fit_transform(y_train)
y_test_enc = le.fit_transform(y_test)

# Converting to one-hot
y_train_cat = to_categorical(y_train_enc)
y_test_cat = to_categorical(y_test_enc)

# Number of classes

num_classes = len(le.classes_)

print(f"Number of classes is {num_classes}")


Number of classes is 10


### 4.2 LSTM Architecture

Optional to put `bidirectional` in the LSTM layer. However, for this case, the precission decreased.


In [45]:
model = Sequential()
model.add(Embedding(input_dim = max_words, output_dim = 128, input_length = max_len)) # Embedding layer
model.add(LSTM(64)) 
model.add(Dense(num_classes, activation = "softmax")) # Output layer ith softmax activation


# Model compilation with categorical cross-entropy 

model.compile(optimizer="adam",
              loss="categorical_crossentropy",
              metrics=["accuracy", Precision(), Recall()])

### 4.3 LSTM Training



In [46]:
history = model.fit(X_train_pad, y_train_cat,
                    epochs=10,         # epoch numbers
                    batch_size=32,     # batch size 
                    validation_split=0.1)  # percentage of train set used for validation


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


### 4.4 LSTM Performance

In [47]:
loss, accuracy, precision, recall = model.evaluate(X_test_pad, y_test_cat)
print(f"Test Loss: {loss:.4f}, Accuracy: {accuracy:.4f}, Precision: {precision:.4f}, Recall: {recall:.4f}")


Test Loss: 0.5736, Accuracy: 0.8250, Precision: 0.8846, Recall: 0.8050


## 5. GRU

GRU´s implementation is almost the same as the used in LSTM. The only difference is that now, instead of using an LSTM layer, this line should be changed to GRU.

Optional to put `bidirectional` in the GRU layer. However, for this case, the precission decreased.

### 5.1 Defining the architecture

In [48]:
# GRU´s construction
model_gru = Sequential()
model_gru.add(Embedding(input_dim=max_words, output_dim=128, input_length=max_len))
model_gru.add(GRU(64))  # This is the only changed line
model_gru.add(Dense(num_classes, activation='softmax'))  

# Compilación del modelo GRU
model_gru.compile(optimizer="adam",
                  loss="categorical_crossentropy",
                  metrics=["accuracy", Precision(), Recall()])



### 5.2 Training GRU

In [49]:

# Training
history_gru = model_gru.fit(X_train_pad, y_train_cat,
                            epochs=10,
                            batch_size=32,
                            validation_split=0.1)


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


### Evaluating GRU

In [50]:
# Evaluation on test set 
loss_gru, acc_gru, prec_gru, rec_gru = model_gru.evaluate(X_test_pad, y_test_cat)
print(f"GRU Test Loss: {loss_gru:.4f}, Accuracy: {acc_gru:.4f}, Precision: {prec_gru:.4f}, Recall: {rec_gru:.4f}")

GRU Test Loss: 0.5857, Accuracy: 0.8350, Precision: 0.8864, Recall: 0.7800


--- 

## Comparging LSTM and GRU performance, best execution


| Model  | Test Loss | Accuracy | Precision | Recall | F1-score |
|--------|----------|----------|-----------|--------|----------|
| **LSTM** | 0.5140 | 84.5% | 91.8% | 78.5% | 0.847 |
| **GRU**  | 0.5697 | 85.5% | 88.9% | 80.0% | 0.842 |

Both models performed similarly. LSTM achieved a higher precision, while GRU had a slight advantage in accuracy and recall. The F1-scores are very close, indicating a balanced performance in both cases.


---

## Comparing LSTM and GRU with exercises_before metrics

``**Disclaimer**``: three last models do not have a test loss because these models do not output a loss value in the same way as neural networks.

| Model           | Test Loss | Accuracy | Precision | Recall | F1-score |
|-----------------|-----------|----------|-----------|--------|----------|
| **LSTM**        | 0.5140    | 84.5%    | 91.8%     | 78.5%  | 0.847    |
| **GRU**         | 0.5697    | 85.5%    | 88.9%     | 80.0%  | 0.842    |
| **Rule-based**  | N/A       | 38.0%    | 56.0%     | 37.0%  | 0.39     |
| **Naive Bayes** | N/A       | 74.7%    | 76.0%     | 77.0%  | 0.74     |
| **spaCy**       | N/A       | 84.3%    | 86.0%     | 86.0%  | 0.85     |

### Observations  
- **LSTM and GRU** achieved the highest performance, with **GRU slightly outperforming LSTM in accuracy and recall**, while LSTM had better precision.  
- **The Rule-Based Classifier performed poorly**, showing 38% accuracy and much lower performance in precision, recall, and F1-score.  
- **Naive Bayes performed decently**, but significantly worse than deep learning models, especially in recall.  
- **The spaCy classifier came close to LSTM/GRU** in accuracy and F1-score, making it a strong alternative with potentially lower computational cost.
