# LSTM classificator

Using the dataset `dataset_emails.csv` (or the same dataset you have used in S08_1) create a some text classificators:
* LSTM
* GRU 

Compare the results between LSTM and GRU. Compare the results with the S08_1 methods. 


## 1. Load the Dataset

---

In [5]:
import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_csv("dataset_emails.csv")

df.head()


Unnamed: 0,prompt,label
0,"Can I send an email, please?",send
1,I'd like to compose an email.,send
2,I need to send an email.,send
3,Could you help me write an email?,send
4,Is it possible to send an email with you?,send


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   prompt  1000 non-null   object
 1   label   1000 non-null   object
dtypes: object(2)
memory usage: 15.8+ KB


## 2. Dividing in train and test split
---

In [3]:
X = df["prompt"]
y = df["label"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state= 42)

print("Train samples: ", len(X_train))
print("Test samples: ", len(X_test))

Train samples:  800
Test samples:  200


## 3. Tokenization and Padding

We need to convert the test to a sequence of numbers, using the Keras´tokenizer.

* *num_words*: define the maximum number of words to take into account. Example: 10000 most frequent words.
* *max_len*: maximum length of each sequence. If a sequence is shorter, it will be filled; if longer, will be clipped.

This step is important for having data prepared for LSTM nets.

In [8]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Tokenization parameters

max_words = 10000 # Maximum of words to take into consideration
max_len = 100 # Maximum len sequence

# Instanciate and adjust tokenizator over train set

tokenizer = Tokenizer(num_words = max_words)
tokenizer.fit_on_texts(X_train)

# Convert texts to numeric sequences
X_train_seq = tokenizer.texts_to_sequences(X_train)
X_test_seq = tokenizer.texts_to_sequences(X_test)

# Applying padding for obtaining fixed len sequences
X_train_pad = pad_sequences(X_train_seq,maxlen = max_len)
X_test_pad = pad_sequences(X_test_seq, maxlen = max_len)

