# S09 Lab Exercise 

## Víctor Vega Sobral

### Explanations 
The attached files are a collection of tweets labelled with sentiment in 3 categories:

```json
sentiments = {
    "LABEL_0": "Bearish", 
    "LABEL_1": "Bullish", 
    "LABEL_2": "Neutral"
}  
```

1. Train a LSTM network to with the training file. 

2. Validate the trained model with the valid file. 

3. Comment what you are doing in each part of your code. As the better the code, comments and result validation as the better the grade.

Remember that you have to send the final file in this exercise and the file must be in your digital portfolio with all the proper commits done.

- ``sent_train.csv`` 
- ``sent_valid.csv`` 



In [46]:
__author__ = "Victor Vega Sobral"

---

### 1. Importing necessary Libraries and Constant Definitions

In [47]:
import pandas as pd
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Bidirectional
from tensorflow.keras.metrics import Precision, Recall

In [48]:
NUM_EPOCHS = 10
BATCH_SIZE = 32
VAL_SPLIT = 0.1

---

### 2. Loading the Datasets

Next step is to load the two different datasets in:

- `train_df`: Dataframe with the training set.

- `valid_df`: Dataframe with the validation set.

---

#### 2.1 Train dataset

In [49]:
# Training dataset
train_df = pd.read_csv("data/sent_train.csv")
train_df.info()

X_train = train_df["text"]
y_train = train_df["label"]

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9543 entries, 0 to 9542
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    9543 non-null   object
 1   label   9543 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 149.2+ KB


#### 2.4 Test Dataset

In [50]:
# Valdiation dataset
valid_df = pd.read_csv("data/sent_valid.csv")
valid_df.info()

X_test = valid_df["text"]
y_test = valid_df["label"]

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2388 entries, 0 to 2387
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   text    2388 non-null   object
 1   label   2388 non-null   int64 
dtypes: int64(1), object(1)
memory usage: 37.4+ KB


---

### 3. Dividing in train and test split (explanation)

As we already have two different csv files with the train and validation, this step is already done. In other cases, `train_test_split` scikit-learn method could be used.

---

### 4. Tokenization and Paddding

Before training the LSTM, we need to convert both test and training dataframes to a sequence of numbers using the Keras tokenizer.

* `num_words`: defines the maximum number of words that the LSTM will take into account.
* `max_len`: maximum length of each sequence. This is the *paddding* step. 
* `embedding_dim`: maximum dimmensions the embedding vector will have.

In [51]:
############ TOKENIZATION ############
max_words = 10000
max_len = 100
embedding_dim = 64 # Number of dimensions the embedding vectors will have

# Instanciating and adjusting tokenizator
tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(X_train)

# Converting texts to numeric sequences.
X_train_seq = tokenizer.texts_to_sequences(X_train)
X_test_seq = tokenizer.texts_to_sequences(X_test)

##################### PADDING ####################
X_train_pad = pad_sequences(X_train_seq, maxlen = max_len)
X_test_pad = pad_sequences(X_test_seq, maxlen = max_len)

---

### 5. LSTM Construction

LSTMs have this basic model architecture:

1. **Embedding layer**: converts each word, represented as an integer, into a dense vector.

2. **LSTM layer**: where it processes the sequences and captures dependencies over time. It´s recommended to also put it to be `Bidirectional`, but deppending of the case, the noise added by this can produce worse results.

3. **Final dense layer**: for class prediction, in this case, I´ll use `softmax`.

---

#### 5.1 Loss functions: categorical cross entropy.

Categorical Cross Entropy is widely used for LSTMs. However, for using it, first we need to encode the labels into binary vectors. That is, **one-hot encoding**.

In [52]:
label_enc = LabelEncoder()

y_train_encoded = label_enc.fit_transform(y_train)
y_test_encoded = label_enc.fit_transform(y_test)

# Converting to one-hot
y_train_cat = to_categorical(y_train_encoded)
y_test_cat = to_categorical(y_test_encoded)

# Verify the number of classes using the classes_ method
# of the label encoder
num_clases = len(label_enc.classes_)

print("Number of classes: ", num_clases)

Number of classes:  3


#### 5.2 LSTM Architecture

In this cell I will build the LSTM architecture mentioned in the previous ``cell 5``.

In [53]:
########### Model Architecture ############
model = Sequential([
    # Embedding layer
    Embedding(input_dim = max_words, 
              output_dim = embedding_dim,
              input_length = max_len),
    #########

    # LSTM Layer
    Bidirectional(LSTM(embedding_dim)),
    ###########

    # Dense layer
    Dense(num_clases, activation = "softmax")
    ###########
])

############ Model Optimizer ###############

### Adam Optimizer with more metrics added like Precision and Recall
model.compile(optimizer ="adam",
              loss = "categorical_crossentropy",
              metrics = ["accuracy", Precision(), Recall()])

---

#### 5.3 LSTM Training

In [None]:


history = model.fit(X_train_pad, y_train_cat,
                    epochs = NUM_EPOCHS, # epoch numbers
                    batch_size = BATCH_SIZE, # batch size
                    validation_split = VAL_SPLIT, # percentage of training data 
                                                # used for validation
                    )

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10



---

### 5.4 LSTM Performance on Test Set

In [None]:
loss, accuracy, precision, recall = model.evaluate(X_test_pad, y_test_cat)
print(f"Test Loss: {loss:.4f}, Accuracy: {accuracy:.4f}, Precision: {precision:.4f}, Recall: {recall:.4f}")


Test Loss: 1.3546, Accuracy: 0.7567, Precision: 0.7599, Recall: 0.7542
