## Obligatorio de Deep Learning - Semestre 2 - 2022

## 1. Setup

### 1.1 Imports

In [None]:
import numpy as np

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Conv1D, MaxPool1D, Dropout, Embedding, LSTM, Dense
import tensorflow as tf

import utils

### 1.2 Set random seeds

In [None]:
np.random.seed(117)
tf.random.set_seed(117)

## 2. Carga de datos

In [None]:
hdfs_train, hdfs_test_kaggle = utils.read_data()

In [None]:
hdfs_train[:4]

In [None]:
hdfs_test_kaggle[:5]

## 3. Análisis exploratorio de datos

### 3.1 Análisis descriptivo general: Distribuciones, Scatterplots, Barplots...

In [None]:
hdfs_train.head()

In [None]:
utils.value_counts(hdfs_train,'class')

### 3.2 Análisis de secuencias

In [None]:
#Agregar ploteo de largos de secuencias, distribuciones por simbolo, etc.

In [None]:
raw_sequences, data_y = utils.load_sequences_and_target(hdfs_train, one_hot=True)
#raw_sequences, data_y = utils.load_sequences_and_target(hdfs_train)

In [None]:
data_y.value_counts()

In [None]:
min([min(s) for s in raw_sequences])

In [None]:
max([max(s) for s in raw_sequences])

In [None]:
max([len(s) for s in raw_sequences])

#### El valor de vocab_size es importante ya que es la dimensionalidad del lenguaje

In [None]:
vocab_size = max([max(s) for s in raw_sequences]) + 1

#### Definiremos arbitrariamente el largo máximo de secuencias (es este tamaño razonable?)

In [None]:
max_len = 10

#### Haremos padding de valor 0 a las secuencias para estandarizar el largo

In [None]:
padded_sequences = utils.pad_sequences(raw_sequences, max_len)

## 4. Entrenamiento de Language Model

### 4.1. Data preprocessing
#### 4.1.1 Particionamiento

In [None]:
X_train, X_test, X_val, y_train, y_test, y_val = utils.split(padded_sequences, data_y)

### 4.2 RNN

In [None]:
optimizer = 'adam'
loss = 'categorical_crossentropy'
import math
embedding_size = math.ceil(vocab_size**0.25)

In [None]:
model = Sequential()
model.add(Embedding(vocab_size+1, embedding_size, input_length=max_len))
model.add(LSTM(64, return_sequences=False))
model.add(Dense(2, activation='softmax'))
model.compile(loss=loss, optimizer=optimizer, metrics=['accuracy'])
model.summary()

### 4.3 Entrenamiento

#### 4.3.1 Hiperarámetros

In [None]:
batch_size = 10
epochs = 1
patience = 10

In [None]:
training1, model1 = utils.train(model,
                X_train,
                y_train, 
                batch_size = batch_size,
                epochs = epochs,
                validation_data_X = X_val, 
                validation_data_y = y_val,                                
                patience = patience,
                class_weights = None)

### 4.4 Evaluación del modelo

In [None]:
utils.eval_model(training1, model1, X_test, y_test)

## 5 Generación de salida para competencia Kaggle

In [None]:
utils.load_test_sequences_and_generate_prediction_file(model1, hdfs_test_kaggle, max_len)

## 6 Consigna

### A) Participación en Competencia Kaggle:
El objetivo de este punto es participar en la competencia de Kaggle y obtener como mínimo un Macro Average Recall (o Weighted Accuracy) superior al 80%. [->Link a la competencia<-](https://www.kaggle.com/t/6d15e3a96bd049b2b4b2a491a69a0fc7).

### B) Utilización de Grid Search (o equivalente):
Para cumplir con la busqueda de modelos óptimos se debe realizar un grid search lo más abarcativo y metódico posible.

### C) Se debe a su vez investigar e implementar al menos 2 de las siguientes técnicas:
#### 1. [Batch Normalization](https://machinelearningmastery.com/how-to-accelerate-learning-of-deep-neural-networks-with-batch-normalization/)
#### 2. [Data Augmentation a través de la realización de Windowing](https://blog.finxter.com/how-to-loop-through-a-python-list-in-batches/#Method_1_Iterating_over_Consecutive_Sliding_Windows)
#### 3. [Gradient Normalization y/o Gradient Clipping](https://machinelearningmastery.com/how-to-avoid-exploding-gradients-in-neural-networks-with-gradient-clipping/)
