<a href="https://colab.research.google.com/github/albertofalco/M72/blob/main/M72_09_Actividad_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

```
ME72: Maestría en Métodos Cuantitativos para la Gestión y Análisis de Datos
M72109: Gestión de datos no estructurados
Universidad de Buenos Aires - Facultad de Ciencias Economicas (UBA-FCE)
Año: 2023

Profesor: Facundo Santiago

Alumno: Alberto Falco
```

Actividad 2: Modelos basados en secuencias con Word2Vec
=======================================================

Introducción
------------

Los modelos basados en secuencias tienen la fortaleza que toman una secuencia de token (en un determinado orden) y generan una salida dependiendo del tipo de problema que se trate.
 - Seq2Class: Toman una secuencia de tokens y generan una clase
 - Seq2Seq: Toman una secuencia de token y generan otra secuencia de tokens.

Vimos como podemos generar un modelo de secuencia utilizando `Word2Vec` y redes LSTM. Sin embargo ¿Les parece que conseguimos una buena performance?

En esta actividad les proponemos ver como podemos mejorar la performance de este modelo.

### Para ejecutar este notebook

Para ejecutar este notebook, instale las siguientes librerias:

In [None]:
!wget https://raw.githubusercontent.com/santiagxf/M72109/master/NLP/Datasets/mascorpus/tweets_marketing.csv \
    --quiet --no-clobber --directory-prefix ./Datasets/mascorpus/
!wget https://raw.githubusercontent.com/santiagxf/M72109/master/m72109/nlp/normalization.py \
    --quiet --no-clobber --directory-prefix ./m72109/nlp/
!wget https://raw.githubusercontent.com/santiagxf/M72109/master/m72109/nlp/transformation.py \
    --quiet --no-clobber --directory-prefix ./m72109/nlp/

!wget https://raw.githubusercontent.com/albertofalco/M72/main/09/activity_2/requirements.txt \
    --quiet --no-clobber
!pip install -r requirements.txt --quiet

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/235.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━[0m [32m122.9/235.5 kB[0m [31m3.7 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m235.5/235.5 kB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25h

Descargamos nuestros vectores de word2vec en español

In [None]:
!mkdir -p ./Models/Word2Vec
!wget https://santiagxf.blob.core.windows.net/public/Word2Vec/model-es.bin \
    --quiet --no-clobber --directory-prefix ./Models/Word2Vec

In [None]:
import warnings
warnings.filterwarnings('ignore')

Instalamos las librerias necesarias

In [None]:
!python -m spacy download es_core_news_sm

2023-11-27 15:57:05.540877: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-11-27 15:57:05.540944: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-11-27 15:57:05.540981: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-11-27 15:57:05.548793: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-11-27 15:57:08.374421: I tensorflow/compiler/

Cargamos el set de datos

In [None]:
import pandas as pd

tweets = pd.read_csv('Datasets/mascorpus/tweets_marketing.csv')

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(tweets['TEXTO'], tweets['SECTOR'],
                                                    test_size=0.33,
                                                    stratify=tweets['SECTOR'])

Direcciones
-----------

¿Como puede hacer para mejorar la performance del modelo original que creamos en clase? Explore diferentes alternativas que lo llevarán a una mejor performance. En particular:

- Remplazar la capa LSTM por una capa de tipo bidireccional. ¿Mejora?
- ¿Que sucede con el pre-procesamiento? ¿Serviría modificar algo?
    - Pista: Explore los parámteros de TweetNormalizer
    
Haga las modificaciones que crea pertinente y revise que propuestas mejoran la performance. Utilice la siguiente estructura de solución como ayuda, pero sientase libre de explorar otra.

> **Importante:** No es necesario realizar tuneo de hiper-parametros para resolver este ejercicio, solo utilice su intuición para introducir modificaciones que deberían de llevarlo a un mejor resultado.

Iteración 1: Original
---------

In [None]:
# Preprocesamiento de texto
from m72109.nlp.normalization import TweetTextNormalizer

normalizer = TweetTextNormalizer(preserve_case=False,
                                 return_tokens=True,
                                 language='spanish'
                                 )

In [None]:
# Vectorización de las palabras
from m72109.nlp.transformation import Word2VecVectorizer

w2v = Word2VecVectorizer(model='Models/Word2Vec/model-es.bin', sequence_to_idx=True)
embedding_weights = w2v.get_weights()

embedding_weights.shape

In [None]:
# Ajustando la longitud de las secuencias
from m72109.nlp.transformation import PadSequenceTransformer

max_seq_len = 100
seq2seq = PadSequenceTransformer(max_len=max_seq_len)

In [None]:
# Construirmos un modelo basado en secuencias
import tensorflow as tf
import tensorflow.keras as keras
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Embedding, LSTM, Dense, Input, SpatialDropout1D
from scikeras.wrappers import KerasClassifier

def build_model(sequence_len, vocab_size, emdedding_size, embedding_weights):
    model = Sequential([
        Embedding(vocab_size, emdedding_size,
                  weights=[embedding_weights],
                  trainable=False,
                  mask_zero=True),
        SpatialDropout1D(0.2),
        LSTM(emdedding_size),
        Dense(7, activation='softmax')
    ])

    model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

# Wrapper de SciKeras.
estimator = KerasClassifier(
    build_fn=build_model,
    epochs=50,
    sequence_len=max_seq_len,
    vocab_size=w2v.vocab_size,
    emdedding_size=w2v.emdedding_size,
    embedding_weights=embedding_weights)

In [None]:
# Creando nuestro pipeline
from sklearn.pipeline import Pipeline

pipeline = Pipeline(steps=[('normalizer', normalizer),
                           ('vectorizer', w2v),
                           ('padder', seq2seq),
                           ('estimator', estimator)])

In [None]:
# Entrenamiento.
model = pipeline.fit(X=X_train, y=y_train)

100%|██████████| 2521/2521 [03:14<00:00, 12.99it/s]
100%|██████████| 2521/2521 [00:00<00:00, 53132.47it/s]


Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


In [None]:
# Obtención de predicciones.
predictions = model.predict(X_test)

100%|██████████| 1242/1242 [01:30<00:00, 13.71it/s]
100%|██████████| 1242/1242 [00:00<00:00, 63485.00it/s]




In [None]:
# Obtención de métricas.
from sklearn.metrics import classification_report

print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

ALIMENTACION       0.95      0.92      0.94       110
  AUTOMOCION       0.87      0.94      0.90       148
       BANCA       0.92      0.92      0.92       198
     BEBIDAS       0.85      0.93      0.89       223
    DEPORTES       0.93      0.95      0.94       216
      RETAIL       0.97      0.85      0.91       268
       TELCO       0.86      0.87      0.87        79

    accuracy                           0.91      1242
   macro avg       0.91      0.91      0.91      1242
weighted avg       0.91      0.91      0.91      1242



El modelo de referencia arrojó una métrica general de accuracy igual a 0.91.

Iteración 2: Ajustes sobre el preprocesamiento
---------

In [None]:
# Importación de librerias.
import tensorflow as tf
import tensorflow.keras as keras
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Embedding, LSTM, Dense, Input, SpatialDropout1D
from scikeras.wrappers import KerasClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
from m72109.nlp.normalization import TweetTextNormalizer
from m72109.nlp.transformation import Word2VecVectorizer
from m72109.nlp.transformation import PadSequenceTransformer

In [None]:
# Setup.
normalizer = TweetTextNormalizer(preserve_case=False,
                                 return_tokens=True,
                                 language='spanish',
                                 lemmatize=False, # Se modifica por False.
                                 stem=False,
                                 reduce_len=False, # Se modifica por False.
                                 strip_handles=True,
                                 strip_stopwords=True,
                                 strip_urls=True,
                                 strip_accents=True,
                                 token_min_len=-1
                                 )

w2v = Word2VecVectorizer(model='Models/Word2Vec/model-es.bin', sequence_to_idx=True)
embedding_weights = w2v.get_weights()

max_seq_len = 100
seq2seq = PadSequenceTransformer(max_len=max_seq_len)

100%|██████████| 2656058/2656058 [00:07<00:00, 361465.88it/s]


In [None]:
# Construccion del modelo.
def build_model(sequence_len, vocab_size, emdedding_size, embedding_weights):
    model = Sequential([
        Embedding(vocab_size, emdedding_size,
                  weights=[embedding_weights],
                  trainable=False,
                  mask_zero=True),
        SpatialDropout1D(0.2),
        LSTM(emdedding_size),
        Dense(7, activation='softmax')
    ])

    model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

# Wrapper de Scikeras.
estimator = KerasClassifier(
    build_fn=build_model,
    epochs=50,
    sequence_len=max_seq_len,
    vocab_size=w2v.vocab_size,
    emdedding_size=w2v.emdedding_size,
    embedding_weights=embedding_weights)

# Construcción del pipeline.
pipeline = Pipeline(steps=[('normalizer', normalizer),
                           ('vectorizer', w2v),
                           ('padder', seq2seq),
                           ('estimator', estimator)])

In [None]:
# Entrenamiento.
model = pipeline.fit(X=X_train, y=y_train)

100%|██████████| 2521/2521 [00:00<00:00, 3665.13it/s]
100%|██████████| 2521/2521 [00:00<00:00, 33661.04it/s]
  X, y = self._initialize(X, y)


Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


In [None]:
# Obtención de resultados.
predictions = model.predict(X_test)

print(classification_report(y_test, predictions))

100%|██████████| 1242/1242 [00:00<00:00, 3701.03it/s]
100%|██████████| 1242/1242 [00:00<00:00, 109552.39it/s]


              precision    recall  f1-score   support

ALIMENTACION       0.98      0.97      0.98       110
  AUTOMOCION       0.95      0.97      0.96       148
       BANCA       0.95      0.94      0.95       198
     BEBIDAS       0.89      0.96      0.92       223
    DEPORTES       0.98      0.92      0.95       216
      RETAIL       0.94      0.94      0.94       268
       TELCO       0.99      0.95      0.97        79

    accuracy                           0.95      1242
   macro avg       0.96      0.95      0.95      1242
weighted avg       0.95      0.95      0.95      1242



A partir de la modificación en el preprocesamiento (desactivación del lemmatizer y sin corte de longitudes de secuencias), el nuevo modelo arrojó una métrica general de accuracy igual a 0.95.

Iteración 3: Ajustes sobre el modelo. Utilización de dropout adicional.
---------

In [None]:
# Importación de librerias.
import tensorflow as tf
import tensorflow.keras as keras
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Embedding, LSTM, Dense, Input, SpatialDropout1D, Dropout
from scikeras.wrappers import KerasClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
from m72109.nlp.normalization import TweetTextNormalizer
from m72109.nlp.transformation import Word2VecVectorizer
from m72109.nlp.transformation import PadSequenceTransformer

In [None]:
# Setup.
normalizer = TweetTextNormalizer(preserve_case=False,
                                 return_tokens=True,
                                 language='spanish',
                                 lemmatize=False, # Se modifica por False.
                                 stem=False,
                                 reduce_len=False, # Se modifica por False.
                                 strip_handles=True,
                                 strip_stopwords=True,
                                 strip_urls=True,
                                 strip_accents=True,
                                 token_min_len=-1
                                 )

# Instanciación del vectorizer y obtención de pesos.
w2v = Word2VecVectorizer(model='Models/Word2Vec/model-es.bin', sequence_to_idx=True)
embedding_weights = w2v.get_weights()

# Padding.
max_seq_len = 100
seq2seq = PadSequenceTransformer(max_len=max_seq_len)

100%|██████████| 2656058/2656058 [00:06<00:00, 425323.74it/s]


In [None]:
# Construccion del modelo.
def build_model(sequence_len, vocab_size, emdedding_size, embedding_weights):
    model = Sequential([
        Embedding(vocab_size, emdedding_size,
                  weights=[embedding_weights],
                  trainable=False,
                  mask_zero=True),
        SpatialDropout1D(0.2),
        LSTM(emdedding_size),
        Dropout(0.1), # Se incorpora una segunda capa de dropout convencional luego de LSTM.
        Dense(7, activation='softmax')
    ])

    model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

# Wrapper de Scikeras.
estimator = KerasClassifier(
    build_fn=build_model,
    epochs=50,
    sequence_len=max_seq_len,
    vocab_size=w2v.vocab_size,
    emdedding_size=w2v.emdedding_size,
    embedding_weights=embedding_weights)

# Construcción del pipeline.
pipeline = Pipeline(steps=[('normalizer', normalizer),
                           ('vectorizer', w2v),
                           ('padder', seq2seq),
                           ('estimator', estimator)])

In [None]:
# Entrenamiento.
model = pipeline.fit(X=X_train, y=y_train)

100%|██████████| 2521/2521 [00:00<00:00, 7032.07it/s]
100%|██████████| 2521/2521 [00:00<00:00, 164294.67it/s]


Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


In [None]:
# Obtención de resultados.
predictions = model.predict(X_test)

print(classification_report(y_test, predictions))

100%|██████████| 1242/1242 [00:00<00:00, 7573.90it/s]
100%|██████████| 1242/1242 [00:00<00:00, 169967.23it/s]


              precision    recall  f1-score   support

ALIMENTACION       0.95      0.95      0.95       110
  AUTOMOCION       0.97      0.97      0.97       148
       BANCA       0.91      0.94      0.93       198
     BEBIDAS       0.93      0.93      0.93       223
    DEPORTES       0.96      0.95      0.96       216
      RETAIL       0.93      0.93      0.93       268
       TELCO       0.97      0.96      0.97        79

    accuracy                           0.94      1242
   macro avg       0.95      0.95      0.95      1242
weighted avg       0.94      0.94      0.94      1242



A partir de la modificación en el preprocesamiento y utilizando una capa adicional de dropout luego de la capa LSTM, el nuevo modelo arrojó una métrica levemente inferior al anterior, con un accuracy igual a 0.94.

Iteración 4: Ajustes sobre el modelo. Uso de redes recurrentes bidireccionales.
---------

In [None]:
# Importación de librerias.
import tensorflow as tf
import tensorflow.keras as keras
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import Embedding, LSTM, Dense, Input, SpatialDropout1D, Dropout, Bidirectional
from scikeras.wrappers import KerasClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
from m72109.nlp.normalization import TweetTextNormalizer
from m72109.nlp.transformation import Word2VecVectorizer
from m72109.nlp.transformation import PadSequenceTransformer

In [None]:
# Setup.
normalizer = TweetTextNormalizer(preserve_case=False,
                                 return_tokens=True,
                                 language='spanish',
                                 lemmatize=False, # Se modifica por False.
                                 stem=False,
                                 reduce_len=False, # Se modifica por False.
                                 strip_handles=True,
                                 strip_stopwords=True,
                                 strip_urls=True,
                                 strip_accents=True,
                                 token_min_len=-1
                                 )

# Instanciación del vectorizer y obtención de pesos.
w2v = Word2VecVectorizer(model='Models/Word2Vec/model-es.bin', sequence_to_idx=True)
embedding_weights = w2v.get_weights()

# Padding.
max_seq_len = 100
seq2seq = PadSequenceTransformer(max_len=max_seq_len)

In [None]:
# Construccion del modelo.
def build_model(sequence_len, vocab_size, emdedding_size, embedding_weights):
    model = Sequential([
        Embedding(vocab_size, emdedding_size,
                  weights=[embedding_weights],
                  trainable=False,
                  mask_zero=True),
        SpatialDropout1D(0.2),
        Bidirectional(LSTM(emdedding_size)), # Se embebe la capa LSTM dentro de la Bidirectional.
        Dropout(0.1), # Se incorpora una segunda capa de dropout convencional luego de LSTM.
        Dense(7, activation='softmax')
    ])

    model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    return model

# Wrapper de Scikeras.
estimator = KerasClassifier(
    build_fn=build_model,
    epochs=50,
    sequence_len=max_seq_len,
    vocab_size=w2v.vocab_size,
    emdedding_size=w2v.emdedding_size,
    embedding_weights=embedding_weights)

# Construcción del pipeline.
pipeline = Pipeline(steps=[('normalizer', normalizer),
                           ('vectorizer', w2v),
                           ('padder', seq2seq),
                           ('estimator', estimator)])

In [None]:
# Entrenamiento
model = pipeline.fit(X=X_train, y=y_train)

100%|██████████| 2521/2521 [00:00<00:00, 6578.03it/s]
100%|██████████| 2521/2521 [00:00<00:00, 148080.56it/s]


Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


In [None]:
# Obtención de resultados.
predictions = model.predict(X_test)

print(classification_report(y_test, predictions))

100%|██████████| 1242/1242 [00:00<00:00, 6984.91it/s]
100%|██████████| 1242/1242 [00:00<00:00, 134444.62it/s]


              precision    recall  f1-score   support

ALIMENTACION       0.98      0.94      0.96       110
  AUTOMOCION       0.97      0.95      0.96       148
       BANCA       0.92      0.94      0.93       198
     BEBIDAS       0.91      0.92      0.92       223
    DEPORTES       0.94      0.94      0.94       216
      RETAIL       0.93      0.92      0.93       268
       TELCO       0.96      0.95      0.96        79

    accuracy                           0.94      1242
   macro avg       0.94      0.94      0.94      1242
weighted avg       0.94      0.94      0.94      1242



A partir de las modificaciones en el preprocesamiento, la incorporación de una capa de dropout y la incorporación de una capa bidireccional sobre la capa LSTM, se obtuvo una métrica similar a la anterior.

Conclusiones
---------

De las iteraciones realizadas y los resultados obtenidos teniendo en cuenta las épocas de entrenamiento ejecutadas, se concluye que la principal mejora sobre el desempeño se obtuvo a partir de la modificación de los parámetros de preprocesamiento.

En particular, a partir de la desactivación del lemmatizer y del corte de las longitudes de secuencias en la etapa de preprocesamiento, el nuevo modelo incrementó su perfomance, arrojando la mejor métrica general de accuracy obtenida, de 0.95.