### ITCR - Escuela de Computación
### Curso IC-6200 - Inteligencia Artificial
### Aprendizaje supervisado

### TP07
### Redes de memoria de corto y largo plazo con PyTorch 
### (Long-Short Term Memory Networks-LSTM)

**Profesora: María Auxiliadora Mora**

**Estudiante: Fabian Vives**

In [70]:
# Bibliotecas requeridas
import torch
import torch.nn as nn
import pandas as pd
import numpy as np
import re
import spacy
from collections import Counter
from torch.utils.data import Dataset, DataLoader
import torch.nn.functional as F
import string
from torch.nn.utils.rnn import pack_padded_sequence, pad_packed_sequence
from sklearn.metrics import mean_squared_error

# Cargar datos

In [71]:
# Se cargan los datos
reviews = pd.read_csv("clothing_reviews.csv")
print(reviews.shape)
reviews.head()

(23486, 11)


Unnamed: 0.1,Unnamed: 0,Clothing ID,Age,Title,Review Text,Rating,Recommended IND,Positive Feedback Count,Division Name,Department Name,Class Name
0,0,767,33,,Absolutely wonderful - silky and sexy and comf...,4,1,0,Initmates,Intimate,Intimates
1,1,1080,34,,Love this dress! it's sooo pretty. i happene...,5,1,4,General,Dresses,Dresses
2,2,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,3,0,0,General,Dresses,Dresses
3,3,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",5,1,0,General Petite,Bottoms,Pants
4,4,847,47,Flattering shirt,This shirt is very flattering to all due to th...,5,1,6,General,Tops,Blouses


### Normalización de los datos

Los espacios de "Title" y "Review Text" que estén vacíos, se les pone un espacio en blanco para evitar errores más adelante.

Ambos espacios se unen en un campo llamado "review"



In [72]:
reviews['Title'] = reviews['Title'].fillna('')
reviews['Review Text'] = reviews['Review Text'].fillna('')
reviews['review'] = reviews['Title'] + ' ' + reviews['Review Text']

Se eliminan los espacios innecesarios, y se agrega un espacio extra llamado "review_length" que contiene la cantidad de palabras que tiene "review"

In [73]:
reviews = reviews[['review', 'Rating']]
reviews.columns = ['review', 'rating']
reviews['review_length'] = reviews['review'].apply(lambda x: len(x.split()))
reviews.head()

Unnamed: 0,review,rating,review_length
0,Absolutely wonderful - silky and sexy and com...,4,8
1,Love this dress! it's sooo pretty. i happen...,5,62
2,Some major design flaws I had such high hopes ...,3,102
3,"My favorite buy! I love, love, love this jumps...",5,25
4,Flattering shirt This shirt is very flattering...,5,38


In [74]:
zero_numbering = {1:0, 2:1, 3:2, 4:3, 5:4}
reviews['rating'] = reviews['rating'].apply(lambda x: zero_numbering[x])

# ADJ

In [75]:
import nltk
# nltk.download('punkt')
from nltk.tokenize import word_tokenize

# Download the 'averaged_perceptron_tagger' corpora if it hasn't been downloaded already
# nltk.download('averaged_perceptron_tagger')

def extract_POS_tags(text):
    tokens = word_tokenize(text)
    pos_tags = nltk.pos_tag(tokens)
    return [tag[1] for tag in pos_tags]

reviews['POS_tags'] = reviews['review'].apply(lambda x: extract_POS_tags(x))

# Para extraer ADJ

In [76]:
def extract_ADJ_tags(pos_tags):
    adj_tags = []
    for tag in pos_tags:
        if tag.endswith('JJ'):
            adj_tags.append(1)
        else:
            adj_tags.append(0)
    return adj_tags

In [77]:
reviews['ADJ_tags'] = reviews['POS_tags'].apply(extract_ADJ_tags)
reviews

Unnamed: 0,review,rating,review_length,POS_tags,ADJ_tags
0,Absolutely wonderful - silky and sexy and com...,3,8,"[RB, JJ, :, NN, CC, NN, CC, JJ]","[0, 1, 0, 0, 0, 0, 0, 1]"
1,Love this dress! it's sooo pretty. i happen...,4,62,"[VB, DT, NN, ., PRP, VBZ, JJ, RB, ., NN, VBD, ...","[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, ..."
2,Some major design flaws I had such high hopes ...,2,102,"[DT, JJ, NN, NN, PRP, VBD, JJ, JJ, NNS, IN, DT...","[0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, ..."
3,"My favorite buy! I love, love, love this jumps...",4,25,"[PRP$, JJ, NN, ., PRP, VBP, ,, VBP, ,, VBP, DT...","[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."
4,Flattering shirt This shirt is very flattering...,4,38,"[VBG, NN, DT, NN, VBZ, RB, JJ, TO, DT, JJ, TO,...","[0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, ..."
...,...,...,...,...,...
23481,Great dress for many occasions I was very happ...,4,33,"[NNP, NN, IN, JJ, NNS, PRP, VBD, RB, JJ, TO, V...","[0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, ..."
23482,Wish it was made of cotton It reminds me of ma...,2,44,"[VB, PRP, VBD, VBN, IN, NN, PRP, VBZ, PRP, IN,...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, ..."
23483,"Cute, but see through This fit well, but the t...",2,46,"[NN, ,, CC, VBP, IN, DT, NN, RB, ,, CC, DT, NN...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, ..."
23484,"Very cute dress, perfect for summer parties an...",2,95,"[RB, JJ, NN, ,, NN, IN, NN, NNS, CC, PRP, PRP,...","[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..."


Para reducir un poco la cantidad de palabras encontradas en los textos de calificación, se cuenta la cantidad de ocurrencias de cada palabra y se eliminan las menos ocurrentes

In [78]:
# Función que normaliza las palabras (quita números, signos...)
tok = spacy.load('en_core_web_sm')
def tokenize (text):
    text = re.sub(r"[^\x00-\x7F]+", " ", text)
    regex = re.compile('[' + re.escape(string.punctuation) + '0-9\\r\\t\\n]')
    nopunct = regex.sub(" ", text.lower())
    return [token.text for token in tok.tokenizer(nopunct)]

# Se cuenta la cantidad de ocurrencias de cada palabra
counts = Counter()
for index, row in reviews.iterrows():
    counts.update(tokenize(row['review']))
    
# Se eliminan las palabras que aparezcan sólo 1 vez
print("Cantidad de palabras original:",len(counts.keys()))
for word in list(counts):
    if counts[word] < 2:
        del counts[word]
print("Cantidad de palabras sin considerar las que aparecen sólo 1 vez:",len(counts.keys()))

Cantidad de palabras original: 14138
Cantidad de palabras sin considerar las que aparecen sólo 1 vez: 8263


Se crea un vocabulario de palabras para ser utilizado para convertir palabras a un número entero.

Se agrega un espacio a la información "encoded" que contiene el texto del espacio "review" en su correspondiente lista de números enteros

In [79]:
# Se crea un vocabulario con las palabras más ocurrentes
vocab2index = {"":0, "UNK":1}
words = ["", "UNK"]
for word in counts:
    vocab2index[word] = len(words)
    words.append(word)
    
# Función que codifica un texto a una lista de enteros según el vocabulario
def encode_sentence(text, vocab2index, N=70):
    tokenized = tokenize(text)
    encoded = np.zeros(N, dtype=int)
    enc1 = np.array([vocab2index.get(word, vocab2index["UNK"]) for word in tokenized])
    length = min(N, len(enc1))
    encoded[:length] = enc1[:length]
    return encoded, length

reviews['encoded'] = reviews['review'].apply(lambda x: np.array(encode_sentence(x,vocab2index )))
reviews.head(20)

  reviews['encoded'] = reviews['review'].apply(lambda x: np.array(encode_sentence(x,vocab2index )))


Unnamed: 0,review,rating,review_length,POS_tags,ADJ_tags,encoded
0,Absolutely wonderful - silky and sexy and com...,3,8,"[RB, JJ, :, NN, CC, NN, CC, JJ]","[0, 1, 0, 0, 0, 0, 0, 1]","[[2, 3, 4, 5, 6, 7, 8, 7, 9, 0, 0, 0, 0, 0, 0,..."
1,Love this dress! it's sooo pretty. i happen...,4,62,"[VB, DT, NN, ., PRP, VBZ, JJ, RB, ., NN, VBD, ...","[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, ...","[[2, 10, 11, 12, 5, 13, 14, 15, 16, 5, 17, 18,..."
2,Some major design flaws I had such high hopes ...,2,102,"[DT, JJ, NN, NN, PRP, VBD, JJ, JJ, NNS, IN, DT...","[0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, ...","[[54, 55, 56, 57, 17, 58, 59, 60, 61, 62, 11, ..."
3,"My favorite buy! I love, love, love this jumps...",4,25,"[PRP$, JJ, NN, ., PRP, VBP, ,, VBP, ,, VBP, DT...","[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[[68, 109, 110, 2, 17, 10, 2, 10, 2, 10, 11, 1..."
4,Flattering shirt This shirt is very flattering...,4,38,"[VBG, NN, DT, NN, VBZ, RB, JJ, TO, DT, JJ, TO,...","[0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, ...","[[122, 123, 11, 123, 52, 92, 122, 19, 124, 125..."
5,Not for the very petite I love tracy reese dre...,1,103,"[RB, IN, DT, RB, JJ, PRP, VBP, NN, JJ, NNS, ,,...","[0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, ...","[[78, 62, 37, 92, 33, 17, 10, 137, 138, 139, 2..."
6,Cagrcoal shimmer fun I aded this in my basket ...,4,104,"[NNP, NN, NN, PRP, VBD, DT, IN, PRP$, NN, IN, ...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, ...","[[1, 168, 112, 17, 1, 11, 21, 68, 169, 170, 17..."
7,"Shimmer, surprisingly goes with lots I ordered...",3,102,"[NNP, ,, RB, VBZ, IN, NNS, PRP, VBD, DT, IN, N...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[[168, 2, 203, 204, 130, 205, 17, 31, 11, 21, ..."
8,Flattering I love this dress. i usually get an...,4,35,"[VBG, PRP, VBP, DT, NN, ., VB, RB, VBP, DT, NN...","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[[122, 17, 10, 11, 12, 2, 17, 143, 118, 236, 1..."
9,"Such a fun dress! I'm 5""5' and 125 lbs. i orde...",4,76,"[JJ, DT, NN, NN, ., PRP, VBP, CD, '', CD, POS,...","[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[[59, 22, 112, 12, 2, 17, 24, 243, 7, 184, 244..."


In [80]:
# Se revisan cuál es la proporción de los datos de acuerdo a la calificación
Counter(reviews['rating'])

Counter({3: 5077, 4: 13131, 2: 2871, 1: 1565, 0: 842})

# División de los datos
Se utilizará un 80% de datos para entrenamiento y un 20% para validación

In [81]:
# Se dividen los datos 
X = list(reviews['encoded'])
y = list(reviews['rating'])
from sklearn.model_selection import train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2)

In [82]:
# Se crea una clase para guardar los dataset
class ReviewsDataset(Dataset):
    def __init__(self, X, Y):
        self.X = X
        self.y = Y
        
    def __len__(self):
        return len(self.y)
    
    def __getitem__(self, idx):
        return torch.from_numpy(self.X[idx][0].astype(np.int32)), self.y[idx], self.X[idx][1]

train_ds = ReviewsDataset(X_train, y_train)
valid_ds = ReviewsDataset(X_valid, y_valid)

# Se preparan los datos para ser introducidos en el modelo
# Se va a utilizar un tamaño de batch = 500
batch_size = 5000
vocab_size = len(words)
train_dl = DataLoader(train_ds, batch_size=batch_size, shuffle=True)
val_dl = DataLoader(valid_ds, batch_size=batch_size)

# Definion Modelo
### 001_LSTM_POST(Esp).ipynb

In [83]:
# Bibliotecas requeridas

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(1)

<torch._C.Generator at 0x7f459be52e50>

In [84]:
class LSTM_fixed_len(torch.nn.Module) :
    def __init__(self, vocab_size, embedding_dim, hidden_dim) :
        super().__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim, padding_idx=0)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True)
        self.linear = nn.Linear(hidden_dim, 5)
        self.dropout = nn.Dropout(0.2)
        
    def forward(self, x, l):
        x = self.embeddings(x)
        x = self.dropout(x)
        lstm_out, (ht, ct) = self.lstm(x)
        return self.linear(ht[-1])

## Entrenamiento del modelo

Se van a hacer 30 épocas de entrenamiento del modelo

In [85]:
def train_model(model, epochs, lr):
    parameters = filter(lambda p: p.requires_grad, model.parameters())
    optimizer = torch.optim.Adam(parameters, lr=lr)
    for i in range(epochs):
        model.train()
        sum_loss = 0.0
        total = 0
        for x, y, l in train_dl:
            x = x.long()
            y = y.long()
            y_pred = model(x, l)
            optimizer.zero_grad()
            loss = F.cross_entropy(y_pred, y)
            loss.backward()
            optimizer.step()
            sum_loss += loss.item()*y.shape[0]
            total += y.shape[0]
        print("Época " + str(i) + " completa")

model_lstm =  LSTM_fixed_len(vocab_size, 50, 50)
train_model(model_lstm, epochs = 30, lr = 0.01)

Época 0 completa
Época 1 completa
Época 2 completa
Época 3 completa
Época 4 completa
Época 5 completa
Época 6 completa
Época 7 completa
Época 8 completa
Época 9 completa
Época 10 completa
Época 11 completa
Época 12 completa
Época 13 completa
Época 14 completa
Época 15 completa
Época 16 completa
Época 17 completa
Época 18 completa
Época 19 completa
Época 20 completa
Época 21 completa
Época 22 completa
Época 23 completa
Época 24 completa
Época 25 completa
Época 26 completa
Época 27 completa
Época 28 completa
Época 29 completa


In [86]:
model_lstm.eval()
correct = 0
total = 0
sum_loss = 0.0
sum_rmse = 0.0
for x, y, l in val_dl:
    x = x.long()
    y = y.long()
    y_hat = model_lstm(x, l)
    loss = F.cross_entropy(y_hat, y)
    pred = torch.max(y_hat, 1)[1]
    
    # Se crea la matriz de confusion
    confusion = [[0,0,0,0,0],[0,0,0,0,0],[0,0,0,0,0],[0,0,0,0,0],[0,0,0,0,0]]
    i = 0
    register_len = len(pred)
    while i < register_len:
        confusion[int(y[i])][int(pred[i])] = confusion[int(y[i])][int(pred[i])] + 1
        i = i + 1
        
    correct += (pred == y).float().sum()
    total += y.shape[0]
    sum_loss += loss.item()*y.shape[0]

print("Exactitud del modelo: " + str(float(correct/total)))

Exactitud del modelo: 0.603022575378418


### Se crea el reporte de clasificación a partir de la matriz de confusión

In [87]:
classification_report = [["rating", "precision", "recall", "f1-score", "support"],[1,0,0,0,0],[2,0,0,0,0],[3,0,0,0,0],[4,0,0,0,0],[5,0,0,0,0]]
i = 0

while i < 5:
    j = 0
    total_precision = 0
    total_recall = 0
    correct = 0
    while j < 5:
        total_precision += confusion[j][i]
        total_recall += confusion[i][j]
        if i == j:
            correct = confusion[j][i]
        j += 1
        
    if total_precision == 0:
        precision = 0
    else:
        precision = round((correct/total_precision),2)

    if total_recall == 0:
        recall = 0
    else:
        recall = round((correct/total_recall),2)

    if (precision+recall) == 0:
        f1_score = 0
    else:
        f1_score = round((2 * precision * recall) / (precision + recall), 2)

    classification_report[i+1][1] = precision
    classification_report[i+1][2] = recall
    classification_report[i+1][3] = f1_score
    classification_report[i+1][4] = total_recall
    i += 1



# Matriz de Confusion

In [88]:
from tabulate import tabulate
print(tabulate(confusion))

-  --  ---  ---  ----
1   6   90   34    52
3  19  147   68    82
3  21  193  162   205
1   6   58  164   793
0   1   19  114  2456
-  --  ---  ---  ----


### Reporte de clasificación

In [89]:
print(tabulate(classification_report, headers='firstrow'))

  rating    precision    recall    f1-score    support
--------  -----------  --------  ----------  ---------
       1         0.12      0.01        0.02        183
       2         0.36      0.06        0.1         319
       3         0.38      0.33        0.35        584
       4         0.3       0.16        0.21       1022
       5         0.68      0.95        0.79       2590


## Conclusiones

* La exactitud del modelo no es tan precisa como me hubiese gustado, se penso que se debe a que hay proporciones muy variadas en los datos.

* La mejora en cada época que ocurre es a un grado menor.

* Se ve un patrón donde generalmente mientras más registros tengan una calificación mejor rendimiento va tener el modelo al momento de predecirlo.

* Los modelos vistos en este ejercicio fueron útiles para practicar con datos y funcionamientos diferentes a los vistos en el curso, me permite ver distintos casos y como afecta tener estos distintos datos y funcionalidades.