# **Augustin Crepin, Marcelo Rojas**
En esta tarea van a crear una red neuronal que clasifique mensajes como spam o no spam. Lo primero es descargar la data:

In [None]:
!wget https://www.ivan-sipiran.com/downloads/spam.csv

--2022-12-05 20:05:55--  https://www.ivan-sipiran.com/downloads/spam.csv
Resolving www.ivan-sipiran.com (www.ivan-sipiran.com)... 66.96.149.31
Connecting to www.ivan-sipiran.com (www.ivan-sipiran.com)|66.96.149.31|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 471781 (461K)
Saving to: ‘spam.csv’


2022-12-05 20:05:59 (211 KB/s) - ‘spam.csv’ saved [471781/471781]



Los datos vienen en un archivo CSV que contiene dos columnas "text" y "label". La columna "text" contiene el texto del mensaje y la columna "label" contiene las etiquetas "ham" y "spam". Un mensaje "ham" es un mensaje que no se considera spam.

# Tarea 
El objetivo de la tarea es crear una red neuronal que clasifique los datos entregados. Para lograr esto debes:



*   Implementar el pre-procesamiento de los datos que creas necesario.
*   Particionar los datos en 70% entrenamiento, 10% validación y 20% test.
*   Usa los datos de entrenamiento y valiadación para tus experimentos y sólo usa el conjunto de test para reportar el resultado final.

Para el diseño de la red neuronal puedes usar una red neuronal recurrente o una red basada en transformers. El objetivo de la tarea no es obtener el performance ultra máximo, sino entender qué decisiones de diseño afectan la solución de un problema como este. Lo que si es necesario (como siempre) es que discutas los resultados y decisiones realizadas.



# Codigo

## Paquetes




In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
import numpy as np
import matplotlib.pyplot as plt
from torchvision import datasets, models, transforms
import time
import os
import copy
from torchvision.io import read_image
from PIL import Image
import pandas as pd

## Pre-procesamiento de datos

In [None]:
#Cargamos los data del csv

data= pd.read_csv("spam.csv")

#Eliminamos que no debemos procesar
data.dropna(how='any')
data = data[(data["label"] == 'spam') | (data["label"] == 'ham')]



Aqui, tenemos que seleccionar los datos que pueden ser procesados, es decir, los que tienen etiquetas válidas (ham/spam) y en donde el texto no es NAN. Por eso, usamos dropna y guardamos solo los datos que tienen label "ham" o "spam".

In [None]:
#Creamos los mensajes y los label como listas
list_text=data["text"].values.tolist()
list_label=data["label"].values.tolist()

#Cambiamos los mensajes como string sin punctuatcion desde el dataframe
text = ' '.join([str(elem) for elem in list_text])

from string import punctuation
text = text.lower()

all_text = ''.join([c for c in text if c not in punctuation])
print(punctuation)


!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [None]:
#Removemos los saltos de línea, y juntamos todo el texto de nuevo
reviews_split = all_text.split('\n')
all_text = ''.join(reviews_split)

words = all_text.split()
#Tenemos que agregar "i'd" a las listas de word para que no sale ningun error a causa de la ponctuacion
words.append("i'd")
print(words[:20])

['go', 'until', 'jurong', 'point', 'crazy', 'available', 'only', 'in', 'bugis', 'n', 'great', 'world', 'la', 'e', 'buffet', 'cine', 'there', 'got', 'amore', 'wat']


## Codificacion de palabras

In [None]:
from collections import Counter

counts = Counter(words) #Construye un diccionario de palabras. Las claves son las palabras y los valores son la frecuencia
vocab = sorted(counts, key=counts.get, reverse=True) #Ordenamos la palabras por frecuencia
vocab_to_int = {word: ii for ii, word in enumerate(vocab, 1)} #Construimos diccionario para mapear palabra a número entero. Empezamos los índices en 1

#Tratamos cada mensaje para comvertirlas en palabras sin punctuation

list_review=[]
for i in range (0,len(list_text)-1) :
  list_review.append(list_text[i].lower())
for i in range (0,len(list_review)-1) :
  list_review[i]=''.join([c for c in list_review[i] if c not in punctuation])

print(list_review)
#Ahora convertimos cada palabra de los reviews en índices


reviews_ints = []
for i in range (0,len(list_review)):
    reviews_ints.append([vocab_to_int[word] for word in list_review[i].split()])



In [None]:
#Cada review ahora se representa como una secuencia de números (índices)
print(reviews_split[0])
print(len(reviews_ints))

go until jurong point crazy available only in bugis n great world la e buffet cine there got amore wat ok lar joking wif u oni free entry in 2 a wkly comp to win fa cup final tkts 21st may 2005 text fa to 87121 to receive entry questionstd txt ratetcs apply 08452810075over18s u dun say so early hor u c already then say nah i dont think he goes to usf he lives around here though freemsg hey there darling its been 3 weeks now and no word back id like some fun you up for it still tb ok xxx std chgs to send å£150 to rcv even my brother is not like to speak with me they treat me like aids patent as per your request melle melle oru minnaminunginte nurungu vettam has been set as your callertune for all callers press 9 to copy your friends callertune winner as a valued network customer you have been selected to receivea å£900 prize reward to claim call 09061701461 claim code kl341 valid 12 hours only had your mobile 11 months or more u r entitled to update to the latest colour mobiles with cam

In [None]:
#Cuántas palabras hay en el diccionario?
print('Palabras únicas:', len(vocab_to_int))
print()

Palabras únicas: 9198



##Embedding de etiquetas

In [None]:
import numpy as np

labels_split = list_label
print(labels_split)
encoded_labels = np.array([1 if label == 'ham' else 0 for label in labels_split])
print(encoded_labels)

['ham', 'ham', 'spam', 'ham', 'ham', 'spam', 'ham', 'ham', 'spam', 'spam', 'ham', 'spam', 'spam', 'ham', 'ham', 'spam', 'ham', 'ham', 'ham', 'spam', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'spam', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'spam', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'spam', 'spam', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'spam', 'ham', 'spam', 'spam', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'spam', 'ham', 'spam', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'spam', 'ham', 'ham', 'spam', 'ham', 'ham', 'spam', 'spam', 'ham', 'spam', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'ham', 'spam', 'spam', 'ham', 'ham', 'ham', 'spam', 'ham', 'ham', 'ham', '

## Longitud de Secuencias

In [None]:
#Sacamos algunas estadísticas de los datos
review_lens = Counter([len(x) for x in reviews_ints]) #Contamos cuantas palabras hay en cada review
print("Reviews de longitud cero:", review_lens[0])
print('Máxima longitud:', max(review_lens))

Reviews de longitud cero: 2
Máxima longitud: 171


In [None]:
print('Reviews antes de eliminación:', len(reviews_ints))

#Extraemos los índices de todos los reviews que tienen longitud > 0
non_zero_idx = [ii for ii, review in enumerate(reviews_ints) if len(review)!=0]

#Nos quedamos solo con los reviews con longitud > 0
reviews_ints = [reviews_ints[ii] for ii in non_zero_idx]

#Lo mismo con los labels
encoded_labels = np.array([encoded_labels[ii] for ii in non_zero_idx])

print('Reviews después de eliminación:', len(reviews_ints))

Reviews antes de eliminación: 5362
Reviews después de eliminación: 5360


Ahora que la data esta tratada, se puede implementar el padding y particionar los datos para entrenar la red.

## Padding

In [None]:
def pad_features(reviews_ints, seq_length):
  features = np.zeros((len(reviews_ints), seq_length), dtype=int)

  #Para cada review, se coloca en la matriz
  for i, row in enumerate(reviews_ints):
    features[i, -len(row):] = np.array(row)[:seq_length]
  
  return features

In [None]:
# Algunos largos de reviews a considerar para elegir el padding:
print(f"Largo review: {len(reviews_ints[0])}")
print(f"Largo review: {len(reviews_ints[1])}")
print(f"Largo review: {len(reviews_ints[20])}")
print(f"Largo review: {len(reviews_ints[200])}")

Largo review: 20
Largo review: 6
Largo review: 8
Largo review: 13


In [None]:
#Probamos el padding
seq_length = 20

features = pad_features(reviews_ints, seq_length=seq_length)

print(features.shape)
print(features[:30,:10])

(5360, 20)
[[  45  437 4221  775  693  731   64    8 1201   89]
 [   0    0    0    0    0    0    0    0    0    0]
 [  46  438    8   22    4  732  876    1  177 1778]
 [   0    0    0    0    0    0    0    0    0    6]
 [   0    0    0    0    0    0    0  924    2   49]
 [ 824  114   68 1543   42  102  194  576   21    7]
 [   0    0    0    0  199   11  604    9   25   57]
 [  72  213   13 1106 1359 1359 1783 2159 2160 2161]
 [ 661   72    4  776  395  200    3   17  102  396]
 [ 128   13   90 1004  777   27  120    6   87 1107]
 [  23  219   34   80  208    7    2   49   67    1]
 [1784 2817    1  177  164   47  694    1 2818  579]
 [ 189    3   17  168    4  165  112   46 2162    8]
 [ 163  102 1545   12    5  144  504    1  411    3]
 [   0    0    0    0    0    0    0    0    0    0]
 [   0 4230    1  267   13  735 1208    5  928  929]
 [   0    0    0    0    0    0    0    0    0    0]
 [   0  826    6  367   52   22 2163  209  263  147]
 [   0    0    0    0    0    0    

##Particion de los datos

Como se explica en las instrucciones, partimos los datos con 70% de datos de test, 10% de datos de validacion y 20% de datos de test. Asi, hacemos un primer split de los datos con los 0.7 y luego compartimos los datos con los 1/3 y 2/3 de los datos que quedan.

In [None]:
split_frac = 0.7

## split data into training, validation, and test data (features and labels, x and y)
split_idx = int(len(features)*0.7)
train_x, remaining_x = features[:split_idx], features[split_idx:]
train_y, remaining_y = encoded_labels[:split_idx], encoded_labels[split_idx:]

test_idx = int(len(remaining_x)*0.66)
val_x, test_x = remaining_x[:test_idx], remaining_x[test_idx:]
val_y, test_y = remaining_y[:test_idx], remaining_y[test_idx:]

## print out the shapes of your resultant feature data
print("\t\t\tFeatures:")
print("Train set: \t\t{}".format(train_x.shape),
      "\nValidation set: \t{}".format(val_x.shape),
      "\nTest set: \t\t{}".format(test_x.shape))

			Features:
Train set: 		(3751, 20) 
Validation set: 	(1061, 20) 
Test set: 		(548, 20)


## Datasets y Dataloaders

In [None]:
import torch
from torch.utils.data import TensorDataset, DataLoader

# crear Tensor datasets
train_data = TensorDataset(torch.from_numpy(train_x), torch.from_numpy(train_y))
valid_data = TensorDataset(torch.from_numpy(val_x), torch.from_numpy(val_y))
test_data = TensorDataset(torch.from_numpy(test_x), torch.from_numpy(test_y))

# dataloaders
batch_size = 32

#Ponemos el valor de drop_last=True si no sale un error de dimension de parametro durante el entranemiento de la red

train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size, drop_last=True)
valid_loader = DataLoader(valid_data, shuffle=True, batch_size=batch_size,drop_last=True)
test_loader = DataLoader(test_data, shuffle=False, batch_size=batch_size, drop_last=True)





##RNN

In [None]:
# Chequear si tenemos GPU
train_on_gpu=torch.cuda.is_available()

if(train_on_gpu):
    print('Training on GPU.')
else:
    print('No GPU available, training on CPU.')

Training on GPU.


In [None]:
#Creamos la red neuronal
import torch.nn as nn

class SentimentRNN(nn.Module):
    def __init__(self, vocab_size, output_size, embedding_dim, hidden_dim, n_layers, drop_prob=0.5):
        
        super(SentimentRNN, self).__init__()

        self.output_size = output_size
        self.n_layers = n_layers
        self.hidden_dim = hidden_dim
        
        # Capas embedding y LSTM
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, n_layers,
                            dropout=drop_prob, batch_first=True)
        
        # dropout
        self.dropout = nn.Dropout(drop_prob)
        
        # Capa lineal y sigmoide
        self.fc = nn.Linear(hidden_dim, output_size)
        self.sig = nn.Sigmoid()

    def forward(self, x, hidden):
        
        embeds = self.embedding(x)
        lstm_out, hidden = self.lstm(embeds, hidden)
                
        #Tomamos solo el último valor de salida del LSTM
        lstm_out = lstm_out[:,-1,:]
                
        # dropout y fully-connected
        out = self.dropout(lstm_out)
        out = self.fc(out)
               
        # sigmoide
        sig_out = self.sig(out)
                  
        # retornar sigmoide y último estado oculto
        return sig_out, hidden
    
    
    def init_hidden(self, batch_size):
        # Crea dos nuevos tensores con tamaño n_layers x batch_size x hidden_dim,
        # inicializados a cero, para estado oculto y memoria de LSTM
        weight = next(self.parameters()).data
        
        if(train_on_gpu):
          hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda(),
                   weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda())
        else:
          hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_(),
                   weight.new(self.n_layers, batch_size, self.hidden_dim).zero_())
        
        return hidden

In [None]:
# Instanciamos la red
vocab_size = len(vocab_to_int) + 1 # +1 for zero padding + our word tokens
output_size = 1
embedding_dim = 400 
hidden_dim = 256
n_layers = 2

net = SentimentRNN(vocab_size, output_size, embedding_dim, hidden_dim, n_layers)

print(net)

SentimentRNN(
  (embedding): Embedding(9199, 400)
  (lstm): LSTM(400, 256, num_layers=2, batch_first=True, dropout=0.5)
  (dropout): Dropout(p=0.5, inplace=False)
  (fc): Linear(in_features=256, out_features=1, bias=True)
  (sig): Sigmoid()
)


###Entrenamiento

In [None]:
# loss and optimization functions
lr=0.001

criterion = nn.BCELoss()
optimizer = torch.optim.Adam(net.parameters(), lr=lr)

In [None]:
# training params

epochs = 4 

counter = 0
print_every = 100
clip=5 # gradient clipping

# Enviar red al GPU
if(train_on_gpu):
    net.cuda()

net.train()
# Bucle de entrenamiento
for e in range(epochs):
    # Inicializar estado oculto
    h = net.init_hidden(batch_size)

    # Bucle para batchs
    for inputs, labels in train_loader:
        counter += 1

        if(train_on_gpu):
            inputs, labels = inputs.cuda(), labels.cuda()

        # Crear nuevas variables para estados ocultos, sino se haría 
        # backprop para todos los pasos del bucle
        h = tuple([each.data for each in h])

        net.zero_grad()

        # Hacer pasada forward
        output, h = net(inputs, h)

        # Calcular loss y hacer backprop
        loss = criterion(output.squeeze(), labels.float())
        loss.backward()
        # gradient clipping
        nn.utils.clip_grad_norm_(net.parameters(), clip)
        optimizer.step()

        # Mensajes
        if counter % print_every == 0:
            # Validation loss
            val_h = net.init_hidden(batch_size)
            val_losses = []
            net.eval()
            for inputs, labels in valid_loader:

                val_h = tuple([each.data for each in val_h])

                if(train_on_gpu):
                    inputs, labels = inputs.cuda(), labels.cuda()

                output, val_h = net(inputs, val_h)
                val_loss = criterion(output.squeeze(), labels.float())

                val_losses.append(val_loss.item())

            net.train()
            print("Época: {}/{}...".format(e+1, epochs),
                  "Paso: {}...".format(counter),
                  "Loss: {:.6f}...".format(loss.item()),
                  "Val Loss: {:.6f}".format(np.mean(val_losses)))

Época: 1/4... Paso: 100... Loss: 0.092739... Val Loss: 0.124426
Época: 2/4... Paso: 200... Loss: 0.018942... Val Loss: 0.125890
Época: 3/4... Paso: 300... Loss: 0.001385... Val Loss: 0.130187
Época: 4/4... Paso: 400... Loss: 0.001556... Val Loss: 0.113190


### Testing

In [None]:
# Calcular accuracy de test

test_losses = [] # track loss
num_correct = 0

# Iniciar estado oculto
h = net.init_hidden(batch_size)

net.eval()
for inputs, labels in test_loader:

    h = tuple([each.data for each in h])

    if(train_on_gpu):
        inputs, labels = inputs.cuda(), labels.cuda()
    
    output, h = net(inputs, h)
    
    test_loss = criterion(output.squeeze(), labels.float())
    test_losses.append(test_loss.item())
    
    # Convertir probabilidades a clases (0,1)
    pred = torch.round(output.squeeze())  
    
    # Comparar predicciones a labels
    correct_tensor = pred.eq(labels.float().view_as(pred))
    correct = np.squeeze(correct_tensor.numpy()) if not train_on_gpu else np.squeeze(correct_tensor.cpu().numpy())
    num_correct += np.sum(correct)


# -- stats! -- ##
# avg test loss
print("Test loss: {:.3f}".format(np.mean(test_losses)))

# Accuracy de test
test_acc = num_correct/len(test_loader.dataset)
print("Test accuracy: {:.3f}".format(test_acc))

#Tiempo de ejecución aprox: 11m


Test loss: 0.068
Test accuracy: 0.973


### Inferencia sobre mensajes

In [None]:
from string import punctuation

def tokenize_message(test_review):
    test_review = test_review.lower() 
    test_text = ''.join([c for c in test_review if c not in punctuation])
    
    test_words = test_text.split()
    
    test_ints = []
    test_ints.append([vocab_to_int[word] for word in test_words])
    
    return test_ints

In [None]:
def predict(net, test_review, sequence_length=200):
      
    net.eval()
    
    test_ints = tokenize_message(test_review)
    
    seq_length = sequence_length
    features = pad_features(test_ints, seq_length)
    
    feature_tensor = torch.from_numpy(features)
    
    batch_size = feature_tensor.size(0)
    
    h = net.init_hidden(batch_size)
    
    if(train_on_gpu):
      feature_tensor = feature_tensor.cuda()
      
    output, h = net(feature_tensor, h)
    
    pred = torch.round(output.squeeze())
    print('Valor predicho, antes del redondeo: {:.6f}'.format(output.item()))
    
    # print custom response based on whether test_review is pos/neg
    if(pred.item()==1):
      print('ham')
    else:
      print('spam')

Como test real y ejemplo, probamos nuestra red con mensaje reales que son ham y spam.

In [None]:
# negative test review
test_message_ham = "hi john, whats'up ?"

# positive test review
test_message_spam = 'click the link in the next txt message or click here>> '

In [None]:

seq_length=20
predict(net, test_message_ham, seq_length)
predict(net, test_message_spam, seq_length)

Valor predicho, antes del redondeo: 0.999495
ham
Valor predicho, antes del redondeo: 0.000436
spam


## BERT

In [None]:
! pip install datasets transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.7.1-py3-none-any.whl (451 kB)
[K     |████████████████████████████████| 451 kB 28.2 MB/s 
[?25hCollecting transformers
  Downloading transformers-4.25.1-py3-none-any.whl (5.8 MB)
[K     |████████████████████████████████| 5.8 MB 55.8 MB/s 
Collecting xxhash
  Downloading xxhash-3.1.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 77.2 MB/s 
Collecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting multiprocess
  Downloading multiprocess-0.70.14-py38-none-any.whl (132 kB)
[K     |████████████████████████████████| 132 kB 79.5 MB/s 
Collecting huggingface-hub<1.0.0,>=0.2.0
  Downloading huggingface_hub-0.11.1-py3-none-any.whl (182 kB)
[K     |████████████████████████████████| 182 kB 77.6 MB/s 
Collecting urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21

### Preparación de la data y tokenización

In [None]:
# Pasamos la data a listas y cambiamos los labels por 1 para ham y 0 para spam
text=data["text"].values.tolist()
label=data["label"].values.tolist()
label = [1 if labels == 'ham' else 0 for labels in label]

In [None]:
from sklearn.model_selection import train_test_split
from transformers import BertTokenizer
from datasets import load_dataset, Dataset

# Creamos el tokenizer para el modelo
model_name = "bert-base-uncased" # Ya que se usará BERT
tokenizer = BertTokenizer.from_pretrained(model_name)

# Dividimos la data 70 train 20 val y 10 test
X_train, X_rem, y_train, y_rem = train_test_split(text, label, test_size=0.3)
X_val, X_test, y_val, y_test = train_test_split(X_rem, y_rem, test_size=0.666)
X_train_tokenized = tokenizer(X_train, padding=True, truncation=True, max_length=512)
X_val_tokenized = tokenizer(X_val, padding=True, truncation=True, max_length=512)
X_test_tokenized = tokenizer(X_test, padding=True, truncation=True, max_length=512)

loading file vocab.txt from cache at /root/.cache/huggingface/hub/models--bert-base-uncased/snapshots/0a6aa9128b6194f4f3c4db429b6cb4891cdb421b/vocab.txt
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at None
loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--bert-base-uncased/snapshots/0a6aa9128b6194f4f3c4db429b6cb4891cdb421b/tokenizer_config.json
loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--bert-base-uncased/snapshots/0a6aa9128b6194f4f3c4db429b6cb4891cdb421b/config.json
Model config BertConfig {
  "_name_or_path": "bert-base-uncased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_positio

In [None]:
# Create torch dataset
class Dataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels=None):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        if self.labels:
            item["labels"] = torch.tensor(self.labels[idx])
        return item

    def __len__(self):
        return len(self.encodings["input_ids"])

train_dataset = Dataset(X_train_tokenized, y_train)
val_dataset = Dataset(X_val_tokenized, y_val)
test_dataset = Dataset(X_test_tokenized, y_test)

In [None]:
# Metrica accuracy
import numpy as np
from datasets import load_metric

metric = load_metric("accuracy")
def compute_metrics(p):
    return metric.compute(predictions=np.argmax(p.predictions, axis=1), references=p.label_ids)

### Modelo

In [None]:
# Creamos el modelo: BertForSequenceClassification -> Clasificación de secuencias
from transformers import BertForSequenceClassification
model = BertForSequenceClassification.from_pretrained(
    'bert-base-uncased',
    num_labels = 2,
    output_attentions = False,
    output_hidden_states = False,
)

# Learning rates probados y su accuracy de test (Adam): 10e-5, 5e-5 -> ACC=0.9888, 2e-5 -> 0.9935.
optimizer = torch.optim.AdamW(model.parameters(), 
                              lr = 100e-5,
                              eps = 1e-08
                              )

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--bert-base-uncased/snapshots/0a6aa9128b6194f4f3c4db429b6cb4891cdb421b/config.json
Model config BertConfig {
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 512,
  "model_type": "bert",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.25.1",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}

loading weights file pytorch_model.bin from cache at /root/.cache/huggingface/hub/models--bert-base-uncased/snapshots/0a6aa9128b6194f4f3c4db429b6cb4891cdb421b/pytorch_model.bin
Some weights of the model check

In [None]:
# Los parámetros de entrenamiento se configuran en un objeto TrainingArguments

from transformers import TrainingArguments

training_args = TrainingArguments(output_dir="test_trainer", evaluation_strategy="epoch")

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).


In [None]:
#Creamos un objeto Trainer

from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    compute_metrics=compute_metrics,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

### Train

In [None]:
# Hacemos el entrenamiento:

#Tiempo de ejecución aprox: 7m
train_results = trainer.train()
trainer.save_model()
trainer.log_metrics("train", train_results.metrics)
trainer.save_metrics("train", train_results.metrics)
trainer.save_state()

***** Running training *****
  Num examples = 3754
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 1410
  Number of trainable parameters = 109483778


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.044764,0.990689
2,0.077800,0.067439,0.986965


***** Running Evaluation *****
  Num examples = 537
  Batch size = 8
Saving model checkpoint to test_trainer/checkpoint-500
Configuration saved in test_trainer/checkpoint-500/config.json
Model weights saved in test_trainer/checkpoint-500/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 537
  Batch size = 8
Saving model checkpoint to test_trainer/checkpoint-1000
Configuration saved in test_trainer/checkpoint-1000/config.json
Model weights saved in test_trainer/checkpoint-1000/pytorch_model.bin


Epoch,Training Loss,Validation Loss,Accuracy
1,No log,0.044764,0.990689
2,0.077800,0.067439,0.986965
3,0.015200,0.066944,0.988827


***** Running Evaluation *****
  Num examples = 537
  Batch size = 8


Training completed. Do not forget to share your model on huggingface.co/models =)


Saving model checkpoint to test_trainer
Configuration saved in test_trainer/config.json
Model weights saved in test_trainer/pytorch_model.bin


***** train metrics *****
  epoch                    =        3.0
  total_flos               =  1002530GF
  train_loss               =     0.0333
  train_runtime            = 0:07:24.07
  train_samples_per_second =      25.36
  train_steps_per_second   =      3.175


### Test

In [None]:
# Testeamos sobre la data de test

metrics = trainer.evaluate(test_dataset)
trainer.log_metrics("eval", metrics)
trainer.save_metrics("eval", metrics)

***** Running Evaluation *****
  Num examples = 1072
  Batch size = 8


***** eval metrics *****
  epoch                   =        3.0
  eval_accuracy           =     0.9935
  eval_loss               =      0.059
  eval_runtime            = 0:00:17.67
  eval_samples_per_second =     60.651
  eval_steps_per_second   =      7.581


Se probaron 3 learning rates diferentes 100e-5, 5e-5 y  2e-5, obtniéndose respectivamente los accuracy de test 0.9935, 0.9888, 0.9935.

#Comentarios

## RNN LTSM

Para clasificar la data, primero utilizamos una RNN del tipo LTSM.

Con LTSM, el batch size óptimo para este problema en particular fue de 32, lo cual no se puede explicar por un fenómeno teórico, esto es, dicho valor se encontró a base de prueba y error (se probó 8 y 16).

Por otro lado, el padding que entregó mejores resultados, fue un padding de largo 20. Se probaron padding con mayores longitudes (100 y 200), en tales casos, el accuracy de test disminuyó. Esto puede ser causado por la longitud de los mensajes a clasificar (logitudes más cercanas 20 que a 100 o 200)



## BERT


Al entrenamiento genera un buen accuracy desde la época uno, debido a que la red está pre-entrenada.

Se obtuvo un mayor accuracy de test utilizando BERT que con LTSM. Esto se debe principalmente a que Al usar el tokenizador de huggingface, es posible desligarse de la dependencia del modelo a los hiperparámetros como el largo del padding, los embeddings, batch size y el learning rate, lo cual hace que la implementación sea realizada con valores óptimos para dichos parámetros sin conocer en mayor medida la data.

Es importante destacar que para BERT, se obesrvó que para learning rates distintos, en general el accuracy se mantuvo constante, esto debido en parte por el termino eps (que ayuda a la estailidad del aprendizaje y por tanto contraresta la inestabilidad de tasas muy altas).

#Conclusion

En esta tarea, comparimos dos manera de clasificar mensajes que son "spam" o no.
Lo que vemos comparando estos dos, es que utilizar los transformer es mas rapido que usar un RNN. Tambien gracias al accuracy de los diferentes test, vemos que el uso de los transformers como BERT permite tener un mejor accuracy. Entonces, parece que para este tipo de clasificacion que son clasificacion de palabras, el uso de transformers es mas efectivo que el uso de RNN (LSTM).