<br>
<br>

# **Modelos del lenguaje basados en redes neuronales artificiales**

## **Redes neuronales recurrentes (RNN)**

### **LSTM**

El conjunto de datos "AG_NEWS" es un conjunto de datos de clasificación de texto ampliamente utilizado en el campo del procesamiento de lenguaje natural (NLP). Contiene noticias de diferentes categorías y se utiliza comúnmente para tareas de clasificación de texto. El conjunto de datos AG_NEWS consta de noticias de cuatro categorías principales, que son:

1. **World**: Noticias sobre eventos y acontecimientos globales, como política internacional, relaciones internacionales y noticias mundiales en general.

2. **Sports**: Noticias relacionadas con eventos deportivos, resultados de partidos, eventos deportivos nacionales e internacionales, etc.

3. **Business**: Noticias relacionadas con el mundo de los negocios, finanzas, economía, empresas, informes de ganancias y otros temas económicos.

4. **Sci/Tech**: Noticias relacionadas con ciencia y tecnología, incluyendo avances científicos, novedades tecnológicas, gadgets, investigaciones científicas y más.

Cada instancia del conjunto de datos AG_NEWS generalmente consiste en un título y un cuerpo de una noticia, junto con una etiqueta que indica la categoría a la que pertenece. 

In [62]:
from torchtext import datasets
from torchtext.data import to_map_style_dataset
import numpy as np

# Load the dataset
train_iter, test_iter = datasets.AG_NEWS(split=('train', 'test'))

train_ds = to_map_style_dataset(train_iter)
test_ds = to_map_style_dataset(test_iter)

train = np.array(train_ds)
test = np.array(test_ds)

In [117]:
# Create vocabulary and embedding

from torchtext.vocab import build_vocab_from_iterator
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import GloVe

tokenizer = get_tokenizer("basic_english")
glove_vectors = GloVe(name='6B', dim=300)

vocab = build_vocab_from_iterator(map(lambda x: tokenizer(x[1]), train_iter), specials=['<pad>','<unk>'])
vocab.set_default_index(vocab["<unk>"])

In [64]:
print("Tamaño del vocabulario:", len(vocab), "tokens")
print("Tokenización de la frase 'Here is an example sentence':", tokenizer("Here is an example sentence"))
print("Índices de las palabras 'here', 'is', 'an', 'example', 'supercalifragilisticexpialidocious':", vocab(['here', 'is', 'an', 'example', 'supercalifragilisticexpialidocious']))
print("Palabras correspondientes a los índices 475, 21, 30, 5297, 0:", vocab.lookup_tokens([475, 21, 30, 5297, 0]))
print("Las diez primeras palabras del vocabulario:", vocab.get_itos()[:10])

Tamaño del vocabulario: 95812 tokens
Tokenización de la frase 'Here is an example sentence': ['here', 'is', 'an', 'example', 'sentence']
Índices de las palabras 'here', 'is', 'an', 'example', 'supercalifragilisticexpialidocious': [476, 22, 31, 5298, 1]
Palabras correspondientes a los índices 475, 21, 30, 5297, 0: ['version', 'at', 'from', 'establish', '<pad>']
Las diez primeras palabras del vocabulario: ['<pad>', '<unk>', '.', 'the', ',', 'to', 'a', 'of', 'in', 'and']


In [65]:
text_pipeline = lambda x: vocab(tokenizer(x))
# text_pipeline = lambda x: tokenizer(x)
# embed_pipeline = lambda x: [glove_vectors.get_vecs_by_tokens(token) for token in tokenizer(x)]
label_pipeline = lambda x: int(x) - 1

print("Tokenización de la frase 'Here is an example sentence':", text_pipeline("Here is an example sentence"))

Tokenización de la frase 'Here is an example sentence': [476, 22, 31, 5298, 2994]


In [138]:
from torch.utils.data import DataLoader
import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

def collate_batch(batch):
    label_list = []
    text_list = []
    # L = [len(tokenizer(sample[1])) for sample in batch]
    # print(L)
    max_len = max([len(tokenizer(sample[1])) for sample in batch])
    # print(max_len)
    
    for i, sample in enumerate(batch):
        label, text = sample
        # print(i)
        # print(text[:20])

        embed_list = glove_vectors.get_vecs_by_tokens(tokenizer(text))
        padding_list = glove_vectors[0].unsqueeze(0).repeat(max_len - len(embed_list), 1)
        
        #print("Padding list: ", padding_list.shape) 
        #print("Embed list: ", embed_list.shape)

        embed_list = torch.cat((embed_list, padding_list), 0) 
        
        #print("Final embed:", embed_list.shape)
        # print(label)

        # text_list.append(torch.tensor(text_pipeline(text), dtype=torch.long))
        text_list.append(embed_list)
        label_list.append(label_pipeline(label))
        
    # return torch.tensor(label_list, dtype=torch.long), torch.nn.utils.rnn.pad_sequence(text_list, batch_first=True, padding_value=vocab["<pad>"])
    #print(text_list)
    # Convert to tensor
    #tensor_text_list = torch.stack(text_list)

    # print(torch.tensor(text_list).shape)
    return torch.tensor(label_list, dtype=torch.long), torch.stack(text_list)

train_dataloader = DataLoader(
    train_iter, batch_size=64, shuffle=True, collate_fn=collate_batch
)

test_dataloader = DataLoader(
    test_iter, batch_size=64, shuffle=True, collate_fn=collate_batch
)

Para verificar que estamos creando correctamente los lotes, vamos a imprimir las primeras cuatro instancias:

In [139]:
for batch in train_dataloader:
    print(batch[1][:4])
    print("\n")
    print(batch[0][:4])
    print("\n")
    print(batch[1].shape)
    break

tensor([[[ 0.0148,  0.2784, -0.5527,  ..., -0.0333, -0.0765,  0.0316],
         [-0.2576, -0.0571, -0.6719,  ..., -0.1604,  0.0467, -0.0706],
         [-0.1674, -0.0937, -0.4510,  ..., -0.1884,  0.0358, -0.2040],
         ...,
         [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
         [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
         [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000]],

        [[ 0.2702,  0.2535,  0.4751,  ...,  0.3080,  0.0611, -0.3626],
         [-0.4440,  0.1282, -0.2525,  ..., -0.2004, -0.0822, -0.0626],
         [-0.3142, -0.3453,  0.0977,  ..., -0.3042, -0.1494, -0.3194],
         ...,
         [-0.6020,  0.0929, -0.8632,  ...,  0.0666,  0.1424,  0.1066],
         [-0.1741, -0.1192,  0.0176,  ...,  0.0382,  0.0282,  0.1074],
         [-0.1256,  0.0136,  0.1031,  ..., -0.3422, -0.0224,  0.1368]],

        [[ 0.0292,  0.1023,  0.1499,  ...,  0.2165, -0.2081,  0.2571],
         [ 0.4185,  0.3042,  0.0979,  ..., -0



------------------

In [140]:
import torch
import torch.nn as nn

class LSTMTextClassificationModel(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_class):
        super(LSTMTextClassificationModel, self).__init__()

        # self.embedding = nn.Embedding(vocab_size, embed_dim)
        # self.glove_embedding = nn.Embedding.from_pretrained(glove_vectors.vectors)

        self.lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, num_class)

    def forward(self, embedded):
        # embedded = self.glove_embedding(text)
        lstm_out, _ = self.lstm(embedded)
        # Tomar la última salida de la secuencia LSTM
        last_output = lstm_out[:, -1, :]
        # last_output = lstm_out[:, -1]
        output = self.fc(last_output)
        return output


In [141]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = LSTMTextClassificationModel(len(vocab), 300, 64, 4).to(device)
model.train()

for batch in train_dataloader:
    print(batch[1].shape)
    predicted_label = model(batch[1])
    label = batch[0]
    break

print(batch[1][:4])
print(predicted_label[:4])
print(label[:4])

torch.Size([64, 108, 300])
tensor([[[ 0.1121,  0.3463, -0.1169,  ..., -0.1499,  0.2519, -0.2226],
         [-0.3620,  0.2663, -0.4980,  ...,  0.0977,  0.0702,  0.1810],
         [ 0.5321,  0.1240,  0.0354,  ..., -0.7439,  0.0667,  0.1422],
         ...,
         [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
         [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
         [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000]],

        [[-0.0470, -0.3525,  0.2789,  ...,  0.0649,  0.2298, -0.1721],
         [ 0.0561,  0.2004, -0.1366,  ..., -0.5013, -0.0210,  0.4466],
         [-0.0374, -0.3325, -0.5968,  ...,  0.8214,  0.6356,  0.3285],
         ...,
         [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
         [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
         [ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000]],

        [[-0.2450, -0.1249, -0.2536,  ..., -0.4107, -0.6193,  0.0039],
         [-0.2853,

------------------

In [143]:
import time

# Hyperparameters
EPOCHS = 10  # epoch
LR = 5  # learning rate

criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=LR)

def train(dataloader):
    model.train()
    total_acc, total_count, max_acc = 0, 0, 0
    log_interval = 500
    start_time = time.time()

    for idx, (label, text) in enumerate(dataloader):
        optimizer.zero_grad()
        predicted_label = model(text)
        loss = criterion(predicted_label, label)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 0.1)
        optimizer.step()

        total_acc += (predicted_label.argmax(1) == label).sum().item()
        total_count += label.size(0)
        if idx % log_interval == 0 and idx > 0:
            elapsed = time.time() - start_time
            
            #print(total_acc / total_count)

            print('| {:5d} batches '
                  '| accuracy {:8.3f}'.format(idx, 
                                              total_acc / total_count))

            if max_acc < total_acc / total_count:
                max_acc = total_acc / total_count
                
            total_acc, total_count = 0, 0
            start_time = time.time()
    return max_acc


def evaluate(dataloader):
    model.eval()
    total_acc, total_count = 0, 0

    with torch.no_grad():
        for idx, (label, text) in enumerate(dataloader):
            predicted_label = model(text)
            loss = criterion(predicted_label, label)
            total_acc += (predicted_label.argmax(1) == label).sum().item()
            total_count += label.size(0)
    return total_acc / total_count

In [144]:
# Hyperparameters
EPOCHS = 10  # epoch
LR = 5  # learning rate
BATCH_SIZE = 8  # batch size for training

criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=LR)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.1)


for epoch in range(1, EPOCHS + 1):
    epoch_start_time = time.time()

    accu_train = train(train_dataloader)
    accu_val = evaluate(test_dataloader)

    if accu_train > accu_val:
        scheduler.step()
    
    print("-" * 59)
    print(
        "| end of epoch {:3d} | time: {:5.2f}s | "
        "valid accuracy {:8.3f} ".format(
            epoch, time.time() - epoch_start_time, accu_val
        )
    )
    print("-" * 59)

|   500 batches | accuracy    0.248
|  1000 batches | accuracy    0.255
|  1500 batches | accuracy    0.354
-----------------------------------------------------------
| end of epoch   1 | time: 71.98s | valid accuracy    0.686 
-----------------------------------------------------------
|   500 batches | accuracy    0.702
|  1000 batches | accuracy    0.806
|  1500 batches | accuracy    0.845
-----------------------------------------------------------
| end of epoch   2 | time: 73.40s | valid accuracy    0.886 
-----------------------------------------------------------
|   500 batches | accuracy    0.886
|  1000 batches | accuracy    0.894
|  1500 batches | accuracy    0.912
-----------------------------------------------------------
| end of epoch   3 | time: 71.98s | valid accuracy    0.901 
-----------------------------------------------------------
|   500 batches | accuracy    0.902
|  1000 batches | accuracy    0.907
|  1500 batches | accuracy    0.928
-------------------------