<br>
<br>

# **Modelos del lenguaje basados en redes neuronales artificiales**

## **Redes neuronales recurrentes (RNN)**

### **LSTM**

El conjunto de datos "AG_NEWS" es un conjunto de datos de clasificación de texto ampliamente utilizado en el campo del procesamiento de lenguaje natural (NLP). Contiene noticias de diferentes categorías y se utiliza comúnmente para tareas de clasificación de texto. El conjunto de datos AG_NEWS consta de noticias de cuatro categorías principales, que son:

1. **World**: Noticias sobre eventos y acontecimientos globales, como política internacional, relaciones internacionales y noticias mundiales en general.

2. **Sports**: Noticias relacionadas con eventos deportivos, resultados de partidos, eventos deportivos nacionales e internacionales, etc.

3. **Business**: Noticias relacionadas con el mundo de los negocios, finanzas, economía, empresas, informes de ganancias y otros temas económicos.

4. **Sci/Tech**: Noticias relacionadas con ciencia y tecnología, incluyendo avances científicos, novedades tecnológicas, gadgets, investigaciones científicas y más.

Cada instancia del conjunto de datos AG_NEWS generalmente consiste en un título y un cuerpo de una noticia, junto con una etiqueta que indica la categoría a la que pertenece. 

In [2]:
from torchtext import datasets
from torchtext.data import to_map_style_dataset
import numpy as np

# Load the dataset
train_iter, test_iter = datasets.AG_NEWS(split=('train', 'test'))

train_ds = to_map_style_dataset(train_iter)
test_ds = to_map_style_dataset(test_iter)

train = np.array(train_ds)
test = np.array(test_ds)

In [3]:
# Create vocabulary and embedding

from torchtext.vocab import build_vocab_from_iterator
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import GloVe

tokenizer = get_tokenizer("basic_english")
glove_vectors = GloVe(name='6B', dim=300)

vocab = build_vocab_from_iterator(map(lambda x: tokenizer(x[1]), train_iter), specials=['<pad>','<unk>'])
vocab.set_default_index(vocab["<unk>"])

In [4]:
glove_vectors["sfsfsdf"]

tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 

In [5]:
glove_vectors.stoi["of"]

3

In [6]:
glove_vectors.itos[3]

'of'

In [7]:
print(len(vocab))
print(vocab["<pad>"])
print(vocab["<unk>"])
print(vocab["."])
print(vocab["the"])
print(vocab["according"])
print(vocab.lookup_token(210))
print(vocab.lookup_token(200))

95812
0
1
2
3
210
according
prime


In [8]:
print(glove_vectors.vectors.shape)  # 400000 palabras y 300 dimensiones
print(glove_vectors['hello'].shape)  # 300 dimensiones
print(glove_vectors.itos[0])
print(glove_vectors.itos[1])
print(glove_vectors.itos[2])
print(glove_vectors.itos[3])
print(glove_vectors.itos[4])
print(glove_vectors.itos[5])
print(glove_vectors.itos[200])
print(glove_vectors.stoi['according'])
print(glove_vectors['according'])
print(glove_vectors['ññññ'])  # Si no existe la palabra, devuelve un vector de ceros

torch.Size([400000, 300])
torch.Size([300])
the
,
.
of
to
and
according
200
tensor([-0.2796,  0.1373,  0.0435,  0.3330, -0.1685,  0.0672, -0.1664,  0.1572,
        -0.1213, -1.7386, -0.0249, -0.2657,  0.1754,  0.1733, -0.0024,  0.1593,
        -0.1860,  0.2516, -0.3865, -0.3362,  0.1269,  0.0737,  0.2497,  0.4563,
        -0.2016,  0.0179, -0.0856,  0.0823,  0.2025, -0.1380,  0.0408,  0.5478,
        -0.0415,  0.1931, -0.8055, -0.2265,  0.2003, -0.0392, -0.1752, -0.1792,
         0.4521, -0.4672,  0.0034,  0.5178,  0.0486, -0.2500, -0.2938,  0.2332,
        -0.2310,  0.1783, -0.0716, -0.0211, -0.0837, -0.1419, -0.0739,  0.1935,
        -0.3890,  0.0122, -0.4115, -0.2794, -0.4985,  0.1327,  0.5072,  0.0141,
        -0.4911, -0.3892,  0.1637,  0.0691, -0.0703,  0.1932,  0.5417,  0.1424,
        -0.1569, -0.2428, -0.4837,  0.1394,  0.3138, -0.1732,  0.0480,  0.4464,
         0.2092,  0.1116, -0.2180,  0.4683,  0.0487,  0.0277, -0.2645,  0.1688,
        -0.0581, -0.0362, -0.6662, -0.0793, 

In [9]:
print("Tamaño del vocabulario:", len(vocab), "tokens")
print("Tokenización de la frase 'Here is an example sentence':", tokenizer("Here is an example sentence"))
print("Índices de las palabras 'here', 'is', 'an', 'example', 'supercalifragilisticexpialidocious':", vocab(['here', 'is', 'an', 'example', 'supercalifragilisticexpialidocious']))
print("Palabras correspondientes a los índices 475, 21, 30, 5297, 0:", vocab.lookup_tokens([475, 21, 30, 5297, 0]))
print("Las diez primeras palabras del vocabulario:", vocab.get_itos()[:10])

Tamaño del vocabulario: 95812 tokens
Tokenización de la frase 'Here is an example sentence': ['here', 'is', 'an', 'example', 'sentence']
Índices de las palabras 'here', 'is', 'an', 'example', 'supercalifragilisticexpialidocious': [476, 22, 31, 5298, 1]
Palabras correspondientes a los índices 475, 21, 30, 5297, 0: ['version', 'at', 'from', 'establish', '<pad>']
Las diez primeras palabras del vocabulario: ['<pad>', '<unk>', '.', 'the', ',', 'to', 'a', 'of', 'in', 'and']


In [10]:
# Text and label pipelines usando el vocabulario y los vectores de GloVe
def text_pipeline(x):
    tokens_vocab = tokenizer(x)  # Devuelve una lista de tokens (palabras)
    tokens_glove = [glove_vectors.stoi.get(w, 399999) for w in tokens_vocab]
    return tokens_glove

label_pipeline = lambda x: int(x) - 1

print(text_pipeline("Here is an example sentence dfgdfñ"))

[187, 14, 29, 880, 2422, 399999]


In [11]:
from torch.utils.data import DataLoader
import torch


def collate_batch(batch):
    label_list, text_list = [], []
    for sample in batch:
        label, text = sample
        text_list.append(torch.tensor(text_pipeline(text), dtype=torch.long))
        label_list.append(label_pipeline(label))
    return torch.tensor(label_list, dtype=torch.long), torch.nn.utils.rnn.pad_sequence(text_list, batch_first=True, padding_value=399999)

train_dataloader = DataLoader(
    train_iter, batch_size=64, shuffle=True, collate_fn=collate_batch
)

test_dataloader = DataLoader(
    test_iter, batch_size=64, shuffle=True, collate_fn=collate_batch
)

Para verificar que estamos creando correctamente los lotes, vamos a imprimir las primeras cuatro instancias:

In [12]:
for batch in train_dataloader:
    print(batch[1][:4])
    print("\n")
    print(batch[0][:4])
    print("\n")
    print(batch[1].shape)
    break

tensor([[    95,     50,    163,    526,  10053,    524,      2,    409,     72,
              6,    375,     23,   1184,     24,   1184,     11,    526,      3,
             50,     95,   1288,   5086,      6,    375,     10,      0,    126,
           1362,    229,      1,    933,    896,      1,      7,   1100,      3,
         138728,    956,   9283,      0,    442, 118464,    211,      2, 399999,
         399999, 399999, 399999, 399999, 399999, 399999, 399999, 399999, 399999,
         399999, 399999, 399999, 399999, 399999, 399999, 399999, 399999, 399999,
         399999, 399999, 399999, 399999, 399999, 399999, 399999, 399999, 399999,
         399999, 399999, 399999, 399999, 399999, 399999, 399999, 399999, 399999,
         399999, 399999, 399999, 399999, 399999, 399999, 399999, 399999, 399999,
         399999],
        [   361,   1555,      4,   1783,   5028,   8327,   1363,     23,   1582,
             24,     65,      7,    623,     78,   1908,     31,   1412,    559,
          

------------------

In [13]:
import torch
import torch.nn as nn

class LSTMTextClassificationModel(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_class):
        super(LSTMTextClassificationModel, self).__init__()
        # self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.glove_embedding = nn.Embedding.from_pretrained(glove_vectors.vectors)

        self.glove_embedding.weight.requires_grad = False
        self.glove_embedding.padding_idx = 399999  # Índice de la palabra <unk> en los vectores de GloVe
        self.glove_embedding.weight.data[399999] = torch.zeros(embed_dim)  # Vector de ceros para la palabra <unk>

        self.lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True, num_layers=1, dropout=0.5)
        self.fc = nn.Linear(hidden_dim, num_class)

    def forward(self, text):
        embedded = self.glove_embedding(text)
        lstm_out, _ = self.lstm(embedded)
        # Tomar la última salida de la secuencia LSTM
        last_output = lstm_out[:, -1, :]
        output = self.fc(last_output)
        return output


In [14]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = LSTMTextClassificationModel(len(glove_vectors), 300, 64, 4).to(device)
model.train()

for batch in train_dataloader:
    print(batch[1].shape)
    predicted_label = model(batch[1])
    label = batch[0]
    break

print(batch[1][:4])
print(predicted_label[:4])
print(label[:4])

torch.Size([64, 76])
tensor([[  4361,    907,    211,   1604,      6,      7,   1604,  20359,  10977,
              3,      0,   9500,      2,  10108,   4707,      1,    544,      3,
            925,   1468,   1752,   4361,   5698,      6,     44,     58,    122,
              3,    198,    857,      2, 399999, 399999, 399999, 399999, 399999,
         399999, 399999, 399999, 399999, 399999, 399999, 399999, 399999, 399999,
         399999, 399999, 399999, 399999, 399999, 399999, 399999, 399999, 399999,
         399999, 399999, 399999, 399999, 399999, 399999, 399999, 399999, 399999,
         399999, 399999, 399999, 399999, 399999, 399999, 399999, 399999, 399999,
         399999, 399999, 399999, 399999],
        [   474,   5760,     19,  56714,  13439,   1590,   1891,     23,  10851,
             24,  10851,     11,    474,   4987,     17,   5973,     19,  11663,
         250510,  56714,   9756,     44,     58,   1638,     22,      0,   3264,
           1478,     13,    171,      5,     4



------------------

In [15]:
import time

# Hyperparameters
EPOCHS = 10  # epoch
LR = 5  # learning rate

criterion = torch.nn.CrossEntropyLoss()
# optimizer = torch.optim.SGD(model.parameters(), lr=LR)
optimizer = torch.optim.Adam(model.parameters(), lr=LR)

def train(dataloader):
    model.train()
    total_acc, total_count, max_acc = 0, 0, 0
    log_interval = 500
    start_time = time.time()

    for idx, (label, text) in enumerate(dataloader):
        optimizer.zero_grad()
        predicted_label = model(text)
        loss = criterion(predicted_label, label)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 0.1)
        optimizer.step()

        total_acc += (predicted_label.argmax(1) == label).sum().item()
        total_count += label.size(0)
        if idx % log_interval == 0 and idx > 0:
            elapsed = time.time() - start_time
            
            #print(total_acc / total_count)

            print('| {:5d} batches '
                  '| accuracy {:8.3f}'.format(idx, 
                                              total_acc / total_count))

            if max_acc < total_acc / total_count:
                max_acc = total_acc / total_count
                
            total_acc, total_count = 0, 0
            start_time = time.time()
    return max_acc


def evaluate(dataloader):
    model.eval()
    total_acc, total_count = 0, 0

    with torch.no_grad():
        for idx, (label, text) in enumerate(dataloader):
            predicted_label = model(text)
            loss = criterion(predicted_label, label)
            total_acc += (predicted_label.argmax(1) == label).sum().item()
            total_count += label.size(0)
    return total_acc / total_count

In [16]:
# Hyperparameters
EPOCHS = 5  # epoch
LR = 5  # learning rate
BATCH_SIZE = 8  # batch size for training

criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=LR)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=3, gamma=0.1)


for epoch in range(1, EPOCHS + 1):
    epoch_start_time = time.time()

    accu_train = train(train_dataloader)
    accu_val = evaluate(test_dataloader)

    scheduler.step()
    
    print("-" * 59)
    print(
        "| end of epoch {:3d} | time: {:5.2f}s | "
        "valid accuracy {:8.3f} ".format(
            epoch, time.time() - epoch_start_time, accu_val
        )
    )
    print("-" * 59)



|   500 batches | accuracy    0.254
|  1000 batches | accuracy    0.253
|  1500 batches | accuracy    0.269
-----------------------------------------------------------
| end of epoch   1 | time: 71.12s | valid accuracy    0.687 
-----------------------------------------------------------
|   500 batches | accuracy    0.774
|  1000 batches | accuracy    0.855
|  1500 batches | accuracy    0.880
-----------------------------------------------------------
| end of epoch   2 | time: 77.96s | valid accuracy    0.888 
-----------------------------------------------------------
|   500 batches | accuracy    0.882
|  1000 batches | accuracy    0.891
|  1500 batches | accuracy    0.909
-----------------------------------------------------------
| end of epoch   3 | time: 70.79s | valid accuracy    0.896 
-----------------------------------------------------------
|   500 batches | accuracy    0.899
|  1000 batches | accuracy    0.908
|  1500 batches | accuracy    0.925
-------------------------