Lab 3:
- Carlos Jarrin
- Fausto Yugcha

# NLP and Neural Networks

In this exercise, we'll apply our knowledge of neural networks to process natural language. As we did in the bigram exercise, the goal of this lab is to predict the next word, given the previous one.

### Data set

Load the text from "One Hundred Years of Solitude" that we used in our bigrams exercise. It's located in the data folder.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Important note:

Start with a smaller part of the text. Maybe the first 10 parragraphs, as the number of tokens rapidly increases as we add more text.

Later you can use a bigger corpus.

Don't forget to prepare the data by generating the corresponding tokens.

In [2]:
import torch
import torch.nn as nn
import torch.optim as optim

from nltk import bigrams
from nltk.tokenize import TreebankWordTokenizer

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score

import numpy as np

In [3]:
# Cargar el texto y tokenizar
tokenizer = TreebankWordTokenizer()
text = open('/content/drive/MyDrive/NLP/datos/cap1.txt', 'r').read().lower()
tokens = tokenizer.tokenize(text)
print(f"tokens = {len(tokens)=}")

tokens = len(tokens)=6293


### Let's prepare the data set.

Our neural network needs to have an input X and an output y. Remember that these sets are numerical, so you'd need something to map the tokens into numbers, and viceversa.

In [4]:
# Generar bigramas (pares de palabras)
bigram_list = list(bigrams(tokens))

In [5]:
X = [bigram[0] for bigram in bigram_list] # Primera palabra del bigrama
y = [bigram[1] for bigram in bigram_list] # Segunda palabra del bigrama

In [6]:
# Convertir las palabras a una representación numérica
vectorizer = CountVectorizer()
X_vectorized = vectorizer.fit_transform(X)

In [7]:
# Dividir los datos en entrenamiento y prueba
X_train, X_test, y_train, y_test = train_test_split(X_vectorized, y, test_size = 0.2, random_state = 0)

In [8]:
# Codificar las etiquetas
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y)

# Convertir los datos a tensores
X_tensor = torch.tensor(X_vectorized.toarray(), dtype = torch.float32)
y_tensor = torch.tensor(y_encoded, dtype=torch.long)

# Dividir en conjunto de entrenamiento y prueba
X_train, X_test, y_train, y_test = train_test_split(X_tensor, y_tensor, test_size = 0.2, random_state = 0)

In [None]:
# Note that our vectors are integers, which can be thought as a categorical variables.
# torch provides the one_hot method, that would generate tensors suitable for our nn
# make sure that the dtype of your tensor is float.

In [10]:
type(X_tensor)
type(y_tensor)

torch.Tensor

### Network design
To start, we are going to have a very simple network. Define a single layer network

In [11]:
# Parámetros de la red
input_size = X_train.shape[1]
hidden_size = 128  # Ajustado para más capas ocultas
output_size = len(label_encoder.classes_)
dropout_rate = 0.3  # Para evitar el sobreajuste

In [12]:
# Definir una red neuronal más profunda
class ImprovedNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(ImprovedNN, self).__init__()
        # Primera capa densa
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.relu1 = nn.ReLU()
        # Segunda capa densa
        self.fc2 = nn.Linear(hidden_size, hidden_size)
        self.relu2 = nn.ReLU()
        # Dropout para evitar sobreajuste
        self.dropout = nn.Dropout(dropout_rate)
        # Capa de salida
        self.fc3 = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        x = self.fc1(x)
        x = self.relu1(x)
        x = self.fc2(x)
        x = self.relu2(x)
        x = self.dropout(x)
        x = self.fc3(x)
        return x

In [13]:
# Crear el modelo
model = ImprovedNN(input_size, hidden_size, output_size)

# Definir el criterio de pérdida y el optimizador
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

In [14]:
# Entrenar el modelo
n_epochs = 180

for epoch in range(n_epochs):
    model.train()

    # Hacer predicciones y calcular la pérdida
    outputs = model(X_train)
    loss = criterion(outputs, y_train)

    # Actualizar los pesos
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    # Calcular precisión en los datos de entrenamiento
    _, predicted = torch.max(outputs, 1)
    train_accuracy = accuracy_score(y_train, predicted)

    # Evaluar en el conjunto de prueba
    model.eval()
    with torch.no_grad():
        outputs_test = model(X_test)
        _, predicted_test = torch.max(outputs_test, 1)
        test_accuracy = accuracy_score(y_test, predicted_test)

    if (epoch+1) % 20 == 0:
        print(f"Epoch [{epoch+1}/{n_epochs}], Loss: {loss.item():.4f}, Train Accuracy: {train_accuracy*100:.2f}%, Test Accuracy: {test_accuracy*100:.2f}%")


Epoch [20/180], Loss: 7.5192, Train Accuracy: 1.63%, Test Accuracy: 0.08%
Epoch [40/180], Loss: 6.6586, Train Accuracy: 4.61%, Test Accuracy: 6.75%
Epoch [60/180], Loss: 5.7641, Train Accuracy: 6.56%, Test Accuracy: 6.75%
Epoch [80/180], Loss: 5.5892, Train Accuracy: 7.59%, Test Accuracy: 6.91%
Epoch [100/180], Loss: 5.4992, Train Accuracy: 7.53%, Test Accuracy: 7.31%
Epoch [120/180], Loss: 5.4224, Train Accuracy: 8.88%, Test Accuracy: 7.07%
Epoch [140/180], Loss: 5.3007, Train Accuracy: 12.30%, Test Accuracy: 8.90%
Epoch [160/180], Loss: 5.1183, Train Accuracy: 16.49%, Test Accuracy: 9.13%
Epoch [180/180], Loss: 4.8893, Train Accuracy: 19.99%, Test Accuracy: 8.98%


In [15]:
# Función para predecir la siguiente palabra dada una palabra
def predict_next_word(input_word):
    input_vectorized = vectorizer.transform([input_word]).toarray()
    input_tensor = torch.tensor(input_vectorized, dtype=torch.float32)

    model.eval()
    with torch.no_grad():
        output = model(input_tensor)
        probabilities = torch.softmax(output, dim=1)
        predicted_prob, predicted_idx = torch.max(probabilities, 1)

    predicted_word = label_encoder.inverse_transform(predicted_idx.numpy())[0]
    return predicted_word, predicted_prob.item()

In [18]:
# Probar predicción
max_n_pred = 10
for _ in range(10):
  word = 'aldea'
  full_pred = word
  for i in range(max_n_pred):
    word2 = predict_next_word(word)[0]
    full_pred = full_pred + ' ' + word2
    word = word2
  print(full_pred)

aldea , de la la la la la la la la
aldea , de la la la la la la la la
aldea , de la la la la la la la la
aldea , de la la la la la la la la
aldea , de la la la la la la la la
aldea , de la la la la la la la la
aldea , de la la la la la la la la
aldea , de la la la la la la la la
aldea , de la la la la la la la la
aldea , de la la la la la la la la


### Analysis

1. Test your network with a few words

In [21]:
def pred_n_words(word = 'buendia', max_n_pred = 10):
  full_pred = word
  l1 = 0
  for i in range(max_n_pred):
    word2 = predict_next_word(word)[0]
    pr = predict_next_word(word)[1]
    full_pred = full_pred + ' ' + word2
    word = word2
    l1 += np.log(pr)

  n_ll = l1/max_n_pred
  print(full_pred, '| neg log:', n_ll)

palabras = ['buendia', 'niño', 'posibilidad', 'casa', 'muchos']
for w in palabras:
  pred_n_words(word = w, max_n_pred =1)
print(' ')
for w in palabras:
  pred_n_words(word = w)


buendia de | neg log: -3.258164787668704
niño de | neg log: -1.6824598039681637
posibilidad de | neg log: -0.16344186687913714
casa , | neg log: -1.998403876553244
muchos , | neg log: -3.0167437219622677
 
buendia de la la la la la la la la la | neg log: -4.937202498653687
niño de la la la la la la la la la | neg log: -4.779632000283633
posibilidad de la la la la la la la la la | neg log: -4.62773020657473
casa , de la la la la la la la la | neg log: -4.588454281295556
muchos , de la la la la la la la la | neg log: -4.690288265836459


2. What does each value in the tensor represents?

Al ser un tensor de convolucion requiere de valores en forma matricial para funcionar de manera adecuada, por lo que el tensor proporcionado debe ajustarse.


3. Why does it make sense to choose that number of neurons in our layer?


Cada capa de entrada debe tener la misma cantidad de salida por que asi fue definido el biagram.

4. What's the negative likelihood for each example?

Es una medida que nos ayuda a cuantificar segun el modelo propuesto que tan probable es que la palabra sea la verdadera.

5. Try generating a few sentences?

Se debe generar con un bucle para generar un amplio vocabulario y no repetir las mismas palabras en bucle cerrado.


6. What's the negative likelihood for each sentence?

Vendria a ser la sumatoria de cada una de las palabras.