<a href="https://colab.research.google.com/github/gattuzzo0/advanced_ml/blob/main/A3b_DL_TC5033_AD2023_text_classifier_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## TC 5033
### Word Embeddings

<br>

#### Activity 3b: Text Classification using RNNs and AG_NEWS dataset in PyTorch
<br>

- Objective:
    - Understand the basics of Recurrent Neural Networks (RNNs) and their application in text classification.
    - Learn how to handle a real-world text dataset, AG_NEWS, in PyTorch.
    - Gain hands-on experience in defining, training, and evaluating a text classification model in PyTorch.
    
<br>

- Instructions:
    - Data Preparation: Starter code will be provided that loads the AG_NEWS dataset and prepares it for training. Do not modify this part. However, you should be sure to understand it, and comment it, the use of markdown cells is suggested.

    - Model Setup: A skeleton code for the RNN model class will be provided. Complete this class and use it to instantiate your model.

    - Implementing Accuracy Function: Write a function that takes model predictions and ground truth labels as input and returns the model's accuracy.

    - Training Function: Implement a function that performs training on the given model using the AG_NEWS dataset. Your model should achieve an accuracy of at least 80% to get full marks for this part.

    - Text Sampling: Write a function that takes a sample text as input and classifies it using your trained model.

    - Confusion Matrix: Implement a function to display the confusion matrix for your model on the test data.

    - Submission: Submit your completed Jupyter Notebook. Make sure to include a markdown cell at the beginning of the notebook that lists the names of all team members. Teams should consist of 3 to 4 members.
    
<br>

- Evaluation Criteria:

    - Correct setup of all the required libraries and modules (10%)
    - Code Quality (30%): Your code should be well-organized, clearly commented, and easy to follow. Use also markdown cells for clarity. Comments should be given for all the provided code, this will help you understand its functionality.
    
   - Functionality (60%):
        - All the functions should execute without errors and provide the expected outputs.
        - RNN model class (20%)
        - Accuracy fucntion (10%)
        - Training function (10%)
        - Sampling function (10%)
        - Confucion matrix (10%)

        - The model should achieve at least an 80% accuracy on the AG_NEWS test set for full marks in this criterion.


Dataset

https://pytorch.org/text/stable/datasets.html#text-classification

https://paperswithcode.com/dataset/ag-news


### Import libraries

In [1]:
# conda install -c pytorch torchtext
# conda install -c pytorch torchdata
# conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch

!pip install scikit-plot



In [2]:
# The following libraries are required for running the given code
# Please feel free to add any libraries you consider adecuate to complete the assingment.
import numpy as np
#PyTorch libraries
import torch
from torchtext.datasets import AG_NEWS
# Dataloader library
from torch.utils.data import DataLoader
from torch.utils.data.dataset import random_split
# Libraries to prepare the data
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from torchtext.data.functional import to_map_style_dataset
# neural layers
from torch import nn
from torch.nn import functional as F

# These libraries are suggested to plot confusion matrix
# you may use others
import scikitplot as skplt
import gc

In [3]:
!pip install torchdata



In [4]:
!pip install torchtext



In [5]:
!pip install portalocker



In [6]:
!pip install torchvision torchaudio




In [7]:
# Use GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

cuda


### Get the train and the test datasets and dataloaders

Classes:

* 1 - World

* 2 - Sports

* 3 - Business

* 4 - Sci/Tech

We will convert them to:

* 0 - World

* 1 - Sports

* 2 - Business

* 3 - Sci/Tech

In [8]:
train_dataset, test_dataset = AG_NEWS()
train_dataset, test_dataset = to_map_style_dataset(train_dataset), to_map_style_dataset(test_dataset)

In [9]:
# Get the tokeniser
# tokeniser object
tokeniser = get_tokenizer('basic_english')

def yield_tokens(data):
    for _, text in data:
        yield tokeniser(text)

In [10]:
# Build the vocabulary
vocab = build_vocab_from_iterator(yield_tokens(train_dataset), specials=["<unk>"])
#set unknown token at position 0
vocab.set_default_index(vocab["<unk>"])

In [11]:
#test tokens
tokens = tokeniser('Welcome to TE3007')
print(tokens, vocab(tokens))

['welcome', 'to', 'te3007'] [3314, 4, 0]


In [12]:
NUM_TRAIN = int(len(train_dataset)*0.9)
NUM_VAL = len(train_dataset) - NUM_TRAIN

In [13]:
train_dataset, val_dataset = random_split(train_dataset, [NUM_TRAIN, NUM_VAL])

In [14]:
print(len(train_dataset), len(val_dataset), len(test_dataset))

108000 12000 7600


In [15]:
# function passed to the DataLoader to process a batch of data as indicated
def collate_batch(batch):
    # Get label and text
    y, x = list(zip(*batch))

    # Create list with indices from tokeniser
    x = [vocab(tokeniser(text)) for text in x]
    x = [t + ([0]*(max_tokens - len(t))) if len(t) < max_tokens else t[:max_tokens] for t in x]

    # Prepare the labels, by subtracting 1 to get them in the range 0-3
    return torch.tensor(x, dtype=torch.int32), torch.tensor(y, dtype=torch.int32) - 1

In [16]:
labels =  ["World", "Sports", "Business", "Sci/Tech"]
max_tokens = 50
BATCH_SIZE = 256

In [17]:
train_loader = DataLoader(train_dataset, batch_size=BATCH_SIZE, collate_fn=collate_batch, shuffle = True)
val_loader = DataLoader(val_dataset, batch_size=BATCH_SIZE, collate_fn=collate_batch, shuffle = True)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, collate_fn=collate_batch, shuffle = True)

### Let us build our RNN model

In [18]:
EMBEDDING_SIZE = 300 # complete
NEURONS = 256 # complete
LAYERS = 2 # complete
NUM_CLASSES = 4 # complete

In [19]:
class RNN_Model_1(nn.Module):
    def __init__(self, embed_size, hidden, layers, num_classes):
        super().__init__()
        self.embedding_layer = nn.Embedding(num_embeddings=len(vocab),
                                            embedding_dim=embed_size)

        self.rnn = nn.GRU(input_size = embed_size,
                          hidden_size = hidden,
                          num_layers=layers,
                          batch_first=True,
                          bidirectional=True)

        self.fc = nn.Linear(in_features = 2 * hidden, out_features = num_classes) # complete output classifier layer using linear layer

    def forward(self, x):
      vector_embs = self.embedding_layer(x)
      y, h = self.rnn(vector_embs)
      return self.fc(y[:, -1])
        # implement forward pass. This function will be called when executing the model

In [20]:
gru_model = RNN_Model_1(EMBEDDING_SIZE, NEURONS, LAYERS, NUM_CLASSES)

In [51]:
def accuracy(model, loader):
  total = 0
  correct = 0
  cost = 0.

  with torch.no_grad():
      model.eval()
      for data in loader:
          # Desempaquetamos el dataset en np.arrays y etiquetas
          inputs, labels = data[0], data[1]
          #outputs = net(images)
          xi = inputs.to(device=device)
          yi = labels.to(device=device)

          yi = yi.type(torch.LongTensor) # <---- Here (casting)
          yi = yi.to(device=device)

          #Hacemos las predicciones
          scores = model(xi)

          criterion = F.nll_loss(F.log_softmax(scores, dim=1), yi)

          cost +=  criterion.item()
          _, pred = torch.max(scores.data, 1)

          # Append de resultados
          total += pred.size(0)
          correct += (pred == yi).sum().item()

  # Return del valor total de costo y precision del dataset (loss % accuracy)

  return cost/len(data), float(correct)/total

In [22]:
from torch import long, as_tensor, cat

In [55]:
def train(model, optimiser, epochs=100):
  model = model.to(device=device)
  train_cost = 0
  val_cost = 0.

  #Realizamos el entrenamiento cada epcoh, durante el numero definido de epochs en la definicion de hiperparametros.
  for epoch in range(epochs):  # loop over the dataset multiple times

    running_loss = 0.0
    train_correct_num  = 0.
    train_total = 0.
    train_cost_acum = 0

    #Los entrenamientos se corren sobre el dataset de entrenamiento (no test ni validacion)
    for i, data in enumerate(train_loader, 0):

      model.train()

      # get the inputs; data is a list of [inputs, labels]
      inputs, labels = data[0], data[1]

      xi = inputs.to(device=device)
      yi = labels.to(device=device)

      yi = yi.type(torch.LongTensor) # <---- Here (casting)
      yi = yi.to(device=device)

      # forward + backward + optimize
      scores = model(xi)

      # zero the parameter gradients
      # debemos de reiniciar el gradiente en cada epoch, para no caer en error de memory leak
      optimiser.zero_grad()

      criterion = nn.NLLLoss()

      cost = criterion(scores, yi)

      # funcion de costo
      cost.backward()
      optimiser.step()

      _, pred = torch.max(scores.data, 1)

      train_correct_num += (pred == yi).sum().item()
      train_total += scores.size(0)

      train_cost_acum += cost.item()

    #Calculamos el costo y la precision despues de cada epoch entrenada
    # Debemos de hacer la evaluacion en el dataset de validacion (no el de entrenamiento)
    # El dataset de test se deberia de usar despues, para hacer la matriz de confusion.
    # Utilizar el mismo dataset durante el calculo de precision y perdida conllevaria a un problema
    # de memory leak y no seria correcto.
    val_cost, val_acc = accuracy(model, val_loader)

    ### Train total -> tamano del dataset entrenado, en este caso 5000
    ### train_correct_num  -> Cantidad de items correctamente clasificados
    ### train_acc  -> Precision del modelo. Debe de tender a 1.
    ### train_cost  -> Funcion de costo (mayormente conocida como loss). Error acumulado. Debe de tender a 0
    train_acc = float(train_correct_num)/train_total
    train_cost = train_cost_acum/i

  # Vamos a monitorear la mejora cada 3 epochs
  # esto con el fin de lograr seleccionar el numero apropiado de epochs necesarios para lograr
  # el accuracy solicitado (evitar epcohs de sobra)
    if epoch%3 == 0:
      print(f'Epoch:{epoch}, train cost: {train_cost:.6f}, val cost: {val_cost:.6f},'
                      f' train acc: {train_acc:.4f}, val acc: {val_acc:4f},'
                      f' lr: {optimiser.param_groups[0]["lr"]:.6f}')

In [49]:
epochs = 40
lr = 0.0001
# instantiate model
rnn_model = RNN_Model_1(EMBEDDING_SIZE, NEURONS, LAYERS, NUM_CLASSES)
optimiser = torch.optim.Adam(gru_model.parameters(), lr=lr)

In [56]:
train(gru_model, optimiser=optimiser,  epochs=epochs)

Epoch:0, train cost: -592.317288, val cost: 39.001468, train acc: 0.2499, val acc: 0.251167, lr: 0.000100
Epoch:3, train cost: -657.033699, val cost: 38.716620, train acc: 0.2499, val acc: 0.251167, lr: 0.000100
Epoch:6, train cost: -721.757023, val cost: 38.748548, train acc: 0.2499, val acc: 0.251167, lr: 0.000100
Epoch:9, train cost: -786.477434, val cost: 38.673606, train acc: 0.2499, val acc: 0.251167, lr: 0.000100
Epoch:12, train cost: -851.189968, val cost: 38.971161, train acc: 0.2499, val acc: 0.251167, lr: 0.000100
Epoch:15, train cost: -915.903621, val cost: 38.714205, train acc: 0.2499, val acc: 0.251167, lr: 0.000100


KeyboardInterrupt: ignored

In [None]:
print(f'{accuracy(gru_model, test_loader):.4f}')

In [None]:
def sample_text(model, loader):
    pass

In [None]:
sample_text(rnn_model, test_loader)

In [None]:
# create confusion matrix
pass