[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/aldomunaretto/immune_deep_learning/blob/main/notebooks/04_NLP/22_NLP_text_classificatio_with_BERT.ipynb)

<h1><font color="#113D68" size=6>Procesamiento del Lenguaje Natural</font></h1>

<h1><font color="#113D68" size=5>Clasificación de texto con BERT</font></h1>

---

<a id="indice"></a>
<h2><font color="#004D7F" size=5>Índice</font></h2>

* [0. Contexto](#section0)
* [1. Procesamiento de datos](#section1)
* [2. Procesamiento en Hugging Face](#section2)
* [3. Finetuning para clasificación de texto](#section3)

<a id="section0"></a>
# <font color="#004D7F" size=6>0. Contexto</font>

Para este ejemplo, utilizaremos el mismo corpus compuesto por tramas de películas. Sin embargo, para esta tarea solo estamos empleando un subconjunto de tramas de películas: solo aquellas que corresponden a películas de comedia, drama u western.

Por lo tanto, nuestro objetivo en este ejemplo es clasificar las tramas de las películas en estos tres géneros.

In [None]:
#!pip install --upgrade accelerate

In [15]:
!pip3 install transformers[torch]

Collecting accelerate>=0.21.0 (from transformers[torch])
  Downloading accelerate-0.31.0-py3-none-any.whl (309 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m309.4/309.4 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
Collecting nvidia-cuda-nvrtc-cu12==12.1.105 (from torch->transformers[torch])
  Using cached nvidia_cuda_nvrtc_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (23.7 MB)
Collecting nvidia-cuda-runtime-cu12==12.1.105 (from torch->transformers[torch])
  Using cached nvidia_cuda_runtime_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (823 kB)
Collecting nvidia-cuda-cupti-cu12==12.1.105 (from torch->transformers[torch])
  Using cached nvidia_cuda_cupti_cu12-12.1.105-py3-none-manylinux1_x86_64.whl (14.1 MB)
Collecting nvidia-cudnn-cu12==8.9.2.26 (from torch->transformers[torch])
  Using cached nvidia_cudnn_cu12-8.9.2.26-py3-none-manylinux1_x86_64.whl (731.7 MB)
Collecting nvidia-cublas-cu12==12.1.3.1 (from torch->transformers[torch])
  Using cached nvidia_cublas_cu

---

<div style="text-align: right"> <font size=5> <a href="#indice"><i class="fa fa-arrow-circle-up" aria-hidden="true" style="color:#004D7F"></i></a></font></div>

---

<a id="section0"></a>
# <font color="#004D7F" size=6>1. Procesamiento de datos</font>

Este archivo tiene el formato:
```text
Plot | "tab" | Label
```
Entonces, nuestro primer paso sería separar etiquetas y gráficos. Para esta tarea, utilizaremos la biblioteca pandas, ya que permite un procesamiento sencillo de csv:

In [1]:
import pandas as pd
df = pd.read_csv('movie_plots_tc.csv',sep=';',encoding='utf-8',encoding_errors='ignore')
plots=df['Plot']
labels=df['Genre']

In [2]:
plots[1]

'The film opens in a town on the Mexican border. A poker game is going on in the local saloon. One of the players cheats and is shot dead by another of the players, a Mexican named Pedro. In the uproar that follows Pedro is wounded as he escapes from the saloon. The sheriff is called, who tracks Pedro to his home but Pedro kills the sherriff too. While Pedro hides, his wife Juanita, is arrested on suspicion of murdering the sheriff. Pedro rescues her from the town jail and the two head for the Mexican border. Caught by the posse before they reach the border, Juanita is killed and the film ends with Pedro being arrested and taken back to town.'

In [3]:
labels[1]

'western'

De todas las muestras existentes, seleccionamos una submuestra de 500 elementos para efectos de cómputo

In [4]:
plots=plots[:500]
labels=labels[:500]

Las redes neuronales no pueden predecir sobre etiquetas de tipo cadena. Como este es un problema de clasificación de múltiples clases, necesitamos codificar nuestras etiquetas en un formato numérico.

In [4]:
import numpy as np
str2id={'western':0,'drama':1,'comedy':2}
id2str={0:'western',1:'drama',2:'comedy'}

list_plots=plots.fillna("CVxTz").values
indexed_labels=np.array([str2id[l] for l in labels])


Como de costumbre, necesitamos dividir nuestros datos entre dos conjuntos: entrenamiento y validación.

In [5]:
from sklearn.model_selection import train_test_split
train_features, val_features, train_labels, val_labels = train_test_split(list_plots, indexed_labels, test_size=0.25, random_state=2000)

Instalación de la biblioteca de transformadores para usar HuggingFace

---

<div style="text-align: right"> <font size=5> <a href="#indice"><i class="fa fa-arrow-circle-up" aria-hidden="true" style="color:#004D7F"></i></a></font></div>

---

<a id="section2"></a>
# <font color="#004D7F" size=6>2. Procesamiento en Hugging Face</font>

Cargamos y procesamos datos de Hugging Face

In [6]:
import torch
from transformers.file_utils import is_tf_available, is_torch_available #, is_torch_tpu_available
from transformers import BertTokenizerFast, BertForSequenceClassification
from transformers import Trainer, TrainingArguments

Para esta sesión, usaremos el modelo `bert-base-uncased`, disponible en Hugging Face. Se establece una longitud máxima de 256 tokens por muestra

<div class="alert alert-block alert-info">
    
<i class="fa fa-info-circle" aria-hidden="true"></i>
Más información sobre el [modelo](https://huggingface.co/bert-base-uncased)

In [7]:
model_name = 'bert-base-uncased'
max_lenght = 256
tokenizer = BertTokenizerFast.from_pretrained(model_name, do_lower_case=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Ahora tokenizamos nuestras muestras de texto de entrenamiento y validación

__¡Cuidado!__ Las muestras de entrada deben ser `str` o `List[str]`.

In [8]:
train_encodings = tokenizer(train_features.tolist(), truncation=True, padding=True, max_length=max_lenght)
val_encodings = tokenizer(val_features.tolist(), truncation=True, padding=True, max_length=max_lenght)

Ahora podemos construirlo en Torch. Conjunto de datos utilizando las codificaciones calculadas de la siguiente manera:

In [9]:
class OurTorchDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {k: torch.tensor(v[idx]) for k, v in self.encodings.items()}
        item["labels"] = torch.tensor([self.labels[idx]])
        return item

    def __len__(self):
        return len(self.labels)

In [10]:
train_dataset= OurTorchDataset(train_encodings, train_labels)
val_dataset= OurTorchDataset(val_encodings, val_labels)

De forma predeterminada, HuggingFace no calcula automáticamente las métricas que estamos buscando. Necesitamos definir una función personalizada para calcular el Accuracy.

In [11]:
from sklearn.metrics import accuracy_score

def compute_metrics(pred):
  labels = pred.label_ids
  preds = pred.predictions.argmax(-1)
  # calculate accuracy using sklearn's function
  acc = accuracy_score(labels, preds)
  return {
      'accuracy': acc,
  }

---

<div style="text-align: right"> <font size=5> <a href="#indice"><i class="fa fa-arrow-circle-up" aria-hidden="true" style="color:#004D7F"></i></a></font></div>

---

<a id="section3"></a>
# <font color="#004D7F" size=6>3. Finetuning para clasificación de texto</font>

Ahora podemos cargar un modelo básico preentrenado de HuggingFace. ¡Asegúrese de especificar el número correcto de etiquetas!

In [12]:
model = BertForSequenceClassification.from_pretrained(model_name, num_labels=3).to("cuda")

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Construimos nuestro modelo que realizará la clasificación de textos

In [13]:
training_args = TrainingArguments(
   output_dir='.',     # directorio de salida
    num_train_epochs=3,              # numero total de epochs de entrenamiento
    warmup_steps=100,                # número de pasos de preparación para el programador de tasas de aprendizaje
    weight_decay=0.01,
    seed=1895,             # fuerza del decaimiento de los pesos
)

In [None]:
trainer = Trainer(
    model=model,                         # el modelo de Transformers instanciado que se va a entrenar
    args=training_args,                  # argumentos de entrenamiento, definidos anteriormente
    train_dataset=train_dataset,         # training dataset
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics      # el callback que calcula las métricas de interés
)
trainer.train()

Step,Training Loss
500,0.6614
1000,0.549


In [None]:
trainer.evaluate()

{'eval_loss': 0.00038344229687936604,
 'eval_accuracy': 1.0,
 'eval_runtime': 1.804,
 'eval_samples_per_second': 69.289,
 'eval_steps_per_second': 8.869,
 'epoch': 3.0}

In [None]:
def get_prediction(text):
    # prepare our text into tokenized sequence
    inputs = tokenizer(text, padding=True, truncation=True, max_length=max_lenght, return_tensors='pt').to("cuda")
    # perform inference to our model
    outputs=model(**inputs)
    # get output probabilities by doing softmax
    probs=outputs[0].softmax(1)
    # executing argmax function to get the candidate label
    return id2str[probs.argmax().item()]

In [None]:
get_prediction("The duo decide to search for the gold together, but they are apprehended by Union forces shortly after leaving the mission - Tuco yells out Confederate-supportive statements at a group of Union soldiers, as they are covered in dust, obscuring the blue color of their uniforms. The two are brought to a prison camp which Angel infiltrated as a Union sergeant in his search for Bill Carson, getting his attention when Tuco poses as Bill Carson. Tuco reveals the name of the cemetery under torture and is sent away to be killed. Knowing that Blondie would not reveal the location, Angel Eyes recruits him into his search. Tuco escapes his fate by killing Angel Eyes' henchman, and soon finds himself in an evacuated town, where Blondie, Angel Eyes, and his gang have also arrived. ")

'western'

<div style="text-align: right"> <font size=5> <a href="#indice"><i class="fa fa-arrow-circle-up" aria-hidden="true" style="color:#004D7F"></i></a></font></div>

---

<div style="text-align: right"> <font size=6><i class="fa fa-coffee" aria-hidden="true" style="color:#004D7F"></i> </font></div>