# SpanBERTa: Reconocimiento de Entidades Nombradas con Transformers
##https://github.com/chriskhanhtran/spanish-bert

# Introducción

SpanBERTa es un modelo de lenguaje basado en tecnología Transformer para español, entrenado desde cero sobre un gran corpus. En esta notebook, llevaremos a cabo un proceso de fine-tuning sobre SpanBERTa para la tarea de Reconocimiento de Entidades Nombradas.

Utilizamos el script `run_ner.py` (dentro de la carpeta drive) y el dataset Güemes Documentado para el fine-tuning.

# Setup

In [3]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)
CARPETA_COLAB = '/content/drive/MyDrive/ProyectoNERC/NERC/'

Mounted at /content/drive


Instalamos paquetes e importamos.

In [1]:
!pip install transformers
!pip install seqeval

Collecting transformers
  Downloading transformers-4.19.2-py3-none-any.whl (4.2 MB)
[K     |████████████████████████████████| 4.2 MB 5.1 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.6.0-py3-none-any.whl (84 kB)
[K     |████████████████████████████████| 84 kB 2.7 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 34.7 MB/s 
[?25hCollecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 37.5 MB/s 
Installing collected packages: pyyaml, tokenizers, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Found existing installation: PyYAML 3.13
    Uninstalling PyYAML-3.13:
      Successfully uninstalled PyYAML-3.13
Successfully installed huggingface-hub-0.

In [2]:
import transformers

# Datos


## 1. Cargamos Dataset

In [30]:
tokens = []
with open(CARPETA_COLAB+'dataset/GD-1.conll', 'r', encoding="utf-8") as f:
  for line in f.readlines():
    splitted_text = line.split()
    if len(splitted_text) == 3:
      tokens.append((splitted_text[0], splitted_text[2]))

Train y test (a y b) para Güemes Documentado
   - testa: Datos de prueba en español para la fase development
   - testb: Datos de prueba final
   - train: Datos de entrenamiento

In [31]:
# Division 60 - 20 - 20. Se realiza a mano para asegurar que cada set termine en una oracion
train = tokens[:31020]
testa = tokens[31020:41395]
testb = tokens[41395:]

In [37]:
with open("train_temp.txt", 'w', encoding="utf-8") as f:
  for token in train:
    f.write(token[0] + ' ' + token[1] + '\n')

with open("dev_temp.txt", 'w', encoding="utf-8") as f:
  for token in testa:
    f.write(token[0] + ' ' + token[1] + '\n')

with open("test_temp.txt", 'w', encoding="utf-8") as f:
  for token in testb:
    f.write(token[0] + ' ' + token[1] + '\n')

## 2. Preprocesamiento

In [38]:
MAX_LENGTH = 120 #@param {type: "integer"}
MODEL = "chriskhanhtran/spanberta" #@param ["chriskhanhtran/spanberta", "bert-base-multilingual-cased"]

Acortamos oraciones a un máximo (`MAX_LENGTH`) en cantidad de tokens.

In [39]:
#%%capture
!wget "https://raw.githubusercontent.com/stefan-it/fine-tuned-berts-seq/master/scripts/preprocess.py"

--2022-05-17 16:54:58--  https://raw.githubusercontent.com/stefan-it/fine-tuned-berts-seq/master/scripts/preprocess.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 991 [text/plain]
Saving to: ‘preprocess.py’


2022-05-17 16:54:58 (46.9 MB/s) - ‘preprocess.py’ saved [991/991]



In [40]:
!python3 preprocess.py train_temp.txt $MODEL $MAX_LENGTH > train.txt
!python3 preprocess.py dev_temp.txt $MODEL $MAX_LENGTH > dev.txt
!python3 preprocess.py test_temp.txt $MODEL $MAX_LENGTH > test.txt

Downloading: 100% 16.0/16.0 [00:00<00:00, 15.1kB/s]
Downloading: 100% 487/487 [00:00<00:00, 293kB/s]
Downloading: 100% 932k/932k [00:00<00:00, 5.70MB/s]
Downloading: 100% 500k/500k [00:00<00:00, 3.44MB/s]


## 3. Etiquetas

En los datasets CoNLL-2002/2003, hay 9 clases de tags NER:

- O, Fuera de entidad
- B-MIS, Inicio de entidad tipo miscelanea justo después de otra entidad miscelanea
- I-MIS, Entidad miscelanea
- B-PER, Inicio de entidad tipo persona justo después de otra entidad persona
- I-PER, Nombre de persona
- B-ORG, Inicio de entidad tipo organización justo después de otra entidad organización
- I-ORG, Organización
- B-LOC, Inicio de entidad tipo lugar justo después de otra entidad lugar
- I-LOC, lugar

Si existen más etiquetas en el dataset, la línea de abajo permite obtener los tags correspondientes al dataset actual. Los mismos se guardan en `labels.txt`, el cual se usará para el fine-tuning.

In [41]:
!cat train.txt dev.txt test.txt | cut -d " " -f 2 | grep -v "^$"| sort | uniq > labels.txt

# Fine-tuning del modelo


Scripts del repositorio `transformers` a usar para el fine-tuning para NER. Luego del 21/04/2020, Hugging Face actualizó sus ejemplos para usar una nueva clase `Trainer`. Para evitar conflictos, se usa la versión anterior a estas actualizaciones.

In [42]:
#%%capture
#No usamos run_ner.py del repo original por errores con las nuevas versiones de transformers
#!wget "https://raw.githubusercontent.com/chriskhanhtran/spanish-bert/master/ner/run_ner.py"
!wget "https://raw.githubusercontent.com/chriskhanhtran/spanish-bert/master/ner/utils_ner.py"

--2022-05-17 16:57:16--  https://raw.githubusercontent.com/chriskhanhtran/spanish-bert/master/ner/utils_ner.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.108.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8603 (8.4K) [text/plain]
Saving to: ‘utils_ner.py’


2022-05-17 16:57:16 (41.2 MB/s) - ‘utils_ner.py’ saved [8603/8603]



Fase de `Transfer Learning`. [Aquí](https://chriskhanhtran.github.io/posts/spanberta-bert-for-spanish-from-scratch/), se puede ver un ejemplo de pre-entrenamiento de un modelo RoBERTa en un corpus gigantesco en español, para predicción de palabras enmascaradas (masked words). Mediante este proceso se obtiene un modelo que aprendió propiedades básicas del lenguaje.

Se pueden observar abajo los hiperparámetros del modelo.

In [43]:
MAX_LENGTH = 128 #@param {type: "integer"}
MODEL = "chriskhanhtran/spanberta" #@param ["chriskhanhtran/spanberta", "bert-base-multilingual-cased"]
OUTPUT_DIR = "spanberta-ner" #@param ["spanberta-ner", "bert-base-ml-ner"]
BATCH_SIZE = 32 #@param {type: "integer"}
NUM_EPOCHS = 3 #@param {type: "integer"}
SAVE_STEPS = 100 #@param {type: "integer"}
LOGGING_STEPS = 100 #@param {type: "integer"}
SEED = 42 #@param {type: "integer"}

Inicio del entrenamiento.

In [48]:
!python3 '/content/drive/MyDrive/ProyectoNERC/NERC/run_ner.py' \
  --data_dir ./ \
  --model_type bert \
  --labels ./labels.txt \
  --model_name_or_path $MODEL \
  --output_dir $OUTPUT_DIR \
  --max_seq_length  $MAX_LENGTH \
  --num_train_epochs $NUM_EPOCHS \
  --per_gpu_train_batch_size $BATCH_SIZE \
  --save_steps $SAVE_STEPS \
  --logging_steps $LOGGING_STEPS \
  --seed $SEED \
  --do_train \
  --do_eval \
  --do_predict \
  --overwrite_output_dir

05/17/2022 17:08:16 - INFO - __main__ -   Tokenizer arguments: {'do_lower_case': False}
Some weights of the model checkpoint at chriskhanhtran/spanberta were not used when initializing RobertaForTokenClassification: ['lm_head.layer_norm.bias', 'lm_head.decoder.weight', 'lm_head.dense.weight', 'lm_head.bias', 'lm_head.decoder.bias', 'lm_head.dense.bias', 'lm_head.layer_norm.weight']
- This IS expected if you are initializing RobertaForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForTokenClassification were not initialized from the model checkpoint at chriskhanhtran/spanberta a

Performance en el dev set:
```
21/04/2020 02:24:31 - INFO - __main__ -   ***** Eval results  *****
21/04/2020 02:24:31 - INFO - __main__ -     f1 = 0.831027443864822
21/04/2020 02:24:31 - INFO - __main__ -     loss = 0.1004064822183894
21/04/2020 02:24:31 - INFO - __main__ -     precision = 0.8207885304659498
21/04/2020 02:24:31 - INFO - __main__ -     recall = 0.8415250344510795
```
Performance en el test set:
```
21/04/2020 02:24:48 - INFO - __main__ -   ***** Eval results  *****
21/04/2020 02:24:48 - INFO - __main__ -     f1 = 0.8559533721898419
21/04/2020 02:24:48 - INFO - __main__ -     loss = 0.06848683688204177
21/04/2020 02:24:48 - INFO - __main__ -     precision = 0.845858475041141
21/04/2020 02:24:48 - INFO - __main__ -     recall = 0.8662921348314607
```

Gráficas de tensorboard del proceso de fine-tuning sobre [spanberta](https://tensorboard.dev/experiment/Ggs7aCjWQ0exU2Nbp3pPlQ/#scalars&_smoothingWeight=0.265) y [bert-base-multilingual-cased](https://tensorboard.dev/experiment/M9AXw2lORjeRzFZzEJOxkA/#scalars) para 5 épocas. Vemos que se produce overfitting luego de 3 épocas.

![](https://raw.githubusercontent.com/chriskhanhtran/spanish-bert/master/img/spanberta-ner-tb-5.JPG)





**Reporte de Clasificación**

In [49]:
def read_examples_from_file(file_path):
    """Read words and labels from a CoNLL-2002/2003 data file.
    
    Args:
      file_path (str): path to NER data file.

    Returns:
      examples (dict): a dictionary with two keys: `words` (list of lists)
        holding words in each sequence, and `labels` (list of lists) holding
        corresponding labels.
    """
    with open(file_path, encoding="utf-8") as f:
        examples = {"words": [], "labels": []}
        words = []
        labels = []
        for line in f:
            if line.startswith("-DOCSTART-") or line == "" or line == "\n":
                if words:
                    examples["words"].append(words)
                    examples["labels"].append(labels)
                    words = []
                    labels = []
            else:
                splits = line.split(" ")
                words.append(splits[0])
                if len(splits) > 1:
                    labels.append(splits[-1].replace("\n", ""))
                else:
                    # Examples could have no label for mode = "test"
                    labels.append("O")
    return examples

Cargamos datos y etiquetas de los textos originales:

In [50]:
y_true = read_examples_from_file("test.txt")["labels"]
y_pred = read_examples_from_file("spanberta-ner/test_predictions.txt")["labels"]

Reporte:

In [51]:
from seqeval.metrics import classification_report as classification_report_seqeval

print(classification_report_seqeval(y_true, y_pred))

              precision    recall  f1-score   support

        DATE       0.66      0.65      0.66        97
         LOC       0.79      0.62      0.69       226
         PER       0.70      0.69      0.69       246

   micro avg       0.72      0.65      0.69       569
   macro avg       0.72      0.65      0.68       569
weighted avg       0.73      0.65      0.69       569



Las métricas de este reporte están diseñadas específicamente para tareas NLP como NER y POS tagging, en las cuales todas las palabras de una entidad deben ser detectadas correctamente para contar como caso de éxito. Por ello los valores son más bajos que los del [reporte de clasificación de scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html).

In [52]:
import numpy as np
from sklearn.metrics import classification_report

print(classification_report(np.concatenate(y_true), np.concatenate(y_pred)))

              precision    recall  f1-score   support

      B-DATE       0.74      0.67      0.70        90
       B-LOC       0.96      0.58      0.72       224
       B-PER       0.80      0.73      0.76       241
      I-DATE       0.97      0.82      0.89       206
       I-LOC       0.60      0.72      0.66        80
       I-PER       0.85      0.84      0.85       313
           O       0.98      0.99      0.99      9134

    accuracy                           0.96     10288
   macro avg       0.84      0.76      0.80     10288
weighted avg       0.96      0.96      0.96     10288

