<a href="https://colab.research.google.com/github/bpanny/nlp-hw4/blob/main/hw4_bert_ner.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fine-tune BERT-based models from Hugging Face on CoNLL-2002 Spanish NER data

In this notebook, you will fine-tune and evaluate multiple BERT-based models on CoNLL-2002 Spanish NER data.

Code for loading and preprocessing the data is provided. You will provide code for training and evaluation using Hugging Face Trainer or PyTorch.

Please copy this notebook and name it `{pitt email id}_hw4_bert_ner.ipynb`.

Run all the cells starting from the top, filling in any sections that need to be filled in. Spots you need to fill in are specified.

**Note**: Please run on GPU by going to Runtime > Change Runtime Type > T4 GPU

This notebook is based on:
* https://github.com/laxmimerit/NLP-Tutorials-with-HuggingFace/blob/main/NLP_with_HuggingFace_Tutorial_2_NER_Training.ipynb  
* https://skimai.com/how-to-fine-tune-bert-for-named-entity-recognition-ner/

# Set up environment, preprocess data

In [None]:
# Download and install needed Hugging Face packages

!pip install -U transformers
!pip install -U accelerate
!pip install -U datasets

In [None]:
# Load dataset, which contains splits for training, validation (dev), and test

import pandas as pd
from datasets import load_dataset

data = load_dataset('conll2002', 'es')
data

In [None]:
# Examine the tagset. Note the BIO framework with 4 possible types

tags = data['train'].features['ner_tags'].feature

index2tag = {idx:tag for idx, tag in enumerate(tags.names)}
tag2index = {tag:idx for idx, tag in enumerate(tags.names)}
index2tag

In [None]:
# Put human-readable NER tags in data

def create_tag_names(batch):
  tag_name = {'ner_tags_str': [tags.int2str(idx) for idx in batch['ner_tags']]}
  return tag_name

data = data.map(create_tag_names)

In [None]:
# Take a look at the data
pd.DataFrame(data['train'])[['tokens', 'ner_tags', 'ner_tags_str']].head(3)

## Metrics
Load NER-specific evaluation metrics

In [None]:
!pip install seqeval
!pip install evaluate

import evaluate
import numpy as np

metric = evaluate.load('seqeval')
ner_feature = data['train'].features['ner_tags']
label_names = ner_feature.feature.names
labels = data['train'][0]['ner_tags']
labels = [label_names[i] for i in labels]

def compute_metrics(eval_preds):
  logits, labels = eval_preds

  predictions = np.argmax(logits, axis=-1)

  true_labels = [[label_names[l] for l in label if l!=-100] for label in labels]

  true_predictions = [[label_names[p] for p,l in zip(prediction, label) if l!=-100]
                      for prediction, label in zip(predictions, labels)]

  all_metrics = metric.compute(predictions=true_predictions, references=true_labels)

  return {"precision": all_metrics['overall_precision'],
          "recall": all_metrics['overall_recall'],
          "f1": all_metrics['overall_f1'],
          "accuracy": all_metrics['overall_accuracy']}

# Fine-tune models
This section is where you choose models and fill in parts of the code to do fine-tuning.

You need to fine-tune at least 2 pretrained models from the Hugging Face platform on the preprocessed CoNLL-2002 Spanish data:
* One BERT-based model pretrained with a regular masked language modeling (MLM) objective on a Spanish corpus. Examples: `PlanTL-GOB-ES/roberta-base-bne`, `chriskhanhtran/spanberta`
* One model pretrained to perform NER on another language, such as English. Models pretrained on the CoNLL-2003 dataset often work. Examples: `elastic/distilbert-base-cased-finetuned-conll03-english`, `dbmdz/bert-bert-cased-finetuned-conll03-english`

You'll want to make sure whatever pretrained model is cased, which contains valuable information for NER.

In [None]:
# FILL IN which model you are fine-tuning and assign the name of it to the `pretrained_model` variable

pretrained_model =

In [None]:
# Tokenize the data with the pretrained model's tokenizer
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(pretrained_model, use_fast=True, add_prefix_space=True)

def align_labels_with_tokens(labels, word_ids):
  new_labels = []
  current_word=None
  for word_id in word_ids:
    if word_id != current_word:
      current_word = word_id
      label = -100 if word_id is None else labels[word_id]
      new_labels.append(label)

    elif word_id is None:
      new_labels.append(-100)

    else:
      label = labels[word_id]

      if label%2==1:
        label = label + 1
      new_labels.append(label)

  return new_labels

def tokenize_and_align_labels(examples):
  tokenized_inputs = tokenizer(examples['tokens'], truncation=True, is_split_into_words=True)

  all_labels = examples['ner_tags']

  new_labels = []
  for i, labels in enumerate(all_labels):
    word_ids = tokenized_inputs.word_ids(i)
    new_labels.append(align_labels_with_tokens(labels, word_ids))

  tokenized_inputs['labels'] = new_labels

  return tokenized_inputs
tokenized_datasets = data.map(tokenize_and_align_labels, batched=True, remove_columns=data['train'].column_names)
tokenized_datasets

In [None]:
# Build a data collator to handle batching

from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

# Train (fine-tune) the model

In [None]:
id2label = {i:label for i, label in enumerate(label_names)}
label2id = {label:i for i, label in enumerate(label_names)}
print(id2label)

{0: 'O', 1: 'B-PER', 2: 'I-PER', 3: 'B-ORG', 4: 'I-ORG', 5: 'B-LOC', 6: 'I-LOC', 7: 'B-MISC', 8: 'I-MISC'}


In [None]:
from transformers import AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained(pretrained_model,
                                                    id2label=id2label,
                                                    label2id=label2id)

## FILL IN code to train
Provide code to train (fine-tune) the pretrained model.
 You can use Hugging Face Trainer class or use any other package you want, such as PyTorch.

 See the [Hugging Face Trainer user guide](https://huggingface.co/learn/nlp-course/chapter3/3?fw=pt) or use any other online examples/resources you find online.

In [None]:
# Training code here

# Evaluate the fine-tuned model

## FILL IN code to evaluate performance of the model on the test set
Provide code to evaluate the pretrained model on the `test` portion of the dataset (`tokenized_datasets['test']`)

You'll need the F1 score for your report.
This is calculated automatically if you passed the `compute_metrics` function to the `Trainer` class.

In [None]:
# Testing code here

Hooray, you're done evaluating a model!

Feel free to restart the runtime and evaluate another one, or test that model on an example in the section below (which you'll need to do for at least one model).

# Test the model on an example
Code is provided here to test your fine-tuned classifier on an example sentence.

You will need to fill in the path to a checkpoint of your fine-tuned model if it has been saved somewhere. Or feel free to run your model some other way on the example sentence.

You will need the output of running at least one of your models on the example sentence for your report.

In [None]:
# Test performance on an example

from transformers import pipeline

checkpoint = # FILL IN path to one of the checkpoints of your fine-tuned model
token_classifier = pipeline(
    "token-classification", model=checkpoint, aggregation_strategy="simple"
)

test_sentence = "Mi nombre is Miguel Salgado. Trabajo en la Universidad de Pittsburgh y vivo en Pittsburgh."
token_classifier(test_sentence)