# Name Entity Recognition (HuggingFace)

Reference:  https://huggingface.co/learn

## 1. Load the data

CoNLL-2003 dataset

## 2. Preprocessing

Tokenization (numericalization), aligning labels

## 3. Dataloader

## 4. Model

The second part of the Pipeline

## 5. Metrics

We need to define `compute_metrics()` that takes list of predictions and labels, and returns a dictionary with the metric names and values.

Note: `pip install seqeval`

## 6. Optimizer

In [45]:
from torch.optim import AdamW

#Adam with learning decay
optimizer = AdamW(model.parameters(), lr=2e-5)

## 7. Accelerator

So usually, you just train right..

But huggingface creates a wrapper called `Accelerator` which
utilize your resources in a parallel fashion....

In [46]:
from accelerate import Accelerator

In [47]:
accelerator = Accelerator()

model, optimizer, train_loader, val_loader = \
    accelerator.prepare(model, optimizer, train_loader, val_loader)

## 8. Learning rate scheduler

In [48]:
from transformers import get_scheduler

num_train_epochs = 1
num_update_steps_per_epoch = len(train_loader)
num_train_steps = num_train_epochs * num_update_steps_per_epoch

lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_train_steps)

## 9. Repository

Repository is like a free-cloud space, hosted by HuggingFace.

It is very useful because for every certain steps, it will upload
your model to the Huggingface....if suddenly something crashes, 
you can resume....because your weights are push to Huggingface repo.

In [49]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [50]:
from huggingface_hub import Repository, get_full_repo_name

model_name = "bert-finetuned-ner-accelerate"
repo_name  = get_full_repo_name(model_name)
repo_name

'Chaklam/bert-finetuned-ner-accelerate'

In [51]:
#sudo apt install git-lfs
#brew install git-lfs
#go to git-lfs and download it

import os

os.environ["TOKENIZERS_PARALLELISM"] = "true"

output_dir = "bert-finetuned-ner-accelerate"
repo       = Repository(output_dir, clone_from=repo_name)

For more details, please read https://huggingface.co/docs/huggingface_hub/concepts/git_vs_http.
/Users/chaklam/Github/Natural-Language-Processing/Code/04 - Huggingface/code-along/bert-finetuned-ner-accelerate is already a clone of https://huggingface.co/Chaklam/bert-finetuned-ner-accelerate. Make sure you pull the latest changes with `repo.git_pull()`.


## 10. Training

In [52]:
#convert predictions and labels into strings, like 
#what our metric object expects

def postprocess(predictions, labels):
    predictions = predictions.detach().cpu().clone().numpy()
    labels      = labels.detach().cpu().clone().numpy()
    
    true_labels = [[label_names[l] for l in label if l !=-100] 
                   for label in labels]
    true_predictions = [
        [label_names[p] for (p, l) in zip(prediction, label) if l!= - 100]
        for prediction, label in zip(predictions, labels)
    ]
    return true_labels, true_predictions
    

In [53]:
from tqdm.auto import tqdm #progress bar
import torch

progress_bar = tqdm(range(num_train_steps))

for epoch in range(num_train_epochs):
    model.train()
    for batch in train_loader:
        outputs = model(**batch) #** because our input is keyword (input_ids = ...)
        loss    = outputs.loss
        accelerator.backward(loss)  #instead of optimizer.backward
        
        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)
    
    #evaluation
    model.eval() #all batchnorm, dropout will be turned off....
    for batch in val_loader:
        with torch.no_grad():
            outputs = model(**batch) 
        
        predictions = outputs.logits.argmax(dim = -1)
        labels      = batch["labels"]
        
        #necessary to pad predictions and labels to same length...if not...crash...
        predictions = accelerator.pad_across_processes(predictions, dim=1, pad_index=-100)
        labels      = accelerator.pad_across_processes(labels, dim=1, pad_index=-100)
        
        predictions_gathered = accelerator.gather(predictions)
        labels_gathered      = accelerator.gather(labels)
        
        true_predictions, true_labels = postprocess(predictions_gathered, labels_gathered)
        metric.add_batch(predictions=true_predictions, references=true_labels)
        
    results = metric.compute()

    print(

        f"epoch {epoch}: ",
        {
                key: results[f"overall_{key}"]
                for key in ["precision", "recall", "f1", "accuracy"]
        }

    )
        
    #save and upload your model
    accelerator.wait_for_everyone() #many processes
    unwrapped_model = accelerator.unwrap_model(model) #start from scratch
    unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)
    
    if accelerator.is_main_process:
        tokenizer.save_pretrained(output_dir)
        repo.push_to_hub(commit_message=f"Training in progress epoch {epoch}", blocking=False)
    

  0%|          | 0/1756 [00:00<?, ?it/s]

## 11. Inference!!!

In [1]:
from transformers import pipeline

checkpoint = "Chaklam/bert-finetuned-ner-accelerate"

clf        = pipeline("token-classification", model=checkpoint, aggregation_strategy="simple")

config.json:   0%|          | 0.00/1.03k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/431M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/347 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/669k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

In [2]:
clf("Ayush and Chaklam are going to play soccer today at AIT, Bangkok, Thailand and eat some snacks")

[{'entity_group': 'PER',
  'score': 0.9725842,
  'word': 'Ayush',
  'start': 0,
  'end': 5},
 {'entity_group': 'PER',
  'score': 0.98298603,
  'word': 'Chaklam',
  'start': 10,
  'end': 17},
 {'entity_group': 'LOC',
  'score': 0.7392455,
  'word': 'AIT',
  'start': 52,
  'end': 55},
 {'entity_group': 'LOC',
  'score': 0.99720275,
  'word': 'Bangkok',
  'start': 57,
  'end': 64},
 {'entity_group': 'LOC',
  'score': 0.9977047,
  'word': 'Thailand',
  'start': 66,
  'end': 74}]