# Deploy BioSemantics NER model

In this notebook we: 
* Show how to use the fine-tuned NER model do to inference
* Rely on the Huggingface `Pipeline` object
* Demonstrate how to spin up a GUI using Gradio

In [None]:
%%capture
!pip install -f ./requirements.txt

### Library imports

In [367]:
from transformers import pipeline
from transformers import AutoModelForTokenClassification, AutoTokenizer
import torch
from transformers import DataCollatorForTokenClassification
from datasets import load_dataset
import gradio as gr
import os
from huggingface_hub import login

### Download checkpoint and tokenizer

In [368]:
%%capture
# Login to HuggingFace
hf_token = os.environ['HF_TOKEN']
login(hf_token)

In [373]:
# Set up the GPU device
if torch.backends.mps.is_built():
    device = torch.device("mps")   # for M-Series Mac users
else:
    device = torch.device("cuda")  # CUDA

In [345]:
model_checkpoint = "camilothorne/distilbert-base-cased-finetuned-ner-biosem"

In [346]:
model     = AutoModelForTokenClassification.from_pretrained(model_checkpoint)
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

### Run model on an example

In [347]:
token_classifier = pipeline(
    task='ner', model=model, tokenizer=tokenizer, device=device, aggregation_strategy=None
)

In [348]:
example1 = "The term 'thiol' or 'sulfhydryl', alone or in combination, means a —SH group."

In [349]:
out1 = token_classifier(example1)

In [350]:
print(out1)

[{'entity': 'B-G', 'score': 0.69100636, 'index': 4, 'word': 'th', 'start': 10, 'end': 12}, {'entity': 'B-G', 'score': 0.7451051, 'index': 5, 'word': '##iol', 'start': 12, 'end': 15}, {'entity': 'B-G', 'score': 0.7969378, 'index': 9, 'word': 'su', 'start': 21, 'end': 23}, {'entity': 'B-G', 'score': 0.81136304, 'index': 10, 'word': '##lf', 'start': 23, 'end': 25}, {'entity': 'B-G', 'score': 0.7791597, 'index': 11, 'word': '##hy', 'start': 25, 'end': 27}, {'entity': 'B-G', 'score': 0.8284253, 'index': 12, 'word': '##dr', 'start': 27, 'end': 29}, {'entity': 'B-G', 'score': 0.8167264, 'index': 13, 'word': '##yl', 'start': 29, 'end': 31}]


### Note on BPE tokenization and CoNLL

The BioSemantics subset used here is formatted in the so-called CoNLL 2003 format, with sentences tokenized
as streams of words, i.e. around whitespaces (mostly). BERT models use on the other hard BPE tokenization, i.e.
subword tokenization (based on commonly observed morpheme-like subword units). 

This mismatch in tokenization
methodology means that some words and entities may get broken into simpler units. In such cases, the expected
behavior is that the label of the original word gets propagated onto its units.

Thus, a CoNNL phrase like
```bash
ether radical wherein the term perfluoroalkyl 
I-G   O       O       O   O    O                
```
gets transformed into:
```bash
et    _her    radical wherein the term _per _f  _lu  _oro _alk _yl
I-G   I-G     O       O       O   O    O    O   O    O    O    O           
```
with `ether` broken into `et` and `_ther`, with the expectation that its original label `I-G` 
propagates to its constituent units. 

In [374]:
data_checkpoint = "camilothorne/biosemantics_uspto"

In [353]:
def print_example(datapoint, id2label):
    '''
    Pretty print sentence.
    '''
    words = datapoint["text"]
    labels = datapoint["labels"]
    line1 = ""
    line2 = ""
    for word, label in zip(words, labels):
        full_label = id2label[label]
        max_length = max(len(word), len(full_label))
        line1 += word + " " * (max_length - len(word) + 1)
        line2 += full_label + " " * (max_length - len(full_label) + 1)
    print(line1)
    print(line2)

In [371]:
# Test data
test_data = load_dataset(data_checkpoint, field='data', split='test')
# Test data labels
labs_test  = load_dataset(data_checkpoint, field='maps', split='test')

In [372]:
# Get labels
label_names    = labs_test['tag']
id2label       = {i: label for i, label in enumerate(label_names)}
label2id       = {v: k for k, v in id2label.items()}

In [363]:
# In CoNLL, tokenization is done through whitespaces
print_example(test_data[100], id2label)

The term “ perfluoroalkoxy ” alone or in combination , means a perfluoroalkyl ether radical wherein the term perfluoroalkyl is as defined above 
O   O    O B-G             O O     O  O  O           O O     O B-G            I-G   O       O       O   O    O              O  O  O       O     


In [364]:
# Entities detected
out2 = token_classifier(' '.join(test_data[100]['text']))
print(out2)

In [359]:
print(out2)

[{'entity': 'B-G', 'score': 0.8485372, 'index': 19, 'word': 'per', 'start': 63, 'end': 66}, {'entity': 'B-G', 'score': 0.88328093, 'index': 20, 'word': '##f', 'start': 66, 'end': 67}, {'entity': 'B-G', 'score': 0.878745, 'index': 21, 'word': '##lu', 'start': 67, 'end': 69}, {'entity': 'B-G', 'score': 0.88304514, 'index': 22, 'word': '##oro', 'start': 69, 'end': 72}, {'entity': 'B-G', 'score': 0.8520078, 'index': 23, 'word': '##alk', 'start': 72, 'end': 75}, {'entity': 'B-G', 'score': 0.8779946, 'index': 24, 'word': '##yl', 'start': 75, 'end': 77}, {'entity': 'B-M', 'score': 0.5666113, 'index': 25, 'word': 'et', 'start': 78, 'end': 80}, {'entity': 'B-M', 'score': 0.5932003, 'index': 26, 'word': '##her', 'start': 80, 'end': 83}, {'entity': 'B-G', 'score': 0.63901865, 'index': 32, 'word': '##f', 'start': 112, 'end': 113}, {'entity': 'B-G', 'score': 0.70366, 'index': 33, 'word': '##lu', 'start': 113, 'end': 115}, {'entity': 'B-G', 'score': 0.6144724, 'index': 34, 'word': '##oro', 'start': 

In [360]:
# IOB tags overlayed on model tokens
tokens = tokenizer.tokenize(example2)
inputs = tokenizer.encode(example2, return_tensors="pt").to(device)
outputs = model(inputs)[0]
predictions = torch.argmax(outputs, dim=2)
preds = [id2label[p] for p in predictions[0].tolist()[1:-1]]
print([(token, id2label[prediction]) for token, prediction in zip(tokens, predictions[0].tolist()[1:-1])])

[('The', 'O'), ('term', 'O'), ('“', 'O'), ('per', 'O'), ('##f', 'O'), ('##lu', 'O'), ('##oro', 'O'), ('##alk', 'O'), ('##ox', 'O'), ('##y', 'O'), ('”', 'O'), ('alone', 'O'), ('or', 'O'), ('in', 'O'), ('combination', 'O'), (',', 'O'), ('means', 'O'), ('a', 'O'), ('per', 'B-G'), ('##f', 'B-G'), ('##lu', 'B-G'), ('##oro', 'B-G'), ('##alk', 'B-G'), ('##yl', 'B-G'), ('et', 'B-M'), ('##her', 'B-M'), ('radical', 'O'), ('wherein', 'O'), ('the', 'O'), ('term', 'O'), ('per', 'O'), ('##f', 'B-G'), ('##lu', 'B-G'), ('##oro', 'B-G'), ('##alk', 'B-G'), ('##yl', 'B-G'), ('is', 'O'), ('as', 'O'), ('defined', 'O'), ('above', 'O')]


### Create interface/GUI

We can use Hugging Face's Gradio support to easily spin a GUI for our NER model.

In [365]:
gr.Interface.from_pipeline(token_classifier).launch(share=False)

* Running on local URL:  http://127.0.0.1:7861

To create a public link, set `share=True` in `launch()`.


