# NER Modeling with hugging face pipeline
* Notebook by Adam Lang
* Date: 12/3/2024

# Overview
* In this notebook I will demonstrate how to perform named entity recognition (NER) using a huggingface pipeline and transformer models.

# Install Dependencies
* We have to install `Sacremoses'.
Sacremoses is a Python library that provides a port of the Moses tokenizer, truecaser, and other text normalization tools used in natural language processing (NLP).
* link: https://pypi.org/project/sacremoses/

In [1]:
!pip install -U transformers #upgrades
!pip install -U sentencepiece #upgrades
!pip install -U sacremoses #upgrades

Collecting transformers
  Downloading transformers-4.46.3-py3-none-any.whl.metadata (44 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.1/44.1 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
Downloading transformers-4.46.3-py3-none-any.whl (10.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.0/10.0 MB[0m [31m58.7 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.46.2
    Uninstalling transformers-4.46.2:
      Successfully uninstalled transformers-4.46.2
Successfully installed transformers-4.46.3
Collecting sacremoses
  Downloading sacremoses-0.1.1-py3-none-any.whl.metadata (8.3 kB)
Downloading sacremoses-0.1.1-py3-none-any.whl (897 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m897.5/897.5 kB[0m [31m44.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: sacremoses
Successfully installed sac

In [2]:
## imports
from transformers import pipeline
import pandas as pd


# Named Entity Recognition Pipeline
* We can build a NER pipeline using huggingface models here.

## Default Model
* I did not specify a model so it used the default transformer for this pipeline which is: `dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496`
  * model card: https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english

In [3]:
## create NER tagger
ner_tagger = pipeline("ner",
                      aggregation_strategy="simple")

# demo text
text = "My name is Tom Brady and I work for Fox Sports. My top 2 skills are football knowledge and working hard."

## get NER tags
outputs = ner_tagger(text)

## output dataframe instead of dict
pd.DataFrame(outputs)

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


Unnamed: 0,entity_group,score,word,start,end
0,PER,0.999374,Tom Brady,11,20
1,ORG,0.998202,Fox Sports,36,46


# Use a specific NER model in pipeline
* We can use a specific model in a pipeline such as one that is multilingual.
* The model we will use was pretrained on wikipedia data and the resulting multilingual NER model supports the 9 languages covered by WikiNEuRal (de, en, es, fr, it, nl, pl, pt, ru), and it was trained on all 9 languages jointly.
* Here is an example: `Babelscape/wikineural-multilingual-ner`
  * model card: https://huggingface.co/Babelscape/wikineural-multilingual-ner
* We can look at the `config.json` to see the specific NER tags:
* The `id2label`:
```
"id2label": {
    "0": "O",
    "1": "B-PER",
    "2": "I-PER",
    "3": "B-ORG",
    "4": "I-ORG",
    "5": "B-LOC",
    "6": "I-LOC",
    "7": "B-MISC",
    "8": "I-MISC"
  },
```
* The `label2id`:
```
 "label2id": {
    "B-LOC": 5,
    "B-MISC": 7,
    "B-ORG": 3,
    "B-PER": 1,
    "I-LOC": 6,
    "I-MISC": 8,
    "I-ORG": 4,
    "I-PER": 2,
    "O": 0
  },
```

Summary:
* So we can see that it predicts multilingual: `PERSON`, `ORG`, and `LOCATION`.
* The various numeric labels are the positional encodings within each sentence of the pre-trained text.

In [8]:
## create NER tagger
ner_tagger_multi = pipeline("ner",
                      aggregation_strategy="simple",
                      model="Babelscape/wikineural-multilingual-ner")
# demo text
text_multilingual = """
                    Je m'appelle Joe et je travaille pour JP Morgan Chase. Je travaille à New York et à Philadelphie. Le nom de mes collègues est Yitong.
                    "Je pars en vacances à la montagne pour skier, aller au musée et visiter le centre-ville de Zurich.
                    "Je pars en vacances à la montagne pour skier, à Chamonix. J'ai réservé mon voyage avec Paris-Toujours.
                    "Ik ga op vakantie naar de bergen om te skiën, in Vail met mijn vriendin Rachel.
                      """
## get NER tags
outputs_multi = ner_tagger(text_multilingual)

## output dataframe instead of dict
pd.DataFrame(outputs_multi)

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


Unnamed: 0,entity_group,score,word,start,end
0,PER,0.995386,Joe,34,37
1,ORG,0.995508,JP Morgan Chase,59,74
2,LOC,0.998929,New York,91,99
3,LOC,0.671302,Philade,105,112
4,ORG,0.544502,##lphie,112,117
5,PER,0.528707,Yiton,147,152
6,MISC,0.298003,##g,152,153
7,LOC,0.996153,Zurich,267,273
8,LOC,0.97012,Chamonix,344,352
9,LOC,0.973592,Paris,383,388


# Summary
* We can see above i was able to extract NER tags in French and Dutch.
* Obviously this runs out of the box, I could tune some other standard parameters for the model or fine tune on my own data.

# Multilingual NER for Job Skills
* Let's try using a model that was pretrained for job skills token classification using an xlm roberta model.
* Here is the model: `jjzha/escoxlmr_knowledge_extraction`
  * model card: https://huggingface.co/jjzha/escoxlmr_knowledge_extraction
* In the `config.json` we can see the classes in the `id2label` dict:
```
id2label": {
    "0": "B",
    "1": "I",
    "2": "O"
  },
```

In [11]:
## device agnostic code
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
device

device(type='cuda')

In [12]:
## create NER tagger
ner_tagger_skills= pipeline("ner",
                      aggregation_strategy="simple",
                      model="jjzha/escoxlmr_knowledge_extraction",
                      device=device)
# demo text
text_skills = """ Data scientist avec de solides compétences en Python, JavaScript, SQL et en programmation. Ils doivent être bons en résolution de problèmes, travailler en équipe et organisés.
              """
## get NER tags
outputs_skills = ner_tagger(text_skills)

## output dataframe instead of dict
pd.DataFrame(outputs_skills)

Unnamed: 0,entity_group,score,word,start,end
0,MISC,0.962284,Python,47,53
1,MISC,0.98389,JavaScript,55,65
2,MISC,0.856218,SQL,67,70
