### ***Named Entity Recognition with BERT ***

**Introduction**

This presentation walks through a code example demonstrating Named Entity Recognition (NER) using the BERT model . NER is a crucial task in Natural Language Processing (NLP) that involves identifying and classifying named entities in text, such as people, locations, and organizations.

##Setup and Data Preparation


###**1. Installation of necessary libraries:**



In [1]:
!pip install seqeval
!pip install transformers
!pip install datasets
!pip install torch
!pip install opendatasets
!pip install evaluate
!pip  install openai
!pip install wandb

Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/43.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: seqeval
  Building wheel for seqeval (setup.py) ... [?25l[?25hdone
  Created wheel for seqeval: filename=seqeval-1.2.2-py3-none-any.whl size=16161 sha256=cdae96e022b5ddd9077259f801d070dc2adfb45ada38858ce61f8038e24151a3
  Stored in directory: /root/.cache/pip/wheels/1a/67/4a/ad4082dd7dfc30f2abfe4d80a2ed5926a506eb8a972b4767fa
Successfully built seqeval
Installing collected packages: seqeval
Successfully installed seqeval-1.2.2
Collecting datasets
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.met

###**2. Importing necessary libraries**

In [2]:
import torch
from transformers import AutoTokenizer, AutoModelForTokenClassification, Trainer, TrainingArguments
from datasets import load_dataset
import numpy as np
from seqeval.metrics import f1_score, precision_score, recall_score, classification_report
from evaluate import load
from datasets import Dataset, DatasetDict
import opendatasets as od
import pandas
import wandb
from wandb.integration.keras import WandbMetricsLogger
from wandb.integration.keras import WandbCallback
from transformers import TrainerCallback, TrainerControl, TrainerState



**This section installs the necessary libraries for the task, including:**
*  seqeval: for evaluating sequence labeling models.
*  transformers: for using pre-trained transformer models.
*  datasets: for loading and processing datasets.
*  torch: for using PyTorch, a deep learning framework.
*  opendatasets: for downloading datasets from online sources.
*  evaluate: for evaluating model performance.

###**3. Mounting Google Drive**

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


###**4. Checking CUDA availability**

Checks if CUDA is available for GPU acceleration and prints information about the available CUDA devices.

In [4]:
print("CUDA available:", torch.cuda.is_available())
print("Current device index:", torch.cuda.current_device())
print("Device name:", torch.cuda.get_device_name(torch.cuda.current_device()))


CUDA available: True
Current device index: 0
Device name: Tesla T4


###**5. Data Loading and Preprocessing**

*   Defines functions to read data from CONLL format files and convert it to the Hugging Face Dataset format.
*   Loads train, validation, and test data from files in your Google Drive.
*   Creates a DatasetDict object containing the three datasets.


In [5]:
def read_conll_file(file_path):
    with open(file_path, "r") as f:
        content = f.read().strip()
        sentences = content.split("\n\n")
        data = []
        for sentence in sentences:
            tokens = sentence.split("\n")
            token_data = []
            for token in tokens:
                token_data.append(token.split())
            data.append(token_data)
    return data


train_data = read_conll_file("/content/drive/MyDrive/NLP/train.txt")
validation_data = read_conll_file("/content/drive/MyDrive/NLP/dev.txt")
test_data = read_conll_file("/content/drive/MyDrive/NLP/test.txt")


def convert_to_dataset(data, label_map):
    formatted_data = {"tokens": [], "ner_tags": []}
    for sentence in data:
        tokens = [token_data[0] for token_data in sentence]
        ner_tags = [label_map[token_data[1]] for token_data in sentence]
        formatted_data["tokens"].append(tokens)
        formatted_data["ner_tags"].append(ner_tags)
    return Dataset.from_dict(formatted_data)


label_list = sorted(list(set([token_data[1] for sentence in train_data for token_data in sentence])))
label_map = {label: i for i, label in enumerate(label_list)}


train_dataset = convert_to_dataset(train_data, label_map)
validation_dataset = convert_to_dataset(validation_data, label_map)
test_dataset = convert_to_dataset(test_data, label_map)


datasets = DatasetDict({
    "tarin": train_dataset,
    "validation": validation_dataset,
    "test": test_dataset,
})




##Model and Tokenizer

###**6. Model and Tokenizer Initialization**
*  Specifies the pre-trained model name (bert-base-cased).
*  Loads the tokenizer and model using the transformers library.

In [6]:
model_name = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForTokenClassification.from_pretrained(model_name, num_labels=len(label_list))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


###**7. Defining Evaluation Metrics and Tokenization Function**
*  Defines a function compute_metrics to calculate evaluation metrics.
*  Defines a function tokenize_and_align_labels to tokenize the input text and align the labels with the corresponding tokens.

In [7]:
def compute_metrics(eval_prediction):
    predictions, labels = eval_prediction
    predictions = np.argmax(predictions, axis=2)


    # Remove ignored index (special tokens)
    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]


    return {
        "precision": precision_score(true_labels, true_predictions),
        "recall": recall_score(true_labels, true_predictions),
        "f1": f1_score(true_labels, true_predictions),
        "classification_report": classification_report(true_labels, true_predictions),
    }


In [8]:



def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(
        examples["tokens"], truncation=True, is_split_into_words=True, padding=True
    )
    labels = []
    for i, label in enumerate(examples["ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx != previous_word_idx:
                label_ids.append(label[word_idx])
            else:
                label_ids.append(-100)
            previous_word_idx = word_idx
        labels.append(label_ids)
    tokenized_inputs["labels"] = labels
    return tokenized_inputs


###**8. Data Tokenization**

*  Applies the tokenize_and_align_labels function to the datasets to tokenize the input and align the labels.

In [9]:
tokenized_datasets = datasets.map(tokenize_and_align_labels, batched=True) # tokenized_datasets will not have train key
tokenized_datasets = DatasetDict({  # Re-create tokenized_datasets with correct key
    'train': tokenized_datasets['tarin'], #Rename tarin to train
    'validation': tokenized_datasets['validation'],
    'test': tokenized_datasets['test']
})

Map:   0%|          | 0/20000 [00:00<?, ? examples/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

Map:   0%|          | 0/10000 [00:00<?, ? examples/s]

##Training

### **9. Training Arguments**
*  Defines the training arguments using the TrainingArguments class from the transformers library.
*  These arguments control various aspects of the training process, such as the output directory, evaluation strategy, and batch size.

In [10]:

training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="steps",
    eval_steps=500,
    save_steps=500,
    num_train_epochs=1,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    logging_steps=100,
    learning_rate=5e-5,
    load_best_model_at_end=True,
    metric_for_best_model="f1",
)




In [11]:
print(train_dataset)

Dataset({
    features: ['tokens', 'ner_tags'],
    num_rows: 20000
})


###**10. Data Collator**
*  Defines a data collator function to prepare the data for input to the model during training.
*  This function handles padding of sequences to ensure consistent input lengths and creates batches of data.


In [12]:
def data_collator(data):
    input_ids = [torch.tensor(item["input_ids"]) for item in data]
    attention_mask = [torch.tensor(item["attention_mask"]) for item in data]
    labels = [torch.tensor(item["labels"]) for item in data]


    input_ids = torch.nn.utils.rnn.pad_sequence(input_ids, batch_first=True, padding_value=tokenizer.pad_token_id)
    attention_mask = torch.nn.utils.rnn.pad_sequence(attention_mask, batch_first=True, padding_value=0)
    labels = torch.nn.utils.rnn.pad_sequence(labels, batch_first=True, padding_value=-100)


    return {
        "input_ids": input_ids,
        "attention_mask": attention_mask,
        "labels": labels,
    }

###**11. Model Training**
*  Creates a Trainer object from the transformers library, which manages the training process.
*  Provides the model, training arguments, training dataset, and other necessary components to the Trainer.
*  Starts the training process using trainer.train().

In [13]:

wandb.init(project='huggingface')
trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=tokenized_datasets["train"],  # Key should be 'train'
        eval_dataset=tokenized_datasets["validation"],
        data_collator=data_collator,
        tokenizer=tokenizer,
        compute_metrics=compute_metrics,
    )

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


  trainer = Trainer(


In [14]:
trainer.train()



Step,Training Loss,Validation Loss,Precision,Recall,F1,Classification Report
500,0.364,0.326994,0.715072,0.771031,0.741998,precision recall f1-score support  LOC 0.70 0.85 0.77 4834  ORG 0.62 0.59 0.60 4677  PER 0.82 0.88 0.85 4635  micro avg 0.72 0.77 0.74 14146  macro avg 0.71 0.77 0.74 14146 weighted avg 0.71 0.77 0.74 14146
1000,0.2839,0.307809,0.79256,0.799731,0.796129,precision recall f1-score support  LOC 0.80 0.83 0.81 4834  ORG 0.71 0.69 0.70 4677  PER 0.87 0.88 0.87 4635  micro avg 0.79 0.80 0.80 14146  macro avg 0.79 0.80 0.80 14146 weighted avg 0.79 0.80 0.80 14146
1500,0.2769,0.2574,0.800178,0.82716,0.813445,precision recall f1-score support  LOC 0.84 0.83 0.84 4834  ORG 0.69 0.75 0.72 4677  PER 0.88 0.90 0.89 4635  micro avg 0.80 0.83 0.81 14146  macro avg 0.80 0.83 0.81 14146 weighted avg 0.80 0.83 0.81 14146
2000,0.2766,0.244846,0.805856,0.838612,0.821907,precision recall f1-score support  LOC 0.82 0.86 0.84 4834  ORG 0.73 0.75 0.74 4677  PER 0.87 0.91 0.89 4635  micro avg 0.81 0.84 0.82 14146  macro avg 0.81 0.84 0.82 14146 weighted avg 0.81 0.84 0.82 14146
2500,0.2498,0.235048,0.817963,0.840803,0.829226,precision recall f1-score support  LOC 0.84 0.87 0.85 4834  ORG 0.74 0.74 0.74 4677  PER 0.88 0.91 0.89 4635  micro avg 0.82 0.84 0.83 14146  macro avg 0.82 0.84 0.83 14146 weighted avg 0.82 0.84 0.83 14146


Trainer is attempting to log a value of "              precision    recall  f1-score   support

         LOC       0.70      0.85      0.77      4834
         ORG       0.62      0.59      0.60      4677
         PER       0.82      0.88      0.85      4635

   micro avg       0.72      0.77      0.74     14146
   macro avg       0.71      0.77      0.74     14146
weighted avg       0.71      0.77      0.74     14146
" of type <class 'str'> for key "eval/classification_report" as a scalar. This invocation of Tensorboard's writer.add_scalar() is incorrect so we dropped this attribute.
Trainer is attempting to log a value of "              precision    recall  f1-score   support

         LOC       0.80      0.83      0.81      4834
         ORG       0.71      0.69      0.70      4677
         PER       0.87      0.88      0.87      4635

   micro avg       0.79      0.80      0.80     14146
   macro avg       0.79      0.80      0.80     14146
weighted avg       0.79      0.80      0.8

TrainOutput(global_step=2500, training_loss=0.31389095458984373, metrics={'train_runtime': 1755.914, 'train_samples_per_second': 11.39, 'train_steps_per_second': 1.424, 'total_flos': 3871098627919776.0, 'train_loss': 0.31389095458984373, 'epoch': 1.0})

##Prediction and Evaluation


###**12. Model Evaluation**
*  Uses the trained model to make predictions on the test dataset.
*  Stores the predictions in the predictions variable.

In [15]:
predictions = trainer.predict(tokenized_datasets["test"])
metrics = predictions.metrics  # Access the metrics dictionary

precision = metrics["test_precision"]
recall = metrics["test_recall"]
f1_score = metrics["test_f1"]

In [17]:
print(precision, recall, f1_score)

0.8215903547285525 0.8446052443043416 0.8329388490479387


###**13. Prediction on a Single Sentence**
*  Demonstrates how to use the trained model to make predictions on a single sentence.
*  Tokenizes the sentence, gets predictions from the model, and prints the predicted labels for each token.

In [18]:
sentence = "Albert Einstein was born in Ulm, Germany."


tokenized_input = tokenizer(sentence, return_tensors="pt").to(model.device)


outputs = model(**tokenized_input)


predicted_labels = outputs.logits.argmax(-1)[0]


label_map = {0: 'O', 1: 'B-PER', 2: 'I-PER', 3: 'B-LOC', 4: 'I-LOC',5: 'B-ORG',6: 'I-ORG'}  # Adjust as per your model's labels

# Decode tokens and labels
tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"][0])
predicted_labels = [label_map[label_id.item()] for label_id in predicted_labels]

# Print tokens with labels
for token, label in zip(tokens, predicted_labels):
    print(f"{token:15} -> {label}")


[CLS]           -> I-ORG
Albert          -> I-PER
Einstein        -> B-ORG
was             -> I-ORG
born            -> I-ORG
in              -> I-ORG
U               -> O
##lm            -> B-LOC
,               -> I-ORG
Germany         -> O
.               -> I-ORG
[SEP]           -> I-ORG


##**Using OpenAI API for Named Entity Recognition**
*  Shows an example of using the OpenAI API for named entity recognition.
*  Sets the OpenAI API key, defines a function to send prompts and get responses, and demonstrates how to use it.

In [19]:
import openai
import os

openai.api_key = "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"


def get_completion(prompt, model="gpt-3.5-turbo"):
    messages = [{"role": "user", "content": prompt}]
    response = openai.ChatCompletion.create(
        model=model,
        messages=messages,
        temperature=0, # this is the degree of randomness of the model's output
    )
    return response.choices[0].message["content"]
text = f"""
Albert Einstein was born in Ulm, Germany.
"""
prompt = f"""
Extract the named entities from the text below. For each entity, specify its type \
(e.g., PERSON, LOCATION, ORGANIZATION, etc.). Return the result in a structured format:
```{text}```
"""
response = get_completion(prompt)
print(response)

{
  "entities": [
    {
      "text": "Albert Einstein",
      "type": "PERSON"
    },
    {
      "text": "Ulm",
      "type": "LOCATION"
    },
    {
      "text": "Germany",
      "type": "LOCATION"
    }
  ]
}


In [7]:
pip install openai==0.28

Collecting openai==0.28
  Downloading openai-0.28.0-py3-none-any.whl.metadata (13 kB)
Downloading openai-0.28.0-py3-none-any.whl (76 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.5/76.5 kB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: openai
  Attempting uninstall: openai
    Found existing installation: openai 1.54.4
    Uninstalling openai-1.54.4:
      Successfully uninstalled openai-1.54.4
Successfully installed openai-0.28.0


#**Conclusion**

This presentation provided a comprehensive overview of how to implement NER using BERT in Google Colab. The code example demonstrates data preparation, model training, prediction, and evaluation, showcasing the power of BERT for this NLP task. The optional OpenAI integration adds another dimension for exploration.