<a href="https://colab.research.google.com/github/bandiajay/Generative-AI/blob/main/02_Named_Entity_Recognition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <center> Named Entity Recognition </center>

<b> Objective: </b> <p> The purpose of this worksheet is to familiarize participants with the concept of identifying the entities involved in a statement through application of Named Entity Recognition (NER). Through hands-on exercises, learners will develop the ability to identify and categorize entities such as names, locations, and organizations within unstructured text. This worksheet aims to enhance participants' skills in extracting meaningful information from textual data, thereby improving their capabilities in data analysis and natural language processing applications. </p>

**Introduction** : Named Entity Recognition is an NLP task, that involves identifying named entities in text and categorizing them into predefined categories such as persons, organizations, locations, dates, etc. NER is widely used in various applications such as: Information Extraction, Search Engines, Chatbots and Virtual Assistants, Sentiment analysis and Open Mining.

<img src = "https://miro.medium.com/v2/resize:fit:2000/format:webp/1*JNHlyK5-jQA6JBKj3nDYcA.png">

<b> Requirements: </b>
<ol>
<li> <i> Transformers </i> - A versatile library from Hugging Face providing state-of-the-art pre-trained models for natural language processing tasks </li>
<li> <i> Tensorflow </i> - A comprehensive and flexible machine learning library developed by Google, used for designing, training, and deploying complex deep learning models
</ol>

<b> Steps: </b>
<ol>
    <li> Install <code> transformers</code>, <code>tensorflow</code> packages.</li>
     <li> Write source code </li>
        <p> 2.1 Import <code> transformers</code>, <code>tensorflow</code> modules <br>
            2.2 Load Named Entity Recogniser into pipeline. <br>
            2.3 Enter an input sentence. <br>
            2.4 Recognise the entities. <br>
            2.5 Print the entities with their labels. <br>
        </p>
</ol>

<h3> Step 1: Install <code> transformers</code>, <code>tensorflow</code> packages </h3>

**Note:** if the below command fails, execute
`python -m pip install transformers`

In [None]:
pip install transformers



**Note:** if the below command fails, execute
`python -m pip install tensorflow`

In [None]:
pip install tensorflow



<h3> Step 2: Write source code</h3>

<h4> Step 2.1 : Import <code> transformers</code>, <code>tensorflow</code> modules required to erecognise the labels for entities </h4>

In [None]:
from transformers import pipeline

<h3> Step 2.2 : Load Named Entity Recogniser into pipeline </h3>


Here

1. **`ner`** specifies that the pipeline is for Named Entity Recognition.
2. **`framework`** specifies the deep learning framework to use. tf - tensorflow, pt - pytorch.
3. **`model`**  specifies the Deep Learning model to use within the pipeline.
    * **`dbmdz`** is the organisation that trained the model.
    * **`bert-large-cased`**: This part of the identifier specifies that the model is based on the BERT-large architecture.
    * **`finetuned-conll03-english`**: This indicates that the model has been fine-tuned on the CoNLL-2003 dataset.

**Note** : You can find more models at [Hugging Face](https://huggingface.co/)

In [None]:
# Load the NER pipeline
ner_classifier = pipeline("ner", framework="tf", model="dbmdz/bert-large-cased-finetuned-conll03-english")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/998 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFBertForTokenClassification.

All the weights of TFBertForTokenClassification were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForTokenClassification for predictions without further training.


tokenizer_config.json:   0%|          | 0.00/60.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

<h3> Step 2.3: Enter an input sentence </h3>

In [None]:
sentence = "John Smith, CEO of XYZ Corporation, announced yesterday that the company will be opening a new office in London, England next month."

<h3> Step 2.4: Recognise the entities </h3>

In [None]:
result = ner_classifier(sentence)

<h3> Step 2.5: Print the entities with their labels </h3>

In [None]:
for entity in result:
  print(f"Entity: {entity['word']}, Label:{entity['entity']},Score: {entity['score']}")


Entity: John, Label:I-PER,Score: 0.9987553358078003
Entity: Smith, Label:I-PER,Score: 0.9992994070053101
Entity: X, Label:I-ORG,Score: 0.9995536208152771
Entity: ##Y, Label:I-ORG,Score: 0.9959398508071899
Entity: ##Z, Label:I-ORG,Score: 0.9982144832611084
Entity: Corporation, Label:I-ORG,Score: 0.9986222982406616
Entity: London, Label:I-LOC,Score: 0.9991400241851807
Entity: England, Label:I-LOC,Score: 0.9997513890266418
