<a href="https://colab.research.google.com/github/brandiegriffin83/brandiegriffin83/blob/main/%22Griffin_Lab_08_NER9.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab 08 - Brandy Griffin - NER

What is NER?

1. Named Entity Recognition (NER) is a technique in NLP that recognizes the nouns in a text and classifies them into their corresponding categories.
2. There are key categories that a noun is placed in such as a person, a place, and etc.
3. There is a BIO notation that differntiates between the beginning (B), inside (I), and outside (O) of a character to make it easy to recognize a named entity. O also means for non-entity tokens.

Steps in NER:

1. Injesting or loading the dataset
2. Extracting or detecting the entities from the text
3. Classifying them into their corresponding categories.

Applications (uses) of NER:

1. Information retrieval
2. Content recommendation
3. In data analysis, it is extracting key patterns and trends (more inforamtion) in big datasets.

Categories in NER: Typical categories in NER are:

1. PER: for person
2. ORG: for organization
3. LOC: for location
4. GPE: for Geopolitical Entity (U.K.)
5. DATE: for date
6. MONEY: for monetary values

Python Libraries for NER:

1. spacy
2. nltk

NLP Project Steps:

1. Introduction:

1. Background
Named Entity Recognition (NER) is a part of Natural Language Processing (NLP) that helps computers identify important information in text—like names of people, places, organizations, and dates. It’s used in everything from search engines to customer service, and especially in law enforcement, where pulling out key details quickly can make a big difference.
2. Objectives
The main goal of this lab was to get hands-on experience with how NER actually works. I wanted to learn how to use Python libraries like spaCy and nltk to load text, process it, identify named entities, and organize them in a way that makes sense. Another goal was to visualize and classify these entities so they’re easier to understand and use in real-life situations.
3. Dataset
For this lab, I used a paragraph of text about the World Health Organization (WHO). This sample dataset was rich with names of people, organizations, dates, and locations, which made it perfect for testing and seeing how well the NER model could pick up on those details.

2. Importing the Libraries

In [1]:
!pip install spacy



In [2]:
import spacy

In [3]:
!pip install nltk



In [4]:
import nltk

In [5]:
!python -m spacy download en_core_web_sm
# this line of code is for pipelining in CPU purposes

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m63.6 MB/s[0m eta [36m0:00:00[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
[38;5;3m⚠ Restart to reload dependencies[0m
If you are in a Jupyter or Colab notebook, you may need to restart Python in
order to load all the package's dependencies. You can do this by selecting the
'Restart kernel' or 'Restart runtime' option.


In [6]:
# let's import the remaining libraries

import numpy as np
import pandas as pd
import spacy
import requests
from bs4 import BeautifulSoup
nlp = spacy.load('en_core_web_sm')

3. Loading the Dataset

In [7]:
text = "The World Health Organization (WHO)[1] is a specialized agency of the United Nations\
responsible for international public health.[2] The WHO Constitution states its main objective as\
'the attainment by all peoples of the highest possible level of health'.[3] Headquartered in\
Geneva, Switzerland, it has six regional offices and 150 field offices worldwide. The WHO was\
established on 7 April 1948.[4][5] The first meeting of the World Health Assembly (WHA), the\
agency's governing body, took place on 24 July of that year. The WHO incorporated the assets,\
personnel, and duties of the League of Nations' Health Organization and the Office International\
d'Hygiène Publique, including the International Classification of Diseases (ICD).[6] Its work\
began in 1951 after a significant infusion of financial and technical resources.[7]"

In [8]:
# converting my data into an nlp document
doc = nlp(text)

4. Extracting or Detecting the Entities from the Text

In [9]:
# in this step, we discover the entities and their corresponding categories
# let's use a for loop

for ent in doc.ents:
  print(ent.text, ent.start_char, ent.end_char, ent.label_)

The World Health Organization 0 29 ORG
the United Nationsresponsible 66 95 ORG
The WHO Constitution 132 152 WORK_OF_ART
health'.[3] 245 256 ORG
Switzerland 281 292 GPE
six 301 304 CARDINAL
150 326 329 CARDINAL
WHO 359 362 ORG
7 April 1948.[4][5 381 399 DATE
first 405 410 ORDINAL
the World Health Assembly 422 447 ORG
24 July of that year 497 517 DATE
WHO 523 526 ORG
the League of Nations' Health Organization 576 618 ORG
the International Classification of Diseases 677 721 ORG
1951 749 753 DATE


5. Classifying and Displaying the Categories

In [10]:
# first importing the library
from spacy import displacy

# then let's display the categories within the text
displacy.render(doc, style='ent')

6. Converting the Text into Tabular (rows and columns) Dataset (using Pandas dataframe)

In [11]:
# let's store all the entities with their labels in a variable
# and name it entities
entities = [(ent.text, ent.label_, ent.lemma_) for ent in doc.ents]


# let's convert the data stored in entities into a pandas' dataframe
df = pd.DataFrame(entities, columns=['Text', 'Type', 'Lemma' ])
df

Unnamed: 0,Text,Type,Lemma
0,The World Health Organization,ORG,the World Health Organization
1,the United Nationsresponsible,ORG,the United Nationsresponsible
2,The WHO Constitution,WORK_OF_ART,the WHO Constitution
3,health'.[3],ORG,health'.[3]
4,Switzerland,GPE,Switzerland
5,six,CARDINAL,six
6,150,CARDINAL,150
7,WHO,ORG,WHO
8,7 April 1948.[4][5,DATE,7 April 1948.[4][5
9,first,ORDINAL,first


Conclusion:

This lab made everything I learned about Named Entity Recognition in the earlier assignment finally come together. Before, I understood NER more on a basic level—like what it does and how it can help in fields like law enforcement. But in this lab, I actually got to see it in action. I used spaCy to break down a real paragraph and extract named entities like people, places, dates, and organizations. It wasn’t just about theory anymore—it became hands-on.

Being able to take raw text, run it through an NLP pipeline, and visualize the important parts using tools like displacy really helped me understand how powerful NER is. I also saw how it can be turned into a structured dataset using pandas, which would be super useful in a real-world setting—especially for something like scanning police reports or large documents quickly.

After doing both the written report and this lab, I can definitely say that NER is more than just tagging names in a sentence—it’s about helping computers understand human language on a deeper level. Whether it's pulling out details from a crime report or scanning social media for threats, NER is a tool that saves time and improves accuracy.



Add 15 points that you learned from completing this lab.

1. NER stands for Named Entity Recognition and is used to detect important parts of text like names, places, and dates.

2. spaCy is one of the main Python libraries used for NER.

3. BIO tagging is a method used to label entities as Beginning, Inside, or Outside of a named entity.

4. The en_core_web_sm model is a small English model used in spaCy for NER tasks.

5. You can load and process text into tokens using nlp = spacy.load().

6. The doc.ents method allows you to extract entities from a processed document.

7. You can use displacy.render() to visualize named entities right inside the notebook.

8. pandas can turn the output of NER into a table, which is helpful for analyzing large amounts of text.

9. NER can recognize different categories like PERSON, ORG, LOC, DATE, MONEY, and more.

10. Law enforcement can use NER to quickly pull names, dates, and locations from reports and messages.

11. Deep learning models like IBM Watson can be trained to detect specific names or terms based on the field they’re used in.

12. NLTK is another Python library that works with natural language but spaCy is more modern and faster for NER.

13. I learned how to install models and libraries using pip in Colab.

14. You can classify named entities based on context—for example, telling if “Marshall” is a person or an agency.

15. The combination of NER with tools like BeautifulSoup, pandas, and requests allows you to build much more powerful NLP systems.



Citations

IBM. What Is Named Entity Recognition? IBM Think, 2023.
https://www.ibm.com/think/topics/named-entity-recognition

Kanerika. Named Entity Recognition: A Comprehensive Guide to NLP's Key Technology. Medium, 2023.
https://medium.com/@kanerika/named-entity-recognition-a-comprehensive-guide-to-nlps-key-technology-636a124eaa46

Mishra, Palash. Simple Use Case: Building a Named Entity Recognition System with Python. Medium, 2020.
https://medium.com/@palashm0002/simple-use-case-building-a-named-entity-recognition-system-with-python-7e87f5fd5908

Babel Street. How Named Entity Recognition Connects the Dots for Law Enforcement and Intelligence. BabelStreet.com, 2022.
https://www.babelstreet.com/blog/how-named-entity-recognition-connects-the-dots-for-law-enforcement-and-intelligence

spaCy Documentation. Named Entity Recognition. Explosion AI.
https://spacy.io/usage/linguistic-features#named-entities



End of Lab 08.