# Named Entity Recognition

Named Entity Recognition (NER) is the task of identifying and classifying named entities** (real-world objects) in text into predefined categories like Person, Organization, Location, Date, Money, etc.

In [48]:
!pip install spacy nltk
!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m7.3 MB/s[0m  [33m0:00:02[0mm [31m7.4 MB/s[0m eta [36m0:00:01[0m
[?25h[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [39]:
# Sample news headlines for NER exploration (no external download needed)
headlines = [
    "Apple is looking at buying U.K. startup for $1 billion",
    "San Francisco considers banning sidewalk delivery robots",
    "Elon Musk's Tesla recalls 2 million cars over Autopilot concerns",
    "President Biden meets with European Union leaders in Brussels",
    "Google announces new AI research lab in Tokyo, Japan",
    "The United Nations General Assembly convenes in New York",
    "Amazon CEO Andy Jassy unveils new Alexa features at CES 2025",
    "NASA's Artemis III mission to land astronauts on the Moon",
    "Microsoft acquires Activision Blizzard for $69 billion",
    "India's ISRO launches Chandrayaan-4 mission from Sriharikota",
    "French President Macron hosts G7 summit in Paris",
    "OpenAI releases GPT-5 in partnership with Microsoft",
    "FIFA World Cup 2026 to be held in United States, Canada and Mexico",
    "Bank of England raises interest rates to combat inflation",
    "SpaceX Starship completes first orbital flight from Texas",
]

In [40]:
import spacy 
nlp = spacy.load("en_core_web_sm")

In [47]:
docs = []
for content in headlines:
    doc = nlp(content)
    docs.append(doc)

for doc in docs[:5]:
    print ("\nHeadline:", doc.text)
    for ent in doc.ents:
        print(ent.text, ent.start_char, ent.end_char, ent.label_)


Headline: Apple is looking at buying U.K. startup for $1 billion
Apple 0 5 ORG
U.K. 27 31 GPE
$1 billion 44 54 MONEY

Headline: San Francisco considers banning sidewalk delivery robots
San Francisco 0 13 GPE

Headline: Elon Musk's Tesla recalls 2 million cars over Autopilot concerns
Elon Musk's 0 11 PERSON
Tesla 12 17 PERSON
2 million 26 35 CARDINAL
Autopilot 46 55 ORG

Headline: President Biden meets with European Union leaders in Brussels
Biden 10 15 PERSON
European Union 27 41 ORG
Brussels 53 61 GPE

Headline: Google announces new AI research lab in Tokyo, Japan
Google 0 6 ORG
AI 21 23 GPE
Tokyo 40 45 GPE
Japan 47 52 GPE


In [44]:
# All NER entity labels and their definitions in spaCy
for label in nlp.get_pipe("ner").labels:
    print(f"{label:15s} -    {spacy.explain(label)}")

CARDINAL        -    Numerals that do not fall under another type
DATE            -    Absolute or relative dates or periods
EVENT           -    Named hurricanes, battles, wars, sports events, etc.
FAC             -    Buildings, airports, highways, bridges, etc.
GPE             -    Countries, cities, states
LANGUAGE        -    Any named language
LAW             -    Named documents made into laws.
LOC             -    Non-GPE locations, mountain ranges, bodies of water
MONEY           -    Monetary values, including unit
NORP            -    Nationalities or religious or political groups
ORDINAL         -    "first", "second", etc.
ORG             -    Companies, agencies, institutions, etc.
PERCENT         -    Percentage, including "%"
PERSON          -    People, including fictional
PRODUCT         -    Objects, vehicles, foods, etc. (not services)
QUANTITY        -    Measurements, as of weight or distance
TIME            -    Times smaller than a day
WORK_OF_ART     -    Title

## Conclusion & Key Takeaways

### Observations from our NER results

Apple - correctly identitified as ORG
Tesla - wrongly identified as PERSON, should be ORG in this context. Not even PRODUCT.
AI - wrongly identified as ORG. It should be CARDINAL. or AI research lab should be LOC.
Autopilot - is not an ORG.

### Common NER challenges demonstrated here

- **Polysemy** -  Same word, different entity types depending on context 
    - "Apple" = ORG vs fruit - correct
    - "Tesla" = ORG vs PERSON - model failed
- **Nested entities** -  "Elon Musk's Tesla" contains both a `PERSON` and an `ORG`
- **Emerging entities** -  Products and tech terms (Autopilot) that didn't exist when the model was trained
- **Boundary detection** -  Deciding where an entity starts/ends (is it "AI" alone or "AI research lab"?)