# Named Entity Recognition (NER) with spaCy

This script demonstrates the use of spaCy for Named Entity Recognition (NER) in English text.  
It loads the `en_core_web_sm` language model, extracts entities from a sample text about Google, and visualizes them using spaCy's `displacy` tool.  
The script also includes a simple preprocessing step (removing punctuation and converting text to lowercase) to show how NER results can change after text cleaning.

**Main steps:**
- Load spaCy language model
- Analyze a sample text for named entities
- Display recognized entities and their labels
- Clean the text and repeat entity recognition
- Visualize results before and after text cleaning

This example is useful for understanding the basics of NER and the impact of text preprocessing on entity extraction.

In [22]:
# Download the English language model for spaCy (run only once)

!python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     ---------- ----------------------------- 3.4/12.8 MB 28.0 MB/s eta 0:00:01
     --------------------------------------  12.6/12.8 MB 46.1 MB/s eta 0:00:01
     --------------------------------------- 12.8/12.8 MB 28.9 MB/s eta 0:00:00
[38;5;2m[+] Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [23]:
# Import necessary libraries and load the small English language model from spaCy

import spacy
from spacy import displacy
from spacy import tokenizer
import re

nlp = spacy.load("en_core_web_sm")

In [24]:
# Example text about Google to analyze

google_text = "Google was founded on September 4 1998 by American computer scientists Larry Page and Sergey Brin Together they own about 14 of its publicly listed shares and control 56 of its stockholder voting power through supervoting stock The company went public via an initial public offering IPO in 2004 In 2015 Google was reorganized as a wholly owned subsidiary of Alphabet Inc Google is Alphabets largest subsidiary and is a holding company for Alphabets internet properties and interests Sundar Pichai was appointed CEO of Google on October 24 2015 replacing Larry Page who became the CEO of Alphabet On December 3 2019 Pichai also became the CEO of Alphabet After the success of its original service Google Search often known simply as Google the company has rapidly grown to offer a multitude of products and services These products address a wide range of use cases including email Gmail navigation and mapping Waze Maps and Earth cloud computing Cloud web navigation Chrome video sharing YouTube productivity Workspace operating systems Android and ChromeOS cloud storage Drive language translation Translate photo storage Photos videotelephony Meet smart home Nest smartphones Pixel wearable technology Pixel Watch and Fitbit music streaming YouTube Music video on demand YouTube TV AI Google Assistant and Gemini machine learning APIs TensorFlow AI chips TPU and more Many of these products and services are dominant in their respective industries as is Google Search Discontinued Google products include gaming Stadia Glass Google Reader Play Music Nexus Hangouts and Inbox by Gmail Googles other ventures outside of internet services and consumer electronics include quantum computing Sycamore selfdriving cars Waymo smart cities Sidewalk Labs and transformer models Google DeepMind"

In [25]:
print(google_text)

Google was founded on September 4 1998 by American computer scientists Larry Page and Sergey Brin Together they own about 14 of its publicly listed shares and control 56 of its stockholder voting power through supervoting stock The company went public via an initial public offering IPO in 2004 In 2015 Google was reorganized as a wholly owned subsidiary of Alphabet Inc Google is Alphabets largest subsidiary and is a holding company for Alphabets internet properties and interests Sundar Pichai was appointed CEO of Google on October 24 2015 replacing Larry Page who became the CEO of Alphabet On December 3 2019 Pichai also became the CEO of Alphabet After the success of its original service Google Search often known simply as Google the company has rapidly grown to offer a multitude of products and services These products address a wide range of use cases including email Gmail navigation and mapping Waze Maps and Earth cloud computing Cloud web navigation Chrome video sharing YouTube produ

In [26]:
# Apply spaCy NLP pipeline to the text to create a Doc object

spacy_doc = nlp(google_text)

In [27]:
# Print named entities found in the text

for word in spacy_doc.ents:
    print(word.text, word.label_)

Google ORG
September 4 1998 DATE
American NORP
Larry Page PERSON
Sergey Brin Together PERSON
about 14 CARDINAL
56 CARDINAL
IPO ORG
2004 DATE
2015 DATE
Alphabet Inc Google ORG
Alphabets ORG
Sundar Pichai PERSON
Google ORG
October 24 2015 DATE
Larry Page PERSON
Alphabet On ORG
December 3 2019 DATE
Alphabet After ORG
Google Search ORG
Google ORG
Gmail PERSON
Waze Maps GPE
Earth LOC
Cloud PERSON
Chrome PERSON
YouTube ORG
Workspace PERSON
Android ORG
Drive GPE
Translate ORG
Pixel PERSON
Pixel Watch PERSON
Fitbit NORP
YouTube Music ORG
YouTube TV AI Google ORG
Gemini ORG
TensorFlow AI ORG
TPU ORG
Google Search ORG
Stadia Glass ORG
Reader Play Music Nexus Hangouts ORG
Inbox ORG
Gmail Googles PERSON
Waymo PRODUCT
Sidewalk Labs PERSON
Google DeepMind PRODUCT


In [28]:
# Visualize named entities in Jupyter Notebook

displacy.render(spacy_doc, style='ent', jupyter=True)

In [29]:
# Clean the text: remove punctuation and lowercase

google_text_clean = re.sub(r'[^\w\s]','', google_text).lower()
print(google_text_clean)

google was founded on september 4 1998 by american computer scientists larry page and sergey brin together they own about 14 of its publicly listed shares and control 56 of its stockholder voting power through supervoting stock the company went public via an initial public offering ipo in 2004 in 2015 google was reorganized as a wholly owned subsidiary of alphabet inc google is alphabets largest subsidiary and is a holding company for alphabets internet properties and interests sundar pichai was appointed ceo of google on october 24 2015 replacing larry page who became the ceo of alphabet on december 3 2019 pichai also became the ceo of alphabet after the success of its original service google search often known simply as google the company has rapidly grown to offer a multitude of products and services these products address a wide range of use cases including email gmail navigation and mapping waze maps and earth cloud computing cloud web navigation chrome video sharing youtube produ

In [30]:
# Process cleaned text and print named entities

spacy_doc_clean = nlp(google_text_clean)

In [31]:
for word in spacy_doc_clean.ents:
    print(word.text, word.label_)

google ORG
september 4 1998 DATE
american computer ORG
about 14 CARDINAL
56 CARDINAL
2004 DATE
2015 DATE
alphabet inc google ORG
google ORG
october 24 2015 DATE
larry PERSON
december 3 2019 DATE
google ORG
google ORG


In [32]:
displacy.render(spacy_doc_clean, style="ent", jupyter=True)