<a href="https://colab.research.google.com/github/chrischibueze/04-Ecommerce-Purchases-Exercise-/blob/main/NER.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Named entity recognition (NER) is a natural language processing technique for identifying and classifying named entities in text. Named entities are typically people, organizations, locations, dates, times, and quantities.

The URL http://thedatacity.com does not contain any named entities. A named entity must be a word or phrase that refers to a specific real-world object or concept. The words "the", "data", and "city" are not named entities, but they are parts of a named entity, which is the entire URL.

To use NER on a URL, you would need to first break the URL down into its constituent parts. This could be done using a regular expression or a web scraping library. Once the URL has been broken down, you could then apply NER to each of the parts.

For example, the URL http://thedatacity.com could be broken down into the following parts:

http://
the
data
city
.com

You could then apply NER to each of these parts to identify any named entities. In this case, there would be no named entities identified.

However, if the URL had contained a named entity, such as "www.google.com", the NER algorithm would have identified the named entity as "Google".

Here are some of the challenges of using NER on URLs:

URLs can be very long and complex, making it difficult to identify named entities.
URLs can contain non-standard characters, which can make it difficult for NER algorithms to work properly.
URLs can be ambiguous, meaning that the same URL could refer to multiple different named entities.
Despite these challenges, NER can be a useful tool for extracting information from URLs. By identifying named entities in URLs, you can gain insights into the content of the website or resource that the URL points to.

In [None]:
import spacy

In [3]:
nlp = spacy.load("en_core_web_sm")

In [4]:
text = ("When Sebastian Thrun started working on self-driving cars at "
        "Google in 2007, few people outside of the company took him "
        "seriously. “I can tell you very senior CEOs of major American "
        "car companies would shake my hand and turn away because I wasn’t "
        "worth talking to,” said Thrun, in an interview with Recode earlier "
        "this week.")
doc = nlp(text)

In [5]:
# Analyze syntax
print("Noun phrases:", [chunk.text for chunk in doc.noun_chunks])
print("Verbs:", [token.lemma_ for token in doc if token.pos_ == "VERB"])

Noun phrases: ['Sebastian Thrun', 'self-driving cars', 'Google', 'few people', 'the company', 'him', 'I', 'you', 'very senior CEOs', 'major American car companies', 'my hand', 'I', 'Thrun', 'an interview', 'Recode']
Verbs: ['start', 'work', 'drive', 'take', 'tell', 'shake', 'turn', 'talk', 'say']


In [6]:
# Find named entities, phrases and concepts
for entity in doc.ents:
    print(entity.text, entity.label_)

Sebastian Thrun PERSON
Google ORG
2007 DATE
American NORP
Thrun PERSON
Recode ORG
earlier this week DATE


In [8]:
import requests

# Fetch the content from the URL
url = "http://thedatacity.com"
response = requests.get(url)
content = response.text

# Process the content using spaCy
doc = nlp(content)


In [9]:
# Extract named entities
named_entities = []
for ent in doc.ents:
    named_entities.append((ent.text, ent.label_))

In [10]:
# Print the extracted named entities
for text, label in named_entities:
    print(f"Text: {text}, Label: {label}")

Text: GB, Label: GPE
Text: href="https://thedatacity.com, Label: ORG
Text: max, Label: PERSON
Text: Data City, Label: GPE
Text: UK&#039;s, Label: NORP
Text: sectors &amp, Label: ORG
Text: equiv="X-UA-Compatible, Label: ORG
Text: href="https://thedatacity.com, Label: ORG
Text: href="https://thedatacity.com, Label: ORG
Text: max, Label: PERSON
Text: max-snippet:-1, Label: PERSON
Text: max, Label: PERSON
Text: dataLayer, Label: PERSON
Text: UK, Label: GPE
Text: cluster &amp, Label: ORG
Text: 5, Label: CARDINAL
Text: 350, Label: CARDINAL
Text: Data City, Label: GPE
Text: UK&#039;s, Label: NORP
Text: sectors &, Label: ORG
Text: UK, Label: GPE
Text: cluster &amp, Label: ORG
Text: 5, Label: CARDINAL
Text: 350, Label: CARDINAL
Text: Data City, Label: GPE
Text: UK, Label: GPE
Text: sectors & companies","isPartOf":{"@id":"https://thedatacity.com/#website"},"datePublished":"2021-05-05T10:18:53, Label: ORG
Text: UK, Label: GPE
Text: cluster & company, Label: ORG
Text: The Data City's, Label: GPE
T

In [11]:
from bs4 import BeautifulSoup

# Load the spaCy model
nlp = spacy.load("en_core_web_sm")

# Fetch the HTML content from the URL
url = "http://thedatacity.com"
response = requests.get(url)
html_content = response.text

# Parse the HTML content to extract the text
soup = BeautifulSoup(html_content, "html.parser")
text = soup.get_text()

# Process the text with spaCy NER
doc = nlp(text)

# Extract named entities
named_entities = []
for ent in doc.ents:
    named_entities.append((ent.text, ent.label_))

# Print the named entities
for entity, label in named_entities:
    print(f"Entity: {entity}, Label: {label}")

Entity: Data City, Label: GPE
Entity: UK, Label: GPE
Entity: GOVERNMENTPE, Label: ORG
Entity: VC & INVESTMENTB2B COMPANIESPlatformDirectoryGlobal PlatformAccreditationLocal Government Package, Label: ORG
Entity: 350, Label: CARDINAL
Entity: Real-Time Industrial Classifications &, Label: ORG
Entity: Featured RTICsAdvanced ManufacturingAgriTechArtificial IntelligenceCryptocurrency EconomyImmersive, Label: PERSON
Entity: ZeroQuantum, Label: ORG
Entity: SIC, Label: ORG
Entity: Role Opening: Business Development ExecutiveData Explorer Release NotesUncovering Life Sciences’ Innovation CommunitiesReviewing the Space Economy’s networkEmbracing Innovation and Empowering Teams The UK, Label: WORK_OF_ART
Entity: Top Artificial Intelligence, Label: ORG
Entity: Skills, Label: ORG
Entity: Midlands Engine, Label: PERSON
Entity: CityDiscover, Label: PRODUCT
Entity: UK, Label: GPE
Entity: UK, Label: GPE
Entity: over 5 million, Label: CARDINAL
Entity: 350, Label: CARDINAL
Entity: SIC, Label: ORG
Entity: