
## NLP Application: Named Entity Recognition (NER) in Python with Spacy
https://www.analyticsvidhya.com/blog/2021/06/nlp-application-named-entity-recognition-ner-in-python-with-spacy/


In [1]:
#! conda install -y spacy
#! python -m spacy download en_core_web_sm

In [2]:
import spacy
from spacy import displacy

In [3]:
nlp = spacy.load("en_core_web_sm")

In [4]:
text = "The Indian Space Research Organisation or is the national space agency of India, headquartered in Bengaluru. It operates under Department of Space which is directly overseen by the Prime Minister of India while Chairman of ISRO acts as executive of DOS as well."
doc = nlp(text)
for word in doc.ents:
    print(word.text, word.label_)

The Indian Space Research Organisation ORG
the national space agency ORG
India GPE
Bengaluru GPE
Department of Space ORG
India GPE
ISRO ORG
DOS ORG


In [5]:
spacy.explain("ORG")
spacy.explain("GPE")

'Countries, cities, states'

In [6]:
displacy.render(doc, style="ent", jupyter=True)

In [7]:
displacy.render(doc, style='dep')

## Long text Example:

In [8]:
text = """In ancient Rome, some neighbors live in three adjacent houses. In the center is the house of Senex, who lives there with wife Domina, son Hero, and several slaves, including head slave Hysterium and the musical's main character Pseudolus. A slave belonging to Hero, Pseudolus wishes to buy, win, or steal his freedom. One of the neighboring houses is owned by Marcus Lycus, who is a buyer and seller of beautiful women; the other belongs to the ancient Erronius, who is abroad searching for his long-lost children (stolen in infancy by pirates). One day, Senex and Domina go on a trip and leave Pseudolus in charge of Hero. Hero confides in Pseudolus that he is in love with the lovely Philia, one of the courtesans in the House of Lycus (albeit still a virgin)."""

doc = nlp(text)
sentence_spans = list(doc.sents)
displacy.render(sentence_spans, style="dep")

In [9]:
raw_text2 = "The Mars Orbiter Mission (MOM), informally known as Mangalyaan, was launched into Earth orbit on 5 November 2013 by the Indian Space Research Organisation (ISRO) and has entered Mars orbit on 24 September 2014. India thus became the first country to enter Mars orbit on its first attempt. It was completed at a record low cost of $74 million."
doc1 = nlp(raw_text2)
for word in doc1.ents:
    print(word.text, word.label_)

The Mars Orbiter Mission ORG
Mangalyaan PERSON
Earth LOC
November 2013 DATE
the Indian Space Research Organisation ORG
ISRO ORG
Mars LOC
24 September 2014 DATE
India GPE
first ORDINAL
Mars LOC
first ORDINAL
$74 million MONEY


In [10]:
spacy.explain("PRODUCT")
spacy.explain("LOC")
spacy.explain("DATE")
spacy.explain("ORDINAL")
spacy.explain("MONEY")

'Monetary values, including unit'

In [11]:
displacy.render(doc1, style="ent", jupyter=True)

## NER of a News Article

In [12]:
from bs4 import BeautifulSoup
import requests
import re

In [13]:
URL="https://www.zeebiz.com/markets/currency/news-cryptocurrency-news-today-june-12-bitcoin-dogecoin-shiba-inu-and-other-top-coins-prices-and-all-latest-updates-158490"
html_content = requests.get(URL).text
soup = BeautifulSoup(html_content, "lxml")
body=soup.body.text
body

'\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n \n\nहिंदी में पढ़ें \xa0\n\n\n\n\n\n\n\n\n\n\n\n\nLive TV\n Live TV\n\n\n\n\n\n\nHome\n\nPersonal Finance\n\nPPF\nMutual Funds\nIncome tax\nEPFO\n\nIncome Tax Calculator\n\n\nPersonal Loan Calculator\n\n\nCar Loan Calculator\n\n\nHome Loan Calculator\n\n\nSIP calculator\n\n\nSWP Calculator\n\n\nMF Returns Calculator \n\nLumpsum Calculator\n\n\nIndia\n\nCompanies\nProperty\nStartups\nUidai\n\n\nEconomy\n\nAviation\n\n\n\nTech\n\nMobiles\nApps\n\n\nAuto\n\nCars\nBikes\n\n\nMarkets\n\nCommodities\nCurrency\n\n\n\nRailways\n\nWorld\n\nEconomy\nPolitics\nMarkets\n\n\nSurvey\nvideos\nphotos\nZNAA\'22\n\nMore ...\n\nVIDEOS\nPHOTOS\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n BREAKING NEWS\n  \n\n   \nSEBI issues consultation paper on insider trading for mutual funds; seeks feedback to tighten rules \n\nTCS Q1 Results FY2023: PAT up 5.2% YoY at Rs 9,478; declares interim dividend of Rs 8 per share \n\nDebt funds schemes bleed in June, see net outflo

In [14]:
body= body.replace('\n', ' ')
body= body.replace('\t', ' ')
body= body.replace('\r', ' ')
body= body.replace('\xa0', ' ')
body=re.sub(r' +', ' ', body)
#body=re.sub(r'[^ws]', '', body)
#body[1000:1500]
body

' हिंदी में पढ़ें Live TV Live TV Home Personal Finance PPF Mutual Funds Income tax EPFO Income Tax Calculator Personal Loan Calculator Car Loan Calculator Home Loan Calculator SIP calculator SWP Calculator MF Returns Calculator Lumpsum Calculator India Companies Property Startups Uidai Economy Aviation Tech Mobiles Apps Auto Cars Bikes Markets Commodities Currency Railways World Economy Politics Markets Survey videos photos ZNAA\'22 More ... VIDEOS PHOTOS BREAKING NEWS SEBI issues consultation paper on insider trading for mutual funds; seeks feedback to tighten rules TCS Q1 Results FY2023: PAT up 5.2% YoY at Rs 9,478; declares interim dividend of Rs 8 per share Debt funds schemes bleed in June, see net outflows of Rs 92,247 cr; Overnight Fund, worst performer NSE Co-location case: CBI now registers phone tapping case against Chitra Ramakrishna | Details here Dalal Street Corner: Market ends 3% up this week as benchmarks extends rally for 3rd day; what should investors do on Monday? Rea

In [15]:
text3 = nlp(body)
displacy.render(text3, style="ent", jupyter=True)

## Another example containing PII data

In [16]:
raw_text3="""
Dear Jason,

My name is David Johnson and I live in Maine and my email address is david@email.com.
My credit card number is 4095-2609-9393-4932 and 
my Bitcoin wallet id is 16Yeky6GMjeNkAiNcBY7ZhrLoMSgg1BoyZ and 
my Ethereum wallet is 0xF53E6adb81A661a2Edb3c5E4F39DACb641eC0Fc8.

On September 18, I visited microsoft.com and sent an email to test@presidio.site, from the IP 192.168.0.1.

My passport: 191280342 and my phone number: (212) 555-1234.
My national identification number: 880909-2538110

This is a valid International Bank Account Number: IL150120690000003111111 . 
Can you please check the status on bank account 954567876544?

Kate's social security number is 078-05-1126.  Her driver license is 1234567A.
"""

text3= nlp(raw_text3)
displacy.render(text3,style="ent",jupyter=True)

## END