In [1]:
# Ref: 
# https://towardsdatascience.com/named-entity-recognition-with-nltk-and-spacy-8c4a7d88e7da

In [None]:
# importing libraries
import pandas as pd
import matplotlib.pyplot as plt
import sklearn
import numpy as np  
import re  # regular expressions like '+', '*'
import nltk  # The Natural Language Toolkit
from sklearn.datasets import load_files  
# nltk.download('popular') 
# will download stopwords, punkt etc # download in default dir else error later on
import pickle  
from nltk.corpus import stopwords
from nltk import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet

In [37]:
txt = """
Donald John Trump (born June 14, 1946) is the 45th and current president of the United States. Before entering politics, he was a businessman and television personality.

Trump was born and raised in the New York City borough of Queens, and received a B.S. degree in economics from the Wharton School at the University of Pennsylvania. He took charge of his family's real-estate business in 1971, renamed it The Trump Organization, and expanded its operations from Queens and Brooklyn into Manhattan. The company built or renovated skyscrapers, hotels, casinos, and golf courses. Trump later started various side ventures, mostly by licensing his name. He owned the Miss Universe and Miss USA beauty pageants from 1996 to 2015, and produced and hosted The Apprentice, a reality television show, from 2003 to 2015. Forbes estimates his net worth to be $3.1 billion.[a]

Trump entered the 2016 presidential race as a Republican and defeated 16 other candidates in the primaries. His political positions have been described as populist, protectionist, and nationalist. He was elected in a surprise victory over Democratic nominee Hillary Clinton, although he lost the popular vote.[b] He became the oldest first-term U.S. president,[c] and the first one without prior military or government service. His election and policies have sparked numerous protests. Trump has made many false or misleading statements during his campaign and presidency. The statements have been documented by fact-checkers, and the media have widely described the phenomenon as unprecedented in American politics. Many of his comments and actions have also been characterized as racially charged or racist.

During his presidency, Trump ordered a travel ban on citizens from several Muslim-majority countries, citing security concerns; after legal challenges, the Supreme Court upheld the policy's third revision. He enacted a tax-cut package for individuals and businesses, rescinding the individual health insurance mandate. He appointed Neil Gorsuch and Brett Kavanaugh to the Supreme Court. In foreign policy, Trump has pursued an America First agenda, withdrawing the U.S. from the Trans-Pacific Partnership trade negotiations, the Paris Agreement on climate change, and the Iran nuclear deal. He recognized Jerusalem as the capital of Israel, imposed import tariffs triggering a trade war with China, and started negotiations with North Korea toward their denuclearization.
"""

################################################################
#################### Data Cleaning #######################
################################################################

In [38]:
# visual inspection
print(txt)


Donald John Trump (born June 14, 1946) is the 45th and current president of the United States. Before entering politics, he was a businessman and television personality.

Trump was born and raised in the New York City borough of Queens, and received a B.S. degree in economics from the Wharton School at the University of Pennsylvania. He took charge of his family's real-estate business in 1971, renamed it The Trump Organization, and expanded its operations from Queens and Brooklyn into Manhattan. The company built or renovated skyscrapers, hotels, casinos, and golf courses. Trump later started various side ventures, mostly by licensing his name. He owned the Miss Universe and Miss USA beauty pageants from 1996 to 2015, and produced and hosted The Apprentice, a reality television show, from 2003 to 2015. Forbes estimates his net worth to be $3.1 billion.[a]

Trump entered the 2016 presidential race as a Republican and defeated 16 other candidates in the primaries. His political position

In [39]:
# remove special chars: . , ( )
txt = [re.sub('[.,()]','',x) for x in txt.split()]
txt = " ".join(txt)

# remove text like '[...]' for example '[a]'
txt = txt.replace("[", " ") # splits 'president[c]' in 'president' & 'c]'
txt = [x for x in txt.split() if not x.endswith("]")] # selects words without ...]
txt = " ".join(txt)
txt

"Donald John Trump born June 14 1946 is the 45th and current president of the United States Before entering politics he was a businessman and television personality Trump was born and raised in the New York City borough of Queens and received a BS degree in economics from the Wharton School at the University of Pennsylvania He took charge of his family's real-estate business in 1971 renamed it The Trump Organization and expanded its operations from Queens and Brooklyn into Manhattan The company built or renovated skyscrapers hotels casinos and golf courses Trump later started various side ventures mostly by licensing his name He owned the Miss Universe and Miss USA beauty pageants from 1996 to 2015 and produced and hosted The Apprentice a reality television show from 2003 to 2015 Forbes estimates his net worth to be $31 billion Trump entered the 2016 presidential race as a Republican and defeated 16 other candidates in the primaries His political positions have been described as populi

################################################################
############ Named Entity Recognition (Spacy) ##############
################################################################

In [40]:
# !pip install -U spacy
# !pip install 'spacy>=2.2.1'
# !pip install spacy && python -m spacy download en
    # may need to re-start notebook after installation

In [41]:
import spacy
print(spacy.__version__)
# should be >= 2.2.1 as per https://github.com/explosion/spaCy/issues/4372

2.2.2


In [42]:
from spacy import displacy
from collections import Counter
import en_core_web_sm
nlp = en_core_web_sm.load()

In [43]:
# Predict NER
from pprint import pprint
doc = nlp(txt)
pprint([(X.text, X.label_) for X in doc.ents])

[('Donald John Trump', 'PERSON'),
 ('June 14 1946', 'DATE'),
 ('45th', 'ORDINAL'),
 ('the United States', 'GPE'),
 ('Trump', 'PERSON'),
 ('New York City', 'GPE'),
 ('Queens', 'GPE'),
 ('the Wharton School', 'ORG'),
 ('the University of Pennsylvania', 'ORG'),
 ('1971', 'DATE'),
 ('The Trump Organization', 'ORG'),
 ('Queens', 'GPE'),
 ('Brooklyn', 'GPE'),
 ('Manhattan', 'GPE'),
 ('Trump', 'PERSON'),
 ('Miss USA', 'ORG'),
 ('1996', 'DATE'),
 ('2015', 'DATE'),
 ('The Apprentice', 'ORG'),
 ('2003', 'DATE'),
 ('2015', 'CARDINAL'),
 ('Forbes', 'PERSON'),
 ('$31 billion', 'MONEY'),
 ('Trump', 'PERSON'),
 ('2016', 'DATE'),
 ('Republican', 'NORP'),
 ('16', 'CARDINAL'),
 ('Democratic', 'NORP'),
 ('Hillary Clinton', 'PERSON'),
 ('first', 'ORDINAL'),
 ('US', 'GPE'),
 ('first', 'ORDINAL'),
 ('Trump', 'PRODUCT'),
 ('American', 'NORP'),
 ('Trump', 'PRODUCT'),
 ('Muslim', 'NORP'),
 ('the Supreme Court', 'ORG'),
 ('third', 'ORDINAL'),
 ('Neil Gorsuch', 'PERSON'),
 ('Brett Kavanaugh', 'PERSON'),
 ('the S

In [53]:
# Visualize in text
displacy.render(doc, style="ent", jupyter=True)

In [19]:
# 'unique entities' - view
temp = [(X.text, X.label_) for X in doc.ents]
temp = [x[1] for x in temp]
entity_unique = sorted(set(temp), key=lambda x:x[1])
entity_unique

['CARDINAL',
 'DATE',
 'LAW',
 'PERSON',
 'NORP',
 'LOC',
 'MONEY',
 'GPE',
 'PRODUCT',
 'ORDINAL',
 'ORG']

In [21]:
############## 'unique entities' (Spacy) - definition ##############

""" 
PERSON:      People, including fictional.
NORP:        Nationalities or religious or political groups.
FAC:         Buildings, airports, highways, bridges, etc.
ORG:         Companies, agencies, institutions, etc.
GPE:         Geo-political Countries, cities, states.
LOC:         Non-GPE locations, mountain ranges, bodies of water.
PRODUCT:     Objects, vehicles, foods, etc. (Not services.)
EVENT:       Named hurricanes, battles, wars, sports events, etc.
WORK_OF_ART: Titles of books, songs, etc.
LAW:         Named documents made into laws.
LANGUAGE:    Any named language.
DATE:        Absolute or relative dates or periods.
TIME:        Times smaller than a day.
PERCENT:     Percentage, including ”%“.
MONEY:       Monetary values, including unit.
QUANTITY:    Measurements, as of weight or distance.
ORDINAL:     “first”, “second”, etc.
CARDINAL:    Numerals that do not fall under another type.
"""

' \nPERSON:      People, including fictional.\nNORP:        Nationalities or religious or political groups.\nFAC:         Buildings, airports, highways, bridges, etc.\nORG:         Companies, agencies, institutions, etc.\nGPE:         Countries, cities, states.\nLOC:         Non-GPE locations, mountain ranges, bodies of water.\nPRODUCT:     Objects, vehicles, foods, etc. (Not services.)\nEVENT:       Named hurricanes, battles, wars, sports events, etc.\nWORK_OF_ART: Titles of books, songs, etc.\nLAW:         Named documents made into laws.\nLANGUAGE:    Any named language.\nDATE:        Absolute or relative dates or periods.\nTIME:        Times smaller than a day.\nPERCENT:     Percentage, including ”%“.\nMONEY:       Monetary values, including unit.\nQUANTITY:    Measurements, as of weight or distance.\nORDINAL:     “first”, “second”, etc.\nCARDINAL:    Numerals that do not fall under another type.\n'

In [None]:
# Note: Nearly every entity is a noun, but NER tells what type of noun

################################################################
############ Named Entity Recognition (Spacy) ##############
############ Re-train with new entities ##############
################################################################

In [None]:
# ref:
# https://towardsdatascience.com/custom-named-entity-recognition-using-spacy-7140ebbb3718

In [None]:
# Note: can train NER model using BERT also

################################################################
############ Named Entity Recognition (Spacy) ##############
############ Working, More Info ##############
################################################################

In [None]:
# Use Cases: get entities from resumes, improve search algos, improve recomm engines
# https://towardsdatascience.com/a-review-of-named-entity-recognition-ner-using-automatic-summarization-of-resumes-5248a75de175


In [None]:
# Working (NN):
    # https://www.youtube.com/watch?v=sqDHBH9IjRU
    # https://stackoverflow.com/questions/44492430/how-does-spacy-use-word-embeddings-for-named-entity-recognition-ner
    