# Natural Language Processing (NLP) with SpaCy

spaCy is a Python natural language processing library specifically designed with the goal of being a useful library for implementing production-ready systems. It is particularly fast and intuitive, making it a top contender for NLP tasks. 

https://spacy.io/


spaCy, as well as its English language model, should have been installed at the start. Below are the commands that are needed to install spaCy. Either type the following commands in powershell or terminal, or directly from the Jupyter Notebook by uncommenting the next two lines.

In [1]:
# Run this only the first time to install SpaCy and download the models
# !pip install -U spacy
#!python -m spacy download en_core_web_sm

## Importing libraries

In [2]:
import spacy
import pandas as pd 

# Loading the English language trained model
nlp = spacy.load("en_core_web_sm")

import en_core_web_sm
nlp = en_core_web_sm.load()

Let us have a sample text as input. 

In [3]:
text = ("Friends, Romans, countrymen, lend me your ears. I come to bury Julius Caesar, not to praise him."
"The evil that men do lives after them. The good is oft interred with their bones.")

### Sentence splitting
Split the sample text into separate sentences (separated by periods.)

In [4]:
doc = nlp(text)
for i in doc.sents:
    print(i)

Friends, Romans, countrymen, lend me your ears.
I come to bury Julius Caesar, not to praise him.
The evil that men do lives after them.
The good is oft interred with their bones.


### Tokenisation

Splitting into words

In [5]:
text = ("Friends, Romans, countrymen, lend me your ears.")
doc = nlp(text)
doc.text.split()

['Friends,', 'Romans,', 'countrymen,', 'lend', 'me', 'your', 'ears.']

### Lemmatisation

In [6]:
doc = nlp("The dogs saw bats with best stripes hanging upside down by their feet")

for token in doc:
    print(token.text + " --> " + token.lemma_)

The --> the
dogs --> dog
saw --> see
bats --> bat
with --> with
best --> good
stripes --> stripe
hanging --> hang
upside --> upside
down --> down
by --> by
their --> their
feet --> foot


### Stop Words

In [7]:
stopwords = nlp.Defaults.stop_words
print('spaCy has a total of ',len(stopwords), ' stop words.')
print(stopwords)

spaCy has a total of  326  stop words.
{'then', 'by', 'always', 'an', 'who', 'besides', 'fifteen', '‘re', 'moreover', 'thru', 'off', 'their', 'we', 'any', 'except', 'the', 'against', 'into', 'about', 'hence', 'nevertheless', 'another', 'down', 'often', 'elsewhere', 'will', 'seemed', 'and', 'on', 'it', '’re', 'beside', 'are', 'seems', "'re", 'beforehand', 'say', 'top', 'here', "'s", 'not', 'both', 'wherever', 'forty', 'either', 'was', 'being', 'themselves', 'further', 'am', 'ourselves', "'ve", 'per', 'been', 'did', 'since', 'whereas', 'beyond', 'around', 'whose', 'whatever', 'yourselves', 'become', 'that', 'during', 'thence', 'towards', 'whoever', 'within', 'seem', 'whereby', 'because', 'were', 'rather', 'none', 'nowhere', 'mine', 'should', 'well', 'other', 'together', 'out', 'something', 'under', 'almost', 'give', 'hundred', 'sometimes', 'ten', 'can', 'only', 'noone', '’m', 'he', 'just', 'name', 'why', 'take', 'hereafter', 'regarding', 'even', 'again', 'else', 'although', "'d", 'next',

Remove stop words from text sample

In [8]:
text = "This is not a good time to talk"
doc = nlp(text)

cleanedtext = []
for item in doc:
    if not item.is_stop:
        cleanedtext.append(item.text)
print(' '.join(cleanedtext))

good time talk


Check if the word is a stop word or not.

In [9]:
for token in doc:
    print(f'{token.text:{10}} {token.is_stop:{4}}') 
    # is_stop shows whether it is stopword or now
    # 1 - Yes
    # 0 - No

This          1
is            1
not           1
a             1
good          0
time          0
to            1
talk          0


Adding stop words
Uncomment the following lines to add to remove one or multiple stop words.

In [10]:
# Adding single token as stopword 
#nlp.Defaults.stop_words.add("perfect")

# Adding multiple tokens 
#nlp.Defaults.stop_words|={"hot","cold"}

# Removing single token
# nlp.Defaults.stop_words.remove("what")
# Removing multiple tokens
# nlp.Defaults.stop_words -= {"who", "when"}           

POS Tagging

In [11]:
doc = nlp("Charles M.H.P. Leclerc wins Monaco F1 GP for Ferrari to delight of home crowd.")
# for token in doc:
pos = [(token.text, token.lemma_, token.pos_, token.tag_, spacy.explain(token.tag_), token.dep_,
            token.shape_, token.is_alpha, token.is_stop) for token in doc]
df = pd.DataFrame(pos, columns=['Text', 'Lemma', 'POS', 'TAG', 'Explain', 'DEP', 'Shape', 'Alpha', 'Stop'])
df

Unnamed: 0,Text,Lemma,POS,TAG,Explain,DEP,Shape,Alpha,Stop
0,Charles,Charles,PROPN,NNP,"noun, proper singular",compound,Xxxxx,True,False
1,M.H.P.,M.H.P.,PROPN,NNP,"noun, proper singular",compound,X.X.X.,False,False
2,Leclerc,Leclerc,PROPN,NNP,"noun, proper singular",nsubj,Xxxxx,True,False
3,wins,win,VERB,VBZ,"verb, 3rd person singular present",ROOT,xxxx,True,False
4,Monaco,Monaco,PROPN,NNP,"noun, proper singular",compound,Xxxxx,True,False
5,F1,F1,PROPN,NNP,"noun, proper singular",compound,Xd,False,False
6,GP,GP,PROPN,NNP,"noun, proper singular",dobj,XX,True,False
7,for,for,SCONJ,IN,"conjunction, subordinating or preposition",mark,xxx,True,True
8,Ferrari,Ferrari,PROPN,NNP,"noun, proper singular",nsubj,Xxxxx,True,False
9,to,to,PART,TO,"infinitival ""to""",aux,xx,True,True


In [12]:
from spacy import displacy

doc = nlp("Charles M.H.P. Leclerc wins Monaco F1 GP for Ferrari to delight of home crowd.")
# doc = nlp("Emma plays tennis for the U.K.")
displacy.render(doc, style="dep", jupyter=True)


Entity

In [13]:
doc = nlp("Google, headquartered in Mountain View (1600 Amphitheatre Pkwy, Mountain View, CA 940430), \
          unveiled the new Android phone for $999 at the Consumer Electronic Show. \
          Sundar Pichai said in his keynote that users love their new Android phones.")

pos = [(ent.text, ent.start_char, ent.end_char, ent.label, ent.label_, spacy.explain(ent.label_)) for ent in doc.ents]
df = pd.DataFrame(pos, columns=['Text', 'Start', 'End', 'Label index', 'Label', 'Explain'])
df 

Unnamed: 0,Text,Start,End,Label index,Label,Explain
0,Google,0,6,383,ORG,"Companies, agencies, institutions, etc."
1,Mountain View,25,38,384,GPE,"Countries, cities, states"
2,1600,40,44,397,CARDINAL,Numerals that do not fall under another type
3,Mountain View,64,77,384,GPE,"Countries, cities, states"
4,CA 940430,79,88,383,ORG,"Companies, agencies, institutions, etc."
5,Android,118,125,383,ORG,"Companies, agencies, institutions, etc."
6,999,137,140,394,MONEY,"Monetary values, including unit"
7,the Consumer Electronic Show,144,172,383,ORG,"Companies, agencies, institutions, etc."
8,Android,244,251,383,ORG,"Companies, agencies, institutions, etc."


In [14]:
displacy.render(doc, style = "ent",jupyter = True)