<a href="https://colab.research.google.com/github/anujsaxena/Python/blob/main/nlp_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Text analysis with spacy**

spaCy is an open-source natural language processing library for Python. It is designed particularly for production use, and it can help us to build applications that process massive volumes of text efficiently. 

# **Tokenization**
Tokenization is the process of breaking text into pieces, called tokens, and ignoring characters like punctuation marks (,. “ ‘) and spaces. spaCy‘s tokenizer takes input in form of unicode text and outputs a sequence of token objects.

In [1]:
!pip install spacy



In [2]:
!python -m spacy download en

Collecting en_core_web_sm==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.5/en_core_web_sm-2.2.5.tar.gz (12.0 MB)
[K     |████████████████████████████████| 12.0 MB 8.1 MB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/usr/local/lib/python3.7/dist-packages/en_core_web_sm -->
/usr/local/lib/python3.7/dist-packages/spacy/data/en
You can now load the model via spacy.load('en')


In [3]:
from spacy.lang.en import English
# Load English tokenizer, tagger, parser, NER and word vectors
nlp = English()
text = """When learning data science, you shouldn't get discouraged!
Challenges and setbacks aren't failures, they're just part of the journey. You've got this!"""
#  "nlp" Object is used to create documents with linguistic annotations.
my_doc = nlp(text)
# Create list of word tokens
token_list = []
for token in my_doc:
    token_list.append(token.text)
print(token_list)


['When', 'learning', 'data', 'science', ',', 'you', 'should', "n't", 'get', 'discouraged', '!', '\n', 'Challenges', 'and', 'setbacks', 'are', "n't", 'failures', ',', 'they', "'re", 'just', 'part', 'of', 'the', 'journey', '.', 'You', "'ve", 'got', 'this', '!']


In [4]:
from spacy.lang.en.stop_words import STOP_WORDS
#Implementation of stop words:
filtered_sent=[]
#  "nlp" Object is used to create documents with linguistic annotations.
doc = nlp(text)
# filtering stop words
for word in doc:
    if word.is_stop==False:
        filtered_sent.append(word)
print("Filtered Sentence:",filtered_sent)


Filtered Sentence: [learning, data, science, ,, discouraged, !, 
, Challenges, setbacks, failures, ,, journey, ., got, !]


In [14]:
# Implementing lemmatization
import spacy
sp = spacy.load('en_core_web_sm')
lem = sp(u'''They were walking in the garden when the earthquake hit the town. 
I had gone to the station. 
She is standing in the queue. 
It is raining outside''')
# finding lemma for each word
for word in lem:
    print(word.text, word.lemma_)

They -PRON-
were be
walking walk
in in
the the
garden garden
when when
the the
earthquake earthquake
hit hit
the the
town town
. .

 

I -PRON-
had have
gone go
to to
the the
station station
. .

 

She -PRON-
is be
standing stand
in in
the the
queue queue
. .

 

It -PRON-
is be
raining rain
outside outside


In [15]:
# Implementing lemmatization
lem = sp(u"computer computer computed computing computes")
# finding lemma for each word
for word in lem:
    print(word.text, word.lemma_)


computer computer
computer computer
computed compute
computing computing
computes compute


In [16]:
import spacy
nlp = spacy.load('en_core_web_sm')
# Process whole documents
text = ('''They were walking in the garden when the earthquake hit the town. 
I had gone to the station. 
She is standing in the queue. 
It is raining outside''')

doc = nlp(text)

#analyze
print("Noun phrase: ",[chunk.text for chunk in doc.noun_chunks])

Noun phrase:  ['They', 'the garden', 'the earthquake', 'the town', 'I', 'the station', 'She', 'the queue', 'It']


In [19]:
# Process whole documents
text = ("When Sebastian Thrun started working on self-driving cars at "
        "Google in 2007, few people outside of the company took him "
        "seriously. “I can tell you very senior CEOs of major American "
        "car companies would shake my hand and turn away because I wasn’t "
        "worth talking to,” said Thrun, in an interview with Recode earlier "
        "this week.")
doc = nlp(text)
print("Verbs:", [token.lemma_ for token in doc if token.pos_ == "VERB"])

Verbs: ['start', 'work', 'drive', 'take', 'can', 'tell', 'would', 'shake', 'turn', 'talk', 'say']


In [20]:
# Find named entities, phrases and concepts
for entity in doc.ents:
    print(entity.text, entity.label_)


Sebastian NORP
Google ORG
2007 DATE
American NORP
Recode ORG
earlier this week DATE


In [22]:
#POS

text = nlp("All look yellow to the jaundiced eyes ")

for word in text:
  print(word.text, word.pos_)

All DET
look VERB
yellow ADJ
to ADP
the DET
jaundiced VERB
eyes NOUN


# **Entity detection**

Entity detection, also called entity recognition, is a more advanced form of language processing that identifies important elements like places, people, organizations, and languages within an input string of text. This is really helpful for quickly extracting information from text, since you can quickly pick out important topics or identify key sections of text.


In [24]:
from spacy.lang.en import English
import en_core_web_sm
from spacy import displacy
nlp = en_core_web_sm.load()
nytimes= nlp(u"""New York City on Tuesday declared a public health emergency and ordered mandatory measles vaccinations amid an outbreak, 
becoming the latest national flash point over refusals to inoculate against dangerous diseases.
At least 285 people have contracted measles in the city since September, mostly in Brooklyn’s Williamsburg neighborhood. 
The order covers four Zip codes there, Mayor Bill de Blasio (D) said Tuesday.
The mandate orders all unvaccinated people in the area, including a concentration of Orthodox Jews, 
to receive inoculations, including for children as young as 6 months old. Anyone who resists could be fined up to $1,000.""")

entities=[(token, token.label_, token.label) for token in nytimes.ents]
print(entities)

[(New York City, 'GPE', 384), (Tuesday, 'DATE', 391), (At least 285, 'CARDINAL', 397), (September, 'DATE', 391), (Brooklyn, 'GPE', 384), (Williamsburg, 'GPE', 384), (four, 'CARDINAL', 397), (Bill de Blasio, 'PERSON', 380), (Tuesday, 'DATE', 391), (Orthodox, 'NORP', 381), (Jews, 'NORP', 381), (as young as 6 months old, 'DATE', 391), (up to $1,000, 'MONEY', 394)]


# **Dependency Parsing**

Dependency parsing is a language processing technique that allows us to better determine the meaning of a sentence by analyzing how it’s constructed to determine how the individual words relate to each other.

In [43]:
from spacy.lang.en import English
import en_core_web_sm
from spacy import displacy
from spacy import displacy
nlp = en_core_web_sm.load()

nytimes= nlp(u"""New York City on Tuesday declared a public health emergency and ordered mandatory measles vaccinations amid an outbreak, 
becoming the latest national flash point over refusals to inoculate against dangerous diseases.
At least 285 people have contracted measles in the city since September, mostly in Brooklyn’s Williamsburg neighborhood. 
The order covers four Zip codes there, Mayor Bill de Blasio (D) said Tuesday.
The mandate orders all unvaccinated people in the area, including a concentration of Orthodox Jews, 
to receive inoculations, including for children as young as 6 months old. Anyone who resists could be fined up to $1,000.""")

for chunk in nytimes.noun_chunks:
  #print(chunk.text, chunk.root.text, chunk.root.dep_,chunk.root.head.text)
   print(chunk.text, chunk.root.text, chunk.root.dep_, chunk.root.head.text)
displacy.render(nytimes, style="ent", jupyter=True)


New York City City nsubj declared
Tuesday Tuesday pobj on
a public health emergency emergency dobj declared
mandatory measles vaccinations vaccinations dobj ordered
an outbreak outbreak pobj amid
the latest national flash point point attr becoming
refusals refusals pobj over
dangerous diseases diseases pobj against
At least 285 people people nsubj contracted
measles measles dobj contracted
the city city pobj in
September September pobj since
Brooklyn’s Williamsburg neighborhood neighborhood pobj in
The order order nsubj covers
four Zip codes codes dobj covers
Mayor Bill de Blasio Blasio nsubj said
(D D appos Blasio
The mandate orders orders ROOT orders
all unvaccinated people people dobj orders
the area area pobj in
a concentration concentration pobj including
Orthodox Jews Jews pobj of
inoculations inoculations dobj receive
children children pobj for
Anyone Anyone nsubjpass fined
who who nsubj resists


In [42]:
import spacy
from spacy import displacy

text = "When Sebastian Thrun started working on self-driving cars at Google in 2007, few people outside of the company took him seriously."

nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
displacy.render(doc, style="dep", jupyter="True")