# SPACY.
-  Spacy is Open Source NLP Library, designed to effectively handle NLP tasks with the most efficient implementation of common algorithms.
-  It has one implemented method i.e choosing the most efficient algorithm currently available.

# NLTK.
-  NLTK is **Natural Language tool kit**, very popular open source. It provides much functionalities but includes less efficient implementations.


In [1]:
# 1. conda install -c conda-forge spacy
# 2. python -m spacy download en

# NLP.
 - Natural Language Processing is an area of computer science and artificial intelligence concerned with the interactions between computers and human(natural) languages, in particular how to program computers to process and analyze large amounts of natural language data.
 - To process the text data, computer needs specialized processing techniques in order to **understand** raw text data.
 - Text data is highly unstructured and can be in multiple languages.

### Examples
- Classifying emails as Spam vs Legitimate
- Sentiment Analysis of Text Movie Review
- Analysing Trends from written customer feedback
- Understanding text commands, "Hey Google, play this song".

## 1. Spacy Basics.

In [2]:
import spacy

In [3]:
# specific string en_core i.e core english language and web_sm i.e small version of this library
# This is loading a model.
nlp = spacy.load('en_core_web_sm')

In [4]:
# Creating a Document object by appling our model to our text.
doc = nlp(u'Tesla in looking at buying U.S starup for $6 million')
# Using language library developed, it parse this string in seperate components for us. It parse as tokens.
# Each word becomes as token.
# 'u' is unique quote string.

In [5]:
for token in doc:
    print(token.text)

Tesla
in
looking
at
buying
U.S
starup
for
$
6
million


In [6]:
for token in doc:
    print(token.text, token.pos)
# Below each umber corresponds to parts of speech.

Tesla 84
in 85
looking 100
at 85
buying 100
U.S 96
starup 92
for 85
$ 99
6 93
million 93


In [7]:
# To get particular parts of speech,
for token in doc:
    print(token.text, token.pos_)

Tesla ADJ
in ADP
looking VERB
at ADP
buying VERB
U.S PROPN
starup NOUN
for ADP
$ SYM
6 NUM
million NUM


In [8]:
# for more info, we can use _dep which stands for Syntactic Dependency.
for token in doc:
    print(token.text, token.pos_, token.dep_)

Tesla ADJ ROOT
in ADP prep
looking VERB pcomp
at ADP prep
buying VERB pcomp
U.S PROPN compound
starup NOUN dobj
for ADP prep
$ SYM quantmod
6 NUM compound
million NUM pobj


In [9]:
nlp.pipeline
# Basic nlp pipeline has tagger, parser and ner i.e name entity recognizer.

[('tagger', <spacy.pipeline.pipes.Tagger at 0x16ea5c322e8>),
 ('parser', <spacy.pipeline.pipes.DependencyParser at 0x16ea5c226a8>),
 ('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x16ea5c22708>)]

In [10]:
nlp.pipe_names

['tagger', 'parser', 'ner']

In [11]:
doc2 = nlp(u"Tesla isn't   looking into startups anymore.")
# Even space becomes token.

In [12]:
for token in doc2:
    print(token.text, token.pos_,token.dep_)

Tesla PROPN nsubj
is AUX aux
n't PART neg
   SPACE 
looking VERB ROOT
into ADP prep
startups NOUN pobj
anymore ADV advmod
. PUNCT punct


In [13]:
# To print individual tokens.
print(doc2[0])
print(doc2[2])

Tesla
n't


In [14]:
print(doc2[1].pos_)

AUX


In [15]:
# syntactic dependency
print(doc2[4].dep_)

ROOT


In [16]:
spacy.explain('PROPN')

'proper noun'

In [17]:
spacy.explain('nsubj')

'nominal subject'

In [18]:
# The word shape – capitalization, punctuation, digits	
doc2[1].shape_

'xx'

In [19]:
doc2[1].lemma_

'be'

In [20]:
doc2[0]

Tesla

In [21]:
# The base form of the word.
doc2[0].lemma_

'Tesla'

In [22]:
doc2[0].tag_

'NNP'

In [23]:
spacy.explain('NNP')

'noun, proper singular'

### Spans
 - A Span is a slice of Doc object in the form of start vs stop.
 - doc[start:stop]

In [24]:
doc3 = nlp(u'Although commmonly attributed to John Lennon from his song "Beautiful Boy", \
the phrase "Life is what happens to us while we are making other plans" was written by \
cartoonist Allen Saunders and published in Reader\'s Digest in 1957, when Lennon was 17.')

In [25]:
# Span
life_quote = doc3[16:30]

In [26]:
print(life_quote)

"Life is what happens to us while we are making other plans"


In [27]:
type(life_quote)

spacy.tokens.span.Span

In [28]:
type(doc3)

spacy.tokens.doc.Doc

In [29]:
doc4 = nlp(u"This is the first sentence. This is second sentence. This is the last sentence")

In [30]:
for sentence in doc4.sents:
    print(sentence) 
# sents is an attribute to print each sentence.

This is the first sentence.
This is second sentence.
This is the last sentence


In [31]:
doc4[6]

This

In [32]:
# To check whether this is starting of sentence.
doc4[6].is_sent_start

True

In [33]:
doc4[6:18]

This is second sentence. This is the last sentence

In [34]:
doc4[7]

is

In [35]:
doc4[7].is_sent_start
# It doesnt return anything since its not a start of sentence.

## 2.Tokenization.
 - The process of breaking up original text into component pieces(tokens.
 - Tokens are the building blocks of a Doc object, i.e everything that helps us  to understand the meaning of the text is derived from tokens and their relationship to one another.
 - Prefix: Character(s) at the beginning. eg- $, @ etc
 - Suffix: Character(s) at the end. eg- km, rs etc
 - Infix: character(s) in between. eg- ...., -----, etc
 - Excetion: Special-case use to split a string into several tokens or prevent a token from being split when punctuation rules are applied. eg- let's U.S

In [36]:
import spacy

In [37]:
# To load spacy library
nlp =spacy.load('en_core_web_sm')

In [38]:
my_string = '"we\'re moving to L.A !"'

In [39]:
print(my_string)

"we're moving to L.A !"


In [40]:
doc = nlp(my_string)

In [41]:
for token in doc:
    print(token.text)

"
we
're
moving
to
L.A
!
"


In [42]:
doc2 = nlp(u"We're here to help! Send snail-mail, email support@oursite.com or visit us at http://www.oursite.com!")

In [43]:
for t in doc2:
    print(t)

We
're
here
to
help
!
Send
snail
-
mail
,
email
support@oursite.com
or
visit
us
at
http://www.oursite.com
!


In [44]:
doc3 = nlp(u"A 5km NYC cab ride costs $10.30")

In [45]:
for t in doc3:
    print(t)

A
5
km
NYC
cab
ride
costs
$
10.30


#### Exceptions
- Punctuations that exists as part of a known abbreviation will be kept as part of the token.

In [46]:
doc4 = nlp(u"Let's visit St. Louis in the U.S. next year.")

In [47]:
for t in doc4:
    print(t)

Let
's
visit
St.
Louis
in
the
U.S.
next
year
.


In [48]:
# To check the number of tokens in document.
len(doc4)

11

In [49]:
doc4.vocab

<spacy.vocab.Vocab at 0x16ea5f5dea0>

#### Counting the Vocab Entries.

In [50]:
len(doc4.vocab)

523

In [51]:
doc5 = nlp(u"It is better to give than receive")

In [52]:
doc5[0]

It

### Retrieving the doc using index position.

In [53]:
doc5[2:5]
# tokens cant be reassigned.

better to give

### Named Entities.
-  The language model recognizes that certain words are organizational names while others are locations and still some other combinations relate to money, dates, etc
- this can be accessable through ents property of a doc object.

In [54]:
doc8 = nlp(u"Apple to build a Hong Kong factory for $6 million")

In [55]:
for token in doc8:
    print(token.text, end=' | ')

Apple | to | build | a | Hong | Kong | factory | for | $ | 6 | million | 

In [56]:
# Using ents property
for entity in doc8.ents:
    print(entity)

Apple
Hong Kong
$6 million


In [57]:
for entity in doc8.ents:
    print(entity)
    print(entity.label_)
    print(str(spacy.explain(entity.label_)))
    print('\n')

Apple
ORG
Companies, agencies, institutions, etc.


Hong Kong
GPE
Countries, cities, states


$6 million
MONEY
Monetary values, including unit




### Noun Chunks.
- This is another object property. these are "base noun phrases" i.e flat phrases that have a noun as their head.

In [58]:
doc9 = nlp(u"Autonomous cars shift insurance liability towards manufactures")

In [59]:
for chunk in doc9.noun_chunks:
    print(chunk)

Autonomous cars
insurance liability
manufactures


For more info on noun_chunks visit https://spacy.io/usage/linguistic-features#noun-chunks

## Token Visualization.

In [60]:
from spacy import displacy

In [61]:
doc = nlp(u"Apple is going to build a U.K. factory for $6 million")

In [62]:
displacy.render(doc, style = 'dep', jupyter= True, options= {'distance':90})
# distance inside options module is distance between each token.
# jupyter is True, since we're using jupyter notebook
# style is a one of the styles parameters. 'dep' is syntactic dependency

In [63]:
# Visualizing the entity recognizer.
doc = nlp(u"Over the last quarter Apple sold nearly 20 thousand iPods for a profit of $6 million.")

In [64]:
displacy.render(doc,style= 'ent', jupyter = True,)
# Style is 'ent' i.e entity
# Here in this style format it identifies and highlights every entity 

## 3. Stemming.
- Stemming is method for cataloging related words, its essentially chops off letters from the end unitl the stem is reached.
-  It works fairly in most cases, but unfortunately English has many exceptions where a more sophisticated process is required. Even SpaCy does'nt include a stemmer instead rely entirely on lemmatization
- Hence we learn in NLTK about different stemmers i.e Porter Stemmer and Snowball Stemmer.

### Porter Stemmer.
- Its one of common and effective stemming tool. This algorithm employs 5 phases of word reduction, each with its own set of mapping rules.
- In first phase, simple suffix mapping  word reduces to stem and ES is removed. 

### Snowball Stemmer.
- Its name of stemming language, this algorithm is more accurately called as the "English Stemmer" or "Porter2 Stemmer"
- Its slight improvement over original Porter stemmer i.e both in logic and speed.


In [65]:
import nltk

In [66]:
from nltk.stem.porter import PorterStemmer

In [67]:
p_stemmer = PorterStemmer()

In [68]:
words= ['run','runner','ran', 'runs','easily','fairly']

In [69]:
for word in words:
    print(word + '----->' + p_stemmer.stem(word))

run----->run
runner----->runner
ran----->ran
runs----->run
easily----->easili
fairly----->fairli


In [70]:
from nltk.stem.snowball import SnowballStemmer

In [71]:
s_stemmer = SnowballStemmer(language='english')

In [72]:
for word in words:
    print(word + '----->' + s_stemmer.stem(word))

run----->run
runner----->runner
ran----->ran
runs----->run
easily----->easili
fairly----->fair


In [73]:
words = ['generous','generation','generously','generate']

In [74]:
for word in words:
    print(word + '----->'+ s_stemmer.stem(word))

generous----->generous
generation----->generat
generously----->generous
generate----->generat


## 4. Lemmatization.
- Beyond stemming, lemmatization looks in word reduction and considers a language's full vocabulary to apply a morphological analysis to words.
- The lemma of 'was' is 'be' and the lemma of 'mice' is 'mouse'.Further lemma of 'meeting' might be 'meet' or meeting depending on its use in a sentence.
- This library is typically much more informative than simple stemming, which is why Spacy has opted to only have Lemmatization available instead of Stemming.
- It looks at surrounding text to determine a given word's part of speech, it does not categorize phrases.

In [75]:
import spacy

In [76]:
nlp = spacy.load('en_core_web_sm')

In [77]:
doc1 = nlp(u"I am a runner running in a race because I love to run since I ran today")

In [78]:
for token in doc1:
    print(token.text,'\t', token.pos_,'\t', token.lemma,'\t', token.lemma_)
# '\t' is for next tab.

I 	 PRON 	 561228191312463089 	 -PRON-
am 	 AUX 	 10382539506755952630 	 be
a 	 DET 	 11901859001352538922 	 a
runner 	 NOUN 	 12640964157389618806 	 runner
running 	 VERB 	 12767647472892411841 	 run
in 	 ADP 	 3002984154512732771 	 in
a 	 DET 	 11901859001352538922 	 a
race 	 NOUN 	 8048469955494714898 	 race
because 	 SCONJ 	 16950148841647037698 	 because
I 	 PRON 	 561228191312463089 	 -PRON-
love 	 VERB 	 3702023516439754181 	 love
to 	 PART 	 3791531372978436496 	 to
run 	 VERB 	 12767647472892411841 	 run
since 	 SCONJ 	 10066841407251338481 	 since
I 	 PRON 	 561228191312463089 	 -PRON-
ran 	 VERB 	 12767647472892411841 	 run
today 	 NOUN 	 11042482332948150395 	 today


-  The number above points to s specific lemma inside this 'en_core_web_sm' language library.

### Function to display lemmas.
- Since the display above is staggared and hard to read, using function we can display the information we require.

In [79]:
def show_lemmas(text):
    for token in text:
        print(f'{token.text:{12}} {token.pos_:{6}} {token.lemma:<{22}}{token.lemma_}')

In [80]:
doc2 = nlp(u"I saw eighteen mice today!")
show_lemmas(doc2)

I            PRON   561228191312463089    -PRON-
saw          VERB   11925638236994514241  see
eighteen     NUM    9609336664675087640   eighteen
mice         NOUN   1384165645700560590   mouse
today        NOUN   11042482332948150395  today
!            PUNCT  17494803046312582752  !


<font color=green>From above we can notice that lemma of 'saw' is 'see', 'mice' is 'mouse' and yet 'eighteen' is to its own number *not* an expanded form of 'eight'.</font>

In [81]:
doc3 = nlp(u"I am meeting him tomorrow at the meeting.")

In [82]:
show_lemmas(doc3)

I            PRON   561228191312463089    -PRON-
am           AUX    10382539506755952630  be
meeting      VERB   6880656908171229526   meet
him          PRON   561228191312463089    -PRON-
tomorrow     NOUN   3573583789758258062   tomorrow
at           ADP    11667289587015813222  at
the          DET    7425985699627899538   the
meeting      NOUN   14798207169164081740  meeting
.            PUNCT  12646065887601541794  .


<font color =green> Here the lemma of 'meeting' is determined by its Part of Speech tag. </font>

## 5. Stop Words.
-  Words like "a" and "the" appear so frequently that they don't require tagging as thoroughly as nouns, verbs and modifiers.
- We can call these stop words and they can be filtered from the text to be processed.
-  SpaCy library holds a built-in list of some 305 English stop words.

In [83]:
import spacy 

In [84]:
# Loading English Core Web Small language.
nlp = spacy.load('en_core_web_sm')

In [85]:
print(nlp.Defaults.stop_words)

{'using', 'becomes', 'anywhere', 'together', 'anything', 'never', 'themselves', 'last', 'elsewhere', 'meanwhile', 'his', 'hers', 'various', 'fifty', 'whole', 'still', '’d', 'their', 'amount', 'beyond', 'had', '’m', 'nowhere', 're', 'and', 'hence', 'someone', 'all', 'former', 'neither', 'otherwise', 'well', 'done', 'make', 'via', 'other', '‘s', 'have', 'besides', 'do', 'already', 'does', 'between', 'nothing', 'show', 'beside', 'across', 'am', 'there', '’s', 'afterwards', 'please', 'next', 'we', 'being', 'both', 'full', 'yet', 'enough', 'others', 'onto', 'everything', 'formerly', 'moreover', 'ours', 'seem', 'unless', 'yours', "'ll", "'ve", 'else', 'myself', 'of', 'ten', 'up', 'least', 'i', 'keep', 'say', 'see', 'thru', 'only', '‘d', 'any', 'bottom', 'go', 'whither', 'would', 'beforehand', 'front', 'may', 'per', 'she', 'the', 'within', 'will', 'before', 'perhaps', 'did', 'he', 'thence', 'without', 'become', 'whence', 'nine', 'quite', 'throughout', 'by', 'eleven', "n't", 'also', 'less', 's

In [86]:
len(nlp.Defaults.stop_words)

326

### To check if a word is a stop word.

In [87]:
nlp.vocab['myself'].is_stop

True

In [88]:
nlp.vocab['mystery'].is_stop

False

In [89]:
nlp.vocab['ahmed'].is_stop

False

### To add a stop word.

In [90]:
# 'btw' is common short hand for " by the way".
nlp.Defaults.stop_words.add('btw')

In [91]:
nlp.vocab['btw'].is_stop

True

In [92]:
len(nlp.Defaults.stop_words)

327

### To remove a stop word.

In [93]:
nlp.Defaults.stop_words.remove('beyond')
nlp.vocab['beyond'].is_stop

False

In [94]:
len(nlp.Defaults.stop_words)

326

<font color= green > From above beyond is removed </font>

## 6. Vocabulary and Matching.
-  In previous conceptsd we have seen how a body is divided into tokens and how thse individual tokens are parsed and tagged with parts of speech, dependencies and lemmas.
- In this section we'll **identify and label specific pharases that matches patterns** we define.

In [95]:
import spacy

In [96]:
nlp = spacy.load('en_core_web_sm')

### Rule -Based Matching.
- SpaCy offers a rule-matching tool called **Matcher** which allows to build a library of token patterns, then match those patterns against Doc object to return a list of found matches.
- We can match on any part of the token including text and annotations and can add multiple patterns to the same matcher.

In [97]:
from spacy.matcher import Matcher

In [98]:
matcher = Matcher(nlp.vocab)

In [99]:
# Creating the patterns which are lists of dictionaries based of keywords. 
# SolarPower
pattern1 = [{'LOWER':'solarpower'}]
# Solar-power
pattern2 = [{'LOWER':'solar'},{'IS_PUNCT':True},{'LOWER':'power'}] # IS_PUNCT means is there punctuation in between.
# Solar power
pattern3 = [{'LOWER':'solar'},{'LOWER':'power'}]

- `pattern1` looks for the single token whose lowercase text ends 'solarpower'
- `pattern2` looks for two adjacent tokens that read 'solar' and 'power' in that order
- `pattern3` looks for the three adjacent tokens, with a middle token that can be any puntuation.

In [100]:
matcher.add('SolarPower', None, pattern1, pattern2, pattern3)

In [101]:
doc = nlp(u'The Solar Power industry continues to grow as demand \
for solarpower increases. Solar-power cars are gaining popularity.')

In [102]:
found_matches=matcher(doc)

In [103]:
print(found_matches)

[(8656102463236116519, 1, 3), (8656102463236116519, 10, 11), (8656102463236116519, 13, 16)]


`matcher`  returns a list of tuples. Each tuple contains an ID for thr match, with start & end token that maps to the span `doc[start:end]`

In [104]:
for match_id, start, end in found_matches:
    string_id = nlp.vocab.strings[match_id] # gives string representation
    span = doc[start:end]                   # gives the matched span
    print(match_id, string_id, start, end, span.text)

8656102463236116519 SolarPower 1 3 Solar Power
8656102463236116519 SolarPower 10 11 solarpower
8656102463236116519 SolarPower 13 16 Solar-power


#### To Remove the Pattern.

In [105]:
matcher.remove('SolarPower')

In [106]:
# Creating new Patterns.
# solarpower, SolarPower 
pattern1= [{'LOWER':'solarpower'}]

# solar.power
patter2 = [{'LOWER':'solar'},{'IS_PUNCT':True, 'OP':'*'},{'LOWER':'power'}]

This found both two-word patterns, with and without the hyphen!

The following quantifiers can be passed to the `'OP'` key:
<table><tr><th>OP</th><th>Description</th></tr>

<tr ><td><span >\!</span></td><td>Negate the pattern, by requiring it to match exactly 0 times</td></tr>
<tr ><td><span >?</span></td><td>Make the pattern optional, by allowing it to match 0 or 1 times</td></tr>
<tr ><td><span >\+</span></td><td>Require the pattern to match 1 or more times</td></tr>
<tr ><td><span >\*</span></td><td>Allow the pattern to match zero or more times</td></tr>
</table>

In [107]:
matcher.add('SolarPower', None,pattern1,pattern2)

In [108]:
doc2=nlp(u"Solar----power is solarpower hey!")

In [109]:
found_matches = matcher(doc2)

In [110]:
print(found_matches)

[(8656102463236116519, 2, 3)]


### Phrase Matcher
- In order to create a document object from a list of phrases and pass into the matcher.

In [111]:
from spacy.matcher import PhraseMatcher

In [112]:
matcher = PhraseMatcher(nlp.vocab)

In [113]:
# Reference file.
with open('reaganomics.txt') as f:
    doc3 = nlp(f.read())

In [114]:
print(doc3)

REAGANOMICS
https://en.wikipedia.org/wiki/Reaganomics

Reaganomics (a portmanteau of [Ronald] Reagan and economics attributed to Paul Harvey)[1] refers to the economic policies promoted by U.S. President Ronald Reagan during the 1980s. These policies are commonly associated with supply-side economics, referred to as trickle-down economics or voodoo economics by political opponents, and free-market economics by political advocates.

The four pillars of Reagan's economic policy were to reduce the growth of government spending, reduce the federal income tax and capital gains tax, reduce government regulation, and tighten the money supply in order to reduce inflation.[2]

The results of Reaganomics are still debated. Supporters point to the end of stagflation, stronger GDP growth, and an entrepreneur revolution in the decades that followed.[3][4] Critics point to the widening income gap, an atmosphere of greed, and the national debt tripling in eight years which ultimately reversed the pos

In [115]:
# looking for Phrases by creating a variable.
phrase_list = ['voodoo economics', 'supply-side economics', 'trickle-down economics', 'free-market economics']

In [116]:
# Converting each phrase into Document object.
phrase_patterns = [nlp(text) for text in phrase_list]

In [117]:
phrase_patterns

[voodoo economics,
 supply-side economics,
 trickle-down economics,
 free-market economics]

In [118]:
matcher.add('EconMatcher', None, *phrase_patterns) # * selects all objects.   

In [119]:
found_matches = matcher(doc3)
found_matches

[(3680293220734633682, 41, 45),
 (3680293220734633682, 49, 53),
 (3680293220734633682, 54, 56),
 (3680293220734633682, 61, 65),
 (3680293220734633682, 673, 677),
 (3680293220734633682, 2987, 2991)]

In [121]:
for match_id, start, end in found_matches:
    string_id = nlp.vocab.strings[match_id] # gives string representation
    span = doc3[start:end]                   # gives the matched span
    print(match_id, string_id, start, end, span.text)

3680293220734633682 EconMatcher 41 45 supply-side economics
3680293220734633682 EconMatcher 49 53 trickle-down economics
3680293220734633682 EconMatcher 54 56 voodoo economics
3680293220734633682 EconMatcher 61 65 free-market economics
3680293220734633682 EconMatcher 673 677 supply-side economics
3680293220734633682 EconMatcher 2987 2991 trickle-down economics


For additional information visit https://spacy.io/usage/linguistic-features#section-rule-based-matching