## Natural Language Processing

___
## Spacy basics

In [1]:
import spacy

#### Loading the language library, this is what makes spacy very efficient

In [2]:
nlp = spacy.load('en_core_web_sm')

In [3]:
doc = nlp(u'Tesla is looking at buying U.S. startup for $6 million')

In [4]:
for token in doc:
    print(token)

Tesla
is
looking
at
buying
U.S.
startup
for
$
6
million


In [5]:
for token in doc:
    print(token.text, token.pos, token.pos_, token.dep_)

Tesla 96 PROPN nsubj
is 87 AUX aux
looking 100 VERB ROOT
at 85 ADP prep
buying 100 VERB pcomp
U.S. 96 PROPN compound
startup 92 NOUN dobj
for 85 ADP prep
$ 99 SYM quantmod
6 93 NUM compound
million 93 NUM pobj


In [6]:
for token in doc:
    print(f'{token.text:{20}} {token.pos:{5}} {token.pos_:{9}} {token.dep_:{9}}')

Tesla                   96 PROPN     nsubj    
is                      87 AUX       aux      
looking                100 VERB      ROOT     
at                      85 ADP       prep     
buying                 100 VERB      pcomp    
U.S.                    96 PROPN     compound 
startup                 92 NOUN      dobj     
for                     85 ADP       prep     
$                       99 SYM       quantmod 
6                       93 NUM       compound 
million                 93 NUM       pobj     


<img src="pipeline1.png" width="600">

In [7]:
nlp.pipeline

[('tagger', <spacy.pipeline.pipes.Tagger at 0x29fdb780ec8>),
 ('parser', <spacy.pipeline.pipes.DependencyParser at 0x29fdb768d68>),
 ('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x29fdb788348>)]

In [8]:
nlp.pipe_names

['tagger', 'parser', 'ner']

In [9]:
doc2 = nlp(u"Tesla isn't looking into startups anymore")

In [10]:
for token in doc2:
    print(f'{token.text:{20}} {token.pos:{5}} {token.pos_:{9}} {token.dep_:{9}}')

Tesla                   96 PROPN     nsubj    
is                      87 AUX       aux      
n't                     94 PART      neg      
looking                100 VERB      ROOT     
into                    85 ADP       prep     
startups                92 NOUN      pobj     
anymore                 86 ADV       advmod   


In [11]:
doc2[0]

Tesla

In [12]:
doc2[0].pos_

'PROPN'

In [13]:
doc2[0].dep_

'nsubj'

|Tag|Description|doc2[0].tag|
|:------|:------:|:------|
|`.text`|The original word text<!-- .element: style="text-align:left;" -->|`Tesla`|
|`.lemma_`|The base form of the word|`tesla`|
|`.pos_`|The simple part-of-speech tag|`PROPN`/`proper noun`|
|`.tag_`|The detailed part-of-speech tag|`NNP`/`noun, proper singular`|
|`.shape_`|The word shape – capitalization, punctuation, digits|`Xxxxx`|
|`.is_alpha`|Is the token an alpha character?|`True`|
|`.is_stop`|Is the token part of a stop list, i.e. the most common words of the language?|`False`|

___
## Spans
Large Doc objects can be hard to work with at times. A **span** is a slice of Doc object in the form `Doc[start:stop]`.

In [14]:
doc3 = nlp(u'Although commmonly attributed to John Lennon from his song "Beautiful Boy", \
the phrase "Life is what happens to us while we are making other plans" was written by \
cartoonist Allen Saunders and published in Reader\'s Digest in 1957, when Lennon was 17.')

In [15]:
life_quote = doc3[16:30]

In [16]:
print(life_quote)

"Life is what happens to us while we are making other plans"


In [17]:
type(life_quote)

spacy.tokens.span.Span

In [18]:
type(doc3)

spacy.tokens.doc.Doc

In [19]:
doc4 = nlp(u'This is the first sentence. This is the second. This is third. And this is another')

In [20]:
for sentence in doc4.sents:
    print(sentence)

This is the first sentence.
This is the second.
This is third.
And this is another


In [21]:
doc4[6]

This

In [22]:
doc4[6].is_sent_start

True

In [23]:
doc4[7].is_sent_start

___

## Tokenization
The first step in creating a `Doc` object is to break down the incoming text into component pieces or "tokens".

<img src="tokenization.png" width="600">

In [24]:
mystring = '"We\'re moving to L.A.!"'
print(mystring)

"We're moving to L.A.!"


In [25]:
doc = nlp(mystring)

In [26]:
for token in doc:
    print(token)

"
We
're
moving
to
L.A.
!
"


In [27]:
doc2 = nlp(u"We're here to help! Send snail-mail, email support@oursite.com or visit us at http://www.oursite.com!")

In [28]:
for token in doc2:
    print(token)

We
're
here
to
help
!
Send
snail
-
mail
,
email
support@oursite.com
or
visit
us
at
http://www.oursite.com
!


In [29]:
doc3 = nlp(u'A 5km NYC cab ride costs $10.30')

for t in doc3:
    print(t)

A
5
km
NYC
cab
ride
costs
$
10.30


In [30]:
len(doc3)

9

In [31]:
doc3.vocab

<spacy.vocab.Vocab at 0x29fda917a48>

In [32]:
len(doc3.vocab)

554

In [33]:
doc8 = nlp(u'Apple to build a Hong Kong factory for $6 million')

In [34]:
for token in doc8:
    print(token.text, end = ' | ')

Apple | to | build | a | Hong | Kong | factory | for | $ | 6 | million | 

In [35]:
for entity in doc8.ents:
    print(entity)
    print(entity.label_)
    print(spacy.explain(entity.label_))
    print('\n')

Apple
ORG
Companies, agencies, institutions, etc.


Hong Kong
GPE
Countries, cities, states


$6 million
MONEY
Monetary values, including unit




In [36]:
doc9 = nlp(u"Autonomous cars shift insurance liability toward manufacturers.")

for chunk in doc9.noun_chunks:
    print(chunk.text)

Autonomous cars
insurance liability
manufacturers


In [37]:
doc10 = nlp(u"Adam and John are very good friends!")

for entity in doc10.ents:
    print(entity)
    print(entity.label_)
    print(spacy.explain(entity.label_))
    print('\n')

Adam
PERSON
People, including fictional


John
PERSON
People, including fictional




___
## Built-in Visualizers

spaCy includes a built-in visualization tool called **displaCy**. displaCy is able to detect whether you're working in a Jupyter notebook, and will return markup that can be rendered in a cell right away. When you export your notebook, the visualizations will be included as HTML.

For more info visit https://spacy.io/usage/visualizers

In [38]:
from spacy import displacy

In [39]:
doc = nlp('Apple is going to build a U.K. factory for $6 million')

In [40]:
displacy.render(doc, style = 'dep', jupyter = True, options={'distance': 50})

In [41]:
doc = nlp(u'Over the last quarter Apple sold over 20 thousand iPods for a profit of $6 million')

In [42]:
displacy.render(doc, style = 'ent', jupyter = 'True')

___

## Stemming


Often when searching text for a certain keyword, it helps if the search returns variations of the word. For instance, searching for "boat" might also return "boats" and "boating". Here, "boat" would be the stem for [boat, boater, boating, boats].

Thanks to nltk - https://www.nltk.org/

In [43]:
from nltk.stem.porter import *

In [44]:
stemmer = PorterStemmer()

In [45]:
words = ['run', 'runner', 'ran', 'runs', 'easily', 'fairly']

In [46]:
for word in words:
    print(f'{word:{10}}{stemmer.stem(word):{10}}')

run       run       
runner    runner    
ran       ran       
runs      run       
easily    easili    
fairly    fairli    


### Snowball Stemmer


The algorithm used here is more acurately called the "English Stemmer" or "Porter2 Stemmer". It offers a slight improvement over the original Porter stemmer, both in logic and speed.

In [47]:
from nltk.stem.snowball import SnowballStemmer

In [48]:
# The Snowball Stemmer requires that you pass a language parameter
s_stemmer = SnowballStemmer(language='english')

In [49]:
for word in words:
    print(f'{word:{10}}{s_stemmer.stem(word):{10}}')

run       run       
runner    runner    
ran       ran       
runs      run       
easily    easili    
fairly    fair      


___

## Lemmatization
In contrast to stemming, lemmatization looks beyond word reduction, and considers a language's full vocabulary to apply a *morphological analysis* to words. The lemma of 'was' is 'be' and the lemma of 'mice' is 'mouse'. Further, the lemma of 'meeting' might be 'meet' or 'meeting' depending on its use in a sentence.

In [50]:
doc1 = nlp(u"I am a runner running in a race because I love to run since I ran today")

for token in doc1:
    print(f'{token.text:{10}} {token.pos_:{10}} {token.lemma:<{25}} {token.lemma_:{10}}')

I          PRON       561228191312463089        -PRON-    
am         AUX        10382539506755952630      be        
a          DET        11901859001352538922      a         
runner     NOUN       12640964157389618806      runner    
running    VERB       12767647472892411841      run       
in         ADP        3002984154512732771       in        
a          DET        11901859001352538922      a         
race       NOUN       8048469955494714898       race      
because    SCONJ      16950148841647037698      because   
I          PRON       561228191312463089        -PRON-    
love       VERB       3702023516439754181       love      
to         PART       3791531372978436496       to        
run        VERB       12767647472892411841      run       
since      SCONJ      10066841407251338481      since     
I          PRON       561228191312463089        -PRON-    
ran        VERB       12767647472892411841      run       
today      NOUN       11042482332948150395      today   

### Function to display lemmas
Since the display above is staggared and hard to read, let's write a function that displays the information we want more neatly.

In [51]:
def show_lemmas(text):
    for token in text:
        print(f'{token.text:{15}} {token.pos_:{10}} {token.lemma:<{25}} {token.lemma_:{10}}')

In [52]:
doc2 = nlp(u"I saw eighteen mice today! She too saw :)")

show_lemmas(doc2)

I               PRON       561228191312463089        -PRON-    
saw             VERB       11925638236994514241      see       
eighteen        NUM        9609336664675087640       eighteen  
mice            NOUN       1384165645700560590       mouse     
today           NOUN       11042482332948150395      today     
!               PUNCT      17494803046312582752      !         
She             PRON       561228191312463089        -PRON-    
too             ADV        12286903790479710773      too       
saw             VERB       11925638236994514241      see       
:)              PUNCT      5920004935509210957       :)        


In [53]:
doc4 = nlp(u"That's an enormous automobile")

show_lemmas(doc4)

That            DET        4380130941430378203       that      
's              AUX        10382539506755952630      be        
an              DET        15099054000809333061      an        
enormous        ADJ        17917224542039855524      enormous  
automobile      NOUN       7211811266693931283       automobile


#### <font color=green>Note that lemmatization does *not* reduce words to their most basic synonym - that is, `enormous` doesn't become `big` and `automobile` doesn't become `car`.</font>

___

## Stop Words
Words like "a" and "the" appear so frequently that they don't require tagging as thoroughly as nouns, verbs and modifiers. We call these *stop words*, and they can be filtered from the text to be processed. spaCy holds a built-in list of some 326 English stop words.

In [54]:
print(nlp.Defaults.stop_words)

{'alone', 'could', 'elsewhere', 'down', 'cannot', 'doing', 'afterwards', 'take', 'most', 'should', 'as', 'although', 'done', 'however', 'for', 'own', 'nobody', 'mostly', 'a', 'thence', 'herein', 'i', 'at', 'beyond', 'call', 'hence', 'his', 'ours', 'serious', 'due', 'front', 'besides', 'so', 'unless', '’m', 'mine', 'together', 'himself', 'why', 'or', 'between', 'various', 'had', 'anyway', 'except', 'whenever', 'otherwise', 'can', 'him', 'you', 'ourselves', '’re', 'still', 'top', 'myself', 'become', "'s", 'quite', 'to', 'within', 'an', 'the', 'each', 'not', 'into', 'much', 'though', 'move', 'even', 'twenty', 'here', 'with', 'whole', '’s', 'moreover', 'nine', 'used', 'five', 'of', 'always', 'less', 'noone', 'then', 'becomes', 'its', 'such', 'n‘t', '‘ll', 'sometime', 'three', 'nowhere', 'amount', 'ten', 'others', 'towards', 'latterly', 'might', 'since', '‘s', 'your', 'may', 'formerly', 'already', 'made', '‘re', 'never', 'toward', 'namely', 'he', 'without', 'many', '‘ve', 'same', 'using', '

In [55]:
len(nlp.Defaults.stop_words)

326

### To see if a word is a stop word

In [56]:
nlp.vocab['myself'].is_stop

True

In [57]:
nlp.vocab['mystery'].is_stop

False

### To add a stop word

There may be times when you wish to add a stop word to the default set. Perhaps you decide that `'btw'` (common shorthand for "by the way") should be considered a stop word.

In [58]:
# Add the word to the set of stop words. Use lowercase!
nlp.Defaults.stop_words.add('btw')

# Set the stop_word tag on the lexeme
nlp.vocab['btw'].is_stop = True

In [59]:
len(nlp.Defaults.stop_words)

327

In [60]:
nlp.vocab['btw'].is_stop

True

### To remove a stop word


Alternatively, you may decide that `'beyond'` should not be considered a stop word.

In [61]:
# Remove the word from the set of stop words
nlp.Defaults.stop_words.remove('beyond')

# Remove the stop_word tag from the lexeme
nlp.vocab['beyond'].is_stop = False

In [62]:
len(nlp.Defaults.stop_words)

326

In [63]:
nlp.vocab['beyond'].is_stop

False