## Index

- import spacy
- nlp spacy.load
- text, pos, dep
- Tokenization
- **entities**
- noun chunks
- displacy
- **Stemmization**
- porter
- snowball
- **Limmetization**
- **Stop words** 
- add and remove from stop words


In [1]:

import spacy

In [49]:
nlp = spacy.load('en_core_web_sm')


In [3]:
doc = nlp(u'Today is the first day of learning Spacy, i am sure this is going to be fun. current time is 13:45 PM')

In [4]:
doc

Today is the first day of learning Spacy, i am sure this is going to be fun. current time is 13:45 PM

In [5]:
for token in doc:
    print(token.text,'         ',token.pos_, '    ', token.dep_)

Today           NOUN      nsubj
is           VERB      ccomp
the           DET      det
first           ADJ      amod
day           NOUN      attr
of           ADP      prep
learning           VERB      pcomp
Spacy           PROPN      dobj
,           PUNCT      punct
i           PRON      nsubj
am           VERB      ROOT
sure           ADJ      acomp
this           DET      nsubj
is           VERB      aux
going           VERB      ccomp
to           PART      aux
be           VERB      xcomp
fun           ADJ      acomp
.           PUNCT      punct
current           ADJ      amod
time           NOUN      nsubj
is           VERB      ROOT
13:45           NUM      nummod
PM           NOUN      attr


In [6]:
nlp.pipeline

[('tagger', <spacy.pipeline.Tagger at 0x219fe5025c8>),
 ('parser', <spacy.pipeline.DependencyParser at 0x219fe5011c8>),
 ('ner', <spacy.pipeline.EntityRecognizer at 0x219fe501768>)]

## Tokenization



![](tokenization.png)

In [20]:
mystring = '"we\'re moving to U.S.A"'

In [22]:
mystring

'"we\'re moving to U.S.A"'

In [23]:
print(mystring)

"we're moving to U.S.A"


In [24]:
doc = nlp(mystring)

In [25]:
for token in doc:
    print(token.text)

"
we
're
moving
to
U.S.A
"


In [29]:
doc2 = nlp(u'we\'re here to help! send-email to us, email support@oursite.com or visit us http://www.oursite.com.!')

In [30]:
doc2

we're here to help! send-email to us, email support@oursite.com or visit us http://www.oursite.com.!

In [31]:
for token in doc2:
    print(token)

we
're
here
to
help
!
send
-
email
to
us
,
email
support@oursite.com
or
visit
us
http://www.oursite.com
.
!


In [34]:
doc3 = nlp(u'a 5km NYC cab ride costs $10.3')

In [35]:
for token in doc3:
    print(token)

a
5
km
NYC
cab
ride
costs
$
10.3


In [42]:
doc4 = nlp(u"let's visit St. louis in the U.S next year.")

In [43]:
for token in doc4:
    print(token)

let
's
visit
St.
louis
in
the
U.S
next
year
.


In [46]:
len(doc3)

9

In [47]:
len(doc3.vocab)

57852

In [56]:
doc5 = nlp(u'Apple to build a factory in Hong Kong wiht $6m')

In [57]:
for token in doc5:
    print(token, end=' | ')

Apple | to | build | a | factory | in | Hong | Kong | wiht | $ | 6 | m | 

## special entity in whole statement(information). excluding the noun,pronoun and other POS. 

In [68]:
for t in doc5.ents:
    print(t)
    print(t.label_)
    print((spacy.explain(t.label_)))
    print('\n')

Apple
ORG
Companies, agencies, institutions, etc.


Hong Kong
GPE
Countries, cities, states


$6m
MONEY
Monetary values, including unit




In [70]:
for chunks in doc5.noun_chunks:
    print(chunks)

Apple
a factory
Hong Kong
$6m


# displacy

- dep
- ent

In [71]:
from spacy import displacy

In [72]:
doc6 = nlp(u'Apple to build a factory in Hong Kong wiht $6m')

In [78]:
displacy.render(doc6,style = 'dep',jupyter=True,options={'distance':120})

In [79]:
displacy.render(doc6,style = 'ent',jupyter=True,options={'distance':120})

In [80]:
displacy.serve(doc6,style='dep')


[93m    Serving on port 5000...[0m
    Using the 'dep' visualizer



127.0.0.1 - - [16/Apr/2020 17:11:54] "GET / HTTP/1.1" 200 9095
127.0.0.1 - - [16/Apr/2020 17:11:54] "GET /favicon.ico HTTP/1.1" 200 9095



    Shutting down server on port 5000.



## NLTK = POrter and snowball stemmer

stemmeing is not there in spacy.


only nltk has it

In [81]:
import nltk

In [82]:
from nltk.stem.porter import PorterStemmer

In [84]:
p_stemmer = PorterStemmer()

In [94]:
words = ['run','runner','running','ran','easily','heavier','fairly']

In [95]:
for word in words:
    print(word, '----->', p_stemmer.stem(word))

run -----> run
runner -----> runner
running -----> run
ran -----> ran
easily -----> easili
heavier -----> heavier
fairly -----> fairli


In [96]:
from nltk.stem.snowball import SnowballStemmer

In [97]:
s_stemmer = SnowballStemmer(language='english')

In [98]:
for word in words:
    print(word, '----->', s_stemmer.stem(word)) 

run -----> run
runner -----> runner
running -----> run
ran -----> ran
easily -----> easili
heavier -----> heavier
fairly -----> fair


In [99]:
words = ['generate','generous','generally','general','generation']

In [100]:
for word in words:
    print(word, '----->', p_stemmer.stem(word))

generate -----> gener
generous -----> gener
generally -----> gener
general -----> gener
generation -----> gener


In [101]:
for word in words:
    print(word, '----->', s_stemmer.stem(word)) 

generate -----> generat
generous -----> generous
generally -----> general
general -----> general
generation -----> generat


### snowball is better. It is visible in last example.

---

## Limmetization

**Limmetization is beyond stemming, it just not only concentrate on reduction to its stem but it checks the complete vocab for related words.**

lema for __'was'__ is __'be'__ and for __'mice'__ is __'mouse'__.**

it varies as per the usage as well. so its quite intelligent in itself.

e.g. 'meeting' could be 'meet' or 'meeting' itself depends on the usage.

In [102]:
doc7 = nlp(u'I am a runner, running in marathon, i run frequently as i love running.')

In [105]:
for t in doc7:
    print(t.text , '\t' , t.pos_ , '\t' , t.lemma_)

I 	 PRON 	 -PRON-
am 	 VERB 	 be
a 	 DET 	 a
runner 	 NOUN 	 runner
, 	 PUNCT 	 ,
running 	 VERB 	 run
in 	 ADP 	 in
marathon 	 NOUN 	 marathon
, 	 PUNCT 	 ,
i 	 PRON 	 i
run 	 VERB 	 run
frequently 	 ADV 	 frequently
as 	 ADP 	 as
i 	 PRON 	 i
love 	 VERB 	 love
running 	 VERB 	 run
. 	 PUNCT 	 .


### f string function for formattting

In [119]:
for t in doc7:
    print(f'{t.text:{10}} \t {t.pos_:<{20}} \t {t.lemma_:<{20}}' )

I          	 PRON                 	 -PRON-              
am         	 VERB                 	 be                  
a          	 DET                  	 a                   
runner     	 NOUN                 	 runner              
,          	 PUNCT                	 ,                   
running    	 VERB                 	 run                 
in         	 ADP                  	 in                  
marathon   	 NOUN                 	 marathon            
,          	 PUNCT                	 ,                   
i          	 PRON                 	 i                   
run        	 VERB                 	 run                 
frequently 	 ADV                  	 frequently          
as         	 ADP                  	 as                  
i          	 PRON                 	 i                   
love       	 VERB                 	 love                
running    	 VERB                 	 run                 
.          	 PUNCT                	 .                   


## stop words

In [124]:
print((nlp.Defaults.stop_words))
print('\n',len(nlp.Defaults.stop_words))

{'say', 'during', 'please', 'forty', 'less', 'part', 'upon', 'whatever', 'if', 'about', 'somewhere', 'always', 'hereupon', 'below', 'former', 'more', 'same', 'toward', 'really', 'another', 'fifty', 'my', 'mostly', 'alone', 'much', 'one', 'own', 'nowhere', 'who', 'behind', 'again', 'for', 'besides', 'make', 'here', 'perhaps', 'there', 'whether', 'yet', 'otherwise', 'yours', 'many', 'namely', 'afterwards', 'side', 'he', 'due', 'therefore', 'a', 'on', 'at', 'them', 'twelve', 'also', 'almost', 'amount', 'does', 'ours', 'via', 'by', 'everywhere', 'into', 'whereafter', 'yourselves', 'move', 'too', 'last', 'latter', 'therein', 'himself', 'is', 'off', 'call', 'after', 'five', 'fifteen', 'so', 'hereby', 'anyway', 'none', 'not', 'over', 'third', 'just', 'show', 'where', 'among', 'in', 'being', 'we', 'whereas', 'whole', 'whoever', 'still', 'whose', 'even', 'along', 'its', 'they', 'above', 'two', 'unless', 'us', 'various', 'without', 'between', 'can', 'might', 'ever', 'up', 'why', 'whereupon', 'be

In [125]:
nlp.vocab['true'].is_stop

False

In [126]:
nlp.vocab['whether'].is_stop

True

In [127]:
nlp.vocab['btw'].is_stop

False

In [132]:
nlp.Defaults.stop_words.add('btw')
nlp.vocab['btw'].is_stop = True

In [131]:
nlp.vocab['btw'].is_stop

True

In [134]:
print(len(nlp.Defaults.stop_words))

306


In [135]:
nlp.Defaults.stop_words.add('latter')
nlp.vocab['latter'].is_stop = False

In [136]:
nlp.vocab['latter'].is_stop

False