### Natural language processing (NLP) is a subfield of linguistics, computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data

##### Example use cases
- Classify emails as Spam or Legitimate
- Sentiment Analysis for Movie reviews or Product
- Analyzing trends from written customers feedback
- Understanding text commands

#### Key Steps for working with Spacy 
- Loading the library
- Building the pipeline object
- using tokens
- POS  (parts of speech) tagging
- Understanding token attributes
 

####  nlp() function from Spacy automatically takes raw text files and performs tokens,parse and describe the text data

### Install Spacy in the command Line using the below code
pip install -U spacy


In [2]:
import spacy

In [3]:
from spacy.lang.en  import English

In [5]:
nlp=English()

In [6]:
nlp

<spacy.lang.en.English at 0x225c8e1e390>

In [7]:
nlp.pipeline

[]

In [8]:
nlp.pipe_names

[]

In [9]:
text="The President’s SOTU Address is coming in the middle of his impeachment trial in the Senate"

In [10]:
doc=nlp(text)

In [11]:
for token in doc:
     print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop)

The The    Xxx True True
President President    Xxxxx True False
’s ’s    ’x False True
SOTU SOTU    XXXX True False
Address Address    Xxxxx True False
is is    xx True True
coming coming    xxxx True False
in in    xx True True
the the    xxx True True
middle middle    xxxx True False
of of    xx True True
his his    xxx True True
impeachment impeachment    xxxx True False
trial trial    xxxx True False
in in    xx True True
the the    xxx True True
Senate Senate    Xxxxx True False


In [12]:
doc = nlp("TDS on Mutual Funds: I-T Dept clarifies 10% tax at source only on dividends, not on capital gains")

for token in doc:
    print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
            token.shape_, token.is_alpha, token.is_stop)

TDS TDS    XXX True False
on on    xx True True
Mutual Mutual    Xxxxx True False
Funds Funds    Xxxxx True False
: :    : False False
I I    X True True
- -    - False False
T T    X True False
Dept Dept    Xxxx True False
clarifies clarifies    xxxx True False
10 10    dd False False
% %    % False False
tax tax    xxx True False
at at    xx True True
source source    xxxx True False
only only    xxxx True True
on on    xx True True
dividends dividends    xxxx True False
, ,    , False False
not not    xxx True True
on on    xx True True
capital capital    xxxx True False
gains gains    xxxx True False


In [13]:
nlp.pipeline

[]

In [14]:
len(doc)

23

In [15]:
len(doc.vocab)

509

In [16]:
doc2=nlp('it is better to give than receive')

In [17]:
doc2[2]

better

In [18]:
doc2[4:10]

give than receive

In [19]:
doc2[3]='worthy' ## Cannot rewrite into a file

TypeError: 'spacy.tokens.doc.Doc' object does not support item assignment

In [20]:
doc3=nlp('Ronaldo now has 50 Juventus goals')

In [21]:
##seperate the tokens 
for token in doc3:
    print(token.text,end=' | ')

Ronaldo | now | has | 50 | Juventus | goals | 

In [22]:
for entity in doc3.ents:
    print(str(spacy,explain(entity.label_)))
    print('\n')

In [23]:
doc4=nlp('production loss is attributed to inefficiency')

In [24]:
for chunk in doc4.noun_chunks:
    print(chunk)

ValueError: [E029] noun_chunks requires the dependency parse, which requires a statistical model to be installed and loaded. For more info, see the documentation:
https://spacy.io/usage/models

In [25]:
from spacy import displacy 

In [26]:
doc4=nlp('Data science is an inter-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data.')

In [30]:
displacy.render(doc4,jupyter=True,options={'distance:110'})

AttributeError: 'set' object has no attribute 'get'

In [32]:
for token in doc4:
    print(token.text,'\t',token.pos_,'\t',token.lemma,'\t')

Data 	  	 7370109823176849439 	
science 	  	 15907409978124634296 	
is 	  	 3411606890003347522 	
an 	  	 15099054000809333061 	
inter 	  	 11115815504542332625 	
- 	  	 9153284864653046197 	
disciplinary 	  	 15771201057713930352 	
field 	  	 14195085135452951260 	
that 	  	 4380130941430378203 	
uses 	  	 15169213402950300534 	
scientific 	  	 12096978277017718205 	
methods 	  	 2524309096534353778 	
, 	  	 2593208677638477497 	
processes 	  	 2329187326147364143 	
, 	  	 2593208677638477497 	
algorithms 	  	 17871777411560780215 	
and 	  	 2283656566040971221 	
systems 	  	 2784036749983215141 	
to 	  	 3791531372978436496 	
extract 	  	 1199617523884050149 	
knowledge 	  	 841909098595075712 	
and 	  	 2283656566040971221 	
insights 	  	 9238170900982487017 	
from 	  	 7831658034963690409 	
structured 	  	 18091975700715099219 	
and 	  	 2283656566040971221 	
unstructured 	  	 8720796447648573112 	
data 	  	 6645506661261177361 	
. 	  	 12646065887601541794 	


#### Stop words


Words that appear so frequently that need not be identified as unique words,don't give any additionnal information and don't require through tagging are generally refered to as Stop words. This can be filtered from text for processing

In [34]:
print(nlp.Defaults.stop_words) ### retrive all the existing default stop words in the english dictionary

{'here', 'these', 'just', 'many', 'rather', 'throughout', 'fifty', 'anyhow', 'since', 'ca', 'nowhere', 'only', 'sometimes', 'much', 'seeming', 'empty', 'first', 'n‘t', '’ve', 'any', 'between', 'wherein', 'therefore', 'alone', 'sixty', 'hereupon', 'off', 'and', 'all', 'enough', 'anything', 'however', 'where', 'he', 'yourselves', 'whereby', 'him', 'say', 'made', 'might', 'would', 'call', 'whereupon', 'than', 'except', 'them', 'doing', 'with', 'could', 'their', 'during', 'twelve', 'everything', 'quite', 'themselves', '‘s', 'last', 'various', 'towards', 'well', 'beyond', "'ll", 'beside', 'nine', "'m", 'using', 'over', 'nothing', 'otherwise', 'nobody', 'amount', 'never', 'back', 'i', 'against', 'moreover', 'everyone', 'third', 'a', 'above', 'becoming', 'who', 'n’t', 'always', 'via', 'three', 'being', 'less', 're', 'out', 'ourselves', 'also', 'an', 'in', 'nor', 'keep', 'beforehand', 'latter', 'thence', 'after', 'name', 'own', 'now', '‘re', 'down', 'bottom', 'wherever', 'noone', 'almost', 'an

In [35]:
len(nlp.Defaults.stop_words)

326

In [37]:
## check if a word is s stop word
print(nlp.vocab['with'].is_stop)
print(nlp.vocab['hello'].is_stop)

True
False


In [38]:
## add a new word to existing dictionary
nlp.Defaults.stop_words.add('btw')
nlp.vocab['btw'].is_stop=True
len(nlp.Defaults.stop_words) ## length has incresed from 326 to 327

327

In [39]:
## remove  a new word to existing dictionary
nlp.Defaults.stop_words.remove('beyond')
nlp.vocab['beyond'].is_stop=False
len(nlp.Defaults.stop_words)

326