#spaCy Basics
# spaCy (https://spacy.io/) is an open-source Python library that parses and "understands" large volumes of text. Separate models are available that cater to specific languages (English, French, German, etc.).
# In this section we'll install and setup spaCy to work with Python, and then introduce some concepts related to Natural Language Processing.

# spacy  is much faster than nltk , but spacy does not include pre-created models sich as sentiment analysis
#  which is typically easier than 

# we will  perform the following :
# loading language library
# build a pipeline object 
# using tokens
# POS tagging 
# Understadning  token attributes 


In [1]:
# loading language library 
import spacy 
nlp = spacy.load('en_core_web_sm')

In [6]:
# Create a Doc object
doc = nlp(u'Tesla is looking at buying U.S. startup for $6 million')


In [7]:
# Print each token separately
for token in doc:
    print(token.text, token.pos_, token.dep_)

Tesla PROPN nsubj
is VERB aux
looking VERB ROOT
at ADP prep
buying VERB pcomp
U.S. PROPN compound
startup NOUN dobj
for ADP prep
$ SYM quantmod
6 NUM compound
million NUM pobj


# This doesn't look very user-friendly, but right away we see some interesting things happen:
# Tesla is recognized to be a Proper Noun, not just a word at the start of a sentence
# U.S. is kept together as one entity (we call this a 'token')

# Pipeline
When we run nlp, our text enters a processing pipeline that first breaks down the text and then performs a series of operations to tag, parse and describe the data. Image source: https://spacy.io/usage/spacy-101#pipelines

In [8]:
nlp.pipeline

[('tagger', <spacy.pipeline.Tagger at 0x7f6d3adb4e10>),
 ('parser', <spacy.pipeline.DependencyParser at 0x7f6d64573888>),
 ('ner', <spacy.pipeline.EntityRecognizer at 0x7f6d3b136c50>)]

In [9]:
nlp.pipe_names

['tagger', 'parser', 'ner']

# Tokenization

# The first step in processing text is to split up all the component parts (words & punctuation) into "tokens".
# These tokens are annotated inside the Doc object to contain descriptive information. 
# We'll go into much more detail on tokenization in an upcoming lecture. For now, let's look at another example:

In [11]:
doc1= nlp(u"Asmita Chatterjee isn't taken for granted.")

In [12]:
for token in doc1:
    print(token.pos_,token.text, token.dep_)

PROPN Asmita compound
PROPN Chatterjee nsubjpass
VERB is auxpass
ADV n't neg
VERB taken ROOT
ADP for mark
VERB granted advcl
PUNCT . punct


# Notice how isn't has been split into two tokens. spaCy recognizes both the root verb is and the negation attached to it.
# Notice also that both the extended whitespace and the period at the end of the sentence are assigned their own tokens

In [13]:
doc1

Asmita Chatterjee isn't taken for granted.

In [14]:
doc1[0]

Asmita

In [15]:
doc1[4]

taken

In [17]:
doc1[2]

is

In [18]:
type(doc1)

spacy.tokens.doc.Doc

# Part-of-Speech Tagging (POS)

# The next step after splitting the text up into tokens is to assign parts of speech. In the above example, Tesla was recognized to be a proper noun. Here some statistical modeling is required.
# For example, words that follow "the" are typically nouns.

In [19]:
doc1[0].pos_

'PROPN'

# dependencies 

In [22]:
doc1[1].dep_

'nsubjpass'

In [23]:
spacy.explain('nsubjpass')

'nominal subject (passive)'

In [24]:
doc1[7].dep_

'punct'

In [25]:
spacy.explain('punct')

'punctuation'

# Additional Token Attributes

In [26]:
print(doc1[4].text)

taken


In [28]:
print(doc1[4].lemma_)

take


In [29]:
print(doc1[4].pos_)

VERB


In [30]:
print(doc1[4].tag_)

VBN


In [35]:
(spacy.explain('VBN'))

'verb, past participle'

In [36]:
doc1[4].shape_

'xxxx'

# Spans

# Large Doc objects can be hard to work with at times. A span is a slice of Doc object in the form Doc[start:stop]. 

In [37]:
doc3 = nlp(u'Although commmonly attributed to John Lennon from his song "Beautiful Boy", \
the phrase "Life is what happens to us while we are making other plans" was written by \
cartoonist Allen Saunders and published in Reader\'s Digest in 1957, when Lennon was 17.')

In [40]:
life_quote = doc3[16:30]
print(life_quote)

"Life is what happens to us while we are making other plans"


In [41]:
life_quote_1 = doc3[31:41]

In [42]:
print(life_quote_1)

written by cartoonist Allen Saunders and published in Reader's


# Sentences 

# Certain tokens inside a Doc object may also receive a "start of sentence" tag. While this doesn't immediately build a list of sentences, these tags enable the generation of sentence segments through Doc.sents.
# Later we'll write our own segmentation rules. 

In [43]:
doc4 = nlp(u'This is the first sentence. This is another sentence. This is the last sentence.')

In [44]:
for sent in doc4.sents:
    print(sent)
    

This is the first sentence.
This is another sentence.
This is the last sentence.


In [45]:
doc4[6].is_sent_start

True

In [48]:
doc4[10].is_sent_start

# tokenization

In [2]:
# Create a string that includes opening and closing quotation marks

mystring = '"We\'re moving to L.A.!"'
print(mystring)

"We're moving to L.A.!"


In [3]:
# Create a Doc object and explore tokens
doc = nlp(mystring)


In [4]:
for token in doc:
    print(token.text, end='|')

"|We|'re|moving|to|L.A.|!|"|

In [6]:
for token in doc :
    print(token.pos_ , token.tag_, token.text
         )

PUNCT `` "
PRON PRP We
VERB VBP 're
VERB VBG moving
ADP IN to
PROPN NNP L.A.
PUNCT . !
PUNCT '' "


In [7]:
doc_5 = nlp(u"We're here to help! Send snail-mail, email support@oursite.com or visit us at http://www.oursite.com!")

In [8]:
for t in doc_5:
    print(t)

We
're
here
to
help
!
Send
snail
-
mail
,
email
support@oursite.com
or
visit
us
at
http://www.oursite.com
!


In [13]:
doc_6 = nlp(u'A 5km NYC cab ride costs $10.30')

for t in doc_6:
    print(t , t.text , t.tag_, t.lemma_ , spacy.explain(t.tag_))

A A DT a determiner
5 5 CD 5 cardinal number
km km NN km noun, singular or mass
NYC NYC NNP nyc noun, proper singular
cab cab NN cab noun, singular or mass
ride ride NN ride noun, singular or mass
costs costs VBZ cost verb, 3rd person singular present
$ $ $ $ symbol, currency
10.30 10.30 CD 10.30 cardinal number


# counting tokens 

In [14]:
len(doc_6)

9

# Counting Vocab Entries Vocab objects contain a full library of items!

In [15]:
len(doc_6.vocab)

57852

# Tokens can be retrieved by index position and slice

In [16]:
doc5 = nlp(u'It is better to give than to receive.')

# Retrieve the third token:
doc5[2]

better

In [17]:
# Retrieve three tokens from the middle:
doc5[2:5]

better to give

# Named Entities 

# Going a step beyond tokens, named entities add another layer of context. 
# The language model recognizes that certain words are organizational names while others are locations,
# and still other combinations relate to money, dates, etc.
# Named entities are accessible through the ents property of a Doc object.

In [18]:
doc8 = nlp(u'Apple to build a Hong Kong factory for $6 million')


In [19]:
for token in doc8:
    print(token.text, end=' | ')


Apple | to | build | a | Hong | Kong | factory | for | $ | 6 | million | 

In [20]:
for ent in doc8.ents:
    print(ent.text+' - '+ent.label_+' - '+str(spacy.explain(ent.label_)))

Apple - ORG - Companies, agencies, institutions, etc.
Hong Kong - GPE - Countries, cities, states
$6 million - MONEY - Monetary values, including unit


# Noun Chunks

# Similar to Doc.ents, Doc.noun_chunks are another object property. 
# Noun chunks are "base noun phrases" – flat phrases that have a noun as their head.
# You can think of noun chunks as a noun plus the words describing the noun – for example, 
# in Sheb Wooley's 1958 song, a "one-eyed, one-horned, flying, purple people-eater" would be one long noun chunk.

In [21]:
doc9 = nlp(u"Autonomous cars shift insurance liability toward manufacturers.")

In [22]:
for chunk in doc9.noun_chunks:
    print(chunk.text)

Autonomous cars
insurance liability
manufacturers


In [23]:
doc10 = nlp(u"Red cars do not carry higher insurance rates.")

for chunk in doc10.noun_chunks:
    print(chunk.text)

Red cars
higher insurance rates


# built in visualizers 

# Visualizing the dependency parse
# Run the cell below to import displacy and display the dependency graphic

In [26]:
from spacy import displacy

doc = nlp(u'Apple is going to build a U.K. factory for $6 million.')
displacy.render(doc, style='ent', jupyter=True,options={'distance': 110})