In [1]:
import spacy
nlp = spacy.load('en')

# Tokenization
The first step in creating a `Doc` object is to break down the incoming text into component pieces or "tokens".

In [2]:
# Create a string that includes opening and closing quotation marks
mystring = '"We\'re moving to L.A.!"'
print(mystring)

"We're moving to L.A.!"


Break the sentence into tokens

In [3]:
doc = nlp(mystring)

for token in doc:
    print(token.text, end = ' | ')

" | We | 're | moving | to | L.A. | ! | " | 

Lets create a more complex sentence. The below sentence contains hyphens, `.` in the middle of the sentence that are not the periods. It also contains the names of emails and websites. 

In [4]:
doc2 = nlp(u"We're here to help! Send snail-mail, email support@oursite.com or visit us at http://www.oursite.com!")

for t in doc2:
    print(t)

We
're
here
to
help
!
Send
snail
-
mail
,
email
support@oursite.com
or
visit
us
at
http://www.oursite.com
!


<font color=green>Note that the exclamation points, comma, and the hyphen in 'snail-mail' are assigned their own tokens, yet both the email address and website are preserved.</font>

In [5]:
doc3 = nlp(u'A 5km NYC cab ride costs $10.30')

for t in doc3:
    print(t)

A
5
km
NYC
cab
ride
costs
$
10.30


<font color = green>Here the distance unit and dollar sign are assigned their own tokens, yet the dollar amount is preserved.<font color = green>

## Exceptions
Punctuation that exists as part of a known abbreviation will be kept as part of the token.

In [6]:
doc4 = nlp(u"Let's visit St. Louis in the U.S. next year.")

for t in doc4:
    print(t)

Let
's
visit
St.
Louis
in
the
U.S.
next
year
.


Here the abbreviation for "Saint" and "United States" are both preserved.

## Counting Tokens

`doc` object has set number of tokens

In [7]:
len(doc)

8

## Counting Vocab Entries
`Vocab` objects contain a full library of items!

In [8]:
len(doc.vocab)

57852

The number of vocab changes based on the language loaded at the start.

## Tokens can be retrieved by index position and slice
`Doc` objects can be thought of as lists of `token` objects. As such, individual tokens can be retrieved by index position, and spans of tokens can be retrieved through slicing:

In [9]:
doc5 = nlp(u'It is better to give than to receive.')

# Retrieve the third token:
print(doc5[2])

print(doc5[2:5])
print(doc5[-4])

better
better to give
than


## Tokens cannot be reassigned
Although `Doc` objects can be considered lists of tokens, they do *not* support item reassignment:

In [10]:
doc6 = nlp(u'My dinner was horrible.')
doc7 = nlp(u'Your dinner was delicious.')
# Try to change "My dinner was horrible" to "My dinner was delicious"
doc6[3] = doc7[3]

TypeError: 'spacy.tokens.doc.Doc' object does not support item assignment

___
# Named Entities
Going a step beyond tokens, *named entities* add another layer of context. The language model recognizes that certain words are organizational names while others are locations, and still other combinations relate to money, dates, etc. Named entities are accessible through the `ents` property of a `Doc` object.

In [11]:
doc8 = nlp(u'Apple to build a Hong Kong factory for $6 million')

for token in doc8:
    print(token.text, end=' | ')

Apple | to | build | a | Hong | Kong | factory | for | $ | 6 | million | 

In [12]:
for ent in doc8.ents:
    print(f'{ent.text} - {ent.label_} - {spacy.explain(ent.label_)}')

Apple - ORG - Companies, agencies, institutions, etc.
Hong Kong - GPE - Countries, cities, states
$6 million - MONEY - Monetary values, including unit


<font color=green>Note how two tokens combine to form the entity `Hong Kong`, and three tokens combine to form the monetary entity:  `$6 million`</font>

In [13]:
len(doc8.ents)

3

---
# Noun Chunks
Similar to `Doc.ents`, `Doc.noun_chunks` are another object property. *Noun chunks* are "base noun phrases" – flat phrases that have a noun as their head. You can think of noun chunks as a noun plus the words describing the noun – for example, in [Sheb Wooley's 1958 song](https://en.wikipedia.org/wiki/The_Purple_People_Eater), a *"one-eyed, one-horned, flying, purple people-eater"* would be one long noun chunk.

In [14]:
doc9 = nlp(u"Jack was a one-eyed, one-horned, flying, purple people-eater")

for chunk in doc9.noun_chunks:
    print(chunk.text)

Jack
, one-horned, flying, purple people-eater


In [15]:
doc10 = nlp(u"High beta stocks may yield higher returns during a prolonged bull market, but with a higher volatility!")

for chunk in doc10.noun_chunks:
    print(chunk.text)

High beta stocks
higher returns
a prolonged bull market
a higher volatility


___
# Built-in Visualizers

spaCy includes a built-in visualization tool called **displaCy**. displaCy is able to detect whether you're working in a Jupyter notebook, and will return markup that can be rendered in a cell right away. When you export your notebook, the visualizations will be included as HTML.

For more info visit https://spacy.io/usage/visualizers

In [16]:
from spacy import displacy

In [17]:
doc = nlp(u"The President of USA has approved a budget to invest $1trillion to boost the infrastructure.")

displacy.render(doc,style = 'dep', jupyter = True, options = {'distance':110})

In [18]:
doc = nlp(u"The President of USA, Donald Trump has approved a budget to invest $1trillion to boost the infrastructure such as roads and bridges. Some money has been allocated to build the US/Mexico wall.")

displacy.render(doc,style = 'ent', jupyter = True)

In [None]:
doc = nlp(u"The President of USA, Donald Trump has approved a budget to invest $1trillion to boost the infrastructure such as roads and bridges. Some money has been allocated to build the US/Mexico wall.")

displacy.serve(doc,style = 'ent')