# lecture 5 - Tokenization

The first step in creating a `Doc` object is to break down the incoming text into component pieces or "tokens".

-  **Prefix**:	Character(s) at the beginning &#9656; `$ ( “ ¿`
-  **Suffix**:	Character(s) at the end &#9656; `km ) , . ! ”`
-  **Infix**:	Character(s) in between &#9656; `- -- / ...`
-  **Exception**: Special-case rule to split a string into several tokens or prevent a token from being split when punctuation rules are applied &#9656; `St. U.S.`

Notice that tokens are pieces of the original text. That is, we don't see any conversion to word stems or lemmas (base forms of words) and we haven't seen anything about organizations/places/money etc. Tokens are the basic building blocks of a Doc object - everything that helps us understand the meaning of the text is derived from tokens and their relationship to one another.

In [1]:
import spacy

In [2]:
nlp = spacy.load('en_core_web_sm')

In [4]:
mystring = '"We\'re moving to L.A.!"'

In [7]:
mystring

'"We\'re moving to L.A.!"'

In [9]:
print(mystring)

"We're moving to L.A.!"


In [10]:
doc = nlp(mystring)

for token in doc:
    print(token.text, end=' | ')

" | We | 're | moving | to | L.A. | ! | " | 

## prefixes, suffixes and infixes
spaCy will isolate punctuation that does *not* form an integral part of a word. Quotation marks, commas, and punctuation at the end of a sentence will be assigned their own token. However, punctuation that exists as part of an email address, website or numerical value will be kept as part of the token.

In [11]:
doc2 = nlp(u"We're here to help! Send snail-mail, email support@oursite.com or visit us at http://www.oursite.com!")

In [12]:
for t in doc2:
    print(t)

We
're
here
to
help
!
Send
snail
-
mail
,
email
support@oursite.com
or
visit
us
at
http://www.oursite.com
!


In [14]:
doc3 = nlp(u"A 5km NYC cab ride costs $10.30")

In [15]:
for t in doc3:
    print(t)

A
5
km
NYC
cab
ride
costs
$
10.30


In [19]:
doc4 = nlp(u"Let's visits St. Luis in the U.S. next year.")

In [20]:
for t in doc4:
    print(t)

Let
's
visits
St.
Luis
in
the
U.S.
next
year
.


In [21]:
len(doc4)

11

In [22]:
doc4.vocab

<spacy.vocab.Vocab at 0x2190b2738a0>

In [23]:
len(doc4.vocab)

796

In [24]:
doc5 = nlp(u'It is better to give than to receive.')

In [25]:
doc5[0]

It

In [26]:
doc5[2:5]

better to give

## named entities
Going a step beyond tokens, *named entities* add another layer of context. The language model recognizes that certain words are organizational names while others are locations, and still other combinations relate to money, dates, etc. Named entities are accessible through the `ents` property of a `Doc` object.

In [27]:
doc8 = nlp(u"Apple to build a Hong Kong factory for $6 million")

In [28]:
for token in doc8:
    print(token.text,end=' | ')

Apple | to | build | a | Hong | Kong | factory | for | $ | 6 | million | 

In [31]:
for entity in doc8.ents:
    print(entity)
    print(entity.label_)
    print(str(spacy.explain(entity.label_)))
    print('\n')

Apple
ORG
Companies, agencies, institutions, etc.


Hong Kong
GPE
Countries, cities, states


$6 million
MONEY
Monetary values, including unit




## noun chunks
Similar to `Doc.ents`, `Doc.noun_chunks` are another object property. *Noun chunks* are "base noun phrases" – flat phrases that have a noun as their head. You can think of noun chunks as a noun plus the words describing the noun – for example, in [Sheb Wooley's 1958 song](https://en.wikipedia.org/wiki/The_Purple_People_Eater), a *"one-eyed, one-horned, flying, purple people-eater"* would be one long noun chunk.

In [32]:
doc9 = nlp(u"Autonomous cars shift insurance liability toward manufacturers.")

In [34]:
for chunk in doc9.noun_chunks:
    print(chunk)

Autonomous cars
insurance liability
manufacturers


## built-in visualizers

spaCy includes a built-in visualization tool called **displaCy**. displaCy is able to detect whether you're working in a Jupyter notebook, and will return markup that can be rendered in a cell right away. When you export your notebook, the visualizations will be included as HTML.

In [35]:
from spacy import displacy

In [36]:
doc = nlp(u'Apple is going to build a U.K. factory for $6 million.')

In [39]:
displacy.render(doc,style='dep',options={'distance':110})

In [40]:
displacy.render(doc,style='dep',options={'distance':50})

The optional `'distance'` argument sets the distance between tokens. If the distance is made too small, text that appears beneath short arrows may become too compressed to read.

In [41]:
doc = nlp(u'Over the last quarter Apple sold nearly 20 thousand iPods for a profit of $6 million.')

In [42]:
displacy.render(doc,style='ent')