## Tokenization

Tokens are the basic building blocks of a Doc object everything that help us understand the meaning of the text is derived from token and their relationship to one another.

In [1]:
# Import spacy and load the language library
import spacy
import en_core_web_sm

In [2]:
# Create a string that consist opening and closiing quotes
my_str='"We\'re moving to L.A.!"'
print(my_str)

"We're moving to L.A.!"


In [3]:
nlp=en_core_web_sm.load()
doc=nlp(my_str)
print(doc)

"We're moving to L.A.!"


In [5]:
for token in doc:
    print(token,end=" | ")

" | We | 're | moving | to | L.A. | ! | " | 

<img src="../tokenization.png" width='600'>

-  **Prefix**:	Character(s) at the beginning &#9656; `$ ( “ ¿`
-  **Suffix**:	Character(s) at the end &#9656; `km ) , . ! ”`
-  **Infix**:	Character(s) in between &#9656; `- -- / ...`
-  **Exception**: Special-case rule to split a string into several tokens or prevent a token from being split when punctuation rules are applied &#9656; `St. U.S.`

## Prefixes, Suffixes and Infixes
spaCy will isolate punctuation that does *not* form an integral part of a word. Quotation marks, commas, and punctuation at the end of a sentence will be assigned their own token. However, punctuation that exists as part of an email address, website or numerical value will be kept as part of the token.

In [4]:
doc2=nlp(u"We're here in case you need any help, for any problem mail us on support@oursite.com or visit our website https://www.oursite.com!")

In [6]:
for token in doc2:
    print(token)

We
're
here
in
case
you
need
any
help
,
for
any
problem
mail
us
on
support@oursite.com
or
visit
our
website
https://www.oursite.com
!


In [8]:
doc3=nlp(u"A 5km NYC cab ride costs $13")
for t in doc3:
    print(t)

A
5
km
NYC
cab
ride
costs
$
13


Here the distance unit and dollar sign are assigned their own tokens, yet the dollar amount is preserved.

## Exceptions
Punctuation that exists as part of a known abbreviation will be kept as part of the token.

In [10]:
doc4=nlp(u"We will visit Mt. Everest in Nepal next year again.")
for t in doc4:
    print(t)

We
will
visit
Mt.
Everest
in
Nepal
next
year
again
.


Here abbrevation Mount is reserved

## Counting Tokens

In [11]:
len(doc4)

11

## Counting Vocab Entries
`Vocab` objects contain a full library of items!

In [13]:
len(doc4.vocab)

57852

## Tokens can be retrieved by index position and slice

In [14]:
doc5=nlp(u"Practice makes a man perfect. So practice daily. Never stop!")
doc5[3]

man

In [15]:
doc5[0:5]

Practice makes a man perfect

In [16]:
doc5[-3:]

Never stop!

## Tokens cannot be reassigned
Although `Doc` objects can be considered lists of tokens, they do *not* support item reassignment:

In [19]:
doc6=nlp(u"Hello my name is Tomar, what is your name?")
doc6[4]

Tomar

In [20]:
doc6[4]="Malik"

TypeError: 'spacy.tokens.doc.Doc' object does not support item assignment

___
# Named Entities
Going a step beyond tokens, *named entities* add another layer of context. The language model recognizes that certain words are organizational names while others are locations, and still other combinations relate to money, dates, etc. Named entities are accessible through the `ents` property of a `Doc` object.

In [21]:
doc7=nlp(u"Apple announced that it will build a warehouse in India for $45 million.")
for entity in doc7.ents:
    print(entity)

Apple
India
$45 million


In [28]:
for token in doc7:
    print(token.text, end=" | ")
    
print('\n--------')
    
for entity in doc7.ents:
    print(entity.text + ":" + entity.label_)
    print(entity.label_ + ' means ' + str(spacy.explain(entity.label_)))
    print("\n")

Apple | announced | that | it | will | build | a | warehouse | in | India | for | $ | 45 | million | . | 
--------
Apple:ORG
ORG means Companies, agencies, institutions, etc.


India:GPE
GPE means Countries, cities, states


$45 million:MONEY
MONEY means Monetary values, including unit




Named Entity Recognition (NER) is an important machine learning tool applied to Natural Language Processing. For more info on **named entities** visit https://spacy.io/usage/linguistic-features#named-entities

---
# Noun Chunks
Similar to `Doc.ents`, `Doc.noun_chunks` are another object property. *Noun chunks* are "base noun phrases" – flat phrases that have a noun as their head. You can think of noun chunks as a noun plus the words describing the noun.

In [29]:
doc8=nlp(u"Autonomous cars shifted insurance liabilities towards manufacturers")
for chunk in doc8.noun_chunks:
    print(chunk)

Autonomous cars
insurance liabilities
manufacturers


In [30]:
for chunk in doc7.noun_chunks:
    print(chunk)

Apple
it
a warehouse
India


# Built-in Visualizers

spaCy includes a built-in visualization tool called **displaCy**. displaCy is able to detect whether you're working in a Jupyter notebook, and will return markup that can be rendered in a cell right away. When you export your notebook, the visualizations will be included as HTML.

For more info visit https://spacy.io/usage/visualizers

In [31]:
from spacy import displacy

In [34]:
docs=nlp(u"Apple is going to build a factory in U.K. for $36 million")
displacy.render(docs,style='dep',jupyter=True,options={"distance":100})

In [40]:
doc_m=nlp(u"Over the last quarter Apple sold nearly 20 thousand iPods for a profit of $6 million.")
displacy.render(doc_m,style='ent',jupyter=True)

___
## Creating Visualizations Outside of Jupyter
If you're using another Python IDE or writing a script, you can choose to have spaCy serve up html separately:

In [42]:
doc_1=nlp(u"Apple is the first trillion dollar company in the world")
displacy.serve(doc_1,style='dep')


[93m    Serving on port 5000...[0m
    Using the 'dep' visualizer



127.0.0.1 - - [29/Jul/2021 23:19:34] "GET / HTTP/1.1" 200 7597
127.0.0.1 - - [29/Jul/2021 23:19:35] "GET /favicon.ico HTTP/1.1" 200 7597



    Shutting down server on port 5000.



<font color=blue>**After running the cell above, click the link below to view the dependency parse**:</font>

http://127.0.0.1:5000
<br><br>
<font color=red>**To shut down the server and return to jupyter**, interrupt the kernel either through the **Kernel** menu above, by hitting the black square on the toolbar, or by typing the keyboard shortcut `Esc`, `I`, `I`</font>