In [2]:
import spacy

In [3]:
nlp = spacy.load("en_core_web_sm")

# spaCy Containers

<img src="images/spacy_containers.png" alt="Concept Image" width="350"/>

- A doc contains several sentences.
- Sentences contain several tokens.
- Tokens can be letters, commas, words, semicolons etc.
- A Span can be a single token itself (just the one word) or it can be a sequence of multiple tokens (like 3 words together).


In [4]:
with open("data/wiki_us.txt", "r") as f:
    text = f.read()

In [5]:
print(text)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America. It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j] At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d] The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22] With a population of more than 331 million people, it is the third most populous country in the world. The national capital is Washington, D.C., and the most populous city is New York.

Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century. The United States emerged from the thirteen British colonies est

## Text vs Doc objects

In [9]:
#- Giving the text to the model that we imported - "en-core-web-sm" model
doc = nlp(text)

In [7]:
print(doc)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America. It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j] At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d] The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22] With a population of more than 331 million people, it is the third most populous country in the world. The national capital is Washington, D.C., and the most populous city is New York.

Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century. The United States emerged from the thirteen British colonies est

In [8]:
print(len(text))
print(len(doc))

#- Both of these have different lengths!

3521
654


Why is the length for both different?

In [10]:
for token in text[0:10]:
    print(token)

T
h
e
 
U
n
i
t
e
d


In [11]:
for token in doc[0:10]:
    print(token)

The
United
States
of
America
(
U.S.A.
or
USA
)


We can see here that the tokens in doc are individal words, grammatical symbols, etc.

**This is different from splitting text in Python using a space as seen below**

Clearly, the split using a space groups together different tokens like "(U.S.A." and "USA),".

In [12]:
for token in text.split()[0:10]:
    print(token)

The
United
States
of
America
(U.S.A.
or
USA),
commonly
known


## Separating sentences using spaCy

In [13]:
for sentence in doc.sents:
    print(sentence)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.
It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j]
At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d]
The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22]
With a population of more than 331 million people, it is the third most populous country in the world.
The national capital is Washington, D.C., and the most populous city is New York.


Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century.
The United States emerged from the thirteen British colonies es

As you can see that each sentence has been separated onto a new line.

## Accessing individual sentences
`doc.sents` **is a generator object. These are NOT subscriptable therefore you cannot access indices in them.**

But we can use `list` on this object to access indices.

In [14]:
sentence1 = list(doc.sents)[0]
print(sentence1)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.


## Attributes of tokens
Some main attributes of tokens are:
- .text

- .head
- .left_edge
- .right_edge
- .ent_type_
- .iob_
- .lemma_
- .morph
- .pos_
- .dep_
- .lang_

In [15]:
for token in doc[0:10]:
    print(token)

The
United
States
of
America
(
U.S.A.
or
USA
)


In [17]:
token2 = sentence1[2]
print(token2)
#- Second token in sentence 1

States


The object printed above is a **token object**.

It is NOT text. It has a whole bunch of attributes associated with it.

The following are some of them:

In [21]:
token2.text


'States'

In [22]:
token2.left_edge
#- The token to the token's left

The

In [23]:
token2.right_edge
#- The token to the token's right

America

## ent_type_
Shows the entity type 
In this case, it is a GeoPolitical Entity - GPE

In [25]:
token2.ent_type_

'GPE'

## ent_iob_
IOB stands for Inside, Outside, or Beginning of an Entity

In this case, the token "States" is part of a larger entity called "The United States of America" therefore 'I' is displayed

In [26]:
token2.ent_iob_

'I'

### Lemma_

The lemma form of a token shows the raw or uninflected form of the word.

Look at the example below where the original word is "known" but the lemma form shows "know" - the raw version of the verb.

In [27]:
sentence1[12]

known

In [29]:
sentence1[12].lemma_

'know'

### Morph
This gives you information about what kind of word it is.

In [31]:
token2.morph

Number=Sing

## pos_
Tells us the part of speech the word is.

Slightly simpler version of morph

In this case, the word is a proper noun - States

In [32]:
token2.pos_

'PROPN'

## dep_
Dependency relation of the token

In this case, it is a noun subject.

In [33]:
token2.dep_

'nsubj'

## lang
Language of the token

In [35]:
token2.lang_

'en'

# Part of Speech and Dependency Example(s)

In [38]:
sample_text = "Mike enjoys playing football."

sample_doc = nlp(sample_text)

print(sample_doc)

Mike enjoys playing football.


In [39]:
for token in sample_doc:
    print (token.text, token.pos_, token.dep_)

Mike PROPN nsubj
enjoys VERB ROOT
playing VERB xcomp
football NOUN dobj
. PUNCT punct


## Parts of Speech and Dependency Visualization

In [40]:
from spacy import displacy
displacy.render(sample_doc, style="dep")

# Tokens vs Entities

### Token
A token is the smallest unit of text that has a meaningful role in a sentence. Typically, tokens are words, punctuation marks, numbers, etc.

### Named Entity
A named entity is a specific type of phrase or word that refers to a proper noun or a unique item. Named entities are usually things like names of people, organizations, locations, dates, and other specific entities that have a distinct identity.

### Example:

"Barack Obama was born on August 4, 1961, in Honolulu, Hawaii."

**Tokens** - "Barack" "Obama" "was" "born" "on"...and so on

**Named Entities**- "Barack Obama (PERSON)", "August 4, 1961 (DATE)", "Honolulu (GPE - Geo-Political Entity, e.g., city, country)", "Hawaii (GPE)"

# Named Entity Recognition (NER)


In [43]:
for ent in doc.ents:
    print(ent.text, ent.label_)

The United States of America GPE
U.S.A. GPE
USA GPE
the United States GPE
U.S. GPE
US GPE
America GPE
North America LOC
50 CARDINAL
five CARDINAL
326 CARDINAL
Indian NORP
3.8 million square miles QUANTITY
9.8 million square kilometers QUANTITY
fourth ORDINAL
The United States GPE
Canada GPE
Mexico GPE
Bahamas GPE
Cuba GPE
more than 331 million CARDINAL
third ORDINAL
Washington GPE
D.C. GPE
New York GPE
Paleo-Indians NORP
Siberia LOC
North American NORP
at least 12,000 years ago DATE
European NORP
the 16th century DATE
The United States GPE
thirteen CARDINAL
British NORP
the East Coast LOC
Great Britain GPE
the American Revolutionary War ORG
the late 18th century DATE
U.S. GPE
North America LOC
Native Americans NORP
1848 DATE
the United States GPE
United States GPE
the second half of the 19th century DATE
the American Civil War ORG
Spanish NORP
World War EVENT
U.S. GPE
World War II EVENT
the Cold War EVENT
the United States GPE
the Korean War EVENT
the Vietnam War EVENT
the Soviet Union

We see all these words and what kind of words each of those words are.

There are some mistakes however with certain words and their types.

### Displaying entire text with highlighted types

In [44]:
displacy.render(doc, style="ent")