# 1. The Basics of SpaCy



## 1.1. What is SpaCy

A Natural language processing framework. Natural language processing, or NLP, is a branch of linguistics that seeks to parse human language in a computer system. This field is generally referred to as computational linguistics, though it has far reaching applications beyond academic linguistic research.

## 1.2. How to Install spaCy

* `!pip install spacy`
* `!python -m spacy download en_core_web_sm`
* `import spacy`
* `nlp = spacy.load("en_core_web_sm")`


## 1.3. Containers

Containers are spaCy objects that contain a large quantity of data about a text. When we analyze texts with the spaCy framework, we create different container objects to do that. Here is a full list of all spaCy containers.

* Doc

* DocBin

* Example

* Language

* Lexeme

* Span

* SpanGroup

* Token

# 2. Getting Started with spaCy and its Linguistic Annotations

## 2.1. Importing spaCy and Loading Data

In [1]:
import spacy

In [2]:
nlp = spacy.load("en_core_web_sm")

In [3]:
with open("/content/wiki_us.txt", "r") as f:
  text = f.read()

In [4]:
text

"The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America. It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j] At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d] The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22] With a population of more than 331 million people, it is the third most populous country in the world. The national capital is Washington, D.C., and the most populous city is New York.\n\nPaleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century. The United States emerged from the thirteen British colonies 

## 2.2. Creating a Doc Container

In [5]:
doc = nlp(text)

In [6]:
print(doc)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America. It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j] At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d] The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22] With a population of more than 331 million people, it is the third most populous country in the world. The national capital is Washington, D.C., and the most populous city is New York.

Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century. The United States emerged from the thirteen British colonies est

In [7]:
print(len(text))
print(len(doc))

3521
654


In [8]:
for token in text[:10]:
  print(token)

T
h
e
 
U
n
i
t
e
d


In [9]:
for token in doc[:10]:
  print(token)

The
United
States
of
America
(
U.S.A.
or
USA
)


You may be thinking to yourself that you could easily use the split method in Python to split by whitespace and have the same result. But you’d be wrong. Let’s see why.

In [10]:
for token in text.split()[:10]:
  print(token)

The
United
States
of
America
(U.S.A.
or
USA),
commonly
known


Notice that the parentheses are not removed or handled individually. To see this more clearly, let’s print off all tokens from index 5 to 8 in both the text and doc objects.

In [11]:
words = text.split()[:10]

In [12]:
i=5
for token in doc[i:8]:
    print (f"SpaCy Token {i}:\n{token}\nWord Split {i}:\n{words[i]}\n\n")
    i=i+1

SpaCy Token 5:
(
Word Split 5:
(U.S.A.


SpaCy Token 6:
U.S.A.
Word Split 6:
or


SpaCy Token 7:
or
Word Split 7:
USA),




## 2.3. Sentence Boundary Detection (SBD)



In [13]:
for sent in doc.sents:
  print(sent)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.
It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j]
At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d]
The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22]
With a population of more than 331 million people, it is the third most populous country in the world.
The national capital is Washington, D.C., and the most populous city is New York.


Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century.
The United States emerged from the thirteen British colonies es

In [14]:
print(len(sent))

35


In [16]:
sentence1 = doc.sents[0]
print(sentence1)

TypeError: 'generator' object is not subscriptable

In [17]:
sentence1 = list(doc.sents)[0]
print(sentence1)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.


## 2.4. Token Attributes

The token object contains a lot of different attributes that are VITAL do performing NLP in spaCy. We will be working with a few of them, such as:

* .text

* .head

* .left_edge

* .right_edge

* .ent_type_

* .iob_

* .lemma_

* .morph

* .pos_

* .dep_

* .lang_

In [18]:
token2 = sentence1[2]
print (token2)

States


### 2.4.1. Text

In [19]:
token2.text

'States'

### 2.4.2. Head

This tells to which word it is governed by, in this case, the primary verb, “is”, as it is part of the noun subject.

In [20]:
token2.head

is

### 2.4.3. Left Edge

If part of a sequence of tokens that are collectively meaningful, known as multi-word tokens, this will tell us where the multi-word token begins.

In [21]:
token2.left_edge

The

### 2.4.4. Right Edge

This will tell us where the multi-word token ends.

In [22]:
token2.right_edge

America

### 2.4.5. Entity Type

In [23]:
token2.ent_type

384

Note the absence of the _ at the end of the attribute. This will return an integer that corresponds to an entity type, where as _ will give you the string equivalent., as in below.

GPE is geopolitical entity and is correct.

In [24]:
token2.ent_type_

'GPE'

### 2.4.6. Ent IOB

IOB code of named entity tag. “B” means the token begins an entity, “I” means it is inside an entity, “O” means it is outside an entity, and "" means no entity tag is set.

In [25]:
token2.ent_iob_

'I'

### 2.4.7. Lemma

Base form of the token, with no inflectional suffixes.

In [26]:
token2.lemma_

'States'

In [27]:
sentence1[12].lemma_

'know'

### 2.4.8. Morph

In [28]:
sentence1[12].morph

Aspect=Perf|Tense=Past|VerbForm=Part

### 2.4.9. Part of Speech



In [29]:
token2.pos_

'PROPN'

### 2.4.10. Syntactic Dependency

In [30]:
token2.dep_

'nsubj'

### 2.4.11. Language

In [31]:
token2.lang_

'en'

## 2.5. Part of Speech Tagging (POS)

SpaCy offers an easy way to parse a text and identify its parts of speech. Below, we will iterate across each token (word or punctuation) in the text and identify its part of speech.

In [32]:
for token in sentence1:
  print(token.text, token.pos_, token.dep_)

The DET det
United PROPN compound
States PROPN nsubj
of ADP prep
America PROPN pobj
( PUNCT punct
U.S.A. PROPN appos
or CCONJ cc
USA PROPN conj
) PUNCT punct
, PUNCT punct
commonly ADV advmod
known VERB acl
as ADP prep
the DET det
United PROPN compound
States PROPN pobj
( PUNCT punct
U.S. PROPN appos
or CCONJ cc
US PROPN conj
) PUNCT punct
or CCONJ cc
America PROPN conj
, PUNCT punct
is AUX ROOT
a DET det
country NOUN attr
primarily ADV advmod
located VERB acl
in ADP prep
North PROPN compound
America PROPN pobj
. PUNCT punct


In [33]:
from spacy import displacy
displacy.render(sentence1, style="dep")

## 2.6. Named Entity Recognition

In [34]:
for ent in doc.ents:
  print(ent.text, ent.label_)

The United States of America GPE
U.S.A. GPE
USA GPE
the United States GPE
U.S. GPE
US GPE
America GPE
North America LOC
50 CARDINAL
five CARDINAL
326 CARDINAL
Indian NORP
3.8 million square miles QUANTITY
9.8 million square kilometers QUANTITY
fourth ORDINAL
The United States GPE
Canada GPE
Mexico GPE
Bahamas GPE
Cuba GPE
more than 331 million CARDINAL
third ORDINAL
Washington GPE
D.C. GPE
New York GPE
Paleo-Indians NORP
Siberia LOC
North American NORP
at least 12,000 years ago DATE
European NORP
the 16th century DATE
The United States GPE
thirteen CARDINAL
British NORP
the East Coast LOC
Great Britain GPE
the American Revolutionary War ORG
the late 18th century DATE
U.S. GPE
North America LOC
Native Americans NORP
1848 DATE
the United States GPE
United States GPE
the second half of the 19th century DATE
the American Civil War ORG
Spanish NORP
World War EVENT
U.S. GPE
World War II EVENT
the Cold War EVENT
the United States GPE
the Korean War EVENT
the Vietnam War EVENT
the Soviet Union

In [35]:
displacy.render(doc, style="ent")