In [1]:
import spacy

In [2]:
nlp = spacy.load('en_core_web_sm')

## 1. Containers 
Containers are spaCY objects that contain a large quantity of data about a text. When we analyze texts with the spaCy framework, we create different container objects to do that. Here is a full list of all spaCy continers we will be focusing on three : Doc, Span, and Token.
1. Doc
2. DocBin
3. Example
4. language
5. Lexeme
6. Span
7. SpanGroup
8. Token

In [4]:
with open("wiki.txt", 'r') as f:
    text = f.read()

In [5]:
print(text)

The United States of America (USA), also known as the United States (U.S.) or America, is a country primarily located in North America. It is a federal republic of 50 states and a federal capital district, Washington, D.C. The 48 contiguous states border Canada to the north and Mexico to the south, with the semi-exclave of Alaska in the northwest and the archipelago of Hawaii in the Pacific Ocean. The United States also asserts sovereignty over five major island territories and various uninhabited islands in Oceania and the Caribbean.[j] It is a megadiverse country, with the world's third-largest land area[c] and third-largest population, exceeding 340 million.[k]

Paleo-Indians migrated from North Asia to North America over 12,000 years ago, and formed various civilizations. Spanish colonization established Spanish Florida in 1513, the first European colony in what is now the continental United States. British colonization followed with the 1607 settlement of Virginia, the first of th

In [6]:
doc = nlp(text)

In [7]:
print(doc)

The United States of America (USA), also known as the United States (U.S.) or America, is a country primarily located in North America. It is a federal republic of 50 states and a federal capital district, Washington, D.C. The 48 contiguous states border Canada to the north and Mexico to the south, with the semi-exclave of Alaska in the northwest and the archipelago of Hawaii in the Pacific Ocean. The United States also asserts sovereignty over five major island territories and various uninhabited islands in Oceania and the Caribbean.[j] It is a megadiverse country, with the world's third-largest land area[c] and third-largest population, exceeding 340 million.[k]

Paleo-Indians migrated from North Asia to North America over 12,000 years ago, and formed various civilizations. Spanish colonization established Spanish Florida in 1513, the first European colony in what is now the continental United States. British colonization followed with the 1607 settlement of Virginia, the first of th

In [8]:
print(len(text))
print(len(doc))

26667
4626


In [9]:
for token in text[0:10]:
    print(token)

T
h
e
 
U
n
i
t
e
d


In [10]:
for token in doc[:10]:
    print(token)

The
United
States
of
America
(
USA
)
,
also


In [11]:
for token in text.split()[:10]:
    print(token)

The
United
States
of
America
(USA),
also
known
as
the


### Sentence Boundary Detection

In [13]:
for send in doc.sents:
    print(send)

The United States of America (USA), also known as the United States (U.S.) or America, is a country primarily located in North America.
It is a federal republic of 50 states and a federal capital district, Washington, D.C.
The 48 contiguous states border Canada to the north and Mexico to the south, with the semi-exclave of Alaska in the northwest and the archipelago of Hawaii in the Pacific Ocean.
The United States also asserts sovereignty over five major island territories and various uninhabited islands in Oceania and the Caribbean.[j]
It is a megadiverse country, with the world's third-largest land area[c] and third-largest population, exceeding 340 million.[k]

Paleo-Indians migrated from North Asia to North America over 12,000 years ago, and formed various civilizations.
Spanish colonization established Spanish Florida in 1513, the first European colony in what is now the continental United States.
British colonization followed with the 1607 settlement of Virginia, the first of th

In [14]:
sentence1 = list(doc.sents)[0]
print(sentence1)

The United States of America (USA), also known as the United States (U.S.) or America, is a country primarily located in North America.


### Token Attributes

In [16]:
for token in doc:
    print(token)
 # Here each token has a bunch of metadata which is called attributes.   

The
United
States
of
America
(
USA
)
,
also
known
as
the
United
States
(
U.S.
)
or
America
,
is
a
country
primarily
located
in
North
America
.
It
is
a
federal
republic
of
50
states
and
a
federal
capital
district
,
Washington
,
D.C.
The
48
contiguous
states
border
Canada
to
the
north
and
Mexico
to
the
south
,
with
the
semi
-
exclave
of
Alaska
in
the
northwest
and
the
archipelago
of
Hawaii
in
the
Pacific
Ocean
.
The
United
States
also
asserts
sovereignty
over
five
major
island
territories
and
various
uninhabited
islands
in
Oceania
and
the
Caribbean.[j
]
It
is
a
megadiverse
country
,
with
the
world
's
third
-
largest
land
area[c
]
and
third
-
largest
population
,
exceeding
340
million.[k
]



Paleo
-
Indians
migrated
from
North
Asia
to
North
America
over
12,000
years
ago
,
and
formed
various
civilizations
.
Spanish
colonization
established
Spanish
Florida
in
1513
,
the
first
European
colony
in
what
is
now
the
continental
United
States
.
British
colonization
followed
with
the
1607
settleme

In [17]:
token2 = sentence1[2]
print(token2)

States


In [18]:
token2.text

'States'

In [19]:
token2.left_edge

The

In [20]:
token2.right_edge

,

In [21]:
token2.ent_type

384

In [22]:
token2.ent_type_

'GPE'

In [23]:
token2.ent_iob_

'I'

In [24]:
token2.lemma_

'States'

In [25]:
sentence1[10].lemma_

'know'

In [26]:
token2.morph

Number=Sing

In [27]:
sentence1[10].morph

Aspect=Perf|Tense=Past|VerbForm=Part

In [28]:
token2.pos_

'PROPN'

In [29]:
token2.dep_

'nsubj'

In [30]:
token.lang_

'en'

## Part of Speech Tagging

In [32]:
text1 = "Mike enjoys playing football."
doc2 = nlp(text1)
print(doc2)

Mike enjoys playing football.


In [33]:
for token in doc2:
    print(token.text, token.pos_, token.dep_)

Mike PROPN nsubj
enjoys VERB ROOT
playing VERB xcomp
football NOUN dobj
. PUNCT punct


In [34]:
from spacy import displacy
displacy.render(doc2, style = "dep")

In [35]:
for ent in doc.ents:
    print(ent.text, ent.label_)

The United States of America GPE
the United States GPE
U.S. GPE
America GPE
North America LOC
50 CARDINAL
Washington GPE
D.C. GPE
48 CARDINAL
Canada GPE
Mexico GPE
Alaska GPE
Hawaii GPE
the Pacific Ocean LOC
The United States GPE
five CARDINAL
Oceania GPE
third ORDINAL
third ORDINAL
340 million.[k MONEY
Paleo-Indians NORP
North Asia LOC
North America LOC
12,000 years ago DATE
Spanish NORP
Spanish NORP
Florida GPE
1513 DATE
first ORDINAL
European NORP
United States GPE
British NORP
1607 DATE
Virginia GPE
first ORDINAL
the Thirteen Colonies EVENT
Africans PERSON
the Southern Colonies' ORG
British NORP
Crown PRODUCT
the American Revolution EVENT
the Declaration of Independence WORK_OF_ART
July 4, 1776 DATE
Revolutionary War EVENT
U.S. GPE
the Confederate States of America ORG
American Civil War EVENT
the United States' GPE
1900 DATE
World War I. Following Japan's EVENT
Pearl Harbor GPE
1941 DATE
U.S. GPE
World War II EVENT
U.S. GPE
the Soviet Union GPE
the Cold War EVENT
The Soviet Union'

In [36]:
displacy.render(doc, style = 'ent')

### Words Vecotrs and spaCy

Word vecotrs or word embiddings, are numerical representations of words in multidimensional space throuhg matrices. The puPurpose of the word vector is to get a computer system to understand.

In [39]:
nlpp= spacy.load("en_core_web_md")

In [40]:
doc = nlpp(text)
sentence11 = list(doc.sents)[0]
print(sentence11)

The United States of America (USA), also known as the United States (U.S.) or America, is a country primarily located in North America.


In [82]:
import numpy as np
your_word = "country"

ms = nlpp.vocab.vectors.most_similar(
    np.asarray([nlpp.vocab.vectors[nlpp.vocab.strings[your_word]]]), n=10)
words = [nlpp.vocab.strings[w] for w in ms[0][0]]
distances = ms[2]
print(words)

['inner-city', 'anti-poverty', 'SLUMS', 'Socioeconomic', 'Divides', 'INTERSECT', 'drop-out', 'dropout', 'handicaps', 'suburb']
