In [86]:
import spacy
import numpy as np

In [3]:
# Create NLP object (load the model that installed)
nlp = spacy.load("en_core_web_sm")

In [4]:
type(nlp)

spacy.lang.en.English

# Getting Started with Spacy

In [5]:
# Open file
with open("data_spacy/wiki_us.txt", "r") as f:
    text = f.read()

print(text)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America. It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j] At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d] The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22] With a population of more than 331 million people, it is the third most populous country in the world. The national capital is Washington, D.C., and the most populous city is New York.

Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century. The United States emerged from the thirteen British colonies est

In [6]:
# Create doc object
doc = nlp(text)
type(doc)

spacy.tokens.doc.Doc

In [7]:
doc

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America. It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j] At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d] The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22] With a population of more than 331 million people, it is the third most populous country in the world. The national capital is Washington, D.C., and the most populous city is New York.

Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century. The United States emerged from the thirteen British colonies est

In [14]:
# Compare length text and Doc object
print("Text length: ", len(text))
print("Doc object length: ", len(doc))

Text length:  3525
Doc object length:  652


In [16]:
# Compare element of text and Doc object
print(f"Text element: ")
for token in text[0:10]:
    print(token)

print("\nDoc object element: ")
for token in doc[0:10]:
    print(token)

Text element: 
T
h
e
 
U
n
i
t
e
d

Doc object element: 
The
United
States
of
America
(
U.S.A.
or
USA
)


In [17]:
# Tokenization based on rules spacy vs string split
print("Text split")
for token in text.split()[:10]:
    print(token)

print("\nTokenization rules:")
for token in doc[:10]:
    print(token)

Text split
The
United
States
of
America
(U.S.A.
or
USA),
commonly
known

Tokenization rules:
The
United
States
of
America
(
U.S.A.
or
USA
)


In [37]:
# Try to get sentence-based tokenization Doc object
# Note: using "sents" attribute. The Doc.sents return generator.
#        Each element of generator is Span object.
#        The Span object contains Token objects

for idx, sent in enumerate(list(doc.sents)[:10]):
    print(f"{idx + 1}. {sent}")

print()
print("Span object: ", type(sent))
print("Token object: ", type(sent[0]))

1. The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.
2. It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j]
3. At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d]
4. The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22]
5. With a population of more than 331 million people, it is the third most populous country in the world.
6. The national capital is Washington, D.C., and the most populous city is New York.


7. Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century.
8. The United States emerged from the thir

NOTE: 
- Doc object contains individual token (based on tokenization rules), but the text input contains individual character.
- By default, Doc object will word-based tokenize the input.
- Doc, Span, or Token object have their own meta-data.

## Extract meta-data from Token Object

In this example we use Token object.

In [46]:
# Tokens object properties

sentence1 = list(doc.sents)[0]
print("Main sentence:\n", sentence1.text)
print(type(sentence1))

token1 = sentence1[12]
print(type(token1))

Main sentence:
 The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.
<class 'spacy.tokens.span.Span'>
<class 'spacy.tokens.token.Token'>


In [44]:
# Get text (string format type)
#  Use "text" properties.

token1.text

'known'

In [48]:
# Get which word (Token object) it is governed by.
#  Return Token object
token1.head

States

In [50]:
# Get the leftmost token of this token's syntactic descendants.
#  Return Token object
token1.left_edge

commonly

In [51]:
# Get the rightmost token of this token's syntactic descendants.
#  Return Token object
token1.right_edge

America

In [59]:
# Entity Type
print(sentence1[2].ent_type) # Return integer that corresponds to an entity type.
print(sentence1[2].ent_type_) # Return string name entity type.

384
GPE


Some explanations:
- PERSON: People, Including Fictional.
- NORP: Nationalities or religious or political groups. 
- FAC: Buildings, airports, highways, bridges, etc.
- ORG: Companies, agencies, institutions, etc.
- GPE: Countries, cities, states.
- LOC: Non-GPE locations, mountain ranges, bodies of water.
- PRODUCT: Objects, vehicles, foods, etc. (Not services.)
- EVENT: Named hurricanes, battles, wars, sports events, etc.
- WORK_OF_ART: Titles of books, songs, etc.
- LAW: Named documents made into laws.
- LANGUAGE: Any named language.
- DATE: Absolute or relative dates or periods.
- TIME: Times smaller than a day.
- PERCENT: Percentage, including ”%“.
- MONEY: Monetary values, including unit.
- QUANTITY: Measurements, as of weight or distance.
- ORDINAL: “first”, “second”, etc.
- CARDINAL: Numerals that do not fall under another type.
der another type.

In [64]:
# IOB Entity Method --> IOB code of named entity tag.
#   “B” means the token begins an entity, 
#   “I” means it is inside an entity, 
#   “O” means it is outside an entity, 
#   and "" means no entity tag is set.

print(token1, token1.ent_iob_) # Return string name entity type.
print(token1, token1.ent_iob) # Return integer that corresponds to an entity type.
print(sentence1[2], sentence1[2].ent_iob_)
print(sentence1[2], sentence1[2].ent_iob)

known O
known 2
States I
States 1


In [65]:
# Lemma --> Get base form of token, with no inflectional suffixes.
print(token1.lemma_)

know


In [69]:
# Morph Analysis
#  Return MorphAnalysis object.

print(token1)
print(token1.morph)

known
Aspect=Perf|Tense=Past|VerbForm=Part


NOTE:
- Aspect refers to how an action, event, or state, expressed by a verb.
- Aspect=Perf ==> Perfective Aspect, indicates the action is completed.
- Tense=Past ==> Past Tense
- VerbForm=Part ==> Part stands for participle, participles are typically used in conjunction with auxiliary verbs to form different tenses or aspects.

In [71]:
# Coarse-grained part-of-speech from the Universal POS tag set
print(token1.pos_)

VERB


In [72]:
# Syntatic dependency relation
print(token1.dep_)

acl


In [73]:
# Language of the parent document's vocabulary
print(token1.lang_)

en


In [74]:
# Try another example
text = "Mike enjoys playing football."
doc2 = nlp(text)
print(doc2)

Mike enjoys playing football.


In [75]:
for token in doc2:
    print(token.text, token.pos_, token.dep_)

Mike PROPN nsubj
enjoys VERB ROOT
playing VERB xcomp
football NOUN dobj
. PUNCT punct


In [79]:
# Visualize it
from spacy import displacy
# Style as dependency
displacy.render(doc2, style='dep')

In [78]:
# Style based on entities
displacy.render(doc2, style='ent')

In [80]:
# Visualize Entities doc model (from data imported)
displacy.render(doc, style='ent')

# Word Vectors and spaCy

> Word vectors (or word embeddings) are numerical representations of words in multidimensional space through matrices.

The word similarity:
> The word similar means that the word that occurs frequently alongside of it. Sometimes it can be synonym or sometimes is not.

In [96]:
nlp = spacy.load("en_core_web_md")
# Find location model on local:
# nlp._path

In [97]:
with open("data_spacy/wiki_us.txt", "r") as f:
    text = f.read()

In [98]:
doc = nlp(text)
sentence1 = list(doc.sents)[0]
print(sentence1)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.


In [100]:
# Example 1 (find the top n similar word from trained word in model)
your_word = "country"

ms = nlp.vocab.vectors.most_similar(
    np.asarray([nlp.vocab.vectors[nlp.vocab.strings[your_word]]]),
    n=10
)
words = [nlp.vocab.strings[w] for w in ms[0][0]]
print(words)

['country—0,467', 'nationâ\x80\x99s', 'countries-', 'continente', 'Carnations', 'pastille', 'бесплатно', 'Argents', 'Tywysogion', 'Teeters']


In [101]:
# Example 2
doc1 = nlp("I like salty fries and hamburgers.")
doc2 = nlp("Fast food tastes very good.")
print(doc1, "<->", doc2, doc1.similarity(doc2))

I like salty fries and hamburgers. <-> Fast food tastes very good. 0.691649353055761


In [102]:
# Example 3
doc3 = nlp("The Empire State Building is in New York.")
print(doc1, "<->", doc3, doc1.similarity(doc3))

I like salty fries and hamburgers. <-> The Empire State Building is in New York. 0.1766669125394067


In [103]:
# Example 4
doc4 = nlp("I enjoy oranges.")
doc5 = nlp("I enjoy apples.")
print(doc4, "<->", doc5, doc4.similarity(doc5))

I enjoy oranges. <-> I enjoy apples. 0.9775700747747101


In [104]:
# Example 6
doc6 = nlp("I enjoy burgers.")
print(doc4, "<->", doc6, doc4.similarity(doc6))

I enjoy oranges. <-> I enjoy burgers. 0.9628306076251026


In [105]:
# Example 7
french_fries = doc1[2:4]
burgers = doc1[5]
print(french_fries, "<->", burgers, french_fries.similarity(burgers))

salty fries <-> hamburgers 0.6938489079475403
