In [3]:
import spacy
import numpy as np

In [4]:
# Create NLP object (load the model that installed)
nlp = spacy.load("en_core_web_sm")

In [5]:
type(nlp)

spacy.lang.en.English

# Introduction

### What is Natural Language Processing
- Natural Language Processing is a subfield of artificial intelligence that tries to **process and analyze natural language data.**
  
NOTE: Natural language is a language that developed naturally through use.

The idea: Since we already know semantic and grammar rules of human language, then we can build applications that can progammatically understand utterances in that language.

### How can Computers Understand Language
- Since computer (or machine) only understand number, we need to convert language words into numbers. This process called **Word Embedding**.
- Word Embedding concept: **Mapping the words to vectors of real numbers that distribute the meaning of each word** between the coordinates of the corresponding word vector. NOTE:
    - Words that similar (in machine perspectives) if their word vector are nearby.
    - Two words are distributed nearby in the vector space based on **the contextual similarity** of their usage in a large corpus of text.
      
      NOTE: Key factors that influence are (1) Co-occurence in Similar Contexts and (2) Frequency of Co-occurence.

### Dependency Grammars vs Phrase Structure Grammars
> spaCy primarily uses dependency grammar for syntactic parsing. It builds dependency trees that represent the relationships between words in a sentence, showing how each word depends on another.

**Phrase Structure Grammars**
- Based on how words combine to form constituents in a sentence ==> Focus on **Relation between constituents**
  
    NOTE: Constituents is a group of words that functions as a single unit in a sentence. E.g: Morphem, Frasa, and Clause.

- Concept: The rule decompose a sentence into its constituent parts until reach one unit word in a hierarchical way.


Example:

![Images](data_spacy/images/ex-parse-structure-grammar.png)

**Dependency Structure Grammars**
- Based on the **relations between individual words.**
- Concept: Determine root which is a main word of sentence (usually verb, we called it main verb). Then all other words are directly or indirectly dependent on this root.
    NOTE:
      (1) We call the independent word as "HEAD" and the dependent word as "CHILD".
      (2) Each word in a sentence must be connected to exactly one HEAD (ROOT is connected to itself). The same word might have None, one, or several CHILDREN.

Example:

![Images](data_spacy/images/ex-dep-structure-grammar.png)

Explanation:
- "sat" is ROOT
- "sat" is HEAD of "cat", "on", and "mat"
- "cat" is HEAD of "the"
- "mat" is HEAD of "the"

### Common Grammar Concept

**Transitive Verbs and Direct Objects**
- A directly object is a noun (or a noun phrase) that **directly receives the action** of the verb.
- A transitive verb is an **action verb that needs something (or someone) to receive the action**. This "something" that receive the action is **direct object**.

For example:

    "She wrote a letter" ==> Direct object is "a letter" and Transitive verb is "wrote".

**Prepositional Objects**
- A preposition **connects noun phrases (or noun, pronoun) with other words in a sentence**.
    Example: "in", "above", "at", "to", "of", etc.
- Object of a Preprosition: A noun (or pronoun, noun phrase) that follows a preposition.

Example:

    "I wrote a series of articles." ==> Preposition is "of" and Object of preposition is "articles".

NOTE:
In some questions, extracting the object of preposition might give us the most informative word or phrase in terms finding the answer.

Example:

    "What can be done about climate change?" ==> The phrase “climate change” is the key phrase.

**Modal Auxiliary Verbs**
- Modal auxiliary verbs are **special verbs used alongside a main verb** to express various moods or modalities. Example: "may", "might", "can", etc.

    NOTE:
  
        Modal auxiliary verbs are special verbs used alongside a main verb to express various moods or modalities.

**Personal Pronoun**
- A personal pronoun refers to a specific person, object, or to multiple people or objects.

Forms according to their grammatical role in a sentence:
- The nominative form (I, you, he, she, it, we, they) is typically used as the nominal subject of a verb.
- The accusative form (me, you, him, her, it, us, them) is typically used as the object of a verb or preposition.
- The reflexive form (myself, yourself/yourselves, himself, herself, itself, ourselves, themselves) typically refers back to the subject specified within the same clause.

# Basic NLP Operations with Spacy

**Tokenization**

> Parsing the text input into tokens.

- By default, SpaCy tokenize text into word-level tokens.

In [6]:
text = "I am flying to Frisco"
doc = nlp(text)

for idx, token in enumerate(doc):
    print(f"{idx + 1}. {token}")

1. I
2. am
3. flying
4. to
5. Frisco


**Lemmatization**
> Process of reducing word forms to their lemma.

NOTE: Lemma is a base form of a token. For example "flying" is "fly".

In [7]:
text = "This product integrates both libraries for downloading and applying patches"
doc = nlp(text)
print("Text", "Lemma")
for token in doc:
    print(token.text, token.lemma_)

Text Lemma
This this
product product
integrates integrate
both both
libraries library
for for
downloading download
and and
applying apply
patches patch


**Case Application Lemmatization: Meaning Recognition.**

Suppose that we have NLP application that interacting with an online system that provides an API for booking tickets for trips.

The application processes a customer's request, extracting necessary information from it, and then passing on that information to the underlying API.

The two information that needed to extract:
1. Determine whether the customer wants an air ticket, a railway ticket, or a bus ticket.
2. Extract its destination.

Ideas:
1. Tokenization
2. Lemmatization
3. Match each token into a predefined list of keywords that help for mapping token into word that represent the customer needed.

NOTE: Lemmatization helps programmer to create simple predifined list of keywords.

![Images](data_spacy/images/ex-app-lemma.png)

In [8]:
from spacy.symbols import ORTH, NORM
from spacy.language import Language

# Example simple program

input_text = "I am flying to Frisco"

# Modeling
nlp = spacy.load("en_core_web_sm")

# Add special case rules (for special token)
special_case = [{ORTH: "Frisco", NORM: "San Francisco"}]
nlp.tokenizer.add_special_case("Frisco", special_case)

# Custom pipeline component to adjust lemmas
@Language.component("custom_lemma_setter")
def custom_lemma_setter(doc):
    for token in doc:
        if token.text == "Frisco":
            token.lemma_ = "San Francisco"  # Set custom lemma
    return doc

# Add the component after the default lemmatizer in the pipeline
nlp.add_pipe("custom_lemma_setter", after="lemmatizer")

doc = nlp(input_text)

print("Text", "Lemma")
for token in doc:
    print(token.text, token.lemma_)

Text Lemma
I I
am be
flying fly
to to
Frisco San Francisco


**Part-of-Speech Tagging**
> Telling part-of-speech of a given word in a given sentence (noun, verb, and so on).

In SpaCy, there are two types parts of speech:
1. Coarse-gained parts of speech
> Using Token.pos (int) and Token.pos_ (unicode) attributes. 

2. Fine-grained parts of speech
> Using Token.tag (int) and Token.tag_ (unicode) attributes.

NOTE: In English, the core of parts of speech include noun, pronoun, determiner, adjective, verb, adverb, preposition, conjunction, and injection.


Detail about pos tagging: https://v2.spacy.io/api/annotation#pos-tagging

In [9]:
# Example

text = "The United States is a country primarily located in North America"
doc = nlp(text)

for idx, token in enumerate(doc):
    print(f"{idx + 1}. {token.text}", token.pos_, token.tag_)

1. The DET DT
2. United PROPN NNP
3. States PROPN NNP
4. is AUX VBZ
5. a DET DT
6. country NOUN NN
7. primarily ADV RB
8. located VERB VBN
9. in ADP IN
10. North PROPN NNP
11. America PROPN NNP


**Case Application POS Tags: Find Relevant Verbs.**

Continue our case ticket online system NLP.Since there are possible way to express a word (for example in past, present, future), we need to filter it into what we expect. For example:

- I flew to LA.
- I have flown to LA.
- I need to fly to LA.
- I am flying to LA.
- I will fly to LA.

Notice that although all of these sentences would include the “fly to LA” combination if reduced to lemmas, only some of them imply the customer’s intent to book a plane ticket to LA. The first two definitely aren’t suitable.

In [10]:
# According to the table, we can select token with tag_ into 'VBG' or 'VB'
# The location will expected to recognize as PROPN

input_text = "I have flown to LA. Now I am flying to Frisco."
doc = nlp(input_text)
print("Text", "Lemma")
current_idx = 0
for token in doc:
    if token.tag_ == "VBG" or token.tag_ == 'VB':
        print(token.text, '->', token.lemma_)
        current_idx = token.i
    if token.pos_ == "PROPN" and current_idx < token.i and current_idx != 0:
        print(token.text, '->', token.lemma_)

Text Lemma
flying -> fly
Frisco -> San Francisco


NOTE: We need to improve our model by adding context of the input text.

For example “I’m already in the sky, flying to LA.” or “I’m going to fly to LA.” When submitted to the ticket booking NLP application, the application should interpret only one of these sentences as “I need an air ticket to LA.” 

**Syntactic Relations: Dependency Parser**

> Dependency parser helps discover syntactic relations between individual tokens in a sentence and connects syntactically related pairs of words with a single arc.


Concept:
- Head and Child ==> Since it describes syntactic relation between two words, then one word is called Head (or Parent) and the other is child (or Dependent).
  
      NOTE:
  - Each word in a sentence has exactly one head. Consequently, a word can be a child only to one head.
  - If a token head is itself, then it is labeled as ROOT.
  - A dependency label is always assigned to the child (the arrow in graphical representation always start from head to child).
 
NOTE:
- Every complete sentence should have a **verb** with the **ROOT tag** and a subject with the **nsubj tag**. The oterh elements are optional. 
  

Detail info about Dependency: https://v2.spacy.io/api/annotation#dependency-parsing

In [11]:
# NOTES:
# using Token.head to access head token object.
# using Token.dep_ to access its dependency

# Example
text = "I need a plane ticket"

doc = nlp(text)

print("Token", "Dependency", "Head")
for token in doc:
    print(token, token.dep_, token.head)

Token Dependency Head
I nsubj need
need ROOT need
a det ticket
plane compound ticket
ticket dobj need


In [12]:
# Visualize it
from spacy import displacy
# Style as dependency
displacy.render(doc, style='dep')

**Case Application Syntactic Relations: Find Relevant Verbs.**

Continue our case ticket online system NLP. Since there are possible the text input is conjugation of two sentences or more, we need to improve our filter. For example:

"I have flown to LA. Now I am flying to Frisco"

NOTE: In this case, we need a ticket into San Fransisco instead LA.

In [13]:
# Concept: From the pattern of text,
#  The verb that represent what transportation do customer needed and the Location as preposition object.

text = "I have flown to LA. Now I am flying to Frisco"
doc = nlp(text)


# Tokenize into sentence level
sentences = list(doc.sents)
for sent in sentences:
    storage = []
    for word in sent:
        if word.dep_ == 'ROOT' or word.dep_ == 'pobj':
            storage.append(word)
    print(storage)

[flown, LA]
[flying, Frisco]


**Named Entity Recognition**
> A named entity is a real object that you can refer to by a proper name.

In [14]:
# NOTES:
# - Using Token.ent_type_ to return Entity type of token
# - Using Tojen.ent_type to return its int format

text = "I have flown to LA. Now I am flying to Frisco"
doc = nlp(text)

print("Token", "Entity")
for token in doc:
    if token.ent_type != 0:
        print(token.text, token.ent_type_)

Token Entity
LA GPE
Frisco PERSON


# Working with Container Objects and Customizing Spacy

### Container Objects

- The main objects of SpaCy:
    1. Container Object: Object that grouping multiple element into a single unit. It can be collected of objects (like tokens or sentences) or a set of annotation. 
    2. Pipeline Components (such as part-of-speech tagger and named entity recognizer). 

In [15]:
# Create Doc object

## Explicity way
from spacy.tokens.doc import Doc
from spacy.vocab import Vocab

doc = Doc(Vocab(), words=["Hi", "there"])
# NOTES:
#  - Vocab object: Storage container that provides vocabulary data, such as lexical types
#      (adjective, verb, noun, and so on).
#  - words argument: list of tokens to add to the Doc object being created.

print("Explicit way: ", doc)

## By default, SpaCy model return Doc object as we input the text into the nlp model
doc = nlp("Hi there")
print("Return from Spacy NLP Model: ", doc)

Explicit way:  Hi there 
Return from Spacy NLP Model:  Hi there


In [17]:
# Iterating over token

doc = nlp("I want a green apple")
for token in doc:
    print(f"{token.i}. {token.text}")

0. I
1. want
2. a
3. green
4. apple


In [35]:
# Get children
# NOTE: By default, SpaCY uses dependency grammar parsing. The structure of Head-Child in token: Head in the right side, Child in left side.
# - Token.lefts ==> Return generator
# - Token.children ==> Return generator

doc = nlp("I want a green apple")
print(f"The children of {doc[4]}: {list(doc[4].lefts)}")
print(f"The children of {doc[4]}: {list(doc[4].children)}")


from spacy import displacy
# Style as dependency
displacy.render(doc, style='dep')

The children of apple: [a, green]
The children of apple: [a, green]


In [46]:
# Get sentence object
# Doc.sents ==> Return generator, each element in span object

doc = nlp("A severe storm hit the beach. It started to rain.")
for idx, s in enumerate(doc.sents):
    print(idx + 1, s.text)

print("Type element Doc.sents: ", type(s))

# Example: check whether the first word in the second sentence of the text being processed is a pronoun
for idx, s in enumerate(doc.sents):
    if s[0].pos_ == 'PRON' and idx == 1:
        print("The second sentence begins with a pronoun.")

# Example: find out how many sentences in the text end with a verb
count = 0
for idx, s in enumerate(doc.sents):
    if s[len(s) - 2].pos_ == 'VERB':
        count += 1
print("Number of sentences end with a verb: ", count)

1 A severe storm hit the beach.
2 It started to rain.
Type element Doc.sents:  <class 'spacy.tokens.span.Span'>
The second sentence begins with a pronoun.
Number of sentences end with a verb:  1


In [67]:
# A noun (or phrase noun) that consists of a noun and its immediate dependents, such as adjectives, determiners, or other words
#  that modify the noun.

# Using Doc.noun_chunks ==> return Span object.

doc = nlp("The quick brown fox jumped over the lazy dog.")

print("Chunk noun: ")
for s in doc.noun_chunks:
    print(s.text)

# Manual code for noun_chunks
store = []
for token in doc:
    if token.pos_ == "NOUN":
        chunk = ""
        for w in token.children:
            if w.pos_ == "DET" or w.pos_ == "ADJ":
                chunk = chunk + w.text + " "
        chunk = chunk + token.text
        store.append(chunk)

print("\nManual way: ")
for s in store:
    print(s)
# NOTE: 
# - noun chunks can also include some other parts of speech, say, adverbs.
# - the words used to modify noun (determiners and adjectives) are always the leftward syntactic children of the noun.
#    So basically, we do not have to add "if w.pos_ == "DET" or w.pos_ == "ADJ" part.
# - the manual code returns string, but doc.noun_chunks returns Span object.

Chunk noun: 
The quick brown fox
the lazy dog

Manual way: 
The quick brown fox
the lazy dog


In [63]:
# Modification manual code that follows the previous notes.
store = []
for token in doc:
    if token.pos_ == 'NOUN':
        chunk = ""
        for w in token.lefts:
            chunk = chunk + w.text + " "

        chunk = chunk + token.text
        store.append(chunk)
print(store)

['The quick brown fox', 'the lazy dog']


In [None]:
# Span object: a slice from a Doc object.
#  Some example Span object: (1) slice doc: Doc[start:end], (2) element Doc.sents, (3) element Doc.noun_chunks.

### Pipeline Object

In [70]:
# Check what pipeline components are available
print(nlp.pipe_names)

['tok2vec', 'tagger', 'parser', 'attribute_ruler', 'lemmatizer', 'custom_lemma_setter', 'ner']


In [72]:
# Disabling pipeline components (from model that already contain specific component)

nlp = spacy.load("en_core_web_sm", disable=['parser']) # Disable dependency parser
print(nlp.pipe_names)

# Try to get dependency label of each token (it won't appears since it is disable)
doc = nlp("I want a green apple.")

print("Text", "POS", "Dependency (disable)")
for token in doc:
    print(token.text, token.pos_, token.dep_)

['tok2vec', 'tagger', 'attribute_ruler', 'lemmatizer', 'ner']
Text POS Dependency (disable)
I PRON 
want VERB 
a DET 
green ADJ 
apple NOUN 
. PUNCT 


# Getting Started with Spacy

In [109]:
# Open file
with open("data_spacy/wiki_us.txt", "r") as f:
    text = f.read()

print(text)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America. It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j] At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d] The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22] With a population of more than 331 million people, it is the third most populous country in the world. The national capital is Washington, D.C., and the most populous city is New York.

Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century. The United States emerged from the thirteen British colonies est

In [110]:
# Create doc object
doc = nlp(text)
type(doc)

spacy.tokens.doc.Doc

In [111]:
doc

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America. It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j] At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d] The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22] With a population of more than 331 million people, it is the third most populous country in the world. The national capital is Washington, D.C., and the most populous city is New York.

Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century. The United States emerged from the thirteen British colonies est

In [112]:
# Compare length text and Doc object
print("Text length: ", len(text))
print("Doc object length: ", len(doc))

Text length:  3525
Doc object length:  652


In [113]:
# Compare element of text and Doc object
print(f"Text element: ")
for token in text[0:10]:
    print(token)

print("\nDoc object element: ")
for token in doc[0:10]:
    print(token)

Text element: 
T
h
e
 
U
n
i
t
e
d

Doc object element: 
The
United
States
of
America
(
U.S.A.
or
USA
)


In [114]:
# Tokenization based on rules spacy vs string split
print("Text split")
for token in text.split()[:10]:
    print(token)

print("\nTokenization rules:")
for token in doc[:10]:
    print(token)

Text split
The
United
States
of
America
(U.S.A.
or
USA),
commonly
known

Tokenization rules:
The
United
States
of
America
(
U.S.A.
or
USA
)


In [115]:
# Try to get sentence-based tokenization Doc object
# Note: using "sents" attribute. The Doc.sents return generator.
#        Each element of generator is Span object.
#        The Span object contains Token objects

for idx, sent in enumerate(list(doc.sents)[:10]):
    print(f"{idx + 1}. {sent}")

print()
print("Span object: ", type(sent))
print("Token object: ", type(sent[0]))

1. The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.
2. It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j]
3. At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d]
4. The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22]
5. With a population of more than 331 million people, it is the third most populous country in the world.
6. The national capital is Washington, D.C., and the most populous city is New York.


7. Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century.
8. The United States emerged from the thir

NOTE: 
- Doc object contains individual token (based on tokenization rules), but the text input contains individual character.
- By default, Doc object will word-based tokenize the input.
- Doc, Span, or Token object have their own meta-data.

## Extract meta-data from Token Object

In this example we use Token object.

In [116]:
# Tokens object properties

sentence1 = list(doc.sents)[0]
print("Main sentence:\n", sentence1.text)
print(type(sentence1))

token1 = sentence1[12]
print(type(token1))

Main sentence:
 The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.
<class 'spacy.tokens.span.Span'>
<class 'spacy.tokens.token.Token'>


In [117]:
# Get text (string format type)
#  Use "text" properties.

token1.text

'known'

In [118]:
# Get which word (Token object) it is governed by.
#  Return Token object
token1.head

States

In [119]:
# Get the leftmost token of this token's syntactic descendants.
#  Return Token object
token1.left_edge

commonly

In [120]:
# Get the rightmost token of this token's syntactic descendants.
#  Return Token object
token1.right_edge

America

In [121]:
# Entity Type
print(sentence1[2].ent_type) # Return integer that corresponds to an entity type.
print(sentence1[2].ent_type_) # Return string name entity type.

384
GPE


Some explanations:
- PERSON: People, Including Fictional.
- NORP: Nationalities or religious or political groups. 
- FAC: Buildings, airports, highways, bridges, etc.
- ORG: Companies, agencies, institutions, etc.
- GPE: Countries, cities, states.
- LOC: Non-GPE locations, mountain ranges, bodies of water.
- PRODUCT: Objects, vehicles, foods, etc. (Not services.)
- EVENT: Named hurricanes, battles, wars, sports events, etc.
- WORK_OF_ART: Titles of books, songs, etc.
- LAW: Named documents made into laws.
- LANGUAGE: Any named language.
- DATE: Absolute or relative dates or periods.
- TIME: Times smaller than a day.
- PERCENT: Percentage, including ”%“.
- MONEY: Monetary values, including unit.
- QUANTITY: Measurements, as of weight or distance.
- ORDINAL: “first”, “second”, etc.
- CARDINAL: Numerals that do not fall under another type.
der another type.

In [122]:
# IOB Entity Method --> IOB code of named entity tag.
#   “B” means the token begins an entity, 
#   “I” means it is inside an entity, 
#   “O” means it is outside an entity, 
#   and "" means no entity tag is set.

print(token1, token1.ent_iob_) # Return string name entity type.
print(token1, token1.ent_iob) # Return integer that corresponds to an entity type.
print(sentence1[2], sentence1[2].ent_iob_)
print(sentence1[2], sentence1[2].ent_iob)

known O
known 2
States I
States 1


In [123]:
# Lemma --> Get base form of token, with no inflectional suffixes.
print(token1.lemma_)

know


In [124]:
# Morph Analysis
#  Return MorphAnalysis object.

print(token1)
print(token1.morph)

known
Aspect=Perf|Tense=Past|VerbForm=Part


NOTE:
- Aspect refers to how an action, event, or state, expressed by a verb.
- Aspect=Perf ==> Perfective Aspect, indicates the action is completed.
- Tense=Past ==> Past Tense
- VerbForm=Part ==> Part stands for participle, participles are typically used in conjunction with auxiliary verbs to form different tenses or aspects.

In [125]:
# Coarse-grained part-of-speech from the Universal POS tag set
print(token1.pos_)

VERB


In [126]:
# Syntatic dependency relation
print(token1.dep_)

acl


In [127]:
# Language of the parent document's vocabulary
print(token1.lang_)

en


In [128]:
# Try another example
text = "Mike enjoys playing football."
doc2 = nlp(text)
print(doc2)

Mike enjoys playing football.


In [129]:
for token in doc2:
    print(token.text, token.pos_, token.dep_)

Mike PROPN nsubj
enjoys VERB ROOT
playing VERB xcomp
football NOUN dobj
. PUNCT punct


In [130]:
# Visualize it
from spacy import displacy
# Style as dependency
displacy.render(doc2, style='dep')

In [131]:
# Style based on entities
displacy.render(doc2, style='ent')

In [132]:
# Visualize Entities doc model (from data imported)
displacy.render(doc, style='ent')

# Word Vectors and spaCy

> Word vectors (or word embeddings) are numerical representations of words in multidimensional space through matrices.

The word similarity:
> The word similar means that the word that occurs frequently alongside of it. Sometimes it can be synonym or sometimes is not.

In [133]:
nlp = spacy.load("en_core_web_md")
# Find location model on local:
# nlp._path

In [134]:
with open("data_spacy/wiki_us.txt", "r") as f:
    text = f.read()

In [135]:
doc = nlp(text)
sentence1 = list(doc.sents)[0]
print(sentence1)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.


In [136]:
# Example 1 (find the top n similar word from trained word in model)
your_word = "country"

ms = nlp.vocab.vectors.most_similar(
    np.asarray([nlp.vocab.vectors[nlp.vocab.strings[your_word]]]),
    n=10
)
words = [nlp.vocab.strings[w] for w in ms[0][0]]
print(words)

['country—0,467', 'nationâ\x80\x99s', 'countries-', 'continente', 'Carnations', 'pastille', 'бесплатно', 'Argents', 'Tywysogion', 'Teeters']


In [137]:
# Example 2
doc1 = nlp("I like salty fries and hamburgers.")
doc2 = nlp("Fast food tastes very good.")
print(doc1, "<->", doc2, doc1.similarity(doc2))

I like salty fries and hamburgers. <-> Fast food tastes very good. 0.691649353055761


In [138]:
# Example 3
doc3 = nlp("The Empire State Building is in New York.")
print(doc1, "<->", doc3, doc1.similarity(doc3))

I like salty fries and hamburgers. <-> The Empire State Building is in New York. 0.1766669125394067


In [139]:
# Example 4
doc4 = nlp("I enjoy oranges.")
doc5 = nlp("I enjoy apples.")
print(doc4, "<->", doc5, doc4.similarity(doc5))

I enjoy oranges. <-> I enjoy apples. 0.9775700747747101


In [140]:
# Example 6
doc6 = nlp("I enjoy burgers.")
print(doc4, "<->", doc6, doc4.similarity(doc6))

I enjoy oranges. <-> I enjoy burgers. 0.9628306076251026


In [141]:
# Example 7
french_fries = doc1[2:4]
burgers = doc1[5]
print(french_fries, "<->", burgers, french_fries.similarity(burgers))

salty fries <-> hamburgers 0.6938489079475403


# Standard Pipeline (Getting Started)

There are two ways to adding custom features to Spacy language model:
1. Rules-based approach
2. Machine learning-based approach

**Attribute Rulers:**
- Dependency Parser
- EntityLinker
- EntityRecognizer
- EntityRuler
- Lemmatizer
- Morpholog
- SentenceRecognizer
- Sentencizer
> It is Sentence Tokenization. The sentence-level tokenization based on language rules.
- SpanCategorizer
- Tagger
- TextCategorizer
- Tok2Vec
- Tokenizer
- TrainablePipe
- Transformer

**Matcher:**
- DependencyMatcher
- Matcher
- PhraseMatcher

NOTE: It may new Attribute or Matcher is being added.

In [157]:
# Demonstrate how to add pipes

nlp = spacy.blank("en") # Create nlp blank model
print(type(nlp))
print("Blank Pipeline:\n", nlp.analyze_pipes())

# Add new Pipeline
nlp.add_pipe("sentencizer")

# Add new Pipeline (Using class construction)
# from spacy.pipeline import Sentencizer
# sentencizer = Sentencizer()
# nlp.add_pipe(sentencizer)


print("\nAfter Adding Pipeline: \n", nlp.analyze_pipes())

<class 'spacy.lang.en.English'>
Blank Pipeline:
 {'summary': {}, 'problems': {}, 'attrs': {}}

After Adding Pipeline: 
 {'summary': {'sentencizer': {'assigns': ['token.is_sent_start', 'doc.sents'], 'requires': [], 'scores': ['sents_f', 'sents_p', 'sents_r'], 'retokenizes': False}}, 'problems': {'sentencizer': []}, 'attrs': {'doc.sents': {'assigns': ['sentencizer'], 'requires': []}, 'token.is_sent_start': {'assigns': ['sentencizer'], 'requires': []}}}


In [153]:
# Try Sentencizer Pipeline
# NOTE: This example must tokenize sentences into two sentence (token).

doc = nlp("This is a sentence. This is another sentence.")

if len(list(doc.sents)) == 2:
    print(True)

for i, token in enumerate(list(doc.sents)):
    print(f"{i + 1}. {token}")

True
1. This is a sentence.
2. This is another sentence.


In [None]:
# Using SpaCy's EntityRuler

