In [2]:
import spacy
import numpy as np

In [3]:
# Create NLP object (load the model that installed)
nlp = spacy.load("en_core_web_sm")

In [4]:
type(nlp)

spacy.lang.en.English

# Introduction

### What is Natural Language Processing
- Natural Language Processing is a subfield of artificial intelligence that tries to **process and analyze natural language data.**
  
NOTE: Natural language is a language that developed naturally through use.

The idea: Since we already know semantic and grammar rules of human language, then we can build applications that can progammatically understand utterances in that language.

### How can Computers Understand Language
- Since computer (or machine) only understand number, we need to convert language words into numbers. This process called **Word Embedding**.
- Word Embedding concept: **Mapping the words to vectors of real numbers that distribute the meaning of each word** between the coordinates of the corresponding word vector. NOTE:
    - Words that similar (in machine perspectives) if their word vector are nearby.
    - Two words are distributed nearby in the vector space based on **the contextual similarity** of their usage in a large corpus of text.
      
      NOTE: Key factors that influence are (1) Co-occurence in Similar Contexts and (2) Frequency of Co-occurence.

# Getting Started with Spacy

In [5]:
# Open file
with open("data_spacy/wiki_us.txt", "r") as f:
    text = f.read()

print(text)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America. It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j] At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d] The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22] With a population of more than 331 million people, it is the third most populous country in the world. The national capital is Washington, D.C., and the most populous city is New York.

Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century. The United States emerged from the thirteen British colonies est

In [6]:
# Create doc object
doc = nlp(text)
type(doc)

spacy.tokens.doc.Doc

In [7]:
doc

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America. It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j] At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d] The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22] With a population of more than 331 million people, it is the third most populous country in the world. The national capital is Washington, D.C., and the most populous city is New York.

Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century. The United States emerged from the thirteen British colonies est

In [8]:
# Compare length text and Doc object
print("Text length: ", len(text))
print("Doc object length: ", len(doc))

Text length:  3525
Doc object length:  652


In [9]:
# Compare element of text and Doc object
print(f"Text element: ")
for token in text[0:10]:
    print(token)

print("\nDoc object element: ")
for token in doc[0:10]:
    print(token)

Text element: 
T
h
e
 
U
n
i
t
e
d

Doc object element: 
The
United
States
of
America
(
U.S.A.
or
USA
)


In [10]:
# Tokenization based on rules spacy vs string split
print("Text split")
for token in text.split()[:10]:
    print(token)

print("\nTokenization rules:")
for token in doc[:10]:
    print(token)

Text split
The
United
States
of
America
(U.S.A.
or
USA),
commonly
known

Tokenization rules:
The
United
States
of
America
(
U.S.A.
or
USA
)


In [11]:
# Try to get sentence-based tokenization Doc object
# Note: using "sents" attribute. The Doc.sents return generator.
#        Each element of generator is Span object.
#        The Span object contains Token objects

for idx, sent in enumerate(list(doc.sents)[:10]):
    print(f"{idx + 1}. {sent}")

print()
print("Span object: ", type(sent))
print("Token object: ", type(sent[0]))

1. The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.
2. It consists of 50 states, a federal district, five major unincorporated territories, 326 Indian reservations, and some minor possessions.[j]
3. At 3.8 million square miles (9.8 million square kilometers), it is the world's third- or fourth-largest country by total area.[d]
4. The United States shares significant land borders with Canada to the north and Mexico to the south, as well as limited maritime borders with the Bahamas, Cuba, and Russia.[22]
5. With a population of more than 331 million people, it is the third most populous country in the world.
6. The national capital is Washington, D.C., and the most populous city is New York.


7. Paleo-Indians migrated from Siberia to the North American mainland at least 12,000 years ago, and European colonization began in the 16th century.
8. The United States emerged from the thir

NOTE: 
- Doc object contains individual token (based on tokenization rules), but the text input contains individual character.
- By default, Doc object will word-based tokenize the input.
- Doc, Span, or Token object have their own meta-data.

## Extract meta-data from Token Object

In this example we use Token object.

In [12]:
# Tokens object properties

sentence1 = list(doc.sents)[0]
print("Main sentence:\n", sentence1.text)
print(type(sentence1))

token1 = sentence1[12]
print(type(token1))

Main sentence:
 The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.
<class 'spacy.tokens.span.Span'>
<class 'spacy.tokens.token.Token'>


In [13]:
# Get text (string format type)
#  Use "text" properties.

token1.text

'known'

In [14]:
# Get which word (Token object) it is governed by.
#  Return Token object
token1.head

States

In [15]:
# Get the leftmost token of this token's syntactic descendants.
#  Return Token object
token1.left_edge

commonly

In [16]:
# Get the rightmost token of this token's syntactic descendants.
#  Return Token object
token1.right_edge

America

In [17]:
# Entity Type
print(sentence1[2].ent_type) # Return integer that corresponds to an entity type.
print(sentence1[2].ent_type_) # Return string name entity type.

384
GPE


Some explanations:
- PERSON: People, Including Fictional.
- NORP: Nationalities or religious or political groups. 
- FAC: Buildings, airports, highways, bridges, etc.
- ORG: Companies, agencies, institutions, etc.
- GPE: Countries, cities, states.
- LOC: Non-GPE locations, mountain ranges, bodies of water.
- PRODUCT: Objects, vehicles, foods, etc. (Not services.)
- EVENT: Named hurricanes, battles, wars, sports events, etc.
- WORK_OF_ART: Titles of books, songs, etc.
- LAW: Named documents made into laws.
- LANGUAGE: Any named language.
- DATE: Absolute or relative dates or periods.
- TIME: Times smaller than a day.
- PERCENT: Percentage, including ”%“.
- MONEY: Monetary values, including unit.
- QUANTITY: Measurements, as of weight or distance.
- ORDINAL: “first”, “second”, etc.
- CARDINAL: Numerals that do not fall under another type.
der another type.

In [18]:
# IOB Entity Method --> IOB code of named entity tag.
#   “B” means the token begins an entity, 
#   “I” means it is inside an entity, 
#   “O” means it is outside an entity, 
#   and "" means no entity tag is set.

print(token1, token1.ent_iob_) # Return string name entity type.
print(token1, token1.ent_iob) # Return integer that corresponds to an entity type.
print(sentence1[2], sentence1[2].ent_iob_)
print(sentence1[2], sentence1[2].ent_iob)

known O
known 2
States I
States 1


In [19]:
# Lemma --> Get base form of token, with no inflectional suffixes.
print(token1.lemma_)

know


In [20]:
# Morph Analysis
#  Return MorphAnalysis object.

print(token1)
print(token1.morph)

known
Aspect=Perf|Tense=Past|VerbForm=Part


NOTE:
- Aspect refers to how an action, event, or state, expressed by a verb.
- Aspect=Perf ==> Perfective Aspect, indicates the action is completed.
- Tense=Past ==> Past Tense
- VerbForm=Part ==> Part stands for participle, participles are typically used in conjunction with auxiliary verbs to form different tenses or aspects.

In [21]:
# Coarse-grained part-of-speech from the Universal POS tag set
print(token1.pos_)

VERB


In [22]:
# Syntatic dependency relation
print(token1.dep_)

acl


In [23]:
# Language of the parent document's vocabulary
print(token1.lang_)

en


In [24]:
# Try another example
text = "Mike enjoys playing football."
doc2 = nlp(text)
print(doc2)

Mike enjoys playing football.


In [25]:
for token in doc2:
    print(token.text, token.pos_, token.dep_)

Mike PROPN nsubj
enjoys VERB ROOT
playing VERB xcomp
football NOUN dobj
. PUNCT punct


In [26]:
# Visualize it
from spacy import displacy
# Style as dependency
displacy.render(doc2, style='dep')

In [27]:
# Style based on entities
displacy.render(doc2, style='ent')

In [28]:
# Visualize Entities doc model (from data imported)
displacy.render(doc, style='ent')

# Word Vectors and spaCy

> Word vectors (or word embeddings) are numerical representations of words in multidimensional space through matrices.

The word similarity:
> The word similar means that the word that occurs frequently alongside of it. Sometimes it can be synonym or sometimes is not.

In [29]:
nlp = spacy.load("en_core_web_md")
# Find location model on local:
# nlp._path

In [30]:
with open("data_spacy/wiki_us.txt", "r") as f:
    text = f.read()

In [31]:
doc = nlp(text)
sentence1 = list(doc.sents)[0]
print(sentence1)

The United States of America (U.S.A. or USA), commonly known as the United States (U.S. or US) or America, is a country primarily located in North America.


In [32]:
# Example 1 (find the top n similar word from trained word in model)
your_word = "country"

ms = nlp.vocab.vectors.most_similar(
    np.asarray([nlp.vocab.vectors[nlp.vocab.strings[your_word]]]),
    n=10
)
words = [nlp.vocab.strings[w] for w in ms[0][0]]
print(words)

['country—0,467', 'nationâ\x80\x99s', 'countries-', 'continente', 'Carnations', 'pastille', 'бесплатно', 'Argents', 'Tywysogion', 'Teeters']


In [33]:
# Example 2
doc1 = nlp("I like salty fries and hamburgers.")
doc2 = nlp("Fast food tastes very good.")
print(doc1, "<->", doc2, doc1.similarity(doc2))

I like salty fries and hamburgers. <-> Fast food tastes very good. 0.691649353055761


In [34]:
# Example 3
doc3 = nlp("The Empire State Building is in New York.")
print(doc1, "<->", doc3, doc1.similarity(doc3))

I like salty fries and hamburgers. <-> The Empire State Building is in New York. 0.1766669125394067


In [35]:
# Example 4
doc4 = nlp("I enjoy oranges.")
doc5 = nlp("I enjoy apples.")
print(doc4, "<->", doc5, doc4.similarity(doc5))

I enjoy oranges. <-> I enjoy apples. 0.9775700747747101


In [36]:
# Example 6
doc6 = nlp("I enjoy burgers.")
print(doc4, "<->", doc6, doc4.similarity(doc6))

I enjoy oranges. <-> I enjoy burgers. 0.9628306076251026


In [37]:
# Example 7
french_fries = doc1[2:4]
burgers = doc1[5]
print(french_fries, "<->", burgers, french_fries.similarity(burgers))

salty fries <-> hamburgers 0.6938489079475403


# Standard Pipeline (Getting Started)

There are two ways to adding custom features to Spacy language model:
1. Rules-based approach
2. Machine learning-based approach

**Attribute Rulers:**
- Dependency Parser
- EntityLinker
- EntityRecognizer
- EntityRuler
- Lemmatizer
- Morpholog
- SentenceRecognizer
- Sentencizer
> It is Sentence Tokenization. The sentence-level tokenization based on language rules.
- SpanCategorizer
- Tagger
- TextCategorizer
- Tok2Vec
- Tokenizer
- TrainablePipe
- Transformer

**Matcher:**
- DependencyMatcher
- Matcher
- PhraseMatcher

NOTE: It may new Attribute or Matcher is being added.

In [38]:
# Demonstrate how to add pipes

nlp = spacy.blank("en") # Create nlp blank model "english" tokenizer
print(type(nlp))
print("Blank Pipeline:\n", nlp.analyze_pipes())

# Add new Pipeline
nlp.add_pipe("sentencizer")

# Add new Pipeline (Using class construction)
# from spacy.pipeline import Sentencizer
# sentencizer = Sentencizer()
# nlp.add_pipe(sentencizer)


print("\nAfter Adding Pipeline: \n", nlp.analyze_pipes())

<class 'spacy.lang.en.English'>
Blank Pipeline:
 {'summary': {}, 'problems': {}, 'attrs': {}}

After Adding Pipeline: 
 {'summary': {'sentencizer': {'assigns': ['token.is_sent_start', 'doc.sents'], 'requires': [], 'scores': ['sents_f', 'sents_p', 'sents_r'], 'retokenizes': False}}, 'problems': {'sentencizer': []}, 'attrs': {'doc.sents': {'assigns': ['sentencizer'], 'requires': []}, 'token.is_sent_start': {'assigns': ['sentencizer'], 'requires': []}}}


In [39]:
# Analyze robust pipeline
nlp2 = spacy.load("en_core_web_sm")
print("Robust pipeline: \n", nlp2.analyze_pipes())

Robust pipeline: 
 {'summary': {'tok2vec': {'assigns': ['doc.tensor'], 'requires': [], 'scores': [], 'retokenizes': False}, 'tagger': {'assigns': ['token.tag'], 'requires': [], 'scores': ['tag_acc'], 'retokenizes': False}, 'parser': {'assigns': ['token.dep', 'token.head', 'token.is_sent_start', 'doc.sents'], 'requires': [], 'scores': ['dep_uas', 'dep_las', 'dep_las_per_type', 'sents_p', 'sents_r', 'sents_f'], 'retokenizes': False}, 'attribute_ruler': {'assigns': [], 'requires': [], 'scores': [], 'retokenizes': False}, 'lemmatizer': {'assigns': ['token.lemma'], 'requires': [], 'scores': ['lemma_acc'], 'retokenizes': False}, 'ner': {'assigns': ['doc.ents', 'token.ent_iob', 'token.ent_type'], 'requires': [], 'scores': ['ents_f', 'ents_p', 'ents_r', 'ents_per_type'], 'retokenizes': False}}, 'problems': {'tok2vec': [], 'tagger': [], 'parser': [], 'attribute_ruler': [], 'lemmatizer': [], 'ner': []}, 'attrs': {'doc.sents': {'assigns': ['parser'], 'requires': []}, 'doc.ents': {'assigns': ['ner

In [40]:
# Try Sentencizer Pipeline
# NOTE: This example must tokenize sentences into two sentence (token).

doc = nlp("This is a sentence. This is another sentence.")

if len(list(doc.sents)) == 2:
    print(True)

for i, token in enumerate(list(doc.sents)):
    print(f"{i + 1}. {token}")

True
1. This is a sentence.
2. This is another sentence.


# Using SpaCy EntityRuler

In [41]:
nlp = spacy.load("en_core_web_sm")

In [42]:
# Using SpaCy's EntityRuler

# Suppose we want to create entity rules Name Entity Rules

text = "Bali was referenced in Mr. Deeds."
doc = nlp(text)

for ent in doc.ents:
    print(ent.text, ent.label_)

Bali PERSON
Deeds PERSON


NOTE:
- Bali should reference as GPE instead PERSON. The model fail to recognize it.
- Deeds is PERSON, however we should extract it as a whole entity: "Mr. Deeds".

In [43]:
# Adding entity ruler and return EntityRuler pipeline object
ruler = nlp.add_pipe("entity_ruler")
print(type(ruler))

<class 'spacy.pipeline.entityruler.EntityRuler'>


In [44]:
nlp.analyze_pipes()

{'summary': {'tok2vec': {'assigns': ['doc.tensor'],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'tagger': {'assigns': ['token.tag'],
   'requires': [],
   'scores': ['tag_acc'],
   'retokenizes': False},
  'parser': {'assigns': ['token.dep',
    'token.head',
    'token.is_sent_start',
    'doc.sents'],
   'requires': [],
   'scores': ['dep_uas',
    'dep_las',
    'dep_las_per_type',
    'sents_p',
    'sents_r',
    'sents_f'],
   'retokenizes': False},
  'attribute_ruler': {'assigns': [],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'lemmatizer': {'assigns': ['token.lemma'],
   'requires': [],
   'scores': ['lemma_acc'],
   'retokenizes': False},
  'ner': {'assigns': ['doc.ents', 'token.ent_iob', 'token.ent_type'],
   'requires': [],
   'scores': ['ents_f', 'ents_p', 'ents_r', 'ents_per_type'],
   'retokenizes': False},
  'entity_ruler': {'assigns': ['doc.ents', 'token.ent_type', 'token.ent_iob'],
   'requires': [],
   'scores': ['ents_f', 'ent

In [45]:
# Try adding pattern
patterns = [
    {'label': 'GPE', 'pattern': 'Bali'}
]

ruler.add_patterns(patterns)

In [46]:
nlp.analyze_pipes()

{'summary': {'tok2vec': {'assigns': ['doc.tensor'],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'tagger': {'assigns': ['token.tag'],
   'requires': [],
   'scores': ['tag_acc'],
   'retokenizes': False},
  'parser': {'assigns': ['token.dep',
    'token.head',
    'token.is_sent_start',
    'doc.sents'],
   'requires': [],
   'scores': ['dep_uas',
    'dep_las',
    'dep_las_per_type',
    'sents_p',
    'sents_r',
    'sents_f'],
   'retokenizes': False},
  'attribute_ruler': {'assigns': [],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'lemmatizer': {'assigns': ['token.lemma'],
   'requires': [],
   'scores': ['lemma_acc'],
   'retokenizes': False},
  'ner': {'assigns': ['doc.ents', 'token.ent_iob', 'token.ent_type'],
   'requires': [],
   'scores': ['ents_f', 'ents_p', 'ents_r', 'ents_per_type'],
   'retokenizes': False},
  'entity_ruler': {'assigns': ['doc.ents', 'token.ent_type', 'token.ent_iob'],
   'requires': [],
   'scores': ['ents_f', 'ent

In [47]:
doc2 = nlp(text)
for ent in doc2.ents:
    print(ent.text, ent.label_)

Bali PERSON
Deeds PERSON


NOTE: 
- The Spacy pipeline isolated the object attribute that already assigned first by a pipeline.
- Since the 'ner' pipeline come first, then 'entity_ruler' can not change ent.label_ object that already assigned with 'ner' pipeline.

In [48]:
# Try putting it in previous ner
nlp2 = spacy.load("en_core_web_sm")

# Add entity_rules pipeline
ruler = nlp2.add_pipe('entity_ruler', before='ner')

# Create pattern
patterns = [
    {'label': 'GPE', 'pattern': 'Bali'},
    {'label': 'FILM', 'pattern': 'Mr. Deeds'}]

ruler.add_patterns(patterns)

doc = nlp2(text)

for ent in doc.ents:
    print(ent.text, ent.label_)

Bali GPE
Mr. Deeds FILM


# How to Use the SpaCy Matcher

NOTE: It's good approach in case the pattern matching that we want to do is correlated with linguistic features.

In [56]:
# Basic example
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")

# Create matcher object
matcher = Matcher(nlp.vocab)
print(type(matcher))

# Create pattern
pattern = [{'LIKE_EMAIL': True}]
# In this example, we will search pattern email
matcher.add("EMAIL_ADDRESS", [pattern])
# NOTE: 
#  - Matcher.add(<label>, <patterns>)
#    <label> Str label that added into nlp vocab
#    <pattern> List of list patterns

doc = nlp("This is an email address: wmattingly@aol.com")
matches = matcher(doc)
# Return list of tuples that contain (<lexeme>, <start_position>, <end_position>)
# NOTE:
#  - <lexeme> is special unique index for access matches data in nlp.vocab
print("Return matches: ", matches)

# Access matches label in nlp vocab
print("Access matches data in nlp vocab: ", nlp.vocab[matches[0][0]].text)

# Print location of matches data
start_pos = matches[0][1]
end_pos = matches[0][2]
print("Matches email: ", doc[start_pos:end_pos])

<class 'spacy.matcher.matcher.Matcher'>
Return matches:  [(16571425990740197027, 6, 7)]
Access matches data in nlp vocab:  EMAIL_ADDRESS
Matches email:  wmattingly@aol.com


In [59]:
# Example 2: Suppose we want to extract all proper nouns

with open("data_spacy/wiki_mlk.txt", 'r') as f:
   text = f.read()

print(text)

Martin Luther King Jr. (born Michael King Jr.; January 15, 1929 â€“ April 4, 1968) was an American Baptist minister and activist who became the most visible spokesman and leader in the American civil rights movement from 1955 until his assassination in 1968. King advanced civil rights through nonviolence and civil disobedience, inspired by his Christian beliefs and the nonviolent activism of Mahatma Gandhi. He was the son of early civil rights activist and minister Martin Luther King Sr.

King participated in and led marches for blacks' right to vote, desegregation, labor rights, and other basic civil rights.[1] King led the 1955 Montgomery bus boycott and later became the first president of the Southern Christian Leadership Conference (SCLC). As president of the SCLC, he led the unsuccessful Albany Movement in Albany, Georgia, and helped organize some of the nonviolent 1963 protests in Birmingham, Alabama. King helped organize the 1963 March on Washington, where he delivered his famou

In [67]:
for idx, sent in enumerate(doc.sents):
    print(sent, "\n")
    if idx == 10:
        break

Martin Luther King Jr. (born Michael King Jr.; January 15, 1929 â€“ April 4, 1968) was an American Baptist minister and activist who became the most visible spokesman and leader in the American civil rights movement from 1955 until his assassination in 1968. 

King advanced civil rights through nonviolence and civil disobedience, inspired by his Christian beliefs and the nonviolent activism of Mahatma Gandhi. 

He was the son of early civil rights activist and minister Martin Luther King Sr.

King participated in and led marches for blacks' right to vote, desegregation, labor rights, and other basic civil rights.[1] 

King led the 1955 Montgomery bus boycott and later became the first president of the Southern Christian Leadership Conference (SCLC). 

As president of the SCLC, he led the unsuccessful Albany Movement in Albany, Georgia, and helped organize some of the nonviolent 1963 protests in Birmingham, Alabama. 

King helped organize the 1963 March on Washington, where he delivered

In [61]:
# Load nlp model
nlp = spacy.load("en_core_web_sm")

# Create matcher object
matcher = Matcher(nlp.vocab)
pattern = [{'POS': 'PROPN'}]
matcher.add("PROPER_NOUN", [pattern])

# Create doc object
doc = nlp(text)

matches = matcher(doc)
print(len(matches))
for match in matches[:10]:
    print(match, doc[match[1]: match[2]])

102
(451313080118390996, 0, 1) Martin
(451313080118390996, 1, 2) Luther
(451313080118390996, 2, 3) King
(451313080118390996, 3, 4) Jr.
(451313080118390996, 6, 7) Michael
(451313080118390996, 7, 8) King
(451313080118390996, 8, 9) Jr.
(451313080118390996, 10, 11) January
(451313080118390996, 16, 17) April
(451313080118390996, 24, 25) Baptist


NOTE: The problem is all proper nouns are just individual tokens. We need to fix it!

In [62]:
# Create matcher object
matcher = Matcher(nlp.vocab)
pattern = [{'POS': 'PROPN', 'OP':'+'}] # Extract pattern that match 1 or more times
matcher.add("PROPER_NOUN", [pattern])

# Create doc object
doc = nlp(text)

matches = matcher(doc)
print(len(matches))
for match in matches[:10]:
    print(match, doc[match[1]: match[2]])

175
(451313080118390996, 0, 1) Martin
(451313080118390996, 0, 2) Martin Luther
(451313080118390996, 1, 2) Luther
(451313080118390996, 0, 3) Martin Luther King
(451313080118390996, 1, 3) Luther King
(451313080118390996, 2, 3) King
(451313080118390996, 0, 4) Martin Luther King Jr.
(451313080118390996, 1, 4) Luther King Jr.
(451313080118390996, 2, 4) King Jr.
(451313080118390996, 3, 4) Jr.


NOTE: It has done exactly what we want to do however, there are some overlaps. It's grabbing all of these in any combination of them.

In [68]:
# Create matcher object
matcher = Matcher(nlp.vocab)
pattern = [{'POS': 'PROPN', 'OP':'+'}] # Extract pattern that match 1 or more times
matcher.add("PROPER_NOUN", [pattern], greedy='LONGEST') # Only chose the longset pattern and neglect overlaping

# Create doc object
doc = nlp(text)

matches = matcher(doc)
# Sort the matches (In the default, LONGEST greedy argument makes the result is not sorted by
#  index position as previously
matches.sort(key=lambda x: x[1])
print(len(matches))
for match in matches[:10]:
    print(match, doc[match[1]: match[2]])

61
(451313080118390996, 0, 4) Martin Luther King Jr.
(451313080118390996, 6, 9) Michael King Jr.
(451313080118390996, 10, 11) January
(451313080118390996, 16, 17) April
(451313080118390996, 24, 25) Baptist
(451313080118390996, 50, 51) King
(451313080118390996, 70, 72) Mahatma Gandhi
(451313080118390996, 84, 89) Martin Luther King Sr.
(451313080118390996, 90, 91) King
(451313080118390996, 114, 115) King


How does it work?

Since pattern = [{'POS': 'PROPN', 'OP':'+'}], it means that token with pattern 'PROPN' will be extracted. When it comes to others then 'PROPN', it will be see as different pattern.

For example

"Martin Luther King Jr. (born Michael King Jr.; ..."

Martin = 'PROPN'
Luther = 'PROPN'
King = 'PROPN'
Jr = 'PROPN'
. = 'PUNCTUATION' (it will stop in here)
( = 'PUNCTUATION'
born = 'VERB'
Micahel = 'PROPN' (start again in here)
King = 'PROPN'
Jr = 'PROPN'

So on ...

Since we do not set the greedy argument as the "LONGEST", we will get the combination of PROPN pattern until its end pattern location. If we set greedy argument as "LONGEST", then it will filtering out only the longest combination.

In [69]:
# Create matcher object
matcher = Matcher(nlp.vocab)
pattern = [{'POS': 'PROPN', 'OP':'+'}, {"POS": "VERB"}] 
# PROPN that match 1 or more times + VERB
matcher.add("PROPER_NOUN", [pattern], greedy='LONGEST') # Only chose the longset pattern and neglect overlaping

# Create doc object
doc = nlp(text)

matches = matcher(doc)
# Sort the matches (In the default, LONGEST greedy argument makes the result is not sorted by
#  index position as previously
matches.sort(key=lambda x: x[1])
print(len(matches))
for match in matches[:10]:
    print(match, doc[match[1]: match[2]])

7
(451313080118390996, 50, 52) King advanced
(451313080118390996, 90, 92) King participated
(451313080118390996, 114, 116) King led
(451313080118390996, 168, 170) King helped
(451313080118390996, 248, 253) Director J. Edgar Hoover considered
(451313080118390996, 323, 325) King won
(451313080118390996, 486, 489) United States beginning


In [70]:
# Example
# Suppose we want to extract quotation marks and any instances where there's a quote and somebody
#  speaking.
import json
with open("data_spacy/alice.json", 'r') as f:
    data = json.load(f)

text = data[0][2][0]
print(text)

Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, `and what is the use of a book,' thought Alice `without pictures or conversation?'


In [71]:
text = text.replace("`", "'")
print(text)

Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, 'and what is the use of a book,' thought Alice 'without pictures or conversation?'


NOTE:

Make sure to standardized text data, things like quotation marks will always throw off our data.

In [73]:
speak_lemmas = ['think', 'say']
matcher = Matcher(nlp.vocab)
pattern = [{'ORTH': "'"},
           {'IS_ALPHA': True, "OP": "+"}, # all alpha numeric that matches 1 or more times
           {'IS_PUNCT': True, "OP": "*"}, # all possible punctuation that matches 0 or more (it can be contain or not)
           {'ORTH': "'"},
           {'POS':"VERB", "LEMMA": {"IN": speak_lemmas}},
           {'POS': 'PROPN', "OP": "+"},
           {'ORTH': "'"},
           {'IS_ALPHA': True, "OP": "+"}, # all alpha numeric that matches 1 or more times
           {'IS_PUNCT': True, "OP": "*"}, # all possible punctuation that matches 0 or more (it can be contain or not)
           {'ORTH': "'"},
           ]
matcher.add("PROPER_NOUNS", [pattern], greedy='LONGEST')
doc = nlp(text)
matches = matcher(doc)
matches.sort(key=lambda x: x[1])
print(len(matches))
for match in matches[:10]:
    print(match, doc[match[1]: match[2]])

1
(3232560085755078826, 47, 67) 'and what is the use of a book,' thought Alice 'without pictures or conversation?'


In [77]:
# Try is it work in general?
# print(data[0][2])
for idx, text in enumerate(data[0][2]):
    text = text.replace("`", "'")
    doc = nlp(text)
    matches = matcher(doc)
    print(f"{idx + 1}. Total matches: ", len(matches))
    matches.sort(key= lambda x: x[1])
    for match in matches[:10]:
        print(match, doc[match[1]:match[2]])

1. Total matches:  1
(3232560085755078826, 47, 67) 'and what is the use of a book,' thought Alice 'without pictures or conversation?'
2. Total matches:  0
3. Total matches:  0
4. Total matches:  0
5. Total matches:  0
6. Total matches:  0
7. Total matches:  0
8. Total matches:  0
9. Total matches:  0
10. Total matches:  0
11. Total matches:  0
12. Total matches:  0
13. Total matches:  0
14. Total matches:  0
15. Total matches:  0
16. Total matches:  0
17. Total matches:  0


In [78]:
# Try add multiple patterns
speak_lemmas = ["think", "say"]
text = data[0][2][0].replace( "`", "'")
matcher = Matcher(nlp.vocab)
pattern1 = [{'ORTH': "'"}, {'IS_ALPHA': True, "OP": "+"}, {'IS_PUNCT': True, "OP": "*"}, {'ORTH': "'"}, {"POS": "VERB", "LEMMA": {"IN": speak_lemmas}}, {"POS": "PROPN", "OP": "+"}, {'ORTH': "'"}, {'IS_ALPHA': True, "OP": "+"}, {'IS_PUNCT': True, "OP": "*"}, {'ORTH': "'"}]
pattern2 = [{'ORTH': "'"}, {'IS_ALPHA': True, "OP": "+"}, {'IS_PUNCT': True, "OP": "*"}, {'ORTH': "'"}, {"POS": "VERB", "LEMMA": {"IN": speak_lemmas}}, {"POS": "PROPN", "OP": "+"}]
pattern3 = [{"POS": "PROPN", "OP": "+"},{"POS": "VERB", "LEMMA": {"IN": speak_lemmas}}, {'ORTH': "'"}, {'IS_ALPHA': True, "OP": "+"}, {'IS_PUNCT': True, "OP": "*"}, {'ORTH': "'"}]
matcher.add("PROPER_NOUNS", [pattern1, pattern2, pattern3], greedy='LONGEST')
for text in data[0][2]:
    text = text.replace("`", "'")
    doc = nlp(text)
    matches = matcher(doc)
    matches.sort(key = lambda x: x[1])
    print (len(matches))
    for match in matches[:10]:
        print (match, doc[match[1]:match[2]])

1
(3232560085755078826, 47, 67) 'and what is the use of a book,' thought Alice 'without pictures or conversation?'
0
0
0
0
0
1
(3232560085755078826, 0, 6) 'Well!' thought Alice
0
0
0
0
0
0
0
1
(3232560085755078826, 57, 68) 'which certainly was not here before,' said Alice
0
0


# Custom Components in SpaCy

Custom component: something that changes the doc object along the way in the pipeline

In [79]:
nlp = spacy.load("en_core_web_sm")
doc = nlp("Britain is a place. Mary is a doctor.")

In [80]:
for ent in doc.ents:
    print(ent.text, ent.label_)

Britain GPE
Mary PERSON


In [81]:
# Suppose that we want to remove all instances of GPE from doc.ents container

from spacy.language import Language

@Language.component("remove_gpe")
def remove_gpe(doc):
    original_ents = list(doc.ents)
    for ent in doc.ents:
        if ent.label_ == 'GPE':
            original_ents.remove(ent)
    doc.ents = original_ents
    return (doc)

In [82]:
# Add to nlp pipeline
nlp.add_pipe("remove_gpe")

<function __main__.remove_gpe(doc)>

In [84]:
nlp.analyze_pipes()

{'summary': {'tok2vec': {'assigns': ['doc.tensor'],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'tagger': {'assigns': ['token.tag'],
   'requires': [],
   'scores': ['tag_acc'],
   'retokenizes': False},
  'parser': {'assigns': ['token.dep',
    'token.head',
    'token.is_sent_start',
    'doc.sents'],
   'requires': [],
   'scores': ['dep_uas',
    'dep_las',
    'dep_las_per_type',
    'sents_p',
    'sents_r',
    'sents_f'],
   'retokenizes': False},
  'attribute_ruler': {'assigns': [],
   'requires': [],
   'scores': [],
   'retokenizes': False},
  'lemmatizer': {'assigns': ['token.lemma'],
   'requires': [],
   'scores': ['lemma_acc'],
   'retokenizes': False},
  'ner': {'assigns': ['doc.ents', 'token.ent_iob', 'token.ent_type'],
   'requires': [],
   'scores': ['ents_f', 'ents_p', 'ents_r', 'ents_per_type'],
   'retokenizes': False},
  'remove_gpe': {'assigns': [],
   'requires': [],
   'scores': [],
   'retokenizes': False}},
 'problems': {'tok2vec': [],
  

In [85]:
doc = nlp("Britain is a place. Mary is a doctor.")
for ent in doc.ents:
    print(ent.text, ent.label_)

Mary PERSON


In [87]:
# Save current model
# nlp.to_disk("data_spacy/new_en_core_web_sm")

NOTE: When we use this saved model, the custom component must be exist in the script that the saved model is loaded.

# Regex Multi Word Tokens

How to use Regex in Named Entity Recognition:
1. Find matches pattern with Regex
2. Create new Span object of matched pattern.
3. Inject new Span into the doc.ents

In [88]:
import re

In [98]:
nlp = spacy.load("en_core_web_sm")

In [113]:
# Analyze pattern

text = "Paul Newman was an American actor, but Paul Hollywood is a British TV Host. The name Paul is quite common."
# NOTE: There are many "Paul" name in this text. We want to give them name based on its context.

# Extract pattern
pattern = r'Paul [A-Z]\w+'

matches = re.finditer(pattern, text) # Return iterator. Each element is re.Match object
print(type(matches)) 

for match in matches:
    print(match)
    print("The index location matches in text: ", match.span())

<class 'callable_iterator'>
<re.Match object; span=(0, 11), match='Paul Newman'>
The index location matches in text:  (0, 11)
<re.Match object; span=(39, 53), match='Paul Hollywood'>
The index location matches in text:  (39, 53)


In [115]:
type(doc.char_span(0, 11))

spacy.tokens.span.Span

In [120]:
from spacy.tokens import Span

nlp = spacy.blank("en")

text = "Paul Newman was an American actor, but Paul Hollywood is a British TV Host. The name Paul is quite common."
    
# Define doc object
doc = nlp(text)
original_ents = list(doc.ents)

# Define pattern
pattern = r"Paul [A-Z]\w+"

print("Before: ")
if len(doc.ents) > 0:
    for ent in doc.ents:
        print(ent.text, ent.label_)
else:
    print("None")

# Define storage
mwt_ents = []
# 1. Find matches pattern
for match in re.finditer(pattern, text):
    # Extract location based on its text
    start, end = match.span()
    # 2. Create new Span object from the text
    span = doc.char_span(start, end) # char_span(start, end) returns Span object based on text position (not token position)
    if span is not None:
        # Store the span text, span start and end position (token based position)
        mwt_ents.append((span.start, span.end, span.text))

print(mwt_ents)

# 3. Inject new Span into the doc.ents
for ent in mwt_ents:
    start, end, name = ent
    per_ent = Span(doc, start, end, label="PERSON")
    original_ents.append(per_ent)

doc.ents = original_ents

print("After: ")
for ent in doc.ents:
    print(ent.text, ent.label_)

Before: 
None
[(0, 2, 'Paul Newman'), (8, 10, 'Paul Hollywood')]
After: 
Paul Newman PERSON
Paul Hollywood PERSON


In [121]:
# Put all of this into custom components

from spacy.language import Language

pattern = r"Paul [A-Z]\w+"
@Language.component("paul_ner")
def paul_ner(doc):
    original_ents = list(doc.ents)
    mwt_ents = []
    for match in re.finditer(pattern, doc.text):
        start, end = match.span()
        span = doc.char_span(start, end)
        if span is not None:
            mwt_ents.append((span.start, span.end, span.text))
    for ent in mwt_ents:
        start, end, name = ent
        per_ent = Span(doc, start, end, label='PERSON')
        original_ents.append(per_ent)
    doc.ents = original_ents
    return (doc)

In [122]:
nlp2 = spacy.blank('en')
nlp2.add_pipe("paul_ner")

<function __main__.paul_ner(doc)>

In [123]:
nlp2.analyze_pipes()

{'summary': {'paul_ner': {'assigns': [],
   'requires': [],
   'scores': [],
   'retokenizes': False}},
 'problems': {'paul_ner': []},
 'attrs': {}}

In [124]:
doc2 = nlp2(text)
print(doc2.ents)

(Paul Newman, Paul Hollywood)


NOTE: Overlap Span object can break the model and turns it into error. 

For example suppose we have two Span (1) "Paul Hollywood" and (2) "Hollywood" with overlaping in their position. If we inject those Span into doc.ents it will produce error.

In [128]:
# Demonstrate the error from this example
from spacy.util import filter_spans

pattern = r"Hollywood"
@Language.component("cinema_ner")
def cinema_ner(doc):
    original_ents = list(doc.ents)
    mwt_ents = []
    for match in re.finditer(pattern, doc.text):
        start, end = match.span()
        span = doc.char_span(start, end)
        if span is not None:
            mwt_ents.append((span.start, span.end, span.text))
    for ent in mwt_ents:
        start, end, name = ent
        per_ent = Span(doc, start, end, label='PERSON')
        original_ents.append(per_ent)
    # NOTE: Comment below line to see the error as mentioned in above cell.
    original_ents = filter_spans(original_ents)
    doc.ents = original_ents
    return (doc)

In [129]:
nlp3 = spacy.load('en_core_web_sm')
nlp3.add_pipe("cinema_ner")

<function __main__.cinema_ner(doc)>

In [130]:
doc3 = nlp3(text)
for ent in doc3.ents:
    print(ent.text, ent.label_)

(Paul Newman, American, Paul Hollywood, British, Paul)


NOTE: Using filter_spans method helps to filter out all the spans that already identified.