In [1]:
import spacy
from spacy.matcher import Matcher

In [2]:
nlp = spacy.load("en_core_web_md")

In [3]:
matcher = Matcher(nlp.vocab)

In [4]:
pattern = [{"LIKE_EMAIL": True}]
matcher.add("EMAIL_ADDRESS", [pattern])

In [5]:
doc = nlp("This is an email address: abcdefg@gmail.com")

In [6]:
matches = matcher(doc)

In [7]:
# Lexeme, Start Token, End Token
print(matches)

[(16571425990740197027, 6, 7)]


In [8]:
print(nlp.vocab[matches[0][0]].text)

EMAIL_ADDRESS


Now, let's use another text:

In [11]:
with open ("../data/wiki_mlk.txt", "r") as f:
    text = f.read()

# wiki_mlk.txt made with the introdutory text about Martin Luther
# King Jr. at Wikipedia. 
# cd .. 
# cd data
# touch wiki_mlk.txt
# echo "${complete text}" > wiki_mlk.txt

In [12]:
print(text)

Martin Luther King Jr. (born Michael King Jr.; January 15, 1929 – April 4, 1968) was an American Baptist minister, activist, and political philosopher who was one of the most prominent leaders in the civil rights movement from 1955 until his assassination in 1968. A black church leader and a son of early civil rights activist and minister Martin Luther King Sr., King advanced civil rights for people of color in the United States through the use of nonviolent resistance and nonviolent civil disobedience against Jim Crow laws and other forms of legalized discrimination.
King participated in and led marches for the right to vote, desegregation, labor rights, and other civil rights.[1] He oversaw the 1955 Montgomery bus boycott and later became the first president of the Southern Christian Leadership Conference (SCLC). As president of the SCLC, he led the unsuccessful Albany Movement in Albany, Georgia, and helped organize some of the nonviolent 1963 protests in Birmingham, Alabama. King w

Extracting proper nouns:

In [13]:
nlp = spacy.load("en_core_web_md")

In [14]:
matcher = Matcher(nlp.vocab)

In [17]:
pattern = [{"POS": "PROPN"}]
matcher.add("PROPER_NOUN", [pattern])
doc = nlp(text)
matches = matcher(doc)

In [18]:
print(len(matches))

116


In [20]:
for match in matches[:10]:
    print(match, doc[match[1]:match[2]])

(451313080118390996, 0, 1) Martin
(451313080118390996, 1, 2) Luther
(451313080118390996, 2, 3) King
(451313080118390996, 3, 4) Jr.
(451313080118390996, 6, 7) Michael
(451313080118390996, 7, 8) King
(451313080118390996, 8, 9) Jr.
(451313080118390996, 10, 11) January
(451313080118390996, 15, 16) April
(451313080118390996, 22, 23) American


Let's try to capture the PROPN as a whole. Martin Luther King Jr. as a one token, for example:

In [21]:
matcher = Matcher(nlp.vocab)
pattern = [{"POS": "PROPN", "OP": "+"}]
matcher.add("PROPER_NOUN", [pattern])
doc = nlp(text)
matches = matcher(doc)

In [22]:
for match in matches[:10]:
    print(match, doc[match[1]:match[2]])

(451313080118390996, 0, 1) Martin
(451313080118390996, 0, 2) Martin Luther
(451313080118390996, 1, 2) Luther
(451313080118390996, 0, 3) Martin Luther King
(451313080118390996, 1, 3) Luther King
(451313080118390996, 2, 3) King
(451313080118390996, 0, 4) Martin Luther King Jr.
(451313080118390996, 1, 4) Luther King Jr.
(451313080118390996, 2, 4) King Jr.
(451313080118390996, 3, 4) Jr.


The "OP" key related to the "+" value captures all the repeated (a lot of times) tokens. This creates this overlapping. Let's solve this:

In [23]:
matcher = Matcher(nlp.vocab)
pattern = [{"POS": "PROPN", "OP": "+"}]
matcher.add("PROPER_NOUN", [pattern], greedy="LONGEST")
doc = nlp(text)
matches = matcher(doc)

In [24]:
for match in matches[:10]:
    print(match, doc[match[1]:match[2]])

(451313080118390996, 66, 71) Martin Luther King Sr.
(451313080118390996, 534, 539) Martin Luther King Jr. Day
(451313080118390996, 589, 594) Martin Luther King Jr. Memorial
(451313080118390996, 0, 4) Martin Luther King Jr.
(451313080118390996, 143, 147) Southern Christian Leadership Conference
(451313080118390996, 6, 9) Michael King Jr.
(451313080118390996, 241, 244) Civil Rights Act
(451313080118390996, 247, 250) Voting Rights Act
(451313080118390996, 255, 258) Fair Housing Act
(451313080118390996, 317, 320) J. Edgar Hoover


Not there yet... Let's continue and try to extract the info in the order that they appear in the text:

In [26]:
matcher = Matcher(nlp.vocab)
pattern = [{"POS": "PROPN", "OP": "+"}]
matcher.add("PROPER_NOUN", [pattern], greedy="LONGEST")
doc = nlp(text)
matches = matcher(doc)
matches.sort(key = lambda x: x[1])
for match in matches[:10]:
    print(match, doc[match[1]:match[2]])

(451313080118390996, 0, 4) Martin Luther King Jr.
(451313080118390996, 6, 9) Michael King Jr.
(451313080118390996, 10, 11) January
(451313080118390996, 15, 16) April
(451313080118390996, 22, 24) American Baptist
(451313080118390996, 66, 71) Martin Luther King Sr.
(451313080118390996, 72, 73) King
(451313080118390996, 82, 84) United States
(451313080118390996, 95, 97) Jim Crow
(451313080118390996, 106, 107) King


PROPN followed by a verb:

In [27]:
matcher = Matcher(nlp.vocab)
pattern = [{"POS": "PROPN", "OP": "+"}, {"POS": "VERB"}]
matcher.add("PROPER_NOUN", [pattern], greedy="LONGEST")
doc = nlp(text)
matches = matcher(doc)
matches.sort(key = lambda x: x[1])
for match in matches[:10]:
    print(match, doc[match[1]:match[2]])

(451313080118390996, 106, 108) King participated
(451313080118390996, 317, 321) J. Edgar Hoover considered
(451313080118390996, 362, 364) FBI mailed
(451313080118390996, 389, 391) King won
(451313080118390996, 550, 553) United States beginning


Changing to other text:

In [28]:
text = "Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, 'and what is the use of a book,' thought Alice 'without pictures or conversation?'"

In [30]:
print(text)

Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, 'and what is the use of a book,' thought Alice 'without pictures or conversation?'


We are going to try to identify which character is thoughting the phrases inside the quotation marks.

In [35]:
speak_lemmas = ["think", "say"]
matcher = Matcher(nlp.vocab)
pattern = [{"ORTH": "'"}, {"IS_ALPHA": True, "OP": "+"},
          {"IS_PUNCT": True, "OP": "*"},
          {"ORTH": "'"}, {"POS": "VERB", "LEMMA" : {"IN": speak_lemmas}},
          {"POS": "PROPN", "OP": "+"},
          {"ORTH": "'"}, {"IS_ALPHA": True, "OP": "+"},
          {"IS_PUNCT": True, "OP": "*"},
          {"ORTH": "'"}
          ]
matcher.add("PROPER_NOUN", [pattern], greedy="LONGEST")
doc = nlp(text)
matches = matcher(doc)
matches.sort(key = lambda x: x[1])
for match in matches[:10]:
    print(match, doc[match[1]:match[2]])

(451313080118390996, 47, 67) 'and what is the use of a book,' thought Alice 'without pictures or conversation?'


Alice is now identified!

It is important to highlight that this pattern is very specific for the task in hand, that is, to identify the person that is making the thought between the quotes in that one text.