In [1]:
#Import Spacy and load the model
import spacy
nlp = spacy.load('en_core_web_sm')

In [2]:
# Import the Matcher library.Here matcher is an object that pairs to the current Vocab object. 
#We can add and remove specific named matchers to matcher as needed.
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

# Creating patterns
In literature, the phrase 'solar power' might appear as one word or two, with or without a hyphen. In this section we'll develop a matcher named 'SolarPower' that finds all three:

In [3]:
#pattern1 = solarpower
pattern1 = [{'LOWER': 'solarpower'}]
#pattern2 = solar power.Remember that single spaces are not tokenized, so they don't count as punctuation. 
pattern2 = [{'LOWER': 'solar'}, {'LOWER': 'power'}]
#pattern3 = solar-power(The punctation can be anything)
pattern3 = [{'LOWER': 'solar'}, {'IS_PUNCT': True}, {'LOWER': 'power'}]

In [4]:
#Add the pattern in Matcher.Once we define our patterns, we pass them into matcher with the name 'SolarPower', and set callbacks to None (more on callbacks later).
matcher.add('SolarPower', None, pattern1, pattern2, pattern3)

In [5]:
#Create a doc
doc = nlp(u'The Solar Power industry continues to grow as demand \
for solarpower increases. Solar-power cars are gaining popularity.')

In [8]:
type(doc)

spacy.tokens.doc.Doc

In [9]:
found_matches = matcher(doc)
print(found_matches)

[(8656102463236116519, 1, 3), (8656102463236116519, 10, 11), (8656102463236116519, 13, 16)]


matcher returns a list of tuples. Each tuple contains an ID for the match, with start & end tokens that map to the span doc[start:end]

In [10]:
for match_id,start,end in found_matches:
    string_id = nlp.vocab.strings[match_id]  # get string representation
    span = doc[start:end]
    print(match_id,string_id,start, end, span.text)

8656102463236116519 SolarPower 1 3 Solar Power
8656102463236116519 SolarPower 10 11 solarpower
8656102463236116519 SolarPower 13 16 Solar-power


The match_id is simply the hash value of the string_ID 'SolarPower'

# Setting pattern options and quantifiers
You can make token rules optional by passing an 'OP':'*' argument. This lets us streamline our patterns list. 'OP':'*' means we can have more than one punc like --,...,---.

In [11]:
# Redefine the patterns:
pattern1 = [{'LOWER': 'solarpower'}]
pattern2 = [{'LOWER': 'solar'}, {'IS_PUNCT': True, 'OP':'*'}, {'LOWER': 'power'}]

In [12]:
# Remove the old patterns to avoid duplication:
matcher.remove('SolarPower')

In [13]:
# Add the new set of patterns to the 'SolarPower' matcher:
matcher.add('SolarPower', None, pattern1, pattern2)

In [14]:
doc1 = nlp(u'The Solar--Power industry continues to grow as demand.')

In [17]:
found_matches1 = matcher(doc1)
print(found_matches1)

[(8656102463236116519, 1, 4)]


In [16]:
for token in doc1:
    print(token)

The
Solar
--
Power
industry
continues
to
grow
as
demand
.


In [19]:
for match_id,start,end in found_matches1:
    string_id = nlp.vocab.strings[match_id]  # get string representation
    span = doc1[start:end]
    print(match_id,string_id,start, end, span.text)

8656102463236116519 SolarPower 1 4 Solar--Power


This found both two-word patterns, with and without the hyphen!

The following quantifiers can be passed to the `'OP'` key:
<table><tr><th>OP</th><th>Description</th></tr>

<tr ><td><span >\!</span></td><td>Negate the pattern, by requiring it to match exactly 0 times</td></tr>
<tr ><td><span >?</span></td><td>Make the pattern optional, by allowing it to match 0 or 1 times</td></tr>
<tr ><td><span >\+</span></td><td>Require the pattern to match 1 or more times</td></tr>
<tr ><td><span >\*</span></td><td>Allow the pattern to match zero or more times</td></tr>
</table>


# Be careful with lemmas!
If we wanted to match on both 'solar power' and 'solar powered', it might be tempting to look for the lemma of 'powered' and expect it to be 'power'. This is not always the case! The lemma of the adjective 'powered' is still 'powered':

In [29]:
pattern1 = [{'LOWER': 'solarpower'}]
pattern2 = [{'LOWER': 'solar'}, {'IS_PUNCT': True, 'OP':'*'}, {'LEMMA': 'power'}] # CHANGE THIS PATTERN

# Remove the old patterns to avoid duplication:
matcher.remove('SolarPower')

# Add the new set of patterns to the 'SolarPower' matcher:
matcher.add('SolarPower', None, pattern1, pattern2)

In [30]:
doc2 = nlp(u'Solar-powered energy runs solar-powered cars.')

In [31]:
found_matches = matcher(doc2)
print(found_matches)

[(8656102463236116519, 0, 3), (8656102463236116519, 5, 8)]


In [32]:
for match_id,start,end in found_matches:
    string_id = nlp.vocab.strings[match_id]  # get string representation
    span = doc2[start:end]
    print(match_id,string_id,start, end, span.text)

8656102463236116519 SolarPower 0 3 Solar-powered
8656102463236116519 SolarPower 5 8 solar-powered


In [33]:
pattern1 = [{'LOWER': 'solarpower'}]
pattern2 = [{'LOWER': 'solar'}, {'IS_PUNCT': True, 'OP':'*'}, {'LOWER': 'power'}]
pattern3 = [{'LOWER': 'solarpowered'}]
pattern4 = [{'LOWER': 'solar'}, {'IS_PUNCT': True, 'OP':'*'}, {'LOWER': 'powered'}]

# Remove the old patterns to avoid duplication:
matcher.remove('SolarPower')

# Add the new set of patterns to the 'SolarPower' matcher:
matcher.add('SolarPower', None, pattern1, pattern2, pattern3, pattern4)

In [34]:
found_matches = matcher(doc2)
print(found_matches)

[(8656102463236116519, 0, 3), (8656102463236116519, 5, 8)]


## Other token attributes
Besides lemmas, there are a variety of token attributes we can use to determine matching rules:
<table><tr><th>Attribute</th><th>Description</th></tr>

<tr ><td><span >`ORTH`</span></td><td>The exact verbatim text of a token</td></tr>
<tr ><td><span >`LOWER`</span></td><td>The lowercase form of the token text</td></tr>
<tr ><td><span >`LENGTH`</span></td><td>The length of the token text</td></tr>
<tr ><td><span >`IS_ALPHA`, `IS_ASCII`, `IS_DIGIT`</span></td><td>Token text consists of alphanumeric characters, ASCII characters, digits</td></tr>
<tr ><td><span >`IS_LOWER`, `IS_UPPER`, `IS_TITLE`</span></td><td>Token text is in lowercase, uppercase, titlecase</td></tr>
<tr ><td><span >`IS_PUNCT`, `IS_SPACE`, `IS_STOP`</span></td><td>Token is punctuation, whitespace, stop word</td></tr>
<tr ><td><span >`LIKE_NUM`, `LIKE_URL`, `LIKE_EMAIL`</span></td><td>Token text resembles a number, URL, email</td></tr>
<tr ><td><span >`POS`, `TAG`, `DEP`, `LEMMA`, `SHAPE`</span></td><td>The token's simple and extended part-of-speech tag, dependency label, lemma, shape</td></tr>
<tr ><td><span >`ENT_TYPE`</span></td><td>The token's entity label</td></tr>

</table>

## PhraseMatcher
In the above section we used token patterns to perform rule-based matching. An alternative - and often more efficient - method is to match on terminology lists. In this case we use PhraseMatcher to create a Doc object from a list of phrases, and pass that into `matcher` instead.

In [39]:
# Perform standard imports, reset nlp
import spacy
nlp = spacy.load('en_core_web_sm')

In [40]:
# Import the PhraseMatcher library
from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab)

In [56]:
f    = open('../Datafiles/reaganomics.txt',encoding='windows-1252')

In [57]:
f

<_io.TextIOWrapper name='../Datafiles/reaganomics.txt' mode='r' encoding='windows-1252'>

In [58]:
doc3 = nlp(f.read())

In [59]:
type(doc3)

spacy.tokens.doc.Doc

In [60]:
for token in doc3:
    print(token)

REAGANOMICS


https://en.wikipedia.org/wiki/Reaganomics



Reaganomics
(
a
portmanteau
of
[
Ronald
]
Reagan
and
economics
attributed
to
Paul
Harvey)[1
]
refers
to
the
economic
policies
promoted
by
U.S.
President
Ronald
Reagan
during
the
1980s
.
These
policies
are
commonly
associated
with
supply
-
side
economics
,
referred
to
as
trickle
-
down
economics
or
voodoo
economics
by
political
opponents
,
and
free
-
market
economics
by
political
advocates
.



The
four
pillars
of
Reagan
's
economic
policy
were
to
reduce
the
growth
of
government
spending
,
reduce
the
federal
income
tax
and
capital
gains
tax
,
reduce
government
regulation
,
and
tighten
the
money
supply
in
order
to
reduce
inflation.[2
]



The
results
of
Reaganomics
are
still
debated
.
Supporters
point
to
the
end
of
stagflation
,
stronger
GDP
growth
,
and
an
entrepreneur
revolution
in
the
decades
that
followed.[3][4
]
Critics
point
to
the
widening
income
gap
,
an
atmosphere
of
greed
,
and
the
national
debt
tripling
in
eight
years


However
,
federal
deficit
as
percent
of
GDP
was
up
throughout
the
Reagan
presidency
from
2.7
%
at
the
end
of
(
and
throughout
)
the
Carter
administration.[9][31][32
]
As
a
short
-
run
strategy
to
reduce
inflation
and
lower
nominal
interest
rates
,
the
U.S.
borrowed
both
domestically
and
abroad
to
cover
the
Federal
budget
deficits
,
raising
the
national
debt
from
$
997
billion
to
$
2.85
trillion.[33
]
This
led
to
the
U.S.
moving
from
the
world
's
largest
international
creditor
to
the
world
's
largest
debtor
nation.[5
]
Reagan
described
the
new
debt
as
the
"
greatest
disappointment
"
of
his
presidency.[34
]



According
to
William
A.
Niskanen
,
one
of
the
architects
of
Reaganomics
,
"
Reagan
delivered
on
each
of
his
four
major
policy
objectives
,
although
not
to
the
extent
that
he
and
his
supporters
had
hoped
"
,
and
notes
that
the
most
substantial
change
was
in
the
tax
code
,
where
the
top
marginal
individual
income
tax
rate
fell
from
70.1
%
to
28.4
%
,
and
there
was
a
"
major
reversal


significant
cuts
in
the
upper
tax
brackets
,
as
that
extra
money
for
the
wealthy
could
trickle
along
to
low
-
income
groups.[67
]



Federal
income
tax
and
payroll
tax
levels



During
the
Reagan
administration
,
fiscal
year
federal
receipts
grew
from
$
599
billion
to
$
991
billion
(
an
increase
of
65
%
)
while
fiscal
year
federal
outlays
grew
from
$
678
billion
to
$
1144
billion
(
an
increase
of
69%).[68][69
]
According
to
a
1996
report
of
the
Joint
Economic
Committee
of
the
United
States
Congress
,
during
Reagan
's
two
terms
,
and
through
1993
,
the
top
10
%
of
taxpayers
paid
an
increased
share
of
income
taxes
(
not
including
payroll
taxes
)
to
the
Federal
government
,
while
the
lowest
50
%
of
taxpayers
paid
a
reduced
share
of
income
tax
revenue.[70
]
Personal
income
tax
revenues
declined
from
9.4
%
GDP
in
1981
to
8.3
%
GDP
in
1989
,
while
payroll
tax
revenues
increased
from
6.0
%
GDP
to
6.7
%
GDP
during
the
same
period.[25
]



Tax
receipts



According
to
a
2003
Treasury
study
,
th

wealthy
and
eliminated
tax
deductions
,
while
raising
tax
rates
on
lower
-
income
individuals.[94][95][96][97
]
The
across
the
board
tax
system
reduced
marginal
rates
and
further
reduced
bracket
creep
from
inflation
.
The
highest
income
earners
(
with
incomes
exceeding
$
1,000,000
)
received
a
tax
break
,
restoring
a
flatter
tax
system.[98
]
In
2006
,
the
IRS
's
National
Taxpayer
Advocate
's
report
characterized
the
effective
rise
in
the
AMT
for
individuals
as
a
problem
with
the
tax
code.[99
]
Through
2007
,
the
revised
AMT
had
brought
in
more
tax
revenue
than
the
former
tax
code
,
which
has
made
it
difficult
for
Congress
to
reform.[98][100
]



Economist
Paul
Krugman
argued
the
economic
expansion
during
the
Reagan
administration
was
primarily
the
result
of
the
business
cycle
and
the
monetary
policy
by
Paul
Volcker.[101
]
Krugman
argues
that
there
was
nothing
unusual
about
the
economy
under
Reagan
because
unemployment
was
reducing
from
a
high
peak
and
that
it
is
consistent
with
Keynesi

In [61]:
# First, create a list of match phrases:
phrase_list = ['voodoo economics', 'supply-side economics', 'trickle-down economics', 'free-market economics']

# Next, convert each phrase to a Doc object:
phrase_patterns = [nlp(text) for text in phrase_list]

# Pass each Doc object into matcher (note the use of the asterisk!):
matcher.add('VoodooEconomics', None, *phrase_patterns)

# Build a list of matches:
matches = matcher(doc3)

In [62]:
# (match_id, start, end)
matches

[(3473369816841043438, 41, 45),
 (3473369816841043438, 49, 53),
 (3473369816841043438, 54, 56),
 (3473369816841043438, 61, 65),
 (3473369816841043438, 673, 677),
 (3473369816841043438, 2985, 2989)]

In [65]:
doc3[:70]

REAGANOMICS
https://en.wikipedia.org/wiki/Reaganomics

Reaganomics (a portmanteau of [Ronald] Reagan and economics attributed to Paul Harvey)[1] refers to the economic policies promoted by U.S. President Ronald Reagan during the 1980s. These policies are commonly associated with supply-side economics, referred to as trickle-down economics or voodoo economics by political opponents, and free-market economics by political advocates.


## Viewing Matches
There are a few ways to fetch the text surrounding a match. The simplest is to grab a slice of tokens from the doc that is wider than the match:

In [66]:
doc3[665:685]  # Note that the fifth match starts at doc3[673]

same time he attracted a following from the supply-side economics movement, which formed in opposition to Keynesian

In [67]:
doc3[2975:2995]  # The sixth match starts at doc3[2985]

against institutions.[66] His policies became widely known as "trickle-down economics", due to the significant

Another way is to first apply the `sentencizer` to the Doc, then iterate through the sentences to the match point:

In [68]:
# Build a list of sentences
sents = [sent for sent in doc3.sents]

# In the next section we'll see that sentences contain start and end token values:
print(sents[0].start, sents[0].end)

0 35


In [69]:
# Iterate over the sentence list until the sentence end value exceeds a match start value:
for sent in sents:
    if matches[4][1] < sent.end:  # this is the fifth match, that starts at doc3[673]
        print(sent)
        break

At the same time he attracted a following from the supply-side economics movement, which formed in opposition to Keynesian demand-stimulus economics.
