___

<a href='http://www.pieriandata.com'> <img src='../Pierian_Data_Logo.png' /></a>
___

# NLP Basics Assessment

For this assessment we'll be using the short story [_An Occurrence at Owl Creek Bridge_](https://en.wikipedia.org/wiki/An_Occurrence_at_Owl_Creek_Bridge) by Ambrose Bierce (1890). <br>The story is in the public domain; the text file was obtained from [Project Gutenberg](https://www.gutenberg.org/ebooks/375.txt.utf-8).

In [2]:
# RUN THIS CELL to perform standard imports:
import spacy
nlp = spacy.load('en_core_web_sm')

**1. Create a Doc object from the file `owlcreek.txt`**<br>
> HINT: Use `with open('../TextFiles/owlcreek.txt') as f:`

In [41]:
# Enter your code here:
with open('../TextFiles/owlcreek.txt') as f:
    doc = nlp(f.read())

In [4]:
doc.to_array(200).reshape(1611,3)

array([[0, 0, 0],
       [0, 0, 0],
       [0, 0, 0],
       ...,
       [0, 0, 0],
       [0, 0, 0],
       [0, 0, 0]], dtype=uint64)

In [5]:
# Run this cell to verify it worked:

doc[:36]

AN OCCURRENCE AT OWL CREEK BRIDGE

by Ambrose Bierce

I

A man stood upon a railroad bridge in northern Alabama, looking down
into the swift water twenty feet below.  

**2. How many tokens are contained in the file?**

In [6]:
tokens = [t for t in doc]
len(tokens)

4833

**3. How many sentences are contained in the file?**<br>HINT: You'll want to build a list first!

In [7]:
sents = [sent for sent in doc.sents]
print(len(sents))

222


**4. Print the second sentence in the document**<br> HINT: Indexing starts at zero, and the title counts as the first sentence.

In [8]:
sentence = str(sents[2])

In [9]:
for t in nlp(sentence):
    print(t, t.pos_, t.dep_, t.lemma_)

A DET det a
man NOUN nsubj man
stood VERB ROOT stand
upon ADP prep upon
a DET det a
railroad NOUN compound railroad
bridge NOUN pobj bridge
in ADP prep in
northern ADJ amod northern
Alabama PROPN pobj Alabama
, PUNCT punct ,
looking VERB advcl look
down ADV advmod down

 SPACE  

into ADP prep into
the DET det the
swift ADJ amod swift
water NOUN pobj water
twenty NUM nummod twenty
feet NOUN npadvmod foot
below ADV advmod below
. PUNCT punct .
  SPACE   


** 5. For each token in the sentence above, print its `text`, `POS` tag, `dep` tag and `lemma`<br>
CHALLENGE: Have values line up in columns in the print output.**

In [7]:
# NORMAL SOLUTION:


A DET det a
man NOUN nsubj man
stood VERB ROOT stand
upon ADP prep upon
a DET det a
railroad NOUN compound railroad
bridge NOUN pobj bridge
in ADP prep in
northern ADJ amod northern
Alabama PROPN pobj alabama
, PUNCT punct ,
looking VERB advcl look
down PART prt down

 SPACE  

into ADP prep into
the DET det the
swift ADJ amod swift
water NOUN pobj water
twenty NUM nummod twenty
feet NOUN npadvmod foot
below ADV advmod below
. PUNCT punct .
  SPACE   


In [11]:
# CHALLENGE SOLUTION:

for t in nlp(sentence):
    print(f' {t.text:{20}} {t.dep_:{10}} {t.pos_:{10}} {t.lemma_}')

 A                    det        DET        a
 man                  nsubj      NOUN       man
 stood                ROOT       VERB       stand
 upon                 prep       ADP        upon
 a                    det        DET        a
 railroad             compound   NOUN       railroad
 bridge               pobj       NOUN       bridge
 in                   prep       ADP        in
 northern             amod       ADJ        northern
 Alabama              pobj       PROPN      Alabama
 ,                    punct      PUNCT      ,
 looking              advcl      VERB       look
 down                 advmod     ADV        down
 
                               SPACE      

 into                 prep       ADP        into
 the                  det        DET        the
 swift                amod       ADJ        swift
 water                pobj       NOUN       water
 twenty               nummod     NUM        twenty
 feet                 npadvmod   NOUN       foot
 below            

**6. Write a matcher called 'Swimming' that finds both occurrences of the phrase "swimming vigorously" in the text**<br>
HINT: You should include an `'IS_SPACE': True` pattern between the two words!

In [9]:
# Import the Matcher library:

from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

In [50]:
# Create a pattern and add it to matcher:

from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
pattern1 = [{'lower':'swimming'},{'IS_SPACE':True},{'lower':'vigorously'}]
matcher.add('Swimming vigorously', None, pattern1)
found_matches = matcher(doc)
for match_id, start, end in found_matches:
    span = doc[start:end]
    print(f'{match_id}, {span.text:{10}}, {start}, {end}')

8654061974911851908, swimming
vigorously, 1274, 1277
8654061974911851908, swimming
vigorously, 3607, 3610


In [40]:
# matcher is just like a regex
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

doc = nlp('A red brown fox was running across the field and was lazy enough to sleep')

# find "across the field"
pattern1 = [{'lower':'across'},{'IS_SPACE':True},{'lower':'the'}, {'lower':'field'}]
# find "a fox"
pattern2 = [{'lower':'a'}, {'lower':'fox'}]

# find "field"
# find "sleep"
matcher.add('AcrossTheFieldMatcher', None, pattern1)
found_matches = matcher(doc)
print(found_matches)
for match_id, start, end in found_matches:
    print(doc[start-4:end+4])


[]


In [60]:

match_id, start, end = found_matches[0]
print(doc[start-10:end+15])

 By diving I could evade the bullets and, swimming
vigorously, reach the bank, take to the woods and get away home.  


In [66]:
match_id, start, end = found_matches[1]
print(doc[start-7:end+5])

over his shoulder; he was now swimming
vigorously with the current.  


**7. Print the text surrounding each found match**

By diving I could evade the bullets and, swimming
vigorously, reach the bank, take to the woods and get away home


over his shoulder; he was now swimming
vigorously with the current.  


**EXTRA CREDIT:<br>Print the *sentence* that contains each found match**

In [69]:
match_id, start, end = found_matches[1]
span=doc[start:end]
print(span.sent)

The hunted man saw all this over his shoulder; he was now swimming
vigorously with the current.  


In [18]:

match_id, start, end = found_matches[]
span=doc[start:end]
print(span.sent)



By diving I could evade the bullets and, swimming
vigorously, reach the bank, take to the woods and get away home.  


The hunted man saw all this over his shoulder; he was now swimming
vigorously with the current.  


### Great Job!