# NLP Basics Assessment

For this assessment we'll be using the short story [_An Occurrence at Owl Creek Bridge_](https://en.wikipedia.org/wiki/An_Occurrence_at_Owl_Creek_Bridge) by Ambrose Bierce (1890). <br>The story is in the public domain; the text file was obtained from [Project Gutenberg](https://www.gutenberg.org/ebooks/375.txt.utf-8).

In [1]:
# RUN THIS CELL to perform standard imports:
import spacy
nlp = spacy.load('en_core_web_sm')

**1. Create a Doc object from the file `owlcreek.txt`**<br>
> HINT: Use `with open('../TextFiles/owlcreek.txt') as f:`

In [2]:
# Enter your code here:

with open('../TextFiles/owlcreek.txt') as f:
    
    doc = nlp(f.read())

In [3]:
# Run this cell to verify it worked:

doc[:36]

AN OCCURRENCE AT OWL CREEK BRIDGE

by Ambrose Bierce

I

A man stood upon a railroad bridge in northern Alabama, looking down
into the swift water twenty feet below.  

**2. How many tokens are contained in the file?**

In [4]:
len(doc)

4833

**3. How many sentences are contained in the file?**<br>HINT: You'll want to build a list first!

In [5]:
sentences = [s for s in doc.sents]

In [6]:
len(sentences)

211

**4. Print the second sentence in the document**<br> HINT: Indexing starts at zero, and the title counts as the first sentence.

In [7]:
sentences[1]

A man stood upon a railroad bridge in northern Alabama, looking down
into the swift water twenty feet below.  

**5. For each token in the sentence above, print its `text`, `POS` tag, `dep` tag and `lemma`<br>
CHALLENGE: Have values line up in columns in the print output.**

In [8]:
sent_two = sentences[1]

for s in sent_two:
    print(s.text, s.pos_, s.lemma_)

A DET a
man NOUN man
stood VERB stand
upon ADP upon
a DET a
railroad NOUN railroad
bridge NOUN bridge
in ADP in
northern ADJ northern
Alabama PROPN alabama
, PUNCT ,
looking VERB look
down PART down

 SPACE 

into ADP into
the DET the
swift ADJ swift
water NOUN water
twenty NUM twenty
feet NOUN foot
below ADV below
. PUNCT .
  SPACE  


In [9]:
# CHALLENGE SOLUTION:

sent_two = sentences[1]

for s in sent_two:
    print(f"{s.text:{10}} {s.pos_:{10}} {s.dep_:{10}} {s.lemma_:{10}}")

A          DET        det        a         
man        NOUN       nsubj      man       
stood      VERB       ROOT       stand     
upon       ADP        prep       upon      
a          DET        det        a         
railroad   NOUN       compound   railroad  
bridge     NOUN       pobj       bridge    
in         ADP        prep       in        
northern   ADJ        amod       northern  
Alabama    PROPN      pobj       alabama   
,          PUNCT      punct      ,         
looking    VERB       advcl      look      
down       PART       prt        down      

          SPACE                 
         
into       ADP        prep       into      
the        DET        det        the       
swift      ADJ        amod       swift     
water      NOUN       pobj       water     
twenty     NUM        nummod     twenty    
feet       NOUN       npadvmod   foot      
below      ADV        advmod     below     
.          PUNCT      punct      .         
           SPACE                

**6. Write a matcher called 'Swimming' that finds both occurrences of the phrase "swimming vigorously" in the text**<br>
HINT: You should include an `'IS_SPACE': True` pattern between the two words!

In [10]:
# Import the Matcher library:
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

In [11]:
# Create a pattern and add it to matcher:

#attern1 = [{'LOWER' : 'SwimmingVigorously'}]
pattern2 = [{'LOWER':'swimming'}, {'IS_SPACE':True},{'LOWER':'vigorously'}]

matcher.add('Swimming', None, pattern2)

In [12]:
# Create a list of matches called "found_matches" and print the list:

found_matches = matcher(doc)
print(found_matches)

[(12881893835109366681, 1274, 1277), (12881893835109366681, 3607, 3610)]


In [13]:
doc[1274:1277]

swimming
vigorously

In [14]:
doc[3607:3610]

swimming
vigorously

**7. Print the text surrounding each found match**

In [15]:
for match_id, start, end in found_matches:
    string_id = nlp.vocab.strings[match_id]  # get string representation
    span = doc[start-10:end+12]                    # get the matched span
    print(span.text)
    print('\n')

 By diving I could evade the bullets and, swimming
vigorously, reach the bank, take to the woods and get away


saw all this over his shoulder; he was now swimming
vigorously with the current.  His brain was as energetic as his




**EXTRA CREDIT:<br>Print the *sentence* that contains each found match**

In [16]:
for sent in sentences:
    if found_matches[0][1] < sent.end:
        print(sent)
        break

By diving I could evade the bullets and, swimming
vigorously, reach the bank, take to the woods and get away home.  


In [17]:
for sent in sentences:
    if found_matches[1][1] < sent.end:
        print(sent)
        break

The hunted man saw all this over his shoulder; he was now swimming
vigorously with the current.  
