___

<a href='http://www.pieriandata.com'> <img src='../Pierian_Data_Logo.png' /></a>
___

# NLP Basics Assessment

For this assessment we'll be using the short story [_An Occurrence at Owl Creek Bridge_](https://en.wikipedia.org/wiki/An_Occurrence_at_Owl_Creek_Bridge) by Ambrose Bierce (1890). <br>The story is in the public domain; the text file was obtained from [Project Gutenberg](https://www.gutenberg.org/ebooks/375.txt.utf-8).

In [1]:
# RUN THIS CELL to perform standard imports:
import spacy
nlp = spacy.load('en_core_web_sm')

**1. Create a Doc object from the file `owlcreek.txt`**<br>
> HINT: Use `with open('../TextFiles/owlcreek.txt') as f:`

In [7]:
# Enter your code here:
with open('../TextFiles/owlcreek.txt') as f:
    doc = nlp(f.read())

In [8]:
# Run this cell to verify it worked:

doc[:36]

AN OCCURRENCE AT OWL CREEK BRIDGE

by Ambrose Bierce

I

A man stood upon a railroad bridge in northern Alabama, looking down
into the swift water twenty feet below.  

**2. How many tokens are contained in the file?**

In [9]:
len(doc)

4833

**3. How many sentences are contained in the file?**<br>HINT: You'll want to build a list first!

In [12]:
# sent_list = []
# for sent in doc.sents:
#     sent_list.append(sent)

sent_list = [sent for sent in doc.sents]

len(sent_list)

211

**4. Print the second sentence in the document**<br> HINT: Indexing starts at zero, and the title counts as the first sentence.

In [94]:
print(sent_list[1].text)
print()
print(sent_list[2].text)

A man stood upon a railroad bridge in northern Alabama, looking down
into the swift water twenty feet below.  

The man's hands were behind
his back, the wrists bound with a cord.  


** 5. For each token in the sentence above, print its `text`, `POS` tag, `dep` tag and `lemma`<br>
CHALLENGE: Have values line up in columns in the print output.**

In [34]:
# NORMAL SOLUTION:

for token in sent_list[1]:
    print(token.text, token.pos_, token.dep_, token.lemma_)

A DET det a
man NOUN nsubj man
stood VERB ROOT stand
upon ADP prep upon
a DET det a
railroad NOUN compound railroad
bridge NOUN pobj bridge
in ADP prep in
northern ADJ amod northern
Alabama PROPN pobj alabama
, PUNCT punct ,
looking VERB advcl look
down PART prt down

 SPACE  

into ADP prep into
the DET det the
swift ADJ amod swift
water NOUN pobj water
twenty NUM nummod twenty
feet NOUN npadvmod foot
below ADV advmod below
. PUNCT punct .
  SPACE   


In [32]:
# CHALLENGE SOLUTION:

for token in sent_list[1]:
    print(f'{token.text:{10}} {token.pos_:{10}} {token.dep_:{10}} {token.lemma_:{15}}')

A          DET        det        a              
man        NOUN       nsubj      man            
stood      VERB       ROOT       stand          
upon       ADP        prep       upon           
a          DET        det        a              
railroad   NOUN       compound   railroad       
bridge     NOUN       pobj       bridge         
in         ADP        prep       in             
northern   ADJ        amod       northern       
Alabama    PROPN      pobj       alabama        
,          PUNCT      punct      ,              
looking    VERB       advcl      look           
down       PART       prt        down           

          SPACE                 
              
into       ADP        prep       into           
the        DET        det        the            
swift      ADJ        amod       swift          
water      NOUN       pobj       water          
twenty     NUM        nummod     twenty         
feet       NOUN       npadvmod   foot           
below      ADV      

**Visualize using displacy**

In [36]:
from spacy import displacy

In [47]:
displacy.render(sent_list[5], style='ent', jupyter=True)

In [48]:
options = {'distance': 110, "compact": True, "bg": "#09a3d5",
           "color": "white", "font": "Source Sans Pro"}

In [49]:
displacy.render(sent_list[5], style='dep', jupyter=True, options=options)

**6. Write a matcher called 'Swimming' that finds both occurrences of the phrase "swimming vigorously" in the text**<br>
HINT: You should include an `'IS_SPACE': True` pattern between the two words!

In [95]:
# Import the Matcher library:

from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

In [96]:
# Create a pattern and add it to matcher:

pattern = [{'LOWER': 'swimming'}, {'IS_SPACE': True, 'OP': '*'}, {'LOWER': 'vigorously'}]

matcher.add('Swimming', None, pattern)

In [97]:
# Create a list of matches called "found_matches" and print the list:

found_matches = matcher(doc)
print(found_matches)

[(12881893835109366681, 1274, 1277), (12881893835109366681, 3607, 3610)]


In [98]:
for match_id, start, end in found_matches:
    string_id = nlp.vocab.strings[match_id]  # get string representation
    span = doc[start:end]                    # get the matched span
    print(match_id, string_id, start, end, span.text, '\n')

12881893835109366681 Swimming 1274 1277 swimming
vigorously 

12881893835109366681 Swimming 3607 3610 swimming
vigorously 



**7. Print the text surrounding each found match**

In [103]:
def surrounding(doc, start, end):
    print(doc[start-5: end+5])

In [104]:
surrounding(doc, 1274, 1277)

evade the bullets and, swimming
vigorously, reach the bank,


In [106]:
surrounding(doc, 3607, 3610)

shoulder; he was now swimming
vigorously with the current.  


In [86]:
for match_id, start, end in found_matches:
    string_id = nlp.vocab.strings[match_id]  # get string representation
    span = doc[start-7:end+7]                    # get the matched span
    print(span.text, '\n')

I could evade the bullets and, swimming
vigorously, reach the bank, take to 

over his shoulder; he was now swimming
vigorously with the current.  His brain 



**EXTRA CREDIT:<br>Print the *sentence* that contains each found match**

In [87]:
# start and end token values

print(sent_list[0].start, sent_list[0].end)

0 13


In [91]:
print(doc[1265:1291])
print()
print(doc[3594:3615])

By diving I could evade the bullets and, swimming
vigorously, reach the bank, take to the woods and get away home.

The hunted man saw all this over his shoulder; he was now swimming
vigorously with the current.  


In [92]:
# referencing the first sentence in the sent_list list

for sent in sent_list:
    if found_matches[0][1] < sent.end:
        print(sent)
        break

By diving I could evade the bullets and, swimming
vigorously, reach the bank, take to the woods and get away home.  


In [93]:
# referencing the second sentence in the sent_list list

for sent in sent_list:
    if found_matches[1][1] < sent.end:
        print(sent)
        break

The hunted man saw all this over his shoulder; he was now swimming
vigorously with the current.  


### Great Job!