<a href="https://colab.research.google.com/github/drpetros11111/NLP_Portilia/blob/NLP_Spacy_Basics_1/07_NLP_Basics_Assessment_Solution.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

___

<a href='http://www.pieriandata.com'> <img src='../Pierian_Data_Logo.png' /></a>
___

# NLP Basics Assessment - Solutions

For this assessment we'll be using the short story [_An Occurrence at Owl Creek Bridge_](https://en.wikipedia.org/wiki/An_Occurrence_at_Owl_Creek_Bridge) by Ambrose Bierce (1890). <br>The story is in the public domain; the text file was obtained from [Project Gutenberg](https://www.gutenberg.org/ebooks/375.txt.utf-8).

In [1]:
# RUN THIS CELL to perform standard imports:
import spacy
nlp = spacy.load('en_core_web_sm')

**1. Create a Doc object from the file `owlcreek.txt`**<br>
> HINT: Use `with open('../TextFiles/owlcreek.txt') as f:`

In [2]:
# Enter your code here:

with open('/content/owlcreek.txt') as f:
    doc = nlp(f.read())

In [3]:
# Run this cell to verify it worked:

doc[:36]

AN OCCURRENCE AT OWL CREEK BRIDGE

by Ambrose Bierce

I

A man stood upon a railroad bridge in northern Alabama, looking down
into the swift water twenty feet below.  

# Slicing a document
Comment (# Run this cell to verify it worked:):

This is a comment in Python. Everything after the # symbol is ignored by the Python interpreter.

In this case, the comment suggests that running the cell will allow you to verify if the previous operations worked as expected.

--------------------------
##doc[:36]:

This is a slice operation on the doc variable.

The : inside the square brackets indicates a slice, which extracts a portion of the sequence (like a string, list, or another sequence type).


Specifically, :36 means "slice from the beginning up to (but not including) index 36."

So this retrieves the first 36 elements (characters or items) of the doc variable.

If doc is a string, it would return the first 36 characters.

If it is a list, it returns the first 36 elements of that list.

In summary, when you run this line, it displays the first 36 elements of doc to verify that the variable has been correctly processed or populated in earlier parts of the code.

**2. How many tokens are contained in the file?**

In [None]:
len(doc)

4833

# Explaining len
The expression len(doc) returns the length of the variable doc.

--------------------------
Here's what happens:

##len() function:
This is a built-in Python function that returns the number of items in an object.

The object can be a string, list, tuple, dictionary, or other types that support a length.

If doc is a string, len(doc) will return the number of characters in that string.

If doc is a list (or another iterable), it will return the number of elements in the list.

-------------------------
#Example:

    doc = "Hello, world!"

print(len(doc))  # Output: 13 (number of characters including punctuation and spaces)

For a list:

    doc = [1, 2, 3, 4, 5]
print(len(doc))  # Output: 5 (number of elements in the list)

In your case, running len(doc) will simply tell you how many elements (or characters) are in doc.

**3. How many sentences are contained in the file?**<br>HINT: You'll want to build a list first!

In [4]:
sents = [sent for sent in doc.sents]
len(sents)

204

# Creating a List of sentences
This code snippet is performing two main operations:

--------------------------
##List Comprehension (sents = [sent for sent in doc.sents]):

It iterates over each item in doc.sents and creates a list sents, where each item is one of the sentences from doc.

doc.sents suggests that doc is likely an object from a natural language processing (NLP) library like spaCy, where sents is a generator for sentences in a document.

In this case, each sent is one sentence in the document doc.

##len(sents):

After creating the sents list, calling len(sents) will return the number of sentences in the document doc.

--------------------------
#Example using spaCy:

    import spacy

# Load the English model in spaCy
    nlp = spacy.load("en_core_web_sm")

# Create a doc object by processing some text
    doc = nlp("This is the first sentence. This is the second sentence.")

# Extract sentences from the doc
    sents = [sent for sent in doc.sents]

# Print the number of sentences
    print(len(sents))  # Output: 2
    
In this example, len(sents) would return 2 because there are two sentences in the document.

So, in your case, len(sents) will return the total number of sentences in the doc object.

**4. Print the second sentence in the document**<br> HINT: Indexing starts at zero, and the title counts as the first sentence.

In [5]:
print(sents[1].text)

The man's hands were behind
his back, the wrists bound with a cord.  


** 5. For each token in the sentence above, print its `text`, `POS` tag, `dep` tag and `lemma`<br>
CHALLENGE: Have values line up in columns in the print output.**

In [6]:
# NORMAL SOLUTION:
for token in sents[1]:
    print(token.text, token.pos_, token.dep_, token.lemma_)

The DET det the
man NOUN poss man
's PART case 's
hands NOUN nsubj hand
were AUX ROOT be
behind ADP prep behind

 SPACE dep 

his PRON poss his
back NOUN pobj back
, PUNCT punct ,
the DET det the
wrists NOUN appos wrist
bound VERB acl bind
with ADP prep with
a DET det a
cord NOUN pobj cord
. PUNCT punct .
  SPACE dep  


# Print different types of tokens
This code snippet is iterating over the tokens in the second sentence (sents[1]) and printing four attributes of each token: text, pos_, dep_, and lemma_.

--------------------------------
Let's break down the key concepts:

##for token in sents[1]::

This iterates over each token in the second sentence of the sents list.

The sentence sents[1] is treated as an iterable collection of tokens.

The index 1 refers to the second sentence (since Python uses zero-based indexing).

----------------------------------------------------
##Token Attributes:

###token.text:

This is the raw text of the token (the actual word or punctuation in the sentence).

-----------------------------------
###token.pos_:

This is the part-of-speech tag of the token (e.g., noun, verb, adjective).

The underscore (_) indicates the string version of the tag, rather than the integer ID.

-------------------------------
##token.dep_:

This is the syntactic dependency label, which describes the token's relationship to the other tokens in the sentence (e.g., subject, object, etc.).

----------------------------
##token.lemma_:

The lemma is the base or dictionary form of the token (e.g., the lemma of "running" is "run").

This code suggests you're likely using spaCy, a popular NLP library that assigns linguistic annotations to text.

------------------------------------------
#Example using spaCy:

    import spacy

# Load the English model
    nlp = spacy.load("en_core_web_sm")

# Process some text to create the doc object
    doc = nlp("This is the first sentence. This is the second sentence.")

# Extract the sentences
    sents = [sent for sent in doc.sents]

# Iterate through tokens in the second sentence and print their attributes
    for token in sents[1]:
       print(token.text, token.pos_, token.dep_, token.lemma_)

----------------
#Output Example:
For the second sentence, "This is the second sentence.", the output might look like:

    This DET nsubj this
    is AUX ROOT be
    the DET det the
    second ADJ amod second
    sentence NOUN attr sentence
    . PUNCT punct .

##DET (Determiner):

Part-of-speech tag for words like "this" or "the".

##AUX (Auxiliary): Verbs like "is".

##ROOT:

The main verb or central piece of the sentence.

##nsubj:

The nominal subject of the verb.

The term nsubj, short for nominal subject, is a syntactic dependency label used in natural language processing (NLP) to describe the grammatical relationship between a subject and the main verb in a sentence.

##What is a nominal subject?
A nominal subject is the noun, pronoun, or noun phrase that performs the action or is the focus of the verb.

It tells us who or what is doing the action of the verb.

In the context of dependency parsing (e.g., in libraries like spaCy), the nsubj label is applied to the token that functions as the subject in relation to the main verb (the root of the sentence).

##Example:
Consider the sentence:

    "The cat sits on the mat."

Here is the dependency breakdown:

"The cat" is the nominal subject because it is the noun phrase that performs the action (sitting).

"sits" is the verb.

In a dependency parse, the word "cat" would be labeled as nsubj to indicate that it is the nominal subject of the verb

"sits."

##Visual Breakdown:
The cat → nsubj → sits

##Another example:

"She is reading a book."

"She" is the nominal subject because "she" is doing the action (reading).

"is reading" is the verb phrase.
In this case:

"She" will be tagged as the nsubj in the dependency parse because it is the subject of the sentence.

nsubj labels the subject of the sentence, usually a noun or pronoun, that performs the action of the main verb.

##lemma_:

The base form of the token (e.g., "is" -> "be").

Each token's attributes provide rich linguistic information about the sentence structure.

In [7]:
# CHALLENGE SOLUTION:
for token in sents[1]:
    print(f'{token.text:{15}} {token.pos_:{5}} {token.dep_:{10}} {token.lemma_:{15}}')

The             DET   det        the            
man             NOUN  poss       man            
's              PART  case       's             
hands           NOUN  nsubj      hand           
were            AUX   ROOT       be             
behind          ADP   prep       behind         

               SPACE dep        
              
his             PRON  poss       his            
back            NOUN  pobj       back           
,               PUNCT punct      ,              
the             DET   det        the            
wrists          NOUN  appos      wrist          
bound           VERB  acl        bind           
with            ADP   prep       with           
a               DET   det        a              
cord            NOUN  pobj       cord           
.               PUNCT punct      .              
                SPACE dep                       


# Explaining the: bound           VERB  acl        bind
This line represents information about the word "bound" in a sentence, specifically its linguistic properties as extracted from a natural language processing (NLP) parser, such as spaCy.

---------------------------------------
Here’s what each part of the line means:

##1. bound (the word itself)
This is the actual word (or token) being analyzed, in this case, "bound."

##2. VERB (part-of-speech tag)
VERB is the part of speech assigned to the word "bound," indicating that it is being used as a verb in this context.

The word "bound" can also function as an adjective or noun (e.g., "He is bound to the task" or "She is homeward bound"), but here it is identified as a verb (e.g., "He bound the book").

##3. acl (dependency label: adjectival clause)
acl stands for adjectival clause (or relative clause).

It is a syntactic dependency label indicating that the verb "bound" is part of a clause that describes a noun.

An adjectival clause (or adjective clause) functions like an adjective, giving more information about a noun.

When the verb is part of a clause that modifies a noun, it receives the acl label.

--------------------------------
#Example:

"The book bound in leather is expensive."

In this sentence, "bound in leather" is an adjectival clause modifying the noun "book."

"bound" is a verb in the clause that provides more information about the "book."

So, "bound" would get the acl dependency label because it is acting in an adjectival clause.

#4. bind (lemma)
Lemma is the base or dictionary form of a word. The lemma for "bound" is "bind."

Lemmatization reduces the word to its root form, so here "bound" is transformed back to its base form "bind."

This helps in understanding the core meaning of the word regardless of its inflected form (e.g., "running" → "run", "bound" → "bind").

#Putting it all together:
The word "bound" is identified as a verb (VERB).

It functions in an adjectival clause (acl), meaning it modifies a noun by describing or adding information about it.

Its lemma is "bind", which is the base form of the word "bound."
##Example Sentence:
In a sentence like:

"The document bound by the lawyer was important."
Here:

"bound" is the verb (part of an adjectival clause modifying "document").

The dependency acl indicates that "bound by the lawyer" describes the noun "document."

The lemma for "bound" is "bind."

**6. Write a matcher called 'Swimming' that finds both occurrences of the phrase "swimming vigorously" in the text**<br>
HINT: You should include an `'IS_SPACE': True` pattern between the two words!

In [8]:
# Import the Matcher library:

from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

In [10]:
# Create a pattern and add it to matcher:

pattern = [{'LOWER': 'swimming'}, {'IS_SPACE': True, 'OP':'*'}, {'LOWER': 'vigorously'}]

matcher.add('Swimming', [pattern]) # Removing the None argument and enclosing the pattern in a list

# Creating pattern with Spacy
This code snippet uses spaCy's Matcher, a tool for finding specific patterns in a text.

-------------------------
Let's break down each part to understand what's happening:

##1. Creating a pattern:
    pattern = [{'LOWER': 'swimming'}, {'IS_SPACE': True, 'OP':'*'}, {'LOWER': 'vigorously'}]

This defines a pattern that will be used to match sequences of tokens in a text. Here's how this pattern works:

    {'LOWER': 'swimming'}:

This looks for a token with the lowercase form "swimming".

LOWER specifies that the matching should be case-insensitive, meaning it will match "Swimming", "swimming", "SWIMMING", etc.

    {'IS_SPACE': True, 'OP': '*'}:

This matches any amount of whitespace (spaces, tabs, etc.) between "swimming" and "vigorously."

##IS_SPACE
True means the pattern looks for space characters.

##OP: '*'
is an operator that means "zero or more" occurrences of the previous token.

So, there could be no space, one space, or multiple spaces (even none).

{'LOWER': 'vigorously'}:

This looks for a token with the lowercase form "vigorously", again case-insensitive.

---------------------------
##2. Adding the pattern to the matcher:

    matcher.add('Swimming', [pattern])

###matcher.add():
This adds the pattern to the matcher object in spaCy.

'Swimming' is the label for the pattern.

It gives the pattern a name or identifier, which will be returned when a match is found.

The second argument is a list of patterns.

Since there is just one pattern here, it is wrapped in a list ([pattern]).

##Explanation of Change:
##Removing the None argument:

Older versions of spaCy's matcher required a third argument (None), but recent versions don’t.

By removing it, the code uses the latest syntax.

##Enclosing the pattern in a list:

The matcher expects the second argument to be a list of patterns, even if you're adding just one pattern.

So, the pattern needs to be enclosed in square brackets.

----------------------
#Summary
The pattern is designed to match the phrase "swimming vigorously" with any amount of space (or no space) between the two words.

After defining the pattern, it's added to the matcher with the label 'Swimming'.

----------------
##Example Usage:
Here’s how this might work with a spaCy matcher:

    import spacy
    from spacy.matcher import Matcher

# Load the English model
    nlp = spacy.load('en_core_web_sm')

# Create the matcher
    matcher = Matcher(nlp.vocab)

# Define the pattern
    pattern = [{'LOWER': 'swimming'}, {'IS_SPACE': True, 'OP': '*'}, {'LOWER': 'vigorously'}]

# Add the pattern to the matcher
    matcher.add('Swimming', [pattern])

# Sample text
    doc = nlp("She was swimming   vigorously in the pool.")

# Apply the matcher to the doc
    matches = matcher(doc)

# Print the matches
    for match_id, start, end in matches:
       matched_span = doc[start:end]  # The matched span
       print(matched_span.text)  # Output: "swimming   vigorously"

In this example:

The matcher would identify and extract the phrase "swimming vigorously" from the text, regardless of the number of spaces between the words.

In [11]:
# Create a list of matches called "found_matches" and print the list:

found_matches = matcher(doc)
print(found_matches)

[(12881893835109366681, 1274, 1277), (12881893835109366681, 3609, 3612)]


# Applying the spaCy matcher to a document (doc)
In this code snippet, you're applying the spaCy matcher to a document (doc) and storing the results in a list called found_matches, then printing it.

--------------------------------------
Let’s break down how this works:

#1. found_matches = matcher(doc):
This line applies the matcher to the document doc.

The matcher(doc) function searches the document for patterns that were previously added to the matcher (e.g., the pattern for matching "swimming vigorously").

It returns a list of matches. Each match is a tuple containing three elements:

##match_id:
An integer representing the ID of the matched pattern. This is usually derived from the string label (like 'Swimming') used when adding the pattern.

##start:
The starting index (token position) of the matched span in the document.

##end:

The ending index (token position) of the matched span in the document.

----------------------------------
##2. print(found_matches):
This simply prints the list of found matches to the console. Each match in the list is a tuple as described above.

------------------------
#Example:
Let’s assume that you have already added the pattern for matching the phrase "swimming vigorously" as shown in your previous example.

Here’s what it would look like when using this snippet:

-----------------------------------
#Import spacy and the Matcher class
    import spacy
    from spacy.matcher import Matcher

# Load the spaCy English model
    nlp = spacy.load('en_core_web_sm')

# Create the matcher object
    matcher = Matcher(nlp.vocab)

# Define a pattern for "swimming vigorously"
    pattern = [{'LOWER': 'swimming'}, {'IS_SPACE': True, 'OP': '*'}, {'LOWER': 'vigorously'}]

# Add the pattern to the matcher
    matcher.add('Swimming', [pattern])

# Process some text
    doc = nlp("She was swimming vigorously in the pool. They were swimming   vigorously as well.")

# Apply the matcher to the doc and store the matches
    found_matches = matcher(doc)

# Print the list of matches
    print(found_matches)

-----------------------------
#Output
The found_matches might look something like this:

[(18320432761005076752, 2, 4), (18320432761005076752, 8, 10)]

#Explanation of the Output:
##18320432761005076752:

This is the ID for the pattern that was matched (corresponding to the label 'Swimming' that was provided when the pattern was added to the matcher).

##2, 4:
These are the start and end token indices for the match.

So in the document doc, the match spans from token at index 2 to token at index 4.

You can extract the matched text using these indices.

##Extracting the Matched Text:
To get the actual text that was matched, you can use doc[start:end] for each match:

    for match_id, start, end in found_matches:
       matched_span = doc[start:end]
       print(matched_span.text)

##Output for this:

swimming vigorously

swimming   vigorously

This would print the exact phrases that matched the pattern.


**7. Print the text surrounding each found match**

In [12]:
print(doc[1265:1290])

By diving I could evade the bullets and, swimming
vigorously, reach the bank, take to the woods and get away home


In [13]:
print(doc[3600:3615])

all this over his shoulder; he was now swimming
vigorously with the current


**EXTRA CREDIT:<br>Print the *sentence* that contains each found match**

In [14]:
for sent in sents:
    if found_matches[0][1] < sent.end:
        print(sent)
        break

By diving I could evade the bullets and, swimming
vigorously, reach the bank, take to the woods and get away home.  


In [15]:
for sent in sents:
    if found_matches[1][1] < sent.end:
        print(sent)
        break

The hunted man saw all this over his shoulder; he was now swimming
vigorously with the current.  


# Finding Sentence
This snippet of code is designed to iterate through sentences (sents) and find the first sentence that contains the start of the first match in found_matches.

Once that sentence is found, it prints the sentence and breaks the loop.

------------------------------
Let's break this down step by step:

#1. for sent in sents:
This loops through all the sentences in the sents list.
sents is likely a list of sentences from a document (created using something like doc.sents in spaCy).

#2. if found_matches[0][1] < sent.end:

##found_matches[0][1]:
This accesses the start index of the first match in found_matches.

In spaCy, found_matches stores tuples where:

The first element is the match ID.

The second element is the start token index of the match.

The third element is the end token index of the match.

##found_matches[0][1]

refers to the start index of the first match in the document.

##sent.end:
This is the end token index of the current sentence (sent).

In spaCy, sent.end gives the index of the token that comes after the last token in the sentence.

The condition checks whether the start of the first match (found_matches[0][1]) falls before the end of the current sentence.

If it does, that means the first match occurs in this sentence.

-----------------------------
#3. print(sent):
Once a sentence is found that contains the match, it is printed.

---------------------
#4. break:
This breaks the loop after finding the first sentence containing the match, meaning no further sentences will be checked.

--------------------------------
#Example Code with spaCy:

    import spacy
    from spacy.matcher import Matcher

# Load a small spaCy English model
    nlp = spacy.load('en_core_web_sm')

# Create the matcher object
    matcher = Matcher(nlp.vocab)

# Define a pattern
    pattern = [{'LOWER': 'swimming'}, {'IS_SPACE': True, 'OP': '*'}, {'LOWER': 'vigorously'}]

# Add the pattern to the matcher
    matcher.add('Swimming', [pattern])

# Process a document
    doc = nlp("She was swimming vigorously in the pool. They were swimming   vigorously as well.")

# Extract sentences
    sents = [sent for sent in doc.sents]

# Apply the matcher
    found_matches = matcher(doc)

# Find and print the sentence containing the first match
    for sent in sents:
       if found_matches[0][1] < sent.end:
         print(sent)
         break
##Output:


She was swimming vigorously in the pool.

#Explanation:
sents contains the two sentences:

    "She was swimming vigorously in the pool."
    "They were swimming vigorously as well."

found_matches[0][1] refers to the start index of the first match, which corresponds to the word "swimming" in the first sentence.

The loop checks if the start index of the first match is less than the end of the current sentence.

Since the first match occurs in the first sentence, the sentence is printed, and the loop is stopped using break.

Purpose:
This approach is useful when you want to identify the first sentence in which a specific match (e.g., a pattern like "swimming vigorously") appears within a larger text.

### Great Job!