In [None]:
# cell 1
import spacy, glob, os # Import the spacy library for NLP, glob for finding files, and os for interacting with the operating system.

### Cell 1: Imports
This cell imports the necessary Python libraries for the task.
* **spaCy**: A powerful library for Natural Language Processing (NLP).
* **glob**: Used to find all pathnames matching a specified pattern.
* **os**: Provides a way of using operating system-dependent functionality, like joining file paths.

In [None]:
# cell 2
nlp = spacy.load('en_core_web_sm', disable=['ner','parser']) # Load the small English language model from spaCy, disabling the NER and parser components for efficiency.
# The following lines are redundant if 'disable' is used during loading, but explicitly remove the pipes.
nlp.remove_pipe('ner') # Remove the 'ner' (Named Entity Recognition) pipeline component.
nlp.remove_pipe('parser') # Remove the 'parser' (Dependency Parser) pipeline component.

### Cell 2: Initializing the spaCy Model
This block sets up the spaCy language model. We load the small English model (`en_core_web_sm`). For this specific task of Part-of-Speech (POS) tagging, we don't need the **Named Entity Recognizer (`ner`)** or the **Dependency Parser (`parser`)**. Disabling them makes the model load and run faster because it doesn't have to perform unnecessary analysis.

In [None]:
# cell 3
def get_spacy_tags(text): # Define a function that takes a text string as input.
    doc=nlp(text) # Process the input text with the spaCy nlp object to create a doc object.
    for word in doc: # Iterate through each token (word) in the processed doc.
        print(word.text, word.tag_) # Print the word itself and its fine-grained POS tag.

get_spacy_tags("Time flies like an arrow") # Call the function with an example sentence to see the output.

### Cell 3: Function to Get POS Tags
The `get_spacy_tags` function is a simple utility to demonstrate spaCy's POS tagging capability. It takes a sentence, processes it, and then iterates through each word to print the word and its corresponding tag from the Penn Treebank tag set.

In [None]:
# cell 4
def read_docs(inputDir, maxDocs=100): # Define a function to read files from a directory.
    """ Read in movie documents (all ending in .txt) from an input folder
    and process with spacy """
    
    docs=[] # Initialize an empty list to store the processed documents.
    # Use glob to find all files ending in .txt within the specified directory.
    for idx, filename in enumerate(glob.glob(os.path.join(inputDir, '*.txt'))):
        with open(filename, encoding='utf-8') as file: # Open each file found (using utf-8 encoding for compatibility).
            docs.append((filename, nlp(file.read()))) # Read the file's content, process it with spaCy, and append a (filename, doc) tuple to our list.
        if idx >= maxDocs -1: # Check if the maximum number of documents has been reached (adjusting for zero-based index).
            break # Exit the loop if the maxDocs limit is met.
    return docs # Return the list of processed documents.

### Cell 4: Function to Read and Process Documents
This function, `read_docs`, is designed to handle a corpus of text files. It reads all `.txt` files from a given input directory, processes the content of each file with spaCy, and stores the resulting `doc` objects in a list. The `maxDocs` parameter is included to limit the number of files processed, which is useful for quick tests on large datasets.

In [None]:
# cell 5
# directory with 2000 movies summaries from Wikipedia
inputDir="../data/movie_summaries/" # Define the path to the directory containing the movie summary text files.
docs=read_docs(inputDir, maxDocs=100) # Call the read_docs function to load and process 100 movie summaries.

### Cell 5: Loading the Movie Summary Corpus
This cell defines the location of the dataset and then calls the `read_docs` function to load 100 summaries from that location. The resulting `docs` variable will hold the data that will be analyzed in the rest of the notebook.

### Cell 6: Penn Treebank POS Tags
Here are the 45 tags used by the Penn Treebank:

|tag|meaning|
|---|---|
|CC|Coordinating conjunction|
|CD|Cardinal number|
|DT|Determiner|
|EX|Existential there|
|FW|Foreign word|
|IN|Preposition or subordinating conjunction|
|JJ|Adjective|
|JJR|Adjective, comparative|
|JJS|Adjective, superlative|
|LS|List item marker|
|MD|Modal|
|NN|Noun, singular or mass|
|NNS|Noun, plural|
|NNP|Proper noun, singular|
|NNPS|Proper noun, plural|
|PDT|Predeterminer|
|POS|Possessive ending|
|PRP|Personal pronoun|
|PRP\$|Possessive pronoun|
|RB|Adverb|
|RBR|Adverb, comparative|
|RBS|Adverb, superlative|
|RP|Particle|
|SYM|Symbol|
|TO|to|
|UH|Interjection|
|VB|Verb, base form|
|VBD|Verb, past tense|
|VBG|Verb, gerund or present participle|
|VBN|Verb, past participle|
|VBP|Verb, non-3rd person singular present|
|VBZ|Verb, 3rd person singular present|
|WDT|Wh-determiner|
|WP|Wh-pronoun|
|WP\$|Possessive wh-pronoun|
|WRB|Wh-adverb|
|.|period|
|,|comma|
|:|colon|
|(|left separator|
|)|right separator|
|$|dollar sign|
|\`\`|open double quotes|
|''|close double quotes|

Explore these tags below by searching for sentences in the (automatically tagged) movie summary corpus that have been tagged for each one.

In [None]:
# cell 7
def find_examples(docs, tag, num_examples=10): # Define a function to find examples of a specific tag in the corpus.
    window=5 # Set a 'window' of 5 words to show on each side of the target word.
    count=0 # Initialize a counter for the number of examples found.
    for _, doc in docs: # Iterate through each doc object in our list of documents.
        for idx, token in enumerate(doc[5:-5]): # Iterate through each token in the doc, avoiding the first and last 5 tokens to prevent index errors.
            if token.tag_ == tag: # Check if the token's tag matches the tag we're looking for.
                # Print the context window: 5 words before, the target word highlighted in red, and 5 words after.
                print (' '.join(["%s" % context.text for context in doc[idx:idx+window]]), "\033[91m%s\033[0m" % doc[idx+window].text, ' '.join(["%s" % context.text for context in doc[idx+window+1:idx+window*2+1] ]))
                # for windows users - you may want to use the following print statement
                # to highlight the middle token in each sentence using #s
                # print (' '.join(["%s" % context.text for context in doc[idx-window+5:idx+5 ]]), "#%s#" % doc[idx+5].text, ' '.join(["%s" % context.text for context in doc[idx+6:idx+window+6] ]))
                count+=1 # Increment the counter.
                if count >= num_examples: # Check if we have found the desired number of examples.
                    return # Exit the function if we have enough examples.

### Cell 7: Function to Find Tag Examples
The `find_examples` function is a tool for exploring the corpus. It searches for a specific POS tag and prints any occurrences it finds, along with the surrounding words (the "context window"). This is a common technique in corpus linguistics called a **concordance**, and it helps you understand how words and tags are used in natural language. The example uses ANSI escape codes (`\033[91m`) to print the target word in red in compatible terminals.

In [None]:
# cell 8
find_examples(docs, "VB", num_examples=10) # Use the function to find 10 examples of the 'VB' (Verb, base form) tag.

### Cell 8: Finding Examples of 'VB'
This cell calls the `find_examples` function to search for and display 10 examples of the tag `VB` (Verb, base form) from the loaded movie summaries. This demonstrates how the function can be used to investigate the usage of different POS tags.

### Cell 9: Questions about Tag Differences
What's the difference between the following?

* PRP and PRP$
* NN and NNP
* JJ and JJR
* VBZ and VB

### Cell 10: Manual Tagging Exercise
Q2: Use the `find_examples` function to help understand the usage of each part-of-speech tag; work with a partner to manually tag the following four sentences

### Cell 11: Sentence 1
1. "Open the pod bay doors, Hal"

### Cell 12: Sentence 2
2. "Frankly, my dear, I don't give a damn"

### Cell 13: Sentence 3
3. "May the Force be with you"

### Cell 14: Sentence 4
4. One morning I shot an elephant in my pajamas. How he got in my pajamas, I don't know

### Cell 15: Evaluation Question
Q3. After tagging the sentences above by hand, run them through the spacy tagger; what's spacy's accuracy on these sentences?

In [None]:
# cell 16
# This cell is intentionally left blank for the user to write code to answer the questions.

### Cell 16: User Workspace
This empty code cell is provided for you to write and run your own code to answer the questions posed in the previous markdown cells. You can use it to run the `get_spacy_tags` function on the sentences and compare the results with your manual tagging.