This notebook explores part-of-speech tagging through its impact on keyword extraction. Keyphrase extraction is a task designed to select a small number of terms (or phrases) from a document that best represent its content.  Here we'll use a tf-idf metric for ranking terms in a document, and use POS information to filter those terms.

#### **1. Importing Necessary Libraries**
This cell imports all the libraries required for the notebook.
* `spacy`: A powerful library for Natural Language Processing (NLP).
* `glob`: Used to find all pathnames matching a specified pattern.
* `os`: Provides a way of using operating system-dependent functionality, like path joining.
* `operator`: Used here to easily sort items by a specific index.
* `math`: Provides access to mathematical functions, specifically `log` for the IDF calculation.
* `random`: Used to create a baseline function that selects random keywords.
* `collections.Counter`: A specialized dictionary subclass for counting hashable objects.

In [None]:
# Import the spacy library for Natural Language Processing
import spacy, glob, os, operator, math, random
# From the collections module, import the Counter class for counting frequencies
from collections import Counter

#### **2. Initializing the SpaCy Model**
Here, we load the English language model from SpaCy. For efficiency, we disable the 'ner' (Named Entity Recognition) and 'parser' components, as we only need the tokenizer and the part-of-speech (POS) tagger for this task.

In [None]:
# Load the pre-trained English model from spacy, disabling the 'ner' and 'parser' pipes for faster processing
nlp = spacy.load('en', disable=['ner,parser'])
# Explicitly remove the 'ner' pipe from the processing pipeline
nlp.remove_pipe('ner')
# Explicitly remove the 'parser' pipe from the processing pipeline
nlp.remove_pipe('parser')

Here's how you get a word and its POS tag from SpaCy.

#### **3. Demonstrating SpaCy's POS Tagging**
This cell defines and calls a simple function to show how SpaCy processes a text and assigns a part-of-speech (POS) tag to each word (token). The tags provide grammatical information about each word (e.g., `NNP` for proper noun, `VBZ` for verb).

In [None]:
# Define a function to get and print spacy tags for a given text
def get_spacy_tags(text):
    """ Get spacy tags for an input text """
    # Process the text with the nlp object to create a Doc object
    doc=nlp(text)
    # Iterate through each token (word) in the Doc object
    for word in doc:
        # Print the word's text and its fine-grained POS tag
        print(word.text, word.tag_)

# Call the function with an example sentence
get_spacy_tags("Time flies like an arrow")

#### **4. Reading and Processing Text Documents**
This function, `read_docs`, is designed to read all `.txt` files from a specified directory. For each file, it reads the content and processes it with the SpaCy `nlp` object. It returns a list of tuples, where each tuple contains the filename and the processed SpaCy `Doc` object.

In [None]:
# Define a function to read all .txt documents from an input directory
def read_docs(inputDir):
    """ Read in movie documents (all ending in .txt) from an input folder"""
    
    # Initialize an empty list to store the documents
    docs=[]
    # Use glob to find all files ending with .txt in the specified directory
    for filename in glob.glob(os.path.join(inputDir, '*.txt')):
        # Open each file for reading
        with open(filename) as file:
            # Append a tuple of (filename, processed_spacy_doc) to the list
            docs.append((filename, nlp(file.read())))
    # Return the list of documents
    return docs

#### **5. Loading the Movie Summaries Dataset**
This cell specifies the directory containing the dataset of 2000 movie summaries from Wikipedia and then calls the `read_docs` function to load them into memory. The result is stored in the `original_docs` variable.

In [None]:
# Define the directory path containing the movie summary text files
# directory with 2000 movie summaries from Wikipedia
inputDir="../data/movie_summaries/"
# Call the read_docs function to load and process the documents from the directory
original_docs=read_docs(inputDir)

Q1. We covered tf-idf in lecture 9 ("lexical semantics") and in the `7.embeddings/TFIDF.ipynb` notebook. Write a method for extracting the 10 terms with highest tf-idf score for each document in a collection.

#### **6. Baseline Function: Extracting Random Keywords**
This function provides a simple baseline for keyword extraction. Instead of using a sophisticated metric, it just selects 10 unique words at random from each document. This helps to illustrate the expected output format for the subsequent functions: a dictionary mapping each filename to a list of 10 keywords.

In [None]:
# Define a function that returns 10 random words from each document
def random_words(docs):
    """ Function to return random 10 terms from doc.
    
    Input: a list of (filename, [spacy tokens]) documents
    Returns: a dict mapping "filename" -> [list of 10 keyphrases, ranked from highest tf-idf score to lowest]
 
    Used just to illustrate expected output of functions below """
    
    # Initialize an empty dictionary to store the keyphrases for each file
    keyphrases={}
    
    # Iterate through each document tuple (filename, doc)
    for filename, doc in docs:
        # Create a list of unique word texts from the document
        tokens=list(set([x.text for x in doc]))
        # Shuffle the list of unique tokens randomly
        random.shuffle(tokens)
  
        # Assign the first 10 shuffled tokens as the keyphrases for the current filename
        keyphrases[filename]=tokens[:10]
    
    # Return the dictionary of keyphrases
    return keyphrases

#### **7. Displaying Random Keywords for Sample Movies**
This cell runs the `random_words` function on the dataset and then prints the randomly selected keywords for three specific movies to show the output.

In [None]:
# Get random keywords for all original documents
terms=random_words(original_docs)
# Iterate through a list of specific movie filenames
for filename in ["Jaws.txt", "Harry_Potter_and_the_Philosophers_Stone.txt", "Back_to_the_Future.txt"]:
    # Print the filename
    print("\n%s\n" % filename)
    # Print the randomly selected keywords for that file, each on a new line
    print('\n'.join(terms[os.path.join(inputDir, filename)]))

#### **8. Answering Q1: Implementing TF-IDF for Keyword Extraction**
This is the main function for Question 1. It calculates the TF-IDF (Term Frequency-Inverse Document Frequency) score for every word in every document and returns the top 10 words with the highest scores for each document.

* **`get_tf(tokens)`**: A nested helper function that calculates **Term Frequency** (how often a word appears in a single document) using `Counter`.
* **`get_idfs(docs)`**: A nested helper function that calculates **Inverse Document Frequency** for every word in the entire collection. IDF measures how important a word is by giving a higher score to words that appear in fewer documents. The formula used is $IDF(t) = \log(\frac{\text{total number of documents}}{\text{number of documents with term t}})$.
* The main body of the function first calculates IDFs for the whole corpus. Then, for each document, it calculates TFs, combines them with the IDFs to get the final TF-IDF score for each term, sorts the terms by this score in descending order, and selects the top 10.

In [None]:
# Define the function to rank terms using TF-IDF
def tf_idf_ranking(docs):
    """
    Function to rank terms in document by tf-idf score, and return the top 10 terms
    
    Input: a list of (filename, [spacy tokens]) documents
    Returns: a dict mapping "filename" -> [list of 10 keyphrases, ranked from highest tf-idf score to lowest]
    
    
    """
    
    # Define a helper function to calculate Term Frequency (TF) for a document
    def get_tf(tokens):
        # Initialize a Counter object to store word counts
        counter=Counter()
        # Iterate through each token in the document
        for token in tokens:
            # Increment the count for the token's text
            counter[token.text]+=1
        # Return the counter object
        return counter
    
    # Define a helper function to calculate Inverse Document Frequency (IDF) for the entire corpus
    def get_idfs(docs):
        # Initialize a Counter to store document frequencies for each word
        counts=Counter()
        # Iterate through each document in the corpus
        for _, doc in docs:
            # Use a dictionary to store unique words in the current document to count each word only once
            doc_types={}
            for token in doc:
                doc_types[token.text]=1

            # For each unique word in the document, increment its document frequency count
            for word in doc_types:
                counts[word]+=1

        # Initialize a dictionary to store the final IDF scores
        idfs={}
        # Iterate through each term in the document frequency counter
        for term in counts:
            # Calculate the IDF score using the log formula
            idfs[term]=math.log(float(len(docs))/counts[term])

        # Return the dictionary of IDF scores
        return idfs

    # Calculate the IDF scores for the entire collection of documents
    idfs=get_idfs(docs)

    # Initialize a dictionary to store the final keyphrases for each document
    keyphrases={}
    
    # Iterate through each document (filename, doc)
    for filename, doc in docs:
        # Calculate the Term Frequency (TF) for the current document
        tf=get_tf(doc)
        # Initialize a dictionary to store candidate keyphrases and their TF-IDF scores
        candidates={}
        # For each term in the TF dictionary
        for term in tf:
            # Calculate the TF-IDF score and store it
            candidates[term]=tf[term]*idfs[term]

        # Sort the candidate terms by their TF-IDF score in descending order
        sorted_x = sorted(candidates.items(), key=operator.itemgetter(1), reverse=True)
       
        # Store the top 10 terms (the words, not their scores) in the keyphrases dictionary
        keyphrases[filename]=[k for k,v in sorted_x[:10]]
    
    # Return the final dictionary of top 10 keyphrases for each document
    return keyphrases

#### **9. Displaying Top 10 TF-IDF Keywords for Sample Movies**
This cell runs the `tf_idf_ranking` function and prints the extracted keywords for the same three movies, allowing for a comparison with the random baseline. The results should be much more relevant to the movie plots.

In [None]:
# Get the top 10 TF-IDF ranked keywords for all documents
terms=tf_idf_ranking(original_docs)
# Iterate through a list of specific movie filenames
for filename in ["Jaws.txt", "Harry_Potter_and_the_Philosophers_Stone.txt", "Back_to_the_Future.txt"]:
    # Print the filename
    print("\n%s\n" % filename)
    # Print the top 10 keywords for that file, each on a new line
    print('\n'.join(terms[os.path.join(inputDir, filename)]))

Q2.  Write a method for extracting the 10 terms with highest tf-idf score for each document in a collection that *excludes all proper names*.

#### **10. Answering Q2: TF-IDF Keywords Excluding Proper Nouns**
This function addresses Question 2 by first filtering out all proper nouns from the documents and then applying the same `tf_idf_ranking` function from Q1.

* **`remove_proper_nouns(docs)`**: This helper function iterates through every token in every document. It creates new, filtered document lists that exclude any token tagged as a singular proper noun (`NNP`) or a plural proper noun (`NNPS`).
* The main function simply calls this filtering function and passes the result to `tf_idf_ranking`.

In [None]:
# Define a function to get keyphrases while excluding proper nouns
def keyphrase_no_proper_nouns(docs):
    """
    Function to rank terms in document by tf-idf score, and return the top 10 terms.  
    Constraint: None of the top 10 terms should be proper nouns.
    
    Input: a list of (filename, [spacy tokens]) documents
    Returns: a dict mapping "filename" -> [list of 10 keyphrases, ranked from highest tf-idf score to lowest]
    
    """
    
    # Define a helper function to filter out proper nouns from the documents
    def remove_proper_nouns(docs):
        # Initialize a list to hold the new, filtered documents
        new_docs=[]
        # Iterate through each document
        for filename, doc in docs:
            # Initialize a list to hold the tokens for the new document
            new_doc=[]
            # Iterate through each token in the original document
            for token in doc:
                # Check if the token's tag is NOT a singular ('NNP') or plural ('NNPS') proper noun
                if token.tag_ != "NNP" and token.tag_ != "NNPS":
                    # If it's not a proper noun, add it to the new document list
                    new_doc.append(token)
            # Add the filtered document (as a tuple of filename and token list) to the new docs list
            new_docs.append((filename, new_doc))
       
        # Return the collection of documents with proper nouns removed
        return new_docs
            
    # Create the new set of documents by removing proper nouns
    new_docs=remove_proper_nouns(docs)
    # Run the original tf_idf_ranking function on the filtered documents
    terms=tf_idf_ranking(new_docs)
    # Return the resulting keywords
    return terms

#### **11. Displaying Keywords (No Proper Nouns) for Sample Movies**
This cell executes the function from Q2. The resulting keywords will be high-importance terms but will not include character names, locations, or other proper nouns.

In [None]:
# Get the top 10 keywords, excluding proper nouns
terms=keyphrase_no_proper_nouns(original_docs)
# Iterate through a list of specific movie filenames
for filename in ["Jaws.txt", "Harry_Potter_and_the_Philosophers_Stone.txt", "Back_to_the_Future.txt"]:
    # Print the filename
    print("\n%s\n" % filename)
    # Print the top 10 keywords for that file, each on a new line
    print('\n'.join(terms[os.path.join(inputDir, filename)]))

Q3.  Write a method for extracting the 10 terms with highest tf-idf score for each document in a collection that *includes only common nouns*.

#### **12. Answering Q3: TF-IDF Keywords Including Only Common Nouns**
This function addresses Question 3 by filtering the documents to include *only* common nouns before applying the TF-IDF calculation. This is a more restrictive filter than the one in Q2.

* **`remove_proper_nouns(docs)`**: Despite its name, this helper function is modified to keep only tokens that are tagged as a singular common noun (`NN`) or a plural common noun (`NNS`).
* The main function uses this new filtered set of documents to find the most important nouns in each movie summary.

In [None]:
# Define a function to get keyphrases that are only common nouns
def keyphrase_only_common_nouns(docs):
    """
    Function to rank terms in document by tf-idf score, and return the top 10 terms.  
    Constraint: All of the top 10 terms should be common nouns.
    
    Input: a list of (filename, [spacy tokens]) documents
    Returns: a dict mapping "filename" -> [list of 10 keyphrases, ranked from highest tf-idf score to lowest]
    
    """
        
    # Define a helper function to filter documents to include ONLY common nouns
    def remove_proper_nouns(docs):
        # Initialize a list to hold the new, filtered documents
        new_docs=[]
        # Iterate through each document
        for filename, doc in docs:
            # Initialize a list to hold the tokens for the new document
            new_doc=[]
            # Iterate through each token in the original document
            for token in doc:
                # Check if the token's tag is a singular ('NN') or plural ('NNS') common noun
                if token.tag_ == "NN" or token.tag_ == "NNS":
                    # If it is a common noun, add it to the new document list
                    new_doc.append(token)
            # Add the filtered document (as a tuple of filename and token list) to the new docs list
            new_docs.append((filename, new_doc))
       
        # Return the collection of documents containing only common nouns
        return new_docs
            
    # Create the new set of documents containing only common nouns
    new_docs=remove_proper_nouns(docs)
    # Run the original tf_idf_ranking function on the filtered documents
    terms=tf_idf_ranking(new_docs)
    # Return the resulting keywords
    return terms

#### **13. Displaying Keywords (Only Common Nouns) for Sample Movies**
This final execution cell runs the function from Q3. The output will consist exclusively of the top 10 most important common nouns for the selected movies.

In [None]:
# Get the top 10 keywords, consisting of only common nouns
terms=keyphrase_only_common_nouns(original_docs)
# Iterate through a list of specific movie filenames
for filename in ["Jaws.txt", "Harry_Potter_and_the_Philosophers_Stone.txt", "Back_to_the_Future.txt"]:
    # Print the filename
    print("\n%s\n" % filename)
    # Print the top 10 keywords for that file, each on a new line
    print('\n'.join(terms[os.path.join(inputDir, filename)]))