This notebook explores WordNet synsets, presenting a simple method for finding in a text all mentions of all hyponyms of a given node in the WordNet hierarchy (e.g., finding all *buildings* in a text).

### 1. Importing Libraries

The first step is to import the necessary Python libraries.
* **nltk**: The Natural Language Toolkit, which provides access to WordNet.
* **re**: The regular expressions library, used for text manipulation.
* **spacy**: A powerful library for Natural Language Processing (NLP), used here for efficient text tokenization and lemmatization.
* **wordnet (as wn)**: The specific corpus from NLTK that we will be working with. We import it with the alias `wn` for convenience.

In [None]:
# Import the Natural Language Toolkit for WordNet access
import nltk, re, spacy

# Import the WordNet corpus from NLTK and assign it the alias 'wn'
from nltk.corpus import wordnet as wn

### 2. Checking NLTK Version

This cell simply checks and displays the installed version of the NLTK library to ensure compatibility and for documentation purposes.

In [None]:
# Display the current version of the NLTK library
nltk.__version__

### 3. Initializing SpaCy

Here, we load a pre-trained English model from SpaCy. To make processing faster, we disable the parts of the pipeline we don't need for this task: the part-of-speech **tagger**, the **named entity recognizer (ner)**, and the dependency **parser**. Our main goal is to get lemmatized tokens, which SpaCy's tokenizer can provide without these components.

In [None]:
# Load the small English model from SpaCy, disabling components that are not needed to speed up processing.
# We only need the tokenizer and lemmatizer for this task.
nlp = spacy.load('en_core_web_sm', disable=['tagger','ner','parser'])

# The following lines are another way to remove components from the pipeline after loading.
# They are redundant here because we used the 'disable' argument above, but show an alternative method.
# nlp.remove_pipe('tagger')
# nlp.remove_pipe('ner')
# nlp.remove_pipe('parser');

### 4. Exploring WordNet Synsets

A **synset** is a group of synonyms that represent a single concept. This code retrieves all synsets associated with the word "car" and prints each synset along with its definition. They are generally ordered from most to least common.

In [None]:
# Get a list of all synsets for the word 'car'
synsets=wn.synsets('car')

# Iterate through each synset in the list
for synset in synsets:
    # Print the synset's name and its definition
    print (synset, synset.definition())

### 5. Getting Words from a Synset

Each synset contains one or more **lemmas**, which are the specific words or phrases that belong to it. This code accesses the first noun synset for "car" (`car.n.01`) and prints all of its lemmas.

In [None]:
# Access the synset 'car.n.01' and iterate through its lemmas
for lemma in wn.synset("car.n.01").lemmas():
    # Print the name of each lemma
    print (lemma.name())

### 6. Defining Hierarchy Traversal Functions

WordNet is structured as a hierarchy. To easily navigate it, we define two simple `lambda` functions:
* **hypo**: Finds the immediate children of a synset (hyponyms, or more specific concepts).
* **hyper**: Finds the immediate parents of a synset (hypernyms, or more general concepts).

In [None]:
# These functions are shortcuts from http://www.nltk.org/howto/wordnet.html to get hyponyms/hypernyms

# Define a lambda function 'hypo' that takes a synset 's' and returns its direct hyponyms.
hypo = lambda s: s.hyponyms()

# Define a lambda function 'hyper' that takes a synset 's' and returns its direct hypernyms.
hyper = lambda s: s.hypernyms()

### 7. Finding All Hyponyms (Descendants)

Using the `closure()` method and our `hypo` function, we can recursively find all synsets that are descendants of a target synset. This code finds every type of "car" in the WordNet hierarchy.

In [None]:
# Use the closure() method with the 'hypo' function to recursively find all hyponyms of 'car.n.01'.
# This effectively gets all synsets that are "under" car.n.01 in the hierarchy.
list(wn.synset("car.n.01").closure(hypo))

### 8. Finding All Hypernyms (Ancestors)

Similarly, we can use `closure()` with our `hyper` function to find all ancestors of a target synset, tracing its path up to the root of the hierarchy (like "entity").

In [None]:
# Use the closure() method with the 'hyper' function to recursively find all hypernyms of 'car.n.01'.
# This traces the path from 'car' up to its most general concepts.
list(wn.synset("car.n.01").closure(hyper))

### 9. Function to Collect All Hyponym Words

This function, `get_words_in_hypo`, automates the process of finding all words associated with a given synset *and* all of its descendants. It gathers every lemma from this entire branch of the WordNet tree, cleans up the formatting (replacing underscores with spaces), and returns them as a single set to avoid duplicates.

In [None]:
def get_words_in_hypo(synset):
    """
    Returns a set of words/phrases that comprise the lemmas of a synset and all its hyponyms.
    """
    # Initialize an empty set to store the words to avoid duplicates.
    words=set()
    
    # Get all hyponyms (descendants) of the input synset.
    hyponym_synsets=list(synset.closure(hypo))
    
    # Add the original synset itself to the list to include its own lemmas.
    hyponym_synsets.append(synset)
    
    # Iterate through the combined list of the original synset and its hyponyms.
    for s in hyponym_synsets:
        # For each synset, iterate through its lemmas.
        for l in s.lemmas():
            # Get the lemma's name (e.g., 'motor_car').
            word=l.name()
            # Replace underscores with spaces for readability (e.g., 'motor car').
            word=re.sub("_", " ", word)
            # Add the cleaned word to the set.
            words.add(word)
    
    # Return the final set of unique words.
    return words

### 10. Testing the Word Collection Function

This cell calls the function we just created on the `car.n.01` synset to demonstrate its output. The result is a comprehensive set of terms related to cars.

In [None]:
# Call the function to get all words for 'car.n.01' and its descendants.
get_words_in_hypo(wn.synset("car.n.01"))

### 11. Function to Find Word Occurrences in Text

The `find_all_words_in_text` function takes a set of target words and a text pre-processed by SpaCy. It iterates through every token in the text and checks if the token's **lemma** is in our target set. Using the lemma allows us to match different forms of a word (e.g., "cars" matches "car"). It returns a list of the numerical indices of all matching tokens.

In [None]:
def find_all_words_in_text(words, spacy_tokens):
    """ 
    For a given set of words, find each instance among a list of tokens already
    processed by spacy. Returns a list of token indexes that match.
    (Note this only identifies single words, not multi-word phrases.)
    """
    # Initialize an empty list to store the indices of matched tokens.
    all_matches=[]
    
    # Iterate through the spacy tokens, getting both the index (idx) and the token object.
    for idx, token in enumerate(spacy_tokens):
        # Check if the token's lemma (its base form) is in our target set of words.
        if token.lemma_ in words:
            # If it's a match, add the token's index to our list.
            all_matches.append(idx)
            
    # Return the list of all found indices.
    return all_matches

### 12. Function to Display Results (Concordance)

A **concordance** shows every occurrence of a word in its context. This `print_concordance` function takes the list of match indices and the SpaCy tokens. For each match, it prints the word, highlighted in red, along with a few words before and after it (a "window") to show how it was used.

In [None]:
def print_concordance(matches, spacy_tokens, window=3):
    """ 
    For a given set of token indexes, prints out a window of words around each match,
    in the style of a concordance.
    """
    # Define ANSI escape codes for terminal colors to highlight the matched word.
    RED="\x1b[31m"
    BLACK="\x1b[0m"
    
    # Calculate left-side padding for alignment based on the window size.
    spacing=window*10
    
    # Iterate through each match index found previously.
    for match in matches:
        # Calculate the start and end indices for the context window.
        start=match-window
        end=match+window+1
        
        # Ensure the start index is not less than 0.
        if start < 0:
            start=0
        # Ensure the end index does not exceed the total number of tokens.
        if end > len(spacy_tokens):
            end=len(spacy_tokens)
            
        # Create the string of text *before* the matched word.
        pre=' '.join([token.text for token in spacy_tokens[start:match]])
        # Create the string of text *after* the matched word.
        post=' '.join([token.text for token in spacy_tokens[match+1:end]])
        
        # Print the formatted concordance line with the matched word in red.
        print("%s %s%s%s %s" % (pre.rjust(spacing), RED, spacy_tokens[match].text, BLACK, post))

### 13. Function to Read Text from a File

This is a helper function to read a text file. It opens the file, reads its entire content, and uses a regular expression (`re.sub`) to replace any sequence of one or more whitespace characters (spaces, tabs, newlines) with a single space. This cleans and standardizes the text for processing.

In [None]:
def read_text(filename):
    """ 
    Read a text file, replacing all whitespace sequences with a single space.
    """
    # Open the file with UTF-8 encoding to handle a wide range of characters.
    with open(filename, encoding="utf-8") as file:
        # Read the file's content and replace all whitespace sequences with a single space.
        return re.sub("\s+", " ", file.read())

### 14. Loading the Text

Here, we use our `read_text` function to load the novel *Pride and Prejudice* from a local file into the `book` variable.

In [None]:
# Call the read_text function to load the specified text file.
# Note: The path "../data/pride_and_prejudice.txt" assumes a specific directory structure.
book=read_text("../data/pride_and_prejudice.txt")

### 15. Processing the Text with SpaCy

Now we pass the entire text of the book to our `nlp` object. SpaCy processes the text, performing tokenization and lemmatization, and stores the result as a sequence of tokens in the `spacy_tokens` variable. This is the most computationally intensive step.

In [None]:
# Process the book's text with the SpaCy nlp object.
# This creates a Doc object containing all the tokens and their linguistic features (like lemmas).
spacy_tokens=nlp(book)

### 16. The Main Search Function

This `wordnet_search` function ties everything together. It takes a target WordNet synset and the SpaCy-processed tokens as input.
1.  It calls `get_words_in_hypo` to build a list of all relevant words.
2.  It passes this list to `find_all_words_in_text` to find all matches in the book.
3.  Finally, it calls `print_concordance` to display the results in a user-friendly way.

In [None]:
def wordnet_search(synset, spacy_tokens):
    """ 
    This function searches through all tokens to find any mention of words
    in the given synset or any of its hyponyms.
    """
    # 1. Get all target words from the synset and its hyponyms.
    targets=get_words_in_hypo(synset)
    
    # 2. Find the indices of all tokens that match the target words.
    matches=find_all_words_in_text(targets, spacy_tokens)
    
    # 3. Print the matches in a concordance view.
    print_concordance(matches, spacy_tokens)

### Q1. Find all color terms in *Pride and Prejudice*

To solve this, we first find the appropriate synset for "color" (`color.n.01`) and then pass it to our `wordnet_search` function.

In [None]:
# Define the synset for color. 'color.n.01' is the primary noun sense.
color_synset = wn.synset('color.n.01')

# Call the search function with the color synset and the processed text.
wordnet_search(color_synset, spacy_tokens)

### Q2. Find all vehicles mentioned in *Pride and Prejudice*

We follow the same process for "vehicle". The synset `vehicle.n.01` represents conveyances that transport people or objects.

In [None]:
# Define the synset for vehicle. 'vehicle.n.01' is "a conveyance that transports people or objects".
vehicle_synset = wn.synset('vehicle.n.01')

# Call the search function with the vehicle synset.
wordnet_search(vehicle_synset, spacy_tokens)

### Q3. Find all verbs of speaking in *Pride and Prejudice*

Here, we look for verbs. The synset `talk.v.01` ("exchange thoughts; talk with") is a good starting point for finding verbs related to speaking.

In [None]:
# Define the synset for speaking verbs. 'talk.v.01' is a good general choice.
speak_synset = wn.synset('talk.v.01')

# Call the search function.
wordnet_search(speak_synset, spacy_tokens)

### Q4. Find all of the people in *Pride and Prejudice*

The synset `person.n.01` ("a human being") is the most appropriate choice for finding general references to people.

In [None]:
# Define the synset for a person. 'person.n.01' is "a human being".
person_synset = wn.synset('person.n.01')

# Call the search function.
wordnet_search(person_synset, spacy_tokens)

### Q5. How can we improve this method?

The current method has a major limitation: **word sense ambiguity**. It identifies a word if its lemma is in the target set, regardless of its meaning in context. For example, if we searched for things found at a river `bank` (the land), our current method would also incorrectly flag mentions of a financial `bank`.

We could improve the method in several ways:

1.  **Part-of-Speech (POS) Filtering**: We could re-enable SpaCy's POS tagger. If we are searching for a noun synset (like `bank.n.01`), we could filter out instances where the word "bank" is used as a verb (e.g., "to bank a shot"). This would reduce errors but not solve ambiguity between two noun senses.

2.  **Word Sense Disambiguation (WSD)**: The most direct improvement would be to use a WSD algorithm. After finding a potential match, a WSD system would analyze the surrounding words (the context) to determine which of the word's possible synsets is the most likely one. We would only count a match if the algorithm assigned it to our target synset (or one of its hyponyms). This is a complex task but provides much higher accuracy. For example, in the phrase "money in the bank," a WSD model would likely choose `bank.n.02` (the financial institution), whereas in "sat on the river bank," it would choose `bank.n.01` (sloping land).