### **Introduction**
This notebook explores distributional similarity. The core idea is that words that appear in similar contexts tend to have similar meanings. We will represent words as vectors based on the words that appear around them and then use these vectors to find similar words.

#### **Cell 1: Imports**
This cell imports the necessary Python libraries for our tasks.

In [None]:
# Import defaultdict for creating dictionaries that can handle missing keys
from collections import defaultdict, Counter
# Import math for mathematical operations, like square root
import math
# Import operator for easily sorting dictionaries by value
import operator
# Import gzip for handling compressed files (though not used in this version)
import gzip

#### **Cell 2: Global Parameters**
Here, we define two key parameters that will control our analysis.

In [None]:
# 'window' defines the size of the context window around a target word.
# A value of 2 means we'll look at 2 words to the left and 2 words to the right.
window=2
# 'vocabSize' limits our analysis to the 10,000 most frequent words in the text.
# This makes computations more manageable and focuses on more meaningful words.
vocabSize=10000

#### **Cell 3: Loading the Data**
This cell loads the text data from a file. The data is a collection of Wikipedia articles.

In [None]:
# Specify the path to the data file.
filename="../data/wiki.10K.txt"

# Open the file, read its entire content, convert all text to lowercase,
# and then split the text into a list of words based on spaces.
wiki_data=open(filename, encoding="utf-8").read().lower().split(" ")

#### **Cell 4: Creating the Vocabulary**
This function identifies the most frequent words in our dataset and prepares a data structure to store their contextual information.

In [None]:
# We'll only create word representations for the most frequent K words (where K=vocabSize).

def create_vocab(data):
    # This will store our final word representations.
    word_representations={}
    # Use Counter to efficiently count the frequency of each word in the data.
    vocab=Counter()
    # Iterate through the data to populate the Counter.
    for i, word in enumerate(data):
        vocab[word]+=1

    # Get a list of the most common words, limited by vocabSize.
    topK=[k for k,v in vocab.most_common(vocabSize)]
    # For each of the top K words, initialize its representation as a defaultdict.
    # A defaultdict(float) will return 0.0 for any context word not yet seen.
    for k in topK:
        word_representations[k]=defaultdict(float)
    # Return the dictionary that will hold the context vectors for our vocabulary words.
    return word_representations

#### **Cell 5: Counting Unigram Context**
This function populates the `word_representations` dictionary. For each word in our vocabulary, it counts the individual words (unigrams) that appear within its context window.

In [None]:
# A word's representation is its unigram distributional context (the individual words
# that appear in a window before and after its occurrences).

def count_unigram_context(data, word_representations):
    # Iterate through each word and its index in the dataset.
    for i, word in enumerate(data):
        # If the word is not in our top K vocabulary, skip it.
        if word not in word_representations:
            continue
        # Define the start of the context window, ensuring it doesn't go below 0.
        start=i-window if i-window > 0 else 0
        # Define the end of the context window, ensuring it doesn't exceed the data length.
        end=i+window+1 if i+window+1 < len(data) else len(data)
        # Iterate through the words within the context window.
        for j in range(start, end):
            # Make sure we don't count the target word itself as part of its own context.
            if i != j:
                # Increment the count for the context word (data[j]) in the representation of the target word (word).
                word_representations[word][data[j]]+=1

#### **Cell 6: Counting Directional Context (Alternative Method)**
This is an alternative way to define "context." Instead of counting individual words, it treats the entire sequence of words to the left and right of the target word as unique context features.

In [None]:
# This function defines context differently, by preserving the order and direction of surrounding words.

def count_directional_context(data, word_representations):
    # Iterate through each word and its index in the dataset.
    for i, word in enumerate(data):
        # If the word is not in our top K vocabulary, skip it.
        if word not in word_representations:
            continue
        # Define the start and end of the context window.
        start=i-window if i-window > 0 else 0
        end=i+window+1 if i+window+1 < len(data) else len(data)
        # Create a string for the left context, prefixed with "L:".
        left="L: %s" % ' '.join(data[start:i])
        # Create a string for the right context, prefixed with "R:".
        right="R: %s" % ' '.join(data[i+1:end])
        
        # Increment the count for the left context string.
        word_representations[word][left]+=1
        # Increment the count for the right context string.
        word_representations[word][right]+=1

#### **Cell 7: Normalizing Vectors**
To calculate cosine similarity efficiently, we first need to normalize each word's context vector so that its length (L2 norm) is 1.

In [None]:
# Normalize a word representation vector so that its L2 norm is 1.
# We do this so that the cosine similarity calculation reduces to a simple dot product.

def normalize(word_representations):
    # Iterate through each word in our vocabulary.
    for word in word_representations:
        # Initialize a variable to hold the sum of squares.
        total=0
        # Iterate through the context words and their counts for the current word.
        for key in word_representations[word]:
            # Add the square of the count to the total.
            total+=word_representations[word][key]*word_representations[word][key]
            
        # The L2 norm is the square root of the sum of squares.
        total=math.sqrt(total)
        # If the total is zero (word never appeared with context), skip division.
        if total == 0: continue
        # Iterate through the context words again to perform the normalization.
        for key in word_representations[word]:
            # Divide each context count by the L2 norm.
            word_representations[word][key]/=total

#### **Cell 8: Calculating the Dot Product**
This function calculates the dot product between two word vectors. Since the vectors are stored as dictionaries (sparse representation), we only need to consider the context words they have in common.

In [None]:
# This function calculates the dot product between two dictionaries.
def dictionary_dot_product(dict1, dict2):
    # Initialize the dot product score to 0.
    dot=0
    # Iterate through the context words (keys) in the first dictionary.
    for key in dict1:
        # If the same context word exists in the second dictionary...
        if key in dict2:
            # ...multiply their corresponding values and add to the dot product.
            dot+=dict1[key]*dict2[key]
    # Return the final dot product score.
    return dot

#### **Cell 9: Finding Similarity Scores**
This function calculates the cosine similarity between a given `query` word and all other words in our vocabulary. Since we normalized our vectors, this is simply their dot product.

In [None]:
# This function finds the similarity between a query word and all other words.
def find_sim(word_representations, query):
    # Check if the query word is in our vocabulary.
    if query not in word_representations:
        print("'%s' is not in vocabulary" % query)
        return None
    
    # Initialize a dictionary to store similarity scores.
    scores={}
    # Iterate through every word in our vocabulary.
    for word in word_representations:
        # Calculate the cosine similarity (dot product of normalized vectors).
        cosine=dictionary_dot_product(word_representations[query], word_representations[word])
        # Store the similarity score.
        scores[word]=cosine
    # Return the dictionary of scores.
    return scores

#### **Cell 10: Finding Nearest Neighbors**
This function uses the similarity scores to find and display the `K` words that are most similar to the `query` word.

In [None]:
# Find the K words with the highest cosine similarity to a query.

def find_nearest_neighbors(word_representations, query, K):
    # Get the similarity scores for the query word against all other words.
    scores=find_sim(word_representations, query)
    # If scores were successfully calculated...
    if scores is not None:
        # Sort the words by their similarity score in descending order.
        # operator.itemgetter(1) sorts by the dictionary's values.
        sorted_x = sorted(scores.items(), key=operator.itemgetter(1), reverse=True)
        # Iterate through the top K items in the sorted list.
        for idx, (k, v) in enumerate(sorted_x[:K]):
            # Print the rank, word, and similarity score, formatted neatly.
            print("%s\t%s\t%.5f" % (idx,k,v))

#### **Cell 11: Comparing Context Methods**
The following cells will demonstrate the `count_unigram_context` method. `count_unigram_context` treats individual words in the context window as features (a "bag-of-words" approach). In contrast, `count_directional_context` treats the entire left or right sequence as a single, unique feature, preserving word order. We will proceed with the unigram method.

#### **Cell 12: Building the Word Representations**
Now, we execute the functions to build our word representations. This involves three steps:
1.  `create_vocab`: Identify the top 10,000 words.
2.  `count_unigram_context`: Count their surrounding context words.
3.  `normalize`: Normalize the resulting context vectors.

In [None]:
# 1. Create the vocabulary and initialize the representation dictionary.
word_representations=create_vocab(wiki_data)
# 2. Populate the dictionary with unigram context counts.
count_unigram_context(wiki_data, word_representations)
# 3. Normalize the context vectors to have a length of 1.
normalize(word_representations)

#### **Cell 13: Finding Similar Words for "actor"**
Let's test our model by finding the 10 most similar words to "actor".

In [None]:
# Find and print the 10 nearest neighbors to the word "actor".
find_nearest_neighbors(word_representations, "actor", 10)

#### **Cell 14: Finding Shared Contexts**
This function helps us understand *why* two words are considered similar. It identifies the specific context words that contribute the most to their high cosine similarity score.

In [None]:
# Find the contexts shared between two words that contribute most to their similarity score.

def find_shared_contexts(word_representations, query1, query2, K):
    # Check if the first query word is in our vocabulary.
    if query1 not in word_representations:
        print("'%s' is not in vocabulary" % query1)
        return None
    
    # Check if the second query word is in our vocabulary.
    if query2 not in word_representations:
        print("'%s' is not in vocabulary" % query2)
        return None
    
    # Initialize a dictionary to store the contribution score of each shared context.
    context_scores={}
    # Get the vector for the first word.
    dict1=word_representations[query1]
    # Get the vector for the second word.
    dict2=word_representations[query2]
    
    # Iterate through the context words of the first query.
    for key in dict1:
        # If the context word also appears in the second query's context...
        if key in dict2:
            # The contribution is the product of their normalized weights.
            score=dict1[key]*dict2[key]
            # Store this contribution score.
            context_scores[key]=score

    # Sort the shared contexts by their contribution score in descending order.
    sorted_x = sorted(context_scores.items(), key=operator.itemgetter(1), reverse=True)
    # Iterate through the top K shared contexts.
    for idx, (k, v) in enumerate(sorted_x[:K]):
        # Print the rank, context word, and its contribution score.
        print("%s\t%s\t%.5f" % (idx,k,v))

#### **Cell 15: Analyzing Similarity between "actor" and "politician"**
Now, let's find the top 10 shared contexts that make "actor" and "politician" similar in this dataset. This makes the model's reasoning interpretable.

In [None]:
# Find and print the top 10 shared contexts between "actor" and "politician".
find_shared_contexts(word_representations, "actor", "politician", 10)

#### **Cell 16: Empty Cell**
This is an empty cell, which can be used for further experiments.

In [None]:
# This cell is intentionally left blank.