<a href="https://colab.research.google.com/github/charlesincharge/Caltech-CS155-2022/blob/main/sets/set5/set5_prob3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Set 5
## 3. Word2Vec \*\*Principles**

## Problem A

The time complexity for a single pair would be O (W * D) because there are W operations (the denominator of softmax) that take D time (the number of derivatives to calculate for each dimension D).

## Problem C:
If D increases, training loss will decrease. However, a very large D may cause the model to overfit.

#### Preparation

In [12]:
# download the helper function
!wget -O P3CHelpers.py https://raw.githubusercontent.com/charlesincharge/Caltech-CS155-2022/main/sets/set5/P3CHelpers.py

--2024-08-05 11:30:22--  https://raw.githubusercontent.com/charlesincharge/Caltech-CS155-2022/main/sets/set5/P3CHelpers.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8001::154, 2606:50c0:8002::154, 2606:50c0:8000::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8001::154|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4939 (4.8K) [text/plain]
Saving to: ‘P3CHelpers.py’


2024-08-05 11:30:23 (923 KB/s) - ‘P3CHelpers.py’ saved [4939/4939]



In [13]:
# download the dataset
!wget -O dr_seuss.txt https://raw.githubusercontent.com/charlesincharge/Caltech-CS155-2022/main/sets/set5/data/dr_seuss.txt

--2024-08-05 11:30:25--  https://raw.githubusercontent.com/charlesincharge/Caltech-CS155-2022/main/sets/set5/data/dr_seuss.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8001::154, 2606:50c0:8002::154, 2606:50c0:8000::154, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8001::154|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8810 (8.6K) [text/plain]
Saving to: ‘dr_seuss.txt’


2024-08-05 11:30:25 (17.5 MB/s) - ‘dr_seuss.txt’ saved [8810/8810]



In [91]:
import numpy as np
from P3CHelpers import *
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

#### Problem D: 
Fill in the generate_traindata and find_most_similar_pairs functions.

In [129]:
def get_word_repr(word_to_index, word):
    """
    Returns one-hot-encoded feature representation of the specified word given
    a dictionary mapping words to their one-hot-encoded index.

    Arguments:
        word_to_index: Dictionary mapping words to their corresponding index
                       in a one-hot-encoded representation of our corpus.

        word:          Word whose feature representation we wish to compute.

    Returns:
        feature_representation:     Feature representation of the passed-in word.
    """
    unique_words = word_to_index.keys()
    # Return a vector that's zero everywhere besides the index corresponding to <word>
    feature_representation = np.zeros(len(unique_words))
    feature_representation[word_to_index[word]] = 1
    return feature_representation    

def generate_traindata(word_list, word_to_index, window_size=4):
    """
    Generates training data for Skipgram model.

    Arguments:
        word_list:     Sequential list of words (strings).
        word_to_index: Dictionary mapping words to their corresponding index
                       in a one-hot-encoded representation of our corpus.

        window_size:   Size of Skipgram window. Defaults to 2 
                       (use the default value when running your code).

    Returns:
        (trainX, trainY):     A pair of matrices (trainX, trainY) containing training 
                              points (one-hot-encoded vectors) and their corresponding output_word
                              (also one-hot-encoded vectors)

    """
    trainX = []
    trainY = []

    ##############################################################
    # TODO: Implement this function, populating trainX and trainY
    ##############################################################
    for i in range(len(word_list)):
      pX = get_word_repr(word_to_index, word_list[i])
      for j in range(max(0,i-window_size), min(len(word_list), i+window_size+1)):
        pY = get_word_repr(word_to_index, word_list[j])
        if i != j:
          trainX.append(pX)
          trainY.append(pY)
    return np.array(trainX), np.array(trainY)

In [136]:
def find_most_similar_pairs(filename, num_latent_factors):
    """
    Find the most similar pairs from the word embeddings computed from
    a body of text
    
    Arguments:
        filename:           Text file to read and train embeddings from
        num_latent_factors: The number of latent factors / the size of the embedding
    """
    # Load in a list of words from the specified file; remove non-alphanumeric characters
    # and make all chars lowercase.
    sample_text = load_word_list(filename)

    # Create word dictionary
    word_to_index = generate_onehot_dict(sample_text)
    print("Textfile contains %s unique words"%len(word_to_index))
    batch_size = 32
    # Create training data
    trainX, trainY = generate_traindata(sample_text, word_to_index)

    dataset = torch.from_numpy(np.array([trainX, trainY])).type(torch.FloatTensor)
    dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

    ##############################################################
    # TODO: 1) Create and train model in Pytorch.      
    ##############################################################
    # vocab_size = number of unique words in our text file. Will be useful 
    # when adding layers to your neural network
    vocab_size = len(word_to_index)

    # initialize model
    model = nn.Sequential(
        nn.Linear(vocab_size, num_latent_factors),
        nn.Linear(num_latent_factors, vocab_size),
        nn.Softmax()
    )
    optimizer = torch.optim.Adam(model.parameters(), lr=0.011)
    loss_fn = nn.CrossEntropyLoss() 
    # train model
    
    n_epochs = 30 
    for epoch in range(n_epochs):
        for batch_idx, (data, target) in enumerate(dataloader):
            # zero gradients
            optimizer.zero_grad()

            #forward pass
            output = model(data)

            # calculate loss
            loss = loss_fn(output,target)

            # backward pass
            loss.backward()

            # weight update
            optimizer.step()
        # print('Train Epoch: %d  Loss: %.4f' % (epoch + 1,  loss.item()))

    ##############################################################
    # TODO: 2) Extract weights for hidden layer
    ##############################################################
    
    # set weights variable below
    weights = model.get_parameter('1.weight').detach()
    
    # Find and print most similar pairs
    similar_pairs = most_similar_pairs(weights, word_to_index)
    for pair in similar_pairs[:30]:
        print(pair)

### Problem E-H:
Run your model on drseuss.txt and answer questions from E through H.

In [137]:
find_most_similar_pairs('dr_seuss.txt', 10)

Textfile contains 308 unique words
Pair(hills, cat), Similarity: 0.99971235
Pair(cat, hills), Similarity: 0.99971235
Pair(more, too), Similarity: 0.9996188
Pair(too, more), Similarity: 0.9996188
Pair(way, clark), Similarity: 0.9996135
Pair(clark, way), Similarity: 0.9996135
Pair(hot, too), Similarity: 0.9995866
Pair(one, of), Similarity: 0.99954945
Pair(of, one), Similarity: 0.99954945
Pair(they, one), Similarity: 0.9995408
Pair(grow, cat), Similarity: 0.9994992
Pair(fat, too), Similarity: 0.9994374
Pair(thin, too), Similarity: 0.9994347
Pair(wish, on), Similarity: 0.99941385
Pair(on, wish), Similarity: 0.99941385
Pair(fish, say), Similarity: 0.99939847
Pair(say, fish), Similarity: 0.99939847
Pair(dog, thin), Similarity: 0.999389
Pair(kite, milk), Similarity: 0.9993884
Pair(milk, kite), Similarity: 0.9993884
Pair(long, zeds), Similarity: 0.99938154
Pair(zeds, long), Similarity: 0.99938154
Pair(nine, way), Similarity: 0.9993658
Pair(many, hand), Similarity: 0.99933267
Pair(hand, many), 