<a href="https://colab.research.google.com/github/alexmijo/6.806-Homeworks/blob/main/6864_hw1_part_1_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import csv
import itertools as it
import numpy as np
import sklearn.decomposition
np.random.seed(0)
from tqdm import tqdm

import lab_util

# Introduction

In this notebook, you will find code scaffolding for the word representation parts of Homework 1 (matrix factorization and Word2Vec-style language modeling; code for the HMM section of the assignment is released in another notebook). There are certain parts of the scaffolding marked with `# Your code here!` comments where you can fill in code to perform the specified tasks. After implementing the methods in this notebook, you will need to design and perform experiments to evaluate each method and respond to the questions in the Homework 1 handout (available on Canvas). You should be able to complete this assignment without changing any of the scaffolding code, just writing code to fill in the scaffolding and run experiments.

## Dataset

We're going to be working with a dataset of product reviews. The following cell loads the dataset and splits it into training, validation, and test sets.

In [None]:
data = []
n_positive = 0
n_disp = 0
with open("reviews.csv") as reader:
  csvreader = csv.reader(reader)
  next(csvreader)
  for id, review, label in csvreader:
    label = int(label)

    # hacky class balancing
    if label == 1:
      if n_positive == 2000:
        continue
      n_positive += 1
    if len(data) == 4000:
      break

    data.append((review, label))
    
    if n_disp > 5:
      continue
    n_disp += 1
    print("review:", review)
    print("rating:", label, "(good)" if label == 1 else "(bad)")
    print()

print(f"Read {len(data)} total reviews.")
np.random.shuffle(data)
reviews, labels = zip(*data)
train_reviews = reviews[:3000]
train_labels = labels[:3000]
val_reviews = reviews[3000:3500]
val_labels = labels[3000:3500]
test_reviews = reviews[3500:]
test_labels = labels[3500:]

review: I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than  most.
rating: 1 (good)

review: Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as "Jumbo".
rating: 0 (bad)

review: This is a confection that has been around a few centuries.  It is a light, pillowy citrus gelatin with nuts - in this case Filberts. And it is cut into tiny squares and then liberally coated with powdered sugar.  And it is a tiny mouthful of heaven.  Not too chewy, and very flavorful.  I highly recommend this yummy treat.  If you are familiar with the story of C.S. Lewis' "The Lion, The Witch, and The Wardrobe" - this is the treat that seduces Edmund into selling out his Brother an

# Part 1: word representations via matrix factorization

First, we'll construct the term-document matrix (look at `/content/hw1/lab_util.py` in the file browser on the left if you want to see how this works).

In [None]:
vectorizer = lab_util.CountVectorizer()
vectorizer.fit(train_reviews)
td_matrix = vectorizer.transform(train_reviews).T
print(f"TD matrix is {td_matrix.shape[0]} x {td_matrix.shape[1]}")

TD matrix is 2006 x 3000


First, implement the function `learn_reps_lsa` that computes word representations via latent semantic analysis. The `sklearn.decomposition` or `np.linalg` packages may be useful.

In [None]:
import sklearn.decomposition
def learn_reps_lsa(matrix, rep_size):
    # `matrix` is a `|V| x n` matrix, where `|V|` is the number of words in the
    # vocabulary. This function should return a `|V| x rep_size` matrix with each
    # row corresponding to a word representation.

    svd = sklearn.decomposition.TruncatedSVD(n_components=rep_size)
    return svd.fit_transform(matrix)

#### Sanity check 1
The following cell contains a simple sanity check for your `learn_reps_lsa` implementation: it should print `True` if your `learn_reps_lsa` function is implemented equivalently to one of our solutions.  There are at least two reasonable ways to formulate these LSA word representations (whether you directly use the left singular vectors of `matrix` or scale them by the singular values), these correspond to the two possible representations in the sanity check below.

In [None]:
DEBUG_sc1_matrix = np.array([[1,0,0,2,1,3,5],
                             [2,0,0,0,0,4,0],
                             [0,3,4,1,8,6,6],
                             [1,4,5,0,0,0,0]])

DEBUG_reps = learn_reps_lsa(DEBUG_sc1_matrix, 3)
DEBUG_gt1 = np.array([[ -4.92017554,  -2.85465774,   1.18575453],
                      [ -2.14977584,  -1.19987977,   3.37221899],
                      [-12.62664695,   0.10890093,  -1.32131745],
                      [ -2.69216011,   5.66453534,   1.33728063]])
DEBUG_gt2 = np.array([[-0.35188159, -0.44213061,  0.29358929],
                      [-0.15374788, -0.18583789,  0.83495136],
                      [-0.90303377,  0.01686662, -0.32715426],
                      [-0.19253817,  0.87732566,  0.3311067 ]])

print(np.allclose(np.abs(DEBUG_reps), np.abs(DEBUG_gt1)) or np.allclose(np.abs(DEBUG_reps), np.abs(DEBUG_gt2)))

True


Let's look at some representations:

In [None]:
reps = learn_reps_lsa(td_matrix, 500)
words = ["good", "bad", "cookie", "jelly", "dog", "the", "3"]
show_tokens = [vectorizer.tokenizer.word_to_token[word] for word in words]
lab_util.show_similar_words(vectorizer.tokenizer, reps, show_tokens)

good 47
  . 1.056
  a 1.101
  but 1.121
  , 1.152
  the 1.157
bad 201
  . 1.396
  taste 1.416
  but 1.434
  a 1.435
  i 1.449
cookie 504
  nana's 0.791
  cookies 1.035
  oreos 1.294
  bars 1.352
  bites 1.385
jelly 351
  twist 1.143
  cardboard 1.249
  advertised 1.364
  peanuts 1.421
  plastic 1.422
dog 925
  food 1.048
  pet 1.065
  pets 1.067
  switched 1.207
  foods 1.229
the 36
  . 0.331
  <unk> 0.366
  of 0.395
  and 0.403
  to 0.422
3 289
  8 1.215
  . 1.239
  the 1.271
  to 1.272
  <unk> 1.279


We've been operating on the raw count matrix, but in class we discussed several reweighting schemes aimed at making LSA representations more informative. 

Here, implement the TF-IDF transform and see how it affects learned representations.

In [None]:
def transform_tfidf(matrix):
    # `matrix` is a `|V| x |D|` matrix of raw counts, where `|V|` is the 
    # vocabulary size and `|D|` is the number of documents in the corpus. This
    # function should (nondestructively) return a version of `matrix` with the
    # TF-IDF transform applied.

    df = np.count_nonzero(matrix, axis=1)
    num_docs = matrix.shape[1]
    idf = np.log(num_docs / df)
    return matrix * idf[:,None]

#### Sanity check 2
The following cell should print `True` if your `transform_tfidf` function is implemented properly. (*Hint: in our implementation, we use the natural logarithm (base $e$) when computing inverse document frequency.*)

In [None]:
DEBUG_sc2_matrix = np.array([[3,1,0,3,0],
                             [0,2,0,0,1],
                             [7,8,2,0,1],
                             [1,9,8,1,0]])
DEBUG_gt = np.array([[1.53247687, 0.51082562, 0.        , 1.53247687, 0.        ],
                     [0.        , 1.83258146, 0.        , 0.        , 0.91629073],
                     [1.56200486, 1.78514841, 0.4462871 , 0.        , 0.22314355],
                     [0.22314355, 2.00829196, 1.78514841, 0.22314355, 0.        ]])
print(np.allclose(transform_tfidf(DEBUG_sc2_matrix), DEBUG_gt))

True


How does this change the learned similarity function?


In [None]:
td_matrix_tfidf = transform_tfidf(td_matrix)
reps_tfidf = learn_reps_lsa(td_matrix_tfidf, 500)
# reps_tfidf = learn_reps_lsa(td_matrix_tfidf, 100)
lab_util.show_similar_words(vectorizer.tokenizer, reps_tfidf, show_tokens)

good 47
  . 0.980
  but 1.014
  a 1.032
  and 1.086
  is 1.091
bad 201
  . 1.330
  taste 1.339
  but 1.355
  a 1.371
  not 1.381
cookie 504
  nana's 0.810
  cookies 1.159
  bars 1.435
  bites 1.449
  moist 1.452
jelly 351
  twist 1.088
  cardboard 1.230
  advertised 1.361
  plum 1.493
  sold 1.538
dog 925
  food 1.031
  pets 1.096
  pet 1.102
  foods 1.186
  switched 1.255
the 36
  . 0.212
  and 0.270
  <unk> 0.292
  of 0.300
  to 0.322
3 289
  8 1.146
  . 1.174
  to 1.216
  the 1.217
  of 1.229


Now that we have some representations, let's see if we can do something useful with them.

Below, implement a feature function that represents a document as the sum of its
learned word embeddings.

The remaining code trains a logistic regression model on a set of *labeled* reviews; we're interested in seeing how much representations learned from *unlabeled* reviews improve classification.

In [None]:
import sklearn.linear_model

def word_featurizer(xs):
    # normalize
    return xs / np.sqrt((xs ** 2).sum(axis=1, keepdims=True))

def lsa_featurizer(xs):
    # This function takes in a matrix in which each row contains the word counts
    # for the given review. It should return a matrix in which each row contains
    # the learned feature representation of each review (e.g. the sum of LSA 
    # word representations).

    feats = xs @ reps_tfidf

    # normalize
    return feats / np.sqrt((feats ** 2).sum(axis=1, keepdims=True))

# We've implemented the remainder of the training and evaluation pipeline,
# so you likely won't need to modify the following four functions.
def combo_featurizer(xs):
    return np.concatenate((word_featurizer(xs), lsa_featurizer(xs)), axis=1)

def train_model(featurizer, xs, ys):
    xs_featurized = featurizer(xs)
    model = sklearn.linear_model.LogisticRegression()
    model.fit(xs_featurized, ys)
    return model

def eval_model(model, featurizer, xs, ys):
    xs_featurized = featurizer(xs)
    pred_ys = model.predict(xs_featurized)
    return np.mean(pred_ys == ys)

def training_experiment(name, featurizer, n_train):
    print(f"{name} features, {n_train} examples")
    train_xs = vectorizer.transform(train_reviews[:n_train])
    train_ys = train_labels[:n_train]
    test_xs = vectorizer.transform(test_reviews)
    test_ys = test_labels
    model = train_model(featurizer, train_xs, train_ys)
    acc = eval_model(model, featurizer, test_xs, test_ys)
    print(acc, '\n')
    return acc

# The following four lines will run a training experiment with all 3k examples
# in training set for each feature type. `training_experiment` may be useful to
# you when performing experiments to answer questions in Part 1 of the Homework
# 1 handout.
n_train = 3000
training_experiment("word", word_featurizer, n_train)
#training_experiment("lsa", lsa_featurizer, n_train)
#training_experiment("combo", combo_featurizer, n_train)
print()

word features, 3000 examples
0.784 




**Part 1: Lab writeup**

Part 1 of your lab report should discuss any implementation details that were important to filling out the code above, as well as your answers to the questions in Part 1 of the Homework 1 handout. Below, you can set up and perform experiments that answer these questions (include figures, plots, and tables in your write-up as you see fit).

## Experiments for Part 1

In [None]:
# Part 1 (a)
td_lsv = learn_reps_lsa(td_matrix, 10)
print(td_lsv[:3,:])
print('\n')
tt_lsv = learn_reps_lsa(td_matrix @ td_matrix.T, 10)
print(tt_lsv[:3,:]) 
print('\nNow normalized:')

td_lsv = td_lsv / np.sqrt((td_lsv ** 2).sum(axis=0, keepdims=True))
print(td_lsv[:3,:])
print('\n')
tt_lsv = tt_lsv / np.sqrt((tt_lsv ** 2).sum(axis=0, keepdims=True))
print(tt_lsv[:3,:]) # Should be the same if the same left singular vectors are found
print('\n 100:')


# Part 1 (b)
reps_tfidf = learn_reps_lsa(td_matrix_tfidf, 100)
lab_util.show_similar_words(vectorizer.tokenizer, reps_tfidf, show_tokens)
print('\n 50:')
reps_tfidf = learn_reps_lsa(td_matrix_tfidf, 50)
lab_util.show_similar_words(vectorizer.tokenizer, reps_tfidf, show_tokens)
print('\n 5:')
reps_tfidf = learn_reps_lsa(td_matrix_tfidf, 5)
lab_util.show_similar_words(vectorizer.tokenizer, reps_tfidf, show_tokens)

[[ 6.74359474e+02  1.74027683e+02 -3.14474899e+01  1.07754108e+01
  -1.49104700e+01 -2.94069195e+00 -4.86052070e+00 -5.19135827e+00
   2.19425445e+00 -3.14353834e+00]
 [ 2.26622823e+01 -1.34838991e+01  2.13550356e+00  3.90265580e+00
   2.96163529e-01 -5.20182054e+00 -2.21451355e+01  1.60523913e+01
  -4.60902727e+00  6.19446675e+00]
 [ 3.75139816e+01 -1.11094699e+01  3.41700253e+00  5.88693479e+00
   4.30552121e+00 -1.07335866e+01 -1.37645600e+01  2.42311690e+01
  -1.29253544e+01  6.01384793e+00]]


[[ 6.59999452e+05  4.50429141e+04 -5.54306904e+03  1.66797496e+03
  -1.95251498e+03 -3.70409255e+02 -5.40726037e+02 -5.32731088e+02
   2.06052009e+02 -2.53778049e+02]
 [ 2.21797045e+04 -3.48998552e+03  3.76412122e+02  6.04108207e+02
   3.87996478e+01 -6.55201941e+02 -2.46344567e+03  1.64785311e+03
  -4.26443465e+02  5.07736097e+02]
 [ 3.67151470e+04 -2.87542107e+03  6.02293611e+02  9.11248679e+02
   5.63820814e+02 -1.35192416e+03 -1.53195564e+03  2.48799934e+03
  -1.19984483e+03  5.06195581e

## Part 2: word representations via language modeling

In this section, we'll train a word embedding model with a word2vec-style objective rather than a matrix factorization objective. This requires a little more work; we've provided scaffolding for a PyTorch model implementation below.
If you don't have much PyTorch experience, there are some tutorials [here](https://pytorch.org/tutorials/) which may be useful. You're also welcome to implement these experiments in any other framework of your choosing.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch.utils.data as torch_data

class Word2VecModel(nn.Module):
    # A torch module implementing a word2vec predictor. The `forward` function
    # should take a batch of context word ids as input and predict the word 
    # in the middle of the context as output, as in the CBOW model from lecture.

    def __init__(self, vocab_size, embed_dim):
        super().__init__()
        self.V = nn.Linear(vocab_size + 1, embed_dim, bias=False)
        self.U = nn.Linear(embed_dim, vocab_size + 1, bias=False)
        self.vocab_size = vocab_size

    def forward(self, context):
        # Context is an `n_batch x n_context` matrix of integer word ids
        # this function should return a set of scores for predicting the word 
        # in the middle of the context

        device = torch.device('cuda')  # run on colab gpu
        outputs = torch.mean(F.one_hot(context + 1, num_classes=self.vocab_size\
                                       + 1).float().to(device), -2)
        outputs = self.V(outputs)
        outputs = self.U(outputs)
        return outputs

In [None]:
def learn_reps_word2vec(corpus, window_size, rep_size, n_epochs, n_batch):
    #This method takes in a corpus of training sentences. It returns a matrix of
    # word embeddings with the same structure as used in the previous section of 
    # the assignment. (You can extract this matrix from the parameters of the 
    # Word2VecModel.)

    tokenizer = lab_util.Tokenizer()
    tokenizer.fit(corpus)
    tokenized_corpus = tokenizer.tokenize(corpus)

    ngrams = lab_util.get_ngrams(tokenized_corpus, window_size)

    device = torch.device('cuda')  # run on colab gpu
    model = Word2VecModel(tokenizer.vocab_size, rep_size).to(device)
    opt = optim.Adam(model.parameters(), lr=0.001)

    loader = torch_data.DataLoader(ngrams, batch_size=n_batch, shuffle=True)

    # What loss function should we use for Word2Vec?
    loss_fn = nn.CrossEntropyLoss()

    losses = []  # Potentially useful for debugging (loss should go down!)
    for epoch in tqdm(range(n_epochs)):
        epoch_loss = 0
        for context, label in loader:
            # As described above, `context` is a batch of context word ids, and
            # `label` is a batch of predicted word labels.

            # Here, perform a forward pass to compute predictions for the model.
            preds = model(context)


            # Now finish the backward pass and gradient update.
            # Remember, you need to compute the loss, zero the gradients
            # of the model parameters, perform the backward pass, and
            # update the model parameters.
            loss = loss_fn(preds, label.to(device))
            loss.backward()
            opt.step()
            opt.zero_grad()


            epoch_loss += loss.item()
        losses.append(epoch_loss)

    # Hint: you want to return a `vocab_size x embedding_size` numpy array
    embedding_matrix = model.V.weight.data.T.cpu().numpy()[1:,:]

    return embedding_matrix

In [None]:
# Use the function you just wrote to learn Word2Vec embeddings:
reps_word2vec = learn_reps_word2vec(train_reviews, 2, 500, 10, 100)

100%|██████████| 10/10 [01:26<00:00,  8.67s/it]


After training the embeddings, we can try to visualize the embedding space to see if it makes sense. First, we can take any word in the space and check its closest neighbors.

In [None]:
lab_util.show_similar_words(vectorizer.tokenizer, reps_word2vec, show_tokens)

good 47
  great 0.923
  decent 1.010
  bad 1.069
  outstanding 1.121
  mild 1.123
bad 201
  good 1.069
  awful 1.097
  bitter 1.151
  overpowering 1.186
  terrible 1.211
cookie 504
  covered 1.296
  nana's 1.319
  berry 1.331
  lover 1.351
  g 1.367
jelly 351
  bears 1.108
  sized 1.238
  coffees 1.267
  pork 1.294
  candies 1.313
dog 925
  cat 0.795
  baby 0.999
  cats 1.172
  pouch 1.197
  son 1.213
the 36
  mrs 1.052
  our 1.263
  their 1.265
  my 1.274
  amazon's 1.291
3 289
  5 0.863
  four 1.043
  10 1.089
  2 1.103
  6 1.110


We can also cluster the embedding space. Clustering in 4 or more dimensions is hard to visualize, and even clustering in 2 or 3 can be difficult because there are so many words in the vocabulary. One thing we can try to do is assign cluster labels and qualitiatively look for an underlying pattern in the clusters.

In [None]:
from sklearn.cluster import KMeans

indices = KMeans(n_clusters=10).fit_predict(reps_word2vec)
zipped = list(zip(range(vectorizer.tokenizer.vocab_size), indices))
np.random.shuffle(zipped)
zipped = zipped[:100]
zipped = sorted(zipped, key=lambda x: x[1])
for token, cluster_idx in zipped:
    word = vectorizer.tokenizer.token_to_word[token]
    print(f"{word}: {cluster_idx}")

points: 0
week: 0
lollipops: 0
pounds: 0
power: 0
five: 0
higher: 0
p: 0
paper: 0
ingredient: 1
filter: 1
center: 1
local: 1
label: 1
mouth: 1
depending: 1
shape: 1
subscription: 1
replacement: 1
world: 1
issue: 1
toddler: 1
smell: 1
popchips: 2
nutrients: 2
baked: 2
sweeteners: 2
anything: 2
nothing: 2
varieties: 2
chowder: 2
us: 2
goodness: 2
water: 2
swiss: 2
fact: 2
commercial: 3
watermelon: 3
pomegranate: 3
real: 3
tuna: 3
stevia: 3
garlic: 3
cake: 3
crust: 3
homemade: 3
seal: 4
digest: 4
compare: 4
mention: 4
seem: 4
expect: 4
include: 4
order: 4
run: 4
rate: 4
bake: 4
due: 4
for: 5
d: 5
entirely: 5
maybe: 5
such: 5
except: 5
total: 5
sticks: 5
on: 5
lower: 5
particular: 5
treat: 5
nice: 6
pricey: 6
fluffy: 6
excellent: 6
similar: 6
expensive: 6
allowed: 7
craving: 7
picked: 7
noticed: 7
mixing: 7
shocked: 7
throwing: 7
stuck: 7
produced: 7
o: 8
last: 8
often: 8
possible: 8
everything: 8
having: 8
locally: 8
easily: 8
today: 8
quickly: 8
its: 8
leaves: 9
got: 9
contains: 9
were: 

Finally, we can use the trained word embeddings to construct vector representations of full reviews. One common approach is to simply average all the word embeddings in the review to create an overall embedding. Implement the transform function in Word2VecFeaturizer to do this.

In [None]:
def w2v_featurizer(xs):
    # This function takes in a matrix in which each row contains the word counts
    # for the given review. It should return a matrix in which each row contains
    # the average Word2Vec embedding of each review (hint: this will be very
    # similar to `lsa_featurizer` from above, just using Word2Vec embeddings 
    # instead of LSA).

    feats = xs @ reps_word2vec

    # normalize
    return feats / np.sqrt((feats ** 2).sum(axis=1, keepdims=True))

training_experiment("word2vec", w2v_featurizer, 3000)
print()

word2vec features, 3000 examples
0.8 




**Part 2: Lab writeup**

Part 2 of your lab report should discuss any implementation details that were important to filling out the code above, as well as your answers to the questions in Part 2 of the Homework 1 handout. Below, you can set up and perform experiments that answer these questions (include figures, plots, and tables in your write-up as you see fit).

## Experiments for Part 2

In [None]:
# Experiments for Part 2 (a) were just done in the 
#"lab_util.show_similar_words(vectorizer.tokenizer, reps_word2vec, show_tokens)"
# cell above, and the experiment for Part 2 (b) was just done in the code cell 
# immediately above this one