<a href="https://colab.research.google.com/github/cicattzo/mit_advanced_nlp/blob/main/HA1_P1_P2_NLP_MIT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
%%bash
!(stat -t /usr/local/lib/*/dist-packages/google/colab > /dev/null 2>&1) && exit 
rm -rf hw1
git clone https://github.com/mit-6864/hw1.git

Cloning into 'hw1'...


In [2]:
import sys
sys.path.append("/content/hw1")

import csv
import itertools as it
import numpy as np
import sklearn.decomposition
np.random.seed(0)
from tqdm import tqdm

import lab_util

# Introduction

In this notebook, you will find code scaffolding for the word representation parts of Homework 1 (matrix factorization and Word2Vec-style language modeling; code for the HMM section of the assignment is released in another notebook). There are certain parts of the scaffolding marked with `# Your code here!` comments where you can fill in code to perform the specified tasks. After implementing the methods in this notebook, you will need to design and perform experiments to evaluate each method and respond to the questions in the Homework 1 handout (available on Canvas). You should be able to complete this assignment without changing any of the scaffolding code, just writing code to fill in the scaffolding and run experiments.

## Dataset

We're going to be working with a dataset of product reviews. The following cell loads the dataset and splits it into training, validation, and test sets.

In [3]:
data = []
n_positive = 0
n_disp = 0
with open("/content/hw1/reviews.csv") as reader:
  csvreader = csv.reader(reader)
  next(csvreader)
  for id, review, label in csvreader:
    label = int(label)

    # hacky class balancing
    if label == 1:
      if n_positive == 2000:
        continue
      n_positive += 1
    if len(data) == 4000:
      break

    data.append((review, label))
    
    if n_disp > 5:
      continue
    n_disp += 1
    print("review:", review)
    print("rating:", label, "(good)" if label == 1 else "(bad)")
    print()

print(f"Read {len(data)} total reviews.")
np.random.shuffle(data)
reviews, labels = zip(*data)
train_reviews = reviews[:3000]
train_labels = labels[:3000]
val_reviews = reviews[3000:3500]
val_labels = labels[3000:3500]
test_reviews = reviews[3500:]
test_labels = labels[3500:]

review: I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than  most.
rating: 1 (good)

review: Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as "Jumbo".
rating: 0 (bad)

review: This is a confection that has been around a few centuries.  It is a light, pillowy citrus gelatin with nuts - in this case Filberts. And it is cut into tiny squares and then liberally coated with powdered sugar.  And it is a tiny mouthful of heaven.  Not too chewy, and very flavorful.  I highly recommend this yummy treat.  If you are familiar with the story of C.S. Lewis' "The Lion, The Witch, and The Wardrobe" - this is the treat that seduces Edmund into selling out his Brother an

# Part 1: word representations via matrix factorization

First, we'll construct the term-document matrix (look at `/content/hw1/lab_util.py` in the file browser on the left if you want to see how this works).

In [4]:
vectorizer = lab_util.CountVectorizer()
vectorizer.fit(train_reviews)
td_matrix = vectorizer.transform(train_reviews).T
print(f"TD matrix is {td_matrix.shape[0]} x {td_matrix.shape[1]}")

TD matrix is 2006 x 3000


In [5]:
td_matrix

array([[3., 3., 5., ..., 1., 6., 4.],
       [1., 0., 0., ..., 0., 0., 0.],
       [2., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

First, implement the function `learn_reps_lsa` that computes word representations via latent semantic analysis. The `sklearn.decomposition` or `np.linalg` packages may be useful.

In [7]:
import sklearn.decomposition
def learn_reps_lsa(matrix, rep_size):
    # `matrix` is a `|V| x n` matrix, where `|V|` is the number of words in the
    # vocabulary. This function should return a `|V| x rep_size` matrix with each
    # row corresponding to a word representation.

    svd = sklearn.decomposition.TruncatedSVD(n_components=rep_size, random_state=42)

    matrix_lsa = svd.fit_transform(matrix)

    return matrix_lsa

    raise NotImplementedError

#### Sanity check 1
The following cell contains a simple sanity check for your `learn_reps_lsa` implementation: it should print `True` if your `learn_reps_lsa` function is implemented equivalently to one of our solutions.  There are at least two reasonable ways to formulate these LSA word representations (whether you directly use the left singular vectors of `matrix` or scale them by the singular values), these correspond to the two possible representations in the sanity check below.

In [8]:
DEBUG_sc1_matrix = np.array([[1,0,0,2,1,3,5],
                             [2,0,0,0,0,4,0],
                             [0,3,4,1,8,6,6],
                             [1,4,5,0,0,0,0]])

DEBUG_reps = learn_reps_lsa(DEBUG_sc1_matrix, 3)
DEBUG_gt1 = np.array([[ -4.92017554,  -2.85465774,   1.18575453],
                      [ -2.14977584,  -1.19987977,   3.37221899],
                      [-12.62664695,   0.10890093,  -1.32131745],
                      [ -2.69216011,   5.66453534,   1.33728063]])
DEBUG_gt2 = np.array([[-0.35188159, -0.44213061,  0.29358929],
                      [-0.15374788, -0.18583789,  0.83495136],
                      [-0.90303377,  0.01686662, -0.32715426],
                      [-0.19253817,  0.87732566,  0.3311067 ]])

print(np.allclose(np.abs(DEBUG_reps), np.abs(DEBUG_gt1)) or np.allclose(np.abs(DEBUG_reps), np.abs(DEBUG_gt2)))

True


Let's look at some representations:

In [9]:
reps = learn_reps_lsa(td_matrix, 500)
words = ["good", "bad", "cookie", "jelly", "dog", "the", "4"]
show_tokens = [vectorizer.tokenizer.word_to_token[word] for word in words]
lab_util.show_similar_words(vectorizer.tokenizer, reps, show_tokens)

good 47
  . 1.056
  a 1.101
  but 1.121
  , 1.152
  the 1.157
bad 201
  . 1.396
  taste 1.416
  but 1.434
  a 1.435
  i 1.449
cookie 504
  nana's 0.795
  cookies 1.012
  oreos 1.272
  bars 1.384
  bites 1.404
jelly 351
  twist 1.076
  cardboard 1.231
  peanuts 1.393
  advertised 1.399
  plastic 1.469
dog 925
  food 1.049
  pet 1.067
  pets 1.070
  switched 1.203
  foods 1.227
the 36
  . 0.331
  <unk> 0.366
  of 0.395
  and 0.403
  to 0.422
4 292
  1 1.047
  6 1.121
  70 1.132
  stevia 1.195
  concentrated 1.245


We've been operating on the raw count matrix, but in class we discussed several reweighting schemes aimed at making LSA representations more informative. 

Here, implement the TF-IDF transform and see how it affects learned representations.

In [10]:
# from sklearn.feature_extraction.text import TfidfVectorizer

def transform_tfidf(matrix):
    # `matrix` is a `|V| x |D|` matrix of raw counts, where `|V|` is the 
    # vocabulary size and `|D|` is the number of documents in the corpus. This
    # function should (nondestructively) return a version of `matrix` with the
    # TF-IDF transform appliied.

    # words_in_document = np.sum(matrix, axis = 0)

    TF = matrix

    # TF = np.log(1+matrix)

    nonzero_matrix = matrix!=0
    nonzero_matrix = nonzero_matrix*1

    frequencies = np.sum(nonzero_matrix, axis = 1)

    IDF = np.log(matrix.shape[1]/frequencies)

    TFIDF = TF.transpose(1,0)*IDF

    TFIDF = TFIDF.transpose(1,0)

    return TFIDF

    # Your code here!
    raise NotImplementedError

#### Sanity check 2
The following cell should print `True` if your `transform_tfidf` function is implemented properly. (*Hint: in our implementation, we use the natural logarithm (base $e$) when computing inverse document frequency.*)

In [11]:
DEBUG_sc2_matrix = np.array([[3,1,0,3,0],
                             [0,2,0,0,1],
                             [7,8,2,0,1],
                             [1,9,8,1,0]])
DEBUG_gt = np.array([[1.53247687, 0.51082562, 0.        , 1.53247687, 0.        ],
                     [0.        , 1.83258146, 0.        , 0.        , 0.91629073],
                     [1.56200486, 1.78514841, 0.4462871 , 0.        , 0.22314355],
                     [0.22314355, 2.00829196, 1.78514841, 0.22314355, 0.        ]])
print(np.allclose(transform_tfidf(DEBUG_sc2_matrix), DEBUG_gt))

True


How does this change the learned similarity function?


In [12]:
td_matrix_tfidf = transform_tfidf(td_matrix)
# reps_tfidf = learn_reps_lsa(td_matrix_tfidf, 500)
reps_tfidf = learn_reps_lsa(td_matrix_tfidf, 100)
lab_util.show_similar_words(vectorizer.tokenizer, reps_tfidf, show_tokens)

good 47
  but 0.449
  . 0.513
  for 0.595
  as 0.596
  not 0.611
bad 201
  like 0.609
  taste 0.630
  but 0.695
  me 0.777
  . 0.779
cookie 504
  cookies 0.388
  nana's 0.467
  oreos 0.745
  bars 0.916
  hope 0.949
jelly 351
  creamer 0.864
  gifts 0.960
  twist 1.003
  advertised 1.084
  cardboard 1.128
dog 925
  pet 0.744
  foods 0.779
  switched 0.796
  pets 0.825
  food 0.859
the 36
  . 0.121
  of 0.154
  to 0.164
  and 0.164
  <unk> 0.212
4 292
  6 0.469
  1 0.507
  70 0.685
  2 0.731
  concentrated 0.818


Now that we have some representations, let's see if we can do something useful with them.

Below, implement a feature function that represents a document as the sum of its
learned word embeddings.

The remaining code trains a logistic regression model on a set of *labeled* reviews; we're interested in seeing how much representations learned from *unlabeled* reviews improve classification.

In [13]:
import sklearn.linear_model

def word_featurizer(xs):
    # normalize
    return xs / np.sqrt((xs ** 2).sum(axis=1, keepdims=True))

def lsa_featurizer(xs):
    # This function takes in a matrix in which each row contains the word counts
    # for the given review. It should return a matrix in which each row contains
    # the learned feature representation of each review (e.g. the sum of LSA 
    # word representations).

    feats = xs @ reps_tfidf

    # normalize
    return feats / np.sqrt((feats ** 2).sum(axis=1, keepdims=True))

# We've implemented the remainder of the training and evaluation pipeline,
# so you likely won't need to modify the following four functions.
def combo_featurizer(xs):
    return np.concatenate((word_featurizer(xs), lsa_featurizer(xs)), axis=1)

def train_model(featurizer, xs, ys):
    xs_featurized = featurizer(xs)
    model = sklearn.linear_model.LogisticRegression()
    model.fit(xs_featurized, ys)
    return model

def eval_model(model, featurizer, xs, ys):
    xs_featurized = featurizer(xs)
    pred_ys = model.predict(xs_featurized)
    return np.mean(pred_ys == ys)

def training_experiment(name, featurizer, n_train):
    print(f"{name} features, {n_train} examples")
    train_xs = vectorizer.transform(train_reviews[:n_train])
    train_ys = train_labels[:n_train]
    test_xs = vectorizer.transform(test_reviews)
    test_ys = test_labels
    model = train_model(featurizer, train_xs, train_ys)
    acc = eval_model(model, featurizer, test_xs, test_ys)
    print(acc, '\n')
    return acc

# The following four lines will run a training experiment with all 3k examples
# in training set for each feature type. `training_experiment` may be useful to
# you when performing experiments to answer questions in Part 1 of the Homework
# 1 handout.
n_train = 3000
training_experiment("word", word_featurizer, n_train)
training_experiment("lsa", lsa_featurizer, n_train)
training_experiment("combo", combo_featurizer, n_train)
print()

word features, 3000 examples
0.784 

lsa features, 3000 examples
0.754 

combo features, 3000 examples
0.792 




**Part 1: Lab writeup**

Part 1 of your lab report should discuss any implementation details that were important to filling out the code above, as well as your answers to the questions in Part 1 of the Homework 1 handout. Below, you can set up and perform experiments that answer these questions (include figures, plots, and tables in your write-up as you see fit).

## Experiments for Part 1

In [14]:
# (a)

W_td = DEBUG_sc2_matrix
W_tt = np.matmul(W_td, W_td.transpose())

u, s, vh = np.linalg.svd(W_td, full_matrices=True)
u

array([[-0.139126  , -0.40244616,  0.88442143, -0.19099681],
       [-0.10685017, -0.00643125, -0.22869359, -0.96759543],
       [-0.64840265, -0.64955514, -0.36266492,  0.16163628],
       [-0.74081105,  0.64503786,  0.18431532,  0.03395598]])

In [15]:
u, s, vh = np.linalg.svd(W_tt, full_matrices=True)
u

array([[-0.139126  ,  0.40244616,  0.88442143, -0.19099681],
       [-0.10685017,  0.00643125, -0.22869359, -0.96759543],
       [-0.64840265,  0.64955514, -0.36266492,  0.16163628],
       [-0.74081105, -0.64503786,  0.18431532,  0.03395598]])

In [16]:
# (b)

reps = learn_reps_lsa(td_matrix, 10)
words = ["the", "dog", "3", "good"]
show_tokens = [vectorizer.tokenizer.word_to_token[word] for word in words]
lab_util.show_similar_words(vectorizer.tokenizer, reps, show_tokens)

the 36
  same 0.052
  date 0.070
  cans 0.095
  idea 0.100
  small 0.100
dog 925
  dogs 0.086
  vet 0.087
  food 0.100
  companies 0.105
  cat 0.115
3 289
  itself 0.063
  kind 0.076
  particular 0.094
  stars 0.096
  per 0.097
good 47
  better 0.045
  like 0.059
  too 0.066
  strong 0.070
  otherwise 0.076


In [17]:
reps = learn_reps_lsa(td_matrix, 100)
words = ["the", "dog", "3", "good"]
show_tokens = [vectorizer.tokenizer.word_to_token[word] for word in words]
lab_util.show_similar_words(vectorizer.tokenizer, reps, show_tokens)

the 36
  . 0.331
  <unk> 0.366
  of 0.395
  and 0.402
  to 0.422
dog 925
  pet 0.539
  pets 0.561
  foods 0.588
  dogs 0.653
  nutritious 0.667
3 289
  1 0.636
  per 0.649
  2 0.682
  8 0.688
  4 0.793
good 47
  pretty 0.837
  everyone 0.983
  . 1.049
  beat 1.066
  tasting 1.077


In [18]:
reps = learn_reps_lsa(td_matrix, 500)
words = ["the", "dog", "3", "good"]
show_tokens = [vectorizer.tokenizer.word_to_token[word] for word in words]
lab_util.show_similar_words(vectorizer.tokenizer, reps, show_tokens)

the 36
  . 0.331
  <unk> 0.366
  of 0.395
  and 0.403
  to 0.422
dog 925
  food 1.049
  pet 1.067
  pets 1.070
  switched 1.203
  foods 1.227
3 289
  8 1.214
  . 1.239
  the 1.271
  to 1.272
  <unk> 1.279
good 47
  . 1.056
  a 1.101
  but 1.121
  , 1.152
  the 1.157


In [19]:
reps = learn_reps_lsa(td_matrix, 1000)
words = ["the", "dog", "3", "good"]
show_tokens = [vectorizer.tokenizer.word_to_token[word] for word in words]
lab_util.show_similar_words(vectorizer.tokenizer, reps, show_tokens)

the 36
  . 0.331
  <unk> 0.366
  of 0.395
  and 0.403
  to 0.422
dog 925
  food 1.054
  pets 1.154
  pet 1.181
  foods 1.267
  dogs 1.313
3 289
  . 1.242
  8 1.249
  the 1.274
  to 1.276
  <unk> 1.282
good 47
  . 1.056
  a 1.102
  but 1.121
  , 1.152
  the 1.157


In [20]:
# (c)

td_matrix_tfidf = transform_tfidf(td_matrix)
# reps_tfidf = learn_reps_lsa(td_matrix_tfidf, 500)
reps_tfidf = learn_reps_lsa(td_matrix_tfidf, 10)
lab_util.show_similar_words(vectorizer.tokenizer, reps_tfidf, show_tokens)

n_train = 3000
training_experiment("word", word_featurizer, n_train)
training_experiment("lsa", lsa_featurizer, n_train)
training_experiment("combo", combo_featurizer, n_train)
print()

the 36
  . 0.019
  in 0.019
  be 0.024
  that 0.027
  a 0.027
dog 925
  dogs 0.065
  him 0.069
  he 0.099
  lamb 0.110
  baby 0.117
3 289
  considering 0.048
  spent 0.094
  daily 0.105
  based 0.118
  particular 0.118
good 47
  very 0.043
  . 0.051
  but 0.053
  much 0.062
  like 0.064
word features, 3000 examples
0.784 

lsa features, 3000 examples
0.602 

combo features, 3000 examples
0.788 




In [21]:
td_matrix_tfidf = transform_tfidf(td_matrix)
# reps_tfidf = learn_reps_lsa(td_matrix_tfidf, 500)
reps_tfidf = learn_reps_lsa(td_matrix_tfidf, 100)
lab_util.show_similar_words(vectorizer.tokenizer, reps_tfidf, show_tokens)

n_train = 3000
training_experiment("word", word_featurizer, n_train)
training_experiment("lsa", lsa_featurizer, n_train)
training_experiment("combo", combo_featurizer, n_train)
print()

the 36
  . 0.121
  of 0.154
  to 0.164
  and 0.164
  <unk> 0.212
dog 925
  pet 0.744
  foods 0.779
  switched 0.796
  pets 0.825
  food 0.859
3 289
  8 0.760
  1 0.769
  2 0.776
  4 0.868
  per 0.869
good 47
  but 0.449
  . 0.513
  for 0.595
  as 0.596
  not 0.611
word features, 3000 examples
0.784 

lsa features, 3000 examples
0.754 

combo features, 3000 examples
0.792 




In [22]:
td_matrix_tfidf = transform_tfidf(td_matrix)
# reps_tfidf = learn_reps_lsa(td_matrix_tfidf, 500)
reps_tfidf = learn_reps_lsa(td_matrix_tfidf, 500)
lab_util.show_similar_words(vectorizer.tokenizer, reps_tfidf, show_tokens)

n_train = 3000
training_experiment("word", word_featurizer, n_train)
training_experiment("lsa", lsa_featurizer, n_train)
training_experiment("combo", combo_featurizer, n_train)
print()

the 36
  . 0.214
  and 0.269
  <unk> 0.290
  of 0.304
  to 0.324
dog 925
  food 1.030
  pets 1.097
  pet 1.105
  foods 1.189
  switched 1.250
3 289
  8 1.148
  . 1.173
  to 1.214
  the 1.215
  of 1.231
good 47
  . 0.983
  but 1.011
  a 1.032
  and 1.087
  is 1.095
word features, 3000 examples
0.784 

lsa features, 3000 examples
0.758 

combo features, 3000 examples
0.794 




In [23]:
td_matrix_tfidf = transform_tfidf(td_matrix)
# reps_tfidf = learn_reps_lsa(td_matrix_tfidf, 500)
reps_tfidf = learn_reps_lsa(td_matrix_tfidf, 1000)
lab_util.show_similar_words(vectorizer.tokenizer, reps_tfidf, show_tokens)

n_train = 3000
training_experiment("word", word_featurizer, n_train)
training_experiment("lsa", lsa_featurizer, n_train)
training_experiment("combo", combo_featurizer, n_train)
print()

the 36
  . 0.264
  <unk> 0.328
  and 0.341
  of 0.362
  to 0.383
dog 925
  food 1.050
  pets 1.162
  pet 1.182
  foods 1.257
  dogs 1.311
3 289
  . 1.217
  8 1.231
  the 1.261
  to 1.263
  <unk> 1.271
good 47
  . 1.028
  a 1.082
  but 1.104
  and 1.138
  the 1.143
word features, 3000 examples
0.784 

lsa features, 3000 examples
0.756 

combo features, 3000 examples
0.794 




In [24]:
td_matrix_tfidf = transform_tfidf(td_matrix)
reps_tfidf = learn_reps_lsa(td_matrix_tfidf, 500)
lab_util.show_similar_words(vectorizer.tokenizer, reps_tfidf, show_tokens)

n_train = 10
training_experiment("word", word_featurizer, n_train)
training_experiment("lsa", lsa_featurizer, n_train)
training_experiment("combo", combo_featurizer, n_train)
print()

the 36
  . 0.214
  and 0.269
  <unk> 0.290
  of 0.304
  to 0.324
dog 925
  food 1.030
  pets 1.097
  pet 1.105
  foods 1.189
  switched 1.250
3 289
  8 1.148
  . 1.173
  to 1.214
  the 1.215
  of 1.231
good 47
  . 0.983
  but 1.011
  a 1.032
  and 1.087
  is 1.095
word features, 10 examples
0.496 

lsa features, 10 examples
0.474 

combo features, 10 examples
0.496 




In [25]:
td_matrix_tfidf = transform_tfidf(td_matrix)
reps_tfidf = learn_reps_lsa(td_matrix_tfidf, 500)
lab_util.show_similar_words(vectorizer.tokenizer, reps_tfidf, show_tokens)

n_train = 100
training_experiment("word", word_featurizer, n_train)
training_experiment("lsa", lsa_featurizer, n_train)
training_experiment("combo", combo_featurizer, n_train)
print()

the 36
  . 0.214
  and 0.269
  <unk> 0.290
  of 0.304
  to 0.324
dog 925
  food 1.030
  pets 1.097
  pet 1.105
  foods 1.189
  switched 1.250
3 289
  8 1.148
  . 1.173
  to 1.214
  the 1.215
  of 1.231
good 47
  . 0.983
  but 1.011
  a 1.032
  and 1.087
  is 1.095
word features, 100 examples
0.616 

lsa features, 100 examples
0.614 

combo features, 100 examples
0.626 




In [26]:
td_matrix_tfidf = transform_tfidf(td_matrix)
reps_tfidf = learn_reps_lsa(td_matrix_tfidf, 500)
lab_util.show_similar_words(vectorizer.tokenizer, reps_tfidf, show_tokens)

n_train = 1000
training_experiment("word", word_featurizer, n_train)
training_experiment("lsa", lsa_featurizer, n_train)
training_experiment("combo", combo_featurizer, n_train)
print()

the 36
  . 0.214
  and 0.269
  <unk> 0.290
  of 0.304
  to 0.324
dog 925
  food 1.030
  pets 1.097
  pet 1.105
  foods 1.189
  switched 1.250
3 289
  8 1.148
  . 1.173
  to 1.214
  the 1.215
  of 1.231
good 47
  . 0.983
  but 1.011
  a 1.032
  and 1.087
  is 1.095
word features, 1000 examples
0.784 

lsa features, 1000 examples
0.734 

combo features, 1000 examples
0.782 




In [27]:
td_matrix_tfidf = transform_tfidf(td_matrix)
reps_tfidf = learn_reps_lsa(td_matrix_tfidf, 500)
lab_util.show_similar_words(vectorizer.tokenizer, reps_tfidf, show_tokens)

n_train = 3000
training_experiment("word", word_featurizer, n_train)
training_experiment("lsa", lsa_featurizer, n_train)
training_experiment("combo", combo_featurizer, n_train)
print()

the 36
  . 0.214
  and 0.269
  <unk> 0.290
  of 0.304
  to 0.324
dog 925
  food 1.030
  pets 1.097
  pet 1.105
  foods 1.189
  switched 1.250
3 289
  8 1.148
  . 1.173
  to 1.214
  the 1.215
  of 1.231
good 47
  . 0.983
  but 1.011
  a 1.032
  and 1.087
  is 1.095
word features, 3000 examples
0.784 

lsa features, 3000 examples
0.758 

combo features, 3000 examples
0.794 




In [28]:
td_matrix_tfidf = transform_tfidf(td_matrix)
reps_tfidf = learn_reps_lsa(td_matrix_tfidf, 500)
lab_util.show_similar_words(vectorizer.tokenizer, reps_tfidf, show_tokens)

n_train = 10000
training_experiment("word", word_featurizer, n_train)
training_experiment("lsa", lsa_featurizer, n_train)
training_experiment("combo", combo_featurizer, n_train)
print()

the 36
  . 0.214
  and 0.269
  <unk> 0.290
  of 0.304
  to 0.324
dog 925
  food 1.030
  pets 1.097
  pet 1.105
  foods 1.189
  switched 1.250
3 289
  8 1.148
  . 1.173
  to 1.214
  the 1.215
  of 1.231
good 47
  . 0.983
  but 1.011
  a 1.032
  and 1.087
  is 1.095
word features, 10000 examples
0.784 

lsa features, 10000 examples
0.758 

combo features, 10000 examples
0.794 




## Part 2: word representations via language modeling

In this section, we'll train a word embedding model with a word2vec-style objective rather than a matrix factorization objective. This requires a little more work; we've provided scaffolding for a PyTorch model implementation below.
If you don't have much PyTorch experience, there are some tutorials [here](https://pytorch.org/tutorials/) which may be useful. You're also welcome to implement these experiments in any other framework of your choosing.

In [29]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch.utils.data as torch_data

class Word2VecModel(nn.Module):
    # A torch module implementing a word2vec predictor. The `forward` function
    # should take a batch of context word ids as input and predict the word 
    # in the middle of the context as output, as in the CBOW model from lecture.

    def __init__(self, vocab_size, embed_dim):
        super().__init__()
        self.vocab_size = vocab_size
        self.embed_dim = embed_dim
        self.embeddings = nn.Embedding(self.vocab_size+1, self.embed_dim)
        self.linear = nn.Linear(self.embed_dim, self.vocab_size+1)

        # Your code here!c
        # raise NotImplementedError

    def forward(self, context):
        # Context is an `n_batch x n_context` matrix of integer word ids
        # this function should return a set of scores for predicting the word 
        # in the middle of the context

        # Your code here!
        embegings = self.embeddings(context)

        embegings=torch.mean(embegings,dim=1)

        linear_out = self.linear(embegings)

        final_out = F.log_softmax(linear_out, dim = 1)

        return final_out

        raise NotImplementedError

In [30]:
def learn_reps_word2vec(corpus, window_size, rep_size, n_epochs, n_batch):
    #This method takes in a corpus of training sentences. It returns a matrix of
    # word embeddings with the same structure as used in the previous section of 
    # the assignment. (You can extract this matrix from the parameters of the 
    # Word2VecModel.)

    tokenizer = lab_util.Tokenizer()
    tokenizer.fit(corpus)
    tokenized_corpus = tokenizer.tokenize(corpus)

    ngrams = lab_util.get_ngrams(tokenized_corpus, window_size)

    device = torch.device('cuda')  # run on colab gpu
    # model = Word2VecModel(tokenizer.vocab_size, rep_size).to(device)
    model = Word2VecModel(tokenizer.vocab_size, rep_size)
    opt = optim.Adam(model.parameters(), lr=0.001)

    loader = torch_data.DataLoader(ngrams, batch_size=n_batch, shuffle=True)

    # What loss function should we use for Word2Vec?
    loss_fn = nn.CrossEntropyLoss()

    losses = []  # Potentially useful for debugging (loss should go down!)

    for epoch in tqdm(range(n_epochs)):
        epoch_loss = 0
        for context, label in loader:

            context = context+1

            label = label +1

            # As described above, `context` is a batch of context word ids, and
            # `label` is a batch of predicted word labels.

            # Here, perform a forward pass to compute predictions for the model.
            # context.device(device)

            # torch.tensor([word_to_ix[target]], dtype=torch.long)

      

            # context = context.to(device)
            # label = label.to(device)

            preds = model(context)  # Your code here!


            # Now finish the backward pass and gradient update.
            # Remember, you need to compute the loss, zero the gradients
            # of the model parameters, perform the backward pass, and
            # update the model parameters.
            loss = loss_fn(preds, label)  # Your code here!

            loss.backward()

            opt.step()

            epoch_loss += loss.item()
        losses.append(epoch_loss)

    # Hint: you want to return a `vocab_size x embedding_size` numpy array
    for name, param in model.named_parameters():
      if name == 'embeddings.weight':
        embedding_matrix = param[1:]
    # embedding_matrix = None  # Your code here!

    return embedding_matrix

In [31]:
# Use the function you just wrote to learn Word2Vec embeddings:
reps_word2vec = learn_reps_word2vec(train_reviews, 2, 500, 10, 100)

100%|██████████| 10/10 [16:39<00:00, 100.00s/it]


After training the embeddings, we can try to visualize the embedding space to see if it makes sense. First, we can take any word in the space and check its closest neighbors.

In [32]:
lab_util.show_similar_words(vectorizer.tokenizer, reps_word2vec.detach().numpy(), show_tokens)

the 36
  this 1.538
  a 1.577
  it 1.616
  my 1.646
  you're 1.652
dog 925
  costco 1.607
  subscribe 1.613
  crunch 1.626
  spice 1.638
  wet 1.643
3 289
  bread 1.637
  couple 1.643
  maybe 1.650
  birthday 1.655
  busy 1.661
good 47
  excellent 1.626
  constantly 1.629
  enjoying 1.663
  decent 1.686
  favorite 1.689


We can also cluster the embedding space. Clustering in 4 or more dimensions is hard to visualize, and even clustering in 2 or 3 can be difficult because there are so many words in the vocabulary. One thing we can try to do is assign cluster labels and qualitiatively look for an underlying pattern in the clusters.

In [33]:
from sklearn.cluster import KMeans

indices = KMeans(n_clusters=10).fit_predict(reps_word2vec.detach().numpy())
zipped = list(zip(range(vectorizer.tokenizer.vocab_size), indices))
np.random.shuffle(zipped)
zipped = zipped[:100]
zipped = sorted(zipped, key=lambda x: x[1])
for token, cluster_idx in zipped:
    word = vectorizer.tokenizer.token_to_word[token]
    print(f"{word}: {cluster_idx}")

amazing: 1
spread: 1
filled: 1
suggest: 1
average: 1
living: 1
learned: 1
description: 1
rate: 1
double: 1
moved: 2
seen: 2
im: 2
fall: 2
teeth: 2
prime: 2
below: 2
brewer: 2
hint: 2
lunches: 2
change: 2
subtle: 2
lays: 2
crackers: 2
expiration: 2
40: 2
plum: 2
un: 2
calcium: 2
mean: 2
fiber: 2
pouch: 2
starbucks: 2
caramels: 4
cubes: 4
general: 4
worked: 4
caffeine: 4
puppy: 4
shipment: 4
zero: 4
holes: 4
lasts: 4
nearly: 7
target: 7
granted: 7
colors: 7
decide: 7
muffin: 8
classic: 8
update: 8
birthday: 8
solid: 8
rica: 8
buying: 9
pieces: 9
than: 9
large: 9
eaten: 9
beer: 9
months: 9
excellent: 9
without: 9
reviews: 9
someone: 9
ok: 9
help: 9
given: 9
bad: 9
cookies: 9
kind: 9
bought: 9
go: 9
disappointed: 9
him: 9
has: 9
potassium: 9
needed: 9
artificial: 9
its: 9
coffee: 9
times: 9
plus: 9
unfortunately: 9
never: 9
treats: 9
something: 9
will: 9
beef: 9
morning: 9
still: 9
fruit: 9
doesn't: 9
truly: 9
we: 9
mill: 9
packing: 9
next: 9
that's: 9


Finally, we can use the trained word embeddings to construct vector representations of full reviews. One common approach is to simply average all the word embeddings in the review to create an overall embedding. Implement the transform function in Word2VecFeaturizer to do this.

In [34]:
def w2v_featurizer(xs):
    # This function takes in a matrix in which each row contains the word counts
    # for the given review. It should return a matrix in which each row contains
    # the average Word2Vec embedding of each review (hint: this will be very
    # similar to `lsa_featurizer` from above, just using Word2Vec embeddings 
    # instead of LSA).

    feats = xs @ reps_word2vec.detach().numpy() # Your code here!

    # normalize
    return feats / np.sqrt((feats ** 2).sum(axis=1, keepdims=True))

training_experiment("word2vec", w2v_featurizer, 3000)
print()

word2vec features, 3000 examples
0.79 




**Part 2: Lab writeup**

Part 2 of your lab report should discuss any implementation details that were important to filling out the code above, as well as your answers to the questions in Part 2 of the Homework 1 handout. Below, you can set up and perform experiments that answer these questions (include figures, plots, and tables in your write-up as you see fit).

## Experiments for Part 2

In [None]:
# Your code here!

