# Week 3: Vectors in Context
This notebook accompanies the week 3 lecture

In [10]:
# setup
import sys
import subprocess
import pkg_resources
from collections import Counter
import re


required = {'spacy', 'scikit-learn', 'numpy', 'pandas', 'torch'}
installed = {pkg.key for pkg in pkg_resources.working_set}
missing = required - installed

if missing:
    python = sys.executable
    subprocess.check_call([python, '-m', 'pip', 'install', *missing], stdout=subprocess.DEVNULL)

import spacy
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import NMF, LatentDirichletAllocation
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import normalize
from sklearn.metrics import accuracy_score
from sklearn.svm import LinearSVC
import pickle

from spacy.lang.en import English
en = English()

import torch
import torch.nn as nn
from torch.utils.data import TensorDataset, DataLoader
# this will set the device on which to train
device = torch.device("cpu")
# if using collab, set your runtime to use GPU and use the line below
#device = torch.device("cuda:0")

def simple_tokenizer(doc, model=en):
    # a simple tokenizer for individual documents (different from above)
    tokenized_docs = []
    parsed = model(doc)
    return([t.lower_ for t in parsed if (t.is_alpha)&(not t.like_url)])

### Exercise: The limits of bag-of-words
Look at the corpus below.  In previous implementations, we've counted up the number of "good" and "bad" instances.  What would be the problem with doing that in this case? How might you address it?

In [11]:
docs = ['The movie was good',
        'The movie was not bad, it was good',
        'The movie was bad']


## Moving beyond unigrams
Our work up to this point has mainly revolved around single-word tokens.  One way to include a bit more context is to move to bigrams, trigrams and maybe beyond (N-grams).  

Let's see how this may help us get better measures of similarity.

In [6]:
for i in range(1,4):
    cv = CountVectorizer(ngram_range=(1, i))
    counts = cv.fit_transform(docs)
    print('Using %s-grams' % i)
    print(cosine_similarity(counts))

Using 1-grams
[[1.         0.79056942 0.75      ]
 [0.79056942 1.         0.79056942]
 [0.75       0.79056942 1.        ]]
Using 2-grams
[[1.         0.7333588  0.71428571]
 [0.7333588  1.         0.64168895]
 [0.71428571 0.64168895 1.        ]]
Using 3-grams
[[1.         0.62554324 0.66666667]
 [0.62554324 1.         0.55603844]
 [0.66666667 0.55603844 1.        ]]


You can see here that with only unigrams, the second review, which has a negation of the word "bad", is marked as just as similar to the "good review" as the "bad review".  But once you get to the bigrams and trigrams, the second review is closer to the good review, which actually makes more sense if you read it.

But let's take a look at what this does to the vocabulary size:

In [12]:
for i in range(1,4):
    cv = CountVectorizer(ngram_range=(1, i))
    counts = cv.fit_transform(docs)
    print('Using %s-grams' % i)
    print(cv.vocabulary_)
    print(len(cv.vocabulary_))

Using 1-grams
{'the': 5, 'movie': 3, 'was': 6, 'good': 1, 'not': 4, 'bad': 0, 'it': 2}
7
Using 2-grams
{'the': 9, 'movie': 5, 'was': 11, 'good': 2, 'the movie': 10, 'movie was': 6, 'was good': 13, 'not': 7, 'bad': 0, 'it': 3, 'was not': 14, 'not bad': 8, 'bad it': 1, 'it was': 4, 'was bad': 12}
15
Using 3-grams
{'the': 15, 'movie': 7, 'was': 18, 'good': 3, 'the movie': 16, 'movie was': 8, 'was good': 20, 'the movie was': 17, 'movie was good': 10, 'not': 12, 'bad': 0, 'it': 4, 'was not': 21, 'not bad': 13, 'bad it': 1, 'it was': 5, 'movie was not': 11, 'was not bad': 22, 'not bad it': 14, 'bad it was': 2, 'it was good': 6, 'was bad': 19, 'movie was bad': 9}
23


And this is just a simple corpus! Imagine if we had a realistic set of reviews, we could imagine many possible combinations of bigrams and trigrams.

This is one of the reasons why it makes sense to move into sequence-based models, where we have some information being shared throughout the full parsing of the document.  This leads us to:

## Recurrent Neural Networks


In [13]:
def doc_to_index(docs, vocab):
    # transform docs into series of indices
    docs_idxs = []
    for d in docs:
        w_idxs = []
        for w in d:
            if w in vocab:
                w_idxs.append(vocab[w])
            else:
                # unknown token = 1
                w_idxs.append(1)
        docs_idxs.append(w_idxs)
    return(docs_idxs)

def pad_sequence(seqs, seq_len=200):
    # function for adding padding to ensure all seq same length
    features = np.zeros((len(seqs), seq_len),dtype=int)
    for i, seq in enumerate(seqs):
        if len(seq) != 0:
            features[i, -len(seq):] = np.array(seq)[:seq_len]
    return features

def onehot_encode(data, vocab, seq_len=200):
    # given dataset, turn each observation into a set of one-hot encoded vector
    onehot_data = np.zeros((len(data), seq_len, len(vocab)),
                          dtype='float32')
    for i, d in enumerate(data):
        for ii, w in enumerate(d):
            onehot_data[i, ii, w] = 1
    return(onehot_data)

In [14]:
# you will need to change this to where ever the file is stored
data_location = '../data/assignment_1_reviews.pkl'
with open(data_location, 'rb') as f:
    all_text = pickle.load(f)
neg, pos = all_text.values()
# join all reviews
all_reviews = np.array(neg+pos)
# create binary indicator for positive review
is_positive = np.array([0]*len(neg)+[1]*len(pos))

I also want to introduce the idea of a "validation set" here.  In training a Neural Net, often you want to leave testing the performance against the test set for last, when you have your final model.  Otherwise, you risk the model being fit to the test set, which would mean you wouldn't have an accurate read on the performance of your model on "new" data.  

There's a lot of guidance on how to split this up, but we're just going to take 30% of our training data and make that our validation data.

### Exercise: Create a training, validation and test dataset and fit a baseline model
1) Create a training, validation, test split.  I recommend 50(train)/20(val)/30(test)

2) Fit a simple SVM model on word counts, either raw or TF-IDF 

3) Calculate accuracy 

In [35]:
# set the seed for numpy
np.random.seed(seed=42)
# shuffle, just for safety
shuffled_idxs = np.random.choice(range(len(all_reviews)), size=len(all_reviews),replace=False)
all_reviews = all_reviews[shuffled_idxs]
is_positive = is_positive[shuffled_idxs]
# sample random 70% for fitting model (training)
# we'll also add a validation set, for checking the progress of the model during training
# 30% will be simulating "new observations" (testing)
pct_train = 0.7
train_bool = np.random.random(len(all_reviews))<=pct_train
reviews_train = all_reviews[train_bool]
reviews_test = all_reviews[~train_bool]
is_positive_train = is_positive[train_bool]
is_positive_test = is_positive[~train_bool]
# making a validation set
pct_val = 0.3
val_idxs = np.random.random(size=len(reviews_train))<=pct_val
is_positive_val = is_positive_train[val_idxs]
is_positive_val.shape
reviews_val = reviews_train[val_idxs]
# reconfigure train so that it doesn't include validation
reviews_train = reviews_train[~val_idxs]
is_positive_train = is_positive_train[~val_idxs]
print(len(reviews_train), len(reviews_val), len(reviews_test))

1233 535 731


In [48]:
# transform all data to work with model
# tokenizing ahead of time for easier match with word idx
parsed_train = [simple_tokenizer(str(d)) for d in reviews_train]
parsed_val = [simple_tokenizer(str(d)) for d in reviews_val]
parsed_test = [simple_tokenizer(str(d)) for d in reviews_test]
# this formulation works if you have previously tokenized
cv = CountVectorizer(tokenizer=lambda doc: doc, lowercase=False, min_df=0.01)
# **important** just fit on trained: prevents information from test in training 
cv.fit(parsed_train)
# get out the vocab
vocab = cv.vocabulary_
print("Size of vocab:", len(vocab))

Size of vocab: 1777


In [49]:
svc = LinearSVC(random_state=92)
svc.fit(cv.transform(parsed_train), is_positive_train)
base_accuracy = accuracy_score(is_positive_test,
                               svc.predict(cv.transform(parsed_test)))
print(base_accuracy)

0.79890560875513


A note here: We'll be using this vocabulary to transform each token to an index (numeric).  However there are two "special" tokens that we'll need to add:

\_PAD: The model expects all inputs to be of the same length.  So we've specified a sequence length.  If a document is longer than that, it gets truncated.  If it's shorted than that, it gets padded.  This token indicates that a particular element of the input document is padding.  This is useful information for the model

\_UNK: Depending on the vocab design, we may have certain tokens that are not included (i.e. do not have an index).  Any of these tokens are labelled as "unknown".

In [50]:
# need to adapt vocab, leave space for padding
vocab = dict([(v, vocab[v]+2) for v in vocab])
vocab['_UNK'] = 1
vocab['_PAD'] = 0
parsed_train = doc_to_index(parsed_train, vocab)
padded_train = pad_sequence(parsed_train)
parsed_val = doc_to_index(parsed_val, vocab)
padded_val = pad_sequence(parsed_val)
parsed_test = doc_to_index(parsed_test, vocab)
padded_test = pad_sequence(parsed_test)
# onehot encoding
# create a "weight matrix" for using Embedding layer in PyTorch
onehot_matrix = np.zeros(shape=(len(vocab), len(vocab)))
np.fill_diagonal(onehot_matrix, 1)

In [51]:
# construct datasets for loading by PyTorch
train_data = TensorDataset(torch.from_numpy(padded_train), torch.from_numpy(is_positive_train))
val_data = TensorDataset(torch.from_numpy(padded_val), torch.from_numpy(is_positive_val))
test_data = TensorDataset(torch.from_numpy(padded_test), torch.from_numpy(is_positive_test))

batch_size = 100

train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size,
                         drop_last=True) # this is to keep the size consistent
val_loader = DataLoader(val_data, shuffle=True, batch_size=batch_size,
                       drop_last=True)
test_loader = DataLoader(test_data, shuffle=True, batch_size=batch_size,
                        drop_last=True)

In [12]:
class SentimentNet(nn.Module):
    # sentiment classifier with single LSTM layer + Fully-connected layer, sigmoid activation and dropout
    # adapted from https://blog.floydhub.com/long-short-term-memory-from-zero-to-hero-with-pytorch/
    def __init__(self,
                 weight_matrix=None,
                 vocab_size=1000, 
                 output_size=1,  
                 hidden_dim=512,
                 embedding_dim=400, 
                 n_layers=2, 
                 dropout_prob=0.5):
        super(SentimentNet, self).__init__()
        # size of the output, in this case it's one input to one output
        self.output_size = output_size
        # number of layers (default 2) one LSTM layer, one fully-connected layer
        self.n_layers = n_layers
        # dimensions of our hidden state, what is passed from one time point to the next
        self.hidden_dim = hidden_dim
        # initialize the representation to pass to the LSTM
        self.embedding, embedding_dim = self.init_embedding(
            vocab_size, 
            embedding_dim, 
            weight_matrix)
        # LSTM layer, where the magic happens
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, n_layers, 
                            dropout=dropout_prob, batch_first=True)
        # dropout, similar to regularization
        self.dropout = nn.Dropout(dropout_prob)
        # fully connected layer
        self.fc = nn.Linear(hidden_dim, output_size)
        # sigmoid activiation
        self.sigmoid = nn.Sigmoid()
        
    def forward(self, x, hidden):
        # forward pass of the network
        batch_size = x.size(0)
        # transform input
        embeds = self.embedding(x)
        # run input embedding + hidden state through model
        lstm_out, hidden = self.lstm(embeds, hidden)
        # reshape
        lstm_out = lstm_out.contiguous().view(-1, self.hidden_dim)
        # dropout certain pct of connections
        out = self.dropout(lstm_out)
        # fully connected layer
        out = self.fc(out)
        # activation function
        out = self.sigmoid(out)
        # reshape
        out = out.view(batch_size, -1)
        out = out[:,-1]
        # return the output and the hidden state
        return out, hidden
    
    def init_embedding(self, vocab_size, embedding_dim, weight_matrix):
        # initializes the embedding
        if weight_matrix is None:
            if vocab_size is None:
                raise ValueError('If no weight matrix, need a vocab size')
            # if embedding is a size, initialize trainable
            return(nn.Embedding(vocab_size, embedding_dim),
                   embedding_dim)
        else:
            # otherwise use matrix as pretrained
            weights = torch.FloatTensor(weight_matrix)
            return(nn.Embedding.from_pretrained(weights),
                  weights.shape[1])
    
    def init_hidden(self, batch_size):
        # initializes the hidden state
        hidden = (torch.zeros(self.n_layers, batch_size, self.hidden_dim).to(device),
                  torch.zeros(self.n_layers, batch_size, self.hidden_dim).to(device))
        return hidden

In [13]:
model_params = {'weight_matrix': onehot_matrix,
               'output_size': 1,
               'hidden_dim': 512,
               'n_layers': 2,
               'embedding_dim': 400,
               'dropout_prob': 0.2}

model = SentimentNet(**model_params)
model.to(device)

SentimentNet(
  (embedding): Embedding(1767, 1767)
  (lstm): LSTM(1767, 512, num_layers=2, batch_first=True, dropout=0.2)
  (dropout): Dropout(p=0.2, inplace=False)
  (fc): Linear(in_features=512, out_features=1, bias=True)
  (sigmoid): Sigmoid()
)

In [14]:
lr=0.005
criterion = nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=lr)
# increasing this will make the training take a while on CPU
# decrease to 5 if it's taking too long
epochs = 1
counter = 0
print_every = 5
clip = 5
valid_loss_min = np.Inf

model.train()
for i in range(epochs):
    h = model.init_hidden(batch_size)
    for inputs, labels in train_loader:
        counter += 1
        h = tuple([e.data for e in h])
        inputs, labels = inputs.to(device), labels.to(device)
        model.zero_grad()
        output, h = model(inputs, h)
        loss = criterion(output.squeeze(), labels.float())
        loss.backward()
        nn.utils.clip_grad_norm_(model.parameters(), clip)
        optimizer.step()
        
        if counter%print_every == 0:
            val_h = model.init_hidden(batch_size)
            val_losses = []
            model.eval()
            for inp, lab in val_loader:
                val_h = tuple([each.data for each in val_h])
                inp, lab = inp.to(device), lab.to(device)
                out, val_h = model(inp, val_h)
                val_loss = criterion(out.squeeze(), lab.float())
                val_losses.append(val_loss.item())
                
            model.train()
            print("Epoch: {}/{}...".format(i+1, epochs),
                  "Step: {}...".format(counter),
                  "Loss: {:.6f}...".format(loss.item()),
                  "Val Loss: {:.6f}".format(np.mean(val_losses)))
            if np.mean(val_losses) <= valid_loss_min:
                torch.save(model.state_dict(), './state_dict.pt')
                print('Validation loss decreased ({:.6f} --> {:.6f}).  Saving model ...'.format(valid_loss_min,np.mean(val_losses)))
                valid_loss_min = np.mean(val_losses)

Epoch: 1/1... Step: 5... Loss: 0.691114... Val Loss: 0.693912
Validation loss decreased (inf --> 0.693912).  Saving model ...
Epoch: 1/1... Step: 10... Loss: 0.692276... Val Loss: 0.695976


Let's now compare the LSTM performance to our SVM performance

In [52]:
# pytorch LSTM model
model.load_state_dict(torch.load('./state_dict.pt'))
h = model.init_hidden(batch_size)
num_correct = 0
model.eval()
for inputs, labels in test_loader:
    h = tuple([each.data for each in h])
    inputs, labels = inputs.to(device), labels.to(device)
    output, h = model(inputs, h)
    # takes output, rounds to 0/1
    pred = torch.round(output.squeeze())
    # take the correct labels, check against preds
    correct_tensor = pred.eq(labels.float().view_as(pred))
    correct = np.squeeze(correct_tensor.cpu().numpy())
    # sum the number of correct
    num_correct += np.sum(correct)
# calc accuracy
test_acc = num_correct/len(test_loader.dataset)
print('LSTM accuracy:', test_acc)

NameError: name 'model' is not defined

Yikes.  All that work and we have a model that doesn't perform as well as the simple model.

There's several issues to be elaborated in the slides.  But the main one we're going to focus on is word embeddings.  Currently each element in an observation is a one-hot encoded vector for word index.  That's a pretty huge vector, and it's mostly zero.  What if we had a more dense, informative representation of an individual word?

### Word-level representations
Remember from Week 2: Our document-level representations cam also be used to create word-level representations.  For count vectors and tfidf vectors, we can just invert the matrix from document-word to word-document.  For matrix factorization, part of the estimation involves creating a word-component matrix.

In [64]:
cv = CountVectorizer(tokenizer=simple_tokenizer)
tfidf = TfidfVectorizer(tokenizer=simple_tokenizer)
# get vectors
count_vecs = cv.fit_transform(all_reviews)
tfidf_vecs = tfidf.fit_transform(all_reviews)
n_components = 10
nmf = NMF(n_components=n_components)
nmf_vecs = nmf.fit_transform(tfidf_vecs)
lda = LatentDirichletAllocation(n_components=n_components)
lda_vecs = lda.fit_transform(count_vecs)

In [65]:
# create word-level representations
count_words = count_vecs.T
tfidf_words = tfidf_vecs.T
nmf_words = nmf.components_.T
lda_words = lda.components_.T
for rep in [count_words, tfidf_words, nmf_words, lda_words]:
    print(rep.shape)

(27053, 2499)
(27053, 2499)
(27053, 10)
(27053, 10)


Ideally, these word-level representations encode some amount of the word's meaning in them.  So let's test it with a few words.  Intuitively, we know that the words "good" and "bad" should be pretty different from eachother (semantically, at least).  We know that "good" and "great" should be pretty similar.  Let's see how our representations capture that.

In [66]:
seed_words = ['good', 'great', 'bad']
# get index of seed words
seed_idxs = [cv.vocabulary_[w] for w in seed_words]
for rep in [count_words, tfidf_words, nmf_words, lda_words]:
    print(cosine_similarity(rep[seed_idxs]))

[[1.         0.32265411 0.32591993]
 [0.32265411 1.         0.17146286]
 [0.32591993 0.17146286 1.        ]]
[[1.         0.2376901  0.26042242]
 [0.2376901  1.         0.10767276]
 [0.26042242 0.10767276 1.        ]]
[[1.         0.71931288 0.77801908]
 [0.71931288 1.         0.55809816]
 [0.77801908 0.55809816 1.        ]]
[[1.         0.99812414 0.95825282]
 [0.99812414 1.         0.94664778]
 [0.95825282 0.94664778 1.        ]]


None of them seem to do great with the good to bad similarity; they're often more similar than good to great.  Bad to great, however, seems more promising.

But generally: These representations are based on a very small corpus and a very specific context.  What word representations (or embeddings) like Word2vec or GloVe try to do is make more general representations of the word based on its context in a large corpus of non-specific context.  Let's see how SpaCy's GloVe-based representations do on this task.

In [67]:
# only the md and lg models contain GloVe vectors
# you may need to run this on colab
#!python -m spacy download en_core_web_md
import en_core_web_md
nlp = en_core_web_md.load()

In [68]:
# we need to parse it with the model, then we can use the vector attribute
glove_words = [nlp(w).vector for w in seed_words]
# each vector is 300-dimensional dense representation
print(glove_words[0][:10])
print(glove_words[0].shape)

[-0.42625   0.4431   -0.34517  -0.1326   -0.05816   0.052598  0.21575
 -0.36721  -0.04519   2.2444  ]
(300,)


In [69]:
cosine_similarity(glove_words)

array([[1.0000004, 0.8416708, 0.7355092],
       [0.8416708, 1.       , 0.5404425],
       [0.7355092, 0.5404425, 1.       ]], dtype=float32)

Wow! This works really well.  We can see that good is pretty close to great, less close to bad.  Bad is far from good, but farther from great.  This is what we'd intuitively expect.

This isn't to say that GloVe is always preferred.  Depending on the context, other word representations might be more useful.  But let's go with GloVe and run our RNN model with this instead of the sparse representation.

### Exercise: RNN with GloVe vectors
Instead of one-hot vectors for each word, we need the 300-dimensional word vector.  Think of a method for incorporating that into the model.  Think about the weight matrix we're using and how you might reformulate that from one-hot encoding to word vectors.

Then, once you've created that weight matrix, re-run the LSTM and compare the results to the other methods.

In [71]:
# collect vectors in matrix
vocab = cv.vocabulary_
vocab = dict([(v, vocab[v]+2) for v in vocab])
vocab['_UNK'] = 1
vocab['_PAD'] = 0
glove_vecs = np.zeros(shape=(len(vocab), 300))
for k, v in vocab.items():
    glove_vecs[v] = nlp(k).vector

In [72]:
# construct datasets for loading by PyTorch
train_data = TensorDataset(torch.from_numpy(padded_train), torch.from_numpy(is_positive_train))
val_data = TensorDataset(torch.from_numpy(padded_val), torch.from_numpy(is_positive_val))
test_data = TensorDataset(torch.from_numpy(padded_test), torch.from_numpy(is_positive_test))

batch_size = 100

train_loader = DataLoader(train_data, shuffle=True, batch_size=batch_size,
                         drop_last=True) # this is to keep the size consistent
val_loader = DataLoader(val_data, shuffle=True, batch_size=batch_size,
                       drop_last=True)
test_loader = DataLoader(test_data, shuffle=True, batch_size=batch_size,
                        drop_last=True)

In [73]:
model_params = {'weight_matrix': glove_vecs,
               'output_size': 1,
               'hidden_dim': 512,
               'n_layers': 2,
               'embedding_dim': 400,
               'dropout_prob': 0.2}

model = SentimentNet(**model_params)
model.to(device)

SentimentNet(
  (embedding): Embedding(27055, 300)
  (lstm): LSTM(300, 512, num_layers=2, batch_first=True, dropout=0.5)
  (dropout): Dropout(p=0.5, inplace=False)
  (fc): Linear(in_features=512, out_features=1, bias=True)
  (sigmoid): Sigmoid()
)

In [74]:
lr=0.005
criterion = nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=lr)
epochs = 2
counter = 0
print_every = 5
clip = 5
valid_loss_min = np.Inf

model.train()
for i in range(epochs):
    h = model.init_hidden(batch_size)
    for inputs, labels in train_loader:
        counter += 1
        h = tuple([e.data for e in h])
        inputs, labels = inputs.to(device), labels.to(device)
        model.zero_grad()
        output, h = model(inputs, h)
        loss = criterion(output.squeeze(), labels.float())
        loss.backward()
        nn.utils.clip_grad_norm_(model.parameters(), clip)
        optimizer.step()
        
        if counter%print_every == 0:
            val_h = model.init_hidden(batch_size)
            val_losses = []
            model.eval()
            for inp, lab in val_loader:
                val_h = tuple([each.data for each in val_h])
                inp, lab = inp.to(device), lab.to(device)
                out, val_h = model(inp, val_h)
                val_loss = criterion(out.squeeze(), lab.float())
                val_losses.append(val_loss.item())
                
            model.train()
            print("Epoch: {}/{}...".format(i+1, epochs),
                  "Step: {}...".format(counter),
                  "Loss: {:.6f}...".format(loss.item()),
                  "Val Loss: {:.6f}".format(np.mean(val_losses)))
            if np.mean(val_losses) <= valid_loss_min:
                torch.save(model.state_dict(), './state_dict.pt')
                print('Validation loss decreased ({:.6f} --> {:.6f}).  Saving model ...'.format(valid_loss_min,np.mean(val_losses)))
                valid_loss_min = np.mean(val_losses)

Epoch: 1/2... Step: 5... Loss: 0.713173... Val Loss: 0.740239
Validation loss decreased (inf --> 0.740239).  Saving model ...
Epoch: 1/2... Step: 10... Loss: 0.704543... Val Loss: 0.702813
Validation loss decreased (0.740239 --> 0.702813).  Saving model ...
Epoch: 2/2... Step: 15... Loss: 0.697206... Val Loss: 0.699178
Validation loss decreased (0.702813 --> 0.699178).  Saving model ...
Epoch: 2/2... Step: 20... Loss: 0.707497... Val Loss: 0.696238
Validation loss decreased (0.699178 --> 0.696238).  Saving model ...


In [75]:
# pytorch LSTM model
model.load_state_dict(torch.load('./state_dict.pt'))
h = model.init_hidden(batch_size)
num_correct = 0
model.eval()
for inputs, labels in test_loader:
    h = tuple([each.data for each in h])
    inputs, labels = inputs.to(device), labels.to(device)
    output, h = model(inputs, h)
    # takes output, rounds to 0/1
    pred = torch.round(output.squeeze())
    # take the correct labels, check against preds
    correct_tensor = pred.eq(labels.float().view_as(pred))
    correct = np.squeeze(correct_tensor.cpu().numpy())
    # sum the number of correct
    num_correct += np.sum(correct)
# calc accuracy
test_acc = num_correct/len(test_loader.dataset)
print('LSTM accuracy:', test_acc)

LSTM accuracy: 0.4911080711354309


Depending on your initialization, you may or may not see an increase in accuracy here.  There are a couple of reasons:

- We are using a small dataset, considering some of the state-of-the-art models are fit on many millions of observations
- We have only trained for a short time here (small number of epochs).  Again, more state-of-the-art models train for hundreds of epochs.

We'll experiment with tuning parameters in Assignment \#3