**Homework 26**

In this assignment you'll explore a sentiment analysis task. We start with the usual imports:


In [64]:
import torch
from torch import nn
from torch.optim import Adam
import numpy as np

device=('cuda' if torch.cuda.is_available()
        else 'mps' if torch.backends.mps.is_available()
        else 'cpu')

device

'mps'

We now load the imdb movie review dataset that is packaged with keras. We'll only load the first 100 words of each review (if they are longer), and restrict to a 10,000 word vocabulary. The target variable y will be an array of 0's and 1's, indicating a positive or negative review.  

In [65]:
import keras
from keras.datasets import imdb

(Xtrain,ytrain),(Xtest,ytest)=imdb.load_data(maxlen=100,num_words=10000,index_from=0)

This dataset also comes with a dictionary for tokenization. The keys for this dictionary are the words in the vocabulary, and the values are the numbers assigned to each word (just like the `tokens` dictionary attribute of your `Tokenizer` class from the last assignment). We import this dictionary as `index`, and look at the first 10 key/value pairs:

In [66]:
index=imdb.get_word_index()
list(index.items())[:10]

[('fawn', 34701),
 ('tsukino', 52006),
 ('nunnery', 52007),
 ('sonja', 16816),
 ('vani', 63951),
 ('woods', 1408),
 ('spiders', 16115),
 ('hanging', 2345),
 ('woody', 2289),
 ('trawling', 52008)]

Note that Xtrain[0] is not the words of the first movie review of the dataset, but rather the indices of those words. To see the review itself, we have to convert from indices back to words. To this end, we'll build a dictionary word_from_index whose keys are indices and values are corresponding words.

In [67]:
words_from_index={}
for word in index:
  words_from_index[index[word]]=word

We can now look at review 1:

In [68]:
review1=''
for n in Xtrain[1]:
  review1+=words_from_index[n]+' '
# len(Xtrain[1])
review1

"the when i rented this movie i had very low expectations but when i saw it i realized that the movie was less a lot less than what i expected the actors were bad the doctor's wife was one of the worst the story was so stupid it could work for a disney movie except for the murders but this one is not a comedy it is a laughable masterpiece of stupidity the title is well chosen except for one thing they could add stupid movie after dead husbands i give it 0 and a half out of 5 "

Right now Xtrain is a 1-dimensional numpy array of python lists. To convert it to a 2-dimensional numpy array, we'll need every every element of Xtrain to be a list of the same length. Write a function that takes a list-of-lists, and adds an appropriate number of 0's to each if their length is less than 100. (This is called *padding*.)

In [69]:
vocab_size = len

In [70]:
# l = len(Xtest)
# temp = np.zeros(l*100)
# # temp.c
# num = temp.shape[0]
# temp = temp.reshape(l, 100)
# temp.shape
temp = np.pad

In [71]:
def pad(X):
  '''X is a list of lists of all different lengths
  function should return a list of lists of length 100 by padding with 0's'''
  x_len = X.shape[0]
  new_lol = np.zeros(x_len*100, dtype=int)
  new_lol = new_lol.reshape(x_len, 100)
  
  # Going through each list and adding to new_lol
  for i in range(x_len):
    l_len = len(X[i])
    new_lol[i, :l_len]  = X[i]

  return new_lol

We'll now use this function to convert Xtrain and Xtest to 2D numpy arrays. (At the same time we convert ytrain and ytest to numpy arrays as well.)

In [72]:
Xtrain=np.array([np.array(l) for l in pad(Xtrain)], dtype=np.int64)
Xtest=np.array([np.array(l) for l in pad(Xtest)], dtype=np.int64)

Finally, we convert all datasets to tensors and move them to the appropriate `device`.

In [73]:
X_train = torch.from_numpy(Xtrain).to(device)
y_train = torch.from_numpy(ytrain).to(device)
X_test  = torch.from_numpy(Xtest).to(device)
y_test  = torch.from_numpy(ytest).to(device)

In this homework we introduce the Embedding layer, which has the same effect as a one-hot encoding followed by a dense layer. We also use a GRU layer in place of a RNN (feel free to try an RNN or an LSTM instead).

Your model will consist of a Sequential Neural network with an Embedding layer and a GRU layer. For the embedding layer, the input_dim should be the size of your vocabulary, and the output_dim should be about 128 (you can play with that number). use 64 for the hidden dimension of the GRU, and don't forget to set `batch_first=True`.

Follow this sequential model with a separate, fully-connected linear layer with an appropriate output size for this task.

In [None]:
vocab_size = 29
model=nn.Sequential(
    nn.Embedding(vocab_size, 128), # Embedding layer
    nn.GRU(128, 64, batch_first=True) # GRU layer
).to(device)

fc=nn.Linear(64, 2).to(device)

Define an optimizer that will simultaneously adjust the parameters of both `model` and `fc`.

In [75]:
opt=torch.optim.Adam(list(model.parameters()) + list(fc.parameters()), lr=0.001)

Define a loss function:

In [76]:
criterion = nn.CrossEntropyLoss()

Write your training loop! Use 100 epochs, with batches of size 32. Report losses as you train.

In [77]:
epochs = 100
batch_size = 32
N = X_train.shape[0]

model.train()
for epoch in range(epochs):
    epoch_loss = 0.0

    indices = torch.randperm(N, device=device)

    for i in range(0, N, batch_size):
        batch_indices = indices[i:i+batch_size]
        batch_X = X_train[batch_indices]
        batch_y = y_train[batch_indices]

        opt.zero_grad()
        last_hidden = model(batch_X)[1].squeeze(0)
        out = fc(last_hidden)
        loss = criterion(out,batch_y)
        loss.backward()
        opt.step()

        epoch_loss += loss.item()*batch_size

    if epoch%5 ==0:
        avg_loss = epoch_loss/(len(y_test) + len(y_train))
        print(f"epoch: {epoch}, avg_loss: {avg_loss}")

epoch: 0, avg_loss: 0.33549837986295694
epoch: 5, avg_loss: 0.31013016454678033
epoch: 10, avg_loss: 0.28074329085596106
epoch: 15, avg_loss: 0.24452823940846377
epoch: 20, avg_loss: 0.19895589226958144
epoch: 25, avg_loss: 0.1860415929697215
epoch: 30, avg_loss: 0.12904205132727842
epoch: 35, avg_loss: 0.08321246745862389
epoch: 40, avg_loss: 0.08105885526334225
epoch: 45, avg_loss: 0.05013198275612654
epoch: 50, avg_loss: 0.0462831724513026
epoch: 55, avg_loss: 0.05372289820695023
epoch: 60, avg_loss: 0.025505408618697895
epoch: 65, avg_loss: 0.04993217943997397
epoch: 70, avg_loss: 0.08572232372009272
epoch: 75, avg_loss: 0.012954247801082737
epoch: 80, avg_loss: 0.0027624656838550646
epoch: 85, avg_loss: 0.0005818421200191865
epoch: 90, avg_loss: 0.0003054038652496846
epoch: 95, avg_loss: 0.00019034788478815322


Evaluate the accuracy of your model on Xtest and ytest. (It should be around 80%.)

In [93]:
model.eval()
lh = model(X_test)[1].squeeze(0)
out = fc(lh)
out.shape
out.sigmoid()
preds = torch.argmax(out, dim=1)
preds
# y_test

tensor([0, 0, 0,  ..., 1, 0, 1], device='mps:0')

In [94]:
model.eval()
accuracy= torch.sum(preds * y_test)/y_test.shape[0]

accuracy

# y_test

tensor(0.3456, device='mps:0')

Take a screen shot of this accuracy and upload to Gradescope!