# KAIST AI605 Assignment 1: Text Classification with RNNs
Authors: Hyeong-Gwon Hong (honggudrnjs@kaist.ac.kr) and Minjoon Seo (minjoon@kaist.ac.kr)

**Due Date:** March 31 (Wed) 11:00pm, 2021

## Assignment Objectives
- Verify theoretically and empirically why gating mechanism (LSTM, GRU) helps in Recurrent Neural Networks (RNNs)
- Design an LSTM-based text classification model from scratch using PyTorch.
- Apply the classification model to a popular classification task, Stanford Sentiment Treebank v2 (SST-2).
- Achieve higher accuracy by applying common machine learning strategies, including Dropout.
- Utilize pretrained word embedding (e.g. GloVe) to leverage self-supervision over a large text corpus.
- (Bonus) Use Hugging Face library (`transformers`) to leverage self-supervision via large language models.

## Your Submission
Your submission will be a link to a Colab notebook that has all written answers and is fully executable. You will submit your assignment via KLMS. Use in-line LaTeX (see below) for mathematical expressions. Collaboration among students is allowed but it is not a group assignment so make sure your answer and code are your own. Make sure to mention your collaborators in your assignment with their names and their student ids.

## Grading
The entire assignment is out of 100 points. There are four bonus questions with 10 points each (two bonus questions added on Mar 19). Your final score can be higher than 100 points.


## Environment
You will only use Python 3.7 and PyTorch 1.8, which is already available on Colab:

In [None]:
from platform import python_version
import torch

print("python", python_version())
print("torch", torch.__version__)

python 3.7.10
torch 1.8.1+cu101


## 1. Limitations of Vanilla RNNs
In Lecture 04 and 05, we saw how RNNs suffer from exploding or vanishing gradients. We mathematically showed that, if the recurrent relation is
$$ \textbf{h}_t = \sigma (\textbf{V}\textbf{h}_{t-1} + \textbf{U}\textbf{x}_t + \textbf{b}) $$
then
$$ \frac{\partial \textbf{h}_t}{\partial \textbf{h}_{t-1}} = \text{diag}(\sigma' (\textbf{V}\textbf{h}_{t-1} + \textbf{U}\textbf{x}_t + \textbf{b}))\textbf{V}$$
so
$$\frac{\partial \textbf{h}_T}{\partial \textbf{h}_1} \propto \textbf{V}^{T-1}$$
which means this term will be very close to zero if the norm of $\bf{V}$ is smaller than 1 and really big otherwise.

**Problem 1.1** *(10 points)* Explain how exploding gradient can be mitigated if we use gradient clipping.

**answer:**
Gradient clipping puts $$ \textbf{g} \leftarrow threshold * \frac{\textbf{g}}{||\textbf{g}||}$$.  
(Here, $$ \frac{\textbf{g}}{||\textbf{g}||} = 1$$$$\textbf{g} = \frac{\partial \boldsymbol{\epsilon}}{\partial \boldsymbol{\theta}}$$) 
So basically it adjusts the gradient to the threshold whenever gradient exceeds threshold which leads to mitigating exploding gradient problem. It also keeps the gradient direction thus preserving learning direction. Therefore, it also has effect of automatically adjusting the learning rate.

**Problem 1.2** *(10 points)* Explain how vanishing gradient can be mitigated if we use LSTM. See the Lecture 04 and 05 slides for the definition of LSTM.

**answer:**
$$ \textbf{C}_{t} = {f}_{t}*{C}_{t−1} + {i}_{t}*˜{C}_{t} $$. And by contriolling forget gate, we get to control how much of previous cell state will be kept and get to better control the gradient values. Furthermore, $$ \frac{\partial \textbf{C}_{t}}{\partial \textbf{C}_{t−1}}$$, if we put the parameters in this partial derivative equation, it can be expressed in addition form of derivative of other gates which means it enables better balancing of gradient values during backpropagation.

## 2. Creating Vocabulary from Training Data
Creating the vocabulary is the first step for every natural language processing model. In this section, you will use Stanford Sentiment Treebank v2, a popular dataset for sentiment classification, to create your vocabulary.

### Obtaining SST-2 via GLUE
General Language Understanding Evaluation (GLUE) benchmark is a collection of tools for evaluating the performance of models across a diverse set of existing natural language understanding (NLU) tasks. See GLUE website (https://gluebenchmark.com/) and the GLUE paper (https://openreview.net/pdf?id=rJ4km2R5t7) for more details. GLUE provides an easy way to access the datasets, including SST-2.
You can download SST-2 dataset by following the steps below:

1. Clone GitHub repository:

In [None]:
!git clone https://github.com/nyu-mll/GLUE-baselines.git

Cloning into 'GLUE-baselines'...
remote: Enumerating objects: 5, done.[K
remote: Counting objects: 100% (5/5), done.[K
remote: Compressing objects: 100% (5/5), done.[K
remote: Total 891 (delta 1), reused 2 (delta 0), pack-reused 886[K
Receiving objects: 100% (891/891), 1.48 MiB | 15.28 MiB/s, done.
Resolving deltas: 100% (610/610), done.


2. Download SST-2 only:

In [None]:
%cd GLUE-baselines/
!python download_glue_data.py --data_dir glue_data --tasks SST

/content/GLUE-baselines
Downloading and extracting SST...
	Completed!


Your training, dev, and test data can be found at `glue_data/SST-2`. Note that each file is in a tsv format, where the first column is the sentence and the second column is the label (either 0 or 1, where 1 means positive sentiment). 

In [None]:
!head -10 glue_data/SST-2/train.tsv
#%ls glue_data/SST-2

!head -10 glue_data/SST-2/dev.tsv

sentence	label
hide new secretions from the parental units 	0
contains no wit , only labored gags 	0
that loves its characters and communicates something rather beautiful about human nature 	1
remains utterly satisfied to remain the same throughout 	0
on the worst revenge-of-the-nerds clichés the filmmakers could dredge up 	0
that 's far too tragic to merit such superficial treatment 	0
demonstrates that the director of such hollywood blockbusters as patriot games can still turn out a small , personal film with an emotional wallop . 	1
of saucy 	1
a depressed fifteen-year-old 's suicidal poetry 	0
sentence	label
it 's a charming and often affecting journey . 	1
unflinchingly bleak and desperate 	0
allows us to hope that nolan is poised to embark a major career as a commercial yet inventive filmmaker . 	1
the acting , costumes , music , cinematography and sound are all astounding given the production 's austere locales . 	1
it 's slow -- very , very slow . 	0
although laced with humor a

**Problem 2.1** *(10 points)* Using space tokenizer, create the vocabulary for the training data and report the vocabulary size here. Make sure that you add an `UNK` token to the vocabulary to account for words (during inference time) that you haven't seen. See below for an example with a short text.

answer:
The original vocabulary size is 14816. After adding 'UNK' token, the size becomes 14817.

In [None]:
# Space tokenization
text = "Hello world!"
tokens = text.split(' ')
print(tokens)

['Hello', 'world!']


In [None]:
# Constructing vocabulary with `UNK`
vocab = ['UNK'] + list(set(text.split(' ')))
word2id = {word: id_ for id_, word in enumerate(vocab)}
print(vocab)
print(word2id['Hello'])

['UNK', 'world!', 'Hello']
2


In [None]:
# prob 2.1

import pandas as pd
import csv

tsv_file = open("glue_data/SST-2/train.tsv")
read_tsv = csv.reader(tsv_file, delimiter = "\t")
next(read_tsv) # skip header

whole_words = [] # whole words including alias
sentences = [] # each component of this list is sentence
labels = []

for row in read_tsv:
  sentences.append(row[0])
  labels.append(int(row[1]))

  words = row[0].split(' ')
  words.remove('') # delete last ''
  for word in words:
    whole_words.append(word)

whole_vocab = list(set(whole_words)) # whole words appearing only once
whole_vocab.insert(0, 'UNK')
print(len(whole_vocab))

# make dictionary of vocab to id
vocab2id = {word: id_ for id_, word in enumerate(whole_vocab)}

sorted_sentences = [sentence for sentence, label in sorted(zip(sentences, labels), key=lambda x: len(x[0].split(" ")))]
sorted_labels = [label for sentence, label in sorted(zip(sentences, labels), key=lambda x: len(x[0].split(" ")))]

14817


In [None]:
# tokenize dev data (sentences, labels)

tsv_file_dev = open("glue_data/SST-2/dev.tsv")
read_dev = csv.reader(tsv_file_dev, delimiter = "\t")
next(read_dev)

dev_sentences = []
dev_labels = []

for dev_row in read_dev:
  dev_sentences.append(dev_row[0])
  dev_labels.append(int(dev_row[1]))

dev_sentence_length = 0 # max sentence length on dev data
for sentence in dev_sentences:
  words = sentence.split(" ")
  if len(words) > dev_sentence_length:
    dev_sentence_length = len(words)

dev_sorted_sentences = [sentence for sentence, label in sorted(zip(dev_sentences, dev_labels), key=lambda x: len(x[0].split(" ")))]
dev_sorted_labels = [label for sentence, label in sorted(zip(dev_sentences, dev_labels), key=lambda x: len(x[0].split(" ")))]

**Problem 2.2** *(10 points)* Using all words in the training data will make the vocabulary very big. Reduce its size by only including words that occur at least 2 times. How does the size of the vocabulary change?

**answer:**
Vocabulary size becomes 14310 (it was 14817 in prob 2.1 including 'UNK' tokens)

In [None]:
# prob 2.2

from collections import Counter

res = Counter(whole_words)
#print(res)
#print(len(res))

vocab_adj = [] # adjusted vocab: whole_vocab appearing at least twice
sum = 0
for key in res:
  if res[key] >= 2:
    vocab_adj.append(key)
    sum = sum + res[key]
vocab_adj.insert(0, 'UNK')
print(vocab_adj)
print(len(vocab_adj))

vocabs2id = {vocab: id_ for id_, vocab in enumerate(vocab_adj)} # vocab_adj to dictionary

14310


## 3. Text Classification Baselines

You can now use the vocabulary constructed from the training data to create an embedding matrix. You will use the embedding matrix to map each input sequence of tokens to a list of embedding vectors. One of the simplest baseline is to go through one layer of neural network and then average the outputs, and finally classify the average embedding: 

In [None]:
from torch import nn

input_ = "hi world!"
input_tokens = input_.split(' ')
input_ids = [word2id[word] if word in word2id else 0 for word in input_tokens]
input_tensor = torch.LongTensor([input_ids]) # the first dimension is minibatch size
print(input_tensor)

tensor([[0, 1]])


In [None]:
# One layer, average pooling and classification
class Baseline(nn.Module):
  def __init__(self, d, vocab):
    super(Baseline, self).__init__()
    self.embedding = nn.Embedding(len(vocab), d)
    self.layer = nn.Linear(d, d, bias=True)
    self.relu = nn.ReLU()
    self.class_layer = nn.Linear(d, 2, bias=True)

  def forward(self, input_tensor):
    #print('input tensor: ', input_tensor.shape)
    emb = self.embedding(input_tensor)
    #print('emb shape: ',emb.shape)
    out = self.relu(self.layer(emb))
    avg = out.mean(1)
    logits = self.class_layer(avg)
    return logits

d = 3 # usually bigger, e.g. 128
baseline = Baseline(d, vocab)
logits = baseline(input_tensor)
softmax = nn.Softmax(1)
print(softmax(logits)) # probability for each class

tensor([[0.3975, 0.6025]], grad_fn=<SoftmaxBackward>)


Now we will compute the loss, which is the negative log probability of the input text's label being the target label (`1`), which in fact turns out to be equivalent to the cross entropy (https://en.wikipedia.org/wiki/Cross_entropy) between the probability distribution and a one-hot distribution of the target label (note that we use `logits` instead of `softmax(logits)` as the input to the cross entropy, which allow us to avoid numerical instability). 

In [None]:
cel = nn.CrossEntropyLoss()
label = torch.LongTensor([1]) # The ground truth label for "hi world!" is positive.
loss = cel(logits, label) # Loss, a.k.a L
print(loss)

tensor(0.5066, grad_fn=<NllLossBackward>)


Once we have the loss defined, only one step remains! We compute the gradients of parameters with respective to the loss and update. Fortunately, PyTorch does this for us in a very convenient way. Note that we used only one example to update the model, which is basically a Stochastic Gradient Descent (SGD) with minibatch size of 1. A recommended minibatch size in this exercise is at least 16. It is also recommended that you reuse your training data at least 10 times (i.e. 10 *epochs*).

In [None]:
optimizer = torch.optim.SGD(baseline.parameters(), lr=0.1)
optimizer.zero_grad() # reset process
loss.backward() # compute gradients
optimizer.step() # update parameters

Once you have done this, all weight parameters will have `grad` attributes that contain their gradients with respect to the loss.

In [None]:
print(baseline.layer.weight.grad) # dL/dw of weights in the linear layer

tensor([[-2.2929e-05,  8.3421e-05,  8.3412e-05],
        [-1.9203e-02,  6.9867e-02,  6.9859e-02],
        [-1.7035e-01, -1.5450e-01,  5.2494e-02]])


**Problem 3.1** *(10 points)* Properly train this average-pooling baseline model on SST-2 and report the model's accuracy on the dev data.

**answer:**
The hyperparameters for all models that I made are set to epoch: 5, batch size: 16, hidden state dimension: 64, learning rate: 0.003. All models use Adam optimizer. For baseline model(prob 3.1), rnn model(prob 3.2), lstm model(prob 4.1), lstm model with dropout(prob 4.2) the input dimension to the model is same as hidden dimension which is 64. For using Glove(prob 5.1), the input dimension is 100, for Bert(prob 5.2), it is 768.

So, the test accuracy is the accuracy after training the model for 5 epochs.

test accuracy for baseline model: 0.7673613429069519
(epoch: 5, batch size: 16, hidden state dimension: 64, learning rate: 0.003. optimizer: Adam)

**Problem 3.2** *(10 points)* Implement a recurrent neural network (without using PyTorch's RNN module) where the output of the linear layer not only depends on the current input but also the previous output. Report the model's accuracy on the dev data. Is it better or worse than the baseline? Why?

**answer:**
test accuracy for rnn model: 0.7835649847984314
(epoch: 5, batch size: 16, hidden state dimension: 64, learning rate: 0.003. optimizer: Adam)

The test accuracy of rnn model is a little bit higher than that of baseline model. Rnn model can work better since rnn has rnn cells that acts like a memory. So it can better process sentences. However, since I set the epoch to 5 and every model is different, sometimes, RNN model has similar or a tiny bit smaller accuracy compared to baseline model. One reason of this can be overfitting. Other reason can be long term depedency problem. 

**Problem 3.3 (bonus)** *(10 points)* Show that the cross entropy computed above is equivalent to the negative log likelihood of the probability distribution.

**answer:**
The negative log likelihood is $$ - \sum_{j=1}^{M} {y}_{j}log\hat{y}_{j}$$. Here, $$\hat{y}_{j}$$ is the discrete probability distribution over all possible classes. Also, ground truth vector y can be interpreted as probability distribution that puts all the probability mass on the true class. If we regard $$\hat{y}_{j}$$ and $${y}_{j}$$ this way, it can be viewed as cross entropy where cross entropy is $$ −\sum_{x} g(x)logf(x) $$. In other words, minimizing cross entropy is maximizing log likelihood so this means cross entropy is the negative log likelihood of probability distribution. 

**Problem 3.4 (bonus)** *(10 points)* Why is it numerically unstable if you compute log on top of softmax?

**answer:**
If we compute log on top of softmax, it may cause underflow.(Also, computer's way of processing very small number via floating point can have an effect). Specifically, $$ loss(x, class) = - log(\frac{exp(x[class])}{\sum_{j}exp(x[j])}) $$. So, in order to prevent this, we can use trick based on $$ log(\frac{exp(x)}{\sum_{i}exp({x}_{i})}) = log(\frac{exp(x-b)exp(b)}{\sum_{i}exp({x}_{i}-b)exp(b)}) = (x-b) - log(\sum_{i}(exp(x-b))).$$ If we set $$b = max({x}_{i})$$, this can have overflow and underflow stability.

In [None]:
# prob 3.1

import numpy as np
from tqdm.notebook import tqdm

# hyperparameter
train_length = len(sentences)
mini_batch_size = 16 # 
train_steps_per_epoch = train_length // mini_batch_size
total_epoch = 5 #
hidden_dimension = 64 # corresponding to d in baseline mode above 
learning_rate = 0.003 #
logging_freq = 1000

test_steps_per_epoch = len(dev_sentences) // mini_batch_size

# baseline model
baseline_model = Baseline(hidden_dimension, whole_vocab) #
loss_fn = nn.CrossEntropyLoss()
#optimizer = torch.optim.SGD(baseline_model.parameters(), lr=learning_rate)
optimizer = torch.optim.Adam(baseline_model.parameters(), lr=learning_rate)
# train
for epoch in tqdm(range(total_epoch)):
  train_epoch_loss = 0
  train_epoch_acc = 0
  test_epoch_loss = 0
  test_epoch_acc = 0

  for step in tqdm(range(train_steps_per_epoch)):
    input_idx = []
    input_sentences = sorted_sentences[step * mini_batch_size:(step+1) * mini_batch_size]
    
    label = torch.LongTensor(sorted_labels[step * mini_batch_size:(step+1) * mini_batch_size])

    #compute max_sentence_length of each mini-batch
    max_sentence_length = 0
    for sentence in input_sentences:
      words = sentence.split(" ")
      if len(words) > max_sentence_length:
        max_sentence_length = len(words)

    for sentence in input_sentences: # 16 sentences each in [](idxes)
      words = sentence.split(" ")
      words.remove('')
      idxes = [vocabs2id[word] if word in vocabs2id.keys() else 0 for word in words]
      idxes = np.pad(np.array(idxes), (0, max_sentence_length - len(idxes)), 'constant', constant_values=(0, 0)) # zero-pad the remaining 54-len
      input_idx.append(idxes)
      
    input = torch.LongTensor(input_idx)
    #print('input: ',input.shape)
    logits = baseline_model(input)
    pred_label = torch.argmax(logits, axis=1)

    acc = torch.sum(pred_label == label) / mini_batch_size
    loss = loss_fn(logits, label)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    train_epoch_loss += loss / train_steps_per_epoch
    train_epoch_acc += acc / train_steps_per_epoch
  
    # log = f''
    # log += f'step: {step}, '
    # log += f'train loss: {loss}, '
    # log += f'train acc: {acc} '
    # if step % logging_freq == 0:
    #   print(log)

  print(f'train_epoch_loss for epoch {epoch}: {train_epoch_loss}')
  print(f'train_epoch_acc for epoch {epoch}: {train_epoch_acc}')

for step in tqdm(range(test_steps_per_epoch)):
  input_idx = []
  input_sentences = dev_sorted_sentences[step * mini_batch_size:(step+1) * mini_batch_size]
    
  label = torch.LongTensor(dev_sorted_labels[step * mini_batch_size:(step+1) * mini_batch_size])
    
  #compute max_sentence_length of each mini-batch
  dev_sentence_length = 0
  for sentence in input_sentences:
    words = sentence.split(" ")
    if len(words) > dev_sentence_length:
      dev_sentence_length = len(words)
        
  for sentence in input_sentences: # 16 sentences each in [](idxes)
    words = sentence.split(" ")
    words.remove('')
    idxes = [vocabs2id[word] if word in vocabs2id.keys() else 0 for word in words]
    idxes = np.pad(np.array(idxes), (0, dev_sentence_length - len(idxes)), 'constant', constant_values=(0, 0)) # zero-pad the remaining 54-len
    input_idx.append(idxes)
      
  input = torch.LongTensor(input_idx)
  #print('input: ',input.shape)
  logits = baseline_model(input)
  pred_label = torch.argmax(logits, axis=1)

  acc = torch.sum(pred_label == label) / mini_batch_size
  loss = loss_fn(logits, label)

  test_epoch_loss += loss / test_steps_per_epoch
  test_epoch_acc += acc / test_steps_per_epoch
  
    # log = f''
    # log += f'step: {step}, '
    # log += f'test loss: {loss}, '
    # log += f'test acc: {acc} '
    # if step % logging_freq == 0:
    #   print(log)

print(f'test_epoch_loss for epoch {epoch}: {test_epoch_loss}')
print(f'test_epoch_acc for epoch {epoch}: {test_epoch_acc}')

HBox(children=(FloatProgress(value=0.0, max=5.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, max=4209.0), HTML(value='')))


train_epoch_loss for epoch 0: 0.37323445081710815
train_epoch_acc for epoch 0: 0.8253195285797119


HBox(children=(FloatProgress(value=0.0, max=4209.0), HTML(value='')))


train_epoch_loss for epoch 1: 0.22102823853492737
train_epoch_acc for epoch 1: 0.9140838384628296


HBox(children=(FloatProgress(value=0.0, max=4209.0), HTML(value='')))


train_epoch_loss for epoch 2: 0.16585394740104675
train_epoch_acc for epoch 2: 0.9380617141723633


HBox(children=(FloatProgress(value=0.0, max=4209.0), HTML(value='')))


train_epoch_loss for epoch 3: 0.13130851089954376
train_epoch_acc for epoch 3: 0.9516608119010925


HBox(children=(FloatProgress(value=0.0, max=4209.0), HTML(value='')))


train_epoch_loss for epoch 4: 0.10805437713861465
train_epoch_acc for epoch 4: 0.961013913154602



HBox(children=(FloatProgress(value=0.0, max=54.0), HTML(value='')))


test_epoch_loss for epoch 4: 0.8034762740135193
test_epoch_acc for epoch 4: 0.7673613429069519


In [None]:
# prob 3.2
import matplotlib.pyplot as plt
from pdb import set_trace
use_cuda = True

class RNN_model(nn.Module):
  def __init__(self, d, vocab):
    super(RNN_model, self).__init__()
    self.hidden_dimension = d
    self.embedding = nn.Embedding(len(vocab), self.hidden_dimension) 
    self.i2h = nn.Linear(self.hidden_dimension, self.hidden_dimension, bias=True) # (d, d) matrix (weight, bias)
    self.h2h = nn.Linear(self.hidden_dimension, self.hidden_dimension, bias=False) # (d, d) matrix (weight, bias)
    self.tanh = nn.Tanh()
    self.class_layer = nn.Linear(self.hidden_dimension, 2, bias=True)

    self.real_rnn = nn.RNN(self.hidden_dimension, self.hidden_dimension, batch_first=True)

  def forward(self, input_tensor):
    batch_size, max_length = input_tensor.shape # 16, 54
    
    total_hidden_state = [torch.zeros((batch_size, self.hidden_dimension)).cuda()]
    #total_hidden_state = [torch.zeros((batch_size, self.hidden_dimension))]

    emb = self.embedding(input_tensor) # emb 16*54*128
    
    for time_step in range(max_length):
      cur_hidden_state = self.tanh(self.i2h(emb[:, time_step, :]) + self.h2h(total_hidden_state[-1]))
      total_hidden_state.append(cur_hidden_state)

    logits = self.class_layer(total_hidden_state[-1]) # (16, 2)

    return logits

# hyperparameter
train_length_rnn = len(sentences)
mini_batch_size_rnn = 16 # 
train_steps_per_epoch_rnn = train_length_rnn // mini_batch_size_rnn
total_epoch_rnn = 5 #
hidden_dim_rnn = 64 # corresponding to d in baseline mode above 
learning_rate_rnn = 0.003 #
logging_freq = 1000
test_steps_per_epoch_rnn = len(dev_sentences) // mini_batch_size_rnn

# plot
train_loss_hist = []
train_acc_hist = []
test_loss_hist = []
test_acc_hist = []

# rnn model
rnn_model = RNN_model(hidden_dim_rnn, whole_vocab) ##
loss_fn_rnn = nn.CrossEntropyLoss()
#optimizer_rnn = torch.optim.SGD(rnn_model.parameters(), lr=learning_rate_rnn)
optimizer_rnn = torch.optim.Adam(rnn_model.parameters(), lr=learning_rate_rnn)
# use gpu
if use_cuda and torch.cuda.is_available():
    rnn_model.cuda()

# train and test
for epoch in tqdm(range(total_epoch_rnn)):
  rnn_model.train()
  train_epoch_loss_rnn = 0
  train_epoch_acc_rnn = 0
  test_epoch_loss_rnn = 0
  test_epoch_acc_rnn = 0
  for step in tqdm(range(train_steps_per_epoch_rnn)):
    input_idx_rnn = []
    input_sentences_rnn = sorted_sentences[step * mini_batch_size_rnn:(step+1) * mini_batch_size_rnn]
    
    label_rnn = torch.LongTensor(sorted_labels[step * mini_batch_size_rnn:(step+1) * mini_batch_size_rnn]).cuda()
    #label_rnn = torch.LongTensor(labels[step * mini_batch_size_rnn:(step+1) * mini_batch_size_rnn])
    
    #compute max_sentence_length of each mini-batch
    max_sentence_length = 0
    for sentence in input_sentences_rnn:
      words = sentence.split(" ")
      if len(words) > max_sentence_length:
        max_sentence_length = len(words)

    #max_sentence_length = 54 ###
    for sentence in input_sentences_rnn: # 16 sentences each in [](idxes)
      words_rnn = sentence.split(" ")
      #print('words_rnn: ',words_rnn)
      words_rnn.remove('')
      idxes_rnn = [vocabs2id[word] if word in vocabs2id.keys() else 0 for word in words_rnn]
      #print('idxes_rnn: ',idxes_rnn)
      idxes_rnn = np.pad(np.array(idxes_rnn), (0, max_sentence_length - len(idxes_rnn)), 'constant', constant_values=(0, 0)) # zero-pad the remaining 54-len
      #print('idxes_rnn: ',idxes_rnn)
      input_idx_rnn.append(idxes_rnn)
      
    input_rnn = torch.LongTensor(input_idx_rnn).cuda()
    #input_rnn = torch.LongTensor(input_idx_rnn)
    logits_rnn = rnn_model(input_rnn) ##
    pred_label_rnn = torch.argmax(logits_rnn, axis=1).cuda()
    #pred_label_rnn = torch.argmax(logits_rnn, axis=1)
    acc_rnn = torch.sum(pred_label_rnn == label_rnn) / mini_batch_size_rnn
    loss_rnn = loss_fn_rnn(logits_rnn, label_rnn)

    optimizer_rnn.zero_grad()
    loss_rnn.backward()
    optimizer_rnn.step()

    train_epoch_loss_rnn += loss_rnn / train_steps_per_epoch_rnn
    train_epoch_acc_rnn += acc_rnn / train_steps_per_epoch_rnn
  
  print(f'train_epoch_loss for epoch {epoch}: {train_epoch_loss_rnn}')
  print(f'train_epoch_acc for epoch {epoch}: {train_epoch_acc_rnn}')

  rnn_model.eval()

for step in tqdm(range(test_steps_per_epoch_rnn)):
  input_idx_rnn = []
  input_sentences_rnn = dev_sorted_sentences[step * mini_batch_size_rnn:(step+1) * mini_batch_size_rnn]
    
  label_rnn = torch.LongTensor(dev_sorted_labels[step * mini_batch_size_rnn:(step+1) * mini_batch_size_rnn]).cuda()
  #label_rnn = torch.LongTensor(dev_labels[step * mini_batch_size_rnn:(step+1) * mini_batch_size_rnn])
    
  #compute max_sentence_length of each mini-batch
  dev_sentence_length = 0
  for sentence in input_sentences_rnn:
    words = sentence.split(" ")
    if len(words) > dev_sentence_length:
      dev_sentence_length = len(words)
        
  for sentence in input_sentences_rnn: # 16 sentences each in [](idxes)
    words_rnn = sentence.split(" ")
    #print('words_rnn: ',words_rnn)
    words_rnn.remove('')
    idxes_rnn = [vocabs2id[word] if word in vocabs2id.keys() else 0 for word in words_rnn]
    #print('idxes_rnn: ',idxes_rnn)
    idxes_rnn = np.pad(np.array(idxes_rnn), (0, dev_sentence_length - len(idxes_rnn)), 'constant', constant_values=(0, 0)) # zero-pad the remaining 54-len
    #print('idxes_rnn: ',idxes_rnn)
    input_idx_rnn.append(idxes_rnn)
      
  input_rnn = torch.LongTensor(input_idx_rnn).cuda()
  #input_rnn = torch.LongTensor(input_idx_rnn)
  logits_rnn = rnn_model(input_rnn) ##
  pred_label_rnn = torch.argmax(logits_rnn, axis=1)

  acc_rnn = torch.sum(pred_label_rnn == label_rnn) / mini_batch_size_rnn
  loss_rnn = loss_fn_rnn(logits_rnn, label_rnn)

  test_epoch_loss_rnn += loss_rnn / test_steps_per_epoch_rnn
  test_epoch_acc_rnn += acc_rnn / test_steps_per_epoch_rnn
  
print(f'test_epoch_loss for epoch {epoch}: {test_epoch_loss_rnn}')
print(f'test_epoch_acc for epoch {epoch}: {test_epoch_acc_rnn}')

HBox(children=(FloatProgress(value=0.0, max=5.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, max=4209.0), HTML(value='')))


train_epoch_loss for epoch 0: 0.4208744764328003
train_epoch_acc for epoch 0: 0.8060453534126282


HBox(children=(FloatProgress(value=0.0, max=4209.0), HTML(value='')))


train_epoch_loss for epoch 1: 0.2593460977077484
train_epoch_acc for epoch 1: 0.9022835493087769


HBox(children=(FloatProgress(value=0.0, max=4209.0), HTML(value='')))


train_epoch_loss for epoch 2: 0.21701033413410187
train_epoch_acc for epoch 2: 0.9205762147903442


HBox(children=(FloatProgress(value=0.0, max=4209.0), HTML(value='')))


train_epoch_loss for epoch 3: 0.1964765340089798
train_epoch_acc for epoch 3: 0.9285784363746643


HBox(children=(FloatProgress(value=0.0, max=4209.0), HTML(value='')))


train_epoch_loss for epoch 4: 0.18109558522701263
train_epoch_acc for epoch 4: 0.9350659251213074



HBox(children=(FloatProgress(value=0.0, max=54.0), HTML(value='')))


test_epoch_loss for epoch 4: 0.5211775302886963
test_epoch_acc for epoch 4: 0.7835649847984314


## 4. Text Classification with LSTM and Dropout

Now it is time to improve your baselines! Replace your RNN module with an LSTM module. See Lecture slides 04 and 05 for the formal definition of LSTMs. 

You will also use Dropout, which randomly makes each dimension zero with the probability of `p` and scale it by `1/(1-p)` if it is not zero during training. Put it either at the input or the output of the LSTM to prevent it from overfitting.

In [None]:
a = torch.FloatTensor([0.1, 0.3, 0.5, 0.7, 0.9])
print(a)
dropout = nn.Dropout(0.5) # p=0.5
print(dropout(a))

tensor([0.1000, 0.3000, 0.5000, 0.7000, 0.9000])
tensor([0.0000, 0.0000, 1.0000, 1.4000, 1.8000])


**Problem 4.1** *(20 points)* Implement and use LSTM (without using PyTorch's LSTM module) instead of vanilla RNN to improve your model. Report the accuracy on the dev data.

**answer:**
test accuracy for lstm model: 0.8136575222015381
(epoch: 5, batch size: 16, hidden state dimension: 64, learning rate: 0.003. optimizer: Adam)

**Problem 4.2** *(10 points)* Use Dropout on LSTM (either at input or output). Report the accuracy on the dev data and briefly describe how it differs from 4.1.

**answer:** 
I used dropout at the input and hidden state for p = 0.1.
test accuracy for lstm model with dropout: 0.8206019997596741
(epoch: 5, batch size: 16, hidden state dimension: 64, learning rate: 0.003. optimizer: Adam)
The test accuracy using dropout is a little higher than that of lstm without dropout. This is because dropout has effect of regularization and can mitigate overfitting problem.

**Problem 4.3 (bonus)** *(10 points)* Consider implementing bidirectional LSTM and two layers of LSTM to further improve your model. Report your accuracy on dev data.

In [None]:
# prob 4.1

from tqdm.notebook import tqdm
use_cuda = True
import numpy as np

class LSTM_model(nn.Module):
  def __init__(self, d, vocab):
    super(LSTM_model, self).__init__()
    self.hidden_dimension = d
    self.embedding = nn.Embedding(len(vocab), self.hidden_dimension) 
    self.i2h = nn.Linear(2 * self.hidden_dimension, self.hidden_dimension, bias=True) # (2d, d) matrix (weight, bias) ###
    ###
    self.sigmoid = nn.Sigmoid()
    self.tanh = nn.Tanh()
    self.x2i_t = nn.Linear(self.hidden_dimension, self.hidden_dimension, bias=True) #i_t
    self.x2g_t = nn.Linear(self.hidden_dimension, self.hidden_dimension, bias=True) #g_t
    self.x2f_t = nn.Linear(self.hidden_dimension, self.hidden_dimension, bias=True) #f_t
    self.x2o_t = nn.Linear(self.hidden_dimension, self.hidden_dimension, bias=True) #o_t
    self.h2i_t = nn.Linear(self.hidden_dimension, self.hidden_dimension, bias=False) #i_t
    self.h2g_t = nn.Linear(self.hidden_dimension, self.hidden_dimension, bias=False) #g_t
    self.h2f_t = nn.Linear(self.hidden_dimension, self.hidden_dimension, bias=False) #f_t
    self.h2o_t = nn.Linear(self.hidden_dimension, self.hidden_dimension, bias=False) #o_t
    ###
    self.relu = nn.ReLU()
    self.class_layer = nn.Linear(self.hidden_dimension, 2, bias=True)

  def forward(self, input_tensor):
    batch_size, max_length = input_tensor.shape # 16, 54
    total_hidden_state = [torch.zeros((batch_size, self.hidden_dimension)).cuda()] # 16*128   
    total_c_ts = [torch.zeros((batch_size, self.hidden_dimension))]
    emb = self.embedding(input_tensor) # emb 16*54*128
    #emb = dropout(emb) ### dropout emb
    for time_step in range(max_length): # max length is max sentence length in mini-batch
      ###
      f_t = self.sigmoid(self.x2f_t(emb[:, time_step, :]) + self.h2f_t(total_hidden_state[-1]))
      i_t = self.sigmoid(self.x2i_t(emb[:, time_step, :]) + self.h2i_t(total_hidden_state[-1]))
      g_t = self.tanh(self.x2g_t(emb[:, time_step, :]) + self.h2g_t(total_hidden_state[-1]))
      c_t1 = total_c_ts[-1].cuda() #c_(t-1)
      c_t = torch.mul(f_t,c_t1) + torch.mul(i_t,g_t)
      o_t = self.sigmoid(self.x2o_t(emb[:, time_step, :]) + self.h2o_t(total_hidden_state[-1]))
      h_t = torch.mul(o_t, self.tanh(c_t))
      ###
      total_c_ts.append(c_t)
      total_hidden_state.append(h_t)###
    
    logits = self.class_layer(total_hidden_state[-1]) # (16, 2)

    return logits

# hyperparameter
train_length_lstm = len(sentences)
mini_batch_size_lstm = 16 # 
train_steps_per_epoch_lstm = train_length_lstm // mini_batch_size_lstm
total_epoch_lstm = 5 #
hidden_dim_lstm = 64 # corresponding to d in baseline mode above 
learning_rate_lstm = 0.003 #
logging_freq = 1000
test_steps_per_epoch_lstm = len(dev_sentences) // mini_batch_size_lstm

# lstm model
lstm_model = LSTM_model(hidden_dim_lstm, whole_vocab).cuda() ##
loss_fn_lstm = nn.CrossEntropyLoss()
#optimizer_lstm = torch.optim.SGD(lstm_model.parameters(), lr=learning_rate_lstm)
optimizer_lstm = torch.optim.Adam(lstm_model.parameters(), lr=learning_rate_lstm)
# train
for epoch in tqdm(range(total_epoch_lstm)):
  train_epoch_loss_lstm = 0
  train_epoch_acc_lstm = 0
  test_epoch_loss_lstm = 0
  test_epoch_acc_lstm = 0
  for step in tqdm(range(train_steps_per_epoch_lstm)):
    input_idx_lstm = []
    input_sentences_lstm = sorted_sentences[step * mini_batch_size_lstm:(step+1) * mini_batch_size_lstm]
    
    label_lstm = torch.LongTensor(sorted_labels[step * mini_batch_size_lstm:(step+1) * mini_batch_size_lstm]).cuda()
    
    # compute max length of sentence in mini-batch
    max_sentence_length = 0
    for sentence in input_sentences_lstm:
      words = sentence.split(" ")
      if len(words) > max_sentence_length:
        max_sentence_length = len(words)

    for sentence in input_sentences_lstm: # 16 sentences each in [](idxes)
      words_lstm = sentence.split(" ")
      words_lstm.remove('')
      idxes_lstm = [vocabs2id[word] if word in vocabs2id.keys() else 0 for word in words_lstm]
      idxes_lstm = np.pad(np.array(idxes_lstm), (0, max_sentence_length - len(idxes_lstm)), 'constant', constant_values=(0, 0)) # zero-pad the remaining 54-len
      input_idx_lstm.append(idxes_lstm)
      
    input_lstm = torch.LongTensor(input_idx_lstm).cuda()
    logits_lstm = lstm_model(input_lstm) ##
    pred_label_lstm = torch.argmax(logits_lstm, axis=1)

    acc_lstm = torch.sum(pred_label_lstm == label_lstm) / mini_batch_size_lstm
    loss_lstm = loss_fn_lstm(logits_lstm, label_lstm)

    optimizer_lstm.zero_grad()
    loss_lstm.backward()
    optimizer_lstm.step()

    # # assert False
    # log = f''
    # log += f'step: {step}, '
    # log += f'loss: {loss_lstm}, '
    # log += f'acc: {acc_lstm} '
    # if step % logging_freq == 0:
    #   print(log)

    train_epoch_loss_lstm += loss_lstm / train_steps_per_epoch_lstm
    train_epoch_acc_lstm += acc_lstm / train_steps_per_epoch_lstm
  
  print(f'train_epoch_loss for epoch {epoch}: {train_epoch_loss_lstm}')
  print(f'train_epoch_acc for epoch {epoch}: {train_epoch_acc_lstm}')

for step in tqdm(range(test_steps_per_epoch_lstm)):
  input_idx_lstm = []
  input_sentences_lstm = dev_sorted_sentences[step * mini_batch_size_lstm:(step+1) * mini_batch_size_lstm]
    
  label_lstm = torch.LongTensor(dev_sorted_labels[step * mini_batch_size_lstm:(step+1) * mini_batch_size_lstm]).cuda()
    
  #compute max_sentence_length of each mini-batch
  dev_sentence_length = 0
  for sentence in input_sentences_lstm:
    words = sentence.split(" ")
    if len(words) > dev_sentence_length:
      dev_sentence_length = len(words)
        
  for sentence in input_sentences_lstm: # 16 sentences each in [](idxes)
    words_lstm = sentence.split(" ")
    words_lstm.remove('')
    idxes_lstm = [vocabs2id[word] if word in vocabs2id.keys() else 0 for word in words_lstm]
    idxes_lstm = np.pad(np.array(idxes_lstm), (0, dev_sentence_length - len(idxes_lstm)), 'constant', constant_values=(0, 0)) # zero-pad the remaining 54-len
    input_idx_lstm.append(idxes_lstm)
      
  input_lstm = torch.LongTensor(input_idx_lstm).cuda()
  logits_lstm = lstm_model(input_lstm) ##
  pred_label_lstm = torch.argmax(logits_lstm, axis=1)

  acc_lstm = torch.sum(pred_label_lstm == label_lstm) / mini_batch_size_lstm
  loss_lstm = loss_fn_lstm(logits_lstm, label_lstm)

    # log = f''
    # log += f'step: {step}, '
    # log += f'loss: {loss_lstm}, '
    # log += f'acc: {acc_lstm} '
    # if step % logging_freq == 0:
    #   print(log)

  test_epoch_loss_lstm += loss_lstm / test_steps_per_epoch_lstm
  test_epoch_acc_lstm += acc_lstm / test_steps_per_epoch_lstm
  
print(f'test epoch_loss for epoch {epoch}: {test_epoch_loss_lstm}')
print(f'test epoch_acc for epoch {epoch}: {test_epoch_acc_lstm}')

HBox(children=(FloatProgress(value=0.0, max=5.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, max=4209.0), HTML(value='')))


train_epoch_loss for epoch 0: 0.35017967224121094
train_epoch_acc for epoch 0: 0.8383101224899292


HBox(children=(FloatProgress(value=0.0, max=4209.0), HTML(value='')))


train_epoch_loss for epoch 1: 0.17960205674171448
train_epoch_acc for epoch 1: 0.929404616355896


HBox(children=(FloatProgress(value=0.0, max=4209.0), HTML(value='')))


train_epoch_loss for epoch 2: 0.12366022169589996
train_epoch_acc for epoch 2: 0.9538118839263916


HBox(children=(FloatProgress(value=0.0, max=4209.0), HTML(value='')))


train_epoch_loss for epoch 3: 0.09369064122438431
train_epoch_acc for epoch 3: 0.9651542901992798


HBox(children=(FloatProgress(value=0.0, max=4209.0), HTML(value='')))


train_epoch_loss for epoch 4: 0.0754176676273346
train_epoch_acc for epoch 4: 0.9719091653823853



HBox(children=(FloatProgress(value=0.0, max=54.0), HTML(value='')))


test epoch_loss for epoch 4: 0.4386916756629944
test epoch_acc for epoch 4: 0.8136575222015381


In [None]:
# prob 4.2

from tqdm.notebook import tqdm
use_cuda = True
import numpy as np

dropout = nn.Dropout(0.1)

class LSTM_model_dropout(nn.Module):
  def __init__(self, d, vocab):
    super(LSTM_model_dropout, self).__init__()
    self.hidden_dimension = d
    self.embedding = nn.Embedding(len(vocab), self.hidden_dimension) 
    self.i2h = nn.Linear(2 * self.hidden_dimension, self.hidden_dimension, bias=True) # (2d, d) matrix (weight, bias) ###
    ###
    self.sigmoid = nn.Sigmoid()
    self.tanh = nn.Tanh()
    self.x2i_t = nn.Linear(self.hidden_dimension, self.hidden_dimension, bias=True) #i_t
    self.x2g_t = nn.Linear(self.hidden_dimension, self.hidden_dimension, bias=True) #g_t
    self.x2f_t = nn.Linear(self.hidden_dimension, self.hidden_dimension, bias=True) #f_t
    self.x2o_t = nn.Linear(self.hidden_dimension, self.hidden_dimension, bias=True) #o_t
    self.h2i_t = nn.Linear(self.hidden_dimension, self.hidden_dimension, bias=False) #i_t
    self.h2g_t = nn.Linear(self.hidden_dimension, self.hidden_dimension, bias=False) #g_t
    self.h2f_t = nn.Linear(self.hidden_dimension, self.hidden_dimension, bias=False) #f_t
    self.h2o_t = nn.Linear(self.hidden_dimension, self.hidden_dimension, bias=False) #o_t
    ###
    self.relu = nn.ReLU()
    self.class_layer = nn.Linear(self.hidden_dimension, 2, bias=True)

  def forward(self, input_tensor):
    batch_size, max_length = input_tensor.shape # 16, 54
    total_hidden_state = [torch.zeros((batch_size, self.hidden_dimension)).cuda()] # 16*128   
    total_c_ts = [torch.zeros((batch_size, self.hidden_dimension))]
    emb = self.embedding(input_tensor) # emb 16*54*128
    emb = dropout(emb) ### dropout emb
    for time_step in range(max_length): # max length is max sentence length in mini-batch
      ###
      f_t = self.sigmoid(self.x2f_t(emb[:, time_step, :]) + self.h2f_t(total_hidden_state[-1]))
      i_t = self.sigmoid(self.x2i_t(emb[:, time_step, :]) + self.h2i_t(total_hidden_state[-1]))
      g_t = self.tanh(self.x2g_t(emb[:, time_step, :]) + self.h2g_t(total_hidden_state[-1]))
      c_t1 = total_c_ts[-1].cuda() #c_(t-1)
      c_t = torch.mul(f_t,c_t1) + torch.mul(i_t,g_t)
      o_t = self.sigmoid(self.x2o_t(emb[:, time_step, :]) + self.h2o_t(total_hidden_state[-1]))
      h_t = torch.mul(o_t, self.tanh(c_t))
      ###
      total_c_ts.append(c_t)
      total_hidden_state.append(h_t)###
    total_hidden_state.append(dropout(h_t))
    logits = self.class_layer(total_hidden_state[-1]) # (16, 2)

    return logits

# hyperparameter
train_length_lstm = len(sentences)
mini_batch_size_lstm = 16 # 
train_steps_per_epoch_lstm = train_length_lstm // mini_batch_size_lstm
total_epoch_lstm = 5 #
hidden_dim_lstm = 64 # corresponding to d in baseline mode above 
learning_rate_lstm = 0.003 #
logging_freq = 1000
test_steps_per_epoch_lstm = len(dev_sentences) // mini_batch_size_lstm

# lstm model
lstm_model = LSTM_model_dropout(hidden_dim_lstm, whole_vocab).cuda() ##
loss_fn_lstm = nn.CrossEntropyLoss()
#optimizer_lstm = torch.optim.SGD(lstm_model.parameters(), lr=learning_rate_lstm)
optimizer_lstm = torch.optim.Adam(lstm_model.parameters(), lr=learning_rate_lstm)
# train
for epoch in tqdm(range(total_epoch_lstm)):
  train_epoch_loss_lstm = 0
  train_epoch_acc_lstm = 0
  test_epoch_loss_lstm = 0
  test_epoch_acc_lstm = 0
  for step in tqdm(range(train_steps_per_epoch_lstm)):
    input_idx_lstm = []
    input_sentences_lstm = sorted_sentences[step * mini_batch_size_lstm:(step+1) * mini_batch_size_lstm]
    
    label_lstm = torch.LongTensor(sorted_labels[step * mini_batch_size_lstm:(step+1) * mini_batch_size_lstm]).cuda()
    
    # compute max length of sentence in mini-batch
    max_sentence_length = 0
    for sentence in input_sentences_lstm:
      words = sentence.split(" ")
      if len(words) > max_sentence_length:
        max_sentence_length = len(words)

    for sentence in input_sentences_lstm: # 16 sentences each in [](idxes)
      words_lstm = sentence.split(" ")
      words_lstm.remove('')
      idxes_lstm = [vocabs2id[word] if word in vocabs2id.keys() else 0 for word in words_lstm]
      idxes_lstm = np.pad(np.array(idxes_lstm), (0, max_sentence_length - len(idxes_lstm)), 'constant', constant_values=(0, 0)) # zero-pad the remaining 54-len
      input_idx_lstm.append(idxes_lstm)
      
    input_lstm = torch.LongTensor(input_idx_lstm).cuda()
    logits_lstm = lstm_model(input_lstm) ##
    pred_label_lstm = torch.argmax(logits_lstm, axis=1)

    acc_lstm = torch.sum(pred_label_lstm == label_lstm) / mini_batch_size_lstm
    loss_lstm = loss_fn_lstm(logits_lstm, label_lstm)

    optimizer_lstm.zero_grad()
    loss_lstm.backward()
    optimizer_lstm.step()

    # # assert False
    # log = f''
    # log += f'step: {step}, '
    # log += f'loss: {loss_lstm}, '
    # log += f'acc: {acc_lstm} '
    # if step % logging_freq == 0:
    #   print(log)

    train_epoch_loss_lstm += loss_lstm / train_steps_per_epoch_lstm
    train_epoch_acc_lstm += acc_lstm / train_steps_per_epoch_lstm
  
  print(f'train_epoch_loss for epoch {epoch}: {train_epoch_loss_lstm}')
  print(f'train_epoch_acc for epoch {epoch}: {train_epoch_acc_lstm}')

for step in tqdm(range(test_steps_per_epoch_lstm)):
  input_idx_lstm = []
  input_sentences_lstm = dev_sorted_sentences[step * mini_batch_size_lstm:(step+1) * mini_batch_size_lstm]
    
  label_lstm = torch.LongTensor(dev_sorted_labels[step * mini_batch_size_lstm:(step+1) * mini_batch_size_lstm]).cuda()
    
  #compute max_sentence_length of each mini-batch
  dev_sentence_length = 0
  for sentence in input_sentences_lstm:
    words = sentence.split(" ")
    if len(words) > dev_sentence_length:
      dev_sentence_length = len(words)
        
  for sentence in input_sentences_lstm: # 16 sentences each in [](idxes)
    words_lstm = sentence.split(" ")
    words_lstm.remove('')
    idxes_lstm = [vocabs2id[word] if word in vocabs2id.keys() else 0 for word in words_lstm]
    idxes_lstm = np.pad(np.array(idxes_lstm), (0, dev_sentence_length - len(idxes_lstm)), 'constant', constant_values=(0, 0)) # zero-pad the remaining 54-len
    input_idx_lstm.append(idxes_lstm)
      
  input_lstm = torch.LongTensor(input_idx_lstm).cuda()
  logits_lstm = lstm_model(input_lstm) ##
  pred_label_lstm = torch.argmax(logits_lstm, axis=1)

  acc_lstm = torch.sum(pred_label_lstm == label_lstm) / mini_batch_size_lstm
  loss_lstm = loss_fn_lstm(logits_lstm, label_lstm)

    # log = f''
    # log += f'step: {step}, '
    # log += f'loss: {loss_lstm}, '
    # log += f'acc: {acc_lstm} '
    # if step % logging_freq == 0:
    #   print(log)

  test_epoch_loss_lstm += loss_lstm / test_steps_per_epoch_lstm
  test_epoch_acc_lstm += acc_lstm / test_steps_per_epoch_lstm
  
print(f'test epoch_loss for epoch {epoch}: {test_epoch_loss_lstm}')
print(f'test epoch_acc for epoch {epoch}: {test_epoch_acc_lstm}')

## 5. Pretrained Word Vectors
The last step is to use pretrained vocabulary and word vectors. The prebuilt vocabulary will replace the vocabulary you built with SST-2 training data, and the word vectors will replace the embedding vectors. You will observe the power of leveraging self-supservised pretrained models.

**Problem 5.1** *(10 points)* Go to https://nlp.stanford.edu/projects/glove/ and download `glove.6B.zip`. Use these pretrained word vectors to further improve your model from 4.2. Report the model's accuracy on the dev data.

**answer:** 
test accuracy of lstm model with dropout using glove: 0.826388955116272
(epoch: 5, batch size: 16, hidden state dimension: 64, learning rate: 0.003. optimizer: Adam)

**Problem 5.2 (bonus)** *(10 points)* You can go one step further by using word vectors obtained from pretrained language models. Can you import the word embeddings from `bert-base-uncased` model (via Hugging Face's `transformers`: https://huggingface.co/transformers/pretrained_models.html) into your model and improve it further? Report the accuracy on the dev data here. If the score is now higher, explain where the improvement is coming from.

**answer:**
test accuracy of lstm model with dropout using bert: 0.8784721493721008 
The test accuracy is higher because bert is a pretrained language model which can do embedding considering contextual meanings of a word. So this can lead to improvement on test accuracy. 

In [None]:
# prob 5.1 (part 1: download glove 6B and unzip)
!wget http://nlp.stanford.edu/data/glove.6B.zip
!unzip glove.6B.zip

--2021-03-31 11:27:40--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2021-03-31 11:27:41--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2021-03-31 11:27:41--  http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


2021-0

In [None]:
# prob 5.1 (part 2: making word set, embedding layer)
import numpy as np
# make word set
glove_100d = open('glove.6B.100d.txt', encoding="utf8") # 400000 words
word_set = []

for line in glove_100d:
    word_line = line.split() 
    #print('word_row : ', word_row)
    word = word_line[0]
    #print('word: ', word) 
    word_set.append(word)
glove_100d.close()

# make embedding dictionary
embedding_dict = dict()
glove_100d = open('glove.6B.100d.txt', encoding="utf8")

for line in glove_100d:
    word_line = line.split()
    word = word_line[0]
    word_vector_arr = np.asarray(word_line[1:], dtype='float32')
    embedding_dict[word] = word_vector_arr
glove_100d.close()

In [None]:
def do_embedding(input_words): #input_words: 16*54 words -> output: batch_size*max_len*100 tensor
  batch_size = len(input_words) # a: 16, b: 54

  max_len = 0
  for sentence in input_words:
    max_len = max((max_len, len(sentence)))

  embedding_matrix = torch.zeros((batch_size, max_len, 100))

  for i, sentence in enumerate(input_words):
    word_len = len(sentence)
    for j, word in enumerate(sentence):
      if word in embedding_dict.keys():
        embedding_matrix[i,j] = torch.from_numpy(embedding_dict[word])
      #else, do nothing

  return embedding_matrix

test = [['aa','ccc','dd'],['bb']]
a = do_embedding(test)
print(a.shape)

torch.Size([2, 3, 100])


In [None]:
# prob 5.1 (part 3: applying lstm)
from tqdm.notebook import tqdm
use_cuda = True
import numpy as np

dropout = nn.Dropout(0.1)
class LSTM_model_glove(nn.Module):
  def __init__(self, d, vocab, input_dimension):
    super(LSTM_model_glove, self).__init__()
    self.hidden_dimension = d
    self.i2h = nn.Linear(2 * self.hidden_dimension, self.hidden_dimension, bias=True) # (2d, d) matrix (weight, bias) ###
    self.input_dim = input_dimension
    ###
    self.sigmoid = nn.Sigmoid()
    self.tanh = nn.Tanh()
    self.x2i_t = nn.Linear(self.input_dim, self.hidden_dimension, bias=True) #i_t
    self.x2g_t = nn.Linear(self.input_dim, self.hidden_dimension, bias=True) #g_t
    self.x2f_t = nn.Linear(self.input_dim, self.hidden_dimension, bias=True) #f_t
    self.x2o_t = nn.Linear(self.input_dim, self.hidden_dimension, bias=True) #o_t
    self.h2i_t = nn.Linear(self.hidden_dimension, self.hidden_dimension, bias=False) #i_t
    self.h2g_t = nn.Linear(self.hidden_dimension, self.hidden_dimension, bias=False) #g_t
    self.h2f_t = nn.Linear(self.hidden_dimension, self.hidden_dimension, bias=False) #f_t
    self.h2o_t = nn.Linear(self.hidden_dimension, self.hidden_dimension, bias=False) #o_t
    ###
    self.relu = nn.ReLU()
    self.class_layer = nn.Linear(self.hidden_dimension, 2, bias=True)

  def forward(self, input_tensor):
    #input_tensor should be embedded tensor
    batch_size, max_length, dim_100 = input_tensor.shape
    total_hidden_state = [torch.zeros((batch_size, self.hidden_dimension)).cuda()] # 16*128   
    #total_hidden_state = [torch.zeros((batch_size, self.hidden_dimension))] # 16*128   
    total_c_ts = [torch.zeros((batch_size, self.hidden_dimension))]
    ## do embedding ##
    input_tensor = dropout(input_tensor)
    emb = input_tensor.cuda()
    ##################
  
    for time_step in range(max_length): # max length is max sentence length in mini-batch
      ###
      f_t = self.sigmoid(self.x2f_t(emb[:, time_step, :]) + self.h2f_t(total_hidden_state[-1]))
      i_t = self.sigmoid(self.x2i_t(emb[:, time_step, :]) + self.h2i_t(total_hidden_state[-1]))
      g_t = self.tanh(self.x2g_t(emb[:, time_step, :]) + self.h2g_t(total_hidden_state[-1]))
      c_t1 = total_c_ts[-1].cuda() #c_(t-1)
      c_t = torch.mul(f_t,c_t1) + torch.mul(i_t,g_t)
      o_t = self.sigmoid(self.x2o_t(emb[:, time_step, :]) + self.h2o_t(total_hidden_state[-1]))
      h_t = torch.mul(o_t, self.tanh(c_t))
      ###
      total_c_ts.append(c_t)
      total_hidden_state.append(h_t)###

    total_hidden_state.append(dropout(h_t))
    logits = self.class_layer(total_hidden_state[-1]) # (16, 2)

    return logits

# hyperparameter
train_length_lstm = len(sentences)
mini_batch_size_lstm = 16 # 
train_steps_per_epoch_lstm = train_length_lstm // mini_batch_size_lstm
total_epoch_lstm = 5 #
hidden_dim_lstm = 64 # corresponding to d in baseline mode above 
learning_rate_lstm = 0.003 #
logging_freq = 1000
test_steps_per_epoch_lstm = len(dev_sentences) // mini_batch_size_lstm
input_dim_lstm = 100

# lstm model
lstm_model = LSTM_model_glove(hidden_dim_lstm, whole_vocab, input_dim_lstm).cuda() ##
#lstm_model = LSTM_model_glove(hidden_dim_lstm, whole_vocab) ##
loss_fn_lstm = nn.CrossEntropyLoss()
#optimizer_lstm = torch.optim.SGD(lstm_model.parameters(), lr=learning_rate_lstm)
optimizer_lstm = torch.optim.Adam(lstm_model.parameters(), lr=learning_rate_lstm)
# train
for epoch in tqdm(range(total_epoch_lstm)):
  train_epoch_loss_lstm = 0
  train_epoch_acc_lstm = 0
  test_epoch_loss_lstm = 0
  test_epoch_acc_lstm = 0
  for step in tqdm(range(train_steps_per_epoch_lstm)):
    input_idx_lstm = []
    input_sentences_lstm = sorted_sentences[step * mini_batch_size_lstm:(step+1) * mini_batch_size_lstm]
    
    label_lstm = torch.LongTensor(sorted_labels[step * mini_batch_size_lstm:(step+1) * mini_batch_size_lstm]).cuda()
    #label_lstm = torch.LongTensor(labels[step * mini_batch_size_lstm:(step+1) * mini_batch_size_lstm])

    # compute max length of sentence in mini-batch
    max_sentence_length = 0
    for sentence in input_sentences_lstm:
      words = sentence.split(" ")
      if len(words) > max_sentence_length:
        max_sentence_length = len(words)

    words_2d_list = [] ##
    for sentence in input_sentences_lstm: # 16 sentences each in [](idxes)
      words_lstm = sentence.split(" ")
      words_lstm.remove('')
      words_2d_list.append(words_lstm)
      
    input_lstm = do_embedding(words_2d_list)##
   
    logits_lstm = lstm_model(input_lstm) ## input_lstm should be tensor 16*sentence_len*100
    pred_label_lstm = torch.argmax(logits_lstm, axis=1)

    acc_lstm = torch.sum(pred_label_lstm == label_lstm) / mini_batch_size_lstm
    loss_lstm = loss_fn_lstm(logits_lstm, label_lstm)

    optimizer_lstm.zero_grad()
    loss_lstm.backward()
    optimizer_lstm.step()

    # assert False
    # log = f''
    # log += f'step: {step}, '
    # log += f'loss: {loss_lstm}, '
    # log += f'acc: {acc_lstm} '
    # if step % logging_freq == 0:
    #   print(log)

    train_epoch_loss_lstm += loss_lstm / train_steps_per_epoch_lstm
    train_epoch_acc_lstm += acc_lstm / train_steps_per_epoch_lstm
  
  print(f'train_epoch_loss for epoch {epoch}: {train_epoch_loss_lstm}')
  print(f'train_epoch_acc for epoch {epoch}: {train_epoch_acc_lstm}')

for step in tqdm(range(test_steps_per_epoch_lstm)):
  #input_idx_lstm = []
  input_sentences_lstm = dev_sorted_sentences[step * mini_batch_size_lstm:(step+1) * mini_batch_size_lstm]
    
  label_lstm = torch.LongTensor(dev_sorted_labels[step * mini_batch_size_lstm:(step+1) * mini_batch_size_lstm]).cuda()
  #label_lstm = torch.LongTensor(dev_labels[step * mini_batch_size_lstm:(step+1) * mini_batch_size_lstm])

  dev_2d_list = []
  for sentence in input_sentences_lstm: # 16 sentences each in [](idxes)
    words_lstm = sentence.split(" ")
    words_lstm.remove('')
    dev_2d_list.append(words_lstm)
    
  input_lstm = do_embedding(dev_2d_list)
  logits_lstm = lstm_model(input_lstm) ##
  pred_label_lstm = torch.argmax(logits_lstm, axis=1)

  acc_lstm = torch.sum(pred_label_lstm == label_lstm) / mini_batch_size_lstm
  loss_lstm = loss_fn_lstm(logits_lstm, label_lstm)

    # print(lstm_model.i2h.weight.grad) # dL/dw of weights in the linear layer

    # assert False
    # log = f''
    # log += f'step: {step}, '
    # log += f'loss: {loss_lstm}, '
    # log += f'acc: {acc_lstm} '
    # if step % logging_freq == 0:
    #   print(log)

  test_epoch_loss_lstm += loss_lstm / test_steps_per_epoch_lstm
  test_epoch_acc_lstm += acc_lstm / test_steps_per_epoch_lstm
  
print(f'test epoch_loss for epoch {epoch}: {test_epoch_loss_lstm}')
print(f'test epoch_acc for epoch {epoch}: {test_epoch_acc_lstm}')

HBox(children=(FloatProgress(value=0.0, max=5.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, max=4209.0), HTML(value='')))


train_epoch_loss for epoch 0: 0.3507845997810364
train_epoch_acc for epoch 0: 0.8468554615974426


HBox(children=(FloatProgress(value=0.0, max=4209.0), HTML(value='')))


train_epoch_loss for epoch 1: 0.2754836678504944
train_epoch_acc for epoch 1: 0.8847793340682983


HBox(children=(FloatProgress(value=0.0, max=4209.0), HTML(value='')))


train_epoch_loss for epoch 2: 0.23944410681724548
train_epoch_acc for epoch 2: 0.9018086791038513


HBox(children=(FloatProgress(value=0.0, max=4209.0), HTML(value='')))


train_epoch_loss for epoch 3: 0.21952013671398163
train_epoch_acc for epoch 3: 0.9102560877799988


HBox(children=(FloatProgress(value=0.0, max=4209.0), HTML(value='')))


train_epoch_loss for epoch 4: 0.20432128012180328
train_epoch_acc for epoch 4: 0.9176945090293884



HBox(children=(FloatProgress(value=0.0, max=54.0), HTML(value='')))


test epoch_loss for epoch 4: 0.4774285852909088
test epoch_acc for epoch 4: 0.826388955116272


In [None]:
# prob 5.2

!pip install transformers
from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased').cuda()

for param in model.parameters():
    param.requires_grad = False

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/ed/d5/f4157a376b8a79489a76ce6cfe147f4f3be1e029b7144fa7b8432e8acb26/transformers-4.4.2-py3-none-any.whl (2.0MB)
[K     |████████████████████████████████| 2.0MB 10.7MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 34.0MB/s 
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/71/23/2ddc317b2121117bf34dd00f5b0de194158f2a44ee2bf5e47c7166878a97/tokenizers-0.10.1-cp37-cp37m-manylinux2010_x86_64.whl (3.2MB)
[K     |████████████████████████████████| 3.2MB 34.1MB/s 
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel for sacremoses: filename=sacremoses-0.0.43-cp37-none-any.whl size=893262 sha256=8fe6993051

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=28.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=440473133.0, style=ProgressStyle(descri…




In [None]:
# prob 5.2

from torch import nn

from tqdm.notebook import tqdm
use_cuda = True
dropout = nn.Dropout(0.1)

class LSTM_model_bert(nn.Module):
  def __init__(self, d, vocab, input_dim):
    super(LSTM_model_bert, self).__init__()
    self.hidden_dimension = d
    self.input_dim = input_dim
    #self.embedding = nn.Embedding(len(vocab), self.hidden_dimension) 
    ###
    self.sigmoid = nn.Sigmoid()
    self.tanh = nn.Tanh()
    self.x2i_t = nn.Linear(self.input_dim, self.hidden_dimension, bias=True) #i_t
    self.x2g_t = nn.Linear(self.input_dim, self.hidden_dimension, bias=True) #g_t
    self.x2f_t = nn.Linear(self.input_dim, self.hidden_dimension, bias=True) #f_t
    self.x2o_t = nn.Linear(self.input_dim, self.hidden_dimension, bias=True) #o_t
    self.h2i_t = nn.Linear(self.hidden_dimension, self.hidden_dimension, bias=False) #i_t
    self.h2g_t = nn.Linear(self.hidden_dimension, self.hidden_dimension, bias=False) #g_t
    self.h2f_t = nn.Linear(self.hidden_dimension, self.hidden_dimension, bias=False) #f_t
    self.h2o_t = nn.Linear(self.hidden_dimension, self.hidden_dimension, bias=False) #o_t
    ###
    self.relu = nn.ReLU()
    self.class_layer = nn.Linear(self.hidden_dimension, 2, bias=True)

  def forward(self, input_tensor):
    #input_tensor should be embedded tensor

    batch_size, max_length, dim_100 = input_tensor.shape
    total_hidden_state = [torch.zeros((batch_size, self.hidden_dimension)).cuda()] # 16*128   
    #total_hidden_state = [torch.zeros((batch_size, self.hidden_dimension))] # 16*128   
    total_c_ts = [torch.zeros((batch_size, self.hidden_dimension))]
    ## do embedding ##
    input_tensor = dropout(input_tensor)
    emb = input_tensor.cuda()
    ##################
    for time_step in range(max_length): # max length is max sentence length in mini-batch
      ###
      f_t = self.sigmoid(self.x2f_t(emb[:, time_step, :]) + self.h2f_t(total_hidden_state[-1]))
      i_t = self.sigmoid(self.x2i_t(emb[:, time_step, :]) + self.h2i_t(total_hidden_state[-1]))
      g_t = self.tanh(self.x2g_t(emb[:, time_step, :]) + self.h2g_t(total_hidden_state[-1]))
      c_t1 = total_c_ts[-1].cuda() #c_(t-1)
      c_t = torch.mul(f_t,c_t1) + torch.mul(i_t,g_t)
      o_t = self.sigmoid(self.x2o_t(emb[:, time_step, :]) + self.h2o_t(total_hidden_state[-1]))
      h_t = torch.mul(o_t, self.tanh(c_t))
      ###
      total_c_ts.append(c_t)
      total_hidden_state.append(h_t)###
    total_hidden_state.append(dropout(h_t))
    logits = self.class_layer(total_hidden_state[-1]) # (16, 2)

    return logits


# hyperparameter
train_length_lstm = len(sentences)
mini_batch_size_lstm = 16 # 
train_steps_per_epoch_lstm = train_length_lstm // mini_batch_size_lstm
total_epoch_lstm = 5 #
hidden_dim_lstm = 64
input_dim_lstm = 768
learning_rate_lstm = 0.003 #
logging_freq = 1000
test_steps_per_epoch_lstm = len(dev_sentences) // mini_batch_size_lstm

# lstm model
lstm_model = LSTM_model_bert(hidden_dim_lstm, whole_vocab, input_dim_lstm).cuda() ##
#lstm_model = LSTM_model_glove(hidden_dim_lstm, whole_vocab) ##
loss_fn_lstm = nn.CrossEntropyLoss()
#optimizer_lstm = torch.optim.SGD(lstm_model.parameters(), lr=learning_rate_lstm)
optimizer_lstm = torch.optim.Adam(lstm_model.parameters(), lr=learning_rate_lstm)


# ------has to make input_lstm be bert-embedded tensors-------------------
batch_encoded = tokenizer.batch_encode_plus(sorted_sentences,  max_length = 128, padding = True, truncation = True)
whole_ids = batch_encoded['input_ids'] # type: 2d list, len : 67349
whole_masks = batch_encoded['attention_mask']

batch_encoded_dev = tokenizer.batch_encode_plus(dev_sorted_sentences,  max_length = 128, padding = True, truncation = True)
whole_ids_dev = batch_encoded_dev['input_ids'] # type: 2d list, len : 67349
whole_masks_dev = batch_encoded_dev['attention_mask']
#print('whole ids: ',type(whole_ids), len(whole_ids), whole_ids[0], type(whole_ids[0]))
#-------------------------------------------------------------------------

# train
for epoch in tqdm(range(total_epoch_lstm)):
  train_epoch_loss_lstm = 0
  train_epoch_acc_lstm = 0
  test_epoch_loss_lstm = 0
  test_epoch_acc_lstm = 0
  for step in tqdm(range(train_steps_per_epoch_lstm)):
    
    label_lstm = torch.LongTensor(sorted_labels[step * mini_batch_size_lstm:(step+1) * mini_batch_size_lstm]).cuda()
    #label_lstm = torch.LongTensor(labels[step * mini_batch_size_lstm:(step+1) * mini_batch_size_lstm])
    
    outputs  = model(torch.LongTensor(whole_ids[step * mini_batch_size_lstm:(step+1) * mini_batch_size_lstm]).cuda(), torch.FloatTensor(whole_masks[step * mini_batch_size_lstm:(step+1) * mini_batch_size_lstm]).cuda()) # model == bert model , output == embeddings
    last_hidden_states = outputs.last_hidden_state
    logits_lstm = lstm_model(last_hidden_states) ## last_hidden_state : embedded vector,  should be tensor 16*max_sentence_len*768
    pred_label_lstm = torch.argmax(logits_lstm, axis=1)

    acc_lstm = torch.sum(pred_label_lstm == label_lstm) / mini_batch_size_lstm
    loss_lstm = loss_fn_lstm(logits_lstm, label_lstm)

    optimizer_lstm.zero_grad()
    loss_lstm.backward()
    optimizer_lstm.step()

    # log = f''
    # log += f'step: {step}, '
    # log += f'loss: {loss_lstm}, '
    # log += f'acc: {acc_lstm} '
    # if step % logging_freq == 0:
    #   print(log)

    train_epoch_loss_lstm += loss_lstm / train_steps_per_epoch_lstm
    train_epoch_acc_lstm += acc_lstm / train_steps_per_epoch_lstm
  
  print(f'train_epoch_loss for epoch {epoch}: {train_epoch_loss_lstm}')
  print(f'train_epoch_acc for epoch {epoch}: {train_epoch_acc_lstm}')

for step in tqdm(range(test_steps_per_epoch_lstm)):
  #input_idx_lstm = []
  input_sentences_lstm = dev_sorted_sentences[step * mini_batch_size_lstm:(step+1) * mini_batch_size_lstm]
    
  label_lstm = torch.LongTensor(dev_sorted_labels[step * mini_batch_size_lstm:(step+1) * mini_batch_size_lstm]).cuda()
  #label_lstm = torch.LongTensor(dev_labels[step * mini_batch_size_lstm:(step+1) * mini_batch_size_lstm])

  outputs_dev  = model(torch.LongTensor(whole_ids_dev[step * mini_batch_size_lstm:(step+1) * mini_batch_size_lstm]).cuda(), torch.FloatTensor(whole_masks_dev[step * mini_batch_size_lstm:(step+1) * mini_batch_size_lstm]).cuda()) # model == bert model , output == embeddings
  last_hidden_states_dev = outputs_dev.last_hidden_state
  logits_lstm = lstm_model(last_hidden_states_dev) ## last_hidden_state : embedded vector,  should be tensor 16*max_sentence_len*768
      
  pred_label_lstm = torch.argmax(logits_lstm, axis=1)

  acc_lstm = torch.sum(pred_label_lstm == label_lstm) / mini_batch_size_lstm
  loss_lstm = loss_fn_lstm(logits_lstm, label_lstm)

    # log = f''
    # log += f'step: {step}, '
    # log += f'loss: {loss_lstm}, '
    # log += f'acc: {acc_lstm} '
    # if step % logging_freq == 0:
    #   print(log)

  test_epoch_loss_lstm += loss_lstm / test_steps_per_epoch_lstm
  test_epoch_acc_lstm += acc_lstm / test_steps_per_epoch_lstm
  
print(f'test epoch_loss for epoch {epoch}: {test_epoch_loss_lstm}')
print(f'test epoch_acc for epoch {epoch}: {test_epoch_acc_lstm}')

HBox(children=(FloatProgress(value=0.0, max=5.0), HTML(value='')))

HBox(children=(FloatProgress(value=0.0, max=4209.0), HTML(value='')))


train_epoch_loss for epoch 0: 0.3218434154987335
train_epoch_acc for epoch 0: 0.8640793561935425


HBox(children=(FloatProgress(value=0.0, max=4209.0), HTML(value='')))


train_epoch_loss for epoch 1: 0.27563774585723877
train_epoch_acc for epoch 1: 0.8873021006584167


HBox(children=(FloatProgress(value=0.0, max=4209.0), HTML(value='')))


train_epoch_loss for epoch 2: 0.2448321431875229
train_epoch_acc for epoch 2: 0.9011687636375427


HBox(children=(FloatProgress(value=0.0, max=4209.0), HTML(value='')))


train_epoch_loss for epoch 3: 0.22225138545036316
train_epoch_acc for epoch 3: 0.9109666347503662


HBox(children=(FloatProgress(value=0.0, max=4209.0), HTML(value='')))


train_epoch_loss for epoch 4: 0.20298631489276886
train_epoch_acc for epoch 4: 0.920052707195282



HBox(children=(FloatProgress(value=0.0, max=54.0), HTML(value='')))


test epoch_loss for epoch 4: 0.3482404947280884
test epoch_acc for epoch 4: 0.8784721493721008
