<a href="https://colab.research.google.com/github/pgmikhael/mit_deeplearning_bootcamp/blob/demonstrations/Tutorial3_advanced_pytorch_tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to RNNs and CNNs in PyTorch
In this tutorial, we'll take you through developing recurrent neural networks (RNNs) and convolutional neural networks (CNNs) in PyTorch to classify beer reviews.

Let's get started!

# Preliminaries

The next few sections will set up the necessary components of the tutorial, including:


1.   Installing PyTorch
2.   Importing dependencies
3.   Downloading and processing data
4.   Defining training and evaluation procedures



## Download PyTorch

In [None]:
# http://pytorch.org/
from os import path


accelerator = 'cu100' if path.exists('/opt/bin/nvidia-smi') else 'cpu'
if accelerator == 'cu100':
    !pip install torch torchvision torchaudio
else:
    !pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

!pip install transformers

import torch
print(f'Torch version = {torch.__version__}')
print(f'Cuda available = {torch.cuda.is_available()}')

## Imports

In [None]:
import argparse
from collections import Counter
import pickle
import re

import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from tqdm import tqdm

## Download and Process Data

In [None]:
!apt-get install wget
!wget https://raw.githubusercontent.com/yala/MLCodeLab/master/lab1/data/beer/overall_train.p
!wget https://raw.githubusercontent.com/yala/MLCodeLab/master/lab1/data/beer/overall_dev.p
!wget https://raw.githubusercontent.com/yala/MLCodeLab/master/lab1/data/beer/overall_test.p

train_path = "overall_train.p"
dev_path   = "overall_dev.p"
test_path  = "overall_test.p"

train_set =  pickle.load(open(train_path, 'rb'))
dev_set =  pickle.load(open(dev_path, 'rb'))
test_set =  pickle.load(open(test_path, 'rb'))

def preprocess_data(data):
    for indx, sample in enumerate(data):
        text, label = sample['text'], sample['y']
        text = re.sub('\W+', ' ', text).lower().strip()
        data[indx] = text, label
    return data

train_set = preprocess_data(train_set)
dev_set = preprocess_data(dev_set)
test_set =  preprocess_data(test_set)

### Question:
1. Why would we want to do this kind of preprocessing?

2. In what kind of datasets would this kind of preprocessing be a bad idea?

3. Before we go to working on the model, what questions should we ask about the data?


In [None]:
print(f'Num Train = {len(train_set):,}')
print(f'Num Dev   = {len(dev_set):,}')
print(f'Num Test  = {len(test_set):,}')
print()

trainText = [t[0] for t in train_set]
trainY = [t[1] for t in train_set]

devText = [t[0] for t in dev_set]
devY = [t[1] for t in dev_set]

testText = [t[0] for t in test_set]
testY = [t[1] for t in test_set]

allText = trainText + devText + testText

print('Train class balance')
y_count = Counter(trainY)
for y in sorted(y_count.keys()):
    print(f'{y} = {100. * y_count[y] / len(trainY):.2f}%')

## Dataset Class

### Review Question:
1. What is the benefit of using a Pytorch Dataset Object?

In [None]:
class BeerReviewDataset(Dataset):
    def __init__(self, X, Y):
      self.X, self.Y = X, Y
      assert len(X) == len(Y)

    def __len__(self):
       return len(self.X)

    def __getitem__(self, i):
      return np.array(self.X[i]), self.Y[i]

## Model and Training Settings

### Review Questions:
1. What does each of these variables mean?
2. What is what might we expect if we use a batch size that is too small? too large?
3. What is what might we expect if we use a learning rate that is too small? too large?
3. What is what might we expect if we use a weight decay that is too small? too large?


In [None]:
batch_size = 16
epochs = 20
lr = 1e-3
weight_decay = 1e-3
max_len = 150
embedding_size = 100
hidden_size = 100
output_size = 3
dropout = 0.4
use_cuda = True
max_len = 256

## Utility Functions

In [None]:
def param_count(model):
    return sum(param.numel() for param in model.parameters() if param.requires_grad)

## Training Procedure

### Review Questions:
1. What are the steps in defining a training loop?
2. What would happen if we removed `optimizer.zero_grad()`?
3. Why do we detach the gradient from loss?

In [None]:
def train_epoch(model, train_loader, optimizer, epoch):
    model.train()  # Set the nn.Module to train mode. 
    total_loss = 0
    correct = 0
    num_samples = len(train_loader.dataset)
    for batch_idx, (data, target) in tqdm(enumerate(train_loader), total=len(train_loader)):  # 1) get batch
        # Move to cuda
        if next(model.parameters()).is_cuda:
            data, target = data.cuda(), target.cuda()
      
        # Reset gradient data to 0
        optimizer.zero_grad()
        
        # Get prediction for batch
        output = model(data)
        
        # 2) Compute loss
        loss = F.cross_entropy(output, target)
        
        # 3) Do backprop
        loss.backward()
        
        # 4) Update model
        optimizer.step()
        
        # Do book-keeping to track accuracy and avg loss
        pred = output.max(1, keepdim=True)[1]  # get the index of the max log-probability
        correct += pred.eq(target.view_as(pred)).sum().item()
        total_loss += loss.detach()  # Don't keep computation graph 

    print(f'Train Epoch: {epoch} '
          f'Loss: {total_loss / num_samples:.4f}, '
          f'Accuracy: {correct}/{num_samples} ({100. * correct / num_samples:.0f}%)')

## Evaluation Procedure

### Review Question:
1. What should we change from our training procedure to make the evaluation procedure?

In [None]:
def eval_epoch(model, test_loader, name):
    model.eval()
    test_loss = 0
    correct = 0
    for data, target in test_loader:
        # Move to cuda
        if next(model.parameters()).is_cuda:
            data, target = data.cuda(), target.cuda()
        
        output = model(data)
        
        test_loss += F.cross_entropy(output, target).item()  # sum up batch loss
        pred = output.max(1, keepdim=True)[1]  # get the index of the max log-probability
        correct += pred.eq(target.view_as(pred)).sum().item()

    test_loss /= len(test_loader.dataset)
    print(f'\n{name} set: '
          f'Average loss: {test_loss:.4f}, '
          f'Accuracy: {correct}/{len(test_loader.dataset)} ({100. * correct / len(test_loader.dataset):.0f}%)\n')

# Introduction to Word Embeddings

## Limitation of Bag-of-Words

The bag-of-words featurization used in the previous tutorials and exercises (shown below) indicates the presence or absence of individual words or n-grams, but it doesn't encode sequential information.

<img src="https://cdn-images-1.medium.com/max/1200/1*eUedufAl7_sI_QWSEIstZg.png">

For example, if we are using a bag-of-words with unigrams, then the sentence "the beer is bad, not good" and the sentence "the beer is good, not bad" have the same featurization because they have the same words, even though they have very different meanings.

## Sequential Featurization

A better approach would be to construct a featurization that preserves the order of the sequence.

For instance, instead of representing a sentence as a single vector, we could represent a sentence as a sequence of vectors, one for each word.

One option is to represent eaach vector as a *one-hot* encoding of the word (i.e. a vector with 1 for the word and 0 for all other words), like below.

<img src="https://www.tensorflow.org/images/feature_columns/categorical_column_with_vocabulary.jpg">

## Word Embeddings

Although the one-hot featurization described above can work, it is still limited because it provides no information about how words are related to each other.

For instance, the words "play", "playing", and "igloo" would have one-hot features `[1,0,0]`, `[0,1,0]`, and `[0,0,1]`, which are all equally different, even though "play" and "playing" should intuitively have very similar representations.

A commonly used alternative to one-hot vectors is *word embeddings*. The idea is to encode a word using a vector of real numbers, i.e. a word embedding, that is *learned* during training.

The advantage of using word embeddings is that similar words can end up learning similar embeddings (as seen below), which allows the model to better encode the meaning of sentences.

<img src="https://shanelynnwebsite-mid9n9g1q9y8tt.netdna-ssl.com/wp-content/uploads/2018/01/word-vector-space-similar-words.png">

## Word Embeddings in PyTorch

PyTorch makes it easy to use word embeddings with the `nn.Embedding` class. An `nn.Embedding` maps from indices (representing a words) to word embeddings.

To use a PyTorch `nn.Embedding`, we first need to determine the vocabulary of all words that will be used.

In [None]:
sentence = 'the beer is really really good'
vocab = {word for word in sentence.split()}
print(f'Vocab = {vocab}')

Next, we assign an index to each word in the vocabulary.

In [None]:
word_to_index = {word: index for index, word in enumerate(vocab)}
print(f'Word to index = {word_to_index}')

Now we can create an `nn.Embedding`.

In [None]:
vocab_size = len(vocab)
embed_size = 5  # the dimensionality of the word embeddings
embed = nn.Embedding(vocab_size, embed_size)

We can then use the embedding to embed each word in a sentence.

In [None]:
indices = [word_to_index[word] for word in sentence.split()]
indices = torch.LongTensor(indices)  # nn.Embedding only works with PyTorch LongTensors
embedding = embed(indices)

print(f'Sentence = {sentence}')
print(f'Indices = {indices}')
print(f'Embedding = \n{embedding}')

## Padding

PyTorch tensors and models are required to have fixed dimensions. However, sentences naturally have different lengths since they have different numbers of words. To address this issue, we add *padding*.

The idea of padding is to add 0s to shorter sequences so that all sequences are the same length.

PyTorch makes padding easy by introducing a `padding_idx` parameter to `nn.Embedding`. Any time the `nn.Embedding` encounters the `padding_idx` in a sequence, it replaces it with a vector of all 0s. That way, the padding has no effect on the model's prediction.

In [None]:
# Sentences
sentence_1 = 'the beer is really really good'
sentence_2 = 'the beer sucks'

# Get vocab
vocab = {word for word in sentence_1.split()} | {word for word in sentence_2.split()}
print(f'Vocab = {vocab}')

# Get word to index mapping (need +1 b/c padding_idx = 0)
word_to_index = {word: index + 1 for index, word in enumerate(vocab)}
print(f'Word to index = {word_to_index}')

# Map words to indices
indices_1 = [word_to_index[word] for word in sentence_1.split()]
indices_2 = [word_to_index[word] for word in sentence_2.split()]

In [None]:
# Add padding
padding_idx = 0

indices_1 = indices_1 + [padding_idx] * max(0, len(indices_2) - len(indices_1))
indices_2 = indices_2 + [padding_idx] * max(0, len(indices_1) - len(indices_2))

indices_1 = torch.LongTensor(indices_1)
indices_2 = torch.LongTensor(indices_2)

print(f'Indices 1 = {indices_1}')
print(f'Indices 2 = {indices_2}')

In [None]:
# Create embedding with padding
vocab_size = len(vocab) + 1  # +1 because of padding
embed_size = 5  # the dimensionality of the word embeddings
embed = nn.Embedding(vocab_size, embed_size, padding_idx=padding_idx)

# Embed
embedding_1 = embed(indices_1)
embedding_2 = embed(indices_2)

print(f'Sentence 1 = {sentence_1}')
print(f'Indices 1 = {indices_1}')
print(f'Embedding 1 = \n{embedding_1}')
print()
print(f'Sentence 2 = {sentence_2}')
print(f'Indices 2 = {indices_2}')
print(f'Embedding 2 = \n{embedding_2}')

# Word Embeddings for Beer Reviews

Now we'll perform the same steps to prepare the beer review dataset for use with word embeddings. The `nn.Embedding` itself will be defined in the model as it is learned along with the other parameters.

## Define Vocab and Word-to-Index Mapping

In [None]:
# Define vocab
vocab = {word for text in allText for word in text.split()}

# Create word to index mapping
padding_idx = 0
word_to_index = {word: index + 1 for index, word in enumerate(vocab)}
vocab_size = len(word_to_index) + 1

print(f'Vocab size = {vocab_size:,}')

## Map Words to Indices

In [None]:
trainX = [[word_to_index[word] for word in text.split()] for text in trainText]
devX =   [[word_to_index[word] for word in text.split()] for text in devText]
testX =  [[word_to_index[word] for word in text.split()] for text in testText]

print(f'Indices of first train sentence = {trainX[0]}')
print(f'Last five indices = {trainX[0][-5:]}')

## Add Padding

Note: Since some beer reviews are extremely long, we've hard coded a maximum sentence length `max_len` in the Model and Training Settings section above.

In [None]:
trainX = [seq[:max_len] + [padding_idx] * (max_len - len(seq)) for seq in trainX]
devX =   [seq[:max_len] + [padding_idx] * (max_len - len(seq)) for seq in devX]
testX =  [seq[:max_len] + [padding_idx] * (max_len - len(seq)) for seq in testX]

print(f'Indices of first train sentence = {trainX[0]}')
print(f'Last five indices = {trainX[0][-5:]}')

## Build Dataset/DataLoader

In [None]:
# Build Dataset
train = BeerReviewDataset(trainX, trainY)
dev = BeerReviewDataset(devX, devY)
test = BeerReviewDataset(testX, testY)

# Build DataLoader
train_loader = DataLoader(train, batch_size=batch_size, shuffle=True)
dev_loader = DataLoader(dev, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test, batch_size=batch_size, shuffle=True)

### Review Question
1. Why is important to shuffle the data?
2. Is it important to shuffle the test data?

# Multi-Layer Perceptron (MLP)

Before experimenting with RNNs and CNNs, we'll review the basic feed-forward neural network (aka the multi-layer perceptron, or MLP), seen below.

<img src="https://freecontent.manning.com/wp-content/uploads/Teofili_WDDLCtS_02.png">

In the previous tutorials and labs, we build an MLP that operated on bag-of-words features. Below, we adapt the MLP to instead use word embeddings.

However, since an MLP is not a sequence model like an RNN or CNN,  it can't handle the extra sequential dimension we just introduced. To handle this, we simply remove that dimension by summing the embeddings, thereby providing the MLP with a since vector for the whole sentence.

## Define MLP

### Review Question:

1. Why is this more powerful than a single linear layer?
2. How can we make an MLP more powerful?
3. What are the limitations of an MLP?

In [None]:
class MLP(nn.Module):
    def __init__(self, vocab_size, padding_idx, embedding_size, hidden_size, output_size, dropout):
        super(MLP, self).__init__()
        
        # Embedding layer
        self.embed = nn.Embedding(vocab_size, embedding_size, padding_idx=padding_idx)
        
        # Fully connected layers
        self.fc1 = nn.Linear(embedding_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, hidden_size)
        self.fc3 = nn.Linear(hidden_size, output_size)
        
        # Dropout (regularization)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):  # batch_size x seq_length
        # Embed
        embedded = self.embed(x)  # batch_size x seq_length x embedding_size
        
        # Sum embeddings
        embedded = embedded.sum(dim=1)  # batch_size x embedding_size
        
        # MLP
        hidden = F.relu(self.fc1(embedded))  # batch_size x hidden_size
        hidden = self.dropout(F.relu(self.fc2(hidden)))  # batch_size x hidden_size
        logit = self.fc3(hidden)  # batch_size x output_size
        
        return logit

## Build MLP

In [None]:
model = MLP(vocab_size, padding_idx, embedding_size, hidden_size, output_size, dropout)

print(model)
print(f'Number of parameters = {param_count(model):,}')

# Move to cuda
if use_cuda and torch.cuda.is_available():
    model = model.cuda()

optimizer = optim.Adam(model.parameters(), lr=lr, weight_decay=weight_decay) 

## Train MLP

In [None]:
for epoch in range(1, epochs + 1):
    train_epoch(model, train_loader, optimizer, epoch)
    eval_epoch(model,  dev_loader, "Dev")
    print("---")

## Test MLP

In [None]:
eval_epoch(model,  test_loader, "Test")

# Recurrent Neural Network (RNN)

Unlike MLPs, recurrent neural networks (RNNs) can handle sequential data. RNNs work by processing a sequence one token at a time while maintaining a sense of state that is updated by each token in the sequence. This allows the RNN to incorporate information from all tokens in the sequence *in the order* in which they appear. This sequential information is what makes RNNs (and CNNs) more powerful than the MLP.

In the code below, we use one particular kind of RNN called a Long Short-Term Memory (LSTM) encoder.

<img src="https://i.ytimg.com/vi/kMLl-TKaEnc/maxresdefault.jpg">

## Define RNN

In [None]:
class RNN(nn.Module):
    def __init__(self, vocab_size, padding_idx, embedding_size, hidden_size, output_size, dropout):
        super(RNN, self).__init__()
        
        # Embedding layer
        self.embed = nn.Embedding(vocab_size, embedding_size, padding_idx=padding_idx)
        
        # LSTM (RNN)
        self.rnn = nn.LSTM(
            input_size=embedding_size,
            hidden_size=hidden_size,
            batch_first=True
        )
        
        # Fully connected layer
        self.output = nn.Linear(hidden_size, output_size)
        
        # Dropout (regularization)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x):  # batch_size x seq_length
        # Embed
        embedded = self.embed(x)  # batch_size x seq_length x embedding_size
      
        # Run RNN
        o, _ = self.rnn(embedded)  # batch_size x seq_length x hidden_size
        
        # Dropout
        o = self.dropout(o)  # batch_size x seq_length x hidden_size
        
        # Max pooling across sequence
        o, _ = torch.max(o, dim=1)    # batch_size x hidden_size
        
        # Output layer
        logit = self.output(o)  # batch_size x output_size
        
        return logit

### Review Question:
1. In this implementation, we take the max across all output states. What are some other options?
2. How should the speed of an RNN compared to the MLP? Will this have move parameters?
3. Do we need to change anything in our train or eval loop?

## Build RNN

In [None]:
model = RNN(vocab_size, padding_idx, embedding_size, hidden_size, output_size, dropout)

print(model)
print(f'Number of parameters = {param_count(model):,}')

# Move to cuda
if use_cuda and torch.cuda.is_available():
    model = model.cuda()
    
optimizer = optim.Adam(model.parameters(), lr=lr, weight_decay=weight_decay) 

## Train RNN

In [None]:
for epoch in range(1, epochs + 1):
    train_epoch(model, train_loader, optimizer, epoch)
    eval_epoch(model,  dev_loader, "Dev")
    print("---")

## Test RNN

In [None]:
eval_epoch(model,  test_loader, "Test")

# Convolutional Neural Network (CNN)

Like RNNs, convolutional neural networks (CNNs) can also handle sequential data. However, rather than processing input tokens one-by-one, a CNN performs operations on several neighboring tokens simultaneously, similar to the n-gram approach (but with access to the full sequence).

For text, we use 1-dimensional convolutions, as illustrated below. The same convolutional weights `w1`, `w2`, and `w3` are used repeatedly throughout the sequence, allowing the model to filter for certain words or phrases in the sequence.

<img src="https://qph.fs.quoracdn.net/main-qimg-523434af0d21bb0b59454aa9563cc90b.webp">

CNNs can also be applied to images (2D convolutions) or videos (3D convolutions).

<img src="https://s3.amazonaws.com/cdn.ayasdi.com/wp-content/uploads/2018/06/21100605/Fig2GCNN1.png">

## Define CNN

In [None]:
class CNN(nn.Module):
    def __init__(self, vocab_size, padding_idx, embedding_size, hidden_size, output_size, dropout):
        super(CNN, self).__init__()
        
        # Embedding layer
        self.embed = nn.Embedding(vocab_size, embedding_size, padding_idx=padding_idx)
        
        # Convolutional layers
        self.conv1 = nn.Conv1d(in_channels=embedding_size, out_channels=hidden_size, kernel_size=3, padding=0)
        self.conv2 = nn.Conv1d(in_channels=embedding_size, out_channels=hidden_size, kernel_size=5, padding=1)
        self.conv3 = nn.Conv1d(in_channels=embedding_size, out_channels=hidden_size, kernel_size=7, padding=2)
        
        # Fully connect layer
        self.output = nn.Linear(hidden_size, output_size)
        
        # Dropout (regularization)
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, x):  # batch_size x seq_length
        # Embed
        embedded = self.embed(x)  # batch_size x seq_length x embedding_size
      
        # Permute dimensions
        embedded = embedded.permute(0, 2, 1)  # batch_size x embedding_size x seq_length
        
        # Convolutional layers
        hidden_1 = self.dropout(F.relu(self.conv1(embedded)))  # batch_size x hidden_size x new_seq_length
        hidden_2 = self.dropout(F.relu(self.conv2(embedded)))  # batch_size x hidden_size x new_seq_length
        hidden_3 = self.dropout(F.relu(self.conv3(embedded)))  # batch_size x hidden_size x new_seq_length
        
        # Sum
        hidden = hidden_1 + hidden_2 + hidden_3    # batch_size x hidden_size x new_seq_length
        
        # Max pooling across sequence
        hidden, _ = hidden.max(dim=-1)    # batch_size x hidden_size
        
        # Output
        logit = self.output(hidden)  # batch_size x output_size
        
        return logit

## Build CNN

In [None]:
model = CNN(vocab_size, padding_idx, embedding_size, hidden_size, output_size, dropout)

print(model)
print(f'Number of parameters = {param_count(model):,}')

if use_cuda and torch.cuda.is_available():
    model = model.cuda()

optimizer = optim.Adam(model.parameters(), lr=lr, weight_decay=weight_decay) 

## Train CNN

In [None]:
for epoch in range(1, epochs + 1):
    train_epoch(model, train_loader, optimizer, epoch)
    eval_epoch(model,  dev_loader, "Dev")
    print("---")

## Test CNN

In [None]:
eval_epoch(model,  test_loader, "Test")

# Transformer

<img src="https://jalammar.github.io/images/t/transformer_resideual_layer_norm_3.png">

## Build from Pretrained Model: BERT (distilled)

In [None]:

model_checkpoint = 'distilbert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, num_labels = output_size)

In [None]:
print(model)
print(f'Number of parameters = {param_count(model):,}')

if use_cuda and torch.cuda.is_available():
    model = model.cuda()

optimizer = optim.Adam(model.parameters(), lr=lr, weight_decay=weight_decay) 

## Modify Dataset to return sentences

In [None]:
class TransformerBeerReviewDataset(Dataset):
    def __init__(self, dataset):
      self.X, self.Y = [l[0] for l in dataset], [l[1] for l in dataset]
      assert len(self.X) == len(self.Y)

    def __len__(self):
       return len(self.X)

    def __getitem__(self, i):
      return self.X[i], self.Y[i]


In [None]:
# Build Dataset

train = TransformerBeerReviewDataset(train_set, tokenizer, max_len=256)
dev = TransformerBeerReviewDataset(dev_set,  tokenizer, max_len=256)
test = TransformerBeerReviewDataset(test_set,  tokenizer, max_len=256)

# Build DataLoader
train_loader = DataLoader(train, batch_size=batch_size, shuffle=True)
dev_loader = DataLoader(dev, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test, batch_size=batch_size, shuffle=True)

## Modify Training/Eval Procedures

In [None]:
def train_epoch(model, train_loader, optimizer, epoch):
    model.train()  # Set the nn.Module to train mode. 
    total_loss = 0
    correct = 0
    num_samples = len(train_loader.dataset)
    for batch_idx, (data, target) in tqdm(enumerate(train_loader), total=len(train_loader)):  # 1) get batch
        # tokenize 
        data_inputs = tokenizer(data, return_tensors='pt', max_length=max_len, padding='max_length', truncation=True)
        
        # Move to cuda
        if next(model.parameters()).is_cuda:
            target =  target.cuda()
            for k,v in data_inputs.items(): # CHANGED: data is a dict
                data_inputs[k] = v.cuda()
      
        # Reset gradient data to 0
        optimizer.zero_grad()
        
        # Get prediction for batch
        output = model(**data_inputs) # CHANGED: data is a dict
        output = output.logits
        
        # 2) Compute loss
        loss = F.cross_entropy(output, target)
        
        # 3) Do backprop
        loss.backward()
        
        # 4) Update model
        optimizer.step()
        
        # Do book-keeping to track accuracy and avg loss
        pred = output.max(1, keepdim=True)[1]  # get the index of the max log-probability
        correct += pred.eq(target.view_as(pred)).sum().item()
        total_loss += loss.detach()  # Don't keep computation graph 

    print(f'Train Epoch: {epoch} '
          f'Loss: {total_loss / num_samples:.4f}, '
          f'Accuracy: {correct}/{num_samples} ({100. * correct / num_samples:.0f}%)')

In [None]:
def eval_epoch(model, test_loader, name):
    model.eval()
    test_loss = 0
    correct = 0
    for data, target in test_loader:
        # tokenize 
        data_inputs = tokenizer(data, return_tensors='pt', max_length=max_len, padding='max_length', truncation=True)
        # Move to cuda
        if next(model.parameters()).is_cuda:
            target =  target.cuda()
            for k,v in data_inputs.items(): # CHANGED: data is a dict
                data_inputs[k] = v.cuda()
        
        output = model(**data_inputs) # CHANGED: data is a dict
        output = output.logits
        
        test_loss += F.cross_entropy(output, target).item()  # sum up batch loss
        pred = output.max(1, keepdim=True)[1]  # get the index of the max log-probability
        correct += pred.eq(target.view_as(pred)).sum().item()

    test_loss /= len(test_loader.dataset)
    print(f'\n{name} set: '
          f'Average loss: {test_loss:.4f}, '
          f'Accuracy: {correct}/{len(test_loader.dataset)} ({100. * correct / len(test_loader.dataset):.0f}%)\n')

In [None]:
for epoch in range(1, epochs + 1):
    train_epoch(model, train_loader, optimizer, epoch)
    eval_epoch(model,  dev_loader, "Dev")
    print("---")

In [None]:
eval_epoch(model,  test_loader, "Test")

### Discussion Questions:
1. In are the advantages and disadvantages of RNN vs CNN?
2. How do you get a CNN to take in a wider context?
3. Why isn't the accuracy higher?