In [1]:
!mkdir ./data
! wget https://raw.githubusercontent.com/udacity/deep-learning-v2-pytorch/master/sentiment-rnn/data/labels.txt
! wget https://raw.githubusercontent.com/udacity/deep-learning-v2-pytorch/master/sentiment-rnn/data/reviews.txt
! mv labels.txt ./data 
! mv reviews.txt ./data

mkdir: cannot create directory ‘./data’: File exists
--2021-01-31 16:40:16--  https://raw.githubusercontent.com/udacity/deep-learning-v2-pytorch/master/sentiment-rnn/data/labels.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 225000 (220K) [text/plain]
Saving to: ‘labels.txt’


2021-01-31 16:40:16 (16.4 MB/s) - ‘labels.txt’ saved [225000/225000]

--2021-01-31 16:40:16--  https://raw.githubusercontent.com/udacity/deep-learning-v2-pytorch/master/sentiment-rnn/data/reviews.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.0.133, 151.101.64.133, 151.101.128.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.0.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 33678267 (32M) 

# Sentiment Analysis with an RNN

In this notebook, you'll implement an RNN that performs sentiment analysis.

>Using an RNN rather than a strictly feedforward network is more accurate since we can include information about the *sequence* of words.

Here, we'll use a dataset of movie reviews, accompanied by sentiment labels: positive or negative.

<img src="https://github.com/udacity/deep-learning-v2-pytorch/raw/3a95d118f9df5a86826e1791c5c100817f0fd924/sentiment-rnn/assets/reviews_ex.png" width=200px>

### Network Architecture

The architecture for this network is shown below.

<img src="https://github.com/udacity/deep-learning-v2-pytorch/raw/3a95d118f9df5a86826e1791c5c100817f0fd924/sentiment-rnn/assets/network_diagram.png" width=300px>

> **First, we'll pass in words to an embedding layer.** We need an embedding layer because we have tens of thousands of words, so we'll need a more efficient representation for our input data than one-hot encoded vectors. You should have seen this before from the Word2Vec lesson. You can actually train an embedding with the Skip-gram Word2Vec model and use those embeddings as input here. However, it's good enough to just have an embedding layer and let the network learn a different embedding table on its own. *In this case, the embedding layer is for dimensionality reduction, rather than for learning semantic representations.*
>
> **After input words are passed to an embedding layer, the new embeddings will be passed to LSTM cells.** The LSTM cells will add *recurrent* connections to the network and give us the ability to include information about the *sequence* of words in the movie review data.
>
> **Finally, the LSTM outputs will go to a sigmoid output layer.** We're using a sigmoid function because positive and negative = 1 and 0, respectively, and a sigmoid will output predicted, sentiment values between 0-1.

We don't care about the sigmoid outputs except for the **very last one**; we can ignore the rest. We'll calculate the loss by comparing the output at the last time step and the training label (pos or neg).

---

### Load in and visualize the data

In [2]:
import numpy as np

# read data from text files
with open('data/reviews.txt', 'r') as f:
    reviews = f.read()
with open('data/labels.txt', 'r') as f:
    labels = f.read()

In [3]:
print(reviews[:2000])
print()
print(labels[:20])

bromwell high is a cartoon comedy . it ran at the same time as some other programs about school life  such as  teachers  . my   years in the teaching profession lead me to believe that bromwell high  s satire is much closer to reality than is  teachers  . the scramble to survive financially  the insightful students who can see right through their pathetic teachers  pomp  the pettiness of the whole situation  all remind me of the schools i knew and their students . when i saw the episode in which a student repeatedly tried to burn down the school  i immediately recalled . . . . . . . . . at . . . . . . . . . . high . a classic line inspector i  m here to sack one of your teachers . student welcome to bromwell high . i expect that many adults of my age think that bromwell high is far fetched . what a pity that it isn  t   
story of a man who has unnatural feelings for a pig . starts out with a opening scene that is a terrific example of absurd comedy . a formal orchestra audience is turn

## Data pre-processing

The first step when building a neural network model is getting your data into the proper form to feed into the network. Since we're using embedding layers, we'll need to encode each word with an integer. We'll also want to clean it up a bit.

You can see an example of the reviews data above. Here are the processing steps we'll want to take:
>* We'll also want to get rid of periods and extraneous punctuation.
>* Also, you might notice that the reviews are delimited with newline characters `\n`. To deal with those, we will split the text into each review using '\n' as the delimiter.
>* Then we can combine all the reviews back together into one big string.

First, let's remove all punctuation. Then get all the text without the newlines and split it into individual words.

In [4]:
from string import punctuation

print(punctuation)

# get rid of punctuation
reviews = reviews.lower() # lowercase, standardize
all_text = ''.join([c for c in reviews if c not in punctuation])

!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~


In [5]:
# split by new lines and spaces
reviews_split = all_text.split('\n') # for ML, to map input to target sentiment
all_text = ' '.join(reviews_split) # to generate words below

# create a list of words (needed to build dictionary mapping words to ints)
words = all_text.split()

In [6]:
words[:30]

['bromwell',
 'high',
 'is',
 'a',
 'cartoon',
 'comedy',
 'it',
 'ran',
 'at',
 'the',
 'same',
 'time',
 'as',
 'some',
 'other',
 'programs',
 'about',
 'school',
 'life',
 'such',
 'as',
 'teachers',
 'my',
 'years',
 'in',
 'the',
 'teaching',
 'profession',
 'lead',
 'me']

### Encoding the words

The embedding lookup requires that we pass in integers to our network. The easiest way to do this is to create dictionaries that map the words in the vocabulary to integers. Then we can convert each of our reviews into integers so they can be passed into the network.

> **Exercise:** Now you're going to encode the words with integers. Build a dictionary that maps words to integers. Later we're going to pad our input vectors with zeros, so make sure the integers **start at 1, not 0**. Also convert the reviews to integers and store the reviews in a new list called `reviews_ints`.

In [7]:
# feel free to use this import
from collections import Counter

In [8]:
# Build a dictionary that maps words to integers
counts = Counter(words)
vocab = sorted(counts, key=counts.get, reverse=True)
vocab_to_int = {word:idx for idx, word in enumerate(vocab, start=1)}

In [9]:
## use the dict to tokenize each review in reviews_split
## store the tokenized reviews in reviews_ints
reviews_ints = []
for review in reviews_split:
        review_int = [vocab_to_int[word] for word in review.split()]
        reviews_ints.append(review_int)

#### Test your code

As a text that you've implemented the dictionary correctly, print out the number of unique words in your vocabulary and the contents of the first, tokenized review.

In [10]:
# stats about vocabulary
print('Unique words: ', len((vocab_to_int))) # should be ~74000+
print()

# print tokens in first review
print('Tokenized review: \n', reviews_ints[:1])

Unique words:  74072

Tokenized review: 
 [[21025, 308, 6, 3, 1050, 207, 8, 2138, 32, 1, 171, 57, 15, 49, 81, 5785, 44, 382, 110, 140, 15, 5194, 60, 154, 9, 1, 4975, 5852, 475, 71, 5, 260, 12, 21025, 308, 13, 1978, 6, 74, 2395, 5, 613, 73, 6, 5194, 1, 24103, 5, 1983, 10166, 1, 5786, 1499, 36, 51, 66, 204, 145, 67, 1199, 5194, 19869, 1, 37442, 4, 1, 221, 883, 31, 2988, 71, 4, 1, 5787, 10, 686, 2, 67, 1499, 54, 10, 216, 1, 383, 9, 62, 3, 1406, 3686, 783, 5, 3483, 180, 1, 382, 10, 1212, 13583, 32, 308, 3, 349, 341, 2913, 10, 143, 127, 5, 7690, 30, 4, 129, 5194, 1406, 2326, 5, 21025, 308, 10, 528, 12, 109, 1448, 4, 60, 543, 102, 12, 21025, 308, 6, 227, 4146, 48, 3, 2211, 12, 8, 215, 23]]


### Encoding the labels

Our labels are "positive" or "negative". To use these labels in our network, we need to convert them to 0 and 1.

> **Exercise**: Convert labels from `positive` and `negative` to 1 and 0, respectively and place those in a new list, `encoded_labels`.

In [11]:
# 1=positive, 0=negative label conversion
encoded_labels = np.array([int(label=='positive') for label in labels.split('\n')])

In [12]:
encoded_labels

array([1, 0, 1, ..., 1, 0, 0])

### Removing outliers

As an additional pre-processing step, we want to make sure that our reviews are in good shape for standard processing. That is, our network will expect a standard input text size, and so, we'll want to shape our reviews into a specific length. We'll approach this task in two main steps:

1. Getting rid of extremely long or short reviews; the outliers
2. Padding/truncating the remaining data so that we have reviews of the same length

<img src="https://github.com/udacity/deep-learning-v2-pytorch/raw/3a95d118f9df5a86826e1791c5c100817f0fd924/sentiment-rnn/assets/outliers_padding_ex.png" width=200px>

Before we pad our review text, we should check for reviews of extremely long or short lengths; outliers that may mess with our training.


In [13]:
# outlier review stats
review_lens = Counter([len(x) for x in reviews_ints])
print("Zero-length reviews: {}".format(review_lens[0]))
print("Maximum review length: {}".format(max(review_lens)))

Zero-length reviews: 1
Maximum review length: 2514


We seem to have one review with zero length. And the maximum review length is wat too many steps for our RNN. We'll have to remove any super short reviews and truncate super long reviews. This removes outliers and should allow our model to train more efficiently.

>**Exercise:** First, remove *any* reviews with zero length from the `reviews_ints` list and their corresponding label in `encoded_labels`.

In [14]:
print("Number of reviews before removing outliers: ", len(reviews_ints))

## remove any reviews/labels with zero length from the reviews_int list.

# print index, content of review with zero-length
non_zero_idx = [ii for ii, review in enumerate(reviews_ints) if len(review) != 0]

# remove 0-length reviews and their labels
reviews_ints = [reviews_ints[ii] for ii in non_zero_idx]
encoded_labels = np.array([encoded_labels[ii] for ii in non_zero_idx])

print('Number of reviews after removing outliers: ', len(reviews_ints))

Number of reviews before removing outliers:  25001
Number of reviews after removing outliers:  25000


---
## Padding sequences

To deal with both short and very long reviews, we'll pad or truncate all our reviews to a specific length. For reviews shorter than some `seq_length`, we'll padd with 0s. For reviews longer than `seq_length`, we can truncate them to the first `seq_length` words. A good `seq_length`, in this case, is 200.

>**Exercise:** Define a function that returns an array `features` that contains the padded data, of a standard size, that we'll pass to the network.
>
>* The data should come from `review_ints`, since we want to feed integers to the network.
>* Each row should be `seq_length` elements long.
>* For reviews shorter than `seq_length` words, **left pad** with 0s. That is, if the review is `['best', 'movie', 'ever']`, `[117, 18, 128]` as integers, the row will look like `[0, 0, 0, ..., 117, 18, 128]`.
>* For reviews longer than `seq_length`, use only the first `seq_length` words as the feature vector.

As a small example, if the `seq_length=10` and an input review is:
```
[117, 18, 128]
```

The resultant, padded sequence should be:
```
[0,0,0,0,0,0,0,0,117,18,128]
```

**Your final `features` array should be a 2D array, with as many rows as there are reviews, and as many columns as the specified `seq_length`.**


In [15]:
def pad_features(reviews_ints, seq_length):
    ''' Return features of review_ints, where each review is padded with 0's
        or truncated to the input seq_length.
    '''
    ## implement function

    features = []

    for review_int in reviews_ints:
        if len(review_int) < seq_length:
            standard_review = [0]*(seq_length - len(review_int)) + review_int
        elif len(review_int) > seq_length:
            standard_review = review_int[:seq_length]
        else:
            standard_review = review_int
        features.append(standard_review)

    return np.array(features)

In [16]:
# Test your implementation!

seq_length = 200

features = pad_features(reviews_ints, seq_length=seq_length)

## test statements - do not change - ##
assert len(features) == len(reviews_ints), "Your features should have as many rows as reviews."
assert len(features[0]) == seq_length, "Each feature row should contain seq_length values."

# print first 10 values of the first 30 batches
print(features[:30,:10])

[[    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [22382    42 46418    15   706 17139  3389    47    77    35]
 [ 4505   505    15     3  3342   162  8312  1652     6  4819]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [   54    10    14   116    60   798   552    71   364     5]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [    0     0     0     0     0     0     0     0     0     0]
 [    1   330   578    34     3   162   748  2731     9   325]
 [    9    11 10171  5305  1946   689   444    22   280   673]
 [    0     0     0     0     0     0     0     0     0

## Training, Validation, Test

With our data in nice shape, we'll split into training, validation and test sets.

> **Exercise:** Create the training, validation, and test sets.
>
>* You'll need to create sets for the features and the labels, `train_x` and `train_y`, for example.
>* Define a split fraction, `split_frac` as the fraction of data to **keep** in the training set. Usually this is set to 0.8 or 0.9.
>* Whatever data is left will be split in half to create the validation and *testing* data.


In [17]:
split_frac = 0.8

from sklearn.model_selection import train_test_split

## split data into training, validation, and test data (features and labels, 
## x and y)
train_x, test_x, train_y, test_y = train_test_split(features, encoded_labels,
                                                    test_size=1-split_frac)

val_x, test_x, val_y, test_y = train_test_split(test_x, test_y, 
                                                test_size=0.5)

## print out the shapes of your resultant feature data
print("train", train_x.shape, train_y.shape)
print("val", val_x.shape, val_y.shape)
print("test", test_x.shape, test_y.shape)

train (20000, 200) (20000,)
val (2500, 200) (2500,)
test (2500, 200) (2500,)


---
## DataLoaders and Batching

After creating training, test, and validation data, we can create DataLoaders for this data by following two steps:

1. Create a known format for accessing our data, using `TensorDataset` which takes in an input set of data and a target set of data with the same first dimension, and creates a dataset.
2. Create DataLoaders and batch our training, validation, and test Tensor datasets.

```python
train_data = TensorDataset(torch.from_numpy(train_x), torch.from_numpy(train_y))
train_loader = DataLoader(train_data, batch_size=batch_size)
```

This is an alternative to creating a generator function for batching our data into full batches.


In [18]:
import torch
from torch.utils.data import TensorDataset, DataLoader

In [19]:
# create Tensor datasets
train_data = TensorDataset(torch.from_numpy(train_x), torch.from_numpy(train_y))
valid_data = TensorDataset(torch.from_numpy(val_x), torch.from_numpy(val_y))
test_data = TensorDataset(torch.from_numpy(test_x), torch.from_numpy(test_y))

# dataloaders
batch_size = 50

# make sure to shuffle your data
train_loader = DataLoader(train_data, batch_size=batch_size, shuffle=True)
valid_loader = DataLoader(valid_data, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(test_data, batch_size=batch_size, shuffle=True)

In [20]:
# obtain one batch of training data
dataiter = iter(train_loader)
sample_x, sample_y = dataiter.next()

print('Sample input size: ', sample_x.size()) # batch_size, seq_length
print('Sample input: \n', sample_x)
print()
print('Sample label size: ', sample_y.size()) # batch_size
print('Sample label: \n', sample_y)

Sample input size:  torch.Size([50, 200])
Sample input: 
 tensor([[   10,  1777,     1,  ...,     8,     3,   348],
        [    0,     0,     0,  ...,   120,    66,     8],
        [   10,    65,   783,  ...,     8,    13,    24],
        ...,
        [  148,   775,  3255,  ...,  2551,    14,   337],
        [    0,     0,     0,  ...,     8,     7,     7],
        [   10,    43,   293,  ..., 10621,    13,   104]])

Sample label size:  torch.Size([50])
Sample label: 
 tensor([1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1,
        1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0,
        0, 0])


---
# Sentiment Network with PyTorch

Below is where you'll define the network.

<img src='https://github.com/udacity/deep-learning-v2-pytorch/raw/3a95d118f9df5a86826e1791c5c100817f0fd924/sentiment-rnn/assets/network_diagram.png' width=300px>

The layers are as follows:
1. An [embedding layer](https://pytorch.org/docs/stable/nn.html#embedding) that converts our word tokens (integers) into embeddings of a specific size.
2. An [LSTM layer](https://pytorch.org/docs/stable/nn.html#lstm) defined by a hidden_state size and number of layers.
3. A fully-connected output layer that maps the LSTM layer outputs to a desired output_size
4. A sigmoid activation layer which turns all outputs into a value 0-1; return **only the last sigmoid output** as the output of this network.

### The Embedding Layer

We need to add an [embedding layer](https://pytorch.org/docs/stable/nn.html#embedding) because there are 74000+ words in our vocabulary. It is massively inefficient to one-hot encode that many classes. So, instead of one-hot encoding, we can have an embedding layer and use that layer as a lookup table. You could train an embedding layer using Word2Vec and then load it here. But it's fine to just make a new layer, using it only for dimensionality reduction and let the network learn the weights.

### The LSTM Layer(s)

We'll create an [LSTM](https://pytorch.org/docs/stable/nn.html#lstm) to use in our recurrent network, which takes in an input_size, a hidden_dim, a number of layers, a dropout probability (for dropout between multiple layers), and a batch_first parameter.

Most of the time, your network will have better performance with more layers; between 2-3. Addinf more layers allows the network to learn really complex relationships.
> **Exercise**: Complete the `__init__`, `forward`, and `init_hidden` functions for the SentimentRNN model class.

Note: `init_hidden` should initialize the hidden and cell state of an lstm layer to all zeros and move those state to GPU, if available.

In [21]:
# First, check if GPU is available
train_on_gpu = torch.cuda.is_available()

if(train_on_gpu):
    print('Training on GPU.')
else:
    print('No GPU available, training on CPU.')

Training on GPU.


In [28]:
import torch.nn as nn

class SentimentRNN(nn.Module):
    """
    The RNN model that will be used to perform Sentiment Analysis.
    """
    def __init__(self, vocab_size, output_size, embedding_dim, hidden_dim, 
                 n_layers, drop_prob=0.5):
        """
        Initialize the model by setting up the layers.
        """
        super(SentimentRNN, self).__init__()

        self.output_size = output_size
        self.n_layers = n_layers
        self.hidden_dim = hidden_dim

        # define all layers

        # Embedding Layer takes in the size of our vocabulary (vocab_size) 
        # and produces an embedding of size embedding_dim. It results in a
        # lookup table with # rows = # word integers and # cols = # embed_dim
        self.embedding = nn.Embedding(num_embeddings=vocab_size, 
                                  embedding_dim=embedding_dim)
        
        # LSTM layer takes input from the embedding layer (with size 
        # embedding_dim). It produces an output and hidden state (with size
        # hidden_dim). We set batch_first=True because we are using DataLoaders
        # to batch our data.
        self.lstm = nn.LSTM(input_size=embedding_dim, 
                            hidden_size=self.hidden_dim, 
                            num_layers=self.n_layers, dropout=drop_prob,
                            batch_first=True)
        
        # Dropout layer, that receives as input the LSTM outputs.
        self.dropout = nn.Dropout(p=0.3)

        # Fully-connected layer maps LSTM layer outputs (after dropout) to 
        # `output_size`
        self.fc = nn.Linear(in_features=self.hidden_dim, 
                            out_features=self.output_size)
        
        # Sigmoid layer
        self.sigmoid = nn.Sigmoid()
        
    def forward(self, x, hidden):
        """
        Perform a forward pass of our model on some input and hidden state.
        """
        # Get batch_size of data for shaping data
        batch_size = x.size(0)

        # Pass x input through embedding layer
        embeds = self.embedding(x)

        # Pass embedding to LSTM layer
        lstm_out, hidden = self.lstm(embeds, hidden)

        # Stack up LSTM outputs
        lstm_out = lstm_out.contiguous().view(-1, self.hidden_dim)

        # LSTM outputs passed to a dropout layer
        out = self.dropout(lstm_out)

        # Pass outputs through fully-connected layer
        out = self.fc(out) 

        # Pass outputs to sigmoid
        sig_out = self.sigmoid(out)

        # Reshape to be batch_size first
        sig_out = sig_out.view(batch_size, -1)
        sig_out = sig_out[:, -1] # get last batch of labels

        # return last sigmoid output and hidden state
        return sig_out, hidden
    
    def init_hidden(self, batch_size):
        ''' Initializes hidden state '''
        # Create two new tensors with sizes n_layers x batch_size x hidden_dim,
        # initialized to zero, for hidden state and cell state of LSTM

        weight = next(self.parameters()).data

        if(train_on_gpu):
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda(),
                      weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda())
        else:
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_(),
                      weight.new(self.n_layers, batch_size, self.hidden_dim).zero_())

        return hidden

## Instantiate the network

Here, we'll instantiate the network. Let's define the hyperparameters.
* `vocab_size`: Size of our vocabulary or the range of values for our input, word tokens.
* `output_size`: Size of our desired output; the number of class scores we want to output (pos/neg).
* `embedding_dim`: Number of columns in the embedding lookup table; size of our embeddings.
* `hidden_dim`: Number of units in the hidden layers of our LSTM cells. Usually larger is better performance wise. Common values are 128, 256, 512, etc.
* `n_layers`: Number of LSTM layers in the network. Typically between 1-3.

>**Exercise**: Define the model hyperparameters.

In [29]:
# Instantiate the model w/ hyperparameters
vocab_size = len(vocab_to_int) + 1
output_size = 1 # pos/neg
embedding_dim = 512
hidden_dim = 256
n_layers = 2

net = SentimentRNN(vocab_size, output_size, embedding_dim, hidden_dim, n_layers)

print(net)

SentimentRNN(
  (embedding): Embedding(74073, 512)
  (lstm): LSTM(512, 256, num_layers=2, batch_first=True, dropout=0.5)
  (dropout): Dropout(p=0.3, inplace=False)
  (fc): Linear(in_features=256, out_features=1, bias=True)
  (sigmoid): Sigmoid()
)


---
## Training

Below is the typical training code.

>We'll also be using a new kind of cross entropy loss, which is designed to work with a single sigmoid output. [BCELoss](https://pytorch.org/docs/stable/nn.html#bceloss) or **Binary Cross Entropy Loss** applies cross entropy loss to a single value between 0 and 1.

We also have some data and training hyperparameters:
* `lr`: Learning rate for our optimizer
* `epochs`: Number of times to iterate through the training dataset.
* `clip`: The maximum gradient value to clip at (to prevent exploding gradients)

In [30]:
# loss and optimization functions
lr = 0.001
criterion = nn.BCELoss()
optimizer = torch.optim.Adam(net.parameters(), lr=lr)

In [31]:
# training params
epochs = 4 # Change to 3-4
counter = 0
print_every = 100
clip = 5 # gradient clipping

# move model to GPU, if available
if(train_on_gpu):
    net.cuda()

net.train()
# train for some number of epochs
for e in range(epochs):
    # initialize hidden state
    h = net.init_hidden(batch_size)

    # batch loop
    for inputs, labels in train_loader:
        counter += 1

        if (train_on_gpu):
            inputs, labels = inputs.cuda(), labels.cuda()

        # Creating new variables for the hidden state, otherwise
        # we'd backprop through the entire training history
        h = tuple([each.data for each in h])

        # zero accumulated gradients
        net.zero_grad()

        # get the output from the model
        output, h = net(inputs, h)

        # calculate the loss and perform backprop
        loss = criterion(output.squeeze(), labels.float())
        loss.backward()

        # `clip_grad_norm` helps prevent exploding gradient problem in RNN/LSTM
        nn.utils.clip_grad_norm_(net.parameters(), clip)
        optimizer.step()

        # loss stats
        if counter % print_every == 0:
            # Get validation loss
            val_h = net.init_hidden(batch_size)
            val_losses = []
            net.eval()
            for inputs, labels in valid_loader:

                # Creating new variables for the hidden state, otherwise
                # we'd backprop through the entire training history
                val_h = tuple([each.data for each in h])

                if(train_on_gpu):
                    inputs, labels = inputs.cuda(), labels.cuda()

                output, val_h = net(inputs, val_h)
                val_loss = criterion(output.squeeze(), labels.float())

                val_losses.append(val_loss.item())

            net.train()
            print("Epoch: {}/{}...".format(e+1, epochs),
                  "Step: {}...".format(counter),
                  "Loss: {:.6f}...".format(loss.item()),
                  "Val Loss: {:.6f}".format(np.mean(val_losses)))

Epoch: 1/4... Step: 100... Loss: 0.612672... Val Loss: 0.646718
Epoch: 1/4... Step: 200... Loss: 0.564205... Val Loss: 0.583867
Epoch: 1/4... Step: 300... Loss: 0.585234... Val Loss: 0.592831
Epoch: 1/4... Step: 400... Loss: 0.740851... Val Loss: 0.534791
Epoch: 2/4... Step: 500... Loss: 0.432671... Val Loss: 0.493893
Epoch: 2/4... Step: 600... Loss: 0.519769... Val Loss: 0.496737
Epoch: 2/4... Step: 700... Loss: 0.445338... Val Loss: 0.443765
Epoch: 2/4... Step: 800... Loss: 0.417270... Val Loss: 0.433025
Epoch: 3/4... Step: 900... Loss: 0.434819... Val Loss: 0.443449
Epoch: 3/4... Step: 1000... Loss: 0.261876... Val Loss: 0.412373
Epoch: 3/4... Step: 1100... Loss: 0.361628... Val Loss: 0.473495
Epoch: 3/4... Step: 1200... Loss: 0.288354... Val Loss: 0.397245
Epoch: 4/4... Step: 1300... Loss: 0.174693... Val Loss: 0.403743
Epoch: 4/4... Step: 1400... Loss: 0.360434... Val Loss: 0.434229
Epoch: 4/4... Step: 1500... Loss: 0.328719... Val Loss: 0.428251
Epoch: 4/4... Step: 1600... Loss: 

---
## Testing

There are a few ways to test your network.
* **Test data performance:** First, we'll see how our trained model performs on all of our defined test_data, above. We'll calculate the average loss and accuracy
* **Inference on user-generated data:** Second, we'll see if we can input just one example review at a time (without a label) and see what the trained model predicts. Looking at new, user input like this, and and predicting an output label, is called **inference**.

In [32]:
# Get test data loss and accuracy

test_losses = [] # track loss
num_correct = 0

# initialize hidden state
h = net.init_hidden(batch_size)

net.eval()
# iterate over test data
for inputs, labels in test_loader:

    # Creating new variables for the hidden state, otherwise we would backprop
    # through the entire training history
    h = tuple([each.data for each in h])

    if (train_on_gpu):
        inputs, labels = inputs.cuda(), labels.cuda()

    # get predicted outputs
    output, h = net(inputs, h)

    # calculate loss
    test_loss = criterion(output.squeeze(), labels.float())
    test_losses.append(test_loss.item())

    # convert output probabilities to predicted class (0/1)
    pred = torch.round(output.squeeze()) # rounds to the nearest integer

    # compare predictions to the true label
    correct_tensor = pred.eq(labels.float().view_as(pred))
    correct = np.squeeze(correct_tensor.numpy()) if not train_on_gpu else\
                np.squeeze(correct_tensor.cpu().numpy())
    num_correct += np.sum(correct)

# -- stats!-- ##
# avg test loss
print("Test loss: {:.3f}".format(np.mean(test_losses)))

# accuracy over all test data
test_acc = num_correct/len(test_loader.dataset)
print("Test accuracy: {:.3f}".format(test_acc))

Test loss: 0.429
Test accuracy: 0.836


### Inference on a test review

You can change this `test_review` to any text that you want. Read it and think: is it pos or neg? Then see if your model predicts it correctly!

> **Exercise**: Write a `predict` function that takes in a trained net, a plain `text_review`, and a sequence length and prints out a custom statement for a positive or negative review!
>* You can use any functions that you've already defined or define any helper functions you want to complete `predict`, but it should just take in a trained net, a text review, and a sequence length.

In [146]:
def predict(net, test_review, sequence_length=200):
    ''' Prints out whether a given review is predicted to be
        positive or negative in sentiment, using a pretrained model.

        params:
        net - A trained net
        test_review - a review made of normal text and punctuation
        sequence_length - the padded length of a review
    '''

    processed_text = test_review.lower()
    processed_text = ''.join([c for c in processed_text if c not in \
                              punctuation])

    # If token is not in our vocabulary, we label it as 0
    text_ints = [vocab_to_int.get(word, 0) for word in \
                 processed_text.lower().split()]

    # Pad review to sequence_length
    padded_review = pad_features([text_ints], sequence_length)

    # Turn review into Tensor, with additional dimension for batch_size (1)
    padded_review_tensor = torch.from_numpy(padded_review)

    batch_size = 1
    h = net.init_hidden(batch_size)
    net.eval()
    h = tuple([each.data for each in h])
    if(train_on_gpu):
        padded_review_tensor = padded_review_tensor.cuda()
    output, h = net(padded_review_tensor, h)
    pred = torch.round(output.squeeze())

    # print custom response based on whether test_review is pos/neg
    if pred == 1:
        print('positive')
    else:
        print('negative')

In [147]:
# negative test review
test_review_neg = 'The worst movie I have seen; acting was terrible and I want my money back. This movie had bad acting and the dialogue was slow.'
# positive test review
test_review_pos = 'This movie had the best acting and the dialogue was so good. I loved it.'

In [148]:
# call function
# try negative and positive reviews
seq_length=200
predict(net, test_review_pos, seq_length)

positive
