<a href="https://colab.research.google.com/github/davidmeadway/customerchurn/blob/main/Introduction_PyTorch_RNNLM_updated.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Tutorial - Introduction to Pytorch and Word-level RNN Language Models
This tutorial is adapted from [_Introduction to Pytorch_](https://pytorch.org/tutorials/beginner/introyt.html) tutorial from Pytorch documentations. 

In this tutorial, You will learn the basics of Pytorch and how to build and train neural network with Pytorch (Word-level RNN Language models example).

## Part 1: Introduction to Pytorch

PyTorch is an open source machine learning framework based on the Torch library, used for applications such as computer vision and natural language processing, primarily developed by Facebook's AI Research lab.

In [None]:
#PyTorch include the downloadable datasets (e.g., "torchtext") 
# Install the fowlling. 
!pip3 install torch torchtext numpy
!pip install torchdata
!pip install portalocker

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Let's start with importing Pytorch

In [None]:
import torch
print("Using torch", torch.__version__)

Using torch 2.0.0+cu118


Check if GPU is available


In [None]:
torch.cuda.is_available()

True

In [None]:
torch.cuda.device_count()

1

In [None]:
torch.cuda.current_device()

0

### Tensors
Tensors are a specialized data structure that are very similar to arrays and matrices. In PyTorch, we use tensors to encode the inputs and outputs of a model, as well as the model’s parameters.

Tensors are similar to NumPy’s ndarrays, except that tensors can run on GPUs or other hardware accelerators. In fact, tensors and NumPy arrays can often share the same underlying memory, eliminating the need to copy data (see Bridge with NumPy). Tensors are also optimized for automatic differentiation (we’ll see more about that later in the Autograd section). If you’re familiar with ndarrays, you’ll be right at home with the Tensor API. If not, follow along!

In [None]:
import torch
import numpy as np

Tensors can be created directly from data or from NumPy arrays.

In [None]:
# Directly from data
data = [[1, 2],[3, 4]]
x_data = torch.tensor(data)

# From a NumPy array
np_array = np.array(data)
x_np = torch.from_numpy(np_array)

Tensor attributes describe their shape, datatype, and the device on which they are stored.

In [None]:
print(f"Shape of tensor: {x_data.shape}")
print(f"Datatype of tensor: {x_data.dtype}")
print(f"Device tensor is stored on: {x_data.device}")

Shape of tensor: torch.Size([2, 2])
Datatype of tensor: torch.int64
Device tensor is stored on: cpu


`shape` is a tuple of tensor dimensions. We can create a new tensor by specifying the dimensionality and initialize it with random or constant values.
- `torch.zeros`: Creates a tensor filled with zeros
- `torch.ones`: Creates a tensor filled with ones
- `torch.rand`: Creates a tensor with random values uniformly sampled between 0 and 1

In [None]:
shape = (2,3,)
rand_tensor = torch.rand(shape)
ones_tensor = torch.ones(shape)
zeros_tensor = torch.zeros(shape)

print(f"Random Tensor: \n {rand_tensor} \n")
print(f"Ones Tensor: \n {ones_tensor} \n")
print(f"Zeros Tensor: \n {zeros_tensor}")

Random Tensor: 
 tensor([[0.0627, 0.0962, 0.1936],
        [0.8497, 0.4206, 0.5859]]) 

Ones Tensor: 
 tensor([[1., 1., 1.],
        [1., 1., 1.]]) 

Zeros Tensor: 
 tensor([[0., 0., 0.],
        [0., 0., 0.]])


We can also create a tensor from another tensor. The new tensor retains the properties (shape, datatype) of the argument tensor, unless explicitly overridden.
* `torch.zeros_like`: Creates a tensor filled with zeros
* `torch.ones_like`: Creates a tensor filled with ones
* `torch.rand_like`: Creates a tensor with random values uniformly sampled between 0 and 1

In [None]:
x_ones = torch.ones_like(x_data) # retains the properties of x_data
print(f"Ones Tensor: \n {x_ones} \n")

x_rand = torch.rand_like(x_data, dtype=torch.float) # overrides the datatype of x_data
print(f"Random Tensor: \n {x_rand} \n")

Ones Tensor: 
 tensor([[1, 1],
        [1, 1]]) 

Random Tensor: 
 tensor([[0.2302, 0.2607],
        [0.1067, 0.2830]]) 



#### Operations on Tensors
Over 100 tensor operations, including arithmetic, linear algebra, matrix manipulation (transposing, indexing, slicing), sampling and more are comprehensively described [here](https://pytorch.org/docs/stable/torch.html).

By default, tensors are created on the CPU. We need to explicitly move tensors to the GPU using `.to` method (after checking for GPU availability). Keep in mind that copying large tensors across devices can be expensive in terms of time and memory!

In [None]:
tensor = torch.rand(3,4)
# # We move our tensor to the GPU if available
if torch.cuda.is_available():
    tensor = tensor.to("cuda")
tensor

tensor([[0.0799, 0.0575, 0.3404, 0.9873],
        [0.3090, 0.3081, 0.7725, 0.7711],
        [0.3139, 0.1672, 0.3240, 0.9477]], device='cuda:0')

We can perform standard numpy-like indexing and slicing

In [None]:
print(f"First row: {tensor[0]}")
print(f"First column: {tensor[:, 0]}")
print(f"Last column: {tensor[..., -1]}")
tensor[:,1] = 0
print(tensor)

First row: tensor([0.0799, 0.0575, 0.3404, 0.9873], device='cuda:0')
First column: tensor([0.0799, 0.3090, 0.3139], device='cuda:0')
Last column: tensor([0.9873, 0.7711, 0.9477], device='cuda:0')
tensor([[0.0799, 0.0000, 0.3404, 0.9873],
        [0.3090, 0.0000, 0.7725, 0.7711],
        [0.3139, 0.0000, 0.3240, 0.9477]], device='cuda:0')


We can use torch.cat to concatenate a sequence of tensors along a given dimension. See also torch.stack, another tensor joining op that is subtly different from torch.cat.

In [None]:
t1 = torch.cat([tensor, tensor, tensor], dim=1)
print(t1)

tensor([[0.0799, 0.0000, 0.3404, 0.9873, 0.0799, 0.0000, 0.3404, 0.9873, 0.0799,
         0.0000, 0.3404, 0.9873],
        [0.3090, 0.0000, 0.7725, 0.7711, 0.3090, 0.0000, 0.7725, 0.7711, 0.3090,
         0.0000, 0.7725, 0.7711],
        [0.3139, 0.0000, 0.3240, 0.9477, 0.3139, 0.0000, 0.3240, 0.9477, 0.3139,
         0.0000, 0.3240, 0.9477]], device='cuda:0')


PyTorch tensors perform arithmetic operations intuitively. Tensors of similar shapes may be added, multiplied, etc. Operations with scalars are distributed over the tensor:

In [None]:
# This computes the matrix multiplication between two tensors. y1, y2, y3 will have the same value
y1 = tensor @ tensor.T
y2 = tensor.matmul(tensor.T)

y3 = torch.rand_like(tensor)
torch.matmul(tensor, tensor.T, out=y3)

  torch.matmul(tensor, tensor.T, out=y3)


tensor([[1.0970, 1.0490, 1.0710],
        [1.0490, 1.2869, 1.0780],
        [1.0710, 1.0780, 1.1016]], device='cuda:0')

In [None]:
# This computes the element-wise product. z1, z2, z3 will have the same value
z1 = tensor * tensor
z2 = tensor.mul(tensor)

z3 = torch.rand_like(tensor)
torch.mul(tensor, tensor, out=z3)

tensor([[0.0064, 0.0000, 0.1159, 0.9748],
        [0.0955, 0.0000, 0.5968, 0.5945],
        [0.0985, 0.0000, 0.1050, 0.8981]], device='cuda:0')

 If you have a one-element tensor, for example by aggregating all values of a tensor into one value, you can convert it to a Python numerical value using item():

In [None]:
agg = tensor.sum()
agg_item = agg.item()
print(agg_item, type(agg_item))

4.845869541168213 <class 'float'>


**In-place operations** Operations that store the result into the operand are called in-place. They are denoted by a `_` suffix. For example: `x.copy_(y)`, `x.t_()`, will change `x`.

In [None]:
print(f"{tensor} \n")
tensor.add_(5)
print(tensor)

tensor([[0.0799, 0.0000, 0.3404, 0.9873],
        [0.3090, 0.0000, 0.7725, 0.7711],
        [0.3139, 0.0000, 0.3240, 0.9477]], device='cuda:0') 

tensor([[5.0799, 5.0000, 5.3404, 5.9873],
        [5.3090, 5.0000, 5.7725, 5.7711],
        [5.3139, 5.0000, 5.3240, 5.9477]], device='cuda:0')


Tensors on the CPU and NumPy arrays can share their underlying memory locations, and changing one will change the other.

In [None]:
tensor = tensor.cpu()
print(f"t: {tensor}")
n = tensor.numpy()
print(f"n: {n}")

t: tensor([[5.0799, 5.0000, 5.3404, 5.9873],
        [5.3090, 5.0000, 5.7725, 5.7711],
        [5.3139, 5.0000, 5.3240, 5.9477]])
n: [[5.0799174 5.        5.3404365 5.9873013]
 [5.309037  5.        5.772537  5.771062 ]
 [5.3138747 5.        5.324045  5.9476585]]


In [None]:
# A change in the tensor reflects in the NumPy array.
tensor.add_(1)
print(f"t: {tensor}")
print(f"n: {n}")

t: tensor([[6.0799, 6.0000, 6.3404, 6.9873],
        [6.3090, 6.0000, 6.7725, 6.7711],
        [6.3139, 6.0000, 6.3240, 6.9477]])
n: [[6.0799174 6.        6.3404365 6.9873013]
 [6.309037  6.        6.772537  6.771062 ]
 [6.3138747 6.        6.324045  6.9476585]]


### Automatic Differentiation with torch.autograd
When training neural networks, the most frequently used algorithm is **back propagation**. In this algorithm, parameters (model weights) are adjusted according to the gradient of the loss function with respect to the given parameter.

To compute those gradients, PyTorch has a built-in differentiation engine called `torch.autograd`. It supports automatic computation of gradient for any computational graph.

Consider the simplest one-layer neural network, with input `x`, parameters `w` and `b`, and some loss function. It can be defined in PyTorch in the following manner:

In [None]:
import torch

x = torch.ones(5)  # input tensor
y = torch.zeros(3)  # expected output
w = torch.randn(5, 3, requires_grad=True)
b = torch.randn(3, requires_grad=True)
z = torch.matmul(x, w)+b
loss = torch.nn.functional.binary_cross_entropy_with_logits(z, y)


In this network, `w` and `b` are parameters, which we need to optimize. Thus, we need to be able to compute the gradients of loss function with respect to those variables. In order to do that, we set the `requires_grad` property of those tensors.

A function that we apply to tensors to construct computational graph is in fact an object of class Function. This object knows how to compute the function in the forward direction, and also how to compute its derivative during the backward propagation step. A reference to the backward propagation function is stored in grad_fn property of a tensor. You can find more information of Function in the [documentation](https://pytorch.org/docs/stable/autograd.html#function
).

In [None]:
print(f"Gradient function for z = {z.grad_fn}")
print(f"Gradient function for loss = {loss.grad_fn}")

Gradient function for z = <AddBackward0 object at 0x7f3981ca5630>
Gradient function for loss = <BinaryCrossEntropyWithLogitsBackward0 object at 0x7f3981ca5240>


#### Computing Gradients

To optimize weights of parameters in the neural network, we need to
compute the derivatives of our loss function with respect to parameters. We need $\frac{\partial loss}{\partial w}$ and
$\frac{\partial loss}{\partial b}$ to be under some fixed values of
``x`` and ``y``. To compute those derivatives, we call
``loss.backward()``, and then retrieve the values from ``w.grad`` and
``b.grad``:



In [None]:
loss.backward()
print(w.grad)
print(b.grad)

tensor([[0.0907, 0.1157, 0.3149],
        [0.0907, 0.1157, 0.3149],
        [0.0907, 0.1157, 0.3149],
        [0.0907, 0.1157, 0.3149],
        [0.0907, 0.1157, 0.3149]])
tensor([0.0907, 0.1157, 0.3149])


## Part2 - Word Level Recurrent Language Model

#### Recurrent Neural Network (RNN)
A recurrent neural network (RNN) is a neural network with a recurrent hidden layer $\pmb{h}$ to operate over a sequence input $\pmb{x} = (x_1,...,x_{|\pmb{x}|})$, one symbol at a time. The recurrent connection represents that the hidden state at a time step becomes input of next time step. Having recurrent connection among hidden units, RNN can model long distance dependencies due to ability to pass information between time step. 

![rnn.png](https://drive.google.com/uc?export=view&id=1zkDrJ1rx4saigEKqj43Xrgt-3z29HESl)


Above image shows a RNN and its unfolding in time. Generally, at time step $t$, the hidden state $h_t$ and output $y_t$  is updated as follows
\begin{align}
h_t &= \textrm{RNN}(h_{t-1}, x_t) \\ 
y_t &= \text{O}(h_t) = h_t 
\end{align}

where $\textrm{RNN}$ and $O$ denotes the function that computes hidden state and output vector. In simplest case, output function can be identity function as shown in previous equation. The simplest RNN formulation is Elman Network  which used $\tanh$ as activation function
\begin{align}
h_t &= 
\begin{cases}
	\tanh(h_{t_1}W^h + x_tW^x + b) & \quad    (t \geq 1) \\ 
	0 & \quad \text{ otherwise}
\end{cases} \\
\end{align}
where $W^h, W^x$ are weight matrices, $b$ is bias term.

#### Recurrent Neural Language Model
Similar to n-gram language model, the probability of a whole sentences are calculated by applying chained rules on probability of a word conditioned on previous words.
\begin{align}
P(\mathbf{w}) &= \prod_{i=1}^{n} P(w_{i}|\pmb{w}_{\le i-1}) = \prod_{i=1}^{n} \textrm{RNN}(h_{i-1}, w_i) 
\end{align}
![rnnlm.png](https://drive.google.com/uc?export=view&id=1f-b4rmSj8FGkgEnC-plAvWiQ3IbHVSQi)

The training objective is to maximize the log likelihood of the correct words.

### Datasets & DataLoaders
Code for processing data samples can get messy and hard to maintain; we ideally want our dataset code to be decoupled from our model training code for better readability and modularity. PyTorch provides two data primitives: `torch.utils.data.DataLoader` and `torch.utils.data.Dataset` that allow you to use pre-loaded datasets as well as your own data. `Dataset` stores the samples and their corresponding labels, and `DataLoader` wraps an iterable around the `Dataset` to enable easy access to the samples.

#### Loading a dataset
We will load Wikitext2 dataset with the following parameters:
- `root`: Directory where the datasets are saved. Default: .data
-  `split`: split or splits to be returned. Can be a string or tuple of strings. Default: (‘train’, ‘valid’, ‘test’)

In [None]:
from torchtext.datasets import WikiText2
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator

In [None]:
import torchdata
import portalocker

In [None]:
train_iter, valid_iter, test_iter = WikiText2(
    root="data", 
    split=('train', 'valid', 'test'))

num = 0
for _ in train_iter:
  num +=1

print(f"Train size: {num}")

Train size: 36718


In [None]:
import itertools

In [None]:

top5 = itertools.islice(train_iter, 5)
for item in top5:
  print(item)

 

 = Valkyria Chronicles III = 

 

 Senjō no Valkyria 3 : <unk> Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to as Valkyria Chronicles III outside Japan , is a tactical role @-@ playing video game developed by Sega and Media.Vision for the PlayStation Portable . Released in January 2011 in Japan , it is the third game in the Valkyria series . <unk> the same fusion of tactical and real @-@ time gameplay as its predecessors , the story runs parallel to the first game and follows the " Nameless " , a penal military unit serving the nation of Gallia during the Second Europan War who perform secret black operations and are pitted against the Imperial unit " <unk> Raven " . 

 The game began development in 2010 , carrying over a large portion of the work done on Valkyria Chronicles II . While it retained the standard features of the series , it also underwent multiple adjustments , such as making the game more <unk> for series newcomers . 

#### Dataset preprocess
We will tokenize and construct the vocabulary from the training dataset

In [None]:
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
from torch.utils.data import dataset

In [None]:
tokenizer = get_tokenizer('basic_english')
vocab = build_vocab_from_iterator(map(tokenizer, train_iter), specials=['<unk>'])
vocab.set_default_index(vocab['<unk>'])

In [None]:
print(f"Vocabulary size: {len(vocab)}")
print(f"The first 10 words in the vocabulary: {vocab.get_itos()[:10]}")

Vocabulary size: 28782
The first 10 words in the vocabulary: ['<unk>', 'the', ',', '.', 'of', 'and', 'in', 'to', 'a', '=']


In [None]:
vocab.lookup_tokens([1,2,3,4,5,6])

['the', ',', '.', 'of', 'and', 'in']

In [None]:
#The vocabulary block converts a list of tokens into integers.
vocab(['here', 'is', 'an', 'example'])

[1291, 23, 30, 617]

In [None]:
# Out-of-vocabulary tokens will be converted to `<unk>` (the default index)
vocab(['here', 'is', 'an', 'unknown', 'token', 'asdfghj'])

[1291, 23, 30, 1831, 28289, 0]

In [None]:
def data_process(raw_text_iter: dataset.IterableDataset) -> torch.Tensor:
    """Converts raw text into a flat Tensor."""
    data = [torch.tensor(vocab(tokenizer(item)), dtype=torch.long) for item in raw_text_iter]
    return torch.cat(tuple(filter(lambda t: t.numel() > 0, data)))
train_iter, valid_iter, test_iter = WikiText2(
    root="data", 
    split=('train', 'valid', 'test'))
train_stream = data_process(train_iter)
valid_stream = data_process(valid_iter)
test_stream = data_process(test_iter)


In [None]:
train_stream

tensor([   9, 3849, 3869,  ..., 2442, 4810,    3])

In [None]:
max_seq_length=10
from typing import Tuple, List
def create_data_tuples(source: torch.Tensor):
    return [[source[i:i + max_seq_length], source[i+1:i+ max_seq_length + 1]]
            for i in range(len(source) - max_seq_length)]


#### Create a Custom Dataset
A custom Dataset class must implement three functions: __init__, __len__, and __getitem__.

In [None]:
from torch.utils.data import Dataset
class LMDataset(Dataset):
    def __init__(self, source):
        self.source = source
        self.data = create_data_tuples(source)

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        return self.data[idx]


In [None]:
train_dataset = LMDataset(train_stream)  
valid_dataset = LMDataset(valid_stream)
test_dataset = LMDataset(test_stream)


In [None]:
train_dataset[1]

[tensor([ 3849,  3869,   881,     9, 20000,    83,  3849,    88,     0,  3869]),
 tensor([ 3869,   881,     9, 20000,    83,  3849,    88,     0,  3869,    21])]

#### Preparing your data for training with DataLoaders
The `Dataset` retrieves our dataset’s features and labels one sample at a time. While training a model, we typically want to pass samples in “minibatches”, reshuffle the data at every epoch to reduce model overfitting, and use Python’s `multiprocessing` to speed up data retrieval.

`DataLoader` is an iterable that abstracts this complexity for us in an easy API.

In [None]:
from torch.utils.data import DataLoader

batch_size = 64
eval_batch_size = 32

train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
valid_dataloader = DataLoader(valid_dataset, batch_size=eval_batch_size, shuffle=True)
test_dataloader = DataLoader(test_dataset, batch_size=eval_batch_size, shuffle=True)

#### Iterate through the DataLoader
We have loaded that dataset into the `DataLoader` and can iterate through the dataset as needed. Each iteration below returns a batch of source and target sequences (containing batch_size=20 tuples respectively). Because we specified shuffle=True, after we iterate over all batches the data is shuffled (for finer-grained control over the data loading order, take a look at [Samplers](https://pytorch.org/docs/stable/data.html#data-loading-order-and-sampler)).

src, tgt = next(iter(train_dataloader))
print(f"Source batch shape: {src.size()}")
print(f"Target batch shape: {tgt.size()}")

In [None]:
src, tgt = next(iter(train_dataloader))
print(f"Source batch shape: {src.size()}")
print(f"Target batch shape: {tgt.size()}")

Source batch shape: torch.Size([64, 10])
Target batch shape: torch.Size([64, 10])


In [None]:
print(src[0])
print(tgt[0])

tensor([ 197,  138,   82,    3, 4293,   21,    0,    2,    0,   20])
tensor([ 138,   82,    3, 4293,   21,    0,    2,    0,   20,    8])


### Build the Neural Network

#### Get Device for Training
We want to be able to train our model on a hardware accelerator like the GPU, if it is available. Let’s check to see if torch.cuda is available, else we continue to use the CPU.

In [None]:
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using {device} device")

Using cuda device


#### Define the Class
We define our neural network by subclassing `nn.Module`, and initialize the neural network layers in `__init__`. Every `nn.Module` subclass implements the operations on input data in the `forward` method.

In [None]:
import torch.nn as nn
class WordLevelRecurrentLanguageModel(nn.Module):
    def __init__(self, vocab_size, embedding_size, hidden_size):
        super(WordLevelRecurrentLanguageModel, self).__init__()
        self.hidden_size = hidden_size
        self.embedding_size = embedding_size
        
        self.embedding = nn.Embedding(vocab_size, embedding_size)
        self.rnn = nn.RNN(embedding_size, hidden_size, batch_first=True)
        
        self.fc = nn.Linear(hidden_size, vocab_size)
        
    def forward(self, x, hidden_states):
        embedded = self.embedding(x)
        out, hidden_states = self.rnn(embedded, hidden_states)
        
        # Reshaping the outputs to batch_size * seq_length x hidden_size
        out = out.contiguous().view(-1, self.hidden_size)
        out = self.fc(out)
        return out, hidden_states
    
    def init_hidden_states(self, batch_size):
        hidden_states = torch.zeros(1, batch_size, self.hidden_size)
        return hidden_states


We create an instance of `WordLevelRecurrentLanguageModel`, and move it to the device, and print its structure.

In [None]:
vocab_size = len(vocab)
embedding_size = 64
hidden_size = 64
model = WordLevelRecurrentLanguageModel(vocab_size, embedding_size, hidden_size).to(device)

Many layers inside a neural network are parameterized, i.e. have associated weights and biases that are optimized during training. Subclassing `nn.Module` automatically tracks all fields defined inside your model object, and makes all parameters accessible using your model’s `parameters()` or `named_parameters()` methods.

In this example, we iterate over each parameter, and print its size and a preview of its values.

In [None]:
print(f"Model structure: {model}\n\n")

for name, param in model.named_parameters():
    print(f"Layer: {name} | Size: {param.size()} | Values : {param[:2]} \n")

Model structure: WordLevelRecurrentLanguageModel(
  (embedding): Embedding(28782, 64)
  (rnn): RNN(64, 64, batch_first=True)
  (fc): Linear(in_features=64, out_features=28782, bias=True)
)


Layer: embedding.weight | Size: torch.Size([28782, 64]) | Values : tensor([[-1.3262, -1.4255,  0.8001, -1.0301, -0.6017,  0.8407, -1.2399,  1.2506,
          0.1481, -0.9826,  0.6768, -0.3042,  1.1040,  0.4234,  1.2463,  0.2012,
         -0.7883, -0.2560,  0.9730, -0.6398, -0.2981, -2.0335, -0.3573, -1.6084,
         -0.3986,  0.7788, -0.7012,  0.7937, -0.9393, -0.4333,  0.4475,  1.4960,
         -0.3972,  1.5357,  1.8472, -0.3796, -2.4107, -0.7223, -1.0914, -1.7269,
          1.0639, -0.7753,  0.4376, -2.1993, -0.0662, -0.2571,  1.1669,  0.4841,
          0.7840,  0.3355,  0.3492,  0.9092, -0.1509, -0.7557,  0.2134,  0.5374,
          0.9334, -0.7528, -0.5809, -0.4361,  0.4235, -0.7776,  0.5672, -1.4143],
        [ 0.2377, -0.3400, -1.3284,  0.4286,  0.4900,  0.5092, -0.8615, -1.5680,
         -0.

To use the model, we pass it the input data. This executes the model’s `forward`, along with some background operations. Do not call `model.forward()` directly!

Calling the model on the input returns a vocab-size -dimensional tensor with raw predicted values for each token. We get the prediction probabilities by passing it through an instance of the `nn.Softmax` module.

In [None]:
print(train_dataset[0][0])
print(train_dataset[0][0].unsqueeze(0).to(device))
print(train_dataset[0][0].unsqueeze(1).to(device))

tensor([    9,  3849,  3869,   881,     9, 20000,    83,  3849,    88,     0])
tensor([[    9,  3849,  3869,   881,     9, 20000,    83,  3849,    88,     0]],
       device='cuda:0')
tensor([[    9],
        [ 3849],
        [ 3869],
        [  881],
        [    9],
        [20000],
        [   83],
        [ 3849],
        [   88],
        [    0]], device='cuda:0')


In [None]:
x = train_dataset[0][0].unsqueeze(0).to(device)
hidden_states = model.init_hidden_states(1).to(device)
logits, states = model(x, hidden_states)
pred_probab = nn.Softmax(dim=1)(logits)
y_pred = pred_probab.argmax(1)
print(f"Predicted tokens: {y_pred}")

Predicted tokens: tensor([ 4838,  6234,  5189, 21226,  4838, 13182, 25473,  1427, 11939,  4479],
       device='cuda:0')


### Optimizing Model Parameters
Now that we have a model and data it’s time to train, validate and test our model by optimizing its parameters on our data. Training a model is an iterative process; in each iteration (called an `epoch`) the model makes a guess about the output, calculates the error in its guess (`loss`), collects the derivatives of the error with respect to its parameters, and optimizes these parameters using gradient descent. 

Hyperparameters are adjustable parameters that let you control the model optimization process. Different hyperparameter values can impact model training and convergence rates (read more about hyperparameter tuning)

We define the following hyperparameters for training:

- Number of Epochs - the number times to iterate over the dataset
- Batch Size - the number of data samples propagated through the network before the parameters are updated
- Learning Rate - how much to update models parameters at each batch/epoch. Smaller values yield slow learning speed, while large values may result in unpredictable behavior during training.



In [None]:
learning_rate = 1e-3
batch_size = 64
epochs = 5

#### Optimization Loop

Once we set our hyperparameters, we can then train and optimize our model with an optimization loop. Each iteration of the optimization loop is called an epoch.

Each epoch consists of two main parts:

- The Train Loop - iterate over the training dataset and try to converge to optimal parameters.
- The Validation/Test Loop - iterate over the test dataset to check if model performance is improving.

Let’s briefly familiarize ourselves with some of the concepts used in the training loop


#### Loss Function
When presented with some training data, our untrained network is likely not to give the correct answer. Loss function measures the degree of dissimilarity of obtained result to the target value, and it is the loss function that we want to minimize during training. To calculate the loss we make a prediction using the inputs of our given data sample and compare it against the true data label value.

Common loss functions include `nn.MSELoss` (Mean Square Error) for regression tasks, and `nn.NLLLoss` (Negative Log Likelihood) for classification. `nn.CrossEntropyLoss` combines `nn.LogSoftmax` and `nn.NLLLoss`.

We pass our model’s output logits to `nn.CrossEntropyLoss`, which will normalize the logits and compute the prediction error.

In [None]:
# Initialize the loss function
loss_fn = nn.CrossEntropyLoss()

#### Optimizer
Optimization is the process of adjusting model parameters to reduce model error in each training step. **Optimization algorithms** define how this process is performed (in this example we use Stochastic Gradient Descent). All optimization logic is encapsulated in the `optimizer` object. Here, we use the SGD optimizer; additionally, there are many [different optimizers](https://pytorch.org/docs/stable/optim.html) available in PyTorch such as ADAM and RMSProp, that work better for different kinds of models and data.

We initialize the optimizer by registering the model’s parameters that need to be trained, and passing in the learning rate hyperparameter.

In [None]:
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)


Inside the training loop, optimization happens in three steps:

- Call `optimizer.zero_grad()` to reset the gradients of model parameters. Gradients by default add up; to prevent double-counting, we explicitly zero them at each iteration.
- Backpropagate the prediction loss with a call to `loss.backward()`. PyTorch deposits the gradients of the loss w.r.t. each parameter.
- Once we have our gradients, we call `optimizer.step()` to adjust the parameters by the gradients collected in the backward pass.



#### Full Implementation
We define `train_loop` that loops over our optimization code, and `test_loop` that evaluates the model’s performance against our test data.

In [None]:
history_loss = []
history_dev_loss = []
model.to(device)
def train_loop(dataloader, model, loss_fn, optimizer):
    size = len(dataloader.dataset)
    for batch, (X, y) in enumerate(dataloader):
        # Compute prediction and loss        
        X = X.to(device)
        y = y.to(device)
        batch_size = X.size()[0]
        hidden_states = model.init_hidden_states(batch_size).to(device)
        logits, states = model(X, hidden_states)
        loss = loss_fn(logits, y.view(-1))

        # Backpropagation
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if batch % 1000 == 0:
            loss, current = loss.item(), batch * len(X)
            history_loss.append(loss)
            print(f"loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")


def test_loop(dataloader, model, loss_fn):
    size = len(dataloader.dataset)
    num_batches = len(dataloader)
    test_loss = 0

    with torch.no_grad():
        for X, y in dataloader:
            X = X.to(device)
            y = y.to(device)
            batch_size = X.size()[0]
            hidden_states = model.init_hidden_states(batch_size).to(device)
            logits, states = model(X, hidden_states)
            test_loss += loss_fn(logits, y.view(-1)).item()

    test_loss /= size
    history_dev_loss.append(loss)
    print(f"Avg loss: {test_loss:>8f} \n")

We initialize the loss function and optimizer, and pass it to `train_loop` and `test_loop`. Feel free to increase the number of epochs to track the model’s improving performance.

In [None]:
##### loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

epochs = 2
model.to(device)
for t in range(epochs):
    print(f"Epoch {t+1}\n-------------------------------")
    train_loop(train_dataloader, model, loss_fn, optimizer)
    test_loop(valid_dataloader, model, loss_fn)
print("Done!")

Epoch 1
-------------------------------
loss: 10.324711  [    0/2049980]
loss: 10.281262  [64000/2049980]
loss: 10.242194  [128000/2049980]
loss: 10.199057  [192000/2049980]
loss: 10.150320  [256000/2049980]
loss: 10.040832  [320000/2049980]
loss: 9.870606  [384000/2049980]
loss: 9.652540  [448000/2049980]
loss: 9.349940  [512000/2049980]
loss: 8.881638  [576000/2049980]
loss: 8.869509  [640000/2049980]
loss: 8.513973  [704000/2049980]
loss: 8.474136  [768000/2049980]
loss: 8.309977  [832000/2049980]
loss: 8.397352  [896000/2049980]
loss: 8.072169  [960000/2049980]
loss: 7.993558  [1024000/2049980]
loss: 8.115429  [1088000/2049980]
loss: 8.236747  [1152000/2049980]
loss: 7.872289  [1216000/2049980]
loss: 7.984025  [1280000/2049980]
loss: 8.031525  [1344000/2049980]
loss: 7.908545  [1408000/2049980]
loss: 7.839429  [1472000/2049980]
loss: 7.729629  [1536000/2049980]
loss: 7.865251  [1600000/2049980]
loss: 7.802162  [1664000/2049980]
loss: 7.822999  [1728000/2049980]
loss: 7.771064  [179

### Save and Load the Model
In this section we will look at how to persist model state with saving, loading and running model predictions.

In [None]:
import torch
import torchvision.models as models

#### Saving and Loading Model Weights

PyTorch models store the learned parameters in an internal state dictionary, called state_dict. These can be persisted via the torch.save method:

In [None]:
torch.save(model.state_dict(), 'model.pt')


To load model weights, you need to create an instance of the same model first, and then load the parameters using load_state_dict() method.

In [None]:
model.load_state_dict(torch.load('model.pt'))
model.eval()

WordLevelRecurrentLanguageModel(
  (embedding): Embedding(28782, 64)
  (rnn): RNN(64, 64, batch_first=True)
  (fc): Linear(in_features=64, out_features=28782, bias=True)
)

In [None]:
from google.colab import drive
drive.mount('/content/drive')
file_path = "/content/drive/MyDrive/Monash/Subjects/FIT5217/Tutorials/Week7"

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
torch.save(model.state_dict(), file_path + '/model.pt')

In [None]:
model.load_state_dict(torch.load(file_path + '/model.pt'))
model.eval()

NameError: ignored

### Excercises
1. Use your trained RNN Language Model to predict next word of the following sentence: _The quick brown fox jumps over_
2. So far, we have limit the max sequence length to 10 (_max_seq_length_).Investigate the effect of different max sequence length to the RNN language model.

In [None]:
model.eval()

WordLevelRecurrentLanguageModel(
  (embedding): Embedding(28782, 64)
  (rnn): RNN(64, 64, batch_first=True)
  (fc): Linear(in_features=64, out_features=28782, bias=True)
)

In [None]:
input = vocab(['the', 'quick', 'brown', 'fox', 'jumps', 'over'])
t_input = torch.tensor(input, dtype=torch.long).to(device)
t_input = t_input.unsqueeze(0)
print(t_input)
print(vocab.lookup_tokens(input))

tensor([[   1, 3275,  563, 1230, 6845,   65]], device='cuda:0')
['the', 'quick', 'brown', 'fox', 'jumps', 'over']


In [None]:
hidden_states = model.init_hidden_states(1).to(device)
hidden_states

tensor([[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]],
       device='cuda:0')

In [None]:
logits, states = model(t_input, hidden_states)

In [None]:
pred_probab = nn.Softmax(dim=1)(logits)
y_pred = pred_probab.argmax(1)
print(f"Predicted tokens: {y_pred}")


Predicted tokens: tensor([0, 2, 1, 1, 2, 1], device='cuda:0')


In [None]:
print(vocab.vocab.lookup_tokens(y_pred.tolist()))

['<unk>', ',', 'the', 'the', ',', 'the']


In [None]:
x = train_dataset[0][1].unsqueeze(0).to(device)
hidden_states = model.init_hidden_states(1).to(device)
logits, states = model(x, hidden_states)
pred_probab = nn.Softmax(dim=1)(logits)
y_pred = pred_probab.argmax(1)
print(f"Predicted tokens: {y_pred}")

Predicted tokens: tensor([1, 2, 2, 9, 2, 3, 1, 2, 2, 2], device='cuda:0')


In [None]:
print(vocab.vocab.lookup_tokens(y_pred.tolist()))

['the', ',', ',', '=', ',', '.', 'the', ',', ',', ',']
