# Coding & Optimizing Neural Networks with PyTorch
## Binary Classification

PyTorch is an optimized tensor library for modeling with deep neural network learning models. This notebook is adapted from my [prior work](https://github.com/ahowe42/QuantExplNotebooks/blob/master/src/my_neuralnetwork_pytorch.ipynb).

Note that this is only my view of what a PyTorch tutorial should be, but there are many others. [The PyTorch Tutorials page](https://pytorch.org/tutorials/) is a good place to look.

Of course, perusing [the docs](https://pytorch.org/docs/stable/index.html) is always a good idea.

For deeper knowledge of how simple neural networks work, [this notebook](https://github.com/ahowe42/QuantExplNotebooks/blob/master/src/my_neuralnetwork.ipynb) may be a useful source. I created this notebook as part of my own learning process. In it, I coded the basic functionality of activated linear layers, forward propagation, loss computation, gradient calculation, and regularized backward propagation.

- <a href=#DP>Data Preparation</a>
- <a href=#DD>Dataset and DataLoader</a>
- <a href=#NNA>Neural Network Architecture</a>
- <a href=#NNO>Neural Network Optimization</a>
- <a href=#LVP>Logging \& Visualizing Progress</a>
- <a href=#T>Training</a>
- <a href=#S>PyTorch Sequential</a>
- <a href=#PND>Parameterized Network Design</a>
- <a href=#C>Checkpointing</a>
- <a href=#E>Extensions</a>
- <a href=#Bot>Go To Bottom</a>

<a id=top></a>

In [None]:
import numpy as np
import pandas as pd
import ipdb
from itertools import chain
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from torch.utils.tensorboard import SummaryWriter

from sklearn.model_selection import ShuffleSplit
from sklearn.metrics import accuracy_score

import chart_studio.plotly as ply
import chart_studio.tools as plytool
import plotly.figure_factory as ff
import plotly.graph_objs as go
import plotly.offline as plyoff

plyoff.init_notebook_mode(connected=True)
x1 = [1,4,7]; y1 = [7,5,7]
x2 = [1,2,3,4,5,6,7]; y2 = [3,2,1,1,1,2,3]
plyoff.iplot(go.Figure(data=[go.Scatter({'x':x1, 'y':y1, 'mode':'markers'}), go.Scatter({'x':x2, 'y':y2, 'mode':'lines'})],
                       layout=go.Layout(autosize=False, width=400, showlegend=False,
                                        xaxis={'showgrid':False, 'showticklabels':False},
                                        title="Initialization Makes Me Smile<br>(and it's fun to show off a little...)")))

## Data Preparation
First, we generate some training / testing data with two features and a single binary response.

<a id=DP></a>
<a href=#top>Go To Top</a>

In [None]:
''' create simulated data '''
# setup
np.random.seed = 42
n = 100
p = 2
trainPerc = 0.7

# define features & response
x = np.random.rand(n, p)
print(x.shape)
y = x[:,0]*5 + x[:,1]*3 + np.random.rand(n)
y = np.atleast_2d(y > np.median(y)).T
print(y.shape)

# partition randomly into the training & testing sets
ss = ShuffleSplit(n_splits=1, train_size=trainPerc)
(trn, tst) = next(ss.split(x))
trnX = x[trn,:]; trnY = y[trn,:]
tstX = x[tst,:]; tstY = y[tst,:]

# plot the data
trnYFlat = np.squeeze(trnY); tstYFlat = np.squeeze(tstY); # squeee y for indexing into x
trc = [go.Scatter({'x':trnX[trnYFlat,0], 'y':trnX[trnYFlat, 1], 'name':'Train 1 (> median)', 'mode':'markers',
                   'marker':{'color':'red', 'symbol':'circle-open'}}, legendgroup='train'), 
        go.Scatter({'x':trnX[~trnYFlat,0], 'y':trnX[~trnYFlat, 1], 'name':'Train 0 (<= median)', 'mode':'markers',
                   'marker':{'color':'blue', 'symbol':'square-open'}}, legendgroup='train'),
        go.Scatter({'x':tstX[tstYFlat,0], 'y':tstX[tstYFlat, 1], 'name':'Test 1 (> median)', 'mode':'markers',
                   'marker':{'color':'red', 'symbol':'circle-dot'}}, legendgroup='test'), 
        go.Scatter({'x':tstX[~tstYFlat,0], 'y':tstX[~tstYFlat, 1], 'name':'Test 0 (<= median)', 'mode':'markers',
                   'marker':{'color':'blue', 'symbol':'square-dot'}}, legendgroup='test')]
plyoff.iplot(go.Figure(data=trc, layout=go.Layout(title = 'Data')))

## Dataset and DataLoader
Data is accessed in a PyTorch neural network via a `DataLoader` object, which is an iterable over a `Dataset` object. Data needs to be defined as `Tensor` objects.

### Tensor
The `Tensor` object, representing a multi-dimensional matrix containing elements of a single data type, is the primary datatype in PyTorch. Common types include:
- `IntTensor` - integers
- `FloatTensor` - floats
- `BoolTensor` - booleans

All `Tensor` types can be seen [here](https://pytorch.org/docs/stable/tensors.html). Note that the response variable defined below is an integer, but defined as a generic `Tensor`. I defined it this way because the loss function throws an exception if it is an `IntTensor` or `BoolTensor`, which is odd.

### Dataset
For use cases where the data has features and responses, it's simplest if this has separate `x` and `y` (or perhaps `features` and `response`) `Tensor` objects. A [map-style Dataset](https://pytorch.org/docs/stable/data.html#torch.utils.data.Dataset) is shown in this tutorial. These must overload the `__getitem__` and `__len__` methods. The former must return data for a given index key, and the latter is self-explanatory. PyTorch also has an [iterable-style Dataset](https://pytorch.org/docs/stable/data.html#torch.utils.data.IterableDataset).

### DataLoader
The `DataLoader` object does just what it says - loads data. There are [many options](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) for setting up a data loader, but the most commonly useful are probably:
- `dataset` - This is required, obviously.
- `batch_size` - This defaults to 1 if not specified. If minibatching is required (helps prevent optimizing getting stuck in a local optimum, as well as uses less memory), set this to the batch size, otherwise, set to the size of the dataset.
- `shuffle` - This randomly permutes the order of the data with each epoch; the default value is False.

In this tutorial, note that the training data is loaded in shuffled batches of size 5, but the testing data is all loaded as a single batch, and not shuffled.

<a id=DD></a>
<a href=#top>Go To Top</a>

In [None]:
''' setup for data access '''
# now make the datasets & dataloaders
batchSize = 5

# Create the data class
class Data(Dataset):
    def __init__(self, x, y):
        self.x = torch.FloatTensor(x)
        self.y = torch.Tensor(y.astype(int))
        self.len, self.p = self.x.shape
    def __getitem__(self, index):      
        return self.x[index], self.y[index]
    def __len__(self):
        return self.len

# training data is accessed in batches, testing data is not
trainData = Data(trnX, trnY)
trainLoad = DataLoader(dataset=trainData, batch_size=batchSize, shuffle=True)
testData = Data(tstX, tstY)
testLoad = DataLoader(dataset=testData, batch_size=len(testData))

## Neural Network Architecture
In PyTorch, the architecture of a simple NN is defined by a class that extends the `nn.Module` object. At a minimum, the `__init__` and `forward` methods must be overloaded. There is far more than I can say here that should be said here, so [RTFM](https://pytorch.org/docs/stable/generated/torch.nn.Module.html). A PyTorch NN is essentially a bunch of stuff around a list of PyTorch modules. These modules should be defined in the `__init__` method as an [nn.ModuleList](https://pytorch.org/docs/stable/generated/torch.nn.ModuleList.html) object. This is an iterable container of Pytorch modules, which could be layers.

There are several types of layers which can go into an NN, depending on the type of model desired, listed [here](https://pytorch.org/docs/stable/nn.html). This tutorial demonstrates a simple network with just activated `Linear` layers - $y_i = f\left(a\times x_i + b\right)$. The `Linear` layer must be defined with an input and output size. Note that each `Linear` layer has the `bias` and `weight` attributes. The biases are typically initialized to 0. There are several choices for initialization of weights in the `nn` module. In this tutorial, we have the options:
- uniform - Uniform(0, 1)
- xavier uniform - Glorot Initialization
- kaiming uniform - He Initialization

After all the 4 hardcoded layers have been defined and initialized, they're added to the list of modules. The `__init__` method also defines the `Dropout` layer and the activation function for each layer (sans the input layer). In PyTorch, choices for the activation functions are defined in [nn.functional](https://pytorch.org/docs/stable/nn.functional.html). The [ReLU](https://pytorch.org/docs/stable/generated/torch.nn.functional.relu.html) (rectified linear unit) or [leaky ReLU](https://pytorch.org/docs/stable/generated/torch.nn.functional.leaky_relu.html) are the most common activation functions for hidden layers, defined as 
\begin{equation}
f(x_i) = \begin{cases}
0 & x_i < 0\\
x_i & x_i >= 0
\end{cases}
\end{equation}
and
\begin{equation}
f(x_i, s) = \begin{cases}
x_i\times-s & x_i < 0\\
x_i & x_i >= 0
\end{cases}
\end{equation}
The leaky ReLU multiplicatively adds a slight negative slope $s$ for negative input values.

For classification problems, the [sigmoid](https://pytorch.org/docs/stable/generated/torch.nn.Sigmoid.html) activation function $f(x_i) = \left(1+e^{-x_i}\right)^{-1}$, which scales the input values to the [0, 1] range, is most commonly used. In the case of binary classification, the value for $f(x_i)$ is then interpreted as the probability that $y_i$ is the target (True) class.

### Forward Propagation
Forward propagation needs to be implemented in a class method named `forward`, which takes the input data values as an argument. This is the process of pushing input data through the network layers to compute a prediction. In this tutorial, forward propagation essentially computes this:
\begin{equation}
\hat{y}_i = Sigmoid\left(A_4\times ReLU\left(A_3\times ReLU\left(A_2\times ReLU\left(A_1\times x_i+b_1\right) + b_2\right) + b_3\right) + b_4\right)
\end{equation}
Forward propagation is where any kind of special layers would be applied - such as dropout or batch normalization. Dropout is implemented as a [Dropout layer](https://pytorch.org/docs/stable/generated/torch.nn.Dropout) parameterized by a probability. The `Dropout` layer randomly set observations in the input data to 0 each time it's called. This helps in regularization and to prevent the network getting stuck in a local optimum. I believe the network could be designed with an individual `Dropout` layer before each `Linear` layer, but there are no parameters for backward propagation, and it would add unnecessary complexity. Thus, the network below has a single `Dropout` layer defined in the `__init__` method that is repeatedly called in the `forward` method.

When trainig the network, the `forward` method is not explicitly called. Forward propagation is execute by simply calling the network object, passing the input features as arguments, as in `myNN(x)`.

### Backward Propagation
Backward propagation is the process by which errors as computed by the loss function are propagated backwards through the network, using the gradients (first derivative) of the layers' parameters. Backward propagation is performed by the loss and optimization functions together.

<a id=NNA></a>
<a href=#top>Go To Top</a>

In [None]:
''' define the model class for a neural net with hidden layers & dropout '''
class myNN(nn.Module):
    def __init__(self):
        super(myNN, self).__init__()
        # define the activations
        self.activations = ['relu', 'relu', 'relu', 'sigmoid']
        # define the linear layers
        self.linears = nn.ModuleList()
        lin1 = nn.Linear(p, 10)
        lin2 = nn.Linear(10, 20)
        lin3 = nn.Linear(20, 10)
        lin4 = nn.Linear(10, 1)
        # initialize the biases
        nn.init.zeros_(lin1.bias)
        nn.init.zeros_(lin2.bias)
        nn.init.zeros_(lin3.bias)
        nn.init.zeros_(lin4.bias)
        # initialize the weights
        self.weightInit = 'kai'
        if self.weightInit == 'uni':
            nn.init.uniform_(lin1.weight)
            nn.init.uniform_(lin2.weight)
            nn.init.uniform_(lin3.weight)
            nn.init.uniform_(lin4.weight)
        elif self.weightInit == 'xav':
            nn.init.xavier_uniform_(lin1.weight)
            nn.init.xavier_uniform_(lin2.weight)
            nn.init.xavier_uniform_(lin3.weight)
            nn.init.xavier_uniform_(lin4.weight)
        elif self.weightInit == 'kai':
            nn.init.kaiming_uniform_(lin1.weight, nonlinearity='relu')
            nn.init.kaiming_uniform_(lin2.weight, nonlinearity='relu')
            nn.init.kaiming_uniform_(lin3.weight, nonlinearity='relu')
            nn.init.kaiming_uniform_(lin4.weight, nonlinearity='relu')
        # add the layers to the model
        self.linears.append(lin1)
        self.linears.append(lin2)
        self.linears.append(lin3)
        self.linears.append(lin4)
        # define last stuff
        self.len = 4
        self.drop = nn.Dropout(0.2)

    def forward(self, x):
        for i, (L, A) in enumerate(zip(self.linears, self.activations)):
            # dropout if not the output layer
            if i < self.len - 1:
                x = self.drop(L(x))
            else:
                x = L(x)
            # compute the activation
            if A == 'relu':
                x = nn.functional.relu(x)
            elif A == 'sigmoid':
                x = torch.sigmoid(x)
            elif A == 'leakyrelu':
                x = nn.functional.leaky_relu(x)
            elif A == 'tanh':
                x = torch.tanh(x)
        return x
    
    def __str__(self):
        mn = super(myNN, self).__str__()
        return '%s\nActivations: %s'%(mn, self.activations)

### Neural Network Optimization
In addition to the network architecture, a NN model definition includes an optimizer, it's parameters, and a loss function.

### Optimization Method
The [opim module](https://pytorch.org/docs/stable/optim.html) includes several common optimization algorithms, including:
- Adaptive methods (Adam, Adagrad, Adamax)
- RMSprop
- [Stochastic Gradient Descent (SGD)](https://pytorch.org/docs/stable/generated/torch.optim.SGD.html) - this is the optimzer used in this tutorial

Each optimizer method needs to have the network's parameters to be optimized, the learning rate for backward propagation, and any method-specific parameters. There is no need to manually specify the parameters to be optimized, as the `nn.Module` object has an inheritable `parameters` method which returns a python generator to iterate over the network parameters. The learning rate typically reduces the amount by which prediction errors are propagated back through the network, which helps the NN to converge. A higher learning rate is often good earlier in training, while a lower learning rate can be better towards the end. PyTorch includes in the optim module several schedulers for decaying the learning rate as `lr_scheduler` objects. In this tutorial, a [multiplicative step scheduler](https://pytorch.org/docs/stable/generated/torch.optim.lr_scheduler.StepLR.html) is used, which reduces the learning rate from the initial value (0.1) by a specified proportion (50%), after each multiple of a specified number of epochs (50).

### Loss Function
PyTorch includes several loss functions, listed in the [nn module](https://pytorch.org/docs/stable/nn.html). As this tutorial applies a simple neural network for a binary classification problem, the [binary cross entropy loss](https://pytorch.org/docs/stable/generated/torch.nn.BCELoss.html) is used. For each observation, the loss function as implemented here computes
\begin{equation}
l\left(y_i, \hat{y}_i\right) = y_i \times \log\hat{y}_i + \left(1-y_i\right)\times \log\left(\hat{y}_i\right)
\end{equation}
Optional arguments to the `BCELoss` function include observation weights, and either a sum or mean reduction.

### Backward Propagation
Backward propagation is the process by which errors as computed by the loss function are propagated backwards through the network, using the gradients (first derivative) of the layers' parameters. All loss functions have a `backward` method; despite it's name, it does not actually propagate the errors backward through the network, it simply computes the gradients and saves them in the parameters' `grad` attributes. Because the network architecture is built up by PyTorch layers and activation functions, it already knows how to do this. The [autograd](https://pytorch.org/docs/stable/autograd.html) module is responsible for this automatic differentiation. Backward propagation of the errors is actually performed by the optimizer.

<a id=NNO></a>
<a href=#top>Go To Top</a>

In [None]:
''' define the model & operating parameters '''
# set the learning rate
learningRate = 0.1
stepSize = 50
stepMult = 0.5

# setup the network
torch.manual_seed(42)
classificationNN = myNN()
print(classificationNN)

# setup the optimizer
optimizer = torch.optim.SGD(params=classificationNN.parameters(), lr=learningRate)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=stepSize, gamma=stepMult)
                                
# setup the loss function
loss = nn.BCELoss()

## Logging & Visualizing Progress
In addition to the logging & visualizing results and progress that we can do ourselves with just python, PyTorch has the ability to log results to [TensorBoard](https://pytorch.org/docs/stable/tensorboard.html), which comes from TensorFlow. There are a lot of options, but the simplest is to create a `SummaryWriter` object, potentially specifying the directory, a comment, and file suffixes. If a directory is not specified, all output goes to *./runs/*. Scalar values can be added to Tensorboard with the `add_scalar` method. The name of what's being added and the epoch must be specified, in addition to the value. In addition to scalar values, there are several other options for adding data to TensorBoard. The `add_graph` method may be used to add an interactive visual of the NN's execution graph; this can be done by passing the network object as the method's `model` argument.

The data being logged can be viewed by executing *tensorboard --logdir=logging directory* on the command line, with the logging directory what is passed to the `SummaryWriter` object in the `log_dir` argument.

<a id=LVP></a>
<a href=#top>Go To Top</a>

In [None]:
# define the tensorboard writer
writr = SummaryWriter(log_dir='./pytorch_tutorial/hardcode', comment='Hardcoded NN')

## Training the NN by Iterating over Epochs.
Each epoch has a training and evaluation phase, though the latter is not strictly needed.

### Training
The training phase is started by executing the network's `train(True)` method, inherited from the base class. This tells PyTorch that operations like dropout or batch normalization shoud occur, which wouldn't occur during evaluation. Then, for each minibatch:
1. training data is propagated forward through the network
2. the loss for the training data is computed
3. errors are propagated back through the network, updaing the parameters

Before executing the `backward` method of the loss function, the optimizer's `zero_grad` method should be executed. This sets the parameter gradients to 0, so they don't accumulate across batches. Then, the optimizer's `step` method should be executed. This actually propagates the errors back through the network. Finally, if a decaying learning rate is used, the scheduler's `step` method should be called after iterating over all minibatches.

### Evaluation
The evaluation phase is where the model's performance - loss and / or error - can be computed on the test dataset, or on the entire training dataset if it wasn't saved during the training phase. Evaluation code should all be run within the context of a `with torch.no_grad()` block, which prevents any gradient computations. The network's `train()` method should be called with `False` as the argument, which is equivalent to executing it's `eval` method. This turns off the `Dropout` layer.

Note that the predicted values from executing the network's `forward` step are probabilities (thanks to the sigmoid activation function) stored in a PyTorch tensor. To compute anything with these results - such as accuracy, it's usually best to convert these to a numpy array or something else. This can be done by executing `.detach().numpy()` on the tensor, which copies the data into a new tensor which is not part of the calculation graph (hence, it's *detached*), then converts it. We want to compute the binary classification accuracy, but $\hat{y}_i$ is a probability, so we need to convert the probabilities to binary flags. We do this using a simple 50% threshold $P_{thresh}$:
\begin{equation}
\hat{y}_i = \begin{cases}
\hat{y}_i > P_{thresh} & 1\\
\hat{y}_i <= P_{thresh} & 0
\end{cases}.
\end{equation}

As an aside, this is exactly what happens with logistic regression:
\begin{align}
\text{log_odds}_i(y_i=1)=&\frac{P\left(y_i=1\right)}{1-P\left(y_i=1\right)} = b_0+\sum_{i=1}^pb_ix_i^j,\ j=1,\ldots,p\\
P\left(y_i=1\right) =& \left(1+e^{-\text{log_odds}_i}\right)^{-1}\\
\hat{y}_i=&\begin{cases}1 & P\left(y_i=1\right)>P_{thresh}\\
0&P\left(y_i=1\right)<=P_{thresh}
\end{cases}
\end{align}


<a id=T></a>
<a href=#top>Go To Top</a>

In [None]:
def trainNN(data, network, loss, optimizer, scheduler, epochs, writer, randState=None, talkFreq=0.2):
    '''
    Train a neural network specified by a network, optimizer, and loss, fitting to data from
    a training dataloader, and evaluating on a testing dataloader. NB This procedure updates
    the input network *in place*.
    :param data: list holding the training set data loader and testing set dataloader. The first
        may load in batches, while the second may not
    :param network: PyTorch NN architecture as defined to extend the nn.Module class, or using
        the Sequential constructor
    :param loss: PyTorch loss function
    :param optimizer: PyTorch optimizer
    :param scheduler: PyTorch learning rate decay scheduler
    :param epochs: integer number of epochs
    :param writer: PyTorch TensorBoard SummaryWriter object
    :param randState: optional seed for PyTorch psuedo random number generator
    :param talkFreq: optional (default=0.2) frequency with which progress should be printed
    :return trnLoss: loss on training set per epoch
    :return trnAcc: accuracy on training set per epoch
    :return tstLoss: loss on testing set per epoch
    :return tstAcc: accuracy on testing set per epoch
    '''
    
    # set the prng seed, maybe
    if randState:
        torch.manual_seed(randState)
        writer.add_scalar('Params/random_state', randState, 0)
        
    # save the initial learning rate
    writer.add_scalar('Params/initial_learning_rate', optimizer.param_groups[0]['initial_lr'], 0)
    
    # add the model graph
    try:
        writer.add_graph(network, data[0].dataset.x)
    except TypeError as err:
        print("Network may be from nn.Sequential, so can't be added to TensorBoard!")
        
    # get the data loaders
    trn, tst = data
    
    # containers for training / testing loss & accuracy
    trnLoss = [np.inf]*epochs
    trnAcc = [0]*epochs
    tstLoss = [np.inf]*epochs
    tstAcc = [0]*epochs

    # iterate over epochs
    for epoch in range(epochs):
        # train with minibatch gradient descent
        network.train(True) # setting train to True tells pytorch that ops like dropout / batch normalization to occur, which wouldn't occur during evaluation
        # iterate over batches
        for indx, (x, y) in enumerate(trn):
            # forward step
            yhat = network(x) # implements forward propagation as implemented by the forward() method
            # compute loss (not storing for now, will do after minibatching)
            l = loss(yhat, y)
            # backward step
            optimizer.zero_grad() # set gradients to zero before backprop, so there's no accumulation among batches
            l.backward() # backward propagation computes the gradients of the parameters
            optimizer.step()
            
        # update the learning rate
        scheduler.step()
        writer.add_scalar('Params/learning_rate', scheduler.get_last_lr()[0], epoch)
        
        # evaluate performance
        with torch.no_grad():
            network.train(False) # using no_grad() and train() false turns off any gradient updating or special functionality
            # evaluate loss & accuracy on training set
            yhat = network(trn.dataset.x)
            trnLoss[epoch] = loss(yhat, trn.dataset.y)
            trnAcc[epoch] = accuracy_score(trn.dataset.y.numpy(), yhat.detach().numpy()>0.5)           
            # evaluate loss & accuracy on testing set
            yhat = network(tst.dataset.x)
            tstLoss[epoch] = loss(yhat, tst.dataset.y)
            tstAcc[epoch] = accuracy_score(tst.dataset.y.numpy(), yhat.detach().numpy()>0.5)
            # tensorboard
            writer.add_scalar('Train/Loss', trnLoss[epoch], epoch)
            writer.add_scalar('Test/Loss', tstLoss[epoch], epoch)
            writer.add_scalar('Train/Accuracy', trnAcc[epoch], epoch)
            writer.add_scalar('Test/Accuracy', tstAcc[epoch], epoch)
            # maybe talk
            if epoch % (epochs*talkFreq) == 0:
                print('Epoch %d Training (Testing) Loss = %0.2f (%0.2f), & Accuracy = %0.2f (%0.2f)'%
                      (epoch, trnLoss[epoch], tstLoss[epoch], trnAcc[epoch], tstAcc[epoch]))

    print('==========\nTraining Initial Loss (Accuracy) = %0.2f (%0.2f), Final Loss (Accuracy) = %0.2f (%0.2f)'%\
          (trnLoss[0], trnAcc[0], trnLoss[-1], trnAcc[-1]))
    print('Testing Initial Loss (Accuracy) = %0.2f (%0.2f), Final Loss (Accuracy) = %0.2f (%0.2f)'%\
          (tstLoss[0], tstAcc[0], tstLoss[-1], tstAcc[-1]))
    
    # return results
    return trnLoss, trnAcc, tstLoss, tstAcc

In [None]:
# train the neural network
randState = 42
epochs = 500
talkFreq = 0.2

trnLoss, trnAcc, tstLoss, tstAcc = trainNN([trainLoad, testLoad], classificationNN, loss, optimizer, scheduler, epochs, writr, randState, talkFreq)

# visualize loss & accuracy progressions
x = list(range(epochs))
lossesAccs = np.asarray([(rl.item(), tl.item(), ra.item(), ta.item()) for (rl, tl, ra, ta)
                         in zip(trnLoss, tstLoss, trnAcc, tstAcc)])

trc = [go.Scatter({'x':x, 'y':lossesAccs[:,0], 'name':'Train Loss', 'mode':'lines',
                   'line':{'color':'red', 'dash':'dash'}}),
       go.Scatter({'x':x, 'y':lossesAccs[:,1], 'name':'Test Loss', 'mode':'lines',
                   'line':{'color':'red'}}),
       go.Scatter({'x':x, 'y':lossesAccs[:,2], 'name':'Train Acc.', 'mode':'lines',
                   'line':{'color':'green', 'dash':'dash'}}),
       go.Scatter({'x':x, 'y':lossesAccs[:,3], 'name':'Test Acc', 'mode':'lines',
                   'line':{'color':'green'}})]
lout = go.Layout(title='Modeling Results; Testing Accuracy = %0.2f%%'%(100*tstAcc[-1]), height=400, legend={'orientation':'h', 'xanchor':'center', 'yanchor':'top', 'y':1.20, 'x':0.5})
plyoff.iplot(go.Figure(data=trc, layout=lout))

## PyTorch Sequential
So far in this tutorial, we've seen an NN with the layers hardcoded line-by-line, then added to the module list. PyTorch provides the [Sequential function](https://pytorch.org/docs/stable/generated/torch.nn.Sequential.html) to clean this up, and provide a few more benefits. The previously-seen NN coded the network architecture as essentially a module list + forward propagation method. In contrast, the Sequential function chains together a series of modules - like a scikit-learn pipeline - such that no separate `forward` method is needed. Data passed into the returned NN is automatically passed along the member modules in sequence. Hence the name.

Note that `Sequential` returns a PyTorch `nn.modules.container.Sequential` object. Because this is not an `nn.module` object, the `add_graph` TensorBoard method will **throw a type exception**. I have not seen a way around this.

<a id=S></a>
<a href=#top>Go To Top</a>

In [None]:
''' define a model, using Sequential '''
# set the learning rate
learningRate = 0.1
stepSize = 50
stepMult = 0.5

# define the network
torch.manual_seed(42)
classificationNN = nn.Sequential(
    nn.Linear(p, 10), nn.Dropout(0.2), nn.ReLU(),
    nn.Linear(10, 20), nn.Dropout(0.2), nn.ReLU(),
    nn.Linear(20, 10), nn.Dropout(0.2), nn.ReLU(),
    nn.Linear(10, 1), nn.Sigmoid()
)
print(classificationNN)

# setup the optimizer
optimizer = torch.optim.SGD(params=classificationNN.parameters(), lr=learningRate)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=stepSize, gamma=stepMult)
                                
# setup the loss function
loss = nn.BCELoss()

# define the tensorboard writer
writr = SummaryWriter(log_dir='./pytorch_tutorial/sequential/', comment='Sequential')

In [None]:
# train the neural network
randState = 42
epochs = 500
talkFreq = 0.2

trnLoss, trnAcc, tstLoss, tstAcc = trainNN([trainLoad, testLoad], classificationNN, loss, optimizer, scheduler, epochs, writr, randState, talkFreq)

# visualize loss & accuracy progressions
x = list(range(epochs))
lossesAccs = np.asarray([(rl.item(), tl.item(), ra.item(), ta.item()) for (rl, tl, ra, ta)
                         in zip(trnLoss, tstLoss, trnAcc, tstAcc)])

trc = [go.Scatter({'x':x, 'y':lossesAccs[:,0], 'name':'Train Loss', 'mode':'lines',
                   'line':{'color':'red', 'dash':'dash'}}),
       go.Scatter({'x':x, 'y':lossesAccs[:,1], 'name':'Test Loss', 'mode':'lines',
                   'line':{'color':'red'}}),
       go.Scatter({'x':x, 'y':lossesAccs[:,2], 'name':'Train Acc.', 'mode':'lines',
                   'line':{'color':'green', 'dash':'dash'}}),
       go.Scatter({'x':x, 'y':lossesAccs[:,3], 'name':'Test Acc', 'mode':'lines',
                   'line':{'color':'green'}})]
lout = go.Layout(title='Modeling Results; Testing Accuracy = %0.2f%%'%(100*tstAcc[-1]), height=400, legend={'orientation':'h', 'xanchor':'center', 'yanchor':'top', 'y':1.20, 'x':0.5})
plyoff.iplot(go.Figure(data=trc, layout=lout))

## Parameterized Network Design
Both networks designed thus far are hardcoded, which is not ideal. Finally, let's see creating the same network without hardcoding:
- number of layers
- layer sizes
- weights initialization method
- activation functions
- dropout probability

<a id=PND></a>
<a href=#top>Go To Top</a>

In [None]:
''' define the model class for a neural net with variable hidden layers & dropout '''
class myNN(nn.Module):
    def __init__(self, layerNodes, wInit, pDropout, activations):
        super(myNN, self).__init__()
        self.activations = activations[1:]
        self.len = len(layerNodes)-1
        self.drop = nn.Dropout(pDropout)
        self.linears = nn.ModuleList()
        for I, O in zip(layerNodes, layerNodes[1:]):
            # create the layer
            lin = nn.Linear(I, O)
            # initialize it
            nn.init.zeros_(lin.bias)
            if wInit == 'uni':
                nn.init.uniform_(lin.weight)
            elif wInit == 'xav':
                nn.init.xavier_uniform_(lin.weight)
            elif wInit == 'kai':
                nn.init.kaiming_uniform_(lin.weight, nonlinearity='relu')
            # and now add it
            self.linears.append(lin)

    def forward(self, x):
        for i, (L, A) in enumerate(zip(self.linears, self.activations)):
            # dropout if not the output layer
            if i < self.len - 1:
                x = self.drop(L(x))
            else:
                x = L(x)
            # compute the activation
            if A == 'relu':
                x = nn.functional.relu(x)
            elif A == 'sigmoid':
                x = torch.sigmoid(x)
            elif A == 'leakyrelu':
                x = nn.functional.leaky_relu(x)
            elif A == 'tanh':
                x = torch.tanh(x)
        return x
    
    def __str__(self):
        mn = super(myNN, self).__str__()
        return '%s\nActivations: %s'%(mn, self.activations)

We can then setup things like this. The parmameterized construction makes it easy to try different network architectures by just editing the arguments to the network constructor.

In [None]:
''' define the model & operating parameters '''
# define the modeling parameters
layers = (p, 10, 20, 10, 1) # input layer size, hidden layer sizes, output layer size
activations = (None, 'relu', 'relu', 'relu', 'sigmoid') # first element should be None; it's for the input layer
pDropout = 0.2
weightInit = 'kai'

# set the learning rate
learningRate = 0.1
stepSize = 50
stepMult = 0.5

# setup the network
torch.manual_seed(42)
classificationNN = myNN(layers, weightInit, pDropout, activations)
print(classificationNN)

# setup the optimizer
optimizer = torch.optim.SGD(params=classificationNN.parameters(), lr=learningRate)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=stepSize, gamma=stepMult)
                                
# setup the loss function
loss = nn.BCELoss()

# define the tensorboard writer
writr = SummaryWriter(log_dir='./pytorch_tutorial/parameterized', comment='Parameterized')

In [None]:
# train the neural network
randState = 42
epochs = 500
talkFreq = 0.2

trnLoss, trnAcc, tstLoss, tstAcc = trainNN([trainLoad, testLoad], classificationNN, loss, optimizer, scheduler, epochs, writr, randState, talkFreq)

# visualize loss & accuracy progressions
x = list(range(epochs))
lossesAccs = np.asarray([(rl.item(), tl.item(), ra.item(), ta.item()) for (rl, tl, ra, ta)
                         in zip(trnLoss, tstLoss, trnAcc, tstAcc)])

trc = [go.Scatter({'x':x, 'y':lossesAccs[:,0], 'name':'Train Loss', 'mode':'lines',
                   'line':{'color':'red', 'dash':'dash'}}),
       go.Scatter({'x':x, 'y':lossesAccs[:,1], 'name':'Test Loss', 'mode':'lines',
                   'line':{'color':'red'}}),
       go.Scatter({'x':x, 'y':lossesAccs[:,2], 'name':'Train Acc.', 'mode':'lines',
                   'line':{'color':'green', 'dash':'dash'}}),
       go.Scatter({'x':x, 'y':lossesAccs[:,3], 'name':'Test Acc', 'mode':'lines',
                   'line':{'color':'green'}})]
lout = go.Layout(title='Modeling Results; Testing Accuracy = %0.2f%%'%(100*tstAcc[-1]), height=400, legend={'orientation':'h', 'xanchor':'center', 'yanchor':'top', 'y':1.20, 'x':0.5})
plyoff.iplot(go.Figure(data=trc, layout=lout))

### Checkpointing
[This](https://pytorch.org/tutorials/recipes/recipes/saving_and_loading_a_general_checkpoint.html) page shows how to checkpoint training progress on a neural network model. This can be useful for backing up in-process model training, especially when training is computationally-intensive and takes a substantial amount of team. It can also be useful for training a model, then continuing training later on with more data, or if more iteration epochs are desired.

In short, this involves two simple steps:
1. serialize the `state_dict` from both the NN and the optimizer - can use `torch.save` on a dictionary of content
2. later on, reinstantiate the NN and optimizer, then use their `load_state_dict` methods to reload the serialized states

<a id=C></a>
<a href=#top>Go To Top</a>

## Extensions

This tutorial demonstrated how to build a simple neural network model for binary classification. Modifying this for multinomial classification or for a regression problem would be straightforward. There are many other options and functionalities in PyTorch that have not been explored here, but you should now be able to understand how / where to implement them as needed. More complex deep learning networks - such as [Recurrent Neural Networks (RNNS)](https://en.wikipedia.org/wiki/Recurrent_neural_network), [Convolutional Neural Networks (CNN)](https://en.wikipedia.org/wiki/Convolutional_neural_network), etc would be built similarly, but with different layers and operations.

See below for a sample Convolutional Recurrent Neural Network, **but note that the subsequent cell will not execute**. This is actual code from the Iron Ore Price Prediction Project.

<a id=E></a>
<a href=#top>Go To Top</a>

In [None]:
class ConvRNN(nn.Module):
    """
    Convolutional Recurrent Neural Network
    https://arxiv.org/pdf/1907.04155.pdf
    https://arxiv.org/pdf/1506.04214.pdf
    """
    layers = {}
    def __init__(self, input_dim, n_time_steps, output_dim, kernel_size1=7, kernel_size2=5, kernel_size3=3,
                 n_channels1=32, n_channels2=32, n_channels3=32, n_units1=32, n_units2=32, n_units3=32):
        """ initializer - create the layers """
        super().__init__()
        # average pooling layers
        self.avg_pool1 = nn.AvgPool1d(2, 2)
        self.avg_pool2 = nn.AvgPool1d(4, 4)
        # convolutional -> convoluional -> recurrent -> padding layers
        self.conv11 = nn.Conv1d(input_dim, n_channels1, kernel_size=kernel_size1)
        self.conv12 = nn.Conv1d(n_channels1, n_channels1, kernel_size=kernel_size1)
        self.gru1 = nn.GRU(n_channels1, n_units1, batch_first=True)
        self.zp11 = nn.ConstantPad1d(((kernel_size1 - 1), 0), 0)
        self.zp12 = nn.ConstantPad1d(((kernel_size1 - 1), 0), 0)
        # convolutional -> convoluional -> recurrent -> padding layers
        self.conv21 = nn.Conv1d(input_dim, n_channels2, kernel_size=kernel_size2)
        self.conv22 = nn.Conv1d(n_channels2, n_channels2, kernel_size=kernel_size2)
        self.gru2 = nn.GRU(n_channels2, n_units2, batch_first=True)
        self.zp21 = nn.ConstantPad1d(((kernel_size2 - 1), 0), 0)
        self.zp22 = nn.ConstantPad1d(((kernel_size2 - 1), 0), 0)
        # convolutional -> convoluional -> recurrent -> padding layers
        self.conv31 = nn.Conv1d(input_dim, n_channels3, kernel_size=kernel_size3)
        self.conv32 = nn.Conv1d(n_channels3, n_channels3, kernel_size=kernel_size3)
        self.gru3 = nn.GRU(n_channels3, n_units3, batch_first=True)
        self.zp31 = nn.ConstantPad1d(((kernel_size3 - 1), 0), 0)
        self.zp32 = nn.ConstantPad1d(((kernel_size3 - 1), 0), 0)
        # linear output layers
        self.linear1 = nn.Linear(n_units1 + n_units2 + n_units3, output_dim)
        self.linear2 = nn.Linear(input_dim * n_time_steps, output_dim)

    def forward(self, x):
        ''' forward propagation of input data through the network '''
        x = x.permute(0, 2, 1)
        # line1
        y1 = self.zp11(x)
        y1 = torch.relu(self.conv11(y1))
        y1 = self.zp12(y1)
        y1 = torch.relu(self.conv12(y1))
        y1 = y1.permute(0, 2, 1)
        out, h1 = self.gru1(y1)
        # line2
        y2 = self.avg_pool1(x)
        y2 = self.zp21(y2)
        y2 = torch.relu(self.conv21(y2))
        y2 = self.zp22(y2)
        y2 = torch.relu(self.conv22(y2))
        y2 = y2.permute(0, 2, 1)
        out, h2 = self.gru2(y2)
        # line3
        y3 = self.avg_pool2(x)
        y3 = self.zp31(y3)
        y3 = torch.relu(self.conv31(y3))
        y3 = self.zp32(y3)
        y3 = torch.relu(self.conv32(y3))
        y3 = y3.permute(0, 2, 1)
        out, h3 = self.gru3(y3)
        h = torch.cat([h1[-1], h2[-1], h3[-1]], dim=1)
        out1 = self.linear1(h)
        out2 = self.linear2(x.contiguous().view(x.shape[0], -1))
        out = out1 + out2

        return out

In [None]:
''' define the model & operating parameters '''
network = ConvRNN(input_dim=feats.shape[2], n_time_steps=feats.shape[1], output_dim=respDays,
                  n_channels1=channels[0], n_channels2=channels[1], n_channels3=channels[2],
                  n_units1=units[0], n_units2=units[1], n_units3=units[2])
opt = torch.optim.Adam(model.parameters(), lr=modelParams['learningRate'])
epoch_scheduler = torch.optim.lr_scheduler.StepLR(opt, modelParams['learnRateDecayStep'], gamma=modelParams['learnRateMDecay'])
loss = torch.nn.MSELoss()

## The End, Happy Torching!
<font size='1'>(wait, what?)</font>

<a id=Bot></a>
<a href=#top>Go To Top</a>