Neural networks (NNs) have gained massive popularity due to their recent successes in many applications. NNs can achieve human-comparable accuracy in tasks like image recognition, but they are not a new concept. The recent notariety stems from advances in computing power and the availability of _large enough amounts_ of useful data on which to train have improved their applicability in industrial and research applications. In this notebook we'll take a bird's eye view over what exactly a NN is and how it works.

### <center> Fig. 1: Our example deep neural network</center>

![alt](figs/NN.png)

Fig. 1 illustrates a typical NN architecture used in classification. The input vector is the far left column, which each feature of the sample (in all of our homeworks, a row of a data matrix) is treated as the input. $W_1$ is a weight matrix used in each of the nodes of the first layer of the NN, $G_{1}$. $G_{1}$ is a function---like a sigmoid (like we saw in logistic regression)---that operates on $w^{T}x$. In the case of a sigmoid like we saw in logistic regression:

$G_{1} = \frac{1}{1 + e^{-w^{T}x}}$

Each node of the first layer is a single column vector of the weight matrix---each layer can have higher or lower numbers of dimensions $m$. The outputs of each node are used as inputs for the next node, creating a long chain of composed activation functions, or layers, $G_{k}$. As $k$ grows large, a neural network is said to be "deep". At the end of the day, a neural network is just a long composition of special classes of functions:

$\hat{y} = G_{k}(G_{k-1}(\ldots G_{2}(G_{1}(x, W_{1}), W_{2})\ldots ), W_{k})$

Part of what makes training a NN feasible is that we choose differentiable functions for $G$. We can apply gradient descent, like we did in early homeworks. Suppose we train a NN with L2-loss, then:

$\mathcal{L}(\hat{y},y) = \| y - \hat{y} \|_{2}^{2}$

$\mathcal{L}(\hat{y},y) = \| y - G_{k}(G_{k-1}(\ldots G_{2}(G_{1}(x, W_{1}), W_{2})\ldots ), W_{k}) \|_{2}^{2}$

The derivative of the loss $\mathcal{L}$ with respect to the weights $W_{1}, \ldots, W_{k}$ between each of the NN layers can be found by application of the chain rule, boiling down to an excercise in recording-keeping the indecies. The _backpropogation_ algorithm applies gradient descent updates to a multi-layer NN. What makes a NN so powerful is that it can, in theory, approximate _any_ function. In practice, this can A) require an infeasible amount of _labeled_ data and B) easily lead to overfitting if not carefully tuned.

### Implementation

At a high level, NNs elegantly combine two of the concepts we've learned so far: regression and the kernel method. In the real world, things we want to learn from data are too _high dimensional_---meaning that the output we care about is a function of many, many features---and as we saw in previous homeworks, fitting a line to seperate data, or follow the trend of data quickly becomes difficult as we add dimensions and non-linear bases.

A NN attempts to do this programmatically where each layer transforms the data: either by expanding the dimensions like we saw with the SVM kernel trick (making $m$ bigger with each layer), and by finding lines of seperation (by minimizing the weights $w$ with gradient descent). In this assignment, we'll walk through using a package called PyTorch---the Python extension of Torch. There are tons of [accessible tutorials](https://pytorch.org/tutorials/) in doing some very cool ML tasks with PyTorch. The tutorials come in the [form of notebooks](https://pytorch.org/tutorials/_downloads/17a7c7cb80916fcdf921097825a0f562/cifar10_tutorial.ipynb) like this one.

Another option would be Google's recently released TensorFlow v2.0. While the newest version attempts to improve on some of the vagueries of v1.0, TensorFlow is more difficult to learn, interpret, and prototype with. Its biggest advantages are in late-stage algorithm deployment within an enterprise software stack; while we aren't doing that here, if your goal is to be a ML software engineer, TensorFlow is worth learning. 

Torch benefits over TensorFlow by being an object oriented implementation within an object oriented language like Python. Optimizers, loss, and activiation functions are easy enough to use that all of the ML tasks performed in our previous assignments have straightforward Torch implementations.

This homework assignment will be running cells and making small changes to this walkthrough. Most of the challenge will likely come from getting PyTorch installed, if problems do arise.

### Problem 1 (Load and view the data)

We're going to jump right into training a classifier using the CIFAR10 dataset. These questions follow along the lines of the tutorial found [here](https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html#sphx-glr-beginner-blitz-cifar10-tutorial-py). In this environment, each image is a sample, and the NN's job is to classify the image with a label of what's in the image. Below are the 10 classes the images fall into:

![alt](figs/cifar10.png)

Each sample is a single color (very small) image. We transform the pixel Red-Green-Blue color values into a long vector and pass this vector into the NN as input. The output is one of the ten categories. After installing PyTorch, run each of the cells below to train and view the performance of the NN. A much larger and more daunting image classification task; labeling images in the ImageNet dataset, can be found [here](https://github.com/pytorch/examples/tree/master/imagenet)

In [None]:
#import the required libraries
import torch       #base library
import torchvision        #library for image classification/computer vision tasks
import torchvision.transforms as transforms       #utilities for transforming data matrices
import torch.nn as nn            #this library contains activation layer and loss functions
import torch.nn.functional as F  #this contains weight adjusting functions
import torch.optim as optim      #this library contains optimizers like gradient descent

import matplotlib.pyplot as plt
import numpy as np

In [None]:
transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

#this process is going to create a new local folder to store the data downloaded from the web in
#here we load the train data, already saved as a common baseline task for new image recognition algorithms
trainset = torchvision.datasets.CIFAR10(root='./data', train=True,
                                        download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=4,
                                          shuffle=True, num_workers=1)

#when training a NN you should 1) shuffle the order of your training data, and 2) create mini-batches of the data
#and perform gradient descent on each batch 1 at a time. The "num_workers" argument allows you to
#paralellize the task of reading in the data with as many cores as your computer has; I've set this to 1
#but setting it to 2 or 4 can speed things up

#here we load the train data, these loaders are fancy wrappers for making data processing easier, but ultimately
#these are just matrices with a bunch of rows, each corresponding to a single sample
testset = torchvision.datasets.CIFAR10(root='./data', train=False,
                                       download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=4,
                                         shuffle=False, num_workers=1)

#these are the labels for classifying the data samples
classes = ('plane', 'car', 'bird', 'cat',
           'deer', 'dog', 'frog', 'horse', 'ship', 'truck')  

In [None]:
#here we can inspect the images and their cooresponding labels.
#images are stored in a height x width x 3 matrix. The 3 refers to the Red-Green-Blue color channel values

def imshow(img):
    img = img / 2 + 0.5     # unnormalize
    npimg = img.numpy()
    plt.imshow(np.transpose(npimg, (1, 2, 0)))
    plt.show()


# get some random training images
dataiter = iter(trainloader)
images, labels = dataiter.next()

# show images
imshow(torchvision.utils.make_grid(images))
# print labels
print(' '.join('%5s' % classes[labels[j]] for j in range(4)))
print("Look at these absurdly small images")

### Problem 2 (Define the NN architecture)

In the cells below we define the NN architecture. The important thing to note here is that in the init(self) routine, each line below that is a __layer__ of the network. I've commented out the line self.fc2 = nn.Linear(120, 84). Each linear layer has the form "nn.Linear(input_dimension, output_dimension).

The input dimension of the current layer must match the output dimension of the previous layer. Notice that the output dimension of self.fc1 matches the input dimension of self.fc3. Later on you will adjust the network and test the performance with an additional linear layer. Run the following cells:

In [None]:
#below we're going to design a simple 

#here we define the class Net that lets us move around adjust the network in memory like any other data structure
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)   #each of these self.function lines is a layer of the network
        self.pool = nn.MaxPool2d(2, 2)        #each of these layers performs a specific task in transforming the image
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        #self.fc2 = nn.Linear(120, 84)          #we're going to focus on these linear layers
        self.fc3 = nn.Linear(120, 10)           #the format is nn.Linear(input_dimension, output_dimension)

    #this defines the function composition G_k(G_k-1(... G_1(x,w_1), w_2)... )))
    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 16 * 5 * 5)
        x = F.relu(self.fc1(x))
        #x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x


net = Net()

In [None]:
#This loss function is applied to the outputs of the Net's forward operation, comparing the prediction
#y_hat to the true output label y
loss_func = nn.CrossEntropyLoss()

#this encodes gradient descent: notice two important parameters
#1) the learning rate determines initially how big of a step the gradient descent algorithm takes
#2) momentum <1 encourages the gradient descent algorithm to slow down as it takes more and more steps
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

### Problem 3 (Train and test the network)

This step can take a while depending on how powerful your computer is, but shouldn't take up too much memory. If your computer is unable to handle the training stage, run the cell and keep the error output as your answer here. You can speed up the process by increasing the num_worker argument in the trainloader above to half the number of CPU cores you have available. If you're interested (not required), the CUDA implementation details can be found at the bottom of the tutorial [here](https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html#sphx-glr-beginner-blitz-cifar10-tutorial-py). Convolutional NN's like the one we use here can be sped up significantly by using GPU's instead of CPU's.

In [None]:
for epoch in range(2):  # loop over the dataset multiple times

    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):  #this trains the algorithm for each batched training set
        # get the inputs
        inputs, labels = data

        # zero the parameter gradients // important for backpropogation updates to work
        optimizer.zero_grad()

        # forward + backward + optimize
        y_hat = net(inputs)
        loss = loss_func(y_hat, labels)
        loss.backward()
        optimizer.step()

        # print statistics
        running_loss += loss.item()
        if i % 2000 == 1999:    # print every 2000 mini-batches
            print('[%d, %5d] loss: %.3f' %
                  (epoch + 1, i + 1, running_loss / 2000))
            running_loss = 0.0

print('Finished Training')

In [None]:
#Here we compute the general accuracy for each of the 10 image classes

class_correct = list(0. for i in range(10))
class_total = list(0. for i in range(10))
with torch.no_grad():
    for data in testloader:
        images, labels = data
        outputs = net(images)
        _, predicted = torch.max(outputs, 1)
        c = (predicted == labels).squeeze()
        for i in range(4):
            label = labels[i]
            class_correct[label] += c[i].item()
            class_total[label] += 1


for i in range(10):
    print('Accuracy of %5s : %2d %%' % (
        classes[i], 100 * class_correct[i] / class_total[i]))

### Problem 4 (Improve the Network)

Copy the code and Net class above and uncomment the following lines in the Net class:

self.fc2(120, 84)
x = F.relu(self.fc2(x))

Adjust the output dimension of self.fc1 accordingly. Train a new NN:

net2 = Net()

with the additional layer and compare the test general accuracy. Which classes performed better or worse? In your own words, comment on how you might programmatically search NN architectures to find the best test performance.

In [None]:
#copy the net class, the optimizer and loss function declaration here

In [None]:
#copy the train and test routines here

In [None]:
#insert remaining code/comments here