# Introduction to Deep Learning
Classical machine learning relies on using statistics to determine relationships between features and labels, and can be very effective for creating predictive models. However, a massive growth in the availability of data coupled with advances in the computing technology required to process it has led to the emergence of new machine learning techniques that mimic the way the brain processes information in a structure called an artificial neural network.

The mathematics that underpins artificial neural networks can be complex if you haven't studied it before, and involves a combination of linear algebra and differential calculus. You don't really need to know the in-depth workings of this math in order to build machine learning models using modern ML frameworks, but a basic conceptual understanding can be helpful. The rest of this section consists of an overview of how neural networks work - if you already know this (or don't really care), skip to **Creating a Neural Network Model with PyTorch** below; otherwise, read on!

## Neural Networks
Your brain works by connecting networks of neurons, each of which receives electrochemical stimuli from multiple inputs, which cause the neuron to fire under certain conditions. When a neuron fires, it creates an electrochemical charge that is passed as an input to one or more other neurons, creating a complex *feed-forward* network made up of layers of neurons that pass the signal on.

<br/>
<div align="center" style='font-size:24px;'>&#8694;&#9711;&rarr;</div>

An artificial neural network uses the same principles but the inputs are numeric values with associated *weights* that reflect their relative importance. The neuron calculates the product of these input values and weights, and applies them to an *activation function* that determines the numeric output that the artificial neuron produces.

### Training a Neural Network
As the human brain learns from experience, the inputs to the neurons are strengthened or weakened depending on their importance to the decisions that the brain needs to make in response to stimuli. Similarly, you train an artificial neural network using a supervised leaning technique in which a *loss function* is used to evaluate how accurately the model outputs match known true values for the inputs that were passed to it, and then adjust the weights to make the outputs more accurate.

### A Simplified Conceptual Example
If you're encountering deep neural networks for the first time, the concepts can seem very complex. Let's simplify things so we can visualize the basic principles more clearly using a very basic example. We'll use a single neuron that has a single input with an associated weight. Our goal is to determine the right value for the weight in order to get the expected output from the neuron. In this case, we know that an input value of **2.1** should produce an output value of **1**.

<br/>
<div align="center" style='font-size:24px;'><sup>2.1</sup><sub>w</sub>&#8649;&#9711;&rarr;1</div>

The code in the following cell defines the neuron as a function that multiplies the the input (*X*) by the weight (*w*), and applies the result to a sigmoid activation function so that the output is squashed to a value between 0 and 1. We calculate the error (or *loss*) by simply subtracting the output generated for *X* (2.1) when using a weight of *w* from the expected true value (*Y*, which we know is 1), and squaring the result.

The code then calculate the derivative of the loss function with respect to the weight - don't worry too much about the actual math involved, the key point is that this enables us to determine in which direction (up or down) to adjust the weight in order to move the function output closer to the true value. If the derivative is negative, indicating that the function line is falling with respect to the weight, we'll increase the weight to make it fall further (bringing the loss down). If the function is rising with respect to the weight, we'll decrease it. We use a constant *learning rate* (*LR*) to specify by how much the weight is adjusted.

To train our single-neuron network, we initialize the weights with a random value and then repeatedly try generating an output from the function using our static input value (*X*) and the weight (*w*), calculating the loss, determining which way we need to adjust the weight to reduce the loss, and then trying again with a revised weight value. We finish after five iterations (*epochs*).

> **Note:** The following code is a simplification of a neural network training process - in reality the network would include *bias* values and the loss function and optimizer would be more complex. This example is intended to simply illustrate the principles by which a neural network is trained by iteratively adjusting the weights applied to each neuron.

Run the following cell to see the results:

In [None]:
import numpy as np
import random
from scipy.misc import derivative

# Known values for input (feature) and output (label)
X = 2.1
Y = 1

# By how much should we adjust the weight with each iteration
LR = 1

# Neuron function
def neuron(x, w):
    from scipy.special import expit as sigmoid
    return sigmoid(x * w)

# Function to calculate loss
def lossfunc(w):
    return abs(Y - neuron(X, w)**2)

# Initialize weight with a random value between 0 and 1
w = random.uniform(0,1)

# Call the function over 5 iterations (epochs), updating the weight and recording the loss each time
e = 1
weights = []
losses = []
while e < 6:
    print('Epoch:', e)
    e += 1
    weights.append(w)
    print('\tWeight:%.20f' % w)

    # Pass the value and weight forward through the neuron
    y = neuron(X, w)
    print('\tTrue value:%.20f' % Y)
    print('\tOutput value:%.20f' % y)

    # Calculate loss
    loss = lossfunc(w)
    losses.append(loss)
    print('\tLoss: %.20f' % loss)

    # Which way should we adjust w to reduce loss?
    dw = derivative(lossfunc, w)
    print('\tDerivative:%.20f' % dw)

    if dw > 0:
        # Slope is positive - decrease w
        w = w - LR
    elif dw < 0:
        # Slope is negative - increase w
        w = w + LR

# Plot the function and the weights and losses in our epochs
from matplotlib import pyplot as plt
%matplotlib inline

# Create an array of weight values
wRange = np.linspace(-1, 7)

# Use the function to get the corresponding loss values
lRange = [lossfunc(i) for i in wRange]

# Plot the function line
plt.xlabel('Weight')
plt.ylabel('Loss')
plt.grid()
plt.plot(wRange,lRange, color='grey', ls="--")

# Plot the weights and losses we recorded
plt.scatter(weights,losses, c='red')
e = 0
while e < len(weights):
    plt.annotate('E' + str(e+1),(weights[e], losses[e]))
    e += 1

plt.show()

Review the output from the code, and note that the loss should reduce after each epoch. The plotted line chart shows the loss function in grey and the weight/loss point for each epoch in red. For each epoch, the derivative of the loss function with respect to the weight tells us in which direction the slope (or *gradient*) of the function is headed, enabling us to determine how to adjust the weight to reduce the loss.

## A More Detailed Look at Neural Networks

Now that you've seen the basic concepts, it's time to consider artificial neural networks in more detail.


### Weights and Bias

Our previous example was based on a single neuron that had a single input. In reality, there are usually multiple inputs (each with its own weight), and there is also a *bias* value that is used to ensure the neuron only generates a significant output when appropriate. For example, suppose we represent a neuron that has two numeric inputs (let's call them *x<sub>1</sub>* and *x<sub>2</sub>*) each with an associated weight (*w<sub>1</sub>* and *w<sub>2</sub>*) and a bias input (*b*). Our artificial neuron will process these inputs within an activation function (let's call it *f*) like this:

$$ f((x_{1} w_{1}) + (x_{2} w_{2}) + b) $$

Now, let's assign the following values to our input variables:

* *x<sub>1</sub>* = 3
* *w<sub>1</sub>* = 0.2
* *x<sub>2</sub>* = 1
* *w<sub>2</sub>* = -0.5
* *b* = 5

Our neuron will therefore calculate:

$$ f((3 \times 0.2) + (1 \times -0.5) + 5) $$

Which simplifies to:

$$ f(0.6 - 0.5 + 5) $$

Or:

$$ f(5.1) $$

### Activation Functions
We've calculated the product of our inputs relative to their weights and bias, so now we just need to apply our activation function to this. Generally an activation function is used to *squash* the output value to a value within a specific range. We want to function to be *smooth* so that it is differentiable, so it's common to use a *sigmoid* function that compresses the value along an *s-line* to a value between 0 and 1, or a *hyperbolic tangent (tanh)* function that produces a result between -1 and 1. Increasingly, *Rectified Linear Unit (ReLU)* functions that set all negative resuts to 0 are used as activation functions in deep neural networks.

In this example, let's use a **sigmoid** activation function:

$$ S(5.1) \approx 0.994$$

So the output of the activation function is (approximately) **0.994**.

### Fully Connected Neural Network Layers
Now that you understand how a single neuron works, let's see how these are combined into a neural network. The network consists of mulitple neurons organized in layers, like this:

<div align="center" style="font-size:18px;">
<table style="border-width:0px; background-color:white;">
    <tr><td style="border-width:0px; background-color:white;"></td><td style="border-width:0px; background-color:white;font-size:18px;">&#8649;&#9711;&#10536;&#9711;&#11112;</td><td style="border-width:0px; background-color:white;"></td></tr>
    <tr><td style="border-width:0px; background-color:white;"></td><td style="border-width:0px; background-color:white;font-size:18px;">&#8649;&#9711;&#10536;&#9711;&#10536;</td><td style="border-width:0px; background-color:white;font-size:18px;">&#8694;&#9711;&#8594;</td></tr>
    <tr><td style="border-width:0px; background-color:white;font-size:18px;">&rarr;&#9711;&#10536;</td><td style="border-width:0px; background-color:white;font-size:18px;">&#8649;&#9711;&#10536;&#9711;&#10536;</td><td style="border-width:0px; background-color:white;font-size:18px;">&#8694;&#9711;&#8594;</td></tr>
    <tr><td style="border-width:0px; background-color:white;"></td><td style="border-width:0px; background-color:white;font-size:18px;">&#8649;&#9711;&#10536;&#9711;&#10536;</td><td style="border-width:0px; background-color:white;font-size:18px;">&#8694;&#9711;&#8594;</td></tr>
    <tr><td style="border-width:0px; background-color:white;"></td><td style="border-width:0px; background-color:white;font-size:18px;">&#8649;&#9711;&#10536;&#9711;&#11111;</td><td style="border-width:0px; background-color:white;"></td></tr>
    <tr><td>Input Layer</td><td>Hidden Layers</td><td>Output Layer</td></tr>
</table>
</div>

The *input layer* of the network consists of neurons that accept the initial input values of the data observation from which you want the network to generate a prediction - in other words, the initial set of *features*. The output of the input layer is passed to all of the neurons in the next layer, which can consist of as many neurons as you decide to include - in this case, five. The outputs from this layer are passed onto every neuron in the next layer, and so on; until finally, the flow of values ends in an *output layer* that contains one output for each possible class you are trying to predict (in this case, there are three possible classes that can be predicted by the network). The values in the output layers are generated by an activation function such that each value is between 0 and 1 and represents the probability of the observation belonging to each class - therefore the neuron in the output layer with the largest probability value represents the predicted class.

The layers in between the input and output layers are called *hidden layers*, because you have no visibility of the values being passed between these layers; and the network can consist of as many hidden layers as you decide to include. This kind of neural network, where the outputs from every neuron in each layer is passed as an input to every neuron in the next layer, is known as a *fully-connected* neural network - or sometimes as a *multi-layer perceptron*.

### Training a Neural Network
Training a deep neural network with multiple layers, inputs, weights, and biases is conceptually the same as the example we saw previously where we trained a single neuron by determining a weight that minimizes loss when an input value with a known output value is processed.

We start by passing the feature values for a set of classes with known labels into the input layer of the network, and initially we use randomly assigned weights and biases to feed the data forward and eventually generate the output layer values. To make this managable, we generally break the input data into *batches* (often called *mini-batches*). For each batch, we can then use a loss function to calculate a numeric value representing the average loss by comparing the predicted values against the known true values.

After we've processed multiple batches, and determined the overall loss, we can use differential calculus to find the derivative of the  loss with respect to the weights and biases that were used. Put more simply, we can determine what impact adjusting each weight and bias upward or downward will have on the loss. This process is called *backpropagation* (because it works backwards from the overall loss, using the *chain-rule* of calculus to calculate the derivatives for each layer with respect to the weights and biases). With these derivatives, we can determine in which direction to adjust the weights and biases to reduce the loss, using a techniques called *gradient descent*.

When the impact of each weight and bias on the loss has been determined, the training process adjusts the weights and bias values so that the loss should decrease, and we repeat the process using the revised values. This cycle of feeding the data forward in batches, calculating the loss, and backpropagating to adjust the weights and biases is repeated (with each repetition of the cycle referred to as an *epoch*) to incrementally improve the accuracy of the model by reducing the overall loss.

### Validation and Overfitting
One of the biggest challenges in machine learning is the problem of *overfitting*. This happens when the model learns the relationships between the features and labels in the test data by minimizing the loss during training; but doesn't generalize well to new data on which it wasn't trained. To detect overfitting, it's common to withhold some of the training data and use it to validate the model after each epoch. You can then calculate the loss from both the training data and the validation data and compare the two. If the training loss is reducing but the validation loss is stable (or worse, increasing), then the model is overfitting.

## Creating a Neural Network with PyTorch
PyTorch is a framework for creating machine learning models, including deep neural networks (DNNs). In this example, we'll use PyTorch to create a simple neural network that classifies iris flowers into species based on measurements of their petals and sepals.

> The iris classification model is a very common machine learning example, and the iris dataset is often the basis for "hello world" sample code for a wide range of machine learning frameworks. In reality, you can solve this problem easily using classical machine learning techniques without the need for a deep learning model; but it's a useful, easy to understand dataset with which to demonstrate the principles of neural networks in this notebook.

### Exploring the Iris Dataset
Before we start using PyTorch to create a model, let's examine the iris dataset. Since this is a commonly used sample dataset, it is built-in to the *scikit-learn* machine learning library, so we'll get it from there. As with any supervised learning problem, we'll then split the dataset into a set of records with which to train the model, and a smaller set with which to validate the trained model.

In [None]:
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split

iris = datasets.load_iris()

   
# Split data 70%-30% into training set and test set
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.40, random_state=0)

print ('Training Set: %d, Test Set: %d \n' % (len(x_train), len(x_test)))
print("Sample of features and labels:")
print('(features: ',iris.feature_names, ')')

# Take a look at the first 25 training features and corresponding labels
for n in range(0,24):
    print(x_train[n], y_train[n], '(' + iris.target_names[y_train[n]] + ')')

The *features* are the measurements for each iris observation, and the *label* is a numeric value that indicates the species of iris that the observation represents (versicolor, virginica, or setosa).

### Importing the PyTorch Libraries
Since we plan to use PyTorch to create our iris classifier, we'll need to install and import the PyTorch libraries we intend to use. The specific installation of of PyTorch depends on your operating system and whether your computer has graphics processing units (GPUs) that can be used for high-performance processing via *cuda*. You can find detailed instructions at https://pytorch.org/get-started/locally/.

In [None]:
!pip install https://download.pytorch.org/whl/cpu/torch-1.0.1-cp36-cp36m-linux_x86_64.whl
    
import torch
import torch.nn as nn
import torch.utils.data as utils
import torch.utils.data as td
from torch.autograd import Variable

print("Libraries imported - ready to use PyTorch", torch.__version__)

### Prepare the Data for PyTorch
PyTorch makes use of *data loaders* to load training and validation data in batches. We've already loaded the data into NumPy arrays, but we need to wrap those in PyTorch datasets (in which the data is converted to PyTorch *tensor* objects) and create loaders to read batches from those datasets.

In [None]:
# Create a dataset and loader for the training data and labels
train_x = Variable(torch.Tensor(x_train).float())
train_y = Variable(torch.Tensor(y_train).long())
train_ds = utils.TensorDataset(train_x,train_y)
train_loader = td.DataLoader(train_ds, batch_size=10,
    shuffle=False, num_workers=1)

# Create a dataset and loader for the test data and labels
test_x = Variable(torch.Tensor(x_test).float())
test_y = Variable(torch.Tensor(y_test).long())
test_ds = utils.TensorDataset(test_x,test_y)
test_loader = td.DataLoader(test_ds, batch_size=10,
    shuffle=False, num_workers=1)

### Define a Neural Network
Now we're ready to define our neural network. In this case, we'll create a network that consists of 3 fully-connected layers:
* An input layer that receives four input values (the iris features) and applies a *ReLU* activation function.
* A hidden layer that receives ten inputs and applies a *ReLU* activation function.
* An output layer that uses a *SoftMax* activation function to generate three outputs (which represent the probabilities for the three iris species)

In [None]:
# Number of hidden layer nodes
hl = 10

# Define the neural network
class IrisNet(nn.Module):
    def __init__(self):
        super(IrisNet, self).__init__()
        self.fc1 = nn.Linear(4, hl)
        self.fc2 = nn.Linear(hl, hl)
        self.fc3 = nn.Linear(hl, 3)

    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        x = torch.softmax(self.fc3(x),dim=1)
        return x

# Create a model instance from the network
model = IrisNet()
print(model)

### Train the Model
To train the model, we need to repeatedly feed the training values forward through the network, use a loss function to calculate the loss, use an optimizer to backpropagate the weight and bias value adjustments, and validate the model using the test data we withheld.

To do this, we'll create a function to train and optimize the model, and function to test the model. Then we'll call these functions iteratively over 100 epochs, logging the loss and accuracy statistics for each epoch.

In [None]:
def train(model, data_loader, optimizer):
    # Set the model to training mode
    model.train()
    train_loss = 0
    
    for batch, tensor in enumerate(data_loader):
        data, target = tensor
        #feedforward
        optimizer.zero_grad()
        out = model(data)
        loss = loss_criteria(out, target)
        train_loss += loss.item()

        # backpropagate
        loss.backward()
        optimizer.step()

    #Return loss
    avg_loss = train_loss / len(data_loader.dataset)
    return avg_loss
           
            
def test(model, data_loader):
    # Switch the model to evaluation mode (so we don't backpropagate)
    model.eval()
    test_loss = 0
    correct = 0

    with torch.no_grad():
        for batch, tensor in enumerate(data_loader):
            data, target = tensor
            # Get the predictions
            out = model(data)

            # calculate the loss
            test_loss += loss_criteria(out, target).item()

            # Calculate the accuracy
            _, predicted = torch.max(out.data, 1)
            correct += torch.sum(target==predicted).item()
            
    # return validation loss and prediction accuracy for the epoch
    avg_accuracy = correct / len(data_loader.dataset)
    avg_loss = test_loss / len(data_loader.dataset)
    return avg_loss, avg_accuracy
       


# Specify the loss criteria (CrossEntropyLoss for multi-class classification)
loss_criteria = nn.CrossEntropyLoss()

# Specify the optimizer (we'll use a Stochastic Gradient Descent optimizer)
learning_rate = 0.01
learning_momentum = 0.9
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate, momentum=learning_momentum)

# We'll track metrics for each epoch in these arrays
epoch_nums = []
training_loss = []
validation_loss = []

# Train over 100 epochs
epochs = 100
for epoch in range(1, epochs + 1):
    
    # Feed the training data into the model to optimize the weights
    train_loss = train(model, train_loader, optimizer)
    
    # Feed the test data into the model to check its performance
    test_loss, accuracy = test(model, test_loader)
    
    # Log the metrcs for this epoch
    epoch_nums.append(epoch)
    training_loss.append(train_loss)
    validation_loss.append(test_loss)
    
    # Print stats for every 10th epoch so we can see training progress
    if (epoch) % 10 == 0:
        print('Epoch {:d}: Training loss= {:.4f}, Validation loss= {:.4f}, Accuracy={:.4%}'.format(epoch, train_loss, test_loss, accuracy))


### Review Training and Validation Loss
After training is complete, we can examine the loss metrics we recorded while training and validating the model. We're really looking for two things:
* The loss should reduce with each epoch, showing that the model is learning the right weights and biases to predict the correct labels.
* The training loss and validations loss should follow a similar trend, showing that the model is not overfitting to the training data.

Let's plot the loss metrics and see:

In [None]:
%matplotlib inline
from matplotlib import pyplot as plt

plt.plot(epoch_nums, training_loss)
plt.plot(epoch_nums, validation_loss)
plt.xlabel('epoch')
plt.ylabel('loss')
plt.legend(['training', 'validation'], loc='upper right')
plt.show()

### View the Learned Weights and Biases
The trained model consists of the final weights and biases that were determined by the optimizer during training. Based on our network model we should expect the following values for each layer:
* Layer 1: There are four input values going to ten output nodes, so there should be 10 x 4 weights and 10 bias values.
* Layer 2: There are ten input values going to ten output nodes, so there should be 10 x 10 weights and 10 bias values.
* Layer 3: There are ten input values going to three output nodes, so there should be 3 x 10 weights and 3 bias values.

In [None]:
for param_tensor in model.state_dict():
    print(param_tensor, "\n", model.state_dict()[param_tensor].numpy())

### Evaluate Model Performance
So, is the model any good? The raw accuracy reported from the validation data would seem to indicate that it predicts pretty well; but it's typically useful to dig a little deeper and compare the predictions for each possible class. A common way to visualize the performace of a classification model is to create a *confusion matrix* that shows a crosstab of correct and incorrect predictions for each class.

In [None]:
#Pytorch doesn't have a built-in confusion matrix metric, so we'll use SciKit-Learn
from sklearn.metrics import confusion_matrix

# Set the model to evaluate mode
model.eval()

# Get predictions for the test data
x = Variable(torch.Tensor(x_test).float())
_, predicted = torch.max(model(x).data, 1)

# Plot the confusion matrix
cm = confusion_matrix(y_test, predicted.numpy())
plt.imshow(cm, interpolation="nearest", cmap=plt.cm.Blues)
plt.colorbar()
tick_marks = np.arange(len(iris.target_names))
plt.xticks(tick_marks, iris.target_names, rotation=45)
plt.yticks(tick_marks, iris.target_names)
plt.xlabel("Predicted Species")
plt.ylabel("True Species")
plt.show()

The confusion matrix should show a strong diagonal line indicating that there are more correct than incorrect predictions for each class.

### Using the Model with New Data
Now that we have a model we believe is reasonably accurate, we can use it to predict the species of new iris observations:

In [None]:
x_new = [[6.6,3.2,5.8,2.4]]
print ('New sample: {}'.format(x_new))

model.eval()

# Get a prediction for the new data sample
x = Variable(torch.Tensor(x_new).float())
_, predicted = torch.max(model(x).data, 1)

print('Prediction:',iris.target_names[predicted.item()])

## Learn More
This notebook was designed to help you understand the basic concepts and principles involved in deep neural networks, using a simple PyTorch example. To learn more about PyTorch, take a look at the <a href="https://pytorch.org/tutorials/" target="_blank">tutorials on the PyTorch web site</a>.