## Supervised Learning Part II - PyTorch

The library we've been using on our examples so far, **scikit-learn**, offers many options of "canned" Machine Learning models via an interface where different algorithms can be manipulated as simple objects. It is a great way to get started with Machine Learning and even solve real-life problems using simple to moderately complex models, with moderately large amounts of data.

However, to tackle more data-intensive tasks you need more powerful tools and **scikit-learn's** canned models won't be enough.

This is where **PyTorch** comes in. It has many performance advantages over **scikit-learn**, including but not limited to:

1. It supports GPU acceleration whereas sklearn only supports CPU.

2. It explicitly requires you to define and control the class of functions (we'll call it Models for now on) you'll use to approximate the Target, the loss functions and the optimization solvers as separate objects (some may argue this is an inconvenience). This allows more flexibility to pick the right combination of tools for each type of problem.

3. Its interface for defining Models is more flexible and expressive than sklearn's. This enables you to easily create very complex models. 

4. It has user-friendly frameworks for parallel and distributed model training.

5. It offers out-of-the-box class templates that make it easy to deal with massive datasets.

**PyTorch** can be seen as a general purpose Machine Learning, or even Scientific Computing framework. However, here we focus on the most common usage of the library: building Neural Networks.

# Neural Networks Primer:

If you look beyond the cool name, you will see that Neural Networks are just another "family" of mathematical function $f(X;\theta)$. But this family of functions is so useful that it sometimes seems almost magical. To add to their allure, Neural Networks can be difficult to write down mathematically depending on how complex their "architeture" is. It is easier to visualize how they work by drawing them as a graph:

![](./images/nnet.png)

Where each neuron represents a linear combination of its inputs with a non-linear function applied to it:

![](./images/neuron.png)

The **weights** $W$ of each neuron in each layer are the parameters $\theta$ of this class of function.

Common choices of non-linear **activation** functions for neurons are the *Sigmoid (or Logistic)* function:

$f(x) = \frac{1}{1+\exp{-x}}$

And the Rectified Linear Unit (a.k.a. ReLU):

$f(x) = \max{(0,x)}$

The type of neural network represented above is known as a **Feed Forward Neural Network** and it is the most "basic" member of this family of functions. More complex Neural Networks include ones where neurons in one layer do not necessarily connect to all neurons in the next layer; where other mathematical operations, like *convolutions* take place inside a neuron; and where ouputs in one layer can become inputs of preceding layers.

As we've seen, Machine Learning algorithms generally involve solving the optimization problem

$\hat{F(X)} = \underset{\theta}{\operatorname{argmin}} L(Y,f(X;\theta))$

This generally requires finding critical points of a function, that is, points where the derivative of the function with respect to its parameters $\theta$ is zero.

Whether you attempt to do it analytically or numerically... How do you compute the derivatives of $L$ with respect to $\theta$ when there's a crazy Neural Network nested in it?

# Backpropagation

For reasons we will not delve into in this workshop, an efficient way of solving the optimization problem above is by combining two numerical algorithms: **Backpropagation** and variants of **Gradient Descent**.

Here is a summary of the method with vanilla Gradient Descent:

1. Randomly initialize all Weights $W$ of the neural network.

2. Forward Propagation: Run your inputs X through the neural neutwork and get an output $\hat{Y}$.

3. Take the output of the Neural Network and compute the Loss $L(\hat{Y}, Y)$.

4. Backpropagation: Run the computed loss value backwards through all layers of the network, but this time each layer will represent the derivatives of the loss function with respect to the **Weights** in that layer. At the end of this process, you will obtain an estimate of $\nabla_W L$, the gradient of $L$ with respect to all **Weights** of the neural network.
    
5. Gradient Descent - Update the **Weights** $W$ using the gradient computed in step 4: $W = W - \alpha\nabla_W L$, where $\alpha$ is a (usually small) constant called the **learning rate**. 

6. Repeat steps 1 through 5 until a stopage criterion is reached. Common criteria include:

    a. The Loss reaches zero, or a value smaller than a pre-defined threshold.
    
    b. Steps 1 through 5 have been repeated a (usually large) pre-defined number of times.
    
Variations of this method include, but are not limited to:

* Running all examples X at once through the network, resulting in large matrix multiplications being performed; 

* Running smaller, *randomly selected batches* of examples X through the network instead; 

* Using an adaptive learning rate;

* Using other types of weight update (also called "a step").

* Randomly selecting neurons to be *dropped out* from the computations at each iteration;

# Loss Function

You know the Loss function is a measure of the error we incurr in when using a function $f(x;\theta)$ to approximate a target $F^*$, but we haven't seen what it looks like. The choice of an appropriate Loss function will depend first on the type of task at hand, then on statistical properties of the data and, to a smaller degree, on the choice of algorithm. Below are some common choices of Loss function.

### Task: Regression

**Mean Squared Error (MSE) Loss:**

$L(Y,\hat{Y}) = \frac{1}{N}\sum{(Y-\hat{Y})^2}$

**Mean Absolute Error (MAE) Loss:**

$L(Y,\hat{Y}) =\frac{1}{N}\sum{|Y-\hat{Y}|}$

### Task: Classification

**Cross Entropy Loss:**

$L(Y,\hat{Y}) = -\sum{Y_{class} * log(\hat{Y_{class}})}$

This Loss is commonly used in Classification problems where there are more than two classes of outputs.

**The Binary Cross Entropy Loss:**

$L(Y,\hat{Y}) = -(Y*log(\hat{Y}) + (1-Y)*log(1-\hat{Y}))$

You may recognize this as the general Cross Entropy above with only two output classes, or the negative log-likelihood of a Bernoulli Distribution.

____

## Neural Networks With PyTorch Example 1 - Iris Dataset Revisited
    
To see all this in action, let's go back to the Iris Dataset and train a logistic regression model with **PyTorch**, this time representing it as a neural network:

In [None]:
from pandas import read_csv
from sklearn.model_selection import train_test_split 
from sklearn.metrics import accuracy_score
import numpy as np
import torch
from torch.autograd import Variable



# LET'S CREATE OUR TRAINING AND TEST SETS

header = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'species']

iris_dataset = read_csv('./data/iris.csv',names = header) 

# ENCODE SPECIES AS CATEGORY NUMBERS 
iris_dataset.loc[iris_dataset.species=='Iris-setosa', 'species'] = 0
iris_dataset.loc[iris_dataset.species=='Iris-versicolor', 'species'] = 1
iris_dataset.loc[iris_dataset.species=='Iris-virginica', 'species'] = 2

X = iris_dataset.values[:,0:4].astype('float32')
Y = iris_dataset.values[:,4].astype('int32')

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.20)

# CONVERT DATASETS TO PYTORCH TENSORS
X_train = Variable(torch.Tensor(X_train).float())
X_test = Variable(torch.Tensor(X_test).float())
Y_train= Variable(torch.Tensor(Y_train).long())
Y_test = Variable(torch.Tensor(Y_test).long())

In [None]:
# DEFINE OUR LOGISTIC REGRESSION MODEL AS A NEURAL NETWORK, INITIALIZE AN OPTIMIZER AND PICK A LOSS FUNCTION


# THERE IS A WAY TO CALL MODELS AS FUNCTIONS LIKE WE DID WITH SKLEARN, BUT CLASSES ARE PREFERABLE
class LogisticRegression(torch.nn.Module):

    def __init__(self):
        super(LogisticRegression, self).__init__()
        self.fc1 = torch.nn.Linear(4, 3)
    def forward(self, X):
        X = self.fc1(X)

        return X  

model = LogisticRegression()

learning_rate = 0.01

optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)  
loss_function = torch.nn.CrossEntropyLoss()  

print(model)


In [None]:
# NOW WE TRAIN THE MODEL

epochs = 1000

for i in range(epochs):
    
    optimizer.zero_grad()
    
    # FORWARD-PROPAGATION     
    Y_hat = model(X_train)

    loss = loss_function(Y_hat, Y_train)
    print("Loss at step", i, "is:" , loss.data.item())
    
    # BACKPROPAGATION
    loss.backward()

    # UPDATE WEIGHTS
    optimizer.step()
    
# NOTICE HOW THE ERROR DECREASES WITH EACH STEP:

In [None]:
# HOW DID IT DO ON THE TEST SET?

Y_hat_test = model(X_test)
Y_predicted = torch.max(Y_hat_test, 1).indices

print("This model got" accuracy_score(Y_test, Y_predicted)*100, "% right")


In [None]:
# NOW WHAT HAPPENS IF WE USE A MORE COMPLEX NEURAL NET INSTEAD OF LOGISTIC REGRESSION?

class NeuralNet(torch.nn.Module):

    def __init__(self):
        super(NeuralNet, self).__init__()
        self.fc1 = torch.nn.Linear(4, 100)
        self.fc2 = torch.nn.Linear(100, 100) # NOTICE THE SIZE OF THE OUTPUT MATCHES THE SIZE OF THE INPUT OF THE NEXT LAYER
        self.fc3 = torch.nn.Linear(100, 3)
        
    def forward(self, X):
        X = torch.nn.functional.sigmoid(self.fc1(X)) # NOTICE THE SIGMOID ACTIVATION APPLIED TO EACH LAYER
        X = torch.nn.functional.sigmoid(self.fc2(X))
        X = self.fc3(X)
        
        return X

nnet_model = NeuralNet()

nnet_optimizer = torch.optim.SGD(nnet_model.parameters(), lr=learning_rate, momentum = 0.9) # ADDING A MOMENTUM TERM TO THE WEIGHT UPDATES  
nnet_loss_function = torch.nn.CrossEntropyLoss()  

print(nnet_model)


In [None]:
for i in range(epochs):
    
    nnet_optimizer.zero_grad()
    
    # FORWARD-PROPAGATION     
    Y_hat = nnet_model(X_train)

    loss = nnet_loss_function(Y_hat, Y_train)
    
    # BACKPROPAGATION
    loss.backward()
    print("Loss at step", i, "is:" , loss.data.item())
    # UPDATE WEIGHTS
    nnet_optimizer.step()
    
Y_hat_test = nnet_model(X_test)
Y_predicted = torch.max(Y_hat_test, 1).indices

print("This model got" accuracy_score(Y_test, Y_predicted)*100, "% right")

## Exercise 3 - Wine Classification with PyTorch

Now you try. Let's go back to the Wine dataset and train a model with PyTorch.

In [None]:
headers = ['wine_type','alcohol', 'malic_acid','ash','alcalinity_of_ash','magnesium',
           'total_phenols','flavanoids','nonflavanoid_phenols','proanthocyanins','color_intensity','hue','OD280_OD315','proline']

wine_dataset = read_csv(####)

# ENCODE SPECIES AS CATEGORY NUMBERS 
wine_dataset.loc[wine_dataset.wine_type=='wine_1', 'wine_type'] = ###
wine_dataset.loc[wine_dataset.wine_type=='wine_2', 'wine_type'] = ###
wine_dataset.loc[wine_dataset.wine_type=='wine_3', 'wine_type'] = ###

X = wine_dataset.values[###].astype('float32')
Y = wine_dataset.values[###].astype('int32')

X_train, X_test, Y_train, Y_test = ####

# CONVERT DATASETS TO PYTORCH TENSORS
X_train = Variable(torch.Tensor(X_train).float())
X_test = Variable(torch.Tensor(X_test).float())
Y_train= Variable(torch.Tensor(Y_train).long())
Y_test = Variable(torch.Tensor(Y_test).long())