# Deep Learning

Deep learning is probably the most exciting and potential filled aspect of machine learning. Deep learning has been around for quite awhile but it has taken until recently for it to truly take off. This is because computers were not where they needed to be to handle the computational complexity needed for deep learning until recently. I will likely do a seperate series of these notebooks focusing just on deep learning because there is alot to be said and it is really is the future of machine learning. The idea of deep learning, at a high level, is to mimic the human brain in computers. The human brain is the most intricate and effective learning tools in the known universe. Through millions of years of evolution, natural selection has created an incredible learning machine. Being able to mimic the human brain in computer would revolutionize a computers ability to learn. At a very high level, deep learning algorithm consist of an input layer, hidden layers, and an output layer. The input layer is where we would input our data, the hidden layers is where all of the learning would occur, and the output layer would where our result would be. This is modeled in the picture below:

![deep_learning](static/deep_learning.jpg)

This notebook will be a crash course overview of deep learning focusing on two very important kinds of deep learning algorithms: the artificial nueral network and the convolution neural network.

# The Artificial Neural Network

The artifical neural network is the most common kind of deep learning. Before we can talk about the neural network, we need to talk about the *neuron*.  The neuron is the building block of the human brain and will also be the building block of our neural network. A picture of a single neuron is shown below.
![neuron](static/neuron.png)

Neurons in and of themselves are pretty much useless. It is when neurons are connected that they become so powerful. The dendrite of a neuron connect to the axons of other neurons, allow neurons to create powerful networks where signals can be sent through the network with ease. When modeling this in our computer, we will simplify the physiology of the neuron. As shown in the diagram of the neural network. Neurons will simply have inputs that come from other neurons and then an output that also connects to the other neurons (except in the case of the first and last layer of neurons). Neurons will send the signal on (in neuroscience, this is refered to as a neuron "firing") based on an *activation function*. We will discuss activation functions in much more detail, but basically they are the function that determines whether or not a neuron will fire. Now that we a basic idea of the neuron, we have to talk about how the neurons are connected together. We will call the lines that connect neurons *synapses*. The synapses will all have a weight associated with them that determine how strong or important the signal being passed is. The weights of the neurons are where the learning actually occurs. As the network learns, the weights on the synapses are adjusted such that important signals are giving higher weights and less important signals get lower weights. Basically, in each neuron, we take the sum of all signals coming in multiplied by their weights and that is the value that a neuron works with. That value will be passed into the activation function which will decide if that signal is strong enough to cause the neuron to fire. Put more formally, if we have *n* different neurons sending a signal to our neuron (we will denote the output of a neuron as *x*) and the synapse the output travels along before arriving at our neuron has weight deonted *w*, then the value that will be passed into our activation function is $\sum_{i=1}^n x_iw_i$. 

Now that we have the value being passed into our neuron, we have to decide what to do with it. As mentioned before, this is where the *activation function* comes into play. Four different activation functions are shown below: 
![threshold](static/threshold.jpg)
![sigmoid](static/sigmoid.jpg)
![rectifier](static/rectifier.jpg)
![tanh](static/tanh.jpg)

Of these, the rectifier function is one of the most important, but which activation function you choose is entirely dependent on your data. For example, if our variables are binary, the we would have to choose either the threshold function or a modified sigmoid function. A very common combination to use is the rectifier function for the hidden layers and then the sigmoid function for the output layer so that our final output is a probability. 

So we already mentioned that the neural network learns by modifying the weights of the synapses, but we need to now discuss how that weights are modified. We start the same way that we would with many of the machine models we have seen so far. Let's assume that we have some independent variables and we want to use them to predict some dependent variable. We first take the independent variables and feed them through our network. Note that we start with all weights set to a default value (let's say 1). Our network will them output a value $\hat{y}$ which we will compare to the actual value $y$. We will use a *cost function* to quantify the difference. The cost function for a single observation might be $C = \frac{1}{2}(\hat{y}-y)^2$. This is just one cost function but we could use others. Our goal is to minimize the cost function, so we will use the value output by the cost function and feed it back into our network. Using this information, the weights will be updated. We can then run our experiment again on the same data. The value of our cost function should now be lower, but it is likely not as low as it can be. We will get a new value for the cost function and use it to again tweak our weights. We continually do this until the weights converge and our cost function is minimized. We just discussed how to do this with a single observation, but extending it to multiple is easy. Run the network on all observations and compute a total cost function as $C = \sum\frac{1}{2}(\hat{y}-y)^2$. Re-run on all observations until the cost function converges.

Let us now speak about mathematically how the weights are actually adjusted. This is done through a process called *backpropogation*, where the value of the cost function is fed back through the network in order to adjust the weights. Now a brute force way to compute these weights would be to choose a ton of different values for the weights and see what works best. But when we have large networks, the amount of synapses increases exponentially, thus the amount of weights we have to figure out is huge. Trying to brute force a huge amount of variables simultaneously becomes computationally gargantuan. In fact, using only a decently large network, the worlds fastest computer would take longer to compute all the combinations for these synapses ten the universe has existed, so we need a smarter way. The method we will use to speed things up will be called *gradient descent*. Basically, we choose some starting values for our weights. We feed these into the cost function and get a value, but we also look at the slope of our cost function (the derivate) at that value. If our slope is negative, that means increasing the values would get a smaller value for the cost function, so we will increase our weights. A positive slope means decreasing our weights would give us a smaller value for the cost function, so we will decrease our weights. We continue until our slope is 0, meaning we have reach a local minimum in our cost function. The actual process is a bit more complex but this is in general what we are trying to do. The following visual should help sure up this concept.

![gd](static/gradient_descent.jpg)

Now the main problem with simple gradient descent is that it requires the cost function to be convex. Convex here means that it has one global minimum and all roads points to it, so to speak. This often does not happen. More complex cost functions, or even simple cost functions in higher dimensions have many ups and down. So our algorithm may find a minimum and stop, but that minimum is a local minimum and we could have done better had we found the global minimum. The solution here is *stochastic gradient descent*, which does not require a convex cost function. The main difference in stochastic gradient descent is when the weight is updated. In regular gradient descent, we run all of our observations, get an overall value for the cost function, and then update the weights before running all rows again. This is also known as batch gradient descent. Stochastic gradient descent does each row at a time. It takes the first observations and keeps running it throught the network until the cost function converges, then it moves on to the second row and does the same. It turns out that because stochastic gradient descent does one row at a time, the fluctuations are much higher, meaning it is much more likely to land in the global minimum. For reasons that involve how computer memory works, stochastic gradient descent is also faster. The big con is that stochastic gradient descent is stochastic (meaning random) meaning if you ran the same stochastic gradient descent twice on the same data, you might not get the same weights. There is a mix of the two called mini-batch gradient descent that does a few observations at a time.

The last concept to mention before we jump into the application is *backpropogation*. Now backpropogation has some heavy mathematics behind it, but at a high level, it means that we know what part of the error each layer is responsible for thus we can update all weights at the same time. With all of this said, let us quickly recap before we jump in the algorithm for stochastic gradient descent.
1. Randomly initialize weights with small numbers that are close to but different from 0.
2. Input the first observation of your data set into the input layer, each feature is one input node.
3. Forward propogate the data through the network. At each layer, neurons are activated based on the weights. Propogate until we get a predicted value $\hat{y}$. 
4. Compare the predicted result to the actual result $y$. Use this to get an error determined by the cost function. 
5. Backpropogate this information through the network and update all weights accordingly. 
6. Repeat steps 3 through 5 after each observation for stochastic gradient descent or after a batch of observations for regular gradient descent.
7. After all data has passed through, that is one epoch. Do more epochs until the weights converge. 

Now that we have this description of the algorithm, we are going to do a business case with it. The dataset for this example is a sample from a ficticious bank. This bank has been noticing that an unusually high number of people leaving the bank. The bank wants to figure out what people are at highest risk for leaving, which is when they turned to you. 6 months ago, the bank selected 10000 random customers from their huge customer base. At that time, they collected all of the information they had on those customers. Just recently, the bank went back and recorded whether or not those customers have left the bank in the past 6 months. They now want you to create an artificial neural network of this data to predict what customers are at the highest risk of leaving. Before we can build our model, we need to install some libraries to help us out. The first library is Theano which is a numerical computation library that can run computations on both CPU's and GPU's. Another similiar library is Tensorflow. These libraries are both great for building neural networks from scratch, but we would like to build networks faster. Enter the Keras library which is built on the Theano and Tensorflow libraries and allows us to build neural networks quickly and easily. Before we can make our model, we, as usual, need to read in and preprocess our data. Looking through our data, we are going to include all features that we think might actually have an effect on our output. These include things like gender, country, current balance, number of account, ect. These variables are the third through thirteenth column in our dataset. Note that since we have categorical data, we will have to encode them using dummy variables. 

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset
dataset = pd.read_csv('data/Churn_Modelling.csv')
X = dataset.iloc[:, 3:13].values
y = dataset.iloc[:, 13].values

# Encoding categorical data
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder_X_1 = LabelEncoder()
X[:, 1] = labelencoder_X_1.fit_transform(X[:, 1])
labelencoder_X_2 = LabelEncoder()
X[:, 2] = labelencoder_X_2.fit_transform(X[:, 2])
onehotencoder = OneHotEncoder(categorical_features = [1])
X = onehotencoder.fit_transform(X).toarray()
X = X[:, 1:]

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

# Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

Now we can build our model. Using keras, we will build a neural network with two hidden layers with six nodes. Generally, more hidden layers and more nodes is better, but be careful because the network can become very large and resource consuming quite quickly, so we will keep it small. We also will choose to use the rectifier function for our hidden layers and the sigmoid function for our output. This makes our output a probability to we can manually choose what our cutoff will be (we will arbitrarily choose 0.5). Now remember when training the network, we have to choose the number of epochs (we will choose 100 so that it runs in reasonable time). We will also choose a mixed gradient descent and go with a batch size of 10 so that we mitigate our chances of missing the global minimum, but we also run faster. These are all parameters that can be easily tweaked. When compiling, one of our inputs is "adam", this is simply a kind of stochastic gradient descent, and the "binary_crossentropy" input is just specifying our loss function.

In [3]:
import keras
from keras.models import Sequential
from keras.layers import Dense

# Initialising the ANN
classifier = Sequential()

# Adding the input layer and the first hidden layer
classifier.add(Dense(output_dim = 6, init = 'uniform', activation = 'relu', input_dim = 11))

# Adding the second hidden layer
classifier.add(Dense(output_dim = 6, init = 'uniform', activation = 'relu'))

# Adding the output layer
classifier.add(Dense(output_dim = 1, init = 'uniform', activation = 'sigmoid'))

# Compiling the ANN
classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])

# Fitting the ANN to the Training set
classifier.fit(X_train, y_train, batch_size = 10, nb_epoch = 100)

# Part 3 - Making the predictions and evaluating the model

# Predicting the Test set results
y_pred = classifier.predict(X_test)
y_pred = (y_pred > 0.5)

# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)



Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

Even with this really small neural network, this program took a couple of minutes to run. Overall, our model was very good at identifying true positive but had some trouble with the negatives. But for this specific