<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 4 Sprint 2*

# Sprint Challenge - Neural Network Foundations

Table of Problems

1. [Defining Neural Networks](#Q1)
2. [Perceptron on XOR Gates](#Q2)
3. [Multilayer Perceptron](#Q3)
4. [Keras MMP](#Q4)

<a id="Q1"></a>
## 1. Define the following terms:

- **Neuron:** 
A neuron, or node, is the indivisible object that forms a neural network, akin to atoms within molecules (assuming atoms cannot be further divisible into sub-atomic particles). A neuron contains information pertaining to one specific feature in our dataset. For example, if we are dealing with pictures, a picture can be composed of 28x28 pixels (greyscale); were we to encode these pictures within our neural networks model, we would have to initialize our model with 784 neurons, since each neuron would hold information (for example: intensity, brightness or luminosity) necessary for our machines to understand what a picture means. 


- **Input Layer:**
Input layer corresponds to the amalgamation of individual neurons that form our initial neural network layer.


- **Hidden Layer:**
The hidden layer is the layer, or layers, that lie between our input layer and our output layer. The word 'deep' from 'Deep Learning' stems from the idea that neural networks can contain multiple hidden layers. Neural networks are sometimes called black box models since it can become very difficult to interpret neurons individually, let alone neurons within neurons within multiple hidden layers; we know hidden layers can help us achieve higher accuracy by allowing us to further tweak our weights to activate certain neurons in the direction of our choosing. 


- **Output Layer:**
Output layer is our final layer, depending on our machine learning problem at hand, the output layer is adjusted accordingly to reflect our final output; for example, regarding a regression problem our output would be a nx1 matrix.


- **Activation:**
Activation is what enables us to give meaning to our neurons; it not sufficient to solely pass on signals from our neurons to other neurons from subsequent layers just by calculating the sum of the product between our weights and out neuron inputs. For example, using the sigmoid function applied in logistic regression problems, we are able to squishify our results (X would be in the range of -1 and 1) and consequently pass on values in a range that can be interpretable, thus 'activating' our neurons. 


- **Backpropagation:**
As the name implies, backpropagation consists of adjusting our errors in reverse; the idea here is that while we may not know exactly what each individual neuron within our many hidden layers may be doing, we do know that by identifying which neurons are more sensible to activation, and knowing which direction we'd like our neurons to activate towards, we are able to adjust our weights from output layer -> hidden layer  | hidden layer -> input layer, by using the derivative of the activation function (rate of change) and multiplying it by our error and applying it to our weights to see how far off our weight terms were. 


## 2. Perceptron on XOR Gates <a id="Q3=2"></a>

Create a perceptron class that can model the behavior of an AND gate. You can use the following table as your training data:

|x1	|x2|x3|	y|
|---|---|---|---|
1|	1|	1|	1|
1|	0|	1|	0|
0|	1|	1|	0|
0|	0|	1|	0|

In [28]:
import numpy as np 
import pandas as pd 

In [3]:
X = np.array([
    [1, 1, 1],
    [1, 0, 1],
    [0, 1, 1],
    [0, 0, 1]
])

y = np.array([
    [1], 
    [0], 
    [0], 
    [0]
])

In [4]:
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
    sx = sigmoid(x)
    return sx * (1-sx)

In [5]:
weights = 2 * np.random.random((3,1)) - 1
weights

array([[-0.7917965 ],
       [ 0.11374332],
       [-0.98208266]])

In [26]:
#iterating multiple times enables us adjust our errors, our weights, and our final activation output 

for i in range(1000):
    weighted_sum = np.dot(X, weights)
    activated_output = sigmoid(weighted_sum)
    error = y - activated_output
    adjustments = error * sigmoid_derivative(activated_output)
    weights += np.dot(X.T, adjustments)
    
print('weights after training')
print(weights)

print('output after training')
print(activated_output)
    

weights after training
[[ 18.14159145]
 [ 18.14159145]
 [-27.50073592]]
output after training
[[9.99846621e-01]
 [8.61667372e-05]
 [8.61667372e-05]
 [1.13916794e-12]]


## 3. Multilayer Perceptron <a id="Q3"></a>

Implement a Neural Network Multilayer Perceptron class that uses backpropagation to update the network's weights.
Your network must have one hidden layer.
You do not have to update weights via gradient descent. You can use something like the derivative of the sigmoid function to update weights.
Train your model on the Heart Disease dataset from UCI:



In [84]:
df = pd.read_csv('https://raw.githubusercontent.com/ryanleeallred/datasets/master/heart.csv')

In [85]:
df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [86]:
X = df.drop('target', axis=1)
y = df['target']

In [90]:
X = X.values
y = y.values

In [101]:
X

array([[63.,  1.,  3., ...,  0.,  0.,  1.],
       [37.,  1.,  2., ...,  0.,  0.,  2.],
       [41.,  0.,  1., ...,  2.,  0.,  2.],
       ...,
       [68.,  1.,  0., ...,  1.,  2.,  3.],
       [57.,  1.,  0., ...,  1.,  1.,  3.],
       [57.,  0.,  1., ...,  1.,  1.,  2.]])

In [91]:
X.shape, y.shape

((303, 13), (303,))

In [102]:
y.resize(303, 1)

In [184]:
#let's scale our data by transforming mean=0 and variance=1

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X = scaler.fit_transform(X)

In [129]:
#neural network class borrowed from DS3 class notes
#neural network class that implements from beginning (forward feed) to end (backward prop)

class NeuralNetwork: 
    def __init__(self):
        # Set up Architecture 
        self.inputs = 13
        self.hiddenNodes = 2
        self.outputNodes = 1
        
        #Initial weights
        self.weights1 = np.random.randn(self.inputs, self.hiddenNodes) #13x1
        self.weights2 = np.random.rand(self.hiddenNodes, self.outputNodes) #1x1
    
    def sigmoid(self, s):
        return 1 / (1+np.exp(-s))
    
    def sigmoidPrime(self, s):
        return s * (1 - s)
    
    def feed_forward(self, X):
        """
        Calculate the NN inference using feed forward.
        """
        
        #Weighted sume of inputs and hidden layer
        self.hidden_sum = np.dot(X, self.weights1)
        
        #Acivations of weighted sum
        self.activated_hidden = self.sigmoid(self.hidden_sum)
        
        # Weight sum between hidden and output
        self.output_sum = np.dot(self.activated_hidden, self.weights2)
        
        #Final activation of output
        self.activated_output = self.sigmoid(self.output_sum)
        
        return self.activated_output
    
    def backward(self, X, y, o):
        """
        Backward propagate through the network
        """
        #error in output
        self.o_error = y - o 
        
        # apply derivative of sigmoid to error
        self.o_delta = self.o_error * self.sigmoidPrime(o) 
        
        # z2 error: how much our hidden layer weights were off
        self.z2_error = self.o_delta.dot(self.weights2.T)
        self.z2_delta = self.z2_error*self.sigmoidPrime(self.activated_hidden)
        
        #Adjust first set (input => hidden) of weights
        self.weights1 += X.T.dot(self.z2_delta)
        
        #adjust second set (hidden => output) of weights
        self.weights2 += self.activated_hidden.T.dot(self.o_delta)
        
    def train(self, X, y):
        o = self.feed_forward(X)
        self.backward(X, y, o)

In [152]:
nn = NeuralNetwork()

#intervals increase exponentially to make it easier to visualize decreases in loss 
#Our loss function below is simply our mean squared error

for i in range(10000):
    if (i+1 in [1, 5, 20, 50, 100, 500, 1000]) or ((i+1) % 2000 ==0):
        print('---' * 3 + f'EPOCH {i+1}' + '---'*3)
        print("Loss: \n", str(np.mean(np.square(y - nn.feed_forward(X)))))
    nn.train(X,y)

---------EPOCH 1---------
Loss: 
 0.2955039833554743
---------EPOCH 5---------
Loss: 
 0.1445848880286784
---------EPOCH 20---------
Loss: 
 0.09179568155170025
---------EPOCH 50---------
Loss: 
 0.08341488251011715
---------EPOCH 100---------
Loss: 
 0.08128590694678048
---------EPOCH 500---------
Loss: 
 0.0739358134425191
---------EPOCH 1000---------
Loss: 
 0.0731647939835647
---------EPOCH 2000---------
Loss: 
 0.06868466060760119
---------EPOCH 4000---------
Loss: 
 0.06832010212977495
---------EPOCH 6000---------
Loss: 
 0.0681943249704681
---------EPOCH 8000---------
Loss: 
 0.06812024466014227
---------EPOCH 10000---------
Loss: 
 0.06806938010662296


## 4. Keras MMP <a id="Q4"></a>

Implement a Multilayer Perceptron architecture of your choosing using the Keras library. Train your model and report its baseline accuracy. Then hyperparameter tune at least two parameters and report your model's accuracy.
Use the Heart Disease Dataset (binary classification)
Use an appropriate loss function for a binary classification task
Use an appropriate activation function on the final layer of your network.
Train your model using verbose output for ease of grading.
Use GridSearchCV to hyperparameter tune your model. (for at least two hyperparameters)
When hyperparameter tuning, show you work by adding code cells for each new experiment.
Report the accuracy for each combination of hyperparameters as you test them so that we can easily see which resulted in the highest accuracy.
You must hyperparameter tune at least 5 parameters in order to get a 3 on this section.

In [180]:
from sklearn.model_selection import GridSearchCV
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import StratifiedKFold


#k-fold in order to improve accuracy
kfold = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
cvscores = []
for train, test in kfold.split(X, y):
    
  # create model
  def create_model():
        model = Sequential()
        model.add(Dense(13, input_dim=13, activation='relu'))
        model.add(Dense(13, activation='relu'))
        model.add(Dense(1, activation='sigmoid'))

  # compile model
        model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
        return model 
    
  # in order to hyperparameter tune we need to pass in KerasClassifier
  model_class = KerasClassifier(build_fn=create_model, verbose=0)
    

  # fit the model
  model.fit(X[train], y[train], epochs=150, batch_size=10, verbose=0)
    
  # evaluate the model
  scores = model.evaluate(X[test], y[test], verbose=0)
  print(f'{model.metrics_names[1]}: {(scores[1]*100):.2f}%') 
  cvscores.append(scores[1]*100)
print(f'{np.mean(cvscores):.2f}% +/- {np.std(cvscores):.2f}%')

acc: 98.02%
acc: 99.01%
acc: 99.01%
98.68% +/- 0.47%


In [185]:
#hyperparam tuning with gridsearch 

param_grid = {'batch_size': [20, 50, 100],
              'epochs': [1, 10, 100]}


grid = GridSearchCV(estimator=model_class, param_grid=param_grid, n_jobs=1, verbose=0)
grid_result = grid.fit(X, y)


print(f"Best: {grid_result.best_score_} using {grid_result.best_params_}")
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print(f"Means: {mean}, Stdev: {stdev} with: {param}")



Best: 0.6831683119138082 using {'batch_size': 50, 'epochs': 10}
Means: 0.6237623790899912, Stdev: 0.21555981664777982 with: {'batch_size': 20, 'epochs': 1}
Means: 0.4257425864537557, Stdev: 0.3055970272720392 with: {'batch_size': 20, 'epochs': 10}
Means: 0.6633663376172384, Stdev: 0.10321118913257638 with: {'batch_size': 20, 'epochs': 100}
Means: 0.49834983547528583, Stdev: 0.22010999215576987 with: {'batch_size': 50, 'epochs': 1}
Means: 0.6831683119138082, Stdev: 0.11461223986680608 with: {'batch_size': 50, 'epochs': 10}
Means: 0.6369637052218119, Stdev: 0.09369682329296686 with: {'batch_size': 50, 'epochs': 100}
Means: 0.5742574234803518, Stdev: 0.16228766996652716 with: {'batch_size': 100, 'epochs': 1}
Means: 0.5577557782332102, Stdev: 0.1995540139622511 with: {'batch_size': 100, 'epochs': 10}
Means: 0.6435643633206686, Stdev: 0.1244535055230195 with: {'batch_size': 100, 'epochs': 100}
