<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 4 Sprint 2*

# Sprint Challenge - Neural Network Foundations

Table of Problems

1. [Defining Neural Networks](#Q1)
2. [Chocolate Gummy Bears](#Q2)
    - Perceptron
    - Multilayer Perceptron
4. [Keras MMP](#Q3)

<a id="Q1"></a>
## 1. Define the following terms:

- **Neuron:** a neuron takes a set of inputs, weights that pass through an activation function. the activation function is what allows for non-linear behavior. the ouput is then sent as an input to neurons in the next layer, repeating the process (weighted sum of inputs + transformation with activation function; the weighted sum contains a bias term). 
- **Input Layer:** deals with the inputs only. it passes the data directly to the first hidden layer, after being multiplied by the weights between the first and second layer and passing through the activation function.  
- **Hidden Layer:** any layer between the input and output layer. 
- **Output Layer:** transforms the hidden layer activations into whatever scale we wanted to have. 
- **Activation:** this is where non-linearity happens, or 'squashing'. they squish values into a smaller range. it is the non-linear transformation over the input signal, it decides whether a neuron is activate or not. it lights up when a high number (close to 1)
- **Backpropagation:** the backprop algo computes the gradient of the cost function. (the cost function minimizes the average of the loss computed over the entire training data set)

it tells us how quickly the cost changes when we change the weights and biases. (partial derviatives dC/dw and dC/db of cost function C). we can learn that a weight will learn slowly if the input neuron is low-activation. we can also use the equations to design activation functions with desired learning properties. for ex., if we choose a non-sigmoid activation function where theta is always positive/nve/never gets close to 0, this prevents the slow-down of learning that happens to sigmoid neurons when they saturate. 


## 2. Chocolate Gummy Bears <a id="Q2"></a>

Right now, you're probably thinking, "yuck, who the hell would eat that?". Great question. Your candy company wants to know too. And you thought I was kidding about the [Chocolate Gummy Bears](https://nuts.com/chocolatessweets/gummies/gummy-bears/milk-gummy-bears.html?utm_source=google&utm_medium=cpc&adpos=1o1&gclid=Cj0KCQjwrfvsBRD7ARIsAKuDvMOZrysDku3jGuWaDqf9TrV3x5JLXt1eqnVhN0KM6fMcbA1nod3h8AwaAvWwEALw_wcB). 

Let's assume that a candy company has gone out and collected information on the types of Halloween candy kids ate. Our candy company wants to predict the eating behavior of witches, warlocks, and ghosts -- aka costumed kids. They shared a sample dataset with us. Each row represents a piece of candy that a costumed child was presented with during "trick" or "treat". We know if the candy was `chocolate` (or not chocolate) or `gummy` (or not gummy). Your goal is to predict if the costumed kid `ate` the piece of candy. 

If both chocolate and gummy equal one, you've got a chocolate gummy bear on your hands!?!?!
![Chocolate Gummy Bear](https://ed910ae2d60f0d25bcb8-80550f96b5feb12604f4f720bfefb46d.ssl.cf1.rackcdn.com/3fb630c04435b7b5-2leZuM7_-zoom.jpg)

In [0]:
import pandas as pd
candy = pd.read_csv('chocolate_gummy_bears.csv')

In [217]:
candy.head()

Unnamed: 0,chocolate,gummy,ate
0,0,1,1
1,1,0,1
2,0,1,1
3,0,0,0
4,1,1,0


### Perceptron

To make predictions on the `candy` dataframe. Build and train a Perceptron using numpy. Your target column is `ate` and your features: `chocolate` and `gummy`. Do not do any feature engineering. :P

Once you've trained your model, report your accuracy. Explain why you could not achieve a higher accuracy with a *simple perceptron*. It's possible to achieve ~95% accuracy on this dataset.

In [0]:
import numpy as np

In [0]:
# Start your candy perceptron here

inputs = candy[['chocolate', 'gummy']].values
correct_outputs = candy['ate'].values.reshape(-1, 1)

In [220]:
inputs.shape, correct_outputs.shape

((10000, 2), (10000, 1))

In [221]:
correct_outputs

array([[1],
       [1],
       [1],
       ...,
       [1],
       [1],
       [1]])

In [0]:
# activation function + derivative for updating weights
def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
    sx = sigmoid(x)
    
    return sx * (1-sx)

In [223]:
# intialize random weights for the two inputs
# we need a matrix of 
weights = 2 * np.random.random((2,1)) - 1
weights

array([[-0.13871912],
       [-0.60270641]])

In [224]:
# now do this 10000 times for fine tuning
for iteration in range(10000):
    
    # weighted sum of inputs/weights
    weighted_sum = np.dot(inputs, weights)
    
    # activate!
    activated_output = sigmoid(weighted_sum)

    # calc error
    error = correct_outputs - activated_output

    adjustments = error * sigmoid_derivative(activated_output)

    # update the weights:
    weights += np.dot(inputs.T, adjustments)

print("Weights after training")
print(weights)

print("Output after training")
print(activated_output)

  


Weights after training
[[-255.02285518]
 [-414.99432639]]
Output after training
[[1.]
 [1.]
 [1.]
 ...
 [1.]
 [1.]
 [1.]]


Explain why you could not achieve a higher accuracy with a simple perceptron:


### Multilayer Perceptron <a id="Q3"></a>

Using the sample candy dataset, implement a Neural Network Multilayer Perceptron class that uses backpropagation to update the network's weights. Your Multilayer Perceptron should be implemented in Numpy. 
Your network must have one hidden layer.

Once you've trained your model, report your accuracy. Explain why your MLP's performance is considerably better than your simple perceptron's on the candy dataset. 

In [0]:
X = inputs
y = correct_outputs

In [168]:
y

array([1, 1, 1, ..., 1, 1, 1])

In [171]:
y = correct_outputs.reshape(-1,1)
y

array([[1],
       [1],
       [1],
       ...,
       [1],
       [1],
       [1]])

In [0]:
# I want activations that correspond to negative weights to be lower
# and activations that correspond to positive weights to be higher

class NeuralNetwork:
    def __init__(self):
        # Set up Architecture of Neural Network
        self.inputs = 2
        self.hiddenNodes = 4
        self.outputNodes = 1

        # Initial Weights
        # 2x3 Matrix Array for the First Layer
        self.weights1 = np.random.rand(self.inputs, self.hiddenNodes)
       
        # 3x1 Matrix Array for Hidden to Output
        self.weights2 = np.random.rand(self.hiddenNodes, self.outputNodes)
        
    def sigmoid(self, s):
        return 1 / (1 + np.exp(-s))
    
    def sigmoidPrime(self, s):
        return s * (1 - s)
    
    def feed_forward(self, X):
        """
        Calculate the NN inference using feed forward.
        aka "predict"
        """
        
        # Weighted sum of inputs => hidden layer
        self.hidden_sum = np.dot(X, self.weights1)
        
        # Activations of weighted sum
        self.activated_hidden = self.sigmoid(self.hidden_sum)
        
        # Weight sum between hidden and output
        self.output_sum = np.dot(self.activated_hidden, self.weights2)
        
        # Final activation of output
        self.activated_output = self.sigmoid(self.output_sum)
        
        return self.activated_output
        
    def backward(self, X,y,o):
        """
        Backward propagate through the network
        """
        
        # Error in Output
        self.o_error = y - o
        
        # Apply Derivative of Sigmoid to error
        # How far off are we in relation to the Sigmoid f(x) of the output
        # ^- aka hidden => output
        # o_delta: overall gradient
        self.o_delta = self.o_error * self.sigmoidPrime(o)
        
        # z2 error
        self.z2_error = self.o_delta.dot(self.weights2.T)
        # How much of that "far off" can explained by the input => hidden
        self.z2_delta = self.z2_error * self.sigmoidPrime(self.activated_hidden)
        
        # Adjustment to first set of weights (input => hidden)
        self.weights1 += X.T.dot(self.z2_delta)
        # Adjustment to second set of weights (hidden => output)
        self.weights2 += self.activated_hidden.T.dot(self.o_delta)
        

    def train(self, X, y):
        o = self.feed_forward(X)
        self.backward(X,y,o)

In [173]:
# Train my 'net
nn = NeuralNetwork()

# Number of Epochs / Iterations
for i in range(10000):
    if (i+1 in [1,2,3,4,5]) or ((i+1) % 1000 ==0):
        print('+' + '---' * 3 + f'EPOCH {i+1}' + '---'*3 + '+')
        print('Input: \n', X)
        print('Actual Output: \n', y)
        print('Predicted Output: \n', str(nn.feed_forward(X)))
        print("Loss: \n", str(np.mean(np.square(y - nn.feed_forward(X)))))
    nn.train(X,y)

+---------EPOCH 1---------+
Input: 
 [[0 1]
 [1 0]
 [0 1]
 ...
 [0 1]
 [0 1]
 [1 0]]
Actual Output: 
 [[1]
 [1]
 [1]
 ...
 [1]
 [1]
 [1]]
Predicted Output: 
 [[0.80279814]
 [0.79251455]
 [0.80279814]
 ...
 [0.80279814]
 [0.80279814]
 [0.79251455]]
Loss: 
 0.3341885619648168
+---------EPOCH 2---------+
Input: 
 [[0 1]
 [1 0]
 [0 1]
 ...
 [0 1]
 [0 1]
 [1 0]]
Actual Output: 
 [[1]
 [1]
 [1]
 ...
 [1]
 [1]
 [1]]
Predicted Output: 
 [[0.43075117]
 [0.37377738]
 [0.43075117]
 ...
 [0.43075117]
 [0.43075117]
 [0.37377738]]
Loss: 
 0.24985462985072399
+---------EPOCH 3---------+
Input: 
 [[0 1]
 [1 0]
 [0 1]
 ...
 [0 1]
 [0 1]
 [1 0]]
Actual Output: 
 [[1]
 [1]
 [1]
 ...
 [1]
 [1]
 [1]]
Predicted Output: 
 [[0.49999228]
 [0.49997793]
 [0.49999228]
 ...
 [0.49999228]
 [0.49999228]
 [0.49997793]]
Loss: 
 0.20115663958007654
+---------EPOCH 4---------+
Input: 
 [[0 1]
 [1 0]
 [0 1]
 ...
 [0 1]
 [0 1]
 [1 0]]
Actual Output: 
 [[1]
 [1]
 [1]
 ...
 [1]
 [1]
 [1]]
Predicted Output: 
 [[0.49999235]
 

P.S. Don't try candy gummy bears. They're disgusting. 

## 3. Keras MMP <a id="Q3"></a>

- Implement a Multilayer Perceptron architecture of your choosing using the Keras library. 
- Train your model and report its baseline accuracy. 
- Then hyperparameter tune at least two parameters and report your model's accuracy.

Use the Heart Disease Dataset (binary classification)
- Use an appropriate loss function for a binary classification task
- Use an appropriate activation function on the final layer of your network.
- Train your model using verbose output for ease of grading.
- Use GridSearchCV or RandomSearchCV to hyperparameter tune your model. (for at least two hyperparameters)
When hyperparameter tuning, show you work by adding code cells for each new experiment.
- Report the accuracy for each combination of hyperparameters as you test them so that we can easily see which resulted in the highest accuracy.
- You must hyperparameter tune at least 3 parameters in order to get a 3 on this section.

In [184]:
import pandas as pd
from sklearn.preprocessing import StandardScaler

df = pd.read_csv('https://raw.githubusercontent.com/ryanleeallred/datasets/master/heart.csv')
df = df.sample(frac=1)
print(df.shape)
df.head()

(303, 14)


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
138,57,1,0,110,201,0,1,126,1,1.5,1,0,1,1
204,62,0,0,160,164,0,0,145,0,6.2,0,3,3,0
24,40,1,3,140,199,0,1,178,1,1.4,2,0,3,1
91,57,1,0,132,207,0,1,168,1,0.0,2,0,3,1


In [0]:
X = df.drop(columns=['target'])
y = df['target']

In [186]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y)

X_train.shape, y_train.shape, X_test.shape, y_test.shape

((227, 13), (227,), (76, 13), (76,))

In [0]:
# normalize
from sklearn.preprocessing import MinMaxScaler, Normalizer

scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)

### Baseline accuracy

In [188]:
# Important Hyperparameters
inputs = X_train.shape[1]
epochs = 50
batch_size = 10

# create model without tuning

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

model = Sequential()
model.add(Dense(15, activation='relu', input_shape=(inputs,)))
model.add(Dense(10, activation='relu'))

# add sigmoid for output layer,  because classification (see notes)
model.add(Dense(1, activation='sigmoid'))
          
  
# since dependent variable is binary, use the log loss function 
# if there are > 2 categroies in output, then use categorical_crossentropy

model.compile(optimizer='adam', 
              loss='binary_crossentropy', 
              metrics=['acc'])

model.fit(X_train, y_train, 
          validation_data=(X_test,y_test), 
          epochs=epochs, 
          batch_size=batch_size,
          verbose=True
         )

Train on 227 samples, validate on 76 samples
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<tensorflow.python.keras.callbacks.History at 0x7f412b791cc0>

In [189]:
results = model.evaluate(X_test, y_test)
results



[0.4901960774471885, 0.7894737]

## Hyperparameter tuning

In [0]:
# 4. hyperparameter tuning

from tensorflow.keras.wrappers.scikit_learn import KerasClassifier
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dropout

def create_model(activation='relu', 
                 dropout_rate=0.1, 
                 init_mode='uniform',
                 optimizer='adam'):
    
    # create model
    model = Sequential()
    model.add(Dense(15, kernel_initializer=init_mode, activation=activation, input_shape=(inputs,)))
    model.add(Dropout(dropout_rate))
    model.add(Dense(1, kernel_initializer=init_mode, activation=activation))
    # Compile model
    model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['acc'])
    return model
  
# create model
model = KerasClassifier(build_fn=create_model, verbose=0)

In [0]:
# define the grid search parameters
param_grid = {'batch_size': [1000],
              'epochs': [20, 40],
              'optimizer': ['adam', 'sgd'] ,
#              'learning_rate': [0.001, 0.01] ,
#              'momentum': [0, 0.2] ,
              'activation': ['softmax', 'linear'],
              'init_mode': ['uniform', 'normal'],
              'dropout_rate': [0.0, 0.1, 0.2],
#              'neurons': [1, 5, 10]
              }

In [195]:
from sklearn.model_selection import GridSearchCV

# Create Grid Search
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=1)
grid_result = grid.fit(X_train, y_train, epochs=5)

# Report Results
print(f"Best: {grid_result.best_score_} using {grid_result.best_params_}")
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print(f"Means: {mean}, Stdev: {stdev} with: {param}")



KeyboardInterrupt: ignored