<img align="left" src="https://lever-client-logos.s3.amazonaws.com/864372b1-534c-480e-acd5-9711f850815c-1524247202159.png" width=200>
<br></br>
<br></br>

## *Data Science Unit 4 Sprint 2*

# Sprint Challenge - Neural Network Foundations

Table of Problems

1. [Defining Neural Networks](#Q1)
2. [Chocolate Gummy Bears](#Q2)
    - Perceptron
    - Multilayer Perceptron
4. [Keras MMP](#Q3)

<a id="Q1"></a>
## 1. Define the following terms:

- **Neuron:**
A Neuron is a node in a neural network. It takes the sum of the weighted outputs of the previous layer and outputs a transformed value.
- **Input Layer:**
The input layer provides the points of entry to the network. There is one node for each input, or feature, of the data set.
- **Hidden Layer:**
A hidden layer is any layer between the input layer and the output layer in a neural network. It can have an arbitrary number of nodes.
- **Output Layer:**
The output layer provides the results of the network. For classification models, the output will have one node for each class. A regression model will have one output node for the numerical prediction.
- **Activation:**
After a node receives the sum of the weighted outputs it passes it through an activation, or transform, function before sending the output to the next layer.
- **Backpropagation:**
Backpropagation is used to update the weights by calculating a gradient. The gradient gives a direction in which to adjust the weight. The gradient and the error determine how much to move the weight. This is were the learning takes place and it happens after every batch.


## 2. Chocolate Gummy Bears <a id="Q2"></a>

Right now, you're probably thinking, "yuck, who the hell would eat that?". Great question. Your candy company wants to know too. And you thought I was kidding about the [Chocolate Gummy Bears](https://nuts.com/chocolatessweets/gummies/gummy-bears/milk-gummy-bears.html?utm_source=google&utm_medium=cpc&adpos=1o1&gclid=Cj0KCQjwrfvsBRD7ARIsAKuDvMOZrysDku3jGuWaDqf9TrV3x5JLXt1eqnVhN0KM6fMcbA1nod3h8AwaAvWwEALw_wcB). 

Let's assume that a candy company has gone out and collected information on the types of Halloween candy kids ate. Our candy company wants to predict the eating behavior of witches, warlocks, and ghosts -- aka costumed kids. They shared a sample dataset with us. Each row represents a piece of candy that a costumed child was presented with during "trick" or "treat". We know if the candy was `chocolate` (or not chocolate) or `gummy` (or not gummy). Your goal is to predict if the costumed kid `ate` the piece of candy. 

If both chocolate and gummy equal one, you've got a chocolate gummy bear on your hands!?!?!
![Chocolate Gummy Bear](https://ed910ae2d60f0d25bcb8-80550f96b5feb12604f4f720bfefb46d.ssl.cf1.rackcdn.com/3fb630c04435b7b5-2leZuM7_-zoom.jpg)

In [54]:
import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score
candy = pd.read_csv('chocolate_gummy_bears.csv')

In [55]:
print(candy.shape)
candy.head()

(10000, 3)


Unnamed: 0,chocolate,gummy,ate
0,0,1,1
1,1,0,1
2,0,1,1
3,0,0,0
4,1,1,0


### Perceptron

To make predictions on the `candy` dataframe. Build and train a Perceptron using numpy. Your target column is `ate` and your features: `chocolate` and `gummy`. Do not do any feature engineering. :P

Once you've trained your model, report your accuracy. Explain why you could not achieve a higher accuracy with a *simple perceptron*. It's possible to achieve ~95% accuracy on this dataset.

In [56]:
# Start your candy perceptron here

X = candy[['chocolate', 'gummy']].values
y = candy[['ate']].values

X.shape, y.shape

((10000, 2), (10000, 1))

In [68]:
class Perceptron:
 

    def __init__(self, niter=10):
        self.niter = niter
        np.random.seed(42)
        self.weights = 2 * np.random.random((2, 1)) - 1
  

    def sigmoid(self, x):
        return 1 / (1 + np.exp(-x))

    
    def sigmoid_derivative(self, x):
        sx = self.sigmoid(x)
        return sx * (1-sx)

    
    def train(self, X, y):
        for i in range(self.niter):
            # Weighted sum of inputs / weights
            weighted_sum = np.dot(X, self.weights)
            # Activate
            activated_output = self.sigmoid(weighted_sum)
            # Calculate error
            error = y - activated_output
            adjustments = error * self.sigmoid_derivative(activated_output)
            # Update the Weights
            self.weights += np.dot(X.T, adjustments)
            
    
    def predict(self, X):
        # Weighted sum of inputs / weights
        weighted_sum = np.dot(X, self.weights)
        predictions = np.round(self.sigmoid(weighted_sum))
        return predictions

In [69]:
yuck = Perceptron()
yuck.train(X, y)
yuck_score = accuracy_score(y, yuck.predict(X))
print(f'Accuracy for a single layer perceptron: {yuck_score}')

Accuracy for a single layer perceptron: 0.7229


## Accuracy Limitations With a Simple Perceptron
Imagine a unit square. At (0,0) and (1,1) the output, or height, should be 0 (95 percent of the time). Conversely, at the other two diagonal points - (1, 0) and (0,1) - the height should be 1. It is impossible to construct a single plane that is both above the positive points and below the zero points. The two conditions are not linearly separable, which is a requirement for this type of model to give useful results.

### Multilayer Perceptron <a id="Q3"></a>

Using the sample candy dataset, implement a Neural Network Multilayer Perceptron class that uses backpropagation to update the network's weights. Your Multilayer Perceptron should be implemented in Numpy. 
Your network must have one hidden layer.

Once you've trained your model, report your accuracy. Explain why your MLP's performance is considerably better than your simple perceptron's on the candy dataset. 

In [70]:
class NeuralNetwork:
    def __init__(self):
        # Set up Architecture of Neural Network
        self.inputs = 2
        self.hiddenNodes = 8
        self.outputNodes = 1

        # Initial Weights
        np.random.seed(42)
        # 2x3 Matrix Array for the First Layer
        self.weights1 = np.random.rand(self.inputs, self.hiddenNodes)
       
        # 3x1 Matrix Array for Hidden to Output
        self.weights2 = np.random.rand(self.hiddenNodes, self.outputNodes)
        
    def sigmoid(self, s):
        return 1 / (1+np.exp(-s))
    
    def sigmoidPrime(self, s):
        return s * (1 - s)
    
    def feed_forward(self, X):
        """
        Calculate the NN inference using feed forward.
        aka "predict"
        """
        
        # Weighted sum of inputs => hidden layer
        self.hidden_sum = np.dot(X, self.weights1)
        
        # Activations of weighted sum
        self.activated_hidden = self.sigmoid(self.hidden_sum)
        
        # Weight sum between hidden and output
        self.output_sum = np.dot(self.activated_hidden, self.weights2)
        
        # Final activation of output
        self.activated_output = self.sigmoid(self.output_sum)
        
        return self.activated_output
        
    def backward(self, X, y, o):
        """
        Backward propagate through the network
        """
        
        # Error in Output
        self.o_error = y - o
        
        # Apply Derivative of Sigmoid to error
        # How far off are we in relation to the Sigmoid f(x) of the output
        # ^- aka hidden => output
        self.o_delta = self.o_error * self.sigmoidPrime(o)
        
        # z2 error
        self.z2_error = self.o_delta.dot(self.weights2.T)
        # How much of that "far off" can explained by the input => hidden
        self.z2_delta = self.z2_error * self.sigmoidPrime(self.activated_hidden)
        
        # Adjustment to first set of weights (input => hidden)
        self.weights1 += X.T.dot(self.z2_delta)
        # Adjustment to second set of weights (hidden => output)
        self.weights2 += self.activated_hidden.T.dot(self.o_delta)
        

    def train(self, X, y):
        o = self.feed_forward(X)
        self.backward(X, y, o)
        
    def predict(self, X):
        predictions = np.round(self.feed_forward(X) * 2)
        return np.round(predictions)

In [71]:
# train a neurll net
yum = NeuralNetwork()
for i in range(10000):
    yum.train(X, y)



In [72]:
predictions = yum.predict(X)
yum_score = accuracy_score(y, yuck.predict(X))
print(f'Accuracy for a single layer perceptron: {yum_score}')

Accuracy for a single layer perceptron: 0.7229




## MLP performance
Theoretically, this model should be able to handle non-linear relationships and give a better accuracy score than the previous model. Theoretically.

P.S. Don't try candy gummy bears. They're disgusting. 

## 3. Keras MMP <a id="Q3"></a>

Implement a Multilayer Perceptron architecture of your choosing using the Keras library. Train your model and report its baseline accuracy. Then hyperparameter tune at least two parameters and report your model's accuracy.
Use the Heart Disease Dataset (binary classification)
Use an appropriate loss function for a binary classification task
Use an appropriate activation function on the final layer of your network.
Train your model using verbose output for ease of grading.
Use GridSearchCV or RandomSearchCV to hyperparameter tune your model. (for at least two hyperparameters)
When hyperparameter tuning, show you work by adding code cells for each new experiment.
Report the accuracy for each combination of hyperparameters as you test them so that we can easily see which resulted in the highest accuracy.
You must hyperparameter tune at least 3 parameters in order to get a 3 on this section.

In [1]:
# import statements
from category_encoders import OneHotEncoder
import numpy as np
import pandas as pd
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import MinMaxScaler
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier

# fix random seed for reproducibility (lol)
seed = 42
np.random.seed(seed)

In [2]:
import pandas as pd
from sklearn.preprocessing import StandardScaler

df = pd.read_csv('https://raw.githubusercontent.com/ryanleeallred/datasets/master/heart.csv')
df = df.sample(frac=1)
print(df.shape)
df.head()

(303, 14)


Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
179,57,1,0,150,276,0,0,112,1,0.6,1,1,1,0
228,59,1,3,170,288,0,0,159,0,0.2,1,0,3,0
111,57,1,2,150,126,1,1,173,0,0.2,2,1,3,1
246,56,0,0,134,409,0,0,150,1,1.9,1,2,3,0
60,71,0,2,110,265,1,0,130,0,0.0,2,1,2,1


In [3]:
import wandb
from wandb.keras import WandbCallback
wandb.init(project="heart-disease", entity='r-m-mauney')

W&B Run: https://app.wandb.ai/r-m-mauney/heart-disease/runs/p5l9wt88

In [4]:
X = df.drop(columns = 'target').to_numpy()
y = df['target'].to_numpy().reshape(-1, 1)
X.shape, y.shape

((303, 13), (303, 1))

In [16]:
# Majority Classifier
sum(y)[0] / len(y)

0.5445544554455446

In [52]:
# Function to create model, required for KerasClassifier
num_inputs = X.shape[1]

def create_model(optimizer='adam', hidden_1_nodes=13):
    # create model
    model = Sequential()
    model.add(Dense(hidden_1_nodes, input_dim=num_inputs, activation='relu'))
    model.add(Dense(7, activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    # Compile model
    model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])
    return model

# create model
model = KerasClassifier(build_fn=create_model, verbose=0)

In [6]:
# baseline Keras model

wandb.config.epochs = 20
wandb.config.batch_size = 101

model.fit(X, y, 
          validation_split=0.33, 
          epochs=wandb.config.epochs, 
          batch_size=wandb.config.batch_size,
          verbose=True,
          callbacks=[WandbCallback()]
         )

Train on 203 samples, validate on 100 samples
Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<tensorflow.python.keras.callbacks.History at 0x7f72ac5a0518>

In [7]:
%%time
# define the grid search parameters
# start with batch size
wandb.config.epochs = 20

param_grid = {'batch_size': [3, 101, 303]}

# Create Grid Search
grid = GridSearchCV(estimator=model, param_grid=param_grid, cv=3, n_jobs=1)
grid_result = grid.fit(X, y,
                       epochs=wandb.config.epochs,
                       verbose=False,
                       callbacks=[WandbCallback()]
                      )

# Report Results
print(f"Best: {grid_result.best_score_} using {grid_result.best_params_}")
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print(f"Mean: {mean}, Stdev: {stdev} with: {param}") 

Best: 0.712871273358663 using {'batch_size': 3}
Mean: 0.712871273358663, Stdev: 0.08440074111595575 with: {'batch_size': 3}
Mean: 0.561056117216746, Stdev: 0.03987802763223187 with: {'batch_size': 101}
Mean: 0.5577557881673177, Stdev: 0.12688042313253176 with: {'batch_size': 303}
CPU times: user 31.2 s, sys: 3.58 s, total: 34.8 s
Wall time: 28.7 s


In [11]:
%%time
# define the grid search parameters
# second, try epoch size
# wandb.config.batch_size = 3

param_grid = {'batch_size': [3],
              'epochs': [20, 50, 100]}

# Create Grid Search
grid = GridSearchCV(estimator=model, param_grid=param_grid, cv=3, n_jobs=1)
grid_result = grid.fit(X, y,
                       verbose=False,
                       callbacks=[WandbCallback()]
                      )

# Report Results
print(f"Best: {grid_result.best_score_} using {grid_result.best_params_}")
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print(f"Mean: {mean}, Stdev: {stdev} with: {param}") 

Best: 0.6237623691558838 using {'batch_size': 3, 'epochs': 100}
Mean: 0.6204620401064554, Stdev: 0.03645332582644608 with: {'batch_size': 3, 'epochs': 20}
Mean: 0.5973597566286722, Stdev: 0.09779229135042866 with: {'batch_size': 3, 'epochs': 50}
Mean: 0.6237623691558838, Stdev: 0.03523787151819811 with: {'batch_size': 3, 'epochs': 100}
CPU times: user 12.9 s, sys: 1.1 s, total: 14 s
Wall time: 22.7 s


In [12]:
%%time
# define the grid search parameters
# third, optimizers
# wandb.config.epochs = 100
# wandb.config.batch_size = 101

param_grid = {'batch_size': [3],
              'epochs': [100],
              'optimizer': ['SGD', 'RMSprop', 'Adagrad', 'Adadelta', 'Adam', 'Adamax', 'Nadam']}

# Create Grid Search
grid = GridSearchCV(estimator=model, param_grid=param_grid, cv=3, n_jobs=1)
grid_result = grid.fit(X, y,
                       verbose=False,
                       callbacks=[WandbCallback()]
                      )

# Report Results
print(f"Best: {grid_result.best_score_} using {grid_result.best_params_}")
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print(f"Mean: {mean}, Stdev: {stdev} with: {param}")

Best: 0.7656765580177307 using {'batch_size': 3, 'epochs': 100, 'optimizer': 'Adamax'}
Mean: 0.5445544719696045, Stdev: 0.03233648861753595 with: {'batch_size': 3, 'epochs': 100, 'optimizer': 'SGD'}
Mean: 0.7557755708694458, Stdev: 0.07335350674429997 with: {'batch_size': 3, 'epochs': 100, 'optimizer': 'RMSprop'}
Mean: 0.5643564263979594, Stdev: 0.1050936160841007 with: {'batch_size': 3, 'epochs': 100, 'optimizer': 'Adagrad'}
Mean: 0.5412541329860687, Stdev: 0.03645333841793224 with: {'batch_size': 3, 'epochs': 100, 'optimizer': 'Adadelta'}
Mean: 0.6798679828643799, Stdev: 0.07335350138082229 with: {'batch_size': 3, 'epochs': 100, 'optimizer': 'Adam'}
Mean: 0.7656765580177307, Stdev: 0.05197366159998095 with: {'batch_size': 3, 'epochs': 100, 'optimizer': 'Adamax'}
Mean: 0.7293729384740194, Stdev: 0.07335350495647412 with: {'batch_size': 3, 'epochs': 100, 'optimizer': 'Nadam'}
CPU times: user 11min 4s, sys: 1min 19s, total: 12min 24s
Wall time: 8min 24s


In [53]:
%%time
# define the grid search parameters
# fourth, number of nodes in first hidden layer

param_grid = {'batch_size': [3],
              'epochs': [100],
              'optimizer': ['Adamax'],
              'hidden_1_nodes' : [7, 13, 26]}

# Create Grid Search
grid = GridSearchCV(estimator=model, param_grid=param_grid, cv=3, n_jobs=1)
grid_result = grid.fit(X, y,
                       verbose=False,
                       callbacks=[WandbCallback()]
                      )

# Report Results
print(f"Best: {grid_result.best_score_} using {grid_result.best_params_}")
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print(f"Mean: {mean}, Stdev: {stdev} with: {param}")

Best: 0.9458000021457672 using {'batch_size': 3, 'epochs': 100, 'hidden_1_nodes': 7, 'optimizer': 'Adamax'}
Mean: 0.9458000021457672, Stdev: 0.005441949973255031 with: {'batch_size': 3, 'epochs': 100, 'hidden_1_nodes': 7, 'optimizer': 'Adamax'}
Mean: 0.9458000021457672, Stdev: 0.005441949973255031 with: {'batch_size': 3, 'epochs': 100, 'hidden_1_nodes': 13, 'optimizer': 'Adamax'}
Mean: 0.9458000021457672, Stdev: 0.005441949973255031 with: {'batch_size': 3, 'epochs': 100, 'hidden_1_nodes': 26, 'optimizer': 'Adamax'}
CPU times: user 2h 27min 44s, sys: 17min 44s, total: 2h 45min 28s
Wall time: 1h 42min 18s


## Best hyperparameters

batch size: 3

epochs: 100

optimizer: Adamax

nodes for first hidden layer: 7