This study guide should reinforce and provide practice for all of the concepts you have seen in the past week. There are a mix of written questions and coding exercises, both are equally important to prepare you for the sprint challenge as well as to be able to speak on these topics comfortably in interviews and on the job.

If you get stuck or are unsure of something remember the 20 minute rule. If that doesn't help, then research a solution with google and stackoverflow. Only once you have exausted these methods should you turn to your Team Lead - they won't be there on your SC or during an interview. That being said, don't hesitate to ask for help if you truly are stuck.

Have fun studying!

# Neural Networks by Hand

## Definitions

Define the following terms in your own words, do not simply copy and paste a definition found elsewhere but reword it to be understandable and memorable to you. *Double click the markdown to add your definitions.*

**Input Layer:** is what receives input from our dataset. Sometimes it is called the visible layer because it's the only part that is exposed to our data and that our data interacts with directly. Typically node maps are drawn with one input node for each of the different inputs/features/columns of our dataset that will be passed to the network.

**Hidden Layer:** This is because they cannot be accessed except through the input layer. They're inside of the network and they perform their functions, but we don't directly interact with them. The simplest possible network is to have a single neuron in the hidden layer that just outputs the value. 

**Output Layer:** The final layer is called the Output Layer. The purpose of the output layer is to output a vector of values that is in a format that is suitable for the type of problem that we're trying to address. Typically the output value is modified by an "activation function" to transform it into a format that makes sense for our context

**Neuron:** the neurons or "nodes" are similar in that they receive inputs and pass on their signal to the next layer of nodes if a certain threshold is reached. The goal with ANNs is not to create a realistic model of the brain but to craft robust algorithms and data structures that can model the complex relationships found in data.

**Weight:** A weight represent the strength of the connection between units. If the weight from node 1 to node 2 has greater magnitude, it means that neuron 1 has greater influence over neuron 2. A weight brings down the importance of the input value.

**Bias:** Bias is a constant which helps the model in a way that it can fit best for the given data

**Activation Function:** In Neural Networks, each node has an activation function. The activation function decides whether a cell "fires" or not. Sometimes it is said that the cell is "activated" or not. In Artificial Neural Networks activation functions decide how much signal to pass onto the next layer. This is why they are sometimes referred to as transfer functions because they determine how much signal is transferred to the next layer.

**Node Map:** it's a visual diagram of the architecture or "topology" of our neural network. Typically node maps are drawn with one input node for each of the different inputs/features/columns of our dataset that will be passed to the network.

**Perceptron:** A perceptron is just a single node or neuron of a neural network with nothing else.

**Epoch:** Same as iteration

**Feed Forward Neural Network:** feed forward neural networks in which the data flows in one direction (forward propagation) and the error flows in the opposite direction (backwards propagation). 

**Back Propagation:** Backpropagation is short for "Backwards Propagation of errors" and refers to a specific (rather calculus intensive) algorithm for how weights in a neural network are updated in reverse order at the end of each training epoch. 

## Questions of Understanding


1. Name 2 activation functions and when they might be used
```
#https://missinglink.ai/guides/neural-network-concepts/7-types-neural-network-activation-functions-right/
 - NNS applied to a binary classification problem might use a sigmoid function as its activation function in order to squishify values down to represent a probability. 
 - NNS applied to multiclass classification problems might have multiple output nodes in the output layer, one for each class that we're trying to predict. This output layer would probably employ what's called a "softmax function" for accomplishing this
 ```


2. What types of machine learning problems are neural networks best suited for?
```
Neural networks are best for situations where the data is “high-dimensional.” For example, a medium-size image file may have 1024 x 768 pixels. Each pixel contains 3 values for the intensity of red, green, and blue at that point in the image. All told, this is 1024 x 768 x 3 = 2,359,296 values. Each one of these values is a separate dimension and a separate input to a neuron at the start of the network.
```


3. In a linear regression problem, we can attempt to account for nonlinear features with polynomial features. What problem do we encounter as our feature size increases? How does a neural network avoid/address this issue?
```
The reality is that in order to fit really curvy nonlinear patterns in data in really complex high dimensional features spaces, the number of polynomial terms that we would have to include in a linear or logistic regression model faces a problem of combinatorial explosion in terms of the number of features that would be required. The interactions between layers of neurons in neural networks in a way accounts for that combinatorial explosion within the structure of the algorithm as needed instead of us having to provide it beforehand.
```


4. What are some of the tradeoffs of using a neural network versus a traditional machine learning algorithm like linear regression or a decision tree?
```
STRENGTHS: Linear regression is straightforward to understand and explain, and can be regularized to avoid overfitting. In addition, linear models can be updated easily with new data using stochastic gradient descent.
WEAKNESSES: Linear regression performs poorly when there are non-linear relationships. They are not naturally flexible enough to capture more complex patterns, and adding the right interaction terms or polynomials can be tricky and time-consuming.
STRENGTHS: Deep neural networks perform very well on image, audio, and text data, and they can be easily updated with new data using batch propagation. Their architectures (i.e. number and structure of layers) can be adapted to many types of problems, and their hidden layers reduce the need for feature engineering.
WEAKNESSES:Deep learning algorithms are usually not suitable as general-purpose algorithms because they require a very large amount of data. In fact, they are usually outperformed by tree ensembles for classical machine learning problems. In addition, they are computationally intensive to train, and they require much more expertise to tune (i.e. set the architecture and hyperparameters).
```


5. What determines the size of the input layer?
```
Input layers/input node for each of the different inputs/features/columns of our dataset that will be passed to the network.
```

## Perceptrons

Use the starter code below to build a perceptron, with just numpy, to predict whether a passenger survived or not. You may reduce the number of features for X to fit code you have already worked on throughout the week, but it is recommended that you modify the code instead.

In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv('https://raw.githubusercontent.com/bundickm/Study-Guides/master/data/titanic.csv')
print('Shape:', df.shape, '\n')
df.head()

Shape: (887, 7) 



Unnamed: 0,Survived,Pclass,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,1,22.0,1,0,7.25
1,1,1,0,38.0,1,0,71.2833
2,1,3,0,26.0,0,0,7.925
3,1,1,0,35.0,1,0,53.1
4,0,3,1,35.0,0,0,8.05


In [2]:
X = np.array(df.drop(columns='Survived'))
y = df['Survived']

In [3]:
df['Survived'].value_counts(normalize=True)

0    0.614431
1    0.385569
Name: Survived, dtype: float64

In [20]:
df.isnull().sum()

Survived                   0
Pclass                     0
Sex                        0
Age                        0
Siblings/Spouses Aboard    0
Parents/Children Aboard    0
Fare                       0
dtype: int64

Create a multilayer perceptron with back propagation, with just numpy, and apply it to the same data set.

In [5]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping

In [7]:
#Multilayer Perceptron
model = Sequential()
model.add(Dense(10, input_dim=6, activation = "relu"))
model.add(Dense(1, activation = "sigmoid")) # binary output

In [11]:
model.compile(optimizer = "adam", loss = "binary_crossentropy",  metrics = ["accuracy"] )

In [12]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 10)                70        
_________________________________________________________________
dense_1 (Dense)              (None, 10)                110       
_________________________________________________________________
dense_2 (Dense)              (None, 1)                 11        
Total params: 191
Trainable params: 191
Non-trainable params: 0
_________________________________________________________________


In [14]:
from sklearn.model_selection import train_test_split

In [15]:
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y, 
    test_size = 0.20, 
    random_state = 42
)

In [16]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((709, 6), (178, 6), (709,), (178,))

In [19]:
model.fit(X_train, y_train, 
          validation_data=(X_test, y_test),
          epochs = 99, 
          batch_size = 10)

Epoch 1/99
Epoch 2/99
Epoch 3/99
Epoch 4/99
Epoch 5/99
Epoch 6/99
Epoch 7/99
Epoch 8/99
Epoch 9/99
Epoch 10/99
Epoch 11/99
Epoch 12/99
Epoch 13/99
Epoch 14/99
Epoch 15/99
Epoch 16/99
Epoch 17/99
Epoch 18/99
Epoch 19/99
Epoch 20/99
Epoch 21/99
Epoch 22/99
Epoch 23/99
Epoch 24/99
Epoch 25/99
Epoch 26/99
Epoch 27/99
Epoch 28/99
Epoch 29/99
Epoch 30/99
Epoch 31/99
Epoch 32/99
Epoch 33/99
Epoch 34/99
Epoch 35/99
Epoch 36/99
Epoch 37/99
Epoch 38/99
Epoch 39/99
Epoch 40/99
Epoch 41/99
Epoch 42/99
Epoch 43/99
Epoch 44/99
Epoch 45/99
Epoch 46/99
Epoch 47/99
Epoch 48/99
Epoch 49/99
Epoch 50/99
Epoch 51/99
Epoch 52/99
Epoch 53/99
Epoch 54/99
Epoch 55/99
Epoch 56/99
Epoch 57/99
Epoch 58/99
Epoch 59/99
Epoch 60/99
Epoch 61/99
Epoch 62/99
Epoch 63/99
Epoch 64/99
Epoch 65/99
Epoch 66/99
Epoch 67/99
Epoch 68/99
Epoch 69/99
Epoch 70/99
Epoch 71/99
Epoch 72/99
Epoch 73/99
Epoch 74/99
Epoch 75/99
Epoch 76/99
Epoch 77/99
Epoch 78/99
Epoch 79/99
Epoch 80/99
Epoch 81/99
Epoch 82/99
Epoch 83/99
Epoch 84/99
E

<tensorflow.python.keras.callbacks.History at 0x7ff987542940>

In [21]:
#Single Layer Perceptron with Keras

model_single = Sequential()
model_single.add(Dense(1, input_dim=6, activation='sigmoid'))

In [22]:
model.compile(optimizer = "adam", loss = "binary_crossentropy",  metrics = ["accuracy"] )

In [24]:
model_single.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_3 (Dense)              (None, 1)                 7         
Total params: 7
Trainable params: 7
Non-trainable params: 0
_________________________________________________________________


In [25]:
model.fit(X_train, y_train, 
          validation_data=(X_test, y_test),
          epochs = 99, 
          batch_size = 10)

Epoch 1/99
Epoch 2/99
Epoch 3/99
Epoch 4/99
Epoch 5/99
Epoch 6/99
Epoch 7/99
Epoch 8/99
Epoch 9/99
Epoch 10/99
Epoch 11/99
Epoch 12/99
Epoch 13/99
Epoch 14/99
Epoch 15/99
Epoch 16/99
Epoch 17/99
Epoch 18/99
Epoch 19/99
Epoch 20/99
Epoch 21/99
Epoch 22/99
Epoch 23/99
Epoch 24/99
Epoch 25/99
Epoch 26/99
Epoch 27/99
Epoch 28/99
Epoch 29/99
Epoch 30/99
Epoch 31/99
Epoch 32/99
Epoch 33/99
Epoch 34/99
Epoch 35/99
Epoch 36/99
Epoch 37/99
Epoch 38/99
Epoch 39/99
Epoch 40/99
Epoch 41/99
Epoch 42/99
Epoch 43/99
Epoch 44/99
Epoch 45/99
Epoch 46/99
Epoch 47/99
Epoch 48/99
Epoch 49/99
Epoch 50/99
Epoch 51/99
Epoch 52/99
Epoch 53/99
Epoch 54/99
Epoch 55/99
Epoch 56/99
Epoch 57/99
Epoch 58/99
Epoch 59/99
Epoch 60/99
Epoch 61/99
Epoch 62/99
Epoch 63/99
Epoch 64/99
Epoch 65/99
Epoch 66/99
Epoch 67/99
Epoch 68/99
Epoch 69/99
Epoch 70/99
Epoch 71/99
Epoch 72/99
Epoch 73/99
Epoch 74/99
Epoch 75/99
Epoch 76/99
Epoch 77/99
Epoch 78/99
Epoch 79/99
Epoch 80/99
Epoch 81/99
Epoch 82/99
Epoch 83/99
Epoch 84/99
E

<tensorflow.python.keras.callbacks.History at 0x7ff982dbf198>

*In a short paragraph, answer the following:*

Why does the multilayer perceptron perform better than the simple perceptron? What limits the simple perceptron? What aspects of the multilayer perceptron allow it to overcome those limitations?

```
Your Answer Here
```

# Keras

## Definitions

Define the following terms in your own words, do not simply copy and paste a definition found elsewhere but reword it to be understandable and memorable to you. *Double click the markdown to add your definitions.*

**Earl Stopping:** `Early stopping catches when model is not increasing in optimatization. This saves time and money. `

**Weight Decay:** `Prevents overfitting the parameters by regularizing the values.  `

**Dropout:** 
```The Dropout Regularization value is a percentage of neurons that you want to be randomly deactivated during training. This means that their contribution to the activation of downstream neurons is temporally removed on the forward pass and any weight updates are not applied to the neuron on the backward pass. Dropout can be used on either the visible or invisible layer. ```

<br/>
The following are hyperparameters:

**Activation Functions:** `Activation functions are mathematical equations that determine the output of a neural network. The type of activations functions depends on the ouputs your are looking for, such as sigmoid for binary and softmax for multiclassification functions.  `

**Optimizer** `Optimizers are algorithms or methods used to change the attributes of the neural network such as weights and learning rate to reduce the losses. Optimizers are used to solve optimization problems by minimizing the function`

**Number of Layers** `This effects the depth of the neural netwrok`

**Number of Neurons** `This is the number of neurons in a layer. This effects the width of the ne `

**Batch Size** `The number of examples used in an iteration`

**Dropout Regularization** `This drops a set number/percentage of features in a single layer  of neural network`

**Learning Rate** `A scalar used to train a model via gradient descent/ step size`

**Number of Epochs** `Iterations`

## Questions of Understanding


1. Why is it recommended to normalize your input data?
```
The process of converting an actual range of values into a standard range of values, typically -1 to +1 or 0 to 1. Also know as scaling. 
```


2. How do you go about deciding on your neural network's architecture?
```
The most common approach seems to be to start with a rough guess based on prior experience about networks used on similar problems. 
- Create a network with hidden layers similar size order to the input, and all the same size, on the grounds that there is no particular reason to vary the size (unless you are creating an autoencoder perhaps).
- Start simple and build up complexity to see what improves a simple network.
- Try varying depths of network if you expect the output to be explained well by the input data, but with a complex relationship (as opposed to just inherently noisy).
-Try adding some dropout, it's the closest thing neural networks have to magic fairy dust that makes everything better (caveat: adding dropout may improve generalisation, but may also increase required layer sizes and training times).

You can use grid / randomized search to help choose the best parameters for your model 

```


3. Why is regularization important with neural networks?
```
Regularization allows you to not overfit your model to the training data. 
```


4. What does `validation.data` do?
```
Validataion data is where you check the model you created to see if the parameters work in predicting . The sample of data used to provide an unbiased evaluation of a model fit on the training dataset while tuning model hyperparameters.
```


5. Why is hyperparameter tuning so important with neural networks?
```
Hyperparameter tuning allows you to train a model to get the best possible predictions. 
```

## Modeling

Using the same dataset as above, use Keras to build a model and find its accuracy.

In [52]:
from tensorflow.keras.wrappers.scikit_learn import KerasClassifier
from tensorflow.keras.optimizers import Adam
from sklearn.model_selection import RandomizedSearchCV

In [53]:
def create_model(lr):
  adam = Adam(learning_rate=lr)
  model = Sequential()
  model.add(Dense(32, input_dim = 6, activation = "relu"))
  model.add(Dense(1, activation ="sigmoid"))

  #Compile model
  model.compile(optimizer="adam", loss = "binary_crossentropy", metrics=["accuracy"])

  return model

In [54]:
hyper_model = KerasClassifier(build_fn=create_model, verbose=0)

Build upon the model you created in the cell above by adding hyperparameter tuning.

In [55]:
param_grid = {
    'lr': [.001, .01, .1],
    'batch_size' : [10, 20, 40],
    'epochs' : [25]
}

In [56]:
rand = RandomizedSearchCV(estimator=hyper_model, param_distributions=param_grid, n_jobs=-1) 
#param_distributions for RandomSearch

In [57]:
rand_result = rand.fit(X_train, y_train)



In [58]:
# Report Results
print(f"Best: {rand_result.best_score_} using {rand_result.best_params_}")
search = rand_result.best_params_
means = rand_result.cv_results_['mean_test_score']
stds = rand_result.cv_results_['std_test_score']
params = rand_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print(f"Means: {mean}, Stdev: {stdev} with: {param}") 

Best: 0.7983018755912781 using {'lr': 0.001, 'epochs': 25, 'batch_size': 10}
Means: 0.7983018755912781, Stdev: 0.04334258986925524 with: {'lr': 0.001, 'epochs': 25, 'batch_size': 10}
Means: 0.7729097962379455, Stdev: 0.03898360840933016 with: {'lr': 0.01, 'epochs': 25, 'batch_size': 10}
Means: 0.7926480889320373, Stdev: 0.03052511033474625 with: {'lr': 0.1, 'epochs': 25, 'batch_size': 10}
Means: 0.7503645896911622, Stdev: 0.037340060711998975 with: {'lr': 0.001, 'epochs': 25, 'batch_size': 20}
Means: 0.7559484481811524, Stdev: 0.030157838676970496 with: {'lr': 0.01, 'epochs': 25, 'batch_size': 20}
Means: 0.7757067203521728, Stdev: 0.04458786182081137 with: {'lr': 0.1, 'epochs': 25, 'batch_size': 20}
Means: 0.7291579246520996, Stdev: 0.023444612913861986 with: {'lr': 0.001, 'epochs': 25, 'batch_size': 40}
Means: 0.7220956921577454, Stdev: 0.031201034065029644 with: {'lr': 0.01, 'epochs': 25, 'batch_size': 40}
Means: 0.7164818644523621, Stdev: 0.017233620244993975 with: {'lr': 0.1, 'epoc

In [59]:
from sklearn.metrics import accuracy_score

In [60]:
final_results = rand_result.best_estimator_.predict(X_test)

Instructions for updating:
Please use instead:* `np.argmax(model.predict(x), axis=-1)`,   if your model does multi-class classification   (e.g. if it uses a `softmax` last-layer activation).* `(model.predict(x) > 0.5).astype("int32")`,   if your model does binary classification   (e.g. if it uses a `sigmoid` last-layer activation).


Find the accuracy of the tuned model.

In [61]:
accuracy_score(final_results, y_test)

0.7359550561797753

In a short paragraph, explain how the hyperparameters impacted the accuracy of your model.

```
You Answer Here
```