### Neural networks account for interactions well (better than linear regressions)

- input layer - predictive features
- output layer - outcome
- hidden layers - nothing we observe directly - black box
- each node represents an aggregation of data from input layer
- the more nodes, the more interactions we captuire

- Forward propagation - an algorithm that moves from inputs on the left to hidden layer and then to the output
- multiply, add process - dot product

- When we are fitting our model, they change the weights
-  In general, for one data point at a time

In [1]:
import numpy as np

In [9]:
#Forward Propagation Code

input_data = np.array([2,3])

weights = {'node_0':np.array([1,1]),
           'node_1':np.array([-1, 1]),
           'output':np.array([2,-1])}

node_0_value = (input_data * weights['node_0']).sum()
node_1_value = (input_data * weights['node_1']).sum()


hidden_layer_values  = np.array([node_0_value, node_1_value])
print(hidden_layer_values)

output = (hidden_layer_values * weights['output']).sum()
print(output)

[5 1]
9


- Activation Function ( in hidden layers) - allows function to capture non-l-nearities 
-  Necessary for relations that are not linear (not straight lines )
- applied to node inputs to produce node output

- tanh - used to be most popular
- today, industry standard is  "relu - rectified linear activation function" = rectified linear unit (ReLU)

 RELU(x) = {0 if x <0,
            x if x>=0}

- Identity function - a node's output will be the same as its input 

In [13]:
#Forward Propagation Code WITH Activation Function included

input_data = np.array([2,3])

weights = {'node_0':np.array([1,1]),
           'node_1':np.array([-1, 1]),
           'output':np.array([2,-1])}

node_0_input = (input_data * weights['node_0']).sum()
node_1_input = (input_data * weights['node_1']).sum()

node_0_output = np.tanh(node_0_input)
node_1_output = np.tanh(node_1_input)


hidden_layer_outputs  = np.array([node_0_output, node_1_output])
print(hidden_layer_outputs)

output = (hidden_layer_outputs * weights['output']).sum()
print(output)

[0.9999092  0.76159416]
1.2382242525694254


- Neural networks partially replace the need for feature engineering

- Deep Learning = Representation Learning
- Modeler doesn't need to specify interactions
- When you train the model, the neural network gets weights that find the relevant patterns
    to make better predictions
    
- Error = Predicted - Actual/Target

 ### Loss Function - function that aggregates errors in prediction from many data points into a single number

- For Example: Loss function for Linear Regression: Mean Squared Error (MSE)
- You take the error for each prediction, square it, and then take the mean of all of them 
        
        
 ### To find the weights that give the lowest value for the loss function, we use:
-     Gradient Descent:
-         - start at a random point
-         - until the slope is flat (equal to zero), find the slope and take a step down
-         - rather than using the amount of error, we use a:
-         learning rate: updates each weight by subtracting learning rate * slope
-            - frequently around 0.01 - ensures we take small steps to reliably move to minimum of weights

In [9]:
#Code to calculate the slopes and then update weights

weights = np.array([1,2])
input_data = np.array([3,4])
target = 6
learning_rate = .01

preds = (weights * input_data).sum()
error = preds - target

print(error)

gradient = 2 * input_data * error
gradient

#Gradient (similar to derivative, but only for a specific point, not the whole function)
                            #This is called "error"
#Gradient for MSE is 2* (predicted - actual) - which makes sense since derivative of f(x) = x^2 is f'(x)=2x

5


array([30, 40])

In [30]:
weights_updated = weights - learning_rate * gradient

preds_updated = (weights_updated * input_data).sum()

error_updated = preds_updated - target

In [31]:
print(f"""\tWeights_Updated:{weights_updated},
        Preds_Updated:\t{preds_updated},
        Error_Updated:\t{error_updated}
       """)

	Weights_Updated:[0.7 1.6],
        Preds_Updated:	8.5,
        Error_Updated:	2.5
       


### Backpropagation - takes error from output layer and propagates backwards towards input layer
-    allows gradient descent to update all weights in the neural network (by getting gradients for all of the weights)
-    comes from chain rule in calculus

- We always do forward propagation before backpropagation

### Known as Stochastic Gradient Descent
- It is common to calculate the slopes on only a subset of the data - called a "Batch"
- Uses different stochastic (greek for to aim/guess -from Jacob Bernoulli book - Russian dude) or random batch
    -to calculate the next update
    -Once all the data has been used, we start again from the beginning
    -Each time through all the training data is called and EPOCH

## KERAS Model Building Steps
1. Specify Architecture - how many layers, how many nodes, activation function
2. Compile - specifies loss function, and details on optimization
3. Fit the model - Cycle of backpropagation and optimization of weights
4. Make predictions

In [17]:
#0. Getting all the necessary packages
import numpy as np
import pandas as pd
from keras.layers import Dense
from keras.models import Sequential

pd.set_option('max_columns', None)
pd.set_option('max_rows', None)

#Loading data
predictors = pd.read_csv('wages.csv')

#Specifies how many columns are in input -> number of nodes in input layer
n_cols = predictors.shape[1]

In [4]:
predictors.head()

Unnamed: 0,wage_per_hour,union,education_yrs,experience_yrs,age,female,marr,south,manufacturing,construction
0,5.1,0,8,21,35,1,1,0,1,0
1,4.95,0,9,42,57,1,1,0,1,0
2,6.67,0,12,1,19,0,0,0,1,0
3,4.0,0,12,4,22,0,0,0,0,0
4,7.5,0,12,17,35,0,1,0,0,0


In [5]:
#1. Model Specification - this is the easier way
#Sequential requires that each layer has weights/connections only to the one layer coming directly after it
model = Sequential()

#We add layers - standard layer is called Dense - all nodes in previous layer connect to all nodes in current layer

#Each layer, specify number of nodes, and activation function
#Input will have n_cols columns, and nothing after column, which indicates that we can have any number of rows/data points
model.add(Dense(100, activation='relu', input_shape= (n_cols,)))
model.add(Dense(100, activation='relu'))

#Output layer has one node - 
model.add(Dense(1,))

#common to use 100s or 1000s of nodes in a layer

Instructions for updating:
Colocations handled automatically by placer.


In [7]:
#2. Compiling - two arguments, optimizer & loss function
#Adam is a good choice - adjusts lr as it changes grad. descent

model.compile(optimizer='adam', loss='mean_squared_error')

In [12]:
#3. Fit the model - apply backpropagation and gradient descent to update weights
#One can improve process by scaling data so each feature is on average a simialr size value
#Common approach: data point - mean divided by standard deviation

#wont actually work unless target is in form of a numpy matrix 
#model.fit(predictors, target)

In [None]:
#When completing a CLASSIFICATION prediction vs Regression:
- Set the loss function equal to "categorical_crossentropy" - most common - similar to log loss but not the same
- Lower score is better
- adding argument metrics= ['accuracy'] helps provide easy-to-understand diagnostics
- Output layer needs to have a separate node for each possible outcome and uses "softmax" activation
    - softmax ensures that the predictions sum to 1 so they can be interpreted as probabilities 

In [31]:
predictors.shape[1]

9

In [33]:
#CODING EXAMPLE

#One Hot Encoding converts result values into individual columns

from keras.utils import to_categorical

data = pd.read_csv('wages.csv')


predictors = data.drop(['wage_per_hour'], axis=1).as_matrix()
n_cols = predictors.shape[1]
target = to_categorical(data['wage_per_hour'])


model = Sequential()

model.add(Dense(100, activation='relu', input_shape=(n_cols,)))
model.add(Dense(100, activation='relu'))
model.add(Dense(100, activation='relu'))
model.add(Dense(100, activation='relu'))

#Number of output nodes is the number of classification categories
model.add(Dense(45, activation='softmax'))

model.compile(optimizer='adam', loss='categorical_crossentropy',
              metrics=['accuracy'])


model.fit(predictors, target)

Instructions for updating:
Use tf.cast instead.


  # Remove the CWD from sys.path while we load stuff.


Epoch 1/1


<keras.callbacks.callbacks.History at 0x2207531ffd0>

In [21]:
predictors

array([[ 0,  8, 21, ...,  0,  1,  0],
       [ 0,  9, 42, ...,  0,  1,  0],
       [ 0, 12,  1, ...,  0,  1,  0],
       ...,
       [ 1, 17, 25, ...,  0,  0,  0],
       [ 1, 12, 13, ...,  1,  0,  0],
       [ 0, 16, 33, ...,  0,  1,  0]], dtype=int64)

In [20]:
target

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)

In [None]:
#Be able to SAVE, RELOAD, and PREDICT with your model

#SAVING Model
from keras.models import load_model

#Hdf 5 is the format of the model

model.save('first_model.h5')

my_model = load_model('first_model.h5')

predictions = my_model.predict(data_to_predict_with)

#Gets the second column of the probability of the shot being made - True
probability_true = predictions[:,1]

In [None]:
#OPTIMIZATION IS HARD
SGD - Stochastic Gradient Descent is the safest optimizer

#Learning rate can be too low or too high
#can specify learning rate 
my_optimizer = SGD(lr=lr)

#Dying Neuron Problem
Node takes value of less than zero for all rows of your data
Once a node starts always getting negative inputs, it may continue to 
    only get negative inputs - thus it contributes nothing to the
    model and is "dead"
    
But the solution is not to change the activation function from "relu" which makes
    any negative value (and its slope) zero
    
For many years, used to use the tanh function
But in deep learning, multiplying many slopes of small values led them to move
    towards zero, known as vanishing gradients problem 



In [None]:
#MODEL VALIDATION - for deep learning, people rarely do k-fold validation - takes too long
model.fit(predictors, target, validation_split=.3) #specifies what fraction of data is used for validation

#We should stop training when validation score is not improving
#We can use Early Stopping to do this
#We then create this setup before we fit our model

from keras.callbacks import EarlyStopping

early_stopping_monitor = EarlyStopping(patience=2) #how many epochs model can go without improving before stopping 

#Normally more than 3 epochs is unlikely to improve

#By default, keras trains for 10 epochs
model.fit(predictors, target, validation_split=.3, epochs=20, verbose=False #prints out fewer updates
         callbacks=[early_stopping_monitor]) #callbacks takes a list - can add other callbacks for more advanced

In [None]:
#Validation score is ultimate measure of model's predictive quality


#MODEL CAPACITY / network Capacity - similar to overfitting and underfitting


#Ability to capture predictive patterns in data - more capacity, further to right on bias variance graph
#adding neurons or layers - moves further to right (higher complexity)

In [None]:
WORKFLOW:
    1.Start with a small network
    2.Get the validation score
    3.Keep increasing capacity until validation score is no longer improving
    4.v