# Tuning and Optimizing Neural Networks - Lab

## Introduction

Now that we've discussed some regularization, initialization and optimization techniques, its time to synthesize those concepts into a cohesive modelling pipeline.  

With this pipeline, yoiu will not only fit an initial model but will also attempt to set various hyperparameters for regularization techniques. Your final model selection will pertain to the test metrics across these models. This will more naturally simulate a problem you might be faced with in practice, and the various modelling decisions you are apt to encounter along the way.  

Recall that our end objective is to achieve a balance between overfitting and underfitting. We've discussed the bias variance tradeoff, and the role of regularization in order to reduce overfitting on training data and improving generalization to new cases. Common frameworks for such a procedure include train/validate/test methodology when data is plentiful, and K-folds cross-validation for smaller, more limited datasets. In this lab, you'll perform the latter, as the dataset in question is fairly limited. 

## Objectives

You will be able to:

* Implement a K-folds cross validation modelling pipeline
* Apply normalization as a preprocessing technique
* Apply regularization techniques to improve your model's generalization
* Choose an appropriate optimization strategy 

## Loading the Data

In [18]:
#Your code here; load and preview the dataset
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split, KFold
from sklearn.pipeline import make_pipeline, FeatureUnion
from sklearn.base import TransformerMixin, BaseEstimator

from keras import models
from keras import layers

In [5]:
df = pd.read_csv('loan_final.csv')
df.head()

Unnamed: 0,loan_amnt,funded_amnt_inv,term,int_rate,installment,grade,emp_length,home_ownership,annual_inc,verification_status,loan_status,purpose,addr_state,total_acc,total_pymnt,application_type
0,5000.0,4975.0,36 months,10.65%,162.87,B,10+ years,RENT,24000.0,Verified,Fully Paid,credit_card,AZ,9.0,5863.155187,Individual
1,2500.0,2500.0,60 months,15.27%,59.83,C,< 1 year,RENT,30000.0,Source Verified,Charged Off,car,GA,4.0,1014.53,Individual
2,2400.0,2400.0,36 months,15.96%,84.33,C,10+ years,RENT,12252.0,Not Verified,Fully Paid,small_business,IL,10.0,3005.666844,Individual
3,10000.0,10000.0,36 months,13.49%,339.31,C,10+ years,RENT,49200.0,Source Verified,Fully Paid,other,CA,37.0,12231.89,Individual
4,3000.0,3000.0,60 months,12.69%,67.79,B,1 year,RENT,80000.0,Source Verified,Fully Paid,other,OR,38.0,4066.908161,Individual


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 42538 entries, 0 to 42537
Data columns (total 16 columns):
loan_amnt              42535 non-null float64
funded_amnt_inv        42535 non-null float64
term                   42535 non-null object
int_rate               42535 non-null object
installment            42535 non-null float64
grade                  42535 non-null object
emp_length             41423 non-null object
home_ownership         42535 non-null object
annual_inc             42531 non-null float64
verification_status    42535 non-null object
loan_status            42535 non-null object
purpose                42535 non-null object
addr_state             42535 non-null object
total_acc              42506 non-null float64
total_pymnt            42535 non-null float64
application_type       42535 non-null object
dtypes: float64(6), object(10)
memory usage: 5.2+ MB


In [7]:
# Drop NaN since very few
df.dropna(inplace=True)

In [8]:
df['int_rate'] = df.int_rate.str[:-1].astype('float64')

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 41394 entries, 0 to 42516
Data columns (total 16 columns):
loan_amnt              41394 non-null float64
funded_amnt_inv        41394 non-null float64
term                   41394 non-null object
int_rate               41394 non-null float64
installment            41394 non-null float64
grade                  41394 non-null object
emp_length             41394 non-null object
home_ownership         41394 non-null object
annual_inc             41394 non-null float64
verification_status    41394 non-null object
loan_status            41394 non-null object
purpose                41394 non-null object
addr_state             41394 non-null object
total_acc              41394 non-null float64
total_pymnt            41394 non-null float64
application_type       41394 non-null object
dtypes: float64(7), object(9)
memory usage: 5.4+ MB


## Defining the Problem

Set up the problem by defining X and Y. 

For this problem use the following variables for X:
* loan_amnt
* home_ownership
* funded_amnt_inv
* verification_status
* emp_length
* installment
* annual_inc

Be sure to use dummy variables for categorical variables and to normalize numerical quanitities. Be sure to also remove any rows with null data.  

For Y, we are looking to build a model to predict the total payment received for a loan.

In [28]:
X, y = df.drop('total_pymnt', axis=1), df.total_pymnt

In [33]:
class DataFrameSelector(BaseEstimator, TransformerMixin):
    """ A DataFrame transformer that provides column selection. """
    
    def __init__(self, columns=[]):
        """ Get selected columns. """
        self.columns = columns
        
    def transform(self, X):
        """ Returns df with selected columns. """
        return X[self.columns].copy()
    
    def fit(self, X, y=None):
        """ Do nothing operation. """
        return self


# -- Get Pipelines --

# Get categoricals and numericals
cat_cols = list(X.select_dtypes(include=object).columns)
num_cols = [c for c in X.columns if c not in cat_cols + ['total_pymt']]

# Fit numerical pipeline
num_pipeline = make_pipeline(
    DataFrameSelector(num_cols),
    SimpleImputer(strategy='median'),
    StandardScaler()
)

# Fit categorical pipeline
cat_pipeline = make_pipeline(
    DataFrameSelector(cat_cols),
    SimpleImputer(strategy='most_frequent'),
    OneHotEncoder(handle_unknown='ignore', sparse=False)
)

# Union pipelines
full_preproc = FeatureUnion(transformer_list=[
    ("cat_pipeline", cat_pipeline),
    ("num_pipeline", num_pipeline)
])

y_scale = StandardScaler()

## Generating a Hold Out Test Set

While we will be using K-fold cross validation to select an optimal model, we still want a final hold out test set that is completely independent of any modelling decisions. As such, pull out a sample of 10% of the total available data. For consistency of results, use random seed 123. 

In [36]:
# Your code here; generate a hold out test set for final model evaluation. Use random seed 123.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)
X_train = full_preproc.fit_transform(X_train)
X_test = full_preproc.transform(X_test)
y_train = y_scale.fit_transform(y_train.values.reshape(-1, 1))
y_test = y_scale.fit_transform(y_test.values.reshape(-1, 1))

In [37]:
X_train.shape, X_test.shape

((33115, 103), (8279, 103))

In [38]:
y_train.shape, y_test.shape

((33115, 1), (8279, 1))

## Defining a K-fold Cross Validation Methodology

Now that your have a complete holdout test set, write a function that takes in the remaining data and performs k-folds cross validation given a model object. Be sure your function returns performance metrics regarding the training and validation sets.

In [110]:
#Your code here; define a function to evaluate a model object using K folds cross validation.
from pprint import pprint
import time
import datetime
from keras import regularizers

#Your code here; try some methods to overfit your network
def build_model(layer_sizes=[10, 10], dropout=False, l1_reg=False, l2_reg=False):
    """ Builds a new model.
        layer_sizes: represents layer structure as an array
        dropout: default 0.2 dropout at each layer if True
        l1_reg: default 0.005 l1_reg at each layer if True
        l2_reg: default 0.005 l2_reg at each layer if True.
    """
    # Get regularization
    if l1_reg:
        reg = regularizers.l1(0.005)
    elif l2_reg:
        reg = regularizers.l2(0.005)
    else:
        reg = None
        
    # Init model
    model = models.Sequential()
    for size in layer_sizes:
        model.add(
            layers.Dense(
            size, 
            activation='relu',
            kernel_regularizer=reg
            )
                 )
        
        if dropout:
            model.add(layers.Dropout(0.2))

    model.add(layers.Dense(1))
    model.add(layers.Activation('linear'))
    model.compile(optimizer='adam',
                  loss='mean_squared_error',
                  metrics=['mean_absolute_error'])
    return model


def k_folds(features_train, labels_train, model_obj, k=10, n_epochs=100):
    """ Performs K-Fold cross validation given a keras model. 
        Returns mean loss and mean metrics.
        Prints results at each fold and time elapsed."""
    # Report elapsed time
    now = datetime.datetime.now()
    # Get fold distribution
    kf = KFold(n_splits=k, random_state=123)
    results = []
    for i, (train_ind, val_ind) in enumerate(kf.split(features_train)):
        X_tr, X_val = features_train[train_ind], features_train[val_ind]
        y_tr, y_val = labels_train[train_ind], labels_train[val_ind]
        
        # Need to rebuild model for each fold
        # otherwise we train over validation data
        model = build_model(**model_obj)
        
        # Fit model to each fold
        hist = model.fit(X_tr, y_tr, 
                             batch_size=512,
                             epochs=n_epochs,
                             verbose=0,
                             )
        
        # Get model metrics
        met = model.evaluate(X_val, y_val)
        print(f"Fold {i} metrics: ")
        print(f"val_mse: {met[0]}")
        print(f"val_mae: {met[1]}")
        print()
        results.append(met)
    
    # Final metrics 
    results = np.array(results).mean(axis=0)
    final_results = {'mean_val_mse': results[0], 'mean_val_mae': results[1]}
    pprint(final_results)
    later = datetime.datetime.now()
    elapsed = later - now
    print('Time Elapsed:', elapsed)
    
    return final_results

## Building a Baseline Model

Here, it is also important to define your evaluation metric that you will look to optimize while tuning the model.   

In general, model training to optimize this metric may consist of using a validation and test set if data is plentiful, or k-folds cross-validation if data is limited. We set up a k-folds cross-validation for this task since the dataset is not overly large.  

Build an initial sequential model with 2 hidden relu layers. The first should have 7 hidden units, and the second 10 hidden units. Finally, add a third layer with a linear activation function to output our predictions for the total loan payment. 

## Evaluating the Baseline Model with K-Folds Cross Validation

Use your k-folds function to evaluate the baseline model.  

Note: This code block is likely to take 10-20 minutes to run depending on the specs on your computer.
Because of time dependencies, it can be interesting to begin timing these operations for future reference.

Here's a simple little recipe to achieve this:
```
import time
import datetime

now = datetime.datetime.now()
later = datetime.datetime.now()
elapsed = later - now
print('Time Elapsed:', elapsed)
```

In [106]:
#Your code here; use your k-folds function to evaluate the baseline model.
model_params = {'layer_sizes': [7, 10], 'dropout': False}

results = k_folds(X_train, y_train, model_obj=model_params, k=10)


Fold 0 metrics: 
val_mse: 0.05517629258658575
val_mae: 0.11476484347800701

Fold 1 metrics: 
val_mse: 0.050690279486206705
val_mae: 0.11186759137876943

Fold 2 metrics: 
val_mse: 0.06649178702492213
val_mae: 0.1264717563410888

Fold 3 metrics: 
val_mse: 0.054590014668838414
val_mae: 0.11863151808147845

Fold 4 metrics: 
val_mse: 0.05606142530028803
val_mae: 0.1169992237923226

Fold 5 metrics: 
val_mse: 0.06715179695425726
val_mae: 0.12277547184100003

Fold 6 metrics: 
val_mse: 0.057602255204335255
val_mae: 0.1174020196638306

Fold 7 metrics: 
val_mse: 0.052781097219631916
val_mae: 0.1175577723342679

Fold 8 metrics: 
val_mse: 0.06491646152614143
val_mae: 0.12260946288493349

Fold 9 metrics: 
val_mse: 0.05672273798208606
val_mae: 0.11247858632388558

{'mean_val_mae': 0.1181558246119584, 'mean_val_mse': 0.0582184147953293}
Time Elapsed: 0:03:50.816316


In [107]:
results

{'mean_val_mse': 0.0582184147953293, 'mean_val_mae': 0.1181558246119584}

## Intentionally Overfitting a Model

Now that you've developed a baseline model, its time to intentionally overfit a model. To overfit a model, you can:
* Add layers
* Make the layers bigger
* Increase the number of training epochs

Again, be careful here. Think about the limitations of your resources, both in terms of your computers specs and how much time and patience you have to let the process run. Also keep in mind that you will then be regularizing these overfit models, meaning another round of experiments and more time and resources.  

For example, here are some timing notes on potential experiments run on a Macbook Pro 3.1 GHz Intel Core i5 with 16gb of RAM:

* Using our 10 fold cross validation methodology, a 5-layer neural network with 10 units per hidden layer and 100 epochs took approximately 15 minutes to train and validate  

* Using our 10 fold cross validation methodology, a 5-layer neural network with 25 units per hidden layer and 100 epochs took approximately 25 minutes to train and validate  

* Using our 10 fold cross validation methodology, a 5-layer neural network with 10 units per hidden layer and 250 epochs took approximately 45 minutes to train and validate


In [108]:
#Your code here; try some methods to overfit your network
model_params_overfit = {'layer_sizes': [100 for _ in range(5)], 'dropout': False}

results_overfit = k_folds(X_train, y_train, model_obj=model_params_overfit, k=10, n_epochs=200)

Fold 0 metrics: 
val_mse: 0.08887592388164019
val_mae: 0.14008039739972727

Fold 1 metrics: 
val_mse: 0.07828280331503942
val_mae: 0.13450996563342457

Fold 2 metrics: 
val_mse: 0.09927672992239972
val_mae: 0.15458821249756835

Fold 3 metrics: 
val_mse: 0.08250277746774724
val_mae: 0.1453164774870527

Fold 4 metrics: 
val_mse: 0.08348161679499103
val_mae: 0.13810881587186297

Fold 5 metrics: 
val_mse: 0.09670177294654146
val_mae: 0.1446036417875467

Fold 6 metrics: 
val_mse: 0.08827322696278514
val_mae: 0.14298278104139217

Fold 7 metrics: 
val_mse: 0.08065023424671482
val_mae: 0.1380346022283012

Fold 8 metrics: 
val_mse: 0.09164724436655249
val_mae: 0.1409262224955344

Fold 9 metrics: 
val_mse: 0.08366403699470265
val_mae: 0.1393290998572239

{'mean_val_mae': 0.1418480216299634, 'mean_val_mse': 0.08733563668991143}
Time Elapsed: 0:10:26.599046


In [109]:
#Your code here; try some methods to overfit your network
results_overfit

{'mean_val_mse': 0.08733563668991143, 'mean_val_mae': 0.1418480216299634}

## Regularizing the Model to Achieve Balance  

Now that you have a powerful model (albeit an overfit one), we can now increase the generalization of the model by using some of the regularization techniques we discussed. Some options you have to try include:  
* Adding dropout
* Adding L1/L2 regularization
* Altering the layer architecture (add or remove layers similar to above)  

This process will be constrained by time and resources. Be sure to test at least 2 different methodologies, such as dropout and L2 regularization. If you have the time, feel free to continue experimenting.

Notes: 

In [115]:
#Your code here; try some regularization or other methods to tune your network

# Try l2 regularization
model_l2_params = {'layer_sizes': [7, 10], 'dropout': False, 'l2_reg': True}

results_l2 = k_folds(X_train, y_train, model_obj=model_l2_params, k=10, n_epochs=150)

Fold 0 metrics: 
val_mse: 0.0648385339241097
val_mae: 0.11910622844085601

Fold 1 metrics: 
val_mse: 0.06161795448565829
val_mae: 0.11402940156235211

Fold 2 metrics: 
val_mse: 0.06917049861760531
val_mae: 0.12534974223893622

Fold 3 metrics: 
val_mse: 0.06172752921168067
val_mae: 0.11530708874337339

Fold 4 metrics: 
val_mse: 0.06414675786371392
val_mae: 0.12140622950982356

Fold 5 metrics: 
val_mse: 0.07572858414253514
val_mae: 0.12477077525283448

Fold 6 metrics: 
val_mse: 0.06891669612462266
val_mae: 0.13593886771507402

Fold 7 metrics: 
val_mse: 0.06099378958240948
val_mae: 0.11430322663980609

Fold 8 metrics: 
val_mse: 0.06971779203071078
val_mae: 0.11946102223865074

Fold 9 metrics: 
val_mse: 0.06665716605439301
val_mae: 0.12646858647074163

{'mean_val_mae': 0.12161411688124482, 'mean_val_mse': 0.0663515302037439}
Time Elapsed: 0:08:17.900585


In [116]:
print(results)
print(results_l2)

{'mean_val_mse': 0.0582184147953293, 'mean_val_mae': 0.1181558246119584}
{'mean_val_mse': 0.0663515302037439, 'mean_val_mae': 0.12161411688124482}


In [117]:
#Your code here; try some regularization or other methods to tune your network

# Try dropout regularization
model_dropout_params = {'layer_sizes': [7, 10], 'dropout': True}

results_dropout = k_folds(X_train, y_train, model_obj=model_dropout_params, k=10, n_epochs=150)

Fold 0 metrics: 
val_mse: 0.10493473757220351
val_mae: 0.2049495402165657

Fold 1 metrics: 
val_mse: 0.0847124815851018
val_mae: 0.1855895208635768

Fold 2 metrics: 
val_mse: 0.11314475806295007
val_mae: 0.21714572500491489

Fold 3 metrics: 
val_mse: 0.10466197334625871
val_mae: 0.19946164187889745

Fold 4 metrics: 
val_mse: 0.10760460421442986
val_mae: 0.208709303451621

Fold 5 metrics: 
val_mse: 0.11870223717841091
val_mae: 0.25734003384608006

Fold 6 metrics: 
val_mse: 0.10403848126445138
val_mae: 0.24018074780459564

Fold 7 metrics: 
val_mse: 0.08481004664956245
val_mae: 0.20271101304138942

Fold 8 metrics: 
val_mse: 0.13403486123584757
val_mae: 0.28223548844407165

Fold 9 metrics: 
val_mse: 0.11013715198121198
val_mae: 0.2513318935826047

{'mean_val_mae': 0.2249654908134317, 'mean_val_mse': 0.10667813330904283}
Time Elapsed: 0:08:56.145518


In [None]:
#Your code here; try some regularization or other methods to tune your network

In [None]:
#Your code here; try some regularization or other methods to tune your network

## Final Evaluation

Now that you have selected a network architecture, tested various regularization procedures and tuned hyperparameters via a validation methodology, it is time to evaluate your finalized model once and for all. Fit the model using all of the training and validation data using the architecture and hyperparameters that were most effective in your expirements above. Afterwards, measure the overall performance on the hold-out test data which has been left untouched (and hasn't leaked any data into the modelling process)!

In [None]:
#Your code here; final model training on entire training set followed by evaluation on hold-out data

## Additional Resources

https://machinelearningmastery.com/dropout-regularization-deep-learning-models-keras/

https://machinelearningmastery.com/grid-search-hyperparameters-deep-learning-models-python-keras/

https://machinelearningmastery.com/regression-tutorial-keras-deep-learning-library-python/

https://stackoverflow.com/questions/37232782/nan-loss-when-training-regression-network
https://www.springboard.com/blog/free-public-data-sets-data-science-project/

## Summary

In this lab, we investigated some data from *The Lending Club* in a complete data science pipeline regarding neural networks. We began with reserving a hold-out set for testing which never was touched during the modeling phase. From there, we implemented a k-fold cross validation methodology in order to assess an initial baseline model and various regularization methods. From here, we'll begin to investigate other neural network architectures such as CNNs.