# Normalization and Tuning Neural Networks - Lab

## Introduction

For this lab on initialization and optimization, let's look at a slightly different type of neural network. This time, we will not perform a classification task as we've done before (Santa vs not santa, bank complaint types), but we'll look at a linear regression problem.

We can just as well use deep learning networks for linear regression as for a classification problem. Do note that getting regression to work with neural networks is a hard problem because the output is unbounded ($\hat y$ can technically range from $-\infty$ to $+\infty$, and the models are especially prone to exploding gradients. This issue makes a regression exercise the perfect learning case!

## Objectives
You will be able to:
* Build a nueral network using keras
* Normalize your data to assist algorithm convergence
* Implement and observe the impact of various initialization techniques

In [1]:
import numpy as np
import pandas as pd
from keras.models import Sequential
from keras import initializers
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn import preprocessing
from keras import optimizers
from sklearn.model_selection import train_test_split

Using TensorFlow backend.


## Loading the data

The data we'll be working with is data related to facebook posts published during the year of 2014 on the Facebook's page of a renowned cosmetics brand.  It includes 7 features known prior to post publication, and 12 features for evaluating the post impact. What we want to do is make a predictor for the number of "likes" for a post, taking into account the 7 features prior to posting.

First, let's import the data set and delete any rows with missing data. Afterwards, briefly preview the data.

In [8]:
#Your code here; load the dataset and drop rows with missing values. Then preview the data.

df = pd.read_csv('dataset_Facebook.csv',delimiter=';')
df.head()

Unnamed: 0,Page total likes,Type,Category,Post Month,Post Weekday,Post Hour,Paid,Lifetime Post Total Reach,Lifetime Post Total Impressions,Lifetime Engaged Users,Lifetime Post Consumers,Lifetime Post Consumptions,Lifetime Post Impressions by people who have liked your Page,Lifetime Post reach by people who like your Page,Lifetime People who have liked your Page and engaged with your post,comment,like,share,Total Interactions
0,139441,Photo,2,12,4,3,0.0,2752,5091,178,109,159,3078,1640,119,4,79.0,17.0,100
1,139441,Status,2,12,3,10,0.0,10460,19057,1457,1361,1674,11710,6112,1108,5,130.0,29.0,164
2,139441,Photo,3,12,3,3,0.0,2413,4373,177,113,154,2812,1503,132,0,66.0,14.0,80
3,139441,Photo,2,12,2,10,1.0,50128,87991,2211,790,1119,61027,32048,1386,58,1572.0,147.0,1777
4,139441,Photo,2,12,2,3,0.0,7244,13594,671,410,580,6228,3200,396,19,325.0,49.0,393


## Initialization

## Normalize the Input Data

Let's look at our input data. We'll use the 7 first columns as our predictors. We'll do the following two things:
- Normalize the continuous variables --> you can do this using `np.mean()` and `np.std()`
- Make dummy variables of the categorical variables (you can do this by using `pd.get_dummies`)

We only count "Category" and "Type" as categorical variables. Note that you can argue that "Post month", "Post Weekday" and "Post Hour" can also be considered categories, but we'll just treat them as being continuous for now.

You'll then use these to define X and Y. 

To summarize, X will be:
* Page total likes
* Post Month
* Post Weekday
* Post Hour
* Paid
along with dummy variables for:
* Type
* Category


Be sure to normalize your features by subtracting the mean and dividing by the standard deviation.  

Finally, y will simply be the "like" column.

In [17]:
#load
df = pd.read_csv('dataset_Facebook.csv',delimiter=';', header=0)
df = df.dropna()

df.head()

Unnamed: 0,Page total likes,Type,Category,Post Month,Post Weekday,Post Hour,Paid,Lifetime Post Total Reach,Lifetime Post Total Impressions,Lifetime Engaged Users,Lifetime Post Consumers,Lifetime Post Consumptions,Lifetime Post Impressions by people who have liked your Page,Lifetime Post reach by people who like your Page,Lifetime People who have liked your Page and engaged with your post,comment,like,share,Total Interactions
0,139441,Photo,2,12,4,3,0.0,2752,5091,178,109,159,3078,1640,119,4,79.0,17.0,100
1,139441,Status,2,12,3,10,0.0,10460,19057,1457,1361,1674,11710,6112,1108,5,130.0,29.0,164
2,139441,Photo,3,12,3,3,0.0,2413,4373,177,113,154,2812,1503,132,0,66.0,14.0,80
3,139441,Photo,2,12,2,10,1.0,50128,87991,2211,790,1119,61027,32048,1386,58,1572.0,147.0,1777
4,139441,Photo,2,12,2,3,0.0,7244,13594,671,410,580,6228,3200,396,19,325.0,49.0,393


In [18]:
np.shape(df)

(495, 19)

In [23]:
#Your code here; define X and y.
X0 = df["Page total likes"]
X1 = df["Type"]
X2 = df["Category"]
X3 = df["Post Month"]
X4 = df["Post Weekday"]
X5 = df["Post Hour"]
X6 = df["Paid"]

## standardize/categorize (does same thing as StandardScalar for X0,X1,X3,X4,X5)
X0= (X0-np.mean(X0))/(np.std(X0))
dummy_X1= pd.get_dummies(X1)
dummy_X2= pd.get_dummies(X2)
X3= (X3-np.mean(X3))/(np.std(X3))
X4= (X4-np.mean(X4))/(np.std(X4))
X5= (X5-np.mean(X5))/(np.std(X5))

X = pd.concat([X0, dummy_X1, dummy_X2, X3, X4, X5, X6], axis=1)

y = df["like"]

Our data is fairly small. Let's just split the data up in a training set and a validation set!  The next three code blocks are all provided for you; have a quick review but not need to make edits!

In [24]:
#Code provided; defining training and validation sets
np.random.seed(123)
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2)

print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(396, 12)
(396,)
(99, 12)
(99,)


In [27]:
#Code provided; building an initial model
np.random.seed(123)
model = Sequential()
model.add(Dense(8, input_dim=12, activation='relu'))
model.add(Dense(1, activation = 'linear'))

model.compile(optimizer= "sgd" ,loss='mse',metrics=['mse'])
hist = model.fit(X_train, y_train, 
                 batch_size=32, 
                 epochs=100, 
                 validation_data = (X_test, y_test), verbose=0)

In [28]:
#Code provided; previewing the loss through successive epochs
hist.history['loss'][:10]

[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]

Did you see what happend? all the values for training and validation loss are "nan". There could be several reasons for that, but as we already mentioned there is likely a vanishing or exploding gradient problem. recall that we normalized out inputs. But how about the outputs? Let's have a look.

In [33]:
y_train.head()

208     54.0
290     23.0
286     15.0
0       79.0
401    329.0
Name: like, dtype: float64

Yes, indeed. We didn't normalize them and we should, as they take pretty high values. Let
s rerun the model but make sure that the output is normalized as well!

## Normalizing the output

Normalize Y as you did X by subtracting the mean and dividing by the standard deviation. Then, resplit the data into training and validation sets as we demonstrated above, and retrain a new model using your normalized X and Y data.

In [34]:
y

0        79.0
1       130.0
2        66.0
3      1572.0
4       325.0
5       152.0
6       249.0
7       325.0
8       161.0
9       113.0
10      233.0
11       88.0
12       90.0
13      137.0
14      577.0
15       86.0
16       40.0
17      678.0
18       54.0
19       34.0
20       66.0
21        0.0
22       16.0
23       72.0
24       99.0
25       88.0
26      412.0
27      100.0
28      523.0
29      143.0
        ...  
469     193.0
470     114.0
471     160.0
472      46.0
473     136.0
474      73.0
475      65.0
476     579.0
477     101.0
478      74.0
479      84.0
480     360.0
481       5.0
482     187.0
483      69.0
484      82.0
485      12.0
486      56.0
487      44.0
488     277.0
489      74.0
490      79.0
491     105.0
492     128.0
493     185.0
494     125.0
495      53.0
496      53.0
497      93.0
498      91.0
Name: like, Length: 495, dtype: float64

In [35]:
#Your code here: redefine Y after normalizing the data.
y = (y-np.mean(y))/np.std(y)
y

0     -0.309011
1     -0.151644
2     -0.349124
3      4.297815
4      0.450051
5     -0.083760
6      0.215544
7      0.450051
8     -0.055990
9     -0.204100
10     0.166174
11    -0.281240
12    -0.275069
13    -0.130045
14     1.227627
15    -0.287411
16    -0.429350
17     1.539274
18    -0.386151
19    -0.447863
20    -0.349124
21    -0.552774
22    -0.503404
23    -0.330610
24    -0.247298
25    -0.281240
26     0.718500
27    -0.244213
28     1.061003
29    -0.111531
         ...   
469    0.042750
470   -0.201014
471   -0.059076
472   -0.410836
473   -0.133130
474   -0.327524
475   -0.352209
476    1.233798
477   -0.241127
478   -0.324439
479   -0.293582
480    0.558048
481   -0.537346
482    0.024236
483   -0.339867
484   -0.299754
485   -0.515747
486   -0.379980
487   -0.417007
488    0.301942
489   -0.324439
490   -0.309011
491   -0.228784
492   -0.157815
493    0.018065
494   -0.167072
495   -0.389237
496   -0.389237
497   -0.265812
498   -0.271983
Name: like, Length: 495,

In [36]:
#Your code here; create training and validation sets as before. Use random seed 123.

np.random.seed(123)
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2)

print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(396, 12)
(396,)
(99, 12)
(99,)


In [37]:
#Your code here; rebuild a simple model using a relu layer followed by a linear layer. (See our code snippet above!)

model = Sequential()

model.add(Dense(8,input_dim=12, activation='relu'))
model.add(Dense(1,input_dim=12, activation='linear'))

model.compile(optimizer='sgd', loss='mse', metrics=['mse'])

hist = model.fit(X_train, y_train,
                batch_size=32,
                epochs=100,
                validation_data=(X_test,y_test),
                verbose=0)



Finally, let's recheck our loss function. Not only should it be populated with numerical data as opposed to null values, but we also should expect to see the loss function decreasing with successive epochs, demonstrating optimization!

In [46]:
hist.history['loss'][80:99]

[0.9328720798396101,
 0.9325820868364488,
 0.9338428841696845,
 0.9327733293928281,
 0.9363994890391224,
 0.9350359585098545,
 0.9332518535430985,
 0.9353073716464669,
 0.9340975188245677,
 0.9365883913606105,
 0.9310656677592885,
 0.9350639726176406,
 0.9327647505384503,
 0.9332090212841226,
 0.9314849135851619,
 0.9293352982612572,
 0.9306509801835725,
 0.9319081851328263,
 0.9315889613194899]

Great! We have a converged model. With that, let's investigate how well the model performed with our good old friend, mean squarred error.

In [49]:
pred_train = model.predict(X_train).reshape(-1)
pred_val = model.predict(X_test).reshape(-1)  

MSE_train = np.mean((pred_train-y_train)**2)
MSE_val = np.mean((pred_val-y_test)**2)

print("MSE_train:", MSE_train)
print("MSE_val:", MSE_val)

MSE_train: 0.9249923866630264
MSE_val: 1.0110184465290766


## Using Weight Initializers

##  He Initialization

Let's try and use a weight initializer. In the lecture, we've seen the He normalizer, which initializes the weight vector to have an average 0 and a variance of 2/n, with $n$ the number of features feeding into a layer.

In [50]:
np.random.seed(123)
model = Sequential()
model.add(Dense(8, input_dim=12, kernel_el_initializer= "he_normal",
                       #initializes the weight vector to have an average 0 and a variance of 2/n, 
                       #with  nn  the number of features feeding into a layer
                activation='relu'))
model.add(Dense(1, activation = 'linear'))

model.compile(optimizer= "sgd" ,loss='mse',metrics=['mse'])
hist = model.fit(X_train, y_train, batch_size=32, 
                 epochs=100, validation_data = (X_test, y_test),verbose=0)

In [51]:
pred_train = model.predict(X_train).reshape(-1)
pred_val = model.predict(X_test).reshape(-1)

MSE_train = np.mean((pred_train-y_train)**2)
MSE_val = np.mean((pred_val-y_test)**2)

In [54]:
print("MSE_train AFTER he_normal:", MSE_train)
print("MSE_val AFTER he_normal:", MSE_val)

MSE_train AFTER he_normal: 0.9266192873789892
MSE_val AFTER he_normal: 0.9474136876258673


The initializer does not really help us to decrease the MSE. We know that initializers can be particularly helpful in deeper networks, and our network isn't very deep. What if we use the `Lecun` initializer with a `tanh` activation?

## Lecun Initialization

In [55]:
np.random.seed(123)
model = Sequential()
model.add(Dense(8, 
                input_dim=12, 
                kernel_initializer= "lecun_normal", 
                activation='tanh'))
model.add(Dense(1, 
                activation = 'linear'))

model.compile(optimizer= "sgd" ,loss='mse',metrics=['mse'])

hist = model.fit(X_train, y_train, 
                 batch_size=32, 
                 epochs=100, 
                 validation_data = (X_test, y_test), 
                 verbose=0)

In [56]:
pred_train = model.predict(X_train).reshape(-1)
pred_val = model.predict(X_test).reshape(-1)

MSE_train = np.mean((pred_train-y_train)**2)
MSE_val = np.mean((pred_val-y_test)**2)

In [57]:
print("MSE_train AFTER lecun_normal:", MSE_train)
print("MSE_val AFTER lecun_normal:", MSE_val)

MSE_train AFTER lecun_normal: 0.9274743163472585
MSE_val AFTER lecun_normal: 0.9462999719122035


Not much of a difference, but a useful note to consider when tuning your network. Next, let's investigate the impace of various optimization algorithms.

## RMSprop

In [58]:
np.random.seed(123)
model = Sequential()
model.add(Dense(8, 
                input_dim=12, 
                activation='relu'))
model.add(Dense(1, 
                activation = 'linear'))

model.compile(optimizer= "rmsprop" ,loss='mse',metrics=['mse'])

hist = model.fit(X_train, y_train, 
                 batch_size=32, 
                 epochs=100, 
                 validation_data = (X_test, y_test), verbose = 0)

In [59]:
pred_train = model.predict(X_train).reshape(-1)
pred_val = model.predict(X_test).reshape(-1)

MSE_train = np.mean((pred_train-y_train)**2)
MSE_val = np.mean((pred_val-y_test)**2)

In [60]:
print("MSE_train WITH rmsprop:", MSE_train)
print("MSE_val WITH rmsprop:", MSE_val)

MSE_train WITH rmsprop: 0.9144463422620787
MSE_val WITH rmsprop: 0.9437088236524245


## Adam

In [22]:
np.random.seed(123)
model = Sequential()
model.add(layers.Dense(8, 
                       input_dim=12, 
                       activation='relu'))
model.add(layers.Dense(1, 
                       activation = 'linear'))

model.compile(optimizer= "Adam" ,loss='mse',metrics=['mse'])

hist = model.fit(X_train, y_train, 
                 batch_size=32, 
                 epochs=100, 
                 validation_data = (X_val, y_test), verbose = 0)

In [23]:
pred_train = model.predict(X_train).reshape(-1)
pred_val = model.predict(X_val).reshape(-1)

MSE_train = np.mean((pred_train-Y_train)**2)
MSE_val = np.mean((pred_val-Y_val)**2)

In [24]:
print(MSE_train)
print(MSE_val)

0.9113685285012638
0.9444777470972421


## Learning Rate Decay with Momentum


In [25]:
np.random.seed(123)
sgd = optimizers.SGD(lr=0.03, decay=0.0001, momentum=0.9)
model = Sequential()
model.add(layers.Dense(8, input_dim=12, activation='relu'))
model.add(layers.Dense(1, activation = 'linear'))

model.compile(optimizer= sgd ,loss='mse',metrics=['mse'])
hist = model.fit(X_train, Y_train, batch_size=32, 
                 epochs=100, validation_data = (X_val, Y_val), verbose = 0)

In [26]:
pred_train = model.predict(X_train).reshape(-1)
pred_val = model.predict(X_val).reshape(-1)

MSE_train = np.mean((pred_train-Y_train)**2)
MSE_val = np.mean((pred_val-Y_val)**2)

In [27]:
print(MSE_train)
print(MSE_val)

0.8188327426055082
0.9218409795298302


## Additional Resources
* https://github.com/susanli2016/Machine-Learning-with-Python/blob/master/Consumer_complaints.ipynb  

* https://catalog.data.gov/dataset/consumer-complaint-database  

* https://machinelearningmastery.com/dropout-regularization-deep-learning-models-keras/  

* https://machinelearningmastery.com/grid-search-hyperparameters-deep-learning-models-python-keras/  

* https://machinelearningmastery.com/regression-tutorial-keras-deep-learning-library-python/  

* https://stackoverflow.com/questions/37232782/nan-loss-when-training-regression-network  

* https://machinelearningmastery.com/grid-search-hyperparameters-deep-learning-models-python-keras/


## Summary  

In this lab, we began to practice some of the concepts regarding normalization and optimization for neural networks. In the final lab for this section, you'll independently practice these concepts on your own in order to tune a model to predict individuals payments to loans.