## Peer-graded Assignment: Build a Regression Model in Keras ##

Imports and data

In [21]:
import keras
import pandas as pd
import numpy as np

In [22]:
concrete_data = pd.read_csv('https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DL0101EN/labs/data/concrete_data.csv')
concrete_data.head()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age,Strength
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.99
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.89
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,40.27
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365,41.05
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360,44.3


A. Build a baseline model (5 marks) 

Use the Keras library to build a neural network with the following:

- One hidden layer of 10 nodes, and a ReLU activation function

- Use the adam optimizer and the mean squared error  as the loss function.

Get predictions and target first

In [36]:
concrete_data_columns = concrete_data.columns

predictors = concrete_data[concrete_data_columns[concrete_data_columns != 'Strength']] # all columns except Strength
target = concrete_data['Strength'] # Strength column

In [37]:
n_cols = predictors.shape[1] # number of cols

Let's import the rest of the packages from the Keras library that we will need to build our regressoin model.

In [23]:
from keras.models import Sequential
from keras.layers import Dense

Define regression model

In [43]:
def regression_model():
    # create model
    model = Sequential()
    model.add(Dense(10, activation='relu', input_shape=(n_cols,)))
    model.add(Dense(1))
    
    # compile model
    model.compile(optimizer='adam', loss='mean_squared_error')
    return model

Call the function to build the model

In [44]:
model = regression_model()

1. Randomly split the data into a training and test sets by holding 30% of the data for testing. You can use the train_test_split helper function from Scikit-learn.

In [27]:
from sklearn.model_selection import train_test_split

In [41]:
X_train, X_test, y_train, y_test = train_test_split(predictors, target, test_size=0.30)

2. Train the model on the training data using 50 epochs.

In [46]:
# fit the model
model.fit(X_train, y_train, validation_split=0, epochs=50, verbose=0)

<tensorflow.python.keras.callbacks.History at 0x1fb1f5a2700>

3. Evaluate the model on the test data and compute the mean squared error between the predicted concrete strength and the actual concrete strength. You can use the mean_squared_error function from Scikit-learn.

In [70]:
## X_train.head()
## X_test.head()
## y_train.head()
## y_test.head()

In [66]:
from sklearn.metrics import mean_squared_error

x_pred = model.predict(X_test)

mean_squared_error(y_test, x_pred)

139.09553170886392

4. Repeat steps 1 - 3, 50 times, i.e., create a list of 50 mean squared errors.

In [71]:
fifty_mse = []

for i in range(50):
    X_train, X_test, y_train, y_test = train_test_split(predictors, target, test_size=0.30)
    model.fit(X_train, y_train, validation_split=0, epochs=50, verbose=0)
    x_pred = model.predict(X_test)
    fifty_mse.append(mean_squared_error(y_test, x_pred))
    
print(fifty_mse)

[98.54393479103541, 99.28212104685292, 84.13543805908697, 86.0342522272036, 67.07884996078634, 80.47540476065879, 62.584418132583956, 79.52083498801949, 72.95199676214821, 75.49131725568829, 66.59020240789265, 83.02431580133869, 61.43580327369817, 62.82301582150894, 57.593586396468936, 53.129895890358746, 42.524675448019636, 51.96443009983853, 49.04529558192437, 47.59906794590274, 47.617842475710084, 45.787750718539606, 57.0354562108847, 49.15184501396654, 51.22971356094604, 55.40440626723468, 44.65928863674531, 48.34417374200552, 67.56466839764089, 57.30838872486563, 47.956885850165015, 52.48178905966427, 41.06187879121081, 52.884426434787855, 50.31663536571426, 45.0719038101276, 51.57654220624121, 43.39384639192956, 44.20137500207329, 46.84296551250365, 49.19121987433305, 45.485967554433664, 46.1785192503977, 48.5490951925033, 61.36727727179312, 51.456677036933144, 48.54657901217355, 47.42418910767352, 49.14502476707817, 46.055556741201485]


5. Report the mean and the standard deviation of the mean squared errors.

In [77]:
## mean
print("Mean:", np.mean(fifty_mse))

## standard deviation
print("Standard Deviation:", np.std(fifty_mse))

Mean: 57.50241489264985
Standard Deviation: 14.533233185306546


B. Normalize the data (5 marks) 

Repeat Part A but use a normalized version of the data. Recall that one way to normalize the data is by subtracting the mean from the individual predictors and dividing by the standard deviation.

How does the mean of the mean squared errors compare to that from Step A?

In [78]:
predictors_norm = (predictors - predictors.mean()) / predictors.std()
predictors_norm.head()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age
0,2.476712,-0.856472,-0.846733,-0.916319,-0.620147,0.862735,-1.217079,-0.279597
1,2.476712,-0.856472,-0.846733,-0.916319,-0.620147,1.055651,-1.217079,-0.279597
2,0.491187,0.79514,-0.846733,2.174405,-1.038638,-0.526262,-2.239829,3.55134
3,0.491187,0.79514,-0.846733,2.174405,-1.038638,-0.526262,-2.239829,5.055221
4,-0.790075,0.678079,-0.846733,0.488555,-1.038638,0.070492,0.647569,4.976069


In [83]:
X_train, X_test, y_train, y_test = train_test_split(predictors_norm, target, test_size=0.30)
model.fit(X_train, y_train, validation_split=0, epochs=50, verbose=0)
x_pred = model.predict(X_test)
mean_squared_error(y_test, x_pred)

106.43124645890202

In [84]:
fifty_mse = []

for i in range(50):
    X_train, X_test, y_train, y_test = train_test_split(predictors_norm, target, test_size=0.30)
    model.fit(X_train, y_train, validation_split=0, epochs=50, verbose=0)
    x_pred = model.predict(X_test)
    fifty_mse.append(mean_squared_error(y_test, x_pred))
    
print(fifty_mse)

[89.76567072898595, 77.4651424436894, 63.069228288084936, 62.10230725293927, 52.28357322511104, 47.79334583544495, 46.210262401290684, 53.46448026670225, 46.40763334976926, 43.884427446496865, 41.178446686961344, 43.0702083328461, 34.78472996918014, 39.630995399722465, 40.29226788727089, 42.63892284978034, 36.34914078157923, 43.99545243321103, 43.533545338467874, 44.63847649525315, 43.74818481382151, 38.58066691533703, 40.89996908967593, 36.76600112408608, 37.560356837342745, 38.998884713664694, 36.66029808717771, 35.06393897397217, 35.67772418455087, 41.37597425213602, 36.297115189057436, 32.776169022120634, 36.90381870771222, 45.137962714874455, 41.678061117663376, 36.600125649357715, 38.86325614960784, 39.508056035421916, 36.96249061625123, 32.75309507540712, 37.548640765125754, 37.66742750177578, 33.13048139332557, 32.3784627374672, 34.420241286883325, 36.06379725822773, 36.161575012369255, 34.648798708620724, 34.79331117358042, 33.75223084283617]


In [85]:
## mean
print("Mean:", np.mean(fifty_mse))

## standard deviation
print("Standard Deviation:", np.std(fifty_mse))

Mean: 42.11870746724476
Standard Deviation: 10.784635224559489


C. Increate the number of epochs (5 marks)

Repeat Part B but use 100 epochs this time for training.

How does the mean of the mean squared errors compare to that from Step B?

In [86]:
X_train, X_test, y_train, y_test = train_test_split(predictors_norm, target, test_size=0.30)
model.fit(X_train, y_train, validation_split=0, epochs=100, verbose=0)
x_pred = model.predict(X_test)
mean_squared_error(y_test, x_pred)

33.4997916278698

In [87]:
fifty_mse = []

for i in range(50):
    X_train, X_test, y_train, y_test = train_test_split(predictors_norm, target, test_size=0.30)
    model.fit(X_train, y_train, validation_split=0, epochs=100, verbose=0)
    x_pred = model.predict(X_test)
    fifty_mse.append(mean_squared_error(y_test, x_pred))
    
print(fifty_mse)

[35.00448899814452, 29.253706818632942, 32.36759182881069, 31.388166324792792, 35.5819562841711, 32.41880388671421, 29.736201273368003, 32.93087752228227, 33.12121626322708, 31.185392952954786, 32.96961226801874, 33.31857936597821, 34.17045653434473, 33.546448898600296, 31.819812120484293, 35.021707736314525, 31.930677138798163, 35.02345368998435, 35.01641739465003, 28.479733942030222, 33.567378406324885, 34.90362196301837, 33.02001898284231, 36.09665619239121, 35.904716624031906, 32.20977672000339, 35.67449477040471, 29.295600020521107, 26.909516827298393, 32.769648356283525, 36.19010946084313, 32.77216289256745, 26.970674313204285, 26.418229867903985, 29.25747099427159, 30.248577658063102, 29.90463563712424, 31.579012108697793, 30.46957039299217, 27.63845444394074, 32.54676320163584, 29.0850160794965, 32.22982097333864, 30.268085146850765, 26.635926732344938, 31.100765534183093, 25.94232600869056, 27.557270305578065, 27.57464777147602, 28.284663738769737]


In [88]:
## mean
print("Mean:", np.mean(fifty_mse))

## standard deviation
print("Standard Deviation:", np.std(fifty_mse))

Mean: 31.54621826734789
Standard Deviation: 2.866988683898354


D. Increase the number of hidden layers (5 marks)

Repeat part B but use a neural network with the following instead:

- Three hidden layers, each of 10 nodes and ReLU activation function.

How does the mean of the mean squared errors compare to that from Step B?

In [89]:
def regression_model():
    # create model
    model = Sequential()
    model.add(Dense(10, activation='relu', input_shape=(n_cols,)))
    model.add(Dense(10, activation='relu'))
    model.add(Dense(10, activation='relu'))   
    model.add(Dense(1))
    
    # compile model
    model.compile(optimizer='adam', loss='mean_squared_error')
    return model

In [90]:
X_train, X_test, y_train, y_test = train_test_split(predictors_norm, target, test_size=0.30)
model.fit(X_train, y_train, validation_split=0, epochs=100, verbose=0)
x_pred = model.predict(X_test)
mean_squared_error(y_test, x_pred)

36.161719264753465

In [91]:
fifty_mse = []

for i in range(50):
    X_train, X_test, y_train, y_test = train_test_split(predictors_norm, target, test_size=0.30)
    model.fit(X_train, y_train, validation_split=0, epochs=100, verbose=0)
    x_pred = model.predict(X_test)
    fifty_mse.append(mean_squared_error(y_test, x_pred))
    
print(fifty_mse)

[37.08831573857858, 28.73772539231455, 33.308801643998734, 27.00904858528797, 34.10762138571191, 29.662771680591014, 28.014709617599838, 30.00249343747726, 35.75436760150557, 29.19142636036736, 28.632387401685065, 29.40426163015924, 31.05008478402742, 29.47675158796809, 28.184399125561022, 25.45088698624377, 34.65387973894857, 26.667774178583123, 32.106842156735844, 29.026595971502523, 25.56613114333355, 24.600746495725975, 31.70066923671377, 26.472094857791742, 28.069707661544186, 34.87041136320401, 31.665288080103878, 28.75292535236546, 31.68048549867654, 28.373725025804564, 28.649643561159735, 31.398613700228417, 29.88439644515941, 28.459197289438396, 27.373886943987284, 23.20314221474834, 27.78422169107095, 28.581302577254483, 31.178875172955767, 33.51137319059712, 32.19406890668828, 26.10367949497153, 25.53956823441345, 28.89648212094557, 27.91826988280471, 33.477438494258834, 29.23581494927325, 34.4469311160305, 25.75100626030299, 33.150611665122405]


In [92]:
## mean
print("Mean:", np.mean(fifty_mse))

## standard deviation
print("Standard Deviation:", np.std(fifty_mse))

Mean: 29.720437072630446
Standard Deviation: 3.0836002262683695


From part A to part D we get a progressive decrese of the Mean and Standard Deviation results. This due to the fact that if we increase the number of epochs, increase the number of hidden layers or normalize the data, we increase the accuracy of the model because it trains better.