# Build a Regression Model in Keras

### Step-By-Step Assignment Instructions

1. Assignment Topic:

In this project, you will build a regression model using the Keras library to model the same data about concrete compressive strength that we used in labs 3.

2. Concrete Data:

For your convenience, the data can be found here again: https://cocl.us/concrete_data. To recap, the predictors in the data of concrete strength include: Cement, Blast Furnace Slag, Fly Ash, Water, Superplasticizer, Coarse Aggregate, Fine Aggregate.

3. Assignment Instructions:

## A. Build a baseline model (5 marks)

Use the Keras library to build a neural network with the following:

- One hidden layer of 10 nodes, and a ReLU activation function

- Use the adam optimizer and the mean squared error as the loss function.

1. Randomly split the data into a training and test sets by holding 30% of the data for testing. You can use the train_test_splithelper function from Scikit-learn.

2. Train the model on the training data using 50 epochs.

3. Evaluate the model on the test data and compute the mean squared error between the predicted concrete strength and the actual concrete strength. You can use the mean_squared_error function from Scikit-learn.

4. Repeat steps 1 - 3, 50 times, i.e., create a list of 50 mean squared errors.

5. Report the mean and the standard deviation of the mean squared errors.

Submit your Jupyter Notebook with your code and comments.



Import the pandas, the numpy, the keras libraries and the packages from the keras library:

In [1]:
import pandas as pd
import numpy as np
import keras
#import sklearn
from keras.models import Sequential
from keras.layers import Dense
from sklearn.model_selection import train_test_split

Using TensorFlow backend.


Download the data file:

In [2]:
concrete_data = pd.read_csv('https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DL0101EN/labs/data/concrete_data.csv')
concrete_data.head(10)

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age,Strength
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.99
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.89
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,40.27
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365,41.05
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360,44.3
5,266.0,114.0,0.0,228.0,0.0,932.0,670.0,90,47.03
6,380.0,95.0,0.0,228.0,0.0,932.0,594.0,365,43.7
7,380.0,95.0,0.0,228.0,0.0,932.0,594.0,28,36.45
8,266.0,114.0,0.0,228.0,0.0,932.0,670.0,28,45.85
9,475.0,0.0,0.0,228.0,0.0,932.0,594.0,28,39.29


Check how many point in the data frame, missing values, and genetal information:

In [3]:
concrete_data.shape

(1030, 9)

In [4]:
concrete_data.describe()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age,Strength
count,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0
mean,281.167864,73.895825,54.18835,181.567282,6.20466,972.918932,773.580485,45.662136,35.817961
std,104.506364,86.279342,63.997004,21.354219,5.973841,77.753954,80.17598,63.169912,16.705742
min,102.0,0.0,0.0,121.8,0.0,801.0,594.0,1.0,2.33
25%,192.375,0.0,0.0,164.9,0.0,932.0,730.95,7.0,23.71
50%,272.9,22.0,0.0,185.0,6.4,968.0,779.5,28.0,34.445
75%,350.0,142.95,118.3,192.0,10.2,1029.4,824.0,56.0,46.135
max,540.0,359.4,200.1,247.0,32.2,1145.0,992.6,365.0,82.6


In [5]:
concrete_data.isnull().sum()

Cement                0
Blast Furnace Slag    0
Fly Ash               0
Water                 0
Superplasticizer      0
Coarse Aggregate      0
Fine Aggregate        0
Age                   0
Strength              0
dtype: int64

Split the data into target (strength) and (predictors):

In [6]:
concrete_data_columns = concrete_data.columns

In [7]:
target = concrete_data['Strength']
target.head(10)

0    79.99
1    61.89
2    40.27
3    41.05
4    44.30
5    47.03
6    43.70
7    36.45
8    45.85
9    39.29
Name: Strength, dtype: float64

In [8]:
predictors = concrete_data.iloc[:, :-1]
predictors.head(10)

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360
5,266.0,114.0,0.0,228.0,0.0,932.0,670.0,90
6,380.0,95.0,0.0,228.0,0.0,932.0,594.0,365
7,380.0,95.0,0.0,228.0,0.0,932.0,594.0,28
8,266.0,114.0,0.0,228.0,0.0,932.0,670.0,28
9,475.0,0.0,0.0,228.0,0.0,932.0,594.0,28


Create a function that defines our regression model for us so that we can conveniently call it to create our model:
 - One hidden layer of 10 nodes, and a ReLU activation function
 - Use the adam optimizer and the mean squared error as the loss function.

In [9]:
#num of inputs = num of predictors colums
n_cols = predictors.shape[1]
def regression_model():
    model = Sequential()
    model.add(Dense(10, activation='relu', input_shape=(n_cols,)))
    model.add(Dense(1))
    
    model.compile(optimizer='adam', loss='mean_squared_error')
    return model
      

Build the model:

In [10]:
model = regression_model()

Train and test the model at the same time using the fit-method. We will leave out 30% of the data for validation and we will train the model for 50 epochs.

In [11]:
list_of_mean_squared_error = []
for cycle in range(50):
    #Randomly split the data into a training set (70%) and a test set (30%):  
    X_train, X_test, y_train, y_test = train_test_split(predictors, target, test_size=0.3)
    #Train and test the model at the same time
    res = model.fit(X_train, y_train, epochs=50, verbose=0, validation_data=(X_test, y_test))
    #Find mean_squared_error as last value in history.
    mean_squared_error = res.history['val_loss'][-1]
    #Add value of mean_squared_error for every cycle in list.
    list_of_mean_squared_error.append(mean_squared_error)
    print('Cycle #{}: mean_squared_error {}'.format(cycle+1, mean_squared_error))

Cycle #1: mean_squared_error 311.99102363462976
Cycle #2: mean_squared_error 121.93921500888071
Cycle #3: mean_squared_error 77.06328409929492
Cycle #4: mean_squared_error 82.26685639415358
Cycle #5: mean_squared_error 66.72646599680088
Cycle #6: mean_squared_error 71.76137834073656
Cycle #7: mean_squared_error 63.70086477335217
Cycle #8: mean_squared_error 51.03346031460561
Cycle #9: mean_squared_error 67.62978517353342
Cycle #10: mean_squared_error 62.68017279368774
Cycle #11: mean_squared_error 56.541851574549014
Cycle #12: mean_squared_error 61.658052882716106
Cycle #13: mean_squared_error 48.749507632456165
Cycle #14: mean_squared_error 54.34474295705653
Cycle #15: mean_squared_error 50.12402348688119
Cycle #16: mean_squared_error 42.35458760431283
Cycle #17: mean_squared_error 52.10623561525808
Cycle #18: mean_squared_error 50.08295312282723
Cycle #19: mean_squared_error 46.84086712587227
Cycle #20: mean_squared_error 52.57823255224135
Cycle #21: mean_squared_error 44.38899003186

Find the mean and the standard deviation of the mean squared errors:

In [12]:
print('The mean of the mean squared errors: {}'.format(np.mean(list_of_mean_squared_error)))
print('The standard deviation of the mean squared errors: {}'.format(np.std(list_of_mean_squared_error)))

The mean of the mean squared errors: 56.70472316766634
The standard deviation of the mean squared errors: 39.03014606243445


## B. Normalize the data (5 marks)

Repeat Part A but use a normalized version of the data. Recall that one way to normalize the data is by subtracting the mean from the individual predictors and dividing by the standard deviation.

How does the mean of the mean squared errors compare to that from Step A?

Normalize the data by substracting the mean and dividing by the standard deviation:

In [14]:
predictors_norm = (predictors - predictors.mean())/predictors.std()
predictors_norm.head(10)

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age
0,2.476712,-0.856472,-0.846733,-0.916319,-0.620147,0.862735,-1.217079,-0.279597
1,2.476712,-0.856472,-0.846733,-0.916319,-0.620147,1.055651,-1.217079,-0.279597
2,0.491187,0.79514,-0.846733,2.174405,-1.038638,-0.526262,-2.239829,3.55134
3,0.491187,0.79514,-0.846733,2.174405,-1.038638,-0.526262,-2.239829,5.055221
4,-0.790075,0.678079,-0.846733,0.488555,-1.038638,0.070492,0.647569,4.976069
5,-0.145138,0.464818,-0.846733,2.174405,-1.038638,-0.526262,-1.291914,0.701883
6,0.945704,0.244603,-0.846733,2.174405,-1.038638,-0.526262,-2.239829,5.055221
7,0.945704,0.244603,-0.846733,2.174405,-1.038638,-0.526262,-2.239829,-0.279597
8,-0.145138,0.464818,-0.846733,2.174405,-1.038638,-0.526262,-1.291914,-0.279597
9,1.85474,-0.856472,-0.846733,2.174405,-1.038638,-0.526262,-2.239829,-0.279597


Build the model:

In [16]:
n_cols = predictors_norm.shape[1]
def regression_model2():
    model2 = Sequential()
    model2.add(Dense(10, activation='relu', input_shape=(n_cols,)))
    model2.add(Dense(1))
    
    model2.compile(optimizer='adam', loss='mean_squared_error')
    return model2

model2 = regression_model2()

Train and test the model at the same time using the fit-method. We will leave out 30% of the data for validation and we will train the model for 50 epochs. **And use predictors_norm instead of predictors.**

In [17]:
list_of_mean_squared_error = []
for cycle in range(50):
    #Randomly split the data into a training set (70%) and a test set (30%):  
    X_train, X_test, y_train, y_test = train_test_split(predictors_norm, target, test_size=0.3)
    #Train and test the model at the same time
    res = model2.fit(X_train, y_train, epochs=50, verbose=0, validation_data=(X_test, y_test))
    #Find mean_squared_error as last value in history.
    mean_squared_error = res.history['val_loss'][-1]
    #Add value of mean_squared_error for every cycle in list.
    list_of_mean_squared_error.append(mean_squared_error)
    print('Cycle #{}: mean_squared_error {}'.format(cycle+1, mean_squared_error))

Cycle #1: mean_squared_error 309.70854922791517
Cycle #2: mean_squared_error 171.65027344342573
Cycle #3: mean_squared_error 135.08761426314567
Cycle #4: mean_squared_error 110.18458448490279
Cycle #5: mean_squared_error 104.77703420398305
Cycle #6: mean_squared_error 82.0043884326725
Cycle #7: mean_squared_error 69.09444679334325
Cycle #8: mean_squared_error 71.53713822596282
Cycle #9: mean_squared_error 67.52837274298312
Cycle #10: mean_squared_error 76.34752514292893
Cycle #11: mean_squared_error 61.52317869316027
Cycle #12: mean_squared_error 55.39994879910861
Cycle #13: mean_squared_error 60.70220243583605
Cycle #14: mean_squared_error 56.56715467138198
Cycle #15: mean_squared_error 48.343141339743404
Cycle #16: mean_squared_error 53.0590337858231
Cycle #17: mean_squared_error 50.38213857014974
Cycle #18: mean_squared_error 51.014310077556125
Cycle #19: mean_squared_error 40.60067456439861
Cycle #20: mean_squared_error 49.64970301038625
Cycle #21: mean_squared_error 47.76922508659

Find the mean and the standard deviation of the mean squared errors:

In [18]:
print('The mean of the mean squared errors: {}'.format(np.mean(list_of_mean_squared_error)))
print('The standard deviation of the mean squared errors: {}'.format(np.std(list_of_mean_squared_error)))

The mean of the mean squared errors: 61.6217672475179
The standard deviation of the mean squared errors: 43.29149305783805


**The mean and the standard deviation of the mean squared errors in case A is less than in case B. But the difference is tiny. And in my opinion it's not a very good idea to compare result of two poor neural networks with one hidden layer only. Data normalization does not help a lot. Error is huge for both cases: A and B.** 

## C. Increate the number of epochs (5 marks)

Repeat Part B but use 100 epochs this time for training.

How does the mean of the mean squared errors compare to that from Step B?

Train and test the model at the same time using the fit-method. We will leave out 30% of the data (data after normalization) for validation and we will train the model **for 100 epochs instead of 50 epochs**.

Build the model:

In [21]:
def regression_model3():
    model3 = Sequential()
    model3.add(Dense(10, activation='relu', input_shape=(n_cols,)))
    model3.add(Dense(1))
    
    model3.compile(optimizer='adam', loss='mean_squared_error')
    return model3

model3 = regression_model3() 

In [22]:
list_of_mean_squared_error = []
for cycle in range(50):
    #Randomly split the data into a training set (70%) and a test set (30%):  
    X_train, X_test, y_train, y_test = train_test_split(predictors_norm, target, test_size=0.3)
    #Train and test the model at the same time
    res = model3.fit(X_train, y_train, epochs=100, verbose=0, validation_data=(X_test, y_test))
    #Find mean_squared_error as last value in history.
    mean_squared_error = res.history['val_loss'][-1]
    #Add value of mean_squared_error for every cycle in list.
    list_of_mean_squared_error.append(mean_squared_error)
    print('Cycle #{}: mean_squared_error {}'.format(cycle+1, mean_squared_error))

Cycle #1: mean_squared_error 171.76414825538217
Cycle #2: mean_squared_error 126.20202735481139
Cycle #3: mean_squared_error 99.93021548373028
Cycle #4: mean_squared_error 91.56857205980418
Cycle #5: mean_squared_error 118.94827882834623
Cycle #6: mean_squared_error 98.52066923962443
Cycle #7: mean_squared_error 106.3992348581456
Cycle #8: mean_squared_error 105.5089927352362
Cycle #9: mean_squared_error 90.22434390169903
Cycle #10: mean_squared_error 99.9333952375986
Cycle #11: mean_squared_error 103.66797721887484
Cycle #12: mean_squared_error 103.32750378457473
Cycle #13: mean_squared_error 102.73060336313587
Cycle #14: mean_squared_error 101.56693193750473
Cycle #15: mean_squared_error 100.97896104337327
Cycle #16: mean_squared_error 95.15525205544283
Cycle #17: mean_squared_error 90.31244521156484
Cycle #18: mean_squared_error 105.96080550406744
Cycle #19: mean_squared_error 102.99081880143545
Cycle #20: mean_squared_error 94.17879705830299
Cycle #21: mean_squared_error 103.446171

Find the mean and the standard deviation of the mean squared errors:

In [23]:
print('The mean of the mean squared errors: {}'.format(np.mean(list_of_mean_squared_error)))
print('The standard deviation of the mean squared errors: {}'.format(np.std(list_of_mean_squared_error)))

The mean of the mean squared errors: 102.2322185415126
The standard deviation of the mean squared errors: 13.119447707016194


**The mean and the standard deviation of the mean squared errors in case C is bigger than in case B. But in both cases error is huge. In my opinion it's not a very good idea to compare result of two poor neural networks with one hidden layer only. Number of epoch does not help.** 

## D. Increase the number of hidden layers (5 marks)

Repeat part B but use a neural network with the following instead:

- Three hidden layers, each of 10 nodes and ReLU activation function.

How does the mean of the mean squared errors compare to that from Step B?

Create a new model with **three hidden layers, each of 10 nodes and ReLU activation function.**

In [24]:
def regression_model4():
    model4 = Sequential()
    model4.add(Dense(10, activation='relu', input_shape=(n_cols,)))
    model4.add(Dense(10, activation='relu'))
    model4.add(Dense(10, activation='relu'))
    model4.add(Dense(1))
    
    model4.compile(optimizer='adam', loss='mean_squared_error')
    return model4

Build a new model with 3 hidden layers:

In [25]:
model4 = regression_model4()

Train and test the model at the same time using the fit-method. We will leave out 30% of the data (data after normalization) for validation and we will train the model for 50 epochs and use **three hidden layers, each of 10 nodes and ReLU activation function**.

In [26]:
list_of_mean_squared_error = []
for cycle in range(50):
    #Randomly split the data into a training set (70%) and a test set (30%):  
    X_train, X_test, y_train, y_test = train_test_split(predictors_norm, target, test_size=0.3)
    #Train and test the model at the same time
    res = model4.fit(X_train, y_train, epochs=50, verbose=0, validation_data=(X_test, y_test))
    #Find mean_squared_error as last value in history.
    mean_squared_error = res.history['val_loss'][-1]
    #Add value of mean_squared_error for every cycle in list.
    list_of_mean_squared_error.append(mean_squared_error)
    print('Cycle #{}: mean_squared_error {}'.format(cycle+1, mean_squared_error))

Cycle #1: mean_squared_error 152.65070083766307
Cycle #2: mean_squared_error 88.09756015419575
Cycle #3: mean_squared_error 82.24099716630954
Cycle #4: mean_squared_error 45.99169541639803
Cycle #5: mean_squared_error 37.56291302430977
Cycle #6: mean_squared_error 35.6977121482775
Cycle #7: mean_squared_error 31.46599593054515
Cycle #8: mean_squared_error 35.7753634036166
Cycle #9: mean_squared_error 27.42970234522156
Cycle #10: mean_squared_error 27.538042630192532
Cycle #11: mean_squared_error 29.450146005377416
Cycle #12: mean_squared_error 24.402957928605066
Cycle #13: mean_squared_error 31.169564947727043
Cycle #14: mean_squared_error 22.983247402030674
Cycle #15: mean_squared_error 24.292332294303623
Cycle #16: mean_squared_error 27.170813390737983
Cycle #17: mean_squared_error 27.323440946807366
Cycle #18: mean_squared_error 25.138819703778015
Cycle #19: mean_squared_error 23.346316235736737
Cycle #20: mean_squared_error 28.742703351388087
Cycle #21: mean_squared_error 30.557835

Find the mean and the standard deviation of the mean squared errors:

In [27]:
print('The mean of the mean squared errors: {}'.format(np.mean(list_of_mean_squared_error)))
print('The standard deviation of the mean squared errors: {}'.format(np.std(list_of_mean_squared_error)))

The mean of the mean squared errors: 31.011653809717178
The standard deviation of the mean squared errors: 21.439497664953674


**The mean and the standard deviation of the mean squared errors in case D is less than in case A, B and C. And it's the only case where error is not very big. It means additional layers in neural network are more important than other things. Also it proves the comparison between poor neural network with one hidden layer in previous cases is a bad idea. Result can be unpredictable.** 