# Build a Regression Model in Keras

### Step-By-Step Assignment Instructions

1. Assignment Topic:

In this project, you will build a regression model using the Keras library to model the same data about concrete compressive strength that we used in labs 3.

2. Concrete Data:

For your convenience, the data can be found here again: https://cocl.us/concrete_data. To recap, the predictors in the data of concrete strength include: Cement, Blast Furnace Slag, Fly Ash, Water, Superplasticizer, Coarse Aggregate, Fine Aggregate.

3. Assignment Instructions:

## A. Build a baseline model (5 marks)

Use the Keras library to build a neural network with the following:

- One hidden layer of 10 nodes, and a ReLU activation function

- Use the adam optimizer and the mean squared error as the loss function.

1. Randomly split the data into a training and test sets by holding 30% of the data for testing. You can use the train_test_splithelper function from Scikit-learn.

2. Train the model on the training data using 50 epochs.

3. Evaluate the model on the test data and compute the mean squared error between the predicted concrete strength and the actual concrete strength. You can use the mean_squared_error function from Scikit-learn.

4. Repeat steps 1 - 3, 50 times, i.e., create a list of 50 mean squared errors.

5. Report the mean and the standard deviation of the mean squared errors.

Submit your Jupyter Notebook with your code and comments.



Import the pandas, the numpy, the keras libraries and the packages from the keras library:

In [1]:
import pandas as pd
import numpy as np
import keras
#import sklearn
from keras.models import Sequential
from keras.layers import Dense
from sklearn.model_selection import train_test_split

Download the data file:

In [2]:
concrete_data = pd.read_csv('concrete_data.csv')
concrete_data.head(10)

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age,Strength
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.99
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.89
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,40.27
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365,41.05
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360,44.3
5,266.0,114.0,0.0,228.0,0.0,932.0,670.0,90,47.03
6,380.0,95.0,0.0,228.0,0.0,932.0,594.0,365,43.7
7,380.0,95.0,0.0,228.0,0.0,932.0,594.0,28,36.45
8,266.0,114.0,0.0,228.0,0.0,932.0,670.0,28,45.85
9,475.0,0.0,0.0,228.0,0.0,932.0,594.0,28,39.29


Check how many point in the data frame, missing values, and genetal information:

In [3]:
concrete_data.shape

(1030, 9)

In [4]:
concrete_data.describe()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age,Strength
count,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0
mean,281.167864,73.895825,54.18835,181.567282,6.20466,972.918932,773.580485,45.662136,35.817961
std,104.506364,86.279342,63.997004,21.354219,5.973841,77.753954,80.17598,63.169912,16.705742
min,102.0,0.0,0.0,121.8,0.0,801.0,594.0,1.0,2.33
25%,192.375,0.0,0.0,164.9,0.0,932.0,730.95,7.0,23.71
50%,272.9,22.0,0.0,185.0,6.4,968.0,779.5,28.0,34.445
75%,350.0,142.95,118.3,192.0,10.2,1029.4,824.0,56.0,46.135
max,540.0,359.4,200.1,247.0,32.2,1145.0,992.6,365.0,82.6


In [5]:
concrete_data.isnull().sum()

Cement                0
Blast Furnace Slag    0
Fly Ash               0
Water                 0
Superplasticizer      0
Coarse Aggregate      0
Fine Aggregate        0
Age                   0
Strength              0
dtype: int64

Split the data into target (strength) and (predictors):

In [6]:
concrete_data_columns = concrete_data.columns

In [7]:
target = concrete_data['Strength']
target.head(10)

0    79.99
1    61.89
2    40.27
3    41.05
4    44.30
5    47.03
6    43.70
7    36.45
8    45.85
9    39.29
Name: Strength, dtype: float64

In [8]:
predictors = concrete_data.iloc[:, :-1]
predictors.head(10)

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360
5,266.0,114.0,0.0,228.0,0.0,932.0,670.0,90
6,380.0,95.0,0.0,228.0,0.0,932.0,594.0,365
7,380.0,95.0,0.0,228.0,0.0,932.0,594.0,28
8,266.0,114.0,0.0,228.0,0.0,932.0,670.0,28
9,475.0,0.0,0.0,228.0,0.0,932.0,594.0,28


Create a function that defines our regression model for us so that we can conveniently call it to create our model:
 - One hidden layer of 10 nodes, and a ReLU activation function
 - Use the adam optimizer and the mean squared error as the loss function.

In [9]:
#num of inputs = num of predictors colums
n_cols = predictors.shape[1]
def regression_model():
    model = Sequential()
    model.add(Dense(10, activation='relu', input_shape=(n_cols,)))
    model.add(Dense(1))
    
    model.compile(optimizer='adam', loss='mean_squared_error')
    return model
      

Build the model:

In [10]:
model = regression_model()

Train and test the model at the same time using the fit-method. We will leave out 30% of the data for validation and we will train the model for 50 epochs.

In [11]:
list_of_mean_squared_error = []
for cycle in range(50):
    #Randomly split the data into a training set (70%) and a test set (30%):  
    X_train, X_test, y_train, y_test = train_test_split(predictors, target, test_size=0.3)
    #Train and test the model at the same time
    res = model.fit(X_train, y_train, epochs=50, verbose=0, validation_data=(X_test, y_test))
    #Find mean_squared_error as last value in history.
    mean_squared_error = res.history['val_loss'][-1]
    #Add value of mean_squared_error for every cycle in list.
    list_of_mean_squared_error.append(mean_squared_error)
    print('Cycle #{}: mean_squared_error {}'.format(cycle+1, mean_squared_error))

Cycle #1: mean_squared_error 506.6769714355469
Cycle #2: mean_squared_error 258.6771240234375
Cycle #3: mean_squared_error 138.43197631835938
Cycle #4: mean_squared_error 115.69904327392578
Cycle #5: mean_squared_error 103.48408508300781
Cycle #6: mean_squared_error 95.52459716796875
Cycle #7: mean_squared_error 86.34464263916016
Cycle #8: mean_squared_error 71.60794830322266
Cycle #9: mean_squared_error 72.25829315185547
Cycle #10: mean_squared_error 78.47827911376953
Cycle #11: mean_squared_error 62.79623794555664
Cycle #12: mean_squared_error 59.80311965942383
Cycle #13: mean_squared_error 52.38227081298828
Cycle #14: mean_squared_error 58.0130729675293
Cycle #15: mean_squared_error 42.09662628173828
Cycle #16: mean_squared_error 52.68122482299805
Cycle #17: mean_squared_error 42.88134002685547
Cycle #18: mean_squared_error 44.31157684326172
Cycle #19: mean_squared_error 45.46287536621094
Cycle #20: mean_squared_error 41.55788803100586
Cycle #21: mean_squared_error 43.06652450561523

Find the mean and the standard deviation of the mean squared errors:

In [12]:
print('The mean of the mean squared errors: {}'.format(np.mean(list_of_mean_squared_error)))
print('The standard deviation of the mean squared errors: {}'.format(np.std(list_of_mean_squared_error)))

The mean of the mean squared errors: 65.32852958679199
The standard deviation of the mean squared errors: 72.63241379761469


## B. Normalize the data (5 marks)

Repeat Part A but use a normalized version of the data. Recall that one way to normalize the data is by subtracting the mean from the individual predictors and dividing by the standard deviation.

How does the mean of the mean squared errors compare to that from Step A?

Normalize the data by substracting the mean and dividing by the standard deviation:

In [13]:
predictors_norm = (predictors - predictors.mean())/predictors.std()
predictors_norm.head(10)

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age
0,2.476712,-0.856472,-0.846733,-0.916319,-0.620147,0.862735,-1.217079,-0.279597
1,2.476712,-0.856472,-0.846733,-0.916319,-0.620147,1.055651,-1.217079,-0.279597
2,0.491187,0.79514,-0.846733,2.174405,-1.038638,-0.526262,-2.239829,3.55134
3,0.491187,0.79514,-0.846733,2.174405,-1.038638,-0.526262,-2.239829,5.055221
4,-0.790075,0.678079,-0.846733,0.488555,-1.038638,0.070492,0.647569,4.976069
5,-0.145138,0.464818,-0.846733,2.174405,-1.038638,-0.526262,-1.291914,0.701883
6,0.945704,0.244603,-0.846733,2.174405,-1.038638,-0.526262,-2.239829,5.055221
7,0.945704,0.244603,-0.846733,2.174405,-1.038638,-0.526262,-2.239829,-0.279597
8,-0.145138,0.464818,-0.846733,2.174405,-1.038638,-0.526262,-1.291914,-0.279597
9,1.85474,-0.856472,-0.846733,2.174405,-1.038638,-0.526262,-2.239829,-0.279597


Build the model:

In [14]:
n_cols = predictors_norm.shape[1]
def regression_model2():
    model2 = Sequential()
    model2.add(Dense(10, activation='relu', input_shape=(n_cols,)))
    model2.add(Dense(1))
    
    model2.compile(optimizer='adam', loss='mean_squared_error')
    return model2

model2 = regression_model2()

Train and test the model at the same time using the fit-method. We will leave out 30% of the data for validation and we will train the model for 50 epochs. **And use predictors_norm instead of predictors.**

In [15]:
list_of_mean_squared_error = []
for cycle in range(50):
    #Randomly split the data into a training set (70%) and a test set (30%):  
    X_train, X_test, y_train, y_test = train_test_split(predictors_norm, target, test_size=0.3)
    #Train and test the model at the same time
    res = model2.fit(X_train, y_train, epochs=50, verbose=0, validation_data=(X_test, y_test))
    #Find mean_squared_error as last value in history.
    mean_squared_error = res.history['val_loss'][-1]
    #Add value of mean_squared_error for every cycle in list.
    list_of_mean_squared_error.append(mean_squared_error)
    print('Cycle #{}: mean_squared_error {}'.format(cycle+1, mean_squared_error))

Cycle #1: mean_squared_error 326.0201416015625
Cycle #2: mean_squared_error 177.8172149658203
Cycle #3: mean_squared_error 106.13230895996094
Cycle #4: mean_squared_error 84.66824340820312
Cycle #5: mean_squared_error 73.25857543945312
Cycle #6: mean_squared_error 50.70921325683594
Cycle #7: mean_squared_error 53.93100357055664
Cycle #8: mean_squared_error 47.17451858520508
Cycle #9: mean_squared_error 46.02519226074219
Cycle #10: mean_squared_error 42.0709342956543
Cycle #11: mean_squared_error 42.78349304199219
Cycle #12: mean_squared_error 43.542388916015625
Cycle #13: mean_squared_error 35.81052780151367
Cycle #14: mean_squared_error 38.57561492919922
Cycle #15: mean_squared_error 39.287567138671875
Cycle #16: mean_squared_error 34.45658874511719
Cycle #17: mean_squared_error 39.172306060791016
Cycle #18: mean_squared_error 36.54716110229492
Cycle #19: mean_squared_error 33.448360443115234
Cycle #20: mean_squared_error 30.416309356689453
Cycle #21: mean_squared_error 38.82193374633

Find the mean and the standard deviation of the mean squared errors:

In [16]:
print('The mean of the mean squared errors: {}'.format(np.mean(list_of_mean_squared_error)))
print('The standard deviation of the mean squared errors: {}'.format(np.std(list_of_mean_squared_error)))

The mean of the mean squared errors: 48.35616844177246
The standard deviation of the mean squared errors: 46.15641806446406


**The mean and the standard deviation of the mean squared errors in case A is less than in case B. But the difference is tiny. And in my opinion it's not a very good idea to compare result of two poor neural networks with one hidden layer only. Data normalization does not help a lot. Error is huge for both cases: A and B.** 

## C. Increate the number of epochs (5 marks)

Repeat Part B but use 100 epochs this time for training.

How does the mean of the mean squared errors compare to that from Step B?

Train and test the model at the same time using the fit-method. We will leave out 30% of the data (data after normalization) for validation and we will train the model **for 100 epochs instead of 50 epochs**.

Build the model:

In [17]:
def regression_model3():
    model3 = Sequential()
    model3.add(Dense(10, activation='relu', input_shape=(n_cols,)))
    model3.add(Dense(1))
    
    model3.compile(optimizer='adam', loss='mean_squared_error')
    return model3

model3 = regression_model3() 

In [18]:
list_of_mean_squared_error = []
for cycle in range(50):
    #Randomly split the data into a training set (70%) and a test set (30%):  
    X_train, X_test, y_train, y_test = train_test_split(predictors_norm, target, test_size=0.3)
    #Train and test the model at the same time
    res = model3.fit(X_train, y_train, epochs=100, verbose=0, validation_data=(X_test, y_test))
    #Find mean_squared_error as last value in history.
    mean_squared_error = res.history['val_loss'][-1]
    #Add value of mean_squared_error for every cycle in list.
    list_of_mean_squared_error.append(mean_squared_error)
    print('Cycle #{}: mean_squared_error {}'.format(cycle+1, mean_squared_error))

Cycle #1: mean_squared_error 147.33978271484375
Cycle #2: mean_squared_error 101.4339370727539
Cycle #3: mean_squared_error 83.9281234741211
Cycle #4: mean_squared_error 74.05772399902344
Cycle #5: mean_squared_error 76.51408386230469
Cycle #6: mean_squared_error 59.28330612182617
Cycle #7: mean_squared_error 73.01850891113281
Cycle #8: mean_squared_error 74.66583251953125
Cycle #9: mean_squared_error 66.52885437011719
Cycle #10: mean_squared_error 54.58378982543945
Cycle #11: mean_squared_error 44.29137420654297
Cycle #12: mean_squared_error 47.55619430541992
Cycle #13: mean_squared_error 44.333824157714844
Cycle #14: mean_squared_error 43.20658874511719
Cycle #15: mean_squared_error 44.73106384277344
Cycle #16: mean_squared_error 38.6217155456543
Cycle #17: mean_squared_error 36.004486083984375
Cycle #18: mean_squared_error 43.84286880493164
Cycle #19: mean_squared_error 41.39344024658203
Cycle #20: mean_squared_error 41.35404586791992
Cycle #21: mean_squared_error 45.10295486450195


Find the mean and the standard deviation of the mean squared errors:

In [19]:
print('The mean of the mean squared errors: {}'.format(np.mean(list_of_mean_squared_error)))
print('The standard deviation of the mean squared errors: {}'.format(np.std(list_of_mean_squared_error)))

The mean of the mean squared errors: 50.318716049194336
The standard deviation of the mean squared errors: 19.274973746221793


**The mean and the standard deviation of the mean squared errors in case C is bigger than in case B. But in both cases error is huge. In my opinion it's not a very good idea to compare result of two poor neural networks with one hidden layer only. Number of epoch does not help.** 

## D. Increase the number of hidden layers (5 marks)

Repeat part B but use a neural network with the following instead:

- Three hidden layers, each of 10 nodes and ReLU activation function.

How does the mean of the mean squared errors compare to that from Step B?

Create a new model with **three hidden layers, each of 10 nodes and ReLU activation function.**

In [20]:
def regression_model4():
    model4 = Sequential()
    model4.add(Dense(10, activation='relu', input_shape=(n_cols,)))
    model4.add(Dense(10, activation='relu'))
    model4.add(Dense(10, activation='relu'))
    model4.add(Dense(1))
    
    model4.compile(optimizer='adam', loss='mean_squared_error')
    return model4

Build a new model with 3 hidden layers:

In [21]:
model4 = regression_model4()

Train and test the model at the same time using the fit-method. We will leave out 30% of the data (data after normalization) for validation and we will train the model for 50 epochs and use **three hidden layers, each of 10 nodes and ReLU activation function**.

In [22]:
list_of_mean_squared_error = []
for cycle in range(50):
    #Randomly split the data into a training set (70%) and a test set (30%):  
    X_train, X_test, y_train, y_test = train_test_split(predictors_norm, target, test_size=0.3)
    #Train and test the model at the same time
    res = model4.fit(X_train, y_train, epochs=50, verbose=0, validation_data=(X_test, y_test))
    #Find mean_squared_error as last value in history.
    mean_squared_error = res.history['val_loss'][-1]
    #Add value of mean_squared_error for every cycle in list.
    list_of_mean_squared_error.append(mean_squared_error)
    print('Cycle #{}: mean_squared_error {}'.format(cycle+1, mean_squared_error))

Cycle #1: mean_squared_error 90.8314208984375
Cycle #2: mean_squared_error 51.01783752441406
Cycle #3: mean_squared_error 43.81345748901367
Cycle #4: mean_squared_error 41.918235778808594
Cycle #5: mean_squared_error 32.592342376708984
Cycle #6: mean_squared_error 33.457801818847656
Cycle #7: mean_squared_error 37.806636810302734
Cycle #8: mean_squared_error 29.911041259765625
Cycle #9: mean_squared_error 32.30567169189453
Cycle #10: mean_squared_error 32.94924545288086
Cycle #11: mean_squared_error 29.010751724243164
Cycle #12: mean_squared_error 27.430288314819336
Cycle #13: mean_squared_error 31.180578231811523
Cycle #14: mean_squared_error 31.150707244873047
Cycle #15: mean_squared_error 26.73761749267578
Cycle #16: mean_squared_error 31.86789321899414
Cycle #17: mean_squared_error 24.76600456237793
Cycle #18: mean_squared_error 25.435007095336914
Cycle #19: mean_squared_error 33.24559020996094
Cycle #20: mean_squared_error 27.346702575683594
Cycle #21: mean_squared_error 23.071638

Find the mean and the standard deviation of the mean squared errors:

In [23]:
print('The mean of the mean squared errors: {}'.format(np.mean(list_of_mean_squared_error)))
print('The standard deviation of the mean squared errors: {}'.format(np.std(list_of_mean_squared_error)))

The mean of the mean squared errors: 27.453756828308105
The standard deviation of the mean squared errors: 11.332802245938483


**The mean and the standard deviation of the mean squared errors in case D is less than in case A, B and C. And it's the only case where error is not very big. It means additional layers in neural network are more important than other things. Also it proves the comparison between poor neural network with one hidden layer in previous cases is a bad idea. Result can be unpredictable.** 