## Build a Regression Model in Keras (B)

In [1]:
import pandas as pd
import numpy as np

The dataset is about the compressive strength of different samples of concrete based on the volumes of the different ingredients that were used to make them. Ingredients include:

1. Cement

2. Blast Furnace Slag

3. Fly Ash

4. Water

5. Superplasticizer

6. Coarse Aggregate

7. Fine Aggregate

Let's read the dataset into a pandas dataframe.

In [2]:
df=pd.read_csv("concrete_data.csv")
df.head()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age,Strength
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.99
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.89
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,40.27
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365,41.05
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360,44.3


Let's check how many data points we have.

In [3]:
df.shape

(1030, 9)

So, there are approximately 1000 samples to train our model on. 
Because of the few samples, we have to be careful not to overfit the training data.

Let's check the dataset for any missing values.

In [7]:
df.describe()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age,Strength
count,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0
mean,281.167864,73.895825,54.18835,181.567282,6.20466,972.918932,773.580485,45.662136,35.817961
std,104.506364,86.279342,63.997004,21.354219,5.973841,77.753954,80.17598,63.169912,16.705742
min,102.0,0.0,0.0,121.8,0.0,801.0,594.0,1.0,2.33
25%,192.375,0.0,0.0,164.9,0.0,932.0,730.95,7.0,23.71
50%,272.9,22.0,0.0,185.0,6.4,968.0,779.5,28.0,34.445
75%,350.0,142.95,118.3,192.0,10.2,1029.4,824.0,56.0,46.135
max,540.0,359.4,200.1,247.0,32.2,1145.0,992.6,365.0,82.6


In [8]:
df.isnull().sum()

Cement                0
Blast Furnace Slag    0
Fly Ash               0
Water                 0
Superplasticizer      0
Coarse Aggregate      0
Fine Aggregate        0
Age                   0
Strength              0
dtype: int64

The data is ready to be used to build our model.

### Split data into predictors and target

The target variable in this problem is the concrete sample strength. Therefore, our predictors will be all the other columns.

In [9]:
concrete_data_columns = df.columns
predictors = df[concrete_data_columns[concrete_data_columns != 'Strength']] # all columns except Strength
target = df['Strength']

Let's do a check of the predictors and the target dataframes.

In [10]:
predictors.head()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360


In [11]:
target.head()

0    79.99
1    61.89
2    40.27
3    41.05
4    44.30
Name: Strength, dtype: float64

Recall that one way to normalize the data is by subtracting the mean from the individual predictors and dividing by the standard deviation

In [25]:
predictors_norm = (predictors - predictors.mean()) / predictors.std()
predictors_norm.head()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age
0,2.476712,-0.856472,-0.846733,-0.916319,-0.620147,0.862735,-1.217079,-0.279597
1,2.476712,-0.856472,-0.846733,-0.916319,-0.620147,1.055651,-1.217079,-0.279597
2,0.491187,0.79514,-0.846733,2.174405,-1.038638,-0.526262,-2.239829,3.55134
3,0.491187,0.79514,-0.846733,2.174405,-1.038638,-0.526262,-2.239829,5.055221
4,-0.790075,0.678079,-0.846733,0.488555,-1.038638,0.070492,0.647569,4.976069


In [27]:
n_cols=predictors_norm.shape[1]
n_cols

8

### Import Keras

In [28]:
import keras

In [29]:
from keras.models import Sequential
from keras.layers import Dense

In [30]:
def regression_model():
    # create model
    model = Sequential()
    model.add(Dense(10, activation='relu', input_shape=(n_cols,)))
    model.add(Dense(1))
    
    # compile model
    model.compile(optimizer='adam', loss='mean_squared_error')
    return model

The regression_model function creates a model that has one hidden layer with 10 neurons and a ReLU activation function. It uses the adam optimizer and the mean squared error as the loss function.

In [31]:
from sklearn.model_selection import train_test_split

Randomly splitting the data into a training and test sets by holding 30% of the data for testing

In [32]:
X_train, X_test, y_train, y_test = train_test_split(predictors_norm, target, test_size=0.3, random_state=42)

### Train the Model

Let's create our model.

In [33]:
# build the model
model = regression_model()

Train the model on the training data using 50 epochs.

In [34]:
epochs = 50
model.fit(X_train, y_train, epochs=epochs, verbose=1)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.History at 0x24f983cde50>

Evaluate the model on the test data 

In [35]:
loss_val = model.evaluate(X_test, y_test)
y_pred = model.predict(X_test)
loss_val



403.8485107421875

In [36]:
from sklearn.metrics import mean_squared_error

Compute the mean_squared_error between the predict concrete strength and the actual concrete strength

In [37]:
mean_square_error = mean_squared_error(y_test, y_pred)
mean = np.mean(mean_square_error)
standard_deviation = np.std(mean_square_error)
print(mean, standard_deviation)

403.84854314123425 0.0


Create a list of 50 mean squared errors  and report mean and the standard deviation of the mean squared errors

In [38]:
total_mean_squared_errors = 50
epochs = 50
mean_squared_errors = []
for i in range(0, total_mean_squared_errors):
    X_train, X_test, y_train, y_test = train_test_split(predictors_norm, target, test_size=0.3, random_state=i)
    model.fit(X_train, y_train, epochs=epochs, verbose=0)
    MSE = model.evaluate(X_test, y_test, verbose=0)
    print("MSE "+str(i+1)+": "+str(MSE))
    y_pred = model.predict(X_test)
    mean_square_error = mean_squared_error(y_test, y_pred)
    mean_squared_errors.append(mean_square_error)

mean_squared_errors = np.array(mean_squared_errors)
mean = np.mean(mean_squared_errors)
standard_deviation = np.std(mean_squared_errors)

print('\n')
print("Below is the mean and standard deviation of " +str(total_mean_squared_errors) + " mean squared errors without normalized data. Total number of epochs for each training is: " +str(epochs) + "\n")
print("Mean: "+str(mean))
print("Standard Deviation: "+str(standard_deviation))

MSE 1: 152.1336669921875
MSE 2: 152.1161346435547
MSE 3: 91.62457275390625
MSE 4: 74.1259765625
MSE 5: 57.72782516479492
MSE 6: 56.269657135009766
MSE 7: 52.201316833496094
MSE 8: 39.488975524902344
MSE 9: 42.229557037353516
MSE 10: 44.02722930908203
MSE 11: 38.85687255859375
MSE 12: 37.88656997680664
MSE 13: 44.983978271484375
MSE 14: 43.00558853149414
MSE 15: 39.62437057495117
MSE 16: 35.399009704589844
MSE 17: 38.094329833984375
MSE 18: 35.95624923706055
MSE 19: 35.77860641479492
MSE 20: 35.030555725097656
MSE 21: 33.90412139892578
MSE 22: 32.13026428222656
MSE 23: 32.66349792480469
MSE 24: 36.26028060913086
MSE 25: 34.59671401977539
MSE 26: 37.02382278442383
MSE 27: 33.36676025390625
MSE 28: 30.232345581054688
MSE 29: 35.198028564453125
MSE 30: 34.619911193847656
MSE 31: 33.97964096069336
MSE 32: 30.225296020507812
MSE 33: 32.730709075927734
MSE 34: 33.82404327392578
MSE 35: 34.36322784423828
MSE 36: 38.07405090332031
MSE 37: 32.073822021484375
MSE 38: 35.447330474853516
MSE 39: 32