## Build a Regression Model in Keras (A)

In [1]:
import pandas as pd
import numpy as np

The dataset is about the compressive strength of different samples of concrete based on the volumes of the different ingredients that were used to make them. Ingredients include:

1. Cement

2. Blast Furnace Slag

3. Fly Ash

4. Water

5. Superplasticizer

6. Coarse Aggregate

7. Fine Aggregate

Let's read the dataset into a pandas dataframe.

In [2]:
df=pd.read_csv("concrete_data.csv")
df.head()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age,Strength
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.99
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.89
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,40.27
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365,41.05
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360,44.3


Let's check how many data points we have.

In [3]:
df.shape

(1030, 9)

So, there are approximately 1000 samples to train our model on. 
Because of the few samples, we have to be careful not to overfit the training data.

Let's check the dataset for any missing values.

In [7]:
df.describe()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age,Strength
count,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0
mean,281.167864,73.895825,54.18835,181.567282,6.20466,972.918932,773.580485,45.662136,35.817961
std,104.506364,86.279342,63.997004,21.354219,5.973841,77.753954,80.17598,63.169912,16.705742
min,102.0,0.0,0.0,121.8,0.0,801.0,594.0,1.0,2.33
25%,192.375,0.0,0.0,164.9,0.0,932.0,730.95,7.0,23.71
50%,272.9,22.0,0.0,185.0,6.4,968.0,779.5,28.0,34.445
75%,350.0,142.95,118.3,192.0,10.2,1029.4,824.0,56.0,46.135
max,540.0,359.4,200.1,247.0,32.2,1145.0,992.6,365.0,82.6


In [8]:
df.isnull().sum()

Cement                0
Blast Furnace Slag    0
Fly Ash               0
Water                 0
Superplasticizer      0
Coarse Aggregate      0
Fine Aggregate        0
Age                   0
Strength              0
dtype: int64

The data is ready to be used to build our model.

### Split data into predictors and target

The target variable in this problem is the concrete sample strength. Therefore, our predictors will be all the other columns.

In [9]:
concrete_data_columns = df.columns
predictors = df[concrete_data_columns[concrete_data_columns != 'Strength']] # all columns except Strength
target = df['Strength']

Let's do a check of the predictors and the target dataframes.

In [10]:
predictors.head()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360


In [11]:
target.head()

0    79.99
1    61.89
2    40.27
3    41.05
4    44.30
Name: Strength, dtype: float64

In [13]:
n_cols=predictors_norm.shape[1]
n_cols

8

### Import Keras

In [14]:
import keras

In [15]:
from keras.models import Sequential
from keras.layers import Dense

In [16]:
def regression_model():
    # create model
    model = Sequential()
    model.add(Dense(10, activation='relu', input_shape=(n_cols,)))
    model.add(Dense(1))
    
    # compile model
    model.compile(optimizer='adam', loss='mean_squared_error')
    return model

The regression_model function creates a model that has one hidden layer with 10 neurons and a ReLU activation function. It uses the adam optimizer and the mean squared error as the loss function.

In [17]:
from sklearn.model_selection import train_test_split

Randomly splitting the data into a training and test sets by holding 30% of the data for testing

In [18]:
X_train, X_test, y_train, y_test = train_test_split(predictors_norm, target, test_size=0.3, random_state=42)

### Train the Model

Let's create our model.

In [19]:
# build the model
model = regression_model()

Train the model on the training data using 50 epochs.

In [20]:
epochs = 50
model.fit(X_train, y_train, epochs=epochs, verbose=1)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.callbacks.History at 0x24f970be190>

Evaluate the model on the test data 

In [21]:
loss_val = model.evaluate(X_test, y_test)
y_pred = model.predict(X_test)
loss_val



242.2939910888672

In [22]:
from sklearn.metrics import mean_squared_error

Compute the mean_squared_error between the predict concrete strength and the actual concrete strength

In [23]:
mean_square_error = mean_squared_error(y_test, y_pred)
mean = np.mean(mean_square_error)
standard_deviation = np.std(mean_square_error)
print(mean, standard_deviation)

242.2940002023023 0.0


Create a list of 50 mean squared errors  and report mean and the standard deviation of the mean squared errors

In [24]:
total_mean_squared_errors = 50
epochs = 50
mean_squared_errors = []
for i in range(0, total_mean_squared_errors):
    X_train, X_test, y_train, y_test = train_test_split(predictors_norm, target, test_size=0.3, random_state=i)
    model.fit(X_train, y_train, epochs=epochs, verbose=0)
    MSE = model.evaluate(X_test, y_test, verbose=0)
    print("MSE "+str(i+1)+": "+str(MSE))
    y_pred = model.predict(X_test)
    mean_square_error = mean_squared_error(y_test, y_pred)
    mean_squared_errors.append(mean_square_error)

mean_squared_errors = np.array(mean_squared_errors)
mean = np.mean(mean_squared_errors)
standard_deviation = np.std(mean_squared_errors)

print('\n')
print("Below is the mean and standard deviation of " +str(total_mean_squared_errors) + " mean squared errors without normalized data. Total number of epochs for each training is: " +str(epochs) + "\n")
print("Mean: "+str(mean))
print("Standard Deviation: "+str(standard_deviation))

MSE 1: 141.3468017578125
MSE 2: 142.78341674804688
MSE 3: 101.7087631225586
MSE 4: 96.55912780761719
MSE 5: 87.07420349121094
MSE 6: 84.63990783691406
MSE 7: 87.24091339111328
MSE 8: 69.85945892333984
MSE 9: 71.36014556884766
MSE 10: 61.299137115478516
MSE 11: 57.55650329589844
MSE 12: 53.530364990234375
MSE 13: 59.90700912475586
MSE 14: 53.951053619384766
MSE 15: 47.50260543823242
MSE 16: 39.693603515625
MSE 17: 43.71796798706055
MSE 18: 41.58518600463867
MSE 19: 41.19272994995117
MSE 20: 42.897003173828125
MSE 21: 40.62035369873047
MSE 22: 40.5488395690918
MSE 23: 38.46042251586914
MSE 24: 40.988548278808594
MSE 25: 43.79781723022461
MSE 26: 43.08354187011719
MSE 27: 40.883846282958984
MSE 28: 40.52485275268555
MSE 29: 46.55608367919922
MSE 30: 43.806697845458984
MSE 31: 41.86607360839844
MSE 32: 39.52911376953125
MSE 33: 38.726959228515625
MSE 34: 44.18952941894531
MSE 35: 41.89140319824219
MSE 36: 47.175167083740234
MSE 37: 41.864654541015625
MSE 38: 45.62808609008789
MSE 39: 42.71