# Assignment - D: Build a Regression Model in Keras with Normalized Setting

## Download and Clean Dataset

Let's start by importing the pandas and the Numpy libraries.

In [72]:
import pandas as pd
import numpy as np

We will be using the same dataset provided in the course.

The dataset is about the compressive strength of different samples of concrete based on the volumes of the different ingredients that were used to make them. Ingredients include:

1. Cement
2. Blast Furnace Slag
3. Fly Ash
4. Water
5. Superplasticizer
6. Coarse Aggregate
7. Fine Aggregate

Let's read the dataset into a *pandas* dataframe.

In [73]:
concrete_data = pd.read_csv('https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DL0101EN/labs/data/concrete_data.csv')
concrete_data.head()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age,Strength
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.99
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.89
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,40.27
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365,41.05
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360,44.3


So the first concrete sample has 540 cubic meter of cement, 0 cubic meter of blast furnace slag, 0 cubic meter of fly ash, 162 cubic meter of water, 2.5 cubic meter of superplaticizer, 1040 cubic meter of coarse aggregate, 676 cubic meter of fine aggregate. Such a concrete mix which is 28 days old, has a compressive strength of 79.99 MPa. 

Let's check how many data points we have.

In [74]:
concrete_data.shape

(1030, 9)

So, there are approximately 1000 samples to train our model on. Because of the few samples, we have to be careful not to overfit the training data.

Let's check the dataset for any missing values.

In [75]:
concrete_data.describe()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age,Strength
count,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0
mean,281.167864,73.895825,54.18835,181.567282,6.20466,972.918932,773.580485,45.662136,35.817961
std,104.506364,86.279342,63.997004,21.354219,5.973841,77.753954,80.17598,63.169912,16.705742
min,102.0,0.0,0.0,121.8,0.0,801.0,594.0,1.0,2.33
25%,192.375,0.0,0.0,164.9,0.0,932.0,730.95,7.0,23.71
50%,272.9,22.0,0.0,185.0,6.4,968.0,779.5,28.0,34.445
75%,350.0,142.95,118.3,192.0,10.2,1029.4,824.0,56.0,46.135
max,540.0,359.4,200.1,247.0,32.2,1145.0,992.6,365.0,82.6


In [76]:
concrete_data.isnull().sum()

Cement                0
Blast Furnace Slag    0
Fly Ash               0
Water                 0
Superplasticizer      0
Coarse Aggregate      0
Fine Aggregate        0
Age                   0
Strength              0
dtype: int64

The data looks very clean and is ready to be used to build our model.

### Split data into predictors and target

The target variable in this problem is the concrete sample strength. Therefore, our predictors will be all the other columns.

In [77]:
concrete_data_columns = concrete_data.columns

predictors = concrete_data[concrete_data_columns[concrete_data_columns != 'Strength']] # all columns except Strength
target = concrete_data['Strength'] # Strength column

Let's do a quick sanity check of the predictors and the target dataframes.

In [78]:
predictors.head()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360


In [79]:
target.head()

0    79.99
1    61.89
2    40.27
3    41.05
4    44.30
Name: Strength, dtype: float64

### Normalize Train Data

Finally, the last step is to normalize the data by substracting the mean and dividing by the standard deviation.

In [80]:
predictors_norm = (predictors - predictors.mean()) / predictors.std()
predictors_norm.head()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age
0,2.476712,-0.856472,-0.846733,-0.916319,-0.620147,0.862735,-1.217079,-0.279597
1,2.476712,-0.856472,-0.846733,-0.916319,-0.620147,1.055651,-1.217079,-0.279597
2,0.491187,0.79514,-0.846733,2.174405,-1.038638,-0.526262,-2.239829,3.55134
3,0.491187,0.79514,-0.846733,2.174405,-1.038638,-0.526262,-2.239829,5.055221
4,-0.790075,0.678079,-0.846733,0.488555,-1.038638,0.070492,0.647569,4.976069


Let's save the number of predictors to n_cols since we will need this number when building our network.

In [81]:
n_cols = predictors_norm.shape[1] # number of predictors

## Build a Baseline Model

### Import Keras

The extremely powerful and yet-easy-to-use **Keras** library runs on top of a low-level library such as **TensorFlow** or **PyTorch**. In this case, this assignment will be using **TensorFlow** as a backend.

In [82]:
import keras

As you can see, the TensorFlow backend was used to install the Keras library.

Let's import the rest of the packages from the Keras library that we will need to build our regressoin model.

In [83]:
from keras.models import Sequential
from keras.layers import Dense

### Build a Neural Network

Let's define a function that defines our regression model for us so that we can conveniently call it to create our model.

In [84]:
# define regression model
def regression_model():
    # create model
    model = Sequential()
    model.add(Dense(10, activation='relu', input_shape=(n_cols,)))
    model.add(Dense(10, activation='relu'))
    model.add(Dense(10, activation='relu'))
    model.add(Dense(1))
    
    # compile model
    model.compile(optimizer='adam', loss='mean_squared_error')
    return model

The above function create a model using the following settings:
- 1 hidden layer of *10 nodes*
- *ReLU* activation function
- *adam* optimizer
- *mean squared error* for the loss function

### Generate the Test Dataset

Let's import scikit-learn in order to randomly split the data into a training and test sets

In [85]:
from sklearn.model_selection import train_test_split

Randomly split the data into a training and test sets by holding 30% of the data for testing.

In [86]:
X_train, X_test, y_train, y_test = train_test_split(predictors_norm, target, test_size=0.3, random_state=42)

### Train and Evaluate the Neural Network Model

Let's call the function now to create our model.

In [87]:
# build the model
model = regression_model()

Next, we will train the model for 50 epochs with 1 verbose setting.

In [88]:
# fit the model - 50 epochs with 1 verbose setting
model.fit(X_train, y_train, epochs=50, verbose=2)

Epoch 1/50
 - 2s - loss: 1609.0714
Epoch 2/50
 - 0s - loss: 1595.4728
Epoch 3/50
 - 0s - loss: 1582.0063
Epoch 4/50
 - 0s - loss: 1568.2681
Epoch 5/50
 - 1s - loss: 1554.3337
Epoch 6/50
 - 0s - loss: 1539.4098
Epoch 7/50
 - 1s - loss: 1523.8552
Epoch 8/50
 - 1s - loss: 1507.2200
Epoch 9/50
 - 1s - loss: 1489.7083
Epoch 10/50
 - 0s - loss: 1470.6858
Epoch 11/50
 - 0s - loss: 1450.2729
Epoch 12/50
 - 1s - loss: 1428.5630
Epoch 13/50
 - 1s - loss: 1405.0577
Epoch 14/50
 - 0s - loss: 1380.0358
Epoch 15/50
 - 0s - loss: 1353.1665
Epoch 16/50
 - 1s - loss: 1325.0359
Epoch 17/50
 - 0s - loss: 1295.0350
Epoch 18/50
 - 1s - loss: 1264.2892
Epoch 19/50
 - 0s - loss: 1231.6087
Epoch 20/50
 - 1s - loss: 1198.3305
Epoch 21/50
 - 0s - loss: 1163.5578
Epoch 22/50
 - 1s - loss: 1127.8993
Epoch 23/50
 - 0s - loss: 1091.6310
Epoch 24/50
 - 1s - loss: 1054.2765
Epoch 25/50
 - 1s - loss: 1017.3353
Epoch 26/50
 - 1s - loss: 979.9666
Epoch 27/50
 - 0s - loss: 942.3540
Epoch 28/50
 - 1s - loss: 905.0294
Epoc

<keras.callbacks.History at 0x7f0ea46de0f0>

Next we need to evaluate the model on the test data.

In [89]:
# evaluate the model using the test dataset
loss_val = model.evaluate(X_test, y_test)

# varify the prediction
y_pred = model.predict(X_test)

# output the evaluate value
print(loss_val)

# output the prediction
# print(y_pred)

295.52732044207625


Now we need to compute the mean squared error between the predicted concrete strength and the actual concrete strength.

Let's import the *mean_squared_error* function from *scikit-learn*.

In [90]:
from sklearn.metrics import mean_squared_error

Evaluate the model on the test data and compute the mean squared error between the predicted concrete strength and the actual concrete strength.

In [91]:
mean_square_error = mean_squared_error(y_test, y_pred)
mean = np.mean(mean_square_error)
standard_deviation = np.std(mean_square_error)

print(mean, standard_deviation)

295.52731551404196 0.0


Create a list of 50 mean squared errors and report mean and the standard deviation of the mean squared errors.

In [None]:
total_mean_squared_errors = 50
epochs = 50

mean_squared_errors = []

for i in range(0, total_mean_squared_errors):
    # split the train and test datasets
    X_train, X_test, y_train, y_test = train_test_split(predictors_norm, target, test_size=0.3, random_state=i)
    
    # fit the model
    model.fit(X_train, y_train, epochs=epochs, verbose=0)
    
    # evaluate the model
    mse = model.evaluate(X_test, y_test, verbose=0)
    print("mean_square_error at " + str(i+1) + " is " + str(mse))
    
    # get the prediction
    y_pred = model.predict(X_test)
    
    # get the mean square error
    mean_square_error = mean_squared_error(y_test, y_pred)
    
    # append result to the mean_squared_errors list
    mean_squared_errors.append(mean_square_error)

# convert to array list
mean_squared_errors = np.array(mean_squared_errors)

# get mean value from the array list
mean = np.mean(mean_squared_errors)

# get standard deviation using the array list
standard_deviation = np.std(mean_squared_errors)

print('\n')
print('\n')
print("The output below is the mean and standard deviation of " + str(total_mean_squared_errors) + " mean squared errors with normalized data.")
print("- Number of epochs for each training: " + str(epochs))
print("- Mean: " + str(mean))
print("- Standard Deviation: " + str(standard_deviation))

mean_square_error at 1 is 147.62524270857037
mean_square_error at 2 is 147.82499038054334
mean_square_error at 3 is 100.32301755244679
mean_square_error at 4 is 93.62093687829076
mean_square_error at 5 is 82.85421777620284
mean_square_error at 6 is 79.0075955313772
mean_square_error at 7 is 80.42990465380227
mean_square_error at 8 is 59.17743701379276
mean_square_error at 9 is 57.967825016157526
mean_square_error at 10 is 47.81789901881542
mean_square_error at 11 is 44.803516141033484
mean_square_error at 12 is 39.34256292546837
mean_square_error at 13 is 46.310660269653916
mean_square_error at 14 is 45.718730728988895
mean_square_error at 15 is 39.152544830223505
mean_square_error at 16 is 31.981984320581923
mean_square_error at 17 is 39.50908976311051
mean_square_error at 18 is 37.050870370710555
mean_square_error at 19 is 34.808084333598806
mean_square_error at 20 is 37.2925007026944
mean_square_error at 21 is 33.57168663049593
mean_square_error at 22 is 33.603471323895995
mean_squa