<h1 align=center><font size = 5>Week 5 Assignment</font></h1>

<a id="item31"></a>

Let's start by importing the <em>pandas</em>, Numpy libraries and scikit-learn functions.

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

Let's download the data and read it into a <em>pandas</em> dataframe.

In [2]:
concrete_data = pd.read_csv('concrete_data.csv')
concrete_data.head()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age,Strength
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.99
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.89
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,40.27
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365,41.05
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360,44.3


#### Let's check how many data points we have.

In [3]:
concrete_data.shape

(1030, 9)

So, there are approximately 1000 samples to train our model on. Because of the few samples, we have to be careful not to overfit the training data.

Let's check the dataset for any missing values.

In [4]:
concrete_data.describe()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age,Strength
count,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0
mean,281.167864,73.895825,54.18835,181.567282,6.20466,972.918932,773.580485,45.662136,35.817961
std,104.506364,86.279342,63.997004,21.354219,5.973841,77.753954,80.17598,63.169912,16.705742
min,102.0,0.0,0.0,121.8,0.0,801.0,594.0,1.0,2.33
25%,192.375,0.0,0.0,164.9,0.0,932.0,730.95,7.0,23.71
50%,272.9,22.0,0.0,185.0,6.4,968.0,779.5,28.0,34.445
75%,350.0,142.95,118.3,192.0,10.2,1029.4,824.0,56.0,46.135
max,540.0,359.4,200.1,247.0,32.2,1145.0,992.6,365.0,82.6


In [5]:
concrete_data.isnull().sum()

Cement                0
Blast Furnace Slag    0
Fly Ash               0
Water                 0
Superplasticizer      0
Coarse Aggregate      0
Fine Aggregate        0
Age                   0
Strength              0
dtype: int64

The data looks very clean and is ready to be used to build our model.

#### Split data into predictors and target

The target variable in this problem is the concrete sample strength. Therefore, our predictors will be all the other columns.

In [6]:
concrete_data_columns = concrete_data.columns

predictors = concrete_data[concrete_data_columns[concrete_data_columns != 'Strength']] # all columns except Strength
target = concrete_data['Strength'] # Strength column

<a id="item2"></a>

Let's do a quick sanity check of the predictors and the target dataframes.

In [7]:
predictors.head()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360


In [8]:
target.head()

0    79.99
1    61.89
2    40.27
3    41.05
4    44.30
Name: Strength, dtype: float64

Let's save the number of predictors to *n_cols* since we will need this number when building our network.

In [9]:
n_cols = predictors.shape[1] # number of predictors

Split the dataframe into train and test set using sklearn built-in function

In [10]:
x_train, x_test, y_train, y_test = train_test_split(predictors, target, test_size=0.3)

In [11]:
print(x_train.shape) #check new datasets
print(x_test.shape)

(721, 8)
(309, 8)


Let's go ahead and import the Keras library

In [12]:
import tensorflow.keras # I use 'tensorflow.***' instead of just 'keras' because of some conflicts 
                        # in GPU-based version of Tensorflow and Keras on my PC.

Let's import the rest of the packages from the Keras library that we will need to build our regressoin model.

In [13]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

<a id='item33'></a>

## Build a Neural Network

Let's define a function that defines our simple one hidden layer regression model for us so that we can conveniently call it to create our model.

In [14]:
# define regression model
def regression_model_simple():
    # create model
    model = Sequential()
    model.add(Dense(10, activation='relu', input_shape=(n_cols,)))
    model.add(Dense(1))
    
    # compile model
    model.compile(optimizer='adam', loss='mean_squared_error', metrics=['mean_squared_error'])
    return model

The above function create a model that has one hidden layer of 10 hidden units.

## Train and Test the Networks

Let's call the function now to create our model.

In [15]:
# build the model
model = regression_model_simple()

Some thoughts about number of layers in Keras from StackOverflow:

*I'm a bit confused about the number of layers that are used in Keras models.
The documentation is rather opaque on the matter.
According to Jason Brownlee the first layer technically consists of two layers, the input layer, specified by input_dim and a hidden layer.*

To check the real number of layers we can use summary() method.

In [16]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense (Dense)                (None, 10)                90        
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 11        
Total params: 101
Trainable params: 101
Non-trainable params: 0
_________________________________________________________________


So, we have one input layer, one hidden layer with 10 neurons and one output layer.

In [17]:
# fit the model
model.fit(x_train, y_train, epochs=50, verbose=0) #no visible output

<tensorflow.python.keras.callbacks.History at 0x1f32e19e088>

In [18]:
scores = model.evaluate(x_test, y_test, verbose=1)



#### First NN

Now we create a list for our metrics called MSE_NN (not-normalized) and append to list every value of mean squared error from each network in range of 50.

In [19]:
MSE_NN = []
for i in range(50):
    x_train, x_test, y_train, y_test = train_test_split(predictors, target, test_size=0.3)
    model = regression_model_simple()
    model.fit(x_train, y_train, epochs=50, verbose=0)
    scores = model.evaluate(x_test, y_test, verbose=0) #here and below no visible output from learning to prevent text overflow
    MSE_NN.append(scores[1])

Convert a list to numpy array to use built-in numpy 'mean' and 'std' functions.

In [20]:
MSE_NN_np = np.array(MSE_NN, dtype=np.float32)

Output a metrics.

In [21]:
print("For non-normalized data")
print("Mean of MSE: {}".format(MSE_NN_np.mean()))
print("Standard deviation of MSE: {}".format(MSE_NN_np.std()))

For non-normalized data
Mean of MSE: 432.495361328125
Standard deviation of MSE: 477.3530578613281


Pretty high errors!
Now let's normalize the data by substracting the mean and dividing by the standard deviation.

In [22]:
predictors_norm = (predictors - predictors.mean()) / predictors.std()
predictors_norm.head()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age
0,2.476712,-0.856472,-0.846733,-0.916319,-0.620147,0.862735,-1.217079,-0.279597
1,2.476712,-0.856472,-0.846733,-0.916319,-0.620147,1.055651,-1.217079,-0.279597
2,0.491187,0.79514,-0.846733,2.174405,-1.038638,-0.526262,-2.239829,3.55134
3,0.491187,0.79514,-0.846733,2.174405,-1.038638,-0.526262,-2.239829,5.055221
4,-0.790075,0.678079,-0.846733,0.488555,-1.038638,0.070492,0.647569,4.976069


#### Second NN

Do the same evaluation on normalized dataset.

In [24]:
MSE_Norm = []
for i in range(50):
    x_train, x_test, y_train, y_test = train_test_split(predictors_norm, target, test_size=0.3)
    model = regression_model_simple()
    model.fit(x_train, y_train, epochs=50, verbose=0)
    scores = model.evaluate(x_test, y_test, verbose=0)
    MSE_Norm.append(scores[1])

And same metrics output.

In [25]:
MSE_Norm_np = np.array(MSE_Norm, dtype=np.float32)

In [26]:
print("For normalized data")
print("Mean of MSE: {}".format(MSE_Norm_np.mean()))
print("Standard deviation of MSE: {}".format(MSE_Norm_np.std()))

For normalized data
Mean of MSE: 349.9527282714844
Standard deviation of MSE: 102.25519561767578


In this case we see a small MSE decrease and significant STD decrease compared to the first NN.

#### Third NN

Now we try to learn our neural network for 100 epochs.

In [27]:
MSE_Norm_100 = []
for i in range(50):
    x_train, x_test, y_train, y_test = train_test_split(predictors_norm, target, test_size=0.3)
    model = regression_model_simple()
    model.fit(x_train, y_train, epochs=100, verbose=0)
    scores = model.evaluate(x_test, y_test, verbose=0)
    MSE_Norm_100.append(scores[1])

In [28]:
MSE_Norm_100_np = np.array(MSE_Norm_100, dtype=np.float32)

In [29]:
print("For normalized data, 100 epochs")
print("Mean of MSE: {}".format(MSE_Norm_100_np.mean()))
print("Standard deviation of MSE: {}".format(MSE_Norm_100_np.std()))

For normalized data, 100 epochs
Mean of MSE: 162.89479064941406
Standard deviation of MSE: 14.343791961669922


Even more decrease of errors compared to the second NN. We are moving in the right direction.

#### Fourth NN

And the last one - neural network with 3 hidden layers.

In [30]:
# define regression model with 3 hidden layers
def regression_model():
    # create model
    model = Sequential()
    model.add(Dense(10, activation='relu', input_shape=(n_cols,)))
    model.add(Dense(10, activation='relu'))
    model.add(Dense(10, activation='relu'))
    model.add(Dense(1))
    
    # compile model
    model.compile(optimizer='adam', loss='mean_squared_error', metrics=['mean_squared_error'])
    return model

In [31]:
MSE_Norm_three_layers = []
for i in range(50):
    x_train, x_test, y_train, y_test = train_test_split(predictors_norm, target, test_size=0.3)
    model = regression_model()
    model.fit(x_train, y_train, epochs=50, verbose=0)
    scores = model.evaluate(x_test, y_test, verbose=0)
    MSE_Norm_three_layers.append(scores[1])

In [32]:
MSE_Norm_three_layers_np = np.array(MSE_Norm_three_layers, dtype=np.float32)

In [33]:
print("For normalized data, 50 epochs, three hidden layers")
print("Mean of MSE: {}".format(MSE_Norm_three_layers_np.mean()))
print("Standard deviation of MSE: {}".format(MSE_Norm_three_layers_np.std()))

For normalized data, 50 epochs, three hidden layers
Mean of MSE: 129.1235809326172
Standard deviation of MSE: 14.69874382019043


Here we can see slight MSE improvement, but STD looks the same. (compared to third NN).

  __

And in the end - representative table:

|  |Non-normalized, 50ep  |  Normalized, 50ep  |  Normalized, 100ep  |  Normalized, 50ep, 3 layers
|---|:---:|:---:|:---:|:---:|
|MSE|432.5|349.9|162.9|129.12|
|STD|477.3|102.2|14.3|14.7|

That's all for today =). Thank you for your attention.