<a id="item31"></a>

## Download and Clean Dataset

Let's start by importing the <em>pandas</em> and the Numpy libraries.

In [24]:
import pandas as pd
import numpy as np


<strong>The dataset is about the compressive strength of different samples of concrete based on the volumes of the different ingredients that were used to make them. Ingredients include:</strong>

<strong>1. Cement</strong>

<strong>2. Blast Furnace Slag</strong>

<strong>3. Fly Ash</strong>

<strong>4. Water</strong>

<strong>5. Superplasticizer</strong>

<strong>6. Coarse Aggregate</strong>

<strong>7. Fine Aggregate</strong>

Let's download the data and read it into a <em>pandas</em> dataframe.

In [25]:
concrete_data = pd.read_csv('https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DL0101EN/labs/data/concrete_data.csv')
concrete_data.head()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age,Strength
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.99
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.89
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,40.27
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365,41.05
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360,44.3


So the first concrete sample has 540 cubic meter of cement, 0 cubic meter of blast furnace slag, 0 cubic meter of fly ash, 162 cubic meter of water, 2.5 cubic meter of superplaticizer, 1040 cubic meter of coarse aggregate, 676 cubic meter of fine aggregate. Such a concrete mix which is 28 days old, has a compressive strength of 79.99 MPa. 

#### Let's check how many data points we have.

In [26]:
concrete_data.shape

(1030, 9)

So, there are approximately 1000 samples to train our model on. Because of the few samples, we have to be careful not to overfit the training data.

Let's check the dataset for any missing values.

In [27]:
concrete_data.describe()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age,Strength
count,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0
mean,281.167864,73.895825,54.18835,181.567282,6.20466,972.918932,773.580485,45.662136,35.817961
std,104.506364,86.279342,63.997004,21.354219,5.973841,77.753954,80.17598,63.169912,16.705742
min,102.0,0.0,0.0,121.8,0.0,801.0,594.0,1.0,2.33
25%,192.375,0.0,0.0,164.9,0.0,932.0,730.95,7.0,23.71
50%,272.9,22.0,0.0,185.0,6.4,968.0,779.5,28.0,34.445
75%,350.0,142.95,118.3,192.0,10.2,1029.4,824.0,56.0,46.135
max,540.0,359.4,200.1,247.0,32.2,1145.0,992.6,365.0,82.6


In [28]:
concrete_data.isnull().sum()

Cement                0
Blast Furnace Slag    0
Fly Ash               0
Water                 0
Superplasticizer      0
Coarse Aggregate      0
Fine Aggregate        0
Age                   0
Strength              0
dtype: int64

The data looks very clean and is ready to be used to build our model.

#### Split data into predictors and target

The target variable in this problem is the concrete sample strength. Therefore, our predictors will be all the other columns.

In [29]:
concrete_data_columns = concrete_data.columns

predictors = concrete_data[concrete_data_columns[concrete_data_columns != 'Strength']] # all columns except Strength
target = concrete_data['Strength'] # Strength column

<a id="item2"></a>

Let's do a quick sanity check of the predictors and the target dataframes.

In [30]:
predictors.head()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360


In [31]:
target.head()

0    79.99
1    61.89
2    40.27
3    41.05
4    44.30
Name: Strength, dtype: float64

### Normalization of data

Finally, the last step is to normalize the data by substracting the mean and dividing by the standard deviation.

In [32]:
predictors_norm = (predictors - predictors.mean()) / predictors.std()
predictors_norm.head()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age
0,2.476712,-0.856472,-0.846733,-0.916319,-0.620147,0.862735,-1.217079,-0.279597
1,2.476712,-0.856472,-0.846733,-0.916319,-0.620147,1.055651,-1.217079,-0.279597
2,0.491187,0.79514,-0.846733,2.174405,-1.038638,-0.526262,-2.239829,3.55134
3,0.491187,0.79514,-0.846733,2.174405,-1.038638,-0.526262,-2.239829,5.055221
4,-0.790075,0.678079,-0.846733,0.488555,-1.038638,0.070492,0.647569,4.976069


Let's save the number of predictors to *n_cols* since we will need this number when building our network.

In [33]:
n_cols = predictors_norm.shape[1] # number of predictors
print(n_cols)

8


### Split the data into training and testing sets

let's import `train_test_split` from the `sklearn`

In [34]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(predictors_norm, target, test_size=0.30, random_state=42)

<a id="item1"></a>

<a id='item32'></a>

## Import Keras

Recall from the videos that Keras normally runs on top of a low-level library such as TensorFlow. This means that to be able to use the Keras library, you will have to install TensorFlow first and when you import the Keras library, it will be explicitly displayed what backend was used to install the Keras library. In CC Labs, we used TensorFlow as the backend to install Keras, so it should clearly print that when we import Keras.

#### Let's go ahead and import the Keras library

In [35]:
import keras

As you can see, the TensorFlow backend was used to install the Keras library.

Let's import the rest of the packages from the Keras library that we will need to build our regressoin model.

In [36]:
from keras.models import Sequential
from keras.layers import Dense

<a id='item33'></a>

## Build a Neural Network

Let's define a function that defines our regression model for us so that we can conveniently call it to create our model.

In [37]:
# define regression model
def regression_model():
    # create model
    model = Sequential()
    model.add(Dense(10, activation='relu', input_shape=(n_cols,)))
    model.add(Dense(1))
    
    # compile model
    model.compile(optimizer='adam', loss='mean_squared_error')
    return model

The above function create a model that one hidden layer of 10 hidden units.

<a id="item4"></a>

<a id='item34'></a>

## Train and Test the Network

Let's call the function now to create our model.

In [38]:
# build the model
model = regression_model()

Next, we will train and test the model at the same time using the *fit* method. We will leave out 30% of the data for validation and we will train the model for 100 epochs.

In [39]:
# fit the model
model.fit(predictors_norm, target, validation_split=0.3, epochs=50, verbose=2)

Train on 721 samples, validate on 309 samples
Epoch 1/50
 - 0s - loss: 1700.5557 - val_loss: 1215.6169
Epoch 2/50
 - 0s - loss: 1680.8144 - val_loss: 1202.7256
Epoch 3/50
 - 0s - loss: 1661.0924 - val_loss: 1189.5514
Epoch 4/50
 - 0s - loss: 1641.3187 - val_loss: 1176.2489
Epoch 5/50
 - 0s - loss: 1621.3524 - val_loss: 1162.1527
Epoch 6/50
 - 0s - loss: 1600.6521 - val_loss: 1148.1178
Epoch 7/50
 - 0s - loss: 1579.4603 - val_loss: 1133.0805
Epoch 8/50
 - 0s - loss: 1557.8403 - val_loss: 1117.3261
Epoch 9/50
 - 0s - loss: 1535.2629 - val_loss: 1101.1223
Epoch 10/50
 - 0s - loss: 1511.6676 - val_loss: 1084.4106
Epoch 11/50
 - 0s - loss: 1487.5150 - val_loss: 1066.7524
Epoch 12/50
 - 0s - loss: 1462.0483 - val_loss: 1048.6314
Epoch 13/50
 - 0s - loss: 1435.5078 - val_loss: 1029.4064
Epoch 14/50
 - 0s - loss: 1407.6669 - val_loss: 1009.9414
Epoch 15/50
 - 0s - loss: 1378.3228 - val_loss: 989.7160
Epoch 16/50
 - 0s - loss: 1348.0538 - val_loss: 968.1723
Epoch 17/50
 - 0s - loss: 1316.5779 -

<keras.callbacks.callbacks.History at 0x19ef9633e08>

In [40]:
model.evaluate(X_test,y_test, verbose=1)



328.37767695454716

   Now we need to compute the mean squared error between the predicted concrete strength and the actual concrete strength.

   Let's import the mean_squared_error function from Scikit-learn.

In [41]:
from sklearn.metrics import mean_squared_error

In [42]:
y_pred = model.predict(X_test)

In [43]:
y_pred

array([[49.38291  ],
       [38.93037  ],
       [57.19381  ],
       [38.453533 ],
       [ 7.839702 ],
       [12.464439 ],
       [ 2.9849153],
       [39.76614  ],
       [26.426874 ],
       [27.859156 ],
       [19.319538 ],
       [13.66063  ],
       [72.288925 ],
       [41.824665 ],
       [17.174267 ],
       [38.389065 ],
       [ 3.7290978],
       [24.686314 ],
       [13.00355  ],
       [ 9.277476 ],
       [27.36717  ],
       [20.395334 ],
       [21.83707  ],
       [24.403585 ],
       [19.443357 ],
       [29.765253 ],
       [ 4.553833 ],
       [28.594213 ],
       [29.866602 ],
       [14.838119 ],
       [26.20355  ],
       [21.78827  ],
       [40.47356  ],
       [36.303864 ],
       [ 8.174399 ],
       [38.243214 ],
       [24.961575 ],
       [12.932574 ],
       [13.093652 ],
       [18.121466 ],
       [ 9.968422 ],
       [ 8.381921 ],
       [18.232012 ],
       [34.080093 ],
       [ 5.132879 ],
       [43.364895 ],
       [33.491272 ],
       [48.29

In [44]:
mse = mean_squared_error(y_test, y_pred)
print('mse: ', mse)

mse:  328.37768135478797


Let's create a list of 50 mean squared errors and report mean and the standard deviation of the mean squared errors.

In [45]:
n = 50
epochs = 50
mean_squared_errors = []
for i in range(0, n):
    X_train, X_test, y_train, y_test = train_test_split(predictors_norm, target, test_size=0.30, random_state=42)
    model.fit(X_train, y_train, epochs=epochs, verbose=0)
    mse = model.evaluate(X_test, y_test, verbose=0)
    print("mse " +str(i+1)+": ", mse)
    y_pred = model.predict(X_test)
    mean_square_error = mean_squared_error(y_test, y_pred)
    mean_squared_errors.append(mean_square_error)
    
mean_squared_errors = np.array(mean_squared_errors)
mean = np.mean(mean_squared_errors)
standard_deviation = np.std(mean_squared_errors)   

print("Mean : ", mean)
print("Standard Deviation : ", standard_deviation)

mse 1:  164.46953728438194
mse 2:  121.24517219815054
mse 3:  96.54527806155504
mse 4:  82.33806375546749
mse 5:  73.97369036628204
mse 6:  66.10248117539489
mse 7:  60.7643805939017
mse 8:  57.58464881131564
mse 9:  55.15423875950687
mse 10:  53.375614141569166
mse 11:  52.19638630024438
mse 12:  50.899725756598905
mse 13:  49.95902846006128
mse 14:  49.1480870663541
mse 15:  48.666696653396954
mse 16:  47.844852879595216
mse 17:  47.44947804447902
mse 18:  46.962125846097386
mse 19:  46.30014160304393
mse 20:  46.02184430218052
mse 21:  45.52631607796382
mse 22:  44.998556637069555
mse 23:  45.012979921013795
mse 24:  44.50098265490486
mse 25:  44.369232529575385
mse 26:  43.768618784290304
mse 27:  44.10201164665346
mse 28:  43.392383976661655
mse 29:  43.131207771671626
mse 30:  43.24155322942148
mse 31:  42.96665382385254
mse 32:  42.68570678365269
mse 33:  42.83995664775564
mse 34:  42.41968880193519
mse 35:  42.522685560207925
mse 36:  42.21399144293035
mse 37:  41.9981929384003

In [46]:
print('Below are the mean and standard deviation of 50 mean squared errors of normalized data for 50 epochs')
print("Mean : ", mean)
print("Standard Deviation : ", standard_deviation)

Below are the mean and standard deviation of 50 mean squared errors of normalized data for 50 epochs
Mean :  51.93660247771346
Standard Deviation :  21.89570648727471


In [47]:
print('mean squared errors: ', mean_squared_errors)

mean squared errors:  [164.46953785 121.24517283  96.54528113  82.33806476  73.97369389
  66.10248412  60.76438228  57.58465078  55.15423876  53.37561495
  52.19638731  50.89972478  49.9590295   49.14808741  48.66669771
  47.84485323  47.44947834  46.96212586  46.30014216  46.02184582
  45.52631687  44.99855815  45.01297975  44.5009832   44.36923244
  43.7686196   44.10201291  43.39238546  43.131209    43.24155476
  42.96665412  42.68570726  42.83995743  42.41968967  42.52268659
  42.21399283  41.9981938   41.97906606  42.06003327  41.79355226
  41.64241226  41.80375425  41.38662662  41.71370548  41.40836063
  41.45778797  41.327379    41.39297389  41.13089546  41.04134939]


The `mean` with predictors_norm data in part B is 51 while it was 54 in part A with predictors data used for training.