<h1 align=center><font size = 5>Regression Models with Keras</font></h1>

<h2>Regression Models with Keras</h2>

<h3>Objective for this Notebook<h3>    
<h5> 1. How to use the Keras library to build a regression model.</h5>
<h5> 2. Download and Clean dataset </h5>
<h5> 3. Build a Neural Network </h5>
<h5> 4. Train and Test the Network. </h5>     



## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>
    
1. <a href="#item31">Download and Clean Dataset</a>  
2. <a href="#item32">Import Keras</a>  
3. <a href="#item33">Build a Neural Network</a>  
4. <a href="#item34">Train and Test the Network</a>  

</font>
</div>


We will be playing around with the same dataset that we used in the videos.

<strong>The dataset is about the compressive strength of different samples of concrete based on the volumes of the different ingredients that were used to make them. Ingredients include:</strong>

<strong>1. Cement</strong>

<strong>2. Blast Furnace Slag</strong>

<strong>3. Fly Ash</strong>

<strong>4. Water</strong>

<strong>5. Superplasticizer</strong>

<strong>6. Coarse Aggregate</strong>

<strong>7. Fine Aggregate</strong>


In [1]:
import pandas as pd
import numpy as np

In [2]:
concrete_data = pd.read_csv('https://cocl.us/concrete_data')
concrete_data.head()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age,Strength
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.99
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.89
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,40.27
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365,41.05
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360,44.3


In [3]:
concrete_data.shape

(1030, 9)

In [4]:
concrete_data.describe()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age,Strength
count,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0
mean,281.167864,73.895825,54.18835,181.567282,6.20466,972.918932,773.580485,45.662136,35.817961
std,104.506364,86.279342,63.997004,21.354219,5.973841,77.753954,80.17598,63.169912,16.705742
min,102.0,0.0,0.0,121.8,0.0,801.0,594.0,1.0,2.33
25%,192.375,0.0,0.0,164.9,0.0,932.0,730.95,7.0,23.71
50%,272.9,22.0,0.0,185.0,6.4,968.0,779.5,28.0,34.445
75%,350.0,142.95,118.3,192.0,10.2,1029.4,824.0,56.0,46.135
max,540.0,359.4,200.1,247.0,32.2,1145.0,992.6,365.0,82.6


In [5]:
concrete_data.isnull().sum()

Cement                0
Blast Furnace Slag    0
Fly Ash               0
Water                 0
Superplasticizer      0
Coarse Aggregate      0
Fine Aggregate        0
Age                   0
Strength              0
dtype: int64

#### Split data into predictors and target

The target variable in this problem is the concrete sample strength. Therefore, our predictors will be all the other columns.


In [6]:
concrete_data_columns = concrete_data.columns

predictors = concrete_data[concrete_data_columns[concrete_data_columns != 'Strength']] # all columns except Strength
target = concrete_data['Strength'] # Strength column

In [7]:
predictors.head()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360


In [8]:
target.head()

0    79.99
1    61.89
2    40.27
3    41.05
4    44.30
Name: Strength, dtype: float64

Finally, the last step is to normalize the data by substracting the mean and dividing by the standard deviation.


In [9]:
predictors_norm = (predictors - predictors.mean()) / predictors.std()
predictors_norm.head()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age
0,2.476712,-0.856472,-0.846733,-0.916319,-0.620147,0.862735,-1.217079,-0.279597
1,2.476712,-0.856472,-0.846733,-0.916319,-0.620147,1.055651,-1.217079,-0.279597
2,0.491187,0.79514,-0.846733,2.174405,-1.038638,-0.526262,-2.239829,3.55134
3,0.491187,0.79514,-0.846733,2.174405,-1.038638,-0.526262,-2.239829,5.055221
4,-0.790075,0.678079,-0.846733,0.488555,-1.038638,0.070492,0.647569,4.976069


Let's save the number of predictors to *n_cols* since we will need this number when building our network.


In [10]:
n_cols = predictors_norm.shape[1] # number of predictors

## Import Keras


In [12]:
import keras
from keras.models import Sequential
from keras.layers import Dense

## Build a Neural Network

Use the Keras library to build a neural network with the following:

- One hidden layer of 10 nodes, and a ReLU activation function

- Use the adam optimizer and the mean squared error  as the loss function.

In [13]:
# define regression model
def regression_model():
    # create model
    model = Sequential()
    model.add(Dense(10, activation='relu', input_shape=(n_cols,)))
    model.add(Dense(1))
    
    # compile model
    model.compile(optimizer='adam', loss='mean_squared_error')
    return model

1. Randomly split the data into a training and test sets by holding 30% of the data for testing. You can use the 
train_test_split
helper function from Scikit-learn.

2. Train the model on the training data using 50 epochs.

3. Evaluate the model on the test data and compute the mean squared error between the predicted concrete strength and the actual concrete strength. You can use the mean_squared_error function from Scikit-learn.

4. Repeat steps 1 - 3, 50 times, i.e., create a list of 50 mean squared errors.

5. Report the mean and the standard deviation of the mean squared errors.

In [17]:
# Allows us to split our data into training and testing data
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
# List to store mean squared errors
mse_list = []

# Repeat 50 times
for _ in range(50):
    # Split the data
    X_train, X_test, y_train, y_test = train_test_split(predictors, target, test_size=0.3, random_state=2)
    
    # Build and train the model
    model = regression_model()
    model.fit(X_train, y_train, epochs=50, verbose=0)
    
    # Predict on test data
    y_pred = model.predict(X_test)
    
    # Compute mean squared error
    mse = mean_squared_error(y_test, y_pred)
    mse_list.append(mse)

# Calculate mean and standard deviation of MSEs
mse_mean = np.mean(mse_list)
mse_std = np.std(mse_list)

print(f'Mean of MSEs: {mse_mean}')
print(f'Standard Deviation of MSEs: {mse_std}')

Mean of MSEs: 257.2031039238549
Standard Deviation of MSEs: 269.2203735167619


**B. Normalize the data (5 marks)**

Repeat Part A but use a normalized version of the data. Recall that one way to normalize the data is by subtracting the mean from the individual predictors and dividing by the standard deviation.

* How does the mean of the mean squared errors compare to that from Step A?*

In [18]:
predictors_norm = (predictors - predictors.mean()) / predictors.std()
predictors_norm.head()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age
0,2.476712,-0.856472,-0.846733,-0.916319,-0.620147,0.862735,-1.217079,-0.279597
1,2.476712,-0.856472,-0.846733,-0.916319,-0.620147,1.055651,-1.217079,-0.279597
2,0.491187,0.79514,-0.846733,2.174405,-1.038638,-0.526262,-2.239829,3.55134
3,0.491187,0.79514,-0.846733,2.174405,-1.038638,-0.526262,-2.239829,5.055221
4,-0.790075,0.678079,-0.846733,0.488555,-1.038638,0.070492,0.647569,4.976069


In [19]:
# Allows us to split our data into training and testing data
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
# List to store mean squared errors
mse_list = []

# Repeat 50 times
for _ in range(50):
    # Split the data
    X_train, X_test, y_train, y_test = train_test_split(predictors_norm, target, test_size=0.3, random_state=2)
    
    # Build and train the model
    model = regression_model()
    model.fit(X_train, y_train, epochs=50, verbose=0)
    
    # Predict on test data
    y_pred = model.predict(X_test)
    
    # Compute mean squared error
    mse = mean_squared_error(y_test, y_pred)
    mse_list.append(mse)

# Calculate mean and standard deviation of MSEs
mse_mean = np.mean(mse_list)
mse_std = np.std(mse_list)

print(f'Mean of MSEs: {mse_mean}')
print(f'Standard Deviation of MSEs: {mse_std}')

Mean of MSEs: 368.30119844419875
Standard Deviation of MSEs: 91.5797703192515


**C. Increate the number of epochs (5 marks)**

Repeat Part B but use 100 epochs this time for training.

How does the mean of the mean squared errors compare to that from Step B?

In [21]:
# Allows us to split our data into training and testing data
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
# List to store mean squared errors
mse_list = []

# Repeat 50 times
for _ in range(50):
    # Split the data
    X_train, X_test, y_train, y_test = train_test_split(predictors_norm, target, test_size=0.3, random_state=2)
    
    # Build and train the model
    model = regression_model()
    model.fit(X_train, y_train, epochs=100, verbose=0)
    
    # Predict on test data
    y_pred = model.predict(X_test)
    
    # Compute mean squared error
    mse = mean_squared_error(y_test, y_pred)
    mse_list.append(mse)

# Calculate mean and standard deviation of MSEs
mse_mean = np.mean(mse_list)
mse_std = np.std(mse_list)

print(f'Mean of MSEs: {mse_mean}')
print(f'Standard Deviation of MSEs: {mse_std}')

Mean of MSEs: 166.68028451859308
Standard Deviation of MSEs: 20.54142281419429


**D. Increase the number of hidden layers (5 marks)**

Repeat part B but use a neural network with the following instead:

- Three hidden layers, each of 10 nodes and ReLU activation function.

How does the mean of the mean squared errors compare to that from Step B?

In [22]:
# define regression model
def regression_model():
    # create model
    model = Sequential()
    model.add(Dense(10, activation='relu', input_shape=(n_cols,)))
    model.add(Dense(10, activation='relu'))
    model.add(Dense(10, activation='relu'))
    model.add(Dense(1))
    
    # compile model
    model.compile(optimizer='adam', loss='mean_squared_error')
    return model

In [23]:
# Allows us to split our data into training and testing data
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
# List to store mean squared errors
mse_list = []

# Repeat 50 times
for _ in range(50):
    # Split the data
    X_train, X_test, y_train, y_test = train_test_split(predictors_norm, target, test_size=0.3, random_state=2)
    
    # Build and train the model
    model = regression_model()
    model.fit(X_train, y_train, epochs=50, verbose=0)
    
    # Predict on test data
    y_pred = model.predict(X_test)
    
    # Compute mean squared error
    mse = mean_squared_error(y_test, y_pred)
    mse_list.append(mse)

# Calculate mean and standard deviation of MSEs
mse_mean = np.mean(mse_list)
mse_std = np.std(mse_list)

print(f'Mean of MSEs: {mse_mean}')
print(f'Standard Deviation of MSEs: {mse_std}')

Mean of MSEs: 131.62623038594035
Standard Deviation of MSEs: 11.047130002438214
