# 1.0 Assignment Instructions
- Use Markdown to clearly label your code for each part,

- Properly comment your code so that your peer who is grading your work is able to understand your code easily,

- Include your comments and discussion of the difference in the mean of the mean squared errors among the different parts.



## 1.A Build a baseline model

- NOTE:
- You might have realised that I didn't change the n_cols variable in the model (not "n_cols = predictors_norm.shape[1]") for the steps B, C, D.
  - That's because it's non-significant, it's of the same size. I mean:
    - n_cols = predictors.shape[1] = predictors_norm.shape[1]


### 1.A.1 Downloading the data, reading and checking it.
- Includes:
  - Downloading the data
  - Reading first 5 rows with .head() method
  - Learning the number of data points with .shape() method
  - Using .describe() method to learn more insight into our data with statistical descriptions/calculations
  - Using .isnull().sum() methods to check number of empty rows, which have to be eliminated if exists. It will cause problems otherwise.

In [2]:
import pandas as pd

# download the data and save
filepath = "https://cocl.us/concrete_data"
concrete_data = pd.read_csv(filepath)

# read the first 5 rows of the data
concrete_data.head()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age,Strength
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.99
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.89
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,40.27
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365,41.05
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360,44.3


In [3]:
# let's check number of data points

concrete_data.shape

## so about 1000 samples
### this is a small amount for DL, need to be careful to not overfit

(1030, 9)

In [4]:
concrete_data.describe()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age,Strength
count,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0
mean,281.167864,73.895825,54.18835,181.567282,6.20466,972.918932,773.580485,45.662136,35.817961
std,104.506364,86.279342,63.997004,21.354219,5.973841,77.753954,80.17598,63.169912,16.705742
min,102.0,0.0,0.0,121.8,0.0,801.0,594.0,1.0,2.33
25%,192.375,0.0,0.0,164.9,0.0,932.0,730.95,7.0,23.71
50%,272.9,22.0,0.0,185.0,6.4,968.0,779.5,28.0,34.445
75%,350.0,142.95,118.3,192.0,10.2,1029.4,824.0,56.0,46.135
max,540.0,359.4,200.1,247.0,32.2,1145.0,992.6,365.0,82.6


In [5]:
# checking for NA vals
concrete_data.isnull().sum()

Cement                0
Blast Furnace Slag    0
Fly Ash               0
Water                 0
Superplasticizer      0
Coarse Aggregate      0
Fine Aggregate        0
Age                   0
Strength              0
dtype: int64

### 1.A.2 Modifying the data to suit it to our needs
- Splitting the data into "predictors" and "target".
- Then doing sanity checks, to make sure it worked well.
- Assigning the number of predictors columns' to a n_cols variable.

In [6]:
# splitting the data into predictors & target as required

concrete_data_columns = concrete_data.columns

predictors = concrete_data[
    concrete_data_columns[concrete_data_columns != "Strength"]
]  # all columns except Strength

target = concrete_data["Strength"]  # Strength column

In [7]:
# some sanity checks
predictors.head()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360


In [8]:
target.head()

0    79.99
1    61.89
2    40.27
3    41.05
4    44.30
Name: Strength, dtype: float64

In [None]:
# Assigning n_cols int variable since it will be useful in the Keras code.

n_cols = predictors.shape[1]  # number of predictors



### 1.A.3 Building a neural network using Keras

- Importing the packages from tensorflow package.
- Defining the regression model
- Building the regression model


In [13]:
# importing Keras packages

import tensorflow as tf
from tensorflow import keras

from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Input

In [16]:
# define regression model


def regression_model():
    # create model
    model = Sequential()
    model.add(Input(shape=(n_cols,)))  # input layer
    model.add(Dense(10, activation="relu"))  # hidden layer
    model.add(Dense(1))  # output layer

    # compile model
    model.compile(optimizer="adam", loss="mean_squared_error")
    return model

In [17]:
# build the model

model = regression_model()

### 1.A.4 Randomly splitting, training, evaluating and reporting

- Randomly splitting the data into a training and test sets by holding 30% of the data for testing
  - Using train_test_split helper function from Scikit-learn for this purpose
- Training the regression model over 50 epochs
- Evaluating the regression model on the test data
- Compute the mean squared error between the predicted concrete strength and actual concrete strength
  - using mean_squared_error function from Scikit-learn
- Repeating steps 3,5,6 50 times
  - i.e. creating a list of 50 mean squared errors
- Reporting the mean & std of the mean squared errors

In [21]:
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# placeholder empty array for mean squared errors (mse)
mse_list = []

# Repeat the process 50 times
for i in range(50):
    # Step A.1: Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(
        predictors, target, test_size=0.3, random_state=i
    )

    # Step A.2: Build and train the model
    model = regression_model()
    model.fit(
        X_train, y_train, epochs=50, verbose=1, batch_size=10
    )  # Train with 50 epochs

    # Step A.3: Evaluate the model and compute the mean squared error
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mse_list.append(mse)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/5

In [24]:
# A.5 Report the mean and the standard deviation of the mean squared errors.

import numpy as np

# Calculating the mean and standard deviation of the 50 mean squared errors (mse)
mean_mse = np.mean(mse_list)
std_mse = np.std(mse_list)

# Outputting the results
print(f"Mean of MSEs: {mean_mse}")
print(f"Standard Deviation of MSEs: {std_mse}")

Mean of MSEs: 122.60948849852208
Standard Deviation of MSEs: 37.72493204603877


#### 1.A.4 Results

- Mean of MSEs: 122.60948849852208
- Standard Deviation of MSEs: 37.72493204603877


## 1.B Normalize and repeat steps in 1.A


In [28]:
# normalize the data

predictors_norm = (predictors - predictors.mean()) / predictors.std()
predictors_norm.head()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age
0,2.476712,-0.856472,-0.846733,-0.916319,-0.620147,0.862735,-1.217079,-0.279597
1,2.476712,-0.856472,-0.846733,-0.916319,-0.620147,1.055651,-1.217079,-0.279597
2,0.491187,0.79514,-0.846733,2.174405,-1.038638,-0.526262,-2.239829,3.55134
3,0.491187,0.79514,-0.846733,2.174405,-1.038638,-0.526262,-2.239829,5.055221
4,-0.790075,0.678079,-0.846733,0.488555,-1.038638,0.070492,0.647569,4.976069


In [25]:
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# placeholder empty array for mean squared errors (mse)
mse_list = []

# Repeat the process 50 times
for i in range(50):
    # Step A.1: Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(
        predictors_norm, target, test_size=0.3, random_state=i
    )

    # Step A.2: Build and train the model
    model = regression_model()
    model.fit(
        X_train, y_train, epochs=50, verbose=1, batch_size=32
    )  # Train with 50 epochs

    # Step A.3: Evaluate the model and compute the mean squared error
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mse_list.append(mse)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/5

In [26]:
# Step A.5 Report the mean and the standard deviation of the mean squared errors.

import numpy as np

# Calculating the mean and standard deviation of the 50 mean squared errors (mse)
mean_mse = np.mean(mse_list)
std_mse = np.std(mse_list)

# Outputting the results
print(f"Mean of MSEs: {mean_mse}")
print(f"Standard Deviation of MSEs: {std_mse}")


#### Results
# Mean of MSEs: 402.47106682268503
# Standard Deviation of MSEs: 105.92906463304185

Mean of MSEs: 402.47106682268503
Standard Deviation of MSEs: 105.92906463304185


### 1.B.1 Results

- Mean of MSEs: 402.47106682268503
- Standard Deviation of MSEs: 105.92906463304185

- Change amount:
  - Mean MSE : 122.609 to 402.471 (228%)
  - std MSE : 37.725 to 105.929 (181%)


In [35]:
# percentage calculations for previos result-B

m_a = 122.609
m_b = 402.471
s_a = 37.725
s_b = 105.929

chg_mab = (m_b - m_a) / m_a * 100
chg_sab = (s_b - s_a) / s_a * 100

print(chg_mab, chg_sab)

228.2556745426519 180.79257786613653


## 1.C Repeat 1.B, use 100 epochs this time for training

In [29]:
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# placeholder empty array for mean squared errors (mse)
mse_list = []

# Repeat the process 50 times
for i in range(50):
    # Step A.1: Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(
        predictors_norm, target, test_size=0.3, random_state=i
    )

    # Step A.2: Build and train the model
    model = regression_model()
    model.fit(
        X_train, y_train, epochs=100, verbose=1, batch_size=64
    )  # Train with 50 epochs

    # Step A.3: Evaluate the model and compute the mean squared error
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mse_list.append(mse)

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

In [30]:
# Step A.5 Report the mean and the standard deviation of the mean squared errors.

import numpy as np

# Calculating the mean and standard deviation of the 50 mean squared errors (mse)
mean_mse = np.mean(mse_list)
std_mse = np.std(mse_list)

# Outputting the results
print(f"Mean of MSEs: {mean_mse}")
print(f"Standard Deviation of MSEs: {std_mse}")

Mean of MSEs: 316.1958819518998
Standard Deviation of MSEs: 105.15059486202671


### 1.C.1 Results
- Mean of MSEs: 316.1958819518998
- Standard Deviation of MSEs: 105.15059486202671

- Change amount:
  - Mean MSE : 402.471 to 316.196 (-21.4%)
  - std MSE : 105.929 to 105.151 (-0.7%)

In [36]:
m_b = 402.471
m_c = 316.196
s_b = 105.929
s_c = 105.151

chg_mbc = (m_c - m_b) / m_b * 100
chg_sbc = (s_c - s_b) / s_b * 100

print(chg_mbc, chg_sbc)

-21.436327089405193 -0.7344542098953127


## 1.D Increase the number of hidden layers (5 marks)

Repeat part B but use a neural network with the following instead:

- Three hidden layers, each of 10 nodes and ReLU activation function.

How does the mean of the mean squared errors compare to that from Step B?

In [31]:
# define regression model


def regression_model():
    # create model
    model = Sequential()
    model.add(Input(shape=(n_cols,)))  # input layer
    model.add(Dense(10, activation="relu"))  # hidden layer
    model.add(Dense(10, activation="relu"))  # hidden layer
    model.add(Dense(10, activation="relu"))  # hidden layer
    model.add(Dense(1))  # output layer

    # compile model
    model.compile(optimizer="adam", loss="mean_squared_error")
    return model

In [32]:
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# placeholder empty array for mean squared errors (mse)
mse_list = []

# Repeat the process 50 times
for i in range(50):
    # Step A.1: Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(
        predictors_norm, target, test_size=0.3, random_state=i
    )

    # Step A.2: Build and train the model
    model = regression_model()
    model.fit(
        X_train, y_train, epochs=50, verbose=1, batch_size=64
    )  # Train with 50 epochs

    # Step A.3: Evaluate the model and compute the mean squared error
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mse_list.append(mse)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/5

In [33]:
# Step A.5 Report the mean and the standard deviation of the mean squared errors.

import numpy as np

# Calculating the mean and standard deviation of the 50 mean squared errors (mse)
mean_mse = np.mean(mse_list)
std_mse = np.std(mse_list)

# Outputting the results
print(f"Mean of MSEs: {mean_mse}")
print(f"Standard Deviation of MSEs: {std_mse}")

Mean of MSEs: 160.69643106776036
Standard Deviation of MSEs: 12.517537029251839


### 1.D.1 Results
- Mean of MSEs: 160.69643106776036
- Standard Deviation of MSEs: 12.517537029251839

- Change amount:
  - Mean MSE : 402.471 to 160.696 (-60.1%)
  - std MSE : 105.929 to 12.517 (-88.2%)

In [37]:
m_b = 402.471
m_d = 160.696
s_b = 105.929
s_d = 12.517

chg_mbd = (m_d - m_b) / m_b * 100
chg_sbd = (s_d - s_b) / s_b * 100

print(chg_mbd, chg_sbd)

-60.07265119722912 -88.18359467190288
