# **ASSIGNMENT: BUILD A REGRESSION MODEL WITH KERAS**

## **Assignment Topic:** Build a regression model using the Keras library to model the data about concrete compressive strength.


### **Download and Clean the Dataset**

In [2]:
import pandas as pd
import numpy as np

# read data
concrete_data = pd.read_csv('https://cocl.us/concrete_data')
concrete_data.head()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age,Strength
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28,79.99
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28,61.89
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270,40.27
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365,41.05
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360,44.3


In [3]:
# number of datapoints
concrete_data.shape

(1030, 9)

In [4]:
# check for missing values
concrete_data.describe()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age,Strength
count,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0
mean,281.167864,73.895825,54.18835,181.567282,6.20466,972.918932,773.580485,45.662136,35.817961
std,104.506364,86.279342,63.997004,21.354219,5.973841,77.753954,80.17598,63.169912,16.705742
min,102.0,0.0,0.0,121.8,0.0,801.0,594.0,1.0,2.33
25%,192.375,0.0,0.0,164.9,0.0,932.0,730.95,7.0,23.71
50%,272.9,22.0,0.0,185.0,6.4,968.0,779.5,28.0,34.445
75%,350.0,142.95,118.3,192.0,10.2,1029.4,824.0,56.0,46.135
max,540.0,359.4,200.1,247.0,32.2,1145.0,992.6,365.0,82.6


In [5]:
#check for null values
concrete_data.isnull().sum()

Cement                0
Blast Furnace Slag    0
Fly Ash               0
Water                 0
Superplasticizer      0
Coarse Aggregate      0
Fine Aggregate        0
Age                   0
Strength              0
dtype: int64

In [6]:
# split dataset into predictors and target
predictors = concrete_data.drop('Strength', axis=1)
predictors.head()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age
0,540.0,0.0,0.0,162.0,2.5,1040.0,676.0,28
1,540.0,0.0,0.0,162.0,2.5,1055.0,676.0,28
2,332.5,142.5,0.0,228.0,0.0,932.0,594.0,270
3,332.5,142.5,0.0,228.0,0.0,932.0,594.0,365
4,198.6,132.4,0.0,192.0,0.0,978.4,825.5,360


In [7]:
# target
target = concrete_data['Strength']
target.head()

0    79.99
1    61.89
2    40.27
3    41.05
4    44.30
Name: Strength, dtype: float64

## **A :  Build a baseline model**

Use the Keras library to build a neural network with the following:

- One hidden layer of 10 nodes, and a ReLU activation function

- Use the adam optimizer and the mean squared error  as the loss function.

1. Randomly split the data into a training and test sets by holding 30% of the data for testing. 

2. Train the model on the training data using 50 epochs.

3. Evaluate the model on the test data and compute the mean squared error between the predicted concrete strength and the actual concrete strength. You can use the mean_squared_error function from Scikit-learn.

4. Repeat steps 1 - 3, 50 times, i.e., create a list of 50 mean squared errors.

5. Report the mean and the standard deviation of the mean squared errors.

In [8]:
# Import required libraries

import keras
from keras.models import Sequential
from keras.layers import Dense


from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

2024-01-31 21:56:24.490279: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2024-01-31 21:56:24.520083: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-01-31 21:56:24.520105: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-01-31 21:56:24.520783: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-01-31 21:56:24.524856: I external/local_tsl/tsl/cuda/cudart_stub.cc:31] Could not find cuda drivers on your machine, GPU will not be used.
2024-01-31 21:56:24.525193: I tensorflow/core/platform/cpu_feature_guard.cc:1

### Build a Neural network with one hidden layer of 10 nodes and ReLU activation function.

In [9]:
# Build Neural network with one hidden layer of 10 nodes
def regression_model():
    model = Sequential()

    # number of columns 
    n_cols = predictors.shape[1]

    # Hidden layer
    model.add(Dense(10, activation='relu', input_shape=(n_cols,)))

    # Output layer
    model.add(Dense(1))

    #compile model
    model.compile(optimizer='adam', loss='mean_squared_error')
    return model


### 1.Randomly split the data into a training and test sets by holding 30% of the data for testing.

In [10]:
# Set random seed
np.random.seed(42)

# Split data to training and test sets
X_train, X_test, y_train, y_test = train_test_split(predictors, target, test_size=0.3, random_state=2)

print(f"Shape of X_train : {X_train.shape}, Shape of X_test: {X_test.shape}")
print(f"Shape of y_train: {y_train.shape}, Shape of y_test: {y_test.shape}")

Shape of X_train : (721, 8), Shape of X_test: (309, 8)
Shape of y_train: (721,), Shape of y_test: (309,)


### 2. Train the model on the training data using 50 epochs.

In [11]:
model = regression_model()

#train the model using 50 epochs
model.fit(X_train, y_train, validation_split=0.3, epochs=50, verbose=2)


Epoch 1/50


2024-01-31 21:56:25.857904: I external/local_xla/xla/stream_executor/cuda/cuda_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2024-01-31 21:56:25.858298: W tensorflow/core/common_runtime/gpu/gpu_device.cc:2256] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...


16/16 - 0s - loss: 130324.5781 - val_loss: 74441.3516 - 372ms/epoch - 23ms/step
Epoch 2/50
16/16 - 0s - loss: 55848.6562 - val_loss: 30462.9531 - 30ms/epoch - 2ms/step
Epoch 3/50
16/16 - 0s - loss: 28871.4844 - val_loss: 21282.9473 - 30ms/epoch - 2ms/step
Epoch 4/50
16/16 - 0s - loss: 23406.1406 - val_loss: 19511.2676 - 29ms/epoch - 2ms/step
Epoch 5/50
16/16 - 0s - loss: 21128.8555 - val_loss: 17405.5645 - 31ms/epoch - 2ms/step
Epoch 6/50
16/16 - 0s - loss: 18805.1465 - val_loss: 15425.1602 - 27ms/epoch - 2ms/step
Epoch 7/50
16/16 - 0s - loss: 16726.1328 - val_loss: 13739.0850 - 28ms/epoch - 2ms/step
Epoch 8/50
16/16 - 0s - loss: 14817.4102 - val_loss: 12171.5781 - 28ms/epoch - 2ms/step
Epoch 9/50
16/16 - 0s - loss: 13049.6299 - val_loss: 10771.6846 - 27ms/epoch - 2ms/step
Epoch 10/50
16/16 - 0s - loss: 11423.3760 - val_loss: 9459.9668 - 28ms/epoch - 2ms/step
Epoch 11/50
16/16 - 0s - loss: 10017.9590 - val_loss: 8193.4961 - 28ms/epoch - 2ms/step
Epoch 12/50
16/16 - 0s - loss: 8697.7744

<keras.src.callbacks.History at 0x748a779475d0>

### 3. Evaluate the model on the test data and compute the mean squared error between the predicted concrete strength and the actual concrete strength.

In [12]:
# prediction
predictions = model.predict(X_test)

#calculate mean squared error
from sklearn.metrics import mean_squared_error

mse = mean_squared_error(y_test, predictions)
print(f"Mean Squared Error: {mse}")

Mean Squared Error: 600.5005684046172


### 4. Repeat steps 1 - 3, 50 times, i.e., create a list of 50 mean squared errors.

In [13]:
# initialize mean squared error list
mse_list = []

# repeat 50 times and calculate mean squared error
for i in range(50):
    # randomly split the data to training and testing data
    X_train, X_test, y_train, y_test = train_test_split(predictors, target, test_size=0.3, random_state=2)
    
    #initialise model
    model = regression_model()
    
    # fit the model
    model.fit(X_train, y_train, validation_split=0.3, epochs=50, verbose=0)
    
    # prediction
    predictions = model.predict(X_test)

    #calculate mean squared error
    mse = mean_squared_error(y_test, predictions)
    
    mse_list.append(mse)

    
    



### 5. Report the mean and the standard deviation of the mean squared errors.

In [14]:
#calculate mean and standard deviation of Mean Squared Errors
mse_mean_a = np.mean(mse_list)
mse_std_a = np.std(mse_list)

print(f"Mean of Mean Squared Errors: {mse_mean_a}")
print(f"Standard Deviation of Mean Squared Errors: {mse_std_a}")

Mean of Mean Squared Errors: 615.465263044772
Standard Deviation of Mean Squared Errors: 766.0662690523781


## **B. Normalize the data**
Repeat Part A but use a normalized version of the data

In [15]:
# Normalise data
predictors_norm = (predictors - predictors.mean())/predictors.std()
predictors_norm.head()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age
0,2.476712,-0.856472,-0.846733,-0.916319,-0.620147,0.862735,-1.217079,-0.279597
1,2.476712,-0.856472,-0.846733,-0.916319,-0.620147,1.055651,-1.217079,-0.279597
2,0.491187,0.79514,-0.846733,2.174405,-1.038638,-0.526262,-2.239829,3.55134
3,0.491187,0.79514,-0.846733,2.174405,-1.038638,-0.526262,-2.239829,5.055221
4,-0.790075,0.678079,-0.846733,0.488555,-1.038638,0.070492,0.647569,4.976069


In [16]:
# initialize mean squared error list
mse_list_b = []

# repeat 50 times and calculate mean squared error
for i in range(50):
    # randomly split the Normalised data to training and testing data
    X_train, X_test, y_train, y_test = train_test_split(predictors_norm, target, test_size=0.3, random_state=2)
    
    #initialise model
    model = regression_model()
    
    # fit the model
    model.fit(X_train, y_train, validation_split=0.3, epochs=50, verbose=0)
    
    # prediction
    predictions = model.predict(X_test)

    #calculate mean squared error
    from sklearn.metrics import mean_squared_error
    mse = mean_squared_error(y_test, predictions)
    
    mse_list_b.append(mse)




### **How does the mean of the mean squared errors compare to that from Step A?**

In [17]:
# Mean and Standard Deviation of Mean Squared Errors
mse_mean_b = np.mean(mse_list_b)
mse_std_b = np.std(mse_list_b)

print(f"Mean of MSE in step A : {mse_mean_a}")
print(f"Mean of MSE in step B (normalised data): {mse_mean_b}")
print(f"Standard Deviation of Mean Squared Errors: {mse_std_b}")



Mean of MSE in step A : 615.465263044772
Mean of MSE in step B (normalised data): 680.4988446823548
Standard Deviation of Mean Squared Errors: 141.243309299532


### **Observation**
Normally, using normalised data results in more accurate model with lower mean squared error. But here, using Z-score normalization has increased the Mean of mean squared error when compared to the model in Step A in which normalization of data is not done.

## **C. Increate the number of epochs**

Repeat Part B but use 100 epochs this time for training.

In [18]:
#train model with 100 epochs
mse_list_c = []

# repeat 50 times and calculate mean squared error
for i in range(50):
    # randomly split the Normalised data to training and testing data
    X_train, X_test, y_train, y_test = train_test_split(predictors_norm, target, test_size=0.3, random_state=2)
    
    #initialise model
    model = regression_model()
    
    # fit the model with epoch = 100
    model.fit(X_train, y_train, validation_split=0.3, epochs=100, verbose=0)
    
    # prediction
    predictions = model.predict(X_test)

    #calculate mean squared error
    from sklearn.metrics import mean_squared_error
    mse = mean_squared_error(y_test, predictions)
    
    mse_list_c.append(mse)




### **How does the mean of the mean squared errors compare to that from Step B?**

In [19]:
# Mean and Standard Deviation of Mean Squared Errors
mse_mean_c = np.mean(mse_list_c)
mse_std_c = np.mean(mse_list_c)

print(f"Mean of MSE in step B : {mse_mean_b}")
print(f"Mean of MSE in step C (epochs = 100) : {mse_mean_c}")

print(f"Standard Deviation of Mean Squared Errors in step C: {mse_std_c}")

Mean of MSE in step B : 680.4988446823548
Mean of MSE in step C (epochs = 100) : 227.66556219801848
Standard Deviation of Mean Squared Errors in step C: 227.66556219801848


### **Observation**

Mean of mean squared error in STEP C (with epoch = 100) is nearly 3 times lower than Mean of Mean squared error in STEP B( epoch = 50). This indicates that a increase in number of iterations or epochs can significantly reduce the error and hence improve the accuracy of the model.

## **D. Increase the number of hidden layers**

Repeat part B but use a neural network with the following instead:

- Three hidden layers, each of 10 nodes and ReLU activation function.

In [20]:
# Initiallise model with three hidden layers of 10 noded
model_2 = Sequential()

n_cols = predictors_norm.shape[1]

# add dohidden layers
model_2.add(Dense(10, activation='relu', input_shape=(n_cols,)))
model_2.add(Dense(10, activation='relu', input_shape=(n_cols,)))
model_2.add(Dense(10, activation='relu', input_shape=(n_cols,)))

# Output layer
model_2.add(Dense(1))

#compile model
model_2.compile(optimizer='adam', loss='mean_squared_error')



In [21]:
#train model with 50 epochs
mse_list_d = []

# repeat 50 times and calculate mean squared error
for i in range(50):
    # randomly split the Normalised data to training and testing data
    X_train, X_test, y_train, y_test = train_test_split(predictors_norm, target, test_size=0.3, random_state=2)
    
    
    # fit the model with epoch
    model_2.fit(X_train, y_train, validation_split=0.3, epochs=50, verbose=0)
    
    # prediction
    predictions = model_2.predict(X_test)

    #calculate mean squared error
    from sklearn.metrics import mean_squared_error
    mse = mean_squared_error(y_test, predictions)
    
    mse_list_d.append(mse)



### **How does the mean of the mean squared errors compare to that from Step B?**

In [22]:
# Find Mean of Mean Squared Errors
mse_mean_d = np.mean(mse_list_d)
mse_std_d = np.std(mse_list_d)

print(f"Mean of MSE in step B : {mse_mean_b}")
print(f"Mean of MSE in step D : {mse_mean_d}")

print(f"Standard Deviation of MSE in step D: {mse_std_d}")

Mean of MSE in step B : 680.4988446823548
Mean of MSE in step D : 40.492097731946316
Standard Deviation of MSE in step D: 14.147629808691628


### **Observation**
**Mean of mean sqaured errors of Step B with one hidden layer is 17 times that of step D with 3 hidden layers**. 
This indicates that increase in number of hidden layers of neurons has a very high impact in accuracy of the model. This is clearly visible form the above observation.