# Regression

### General instruction:

#### 1. Assignment Topic:
In this project, you will build a regression model using the Keras library to model the same data about concrete compressive strength that we used in labs 3.

#### 2. Concrete Data:

For your convenience, the data can be found here again: 
https://cocl.us/concrete_data
. To recap, the predictors in the data of concrete strength include:
1. Cement
2. Blast Furnace Slag
3. Fly Ash
4. Water
5. Superplasticizer
6. Coarse Aggregate
7. Fine Aggregate

#### 3. Assignment Instructions:

Please check the My Submission tab for detailed assignment instructions.

#### 4. How to submit:

You will need to submit your code for each part in a Jupyter Notebook. Since each part builds on the previous one, you can submit the same notebook four times for grading. Please make sure that you:

1. use Markdown to clearly label your code for each part,
2. properly comment your code so that your peer who is grading your work is able to understand your code easily,
3. include your comments and discussion of the difference in the mean of the mean squared errors among the different parts.

## Installing dependencies and improting libraries

In [27]:
import os
import pandas as pd
import numpy as np
import keras
from sklearn.model_selection import train_test_split
from keras.models import Sequential
from keras.layers import Dense
from sklearn.metrics import mean_squared_error

In [2]:
current_dir = os.getcwd()
file_path = os.path.join(current_dir, 'concrete_data.csv')
df = pd.read_csv(file_path)

In [3]:
df.tail()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age,Strength
1025,276.4,116.0,90.3,179.6,8.9,870.1,768.3,28,44.28
1026,322.2,0.0,115.6,196.0,10.4,817.9,813.4,28,31.18
1027,148.5,139.4,108.6,192.7,6.1,892.4,780.0,28,23.7
1028,159.1,186.7,0.0,175.6,11.3,989.6,788.9,28,32.77
1029,260.9,100.5,78.3,200.6,8.6,864.5,761.5,28,32.4


In [4]:
df.describe()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age,Strength
count,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0,1030.0
mean,281.167864,73.895825,54.18835,181.567282,6.20466,972.918932,773.580485,45.662136,35.817961
std,104.506364,86.279342,63.997004,21.354219,5.973841,77.753954,80.17598,63.169912,16.705742
min,102.0,0.0,0.0,121.8,0.0,801.0,594.0,1.0,2.33
25%,192.375,0.0,0.0,164.9,0.0,932.0,730.95,7.0,23.71
50%,272.9,22.0,0.0,185.0,6.4,968.0,779.5,28.0,34.445
75%,350.0,142.95,118.3,192.0,10.2,1029.4,824.0,56.0,46.135
max,540.0,359.4,200.1,247.0,32.2,1145.0,992.6,365.0,82.6


In [6]:
df.shape

(1030, 9)

In [5]:
df.isnull().sum()

Cement                0
Blast Furnace Slag    0
Fly Ash               0
Water                 0
Superplasticizer      0
Coarse Aggregate      0
Fine Aggregate        0
Age                   0
Strength              0
dtype: int64

## A. Build a baseline model (5 marks) 

Use the Keras library to build a neural network with the following:
- One hidden layer of 10 nodes, and a ReLU activation function
- Use the adam optimizer and the mean squared error  as the loss function.
1. Randomly split the data into a training and test sets by holding 30% of the data for testing. You can use the 
train_test_split
helper function from Scikit-learn.
2. Train the model on the training data using 50 epochs.
3. Evaluate the model on the test data and compute the mean squared error between the predicted concrete strength and the actual concrete strength. You can use the mean_squared_error function from Scikit-learn.
4. Repeat steps 1 - 3, 50 times, i.e., create a list of 50 mean squared errors.
5. Report the mean and the standard deviation of the mean squared errors.
Submit your Jupyter Notebook with your code and comments.

### 1. Split the data into a training and test sets

#### 1.1. Split data into predictors and target

In [8]:
concrete_data = df

In [9]:
concrete_data_columns = concrete_data.columns

predictors = concrete_data[concrete_data_columns[concrete_data_columns != 'Strength']] # all columns except Strength
target = concrete_data['Strength'] # Strength column

In [11]:
predictors.tail()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age
1025,276.4,116.0,90.3,179.6,8.9,870.1,768.3,28
1026,322.2,0.0,115.6,196.0,10.4,817.9,813.4,28
1027,148.5,139.4,108.6,192.7,6.1,892.4,780.0,28
1028,159.1,186.7,0.0,175.6,11.3,989.6,788.9,28
1029,260.9,100.5,78.3,200.6,8.6,864.5,761.5,28


In [12]:
target.tail()

1025    44.28
1026    31.18
1027    23.70
1028    32.77
1029    32.40
Name: Strength, dtype: float64

In [37]:
n_cols = predictors.shape[1] # number of predictors

#### 1.2. Split to train-test subset

In [35]:
X = predictors
y = target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

### 2. Train the model

#### 2.1. Build a neural network

In [23]:
# define regression model
def regression_model():
    # create model
    model = Sequential()
    model.add(Dense(10, activation='relu', input_shape=(n_cols,)))
    model.add(Dense(1))
    
    # compile model
    model.compile(optimizer='adam', loss='mean_squared_error')
    return model

#### 2.2. Train and test the model

In [24]:
# build the model
model = regression_model()

In [26]:
# fit the model
model.fit(X_train, y_train, validation_split=0.3, epochs=50, verbose=2)

Epoch 1/50
16/16 - 0s - loss: 659.9481 - val_loss: 572.4831 - 59ms/epoch - 4ms/step
Epoch 2/50
16/16 - 0s - loss: 623.8214 - val_loss: 534.2709 - 32ms/epoch - 2ms/step
Epoch 3/50
16/16 - 0s - loss: 579.4589 - val_loss: 496.6895 - 30ms/epoch - 2ms/step
Epoch 4/50
16/16 - 0s - loss: 537.5168 - val_loss: 463.8261 - 32ms/epoch - 2ms/step
Epoch 5/50
16/16 - 0s - loss: 500.9638 - val_loss: 434.0738 - 31ms/epoch - 2ms/step
Epoch 6/50
16/16 - 0s - loss: 468.7195 - val_loss: 406.4196 - 31ms/epoch - 2ms/step
Epoch 7/50
16/16 - 0s - loss: 437.8703 - val_loss: 381.9273 - 29ms/epoch - 2ms/step
Epoch 8/50
16/16 - 0s - loss: 410.5940 - val_loss: 359.4396 - 29ms/epoch - 2ms/step
Epoch 9/50
16/16 - 0s - loss: 385.5906 - val_loss: 338.6426 - 30ms/epoch - 2ms/step
Epoch 10/50
16/16 - 0s - loss: 362.3828 - val_loss: 318.9467 - 31ms/epoch - 2ms/step
Epoch 11/50
16/16 - 0s - loss: 340.8242 - val_loss: 300.4897 - 33ms/epoch - 2ms/step
Epoch 12/50
16/16 - 0s - loss: 320.3775 - val_loss: 283.6197 - 35ms/epoch 

<keras.src.callbacks.History at 0x1ec1d66db80>

### 3. Evaluate the model

In [28]:
# Predict the concrete strength for the test data
y_pred = model.predict(X_test)



In [30]:
# Compute the Mean Squared Error between actual and predicted values
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error on Test Data: {mse:.4f}")

Mean Squared Error on Test Data: 17.6684


### 4. Repeat steps 1-3 50 times

In [31]:
mse_list = []

for i in range(50):
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mse_list.append(mse)
print("List of Mean Squared Errors from 50 evaluations:")
print(mse_list)

List of Mean Squared Errors from 50 evaluations:
[17.668392262888368, 17.668392262888368, 17.668392262888368, 17.668392262888368, 17.668392262888368, 17.668392262888368, 17.668392262888368, 17.668392262888368, 17.668392262888368, 17.668392262888368, 17.668392262888368, 17.668392262888368, 17.668392262888368, 17.668392262888368, 17.668392262888368, 17.668392262888368, 17.668392262888368, 17.668392262888368, 17.668392262888368, 17.668392262888368, 17.668392262888368, 17.668392262888368, 17.668392262888368, 17.668392262888368, 17.668392262888368, 17.668392262888368, 17.668392262888368, 17.668392262888368, 17.668392262888368, 17.668392262888368, 17.668392262888368, 17.668392262888368, 17.668392262888368, 17.668392262888368, 17.668392262888368, 17.668392262888368, 17.668392262888368, 17.668392262888368, 17.668392262888368, 17.668392262888368, 17.668392262888368, 17.668392262888368, 17.668392262888368, 17.668392262888368, 17.668392262888368, 17.668392262888368, 17.668392262888368, 17.6683922

### 5. Mean and SD of the mean squared error

In [33]:
mean_mse = np.mean(mse_list)
sd_mse = np.std(mse_list)
print(f"Mean of MSE on Test Data over 50 runs: {mean_mse:.4f}")
print(f"Standard Deviation of MSE on Test Data over 50 runs: {sd_mse:.4f}")

Mean of MSE on Test Data over 50 runs: 17.6684
Standard Deviation of MSE on Test Data over 50 runs: 0.0000


In [36]:
# List to store MSE values for each run
mse_list = []

# Repeat the process 50 times
for _ in range(50):
    # 1. Split the data into training and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=None)  # random_state=None for random splitting

    model = regression_model()

    # 3. Train the model on the training data using 50 epochs
    model.fit(X_train, y_train, validation_split=0.3, epochs=50, verbose=0)  # verbose=0 to suppress output

    # 4. Evaluate the model on the test data and compute the MSE
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mse_list.append(mse)  # Store the MSE for this run

# After completing all runs, calculate mean and standard deviation of the MSE
mean_mse = np.mean(mse_list)
std_mse = np.std(mse_list)

# Output the results
print(f"Mean MSE over 50 runs: {mean_mse:.4f}")
print(f"Standard Deviation of MSE over 50 runs: {std_mse:.4f}")

Mean MSE over 50 runs: 652.4933
Standard Deviation of MSE over 50 runs: 837.8043


## B. Normalize the data 

In [38]:
predictors_norm = (predictors - predictors.mean()) / predictors.std()
predictors_norm.head()

Unnamed: 0,Cement,Blast Furnace Slag,Fly Ash,Water,Superplasticizer,Coarse Aggregate,Fine Aggregate,Age
0,2.476712,-0.856472,-0.846733,-0.916319,-0.620147,0.862735,-1.217079,-0.279597
1,2.476712,-0.856472,-0.846733,-0.916319,-0.620147,1.055651,-1.217079,-0.279597
2,0.491187,0.79514,-0.846733,2.174405,-1.038638,-0.526262,-2.239829,3.55134
3,0.491187,0.79514,-0.846733,2.174405,-1.038638,-0.526262,-2.239829,5.055221
4,-0.790075,0.678079,-0.846733,0.488555,-1.038638,0.070492,0.647569,4.976069


In [39]:
target_norm = (target - target.mean()) / target.std()
target_norm.head()

0    2.644123
1    1.560663
2    0.266498
3    0.313188
4    0.507732
Name: Strength, dtype: float64

In [40]:
X = predictors_norm
y = target_norm

In [41]:
# List to store MSE values for each run
mse_list = []

# Repeat the process 50 times
for _ in range(50):
    # 1. Split the data into training and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=None)  # random_state=None for random splitting

    model = regression_model()

    # 3. Train the model on the training data using 50 epochs
    model.fit(X_train, y_train, validation_split=0.3, epochs=50, verbose=0)  # verbose=0 to suppress output

    # 4. Evaluate the model on the test data and compute the MSE
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mse_list.append(mse)  # Store the MSE for this run

# After completing all runs, calculate mean and standard deviation of the MSE
mean_mse = np.mean(mse_list)
std_mse = np.std(mse_list)

# Output the results
print(f"Mean MSE over 50 runs: {mean_mse:.4f}")
print(f"Standard Deviation of MSE over 50 runs: {std_mse:.4f}")

Mean MSE over 50 runs: 0.3141
Standard Deviation of MSE over 50 runs: 0.0675


## C.  Increate the number of epochs

In [42]:
# List to store MSE values for each run
mse_list = []

# Repeat the process 50 times
for _ in range(50):
    # 1. Split the data into training and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=None)  # random_state=None for random splitting

    model = regression_model()

    # 3. Train the model on the training data using 50 epochs
    model.fit(X_train, y_train, validation_split=0.3, epochs=100, verbose=0)  # verbose=0 to suppress output

    # 4. Evaluate the model on the test data and compute the MSE
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mse_list.append(mse)  # Store the MSE for this run

# After completing all runs, calculate mean and standard deviation of the MSE
mean_mse = np.mean(mse_list)
std_mse = np.std(mse_list)

# Output the results
print(f"Mean MSE over 50 runs: {mean_mse:.4f}")
print(f"Standard Deviation of MSE over 50 runs: {std_mse:.4f}")

Mean MSE over 50 runs: 0.2127
Standard Deviation of MSE over 50 runs: 0.0322


## D. Increase the number of hidden layers

In [44]:
# define regression model
def regression_model_3hidden():
    # create model
    model = Sequential()
    model.add(Dense(10, activation='relu', input_shape=(n_cols,)))
    model.add(Dense(10, activation='relu'))
    model.add(Dense(10, activation='relu'))
    model.add(Dense(1))
    
    # compile model
    model.compile(optimizer='adam', loss='mean_squared_error')
    return model

In [45]:
# List to store MSE values for each run
mse_list = []

# Repeat the process 50 times
for _ in range(50):
    # 1. Split the data into training and test sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=None)  # random_state=None for random splitting

    model = regression_model_3hidden()

    # 3. Train the model on the training data using 50 epochs
    model.fit(X_train, y_train, validation_split=0.3, epochs=100, verbose=0)  # verbose=0 to suppress output

    # 4. Evaluate the model on the test data and compute the MSE
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    mse_list.append(mse)  # Store the MSE for this run

# After completing all runs, calculate mean and standard deviation of the MSE
mean_mse = np.mean(mse_list)
std_mse = np.std(mse_list)

# Output the results
print(f"Mean MSE over 50 runs: {mean_mse:.4f}")
print(f"Standard Deviation of MSE over 50 runs: {std_mse:.4f}")

Mean MSE over 50 runs: 0.1810
Standard Deviation of MSE over 50 runs: 0.0238
