# Peer-graded Assignment: Build a Regression Model in Keras
The lab is split into 2 main parts:
1. Data preparation. Includes reading the dataset, preparing its values and normalizing it
2. Part A, B, C, D according to the assessment tasks

Requirements:
0. Python 3.11.1
1. Jupyter core
2. Jupyter lab

1. Download / read the data
2. Split the data into the train and test sets
3. Define a function to generate a Keras model for the dataset
4. Fit and train the model
5. Test the accuracy of the model

In [None]:
# Install the libraries
# %pip install pandas
# %pip install numpy
# %pip install matplotlib
# %pip install scikit-learn
# %pip install tensorflow
# %pip install keras


# Import the libraries

In [None]:
# Math and data
import numpy as np
import pandas as pd

# Keras
import keras
from keras.models import Sequential
from keras.layers import Dense

# Sklearn
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

# Download the data

In [None]:
# concrete_data = pd.read_csv('https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DL0101EN/labs/data/concrete_data.csv')
concrete_data = pd.read_csv("concrete_data.csv")
concrete_data.head()

In [None]:
concrete_data.shape

In [None]:
concrete_data.info()

In [None]:
concrete_data.describe()

In [None]:
concrete_data.isnull().sum()

In [None]:
# Split the data into predictors and target

predictors = concrete_data[concrete_data.columns[concrete_data.columns != "Strength"]]
target = concrete_data["Strength"]
normalized_predictors = (predictors - predictors.mean()) / predictors.std()

assert type(predictors) is pd.DataFrame
assert type(target) is pd.Series


### Define the model generation functions
1. The Part A,B,C of 1 10-nodes ReLU hidden layer model yielding function
1. The Part D of 3 10-nodes ReLU hidden layers model yielding function

In [None]:
# Define a neural network model generation function
def generate_model_ABC(n_cols: int, ):
    """
    Generate model for the assessments parts A, B, C
    """
    model = Sequential()
    model.add(
        Dense(10, activation="relu", input_shape=(n_cols,))
    )
    model.add(
        Dense(1)
    )

    model.compile(optimizer="adam", loss="mean_squared_error",)

    return model

def generate_model_D(n_cols: int, ):
    """
    Generate model for the assessment part D
    """
    model = Sequential()
    model.add(
        Dense(10, activation="relu", input_shape=(n_cols,))
    )
    model.add(
        Dense(10, activation="relu"),
    )
    model.add(
        Dense(10, activation="relu"),
    )
    model.add(
        Dense(1)
    )

    model.compile(optimizer="adam", loss="mean_squared_error",)

    return model
    

In [None]:
def get_model_predictions(
        x_train: np.ndarray,
        x_test: np.ndarray,
        y_train: np.ndarray,
        model: keras.Model,
        epochs: int,
        verbose: int = 0,
        ) -> np.ndarray:
    """
    Args:
        model: the model to train
        n_cols: number of columns which determine the input shape of the model
    Return:
        float: the Mean Squared Error value of the model predictions
            against the original test dataset
    """
    # Train the model
    model.fit(x_train, y_train, epochs=epochs, verbose=verbose)
    # Yield the predictions
    predictions = model.predict(x_test)
    return predictions

In [None]:
# Iterate for N times
from typing import Tuple

def generate_model_on_data(
        predictors: pd.DataFrame,
        target: pd.Series,
        epochs_number: int,
        model_generation_function,
        N: int = 50,
        verbose: int = 0,
    ) -> Tuple[np.ndarray, keras.Model]:
    """
    Args:
        predictors (pd.DataFrame): input data (X_0, X_1, ..., X_n)
        target (pd.DataFrame): labels according to the predictors (Y)
        epochs_number (int): number of model training optimization iterations
        model_generation_function (Callable): the function to generate a model. Either generate_model_ABC or generate_model_D
        N (int) = 50: constant number of training repetitions
        verbose (int) = 0: the training verbosity
    Return:
        np.ndarray: array of MSE for each iteration
    """
    mse_list = []

    for idx in range(N):
        print(f"Training iteration #{idx}/{N}")
        x_train, x_test, y_train, y_test = train_test_split(predictors, target, test_size=0.3, random_state=idx)
        n_cols = x_train.shape[1]
        model = model_generation_function(n_cols)
        predictions = get_model_predictions(
            x_train, x_test, y_train,
            model=model,
            epochs=epochs_number,
            verbose=verbose,
        )
        mse = mean_squared_error(y_test, predictions)
        mse_list.append(mse)
        
    mse_array = np.array(mse_list)
    assert mse_array.shape[0] == N
    return mse_array

In [None]:
# Define the MSE report function
def print_mse_report(mse_array: np.ndarray):
    # Report the mean and the standard deviation of the mean squared errors.
    print(f"The MSE mean value is {mse_array.mean()}")
    print(f"The MSE standard deviation is {mse_array.std()}")

# A. Build a baseline model (5 marks) 

Use the Keras library to build a neural network with the following:

- One hidden layer of 10 nodes, and a ReLU activation function

- Use the adam optimizer and the mean squared error  as the loss function.

1. Randomly split the data into a training and test sets by holding 30% of the data for testing. You can use the train_test_split

helper function from Scikit-learn.

2. Train the model on the training data using 50 epochs.

3. Evaluate the model on the test data and compute the mean squared error between the predicted concrete strength and the actual concrete strength. You can use the mean_squared_error function from Scikit-learn.

4. Repeat steps 1 - 3, 50 times, i.e., create a list of 50 mean squared errors.

5. Report the mean and the standard deviation of the mean squared errors.

In [None]:
# Part A MSEs
part_A_mse_array  = generate_model_on_data(
    predictors=predictors,
    target=target,
    epochs_number=50,
    model_generation_function=generate_model_ABC,
    N=50,
    verbose=0,
)
print(part_A_mse_array)
print_mse_report(part_A_mse_array)

### Conclusion on Part A
The MSE mean value is 406.5074  
The MSE standard deviation is 393.3563  
The mean squared error (MSE) is significantly high which does not seem possible to use on real datasets

# B. Normalize the data (5 marks) 

Repeat Part A but use a normalized version of the data. Recall that one way to normalize the data is by subtracting the mean from the individual predictors and dividing by the standard deviation.

How does the mean of the mean squared errors compare to that from Step A?

In [None]:
# Part B MSEs
part_B_mse_array = generate_model_on_data(
    predictors=normalized_predictors,   # Use the normalized predictors
    target=target,
    epochs_number=50,
    model_generation_function=generate_model_ABC,
    N=50,
    verbose=0,
)
print_mse_report(part_B_mse_array)

print(f"The mean MSE difference between parts B and A is {part_B_mse_array.mean() - part_A_mse_array.mean()}")
print(f"The mean MSE % increase between parts B and A is {(1 - part_B_mse_array.mean() / part_A_mse_array.mean()) * 100:.2f}%")

### Conclusion on Part B
The MSE mean value has dropped from 406.5074 to 398.9995 - the decrease rate is almost constant  
The MSE standard deviation has dropped from 393.3563 to 126.9637 which is nearly 70% better 


# C. Increate the number of epochs (5 marks)

Repeat Part B but use 100 epochs this time for training.

How does the mean of the mean squared errors compare to that from Step B?

In [None]:
# Part C MSEs
part_C_mse_array = generate_model_on_data(
    predictors=normalized_predictors,
    target=target,
    epochs_number=100,  # Increase the number of epochs from 50 to 100
    model_generation_function=generate_model_ABC,
    N=50,
    verbose=0,
)
print_mse_report(part_C_mse_array)

print(f"The mean MSE difference between parts C and B is {part_C_mse_array.mean() - part_B_mse_array.mean()}")
print(f"The mean MSE % increase between parts C and B is {(1 - part_C_mse_array.mean() / part_B_mse_array.mean()) * 100:.2f}%")

### Conclusion on Part C
The MSE mean value has dropped from 398.99 to 167.29 - the 58.07% decrease is significant  
The MSE standard deviation has dropped from 126.96 to just 16.90 which is just 13.3% of the Part B deviation. The change is indcredibly high

# D. Increase the number of hidden layers (5 marks)

Repeat part B but use a neural network with the following instead:

- Three hidden layers, each of 10 nodes and ReLU activation function.

How does the mean of the mean squared errors compare to that from Step B?

In [None]:
# Part D MSEs
part_D_mse_array = generate_model_on_data(
    predictors=normalized_predictors,
    target=target,
    epochs_number=100,
    model_generation_function=generate_model_D, # Use the 3 hidden layers model instead of 1
    N=50,
    verbose=0,
)
print_mse_report(part_D_mse_array)

print(f"The mean MSE difference between parts D and C is {part_D_mse_array.mean() - part_C_mse_array.mean()}")
print(f"The mean MSE % increase between parts D and C is {(1 - part_D_mse_array.mean() / part_C_mse_array.mean()) * 100:.2f}%")

### Conclusion on Part D
The MSE mean value has dropped from 167.29 to just 91.06 - the 45.5% decrease is the second pivotal difference  
The MSE standard deviation has increased from 16.9 to 24.91 - a 47.3% increase. However, in terms of lower MSE it can be counted as a positive factor
