# Neural networks
Lecturer: dr Andrzej Tomski, WSB-NLU  

> Author: Andrzej Kocielski, 2023-2024  
> email: <akocielski@student.wsb-nlu.edu.pl>, <and.koc001@gmail.com>
___

## Exercise 2

Built an ANN to estimate price, using provided dataset, "CarPricesData.pkl".

Assumptions:
- The provided data is to be split into training and test sets.
- Sequential model from Keras is to be used.


### Importing libraries and set up

In [None]:
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

import warnings # ignore warnings
warnings.filterwarnings("ignore")
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=FutureWarning)

# Imports
import numpy as np
import pandas as pd

from sklearn.preprocessing import StandardScaler # used for standaristion of data
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression 

from keras.models import Sequential # model of ANN
from keras.layers import Dense # layers of ANN

# model metrics
from sklearn.metrics import mean_absolute_percentage_error, mean_squared_error, r2_score

import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# To remove the scientific notation from numpy arrays
np.set_printoptions(suppress=True)

### Load and inspect data

In [None]:
# Load dataset from file
CarPricesData = pd.read_pickle('CarPricesData.pkl')
# Data inspection
print(f"Shape of dataframe: {CarPricesData.shape}\n")
print(CarPricesData.info(), "\n")
print(CarPricesData.head(2))
print(CarPricesData.describe())

In [None]:
# import seaborn as sns
# sns.boxplot(CarPricesData["KM"])

### Prepare data

It appears there is some inaccurate data, e.g. with possible outliers in KM column. 

All the records with KM = 1 will be removed (even though in some cases the Age = 1, which could be legitimate).

In [None]:
# CarPricesData[CarPricesData["KM"] < 1000]
cleaned_data = CarPricesData[CarPricesData["KM"] != 1]
print(f"Removed rows: {CarPricesData.shape[0] - cleaned_data.shape[0]}")

As the goal of this NN is to predict Price based on the other parameters, this information must be removed from the dataset destined for training the NN.

In [None]:
# Separate Target Variable and Predictor Variables
TargetVariable = ['Price']
Predictors = ['Age', 'KM', 'Weight', 'HP', 'MetColor', 'CC', 'Doors']

X = cleaned_data[Predictors].values
y = cleaned_data[TargetVariable].values

Sandardization of data, using `StandardScaler` from sklearn.preprocessing library

In [None]:
# Set StandardScaler to both datasets
PredictorScaler = StandardScaler()
TargetVarScaler = StandardScaler()

# Storing the fit object for later reference
PredictorScalerFit = PredictorScaler.fit(X)
TargetVarScalerFit = TargetVarScaler.fit(y)
 
# Generating the standardized values of X and y
X = PredictorScalerFit.transform(X)
y = TargetVarScalerFit.transform(y)

Split the data into training and testing sets, using method `train_test_split` from scikit-learn

In [None]:
# Split data into 70% training and 30% testing 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=2023)
 
# Quick sanity check with the shapes of Training and testing datasets
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

### Build ANN

Hiperparameters as per exercise notes:  
- units=5: This means that we are creating a layer with five neurons. Each of these five neurons will receive input values, for example the "Age" values will be passed to all five neurons, as will all other columns.
- input_dim=7: This means that there are seven predictors in the input that are expected by the first layer. If you see a second dense layer, we do not specify this value because the sequential model passes this information on to subsequent layers.
- kernel_initializer='normal': When the neurons start computing, some algorithm must decide the value for each weight. This parameter specifies that we can choose different values for it, such as "normal" or "glorot_uniform".
- activation='relu': Specifies the activation function for the computations inside each neuron. You can choose values such as "relu", "tanh", "sigmoid", etc.


Further parameters for tuning the model:
- batch_size=10: Specifies how many rows will be fed to the network at one time, after which SSE calculation will begin and the neural network will begin to adjust its weights based on errors.
- epochs=20: The same weight adjustment action is continued 50 times according to this parameter. To put it simply, the ANN looks at the full training data 50 times and adjusts its weights.

In [None]:
# create ANN model
model = Sequential()

# Building the layers of the net
# Defining the Input layer and FIRST hidden layer, both are same!
model.add(Dense(units=5, input_dim=7, kernel_initializer='normal', activation='relu'))

# Defining the Second layer of the model
# after the first layer we don't have to specify input_dim as keras configure it automatically
model.add(Dense(units=5, kernel_initializer='normal', activation='relu'))

# The output neuron is a single fully connected node 
# Since we will be predicting a single number
model.add(Dense(1))

# Compiling the model
model.compile(optimizer='adam', loss='mse', metrics=['accuracy'])

# model info
model.summary()

### Training the model

In [None]:
# Fitting the ANN to the Training set
history = model.fit(
    X_train, y_train,
    epochs=10,
    batch_size=20,
    validation_data=(X_test, y_test),
    verbose=1,
)

### Visualising the training progress

In practice, when training a model using frameworks like Keras or similar machine learning libraries, there are typically two datasets: the training set (_train_) and the validation set (_val_ or _validation_). Metrics values, such as accuracy (_accuracy_) or loss function (_loss_), are monitored on both the training and validation sets.  

In the code, `val_` is a prefix used to refer to the metric values on the validation (_test_) set during the training of the model. In this context, if `key1` is, for example, 'accuracy', then `'val_' + key1` denotes the accuracy on the validation set.  

Therefore, when using `history.history['val_' + key1]`, refers to the accuracy values on the validation set, and `history.history[key1]` refers to the accuracy values on the training set. 

In [None]:
def draw_curves(history, key1='accuracy', ylim1=(0.0, 1.0), key2='loss', ylim2=(0.0, 0.5)):
    
    plt.figure(figsize=(12,4))
    
    # plt.subplot(1, 2, 1)
    # plt.plot(history.history[key1], "r--")
    # plt.plot(history.history['val_' + key1], "g--")
    # plt.ylabel(key1)
    # plt.xlabel('Epoch')
    # plt.ylim(ylim1)
    # plt.legend(['train', 'test'], loc='best')
    
    # plt.subplot(1, 2, 2)
    plt.plot(history.history[key2], "r--")
    plt.plot(history.history['val_' + key2], "g--")
    plt.ylabel(key2)
    plt.xlabel('Epoch')
    plt.ylim(ylim2)    
    plt.legend(['train', 'test'], loc='best')
    
    plt.show()

In [None]:
draw_curves(history, key1='accuracy', ylim1=(-0.1, 1.2), key2='loss', ylim2=(0.0, 1.2))

## Find the best parameters for the ANN
### Training the model

Further parameters for tuning the model:
- batch_size=20: Specifies how many rows will be fed to the network at one time, after which SSE calculation will begin and the neural network will begin to adjust its weights based on errors.
- epochs=50: The same weight adjustment action is continued 50 times according to this parameter. To put it simply, the ANN looks at the full training data 50 times and adjusts its weights.

In [None]:
def FindBestParams(X_train, y_train, X_test, y_test):
    """ 
    Function finds the best parameters for training the ANN.
    """    
    # Defining the list of hyper parameters to try
    batch_size_list = [5, 10]#, 15, 20]
    epoch_list = [5, 10]#, 30, 60]
    
    SearchResultsData = pd.DataFrame(columns=['TrialNumber', 'Parameters', 'MAPE', 'Accuracy'])
    
    # initializing the trials
    TrialNumber=0
    for batch_size_trial in batch_size_list:
        for epochs_trial in epoch_list:
            print()
            TrialNumber+=1
            
            # create ANN model
            model = Sequential()
            # Defining the first layer of the model
            model.add(Dense(units=5, input_dim=7, kernel_initializer='normal', activation='relu'))
            # Defining the Second layer of the model
            model.add(Dense(units=5, kernel_initializer='normal', activation='relu'))
            # The output neuron 
            model.add(Dense(1))

            # Compiling the model
            model.compile(optimizer='adam', loss='mse', metrics=['accuracy'])
            
            # Fitting the ANN to the Training set
            history = model.fit(
                X_train, y_train,
                epochs=epochs_trial,
                batch_size=batch_size_trial,
                validation_data=(X_test, y_test),
                verbose=0,
                )

            MAPE = np.mean(100 * (np.abs(y_test - model.predict(X_test))/y_test))
           
            # printing the results of the current iteration
            # print(f"Trial {TrialNumber}, MAPE: {MAPE}")

            trial_result = pd.DataFrame(
                data=[[TrialNumber, str(batch_size_trial)+'-'+str(epochs_trial), MAPE, 100-MAPE]],
                              columns=['TrialNumber', 'Parameters', 'MAPE', 'Accuracy'])
            print(trial_result)

            SearchResultsData = pd.concat([SearchResultsData, trial_result], ignore_index=True)
            # SearchResultsData = SearchResultsData.append(trial_result)) # depricated in Pandas 2.0
    
    return(SearchResultsData)

In [None]:
# Calling the function
ResultsData = FindBestParams(X_train, y_train, X_test, y_test)

In [None]:
# Best accuracy
max_accuracy = ResultsData.loc[ResultsData['Accuracy'].idxmax()]
print(f"Best accuracy found: {max_accuracy[-1]:.3f}, with corresonding batch size and number of epochs: {max_accuracy[1]}.")

___
## Linear regression model

For comparison, linear regression model is used.

For this model, the same dataset and subsets are used (X_train, X_test, y_train, y_test).

In [None]:
# Quick sanity check with the shapes of Training and testing datasets
print(f"Train inputs {X_train.shape}, train outputs {y_train.shape}")
print(f"Test inputs {X_test.shape}, test outputs {y_test.shape}")

In [None]:
# create an instance of a LinearRegression() model named lin_reg_model.
lin_reg_model = LinearRegression()

#Train/fit lin_reg_model on the training data.
lin_reg_model.fit(X_train, y_train)

In [None]:
# calculate predictions
predictions = lin_reg_model.predict(X_test)

In [None]:
# accuracy of the linear regression model
MAPE_lr = np.mean(100 * (np.abs(y_test - predictions)/y_test))

print(f"Accuracy: {100 - MAPE_lr:.3f}") 

## Findings

Both models were compared using Mean Absolute Percentage Error (MAPE) metrics.  
The formula for MAPE is:

$$ MAPE = \frac{1}{n} \sum_{i=1}^{n} \left| \frac{A_i - F_i}{A_i} \right| \times 100\% $$

Where:  
$ n $ is the number of observations; 
$ A_i $ is the actual value; 
$ F_i $ is the forecasted (predicted) value.

Model accuracy is expressed as: $ 100\% - MAPE $

The following are the found accuracies of both models:

In [None]:
print(f"ANN accuracy:\t {max_accuracy[-1]:.3f}%")
print(f"LR accuracy:\t {100 - MAPE_lr:.3f}%")

References:
- <https://www.analyticsvidhya.com/blog/2021/07/car-price-prediction-machine-learning-vs-deep-learning/>
- <https://www.kaggle.com/code/karan842/car-price-prediction-neural-network>
- <https://github.com/andkoc001/Machine-Learning-and-Statistics-Project>

___
Andrzej Kocielski, 2023