<a href="https://colab.research.google.com/github/anhtpn/AI-Course/blob/main/Regression_DNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Linear Regression
In this laboratory, we target on DNN architecture and implementation experiences. A model, designed and investigated in a variety of configurations, is implemented to predict car price in the data set.

## Data Preprocessing
In this section, we need to download samples of the data set, transform their structures into vectors, and then prepare 3 subsets, including training, validation, and test sets.

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder

# Make NumPy printouts easier to read.
np.set_printoptions(precision=3, suppress=True)

In [None]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Normalization, Dropout
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.utils import plot_model

print(tf.__version__)

First import the dataset using *pandas*:

In [None]:
dataset = pd.read_csv('CarPrice_Assignment.csv')

In [None]:
dataset.head()

In [None]:
dataset.info()

In [None]:
dataset.duplicated().all()

In [None]:
dataset = dataset.drop('car_ID',axis=1)

In [None]:
dataset.shape

In [None]:
def car_name(x):
    carname  = x.split(' ')[0]
    return carname

In [None]:
dataset['CarName']   = dataset['CarName'].apply(car_name)
dataset['CarName'].unique()

Label encode the values in the columns which is categorical

In [None]:
quanlitative = [f for f in dataset.columns if dataset.dtypes[f] == 'object']
le = LabelEncoder()
for col in quanlitative:
    dataset[col] = le.fit_transform(dataset[col])

In [None]:
y = dataset['price']
X = dataset.drop(['price'],axis=1)

## Split the data
After preprocessing samples, we need to divide into subsets, including

Training set: calculate gradients, update weights
Validation set: monitor network performance during training phase and trigger termination
Test set: evaluate network performance using un-seen samples (samples are never seen by the network yet)
To be simple, we split 20% of the training set to obtain the validation set. Please note that samples are collected randomly, not in consecutive block.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

In [None]:
print(X_train.shape, y_train.shape)

## The Normalization layer
The Normalization is a clean and simple way to add feature normalization into your model.

The first step is to create the layer:

In [None]:
normalizer = Normalization(axis=-1)

Then, fit the state of the preprocessing layer to the data by calling Normalization.adapt:

In [None]:
normalizer.adapt(np.array(X_train))

## Regression using a DNN

In [None]:
...
def build_and_compile_model():
  model = Sequential()
  model.add(normalizer)
  model.add(Dense(64, activation='relu', input_shape=(X_train.shape[1],)))
  model.add(Dropout(0.5))
  model.add(Dense(64, activation='relu'))
  model.add(Dropout(0.5))
  model.add(Dense(1))
  model.compile(loss='mean_squared_error',
                optimizer=Adam(0.1))
  return model

In [None]:
dnn_model = build_and_compile_model()

In [None]:
dnn_model.summary()

## Callbacks: Early Stopping, Model Checkpoint
How many epochs should we train the network to obtain a "good" performance, 10, 20, 30, etc. ? \
--> Apply Early Stopping to track the performance in validation set during the training phase, and terminate it when the performance stops improving.

Moreover, we should save the model when it reach the highest performance in the training phase. Otherwise, we get the latest weights (after the last epoch) that maybe not the best ones.

In [None]:
early_stopping_cb = EarlyStopping(
    monitor='val_loss', # track the validation accuracy
    patience=3, # val_loss doesn't improve after 3 consecutive epochs, stop!
    verbose=1)

model_checkpoint_cb = ModelCheckpoint(
    'model_checkpoint', 
    monitor='val_loss', # track val_loss
    verbose=1, 
    save_best_only=True,    # overwrite saved model, keep only the best one
    )

In [None]:
history = dnn_model.fit(
    X_train,
    y_train,
    validation_split=0.2,
    verbose=1, epochs=100, batch_size = 2,
    callbacks=[early_stopping_cb, model_checkpoint_cb])

In [None]:
def plot_loss(history):
  plt.plot(history.history['loss'], label='loss')
  plt.plot(history.history['val_loss'], label='val_loss')
  plt.xlabel('Epoch')
  plt.ylabel('Error [price]')
  plt.legend()
  plt.grid(True)

In [None]:
plot_loss(history)

## Evaluation phase
Now, evaluate the pre-trained network in test set.

In [None]:
dnn_model.evaluate(X_test, y_test, verbose=0)

## Make predictions
You can now make predictions with the dnn_model on the test set using Keras Model.predict and review the loss

In [None]:
test_predictions = dnn_model.predict(X_test).flatten()

a = plt.axes(aspect='equal')
plt.scatter(y_test, test_predictions)
plt.xlabel('True Values [price]')
plt.ylabel('Predictions [price]')
