<div class="alert alert-block alert-info">

**TODO:**
* make sure to make it possible to go through this in 1:00
</div>

# Deep Regression

Reminder: We are within supervised learning (we have labels/targets that are real values) -> Regression

Data and goal: In this notebook we read the zip code data produced by **02_vector_preparations** and create one deep learning model for
predicting the average zip code income from population and spatial features. We will try to tune hyperparameters and finally assesses the models performance metrics on a previously unseen test dataset.

Contents of this notebook:

0. Prepare environment
1. Reading the data
2. Check for GPU
3. Model definition
4. Task
5. Performance
6. Comparison to shallow models


## 0. Prepare environment

In [None]:
import os
import pandas as pd
from math import sqrt
#plotting loss
import matplotlib.pyplot as plt
# error metrics
from sklearn.metrics import mean_squared_error, mean_absolute_error,r2_score
# deep learning tools
import tensorflow
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import RMSprop


## 1. Reading the data
### 1.1 Define input and output file paths 

In [None]:
username = os.environ.get('USER')
base_directory= f'/scratch/project_2002044/{username}/2022/GeoML'

data_directory = os.path.join(base_directory,'data')

#inputs
preprocessed_data_directory = os.path.join(data_directory,'preprocessed_regression')
train_dataset_name = os.path.join(preprocessed_data_directory,'scaled_train_zip_code_data.csv')
test_dataset_name = os.path.join(preprocessed_data_directory,'scaled_test_zip_code_data.csv')
val_dataset_name = os.path.join(preprocessed_data_directory,'scaled_val_zip_code_data.csv')
train_label_name = os.path.join(preprocessed_data_directory,'train_income_labels.pkl')
test_label_name = os.path.join(preprocessed_data_directory,'test_income_labels.pkl')
val_label_name = os.path.join(preprocessed_data_directory,'val_income_labels.pkl')

# outputs
results_directory = os.path.join(data_directory,'regression_results')
metrics_filename = os.path.join(results_directory,'shallow_metrics.csv')

In [None]:
# for reproducible results when randomness is involved, we can set a random seed
random_seed= 42

In [None]:
# read train, validation and test datasets
x_train = pd.read_csv(train_dataset_name)
x_val = pd.read_csv(val_dataset_name)
y_train = pd.read_pickle(train_label_name)
y_val = pd.read_pickle(val_label_name)
x_test = pd.read_csv(test_dataset_name)
y_test = pd.read_pickle(test_label_name)
num_of_x_columns =  x_train.to_numpy().shape[1]

## 2. Check for GPUs

In this part of the course we do not yet need GPUs, the dataset and models used are sufficiently small, and training goes fast also on CPU.

In [None]:
def checkGPUavailability():
    device = tensorflow.config.list_physical_devices('GPU')
    if device:
        print("We have a GPU available!")
    else:
        print("Sadly no GPU available. :( you have settle with a CPU. Good luck!")

checkGPUavailability()

## 3. Model definition

We will now build a model (linear stack of layers) from scratch using [keras Sequential](https://www.tensorflow.org/api_docs/python/tf/keras/Sequential).
We start with a model with one input and one output layer and two hidden layers. 
The input shape for the input layer is defined by the number of features available.
The number of layers and perceptrons/neurons/nodes per layer can be chosen freely. We will adjust them later.
We choose [ReLU - rectified linear unit](https://www.tensorflow.org/api_docs/python/tf/keras/activations/relu) activation function.
The last layer is also called output layer. Since we are doing regression, we only want one value per set of features for the average income per zip-code.
To [compile](https://www.tensorflow.org/api_docs/python/tf/keras/Sequential#compile) the model, we set the optimizer to the default [RMSprop](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/RMSprop) with default learning rate of 0.001. We use the mean square error to compute the loss. We set the mean average error and mean square error to be evaluated by the model during training and testing.  

Then we [fit](https://www.tensorflow.org/api_docs/python/tf/keras/Sequential#fit) the model to the training data. The number of epochs is set to 100; one epoch is one iteration over the entire x and y data provided. We want to shuffle the training data before each epoch `shuffle=True. And set the batch size to 64, which sets the number of samples that are forwarded through the model in each pass. We also set what dataset to use as validation dataset. We store the whole model fitting into the variable history to visualize the training progress.

In [None]:
# Initialize a Sequential model
model = Sequential()

# Add first layer with 64 perceptrons. Activation function is relu
model.add(Dense(64, activation='relu', input_shape=(num_of_x_columns,)))

# Add another layer with 64 perceptrons
model.add(Dense(64, activation='relu'))

# The last layer has to have only 1 perceptron as it is the output layer
model.add(Dense(1))

# Setting optimizer and loss functions. Learning rate set to 0.001
model.compile(optimizer=RMSprop(learning_rate=.001), loss='mse', metrics=['mae','mse'])
print(model.summary())

#Train the network with 100 epochs and batch size of 64, store it into history, for loss plot
history = model.fit(x_train, y_train, epochs=100, shuffle=True, batch_size=64, verbose=1, validation_data=(x_val, y_val))

In [None]:
# define function to plot the loss for visualization of above training progress
def plot_loss(history):
  plt.plot(history.history['loss'], label='loss')
  plt.plot(history.history['val_loss'], label='val_loss')
  #plt.ylim([0, 10])
  plt.xlabel('Epoch')
  plt.ylabel('Error [MPG]')
  plt.legend()
  plt.grid(True)

plot_loss(history)

## 4. Task


Can you improve on the performance metrics?
You can add layers to the model or change the number of epochs, number of neurons per layer or batch size. What can you observe?
Also compare to your results from the shallow regression exercise.
Report the best results (on test set below) and mark down parameters used, so that others can reproduce the results. 

## 5. Performance

After tuning the model based on the validation data, we can use the test set to report the performance metrics.

In [None]:
#Evaluating the performance of the model using previously unseen test dataset
prediction = model.predict(x_test)
r2 = r2_score(y_test, prediction)
rmse = sqrt(mean_squared_error(y_test, prediction))
mae = mean_absolute_error(y_test, prediction)

print("\nMODEL ACCURACY METRICS WITH TEST DATASET: \n" +
        "\t Root mean squared error: "+ str(rmse) + "\n" +
        "\t Mean absolute error: " + str(mae) + "\n" +
        "\t Coefficient of determination: " + str(r2) + "\n")


## 6. Comparison to shallow models and baseline 

Now we can print again the performance metrics for our models from the shallow regression exercises and see how the deep model is performing in comparison.

In [None]:

shallow_metrics = pd.read_csv(metrics_filename)

print(shallow_metrics.sort_values(by=['RMSE'], ascending=False))