<div class="alert alert-block alert-info">

**TODO:**

* have it without outputs on Github
* more description to model building
* reminder on parameters 
* explain numbers in model generation
* add links where helpful
* make sure to make it possible to go through this in 1:00
</div>

# Deep Regression

Reminder: We are within supervised learning (we have labels/targets that are real values) -> Regression

Data and goal: In this notebook we read the zip code data produced by **02_vector_preparations** and create one deep learning model for
predicting the average zip code income from population and spatial features. We will try to tune hyperparameters and finally assesses the models performance metrics on a previously unseen test dataset.

Contents of this notebook:

0. Prepare environment
1. Reading the data
2. Check for GPU
3. Model definition
4. Performance
5. Comparison to shallow models
6. Task




## 0. Prepare environment

In [1]:
import os
import pandas as pd
from math import sqrt
#plotting loss
import matplotlib.pyplot as plt
# error metrics
from sklearn.metrics import mean_squared_error, mean_absolute_error,r2_score
# deep learning tools
import tensorflow
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import RMSprop


2022-11-01 08:27:12.148849: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-01 08:27:13.193661: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2022-11-01 08:27:13.761019: E tensorflow/stream_executor/cuda/cuda_blas.cc:2981] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2022-11-01 08:27:16.237711: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; 

## 1. Reading the data
### 1.1 Define input and output file paths 

In [3]:
username = os.environ.get('USER')
base_directory= f'/scratch/project_2002044/{username}/2022/GeoML'

data_directory = os.path.join(base_directory,'data')

#inputs
preprocessed_data_directory = os.path.join(data_directory,'preprocessed_regression')
train_dataset_name = os.path.join(preprocessed_data_directory,'scaled_train_zip_code_data.csv')
test_dataset_name = os.path.join(preprocessed_data_directory,'scaled_test_zip_code_data.csv')
val_dataset_name = os.path.join(preprocessed_data_directory,'scaled_val_zip_code_data.csv')
train_label_name = os.path.join(preprocessed_data_directory,'train_income_labels.pkl')
test_label_name = os.path.join(preprocessed_data_directory,'test_income_labels.pkl')
val_label_name = os.path.join(preprocessed_data_directory,'val_income_labels.pkl')

# outputs
results_directory = os.path.join(data_directory,'regression_results')
metrics_filename = os.path.join(results_directory,'shallow_metrics.csv')

In [None]:
# for reproducible results when randomness is involved, we can set a random seed
random_seed= 42

In [None]:
# read train, validation and test datasets
x_train = pd.read_csv(train_dataset_name)
x_val = pd.read_csv(val_dataset_name)
y_train = pd.read_pickle(train_label_name)
y_val = pd.read_pickle(val_label_name)
x_test = pd.read_csv(test_dataset_name)
y_test = pd.read_pickle(test_label_name)
num_of_x_columns =  x_train.to_numpy().shape[1]

## 2. Check for GPUs

In this part of the course we do not yet need GPUs, the dataset and models used are sufficiently small, and training goes fast also on CPU.

In [4]:
def checkGPUavailability():
    device = tensorflow.config.list_physical_devices('GPU')
    if device:
        print("We have a GPU available!")
    else:
        print("Sadly no GPU available. :( you have settle with a CPU. Good luck!")

checkGPUavailability()

Sadly no GPU available. :( you have settle with a CPU. Good luck!


2022-11-01 08:27:18.522825: E tensorflow/stream_executor/cuda/cuda_driver.cc:265] failed call to cuInit: CUDA_ERROR_NO_DEVICE: no CUDA-capable device is detected
2022-11-01 08:27:18.522866: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (r18c43.bullx): /proc/driver/nvidia/version does not exist


## 3. Model definition

In [7]:
# Initialize a Sequential model
model = Sequential()

# Add first layer with 64 perceptrons. Activation function is relu
model.add(Dense(64, activation='relu', input_shape=(num_of_x_columns,)))

# Add another layer with 64 perceptrons
model.add(Dense(64, activation='relu'))

# The last layer has to have only 1 perceptron as it is the output layer
model.add(Dense(1))

# Setting optimizer and loss functions. Learning rate set to 0.001
model.compile(optimizer=RMSprop(learning_rate=.001), loss='mse', metrics=['mae','mse'])
print(model.summary())

#Train the network with 1000 epochs and batch size of 64, store it into history, for loss plot
history = model.fit(x_train, y_train, epochs=1000, shuffle=True, batch_size=64, verbose=2, validation_data=(x_val, y_val))

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense (Dense)               (None, 64)                7552      
                                                                 
 dense_1 (Dense)             (None, 64)                4160      
                                                                 
 dense_2 (Dense)             (None, 1)                 65        
                                                                 
Total params: 11,777
Trainable params: 11,777
Non-trainable params: 0
_________________________________________________________________


2022-11-01 08:27:18.683527: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


None
Epoch 1/1000
28/28 - 1s - loss: 477737824.0000 - mae: 21596.5879 - mse: 477737824.0000 - 873ms/epoch - 31ms/step
Epoch 2/1000
28/28 - 0s - loss: 476102144.0000 - mae: 21558.9121 - mse: 476102144.0000 - 42ms/epoch - 1ms/step
Epoch 3/1000
28/28 - 0s - loss: 472733600.0000 - mae: 21480.7090 - mse: 472733600.0000 - 42ms/epoch - 1ms/step
Epoch 4/1000
28/28 - 0s - loss: 467339136.0000 - mae: 21354.8359 - mse: 467339136.0000 - 41ms/epoch - 1ms/step
Epoch 5/1000
28/28 - 0s - loss: 459707520.0000 - mae: 21174.0547 - mse: 459707520.0000 - 41ms/epoch - 1ms/step
Epoch 6/1000
28/28 - 0s - loss: 449505760.0000 - mae: 20927.1992 - mse: 449505760.0000 - 41ms/epoch - 1ms/step
Epoch 7/1000
28/28 - 0s - loss: 436757664.0000 - mae: 20612.2598 - mse: 436757664.0000 - 41ms/epoch - 1ms/step
Epoch 8/1000
28/28 - 0s - loss: 421312384.0000 - mae: 20216.5430 - mse: 421312384.0000 - 41ms/epoch - 1ms/step
Epoch 9/1000
28/28 - 0s - loss: 403247680.0000 - mae: 19737.6484 - mse: 403247680.0000 - 41ms/epoch - 1ms

<keras.callbacks.History at 0x7f3487b70850>

In [None]:
def plot_loss(history):
  plt.plot(history.history['loss'], label='loss')
  plt.plot(history.history['val_loss'], label='val_loss')
  #plt.ylim([0, 10])
  plt.xlabel('Epoch')
  plt.ylabel('Error [MPG]')
  plt.legend()
  plt.grid(True)

plot_loss(history)

## 4. Performance

After tuning the model based on the validation data, we can use the test set to report the performance metrics.

In [8]:
#Evaluating the performance of the model using previously unseen test dataset
prediction = model.predict(x_test)
r2 = r2_score(y_test, prediction)
rmse = sqrt(mean_squared_error(y_test, prediction))
mae = mean_absolute_error(y_test, prediction)

print("\nMODEL ACCURACY METRICS WITH TEST DATASET: \n" +
        "\t Root mean squared error: "+ str(rmse) + "\n" +
        "\t Mean absolute error: " + str(mae) + "\n" +
        "\t Coefficient of determination: " + str(r2) + "\n")



MODEL ACCURACY METRICS WITH TEST DATASET: 
	 Root mean squared error: 1448.3619776275707
	 Mean absolute error: 1023.495604636925
	 Coefficient of determination: 0.7952711105897102



## 5. Comparison to shallow models and baseline 

Now we can print again the performance metrics for our models from the shallow regression exercises and see how the deep model is performing in comparison.

In [9]:

shallow_metrics = pd.read_csv(metrics_filename)

print(shallow_metrics.sort_values(by=['RMSE'], ascending=False))

   Unnamed: 0                        model         RMSE          MAE        R2
0           0      Naive median prediction  3210.315973  2521.001704 -0.005820
6           6           AdaBoost Regressor  1790.366977  1393.361386  0.687170
1           1            Linear Regression  1600.219500  1178.917810  0.750090
2           2  Gradient Boosting Regressor  1519.972072  1151.982767  0.774526
3           3      Random Forest Regressor  1452.603457  1074.873140  0.794070
5           5            Bagging Regressor  1438.135355  1079.788359  0.798152
4           4        Extra Trees Regressor  1332.022588   991.608972  0.826840


## 6. Task

Study the tensorflow documentation and experiment with different hyperparameter values. Can you improve on the performance metrics?
Report the best results (on test set!) and mark down parameters used, so that others can reproduce the results. 