<div class="alert alert-block alert-info">

**TODO:**
* check all texts
* fix comments
* have it without outputs on Github
* compare to results from shallow models
* add task to build deeper models, add shape consinderations! Links to relevant pages.
* more description to model building
* reminder on parameters 
* train/test/val, use val for tuninig
* make some tuning a task
* Add note on CV
* explain numbers in model generation
* talk about engineered features
* make sure to make it possible to go through this in 1:00
</div>

# Deep Regression

Data - prepared vector data

Goal - use deep learning for predict the median income from zip code level population and spatial variables, assess the model accuracy with a test dataset, predicts the number to all zip codes and writes it to a geopackage

Content of this notebook:

0. Prepare environment
1. Set paths
2. Check for GPU
3. Reading and preparing data 
4. Model definition
5. Prediction and inference
6. Comparison to shallow




0. Prepare environment

In [None]:
import time
import pandas as pd
import geopandas as gpd
from math import sqrt
import os
from sklearn.metrics import mean_squared_error, mean_absolute_error,r2_score
import tensorflow
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.optimizers import RMSprop
from sklearn.model_selection import train_test_split

In [None]:
random_seed= 42

1. Set paths

In [None]:
username = os.environ.get('USER')
base_directory= f'/scratch/project_2002044/{username}/2022/GeoML'

data_directory = os.path.join(base_directory,'data')

preprocessed_data_directory = os.path.join(data_directory,'preprocessed_regression')
train_dataset_name = os.path.join(preprocessed_data_directory,'train_zip_code_data.csv')
test_dataset_name = os.path.join(preprocessed_data_directory,'test_zip_code_data.csv')
val_dataset_name = os.path.join(preprocessed_data_directory,'val_zip_code_data.csv')
train_label_name = os.path.join(preprocessed_data_directory,'train_income_labels.pkl')
test_label_name = os.path.join(preprocessed_data_directory,'test_income_labels.pkl')
val_label_name = os.path.join(preprocessed_data_directory,'val_income_labels.pkl')


metrics_filename = os.path.join(base_directory,'shallow_regression','shallow_metrics.csv')

In [None]:
!pwd


2. Check for GPUs

In [None]:
def checkGPUavailability():
    device = tensorflow.config.list_physical_devices('GPU')
    if device:
        print("We have a GPU available!")
    else:
        print("Sadly no GPU available. :( you have settle with a CPU. Good luck!")

checkGPUavailability()

3. Reading and preparing data 

In [None]:
# read train and test datasets
x_train = gpd.read_file(train_dataset_name)
x_test = gpd.read_file(test_dataset_name)
x_val = gpd.read_file(val_dataset_name)
y_train = pd.read_pickle(train_label_name)
y_test = pd.read_pickle(test_label_name)
y_val = pd.read_pickle(val_label_name)
num_of_x_columns =  x_train.to_numpy().shape[1]


## 4. Model definition

In [None]:
# Initialize a Sequential model
model = Sequential()

# Add first layer with 64 perceptrons. Activation function is relu
model.add(Dense(64, activation='relu', input_shape=(num_of_x_columns,)))

# Add another layer with 64 perceptrons
model.add(Dense(64, activation='relu'))

# The last layer has to have only 1 perceptron as it is the output layer
model.add(Dense(1))

# Setting optimizer and loss functions. Learning rate set to 0.001
model.compile(optimizer=RMSprop(learning_rate=.001), loss='mse', metrics=['mae','mse'])
print(model.summary())

#Train the network with 1000 epochs and batch size of 64
model.fit(x_train, y_train, epochs=1000, shuffle=True, batch_size=64, verbose=2)



## 5. Prediction and inference

In [None]:
#Evaluating the performance of the model using test data
prediction = model.predict(x_test)
r2 = r2_score(y_test, prediction)
rmse = sqrt(mean_squared_error(y_test, prediction))
mae = mean_absolute_error(y_test, prediction)

print("\nMODEL ACCURACY METRICS WITH TEST DATASET: \n" +
        "\t Root mean squared error: "+ str(rmse) + "\n" +
        "\t Mean absolute error: " + str(mae) + "\n" +
        "\t Coefficient of determination: " + str(r2) + "\n")


## 6. Comparison to shallow models and baseline 

In [None]:

shallow_metrics = pd.read_csv(metrics_filename)

print(shallow_metrics.sort_values(by=['RMSE'], ascending=False))