# GreenValueNet

This notebook contains the code needed to execute the GreenValueNet hedonic pricing neural network. 

### Set up and data loading

In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error
from data_load_funcs import get_params, load_data_catalogue
from processing_funcs import process_data, normalise_values
from model_funcs import *

params = get_params()
data_catalogue = load_data_catalogue()

If you do not have a file called `dataset.csv` in the `data/interim_files` folder the following cell will generate this folder and generate summary statistics. The data processing happens locally and invovles large datasets with spatial components so can take quite several hours - please be pateint! If you already have the file, it will be read in and summary statistics are generated. 

In [None]:
dataset = process_data(data_catalogue, params)

# show summary stats
summary_stats = dataset.describe().transpose()[['mean', 'std', 'max', 'min']]
summary_stats.columns = ['Mean', 'Std Dev', 'Maximum', 'Minimum']
print(summary_stats)

Now we normalise any non-encoded variables to increase speed of learning of algorithm and convert the dataset to an array of inputs, and an associated output array.

In [None]:
norm_cols = [col for col in dataset.columns if col not in params['non_norm_cols']]
for col in norm_cols:
    dataset[col] = normalise_values(dataset[col])

# creates an arry of shape m, x, y
x, y = create_x_y_arr(dataset, params)

The dataset is then split into train, dev and test sets using sci-kit learn.

In [None]:
x_train, x_dev, x_test, y_train, y_dev, y_test = split_to_test_dev_train(
    x,
    y,
    params['dev_size'],
    params['test_size'],
    prop=False
)

## Benchmarking

To evaluate the performance of my neural network I will run random forest and XGBoost regressions as baseline models. I will then build 2 alternative models: a deep neural network and a bayesian model. We optimise based on the mean squared error (MSE) but and report this as our measure of performance.

### Random Forest


In [None]:
# run baseline random forest regression using scikit-learn
rfr_model = random_forest_reg(
    x_train,
    y_train,
    tuning=False
)

# now run with grid search to tune parameters
rfr_tuned  =  random_forest_reg(
    x_train,
    y_train,
    tuning=True,
    tuning_params = params['tuning_dict']['grid']
)

# generate predictions and measure according to mean squared error
rfr_pred, rfr_mse = generate_pred_metric(rfr_model, mean_squared_error, x_dev, y_dev)
rfr_t_pred, rfr_t_mse = generate_pred_metric(rfr_tuned, mean_squared_error, x_dev, y_dev)

### XGBoost

In [None]:
xgb_model = boosted_grad_reg(x_train, y_train)
xgb_pred, xgb_mse = generate_pred_metric(xgb_model, mean_squared_error, x_dev, y_dev)

## Neural networks

We know build some neural networks. Number of epochs, hidden layers, and nodes in hidden layers is initially set with rules of thumb but then optimiszed using hyperparameter tuning.

In [None]:
# set epochs to be 3 times number of features
epochs = int(x_train.shape[1]) * 3

# set n_hidden_units to be mean of input and output layer sizes
n_hidden_units = round((x_train.shape[1] + 1) / 2)

### Single Layer Neural Network

A single hidden layer with ReLU activation is used with a linear output layer.

In [None]:
single_nn = neural_net(
    x_train,
    y_train,
    n_hidden_units = n_hidden_units,
    epochs = epochs,
    validation_data = (x_dev, y_dev)
)

In [None]:
single_nn.history.history

### Deep Neural Network

The full model is specified as a deep neural network using layers with ReLU activation functions with a linear activation in the output layer. The choice of number of layers was initially kept small due to computational processing constraints.

In [None]:
deep_nn = neural_net(
    x_train,
    y_train,
    n_layers = 5,
    n_hidden_units = n_hidden_units,
    epochs = epochs,
    validation_data = (x_dev, y_dev)
)

In [None]:
# save model as tensor object and stick in a folder called outputs
model_dir = cwd / "outputs" / "models"
deep_nn.export(model_dir / "deep_nn.tf") # check this file ending