# GreenValueNet

This notebook contains the code needed to execute the GreenValueNet hedonic pricing neural network. 

### Set up and data loading

In [None]:
import tensorflow as tf
from pathlib import Path
from data_load_funcs import get_params, load_data_catalogue
from processing_funcs import process_data, normalise_values
from model_funcs import create_x_y_arr, split_to_test_dev_train, extract_target_var, random_forest_reg

params = get_params()
data_catalogue = load_data_catalogue()

If you do not have a file called `dataset.csv` in the `data/interim_files` folder the following cell will generate this folder and generate summary statistics. The data processing happens locally and invovles large datasets with spatial components so can take quite several hours - please be pateint! If you already have the file, it will be read in and summary statistics are generated. 

In [None]:
dataset = process_data(data_catalogue, params)

# show summary stats
summary_stats = dataset.describe().transpose()[['mean', 'std', 'max', 'min']]
summary_stats.columns = ['Mean', 'Std Dev', 'Maximum', 'Minimum']
print(summary_stats)

Now we normalise any non-encoded variables to increase speed of learning of algorithm and convert the dataset to an array of inputs, and an associated output array.

In [None]:
norm_cols = [col for col in dataset.columns if col not in params['non_norm_cols']]
for col in norm_cols:
    dataset[col] = normalise_values(dataset[col])

x, y = create_x_y_arr(dataset, params)

In [None]:
# things to check:
# max house price, floor area, current energy efficinecy, no. rooms, road dist
# min floor area, habitable rooms, heated rooms, construction age band

# do we want to remove any outliers? what should we be doing with blank data? is it treated as 0?

The dataset is then split into train, dev and test sets, and identifying columns are dropped before being used in the models. We then isolate the target variable `ln_price` from the input variables.

In [None]:
# make any modifications to handle numpy array

train, dev, test = split_to_test_dev_train(
    dataset,
    params['dev_size'],
    params['test_size'],
    prop=False
)

x_train, y_train = extract_target_var(train, params['target_var'], params['cols_out'])
x_dev, y_dev = extract_target_var(dev, params['target_var'], params['cols_out'])
x_test, y_test = extract_target_var(test, params['target_var'], params['cols_out'])


## Benchmarking

To evaluate the performance of my neural network I will run a single layer NN and a random forest as baseline models. I will then build 2 alternative models: a deep neural network and a bayesian model. We optimise based on the mean squared error (MSE) but and report both mean squared and root mean squared errors (RMSE). By minimising the MSE, we have necessarily minimise the RMSE but with less computation.

### Random Forest


In [None]:
# run baseline random forest regression using scikit-learn
rfr_model, rfr_predict, rfr_metrics = random_forest_reg(
    x_train,
    x_dev,
    y_train,
    y_dev,
    tuning=False
)

# now run with grid search to tune parameters
rfr_tuned, rfr_tuned_pred, rfr_tuned_metrics =  random_forest_reg(
    x_train,
    x_dev,
    y_train,
    y_dev,
    tuning=True,
    tuning_params = params['tuning_dict']['grid']
)

### Single Layer Neural Network

In [None]:
# set up single layer NN
from tensorflow.keras.metrics import mean_squared_error

normalizer = tf.keras.layers.Normalization(axis=-1)
normalizer.adapt(dataset_tf)

def get_basic_model():
  model = tf.keras.Sequential([
    normalizer,
    tf.keras.layers.Dense(1, activation='linear')
  ])

  model.compile(optimizer='adam',
                loss=tf.keras.losses.MeanSquaredError(from_logits=True),
                metrics=['mean_squared_error'])
  return model

single_nn = get_basic_model()
single_nn.fit(numeric_features, target, epochs=15, batch_size = BATCH_SIZE)

# normally use a linear activation function (i.e. here a linear regression)

## Full model

### Deep Neural Network

### Bayesian model

Here I use the pymc library to build a bayesian model.