# GreenValueNet

This notebook contains the code needed to execute the GreenValueNet hedonic pricing neural network. 

### Set up and data loading

In [1]:
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error
from data_load_funcs import get_params, load_data_catalogue
from processing_funcs import process_data, normalise_values
from model_funcs import create_x_y_arr, split_to_test_dev_train, random_forest_reg, baseline_nn

params = get_params()
data_catalogue = load_data_catalogue()






If you do not have a file called `dataset.csv` in the `data/interim_files` folder the following cell will generate this folder and generate summary statistics. The data processing happens locally and invovles large datasets with spatial components so can take quite several hours - please be pateint! If you already have the file, it will be read in and summary statistics are generated. 

In [4]:
dataset = process_data(data_catalogue, params)

# show summary stats
summary_stats = dataset.describe().transpose()[['mean', 'std', 'max', 'min']]
summary_stats.columns = ['Mean', 'Std Dev', 'Maximum', 'Minimum']
print(summary_stats)

Processed dataset already exists.
Loading exisitng dataset...
                                  Mean     Std Dev        Maximum   Minimum
ln_price                     12.315576    0.671025      18.134158  0.116855
propertytype                  1.330222    1.086272       3.000000  0.000000
oldnew                        0.180940    0.384968       1.000000  0.000000
duration                      0.344295    0.475170       2.000000  0.000000
current_energy_efficiency    62.324093   13.226019     347.000000  0.000000
potential_energy_efficiency   0.811409    0.144810      70.000000  0.000000
total_floor_area             91.697405  139.792843  530331.552000  0.000000
extension_count               0.488791    0.732426       4.000000  0.000000
number_habitable_rooms        4.545110    1.703158     137.000000  0.000000
number_heated_rooms           4.456910    1.712974     100.000000  0.000000
construction_age_band         4.998326    3.503951      12.000000 -1.000000
coastline_dist            

Now we normalise any non-encoded variables to increase speed of learning of algorithm and convert the dataset to an array of inputs, and an associated output array.

In [None]:
norm_cols = [col for col in dataset.columns if col not in params['non_norm_cols']]
for col in norm_cols:
    dataset[col] = normalise_values(dataset[col])

x, y = create_x_y_arr(dataset, params)

In [5]:
# things to check:
# max house price, floor area, current energy efficinecy, no. rooms, road dist
# min floor area, habitable rooms, heated rooms, construction age band

# do we want to remove any outliers? what should we be doing with blank data? is it treated as 0?
dataset.columns

Index(['transactionid', 'ln_price', 'postcode', 'propertytype', 'oldnew',
       'duration', 'current_energy_efficiency', 'potential_energy_efficiency',
       'total_floor_area', 'extension_count', 'number_habitable_rooms',
       'number_heated_rooms', 'construction_age_band', 'coastline_dist',
       'prim_school_dist', 'sec_school_dist', 'roads_dist', 'nat_park_dist',
       'nat_trust_dist', 'ttwa_dist', 'dom_builds_share', 'garden_share',
       'non_dom_builds_share', 'path_share', 'greenspace_share',
       'water_share'],
      dtype='object')

The dataset is then split into train, dev and test sets using sci-kit learn.

In [None]:
x_train, x_dev, x_test, y_train, y_dev, y_test = split_to_test_dev_train(
    x,
    y,
    params['dev_size'],
    params['test_size'],
    prop=False
)

## Benchmarking

To evaluate the performance of my neural network I will run random forest and XGBoost regressions as baseline models. I will then build 2 alternative models: a deep neural network and a bayesian model. We optimise based on the mean squared error (MSE) but and report both mean squared and root mean squared errors (RMSE). By minimising the MSE, we have necessarily minimise the RMSE but with less computation.

### Random Forest


In [None]:
# run baseline random forest regression using scikit-learn
rfr_model = random_forest_reg(
    x_train,
    x_dev,
    y_train,
    y_dev,
    tuning=False
)

# now run with grid search to tune parameters
rfr_tuned  =  random_forest_reg(
    x_train,
    x_dev,
    y_train,
    y_dev,
    tuning=True,
    tuning_params = params['tuning_dict']['grid']
)

# generate predictions and measure according to mean squared error
rfr_pred = rfr_model.predict(x_dev)
mse_rfr = mean_squared_error(y_dev, rfr_pred)
rmse_rfr = mse_rfr ** 0.5

rfr_t_pred = rfr_tuned.predict(x_dev)
mse_t_rfr = mean_squared_error(y_dev, rfr_t_pred)
rmse_t_rfr = mse_t_rfr ** 0.5


### XGBoost

## Neural networks

### Single Layer Neural Network

In [None]:
fitted_single_nn = baselin_nn()



### Deep Neural Network

The full model is specified as a deep neural network using layers with ReLU activation functions with a linear activation in the output layer. The choice of number of layers was initially kept small due to computational processing constraints.

In [None]:
deep_nn = None

In [None]:
# save model as tensor object and stick in a folder called outputs