# GreenValueNet

This notebook contains the code needed to execute the GreenValueNet hedonic pricing neural network. 

### Set up and data loading

In [6]:
from pathlib import Path
from data_load_funcs import get_params, load_data_catalogue
from processing_funcs import process_data, normalise_values

cwd = Path.cwd()
params = get_params()
data_catalogue = load_data_catalogue()

If you do not have a file called `dataset.csv` in the `data/interim_files` folder the following cell will generate this folder and generate summary statistics. The data processing happens locally and invovles large datasets with spatial components so can take quite several hours - please be pateint! If you already have the file, it will be read in and summary statistics are generated. 

In [None]:
dataset = process_data(data_catalogue, params)

# show summary stats
summary_stats = dataset.describe().transpose()[['mean', 'std', 'max', 'min']]
summary_stats.columns = ['Mean', 'Std Dev', 'Maximum', 'Minimum']
print(summary_stats)

Now we normalise any non-encoded variables to increase speed of learning of algorithm.

In [None]:
norm_cols = [col for col in dataset.columns if col not in params['non_norm_cols']]
for col in norm_cols:
    dataset[col] = normalise_values(dataset[col])

In [None]:
# things to check:
# max house price, floor area, current energy efficinecy, no. rooms, road dist
# min floor area, habitable rooms, heated rooms, construction age band

# do we want to remove any outliers? what should we be doing with blank data? is it treated as 0?

In [None]:
# split into train, dev and test datasets
import tensorflow as tf
observations = len(dataset)

# drop any non numeric cols here
dataset_tf = tf.convert_to_tensor(dataset)

# put train dev test split into params

# maybe convert into a matrix with clear labelling for computational processing

## Benchmarking

To evaluate the performance of my neural network I will run a single layer NN and a random forest as baseline line models. I will then build 2 alternative models: a deep neural network and a bayesian model. We optimise based on the mean squared error (MSE) but and report both mean squared and root mean squared errors (RMSE). By minimising the MSE, we have necessarily minimsied the RMSE but with less computation.

### Random Forest


In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error


rfr = RandomForestRegressor().fit(x_train, y_train)
rfr_pred = rfr.predict(x_dev)
mse = mean_squared_error(y_dev, rfr_pred)
rmse = mse**.5

# Random forest regressor parameter tuning
grid = {
    'n_estimators': [200, 300, 400, 500],
    'max_features': ['sqrt', 'log2'],
    'max_depth': [3,4,5,6,7],
    'random_state': [18]

}
CV_rfr = GridSearchCV(
    estimator=RandomForestRegressor(),
    param_frid=grid, cv=5
)
CV_rfr.fit(x_train, y_train)




### Single Layer Neural Network

In [None]:
# set up single layer NN
from tensorflow.keras.metrics import mean_squared_error

normalizer = tf.keras.layers.Normalization(axis=-1)
normalizer.adapt(dataset_tf)

def get_basic_model():
  model = tf.keras.Sequential([
    normalizer,
    tf.keras.layers.Dense(10, activation='relu'),
    tf.keras.layers.Dense(10, activation='relu'),
    tf.keras.layers.Dense(1)
  ])

  model.compile(optimizer='adam',
                loss=tf.keras.losses.MeanSquaredError(from_logits=True),
                metrics=['mean_squared_error'])
  return model

single_nn = get_basic_model()
single_nn.fit(numeric_features, target, epochs=15, batch_size = BATCH_SIZE)