# Shallow regression for vector data

Reminder: We are within supervised learning (we have labels/targets that are real values) -> Regression

Data and goal: In this notebook we read the zip code data produced by **02_vector_preparations** and create different machine learning models for
predicting the median income per zip-code area from population and spatial features. We will adjust parameters to improve the performance on a validation dataset and finally assesses the models error metrics with a test dataset.

Notebook contents:

0. Environment preparation
1. Reading the data
2. Function defintion
3. Lasso regression
4. Random Forest
5. Task 
6. Model comparison



## 0. Environment preparation

In [None]:
import os
import time
import pandas as pd
from math import sqrt
# machine learning models
from sklearn.linear_model import Lasso
from sklearn.ensemble import RandomForestRegressor
# error metrics
from sklearn.metrics import mean_squared_error, mean_absolute_error,r2_score


## 1. Reading the data
### 1.1 Define input and output file paths 

In [None]:
username = os.environ.get('USER')
base_directory= f'/scratch/project_2002044/{username}/2022/GeoML'

# inputs, all datasets created in 02_vector_preparation.ipynb
data_directory = os.path.join(base_directory,'data')
preprocessed_data_directory = os.path.join(data_directory,'preprocessed_regression')
scaled_train_dataset_name = os.path.join(preprocessed_data_directory,'scaled_train_zip_code_data.csv')
scaled_test_dataset_name = os.path.join(preprocessed_data_directory,'scaled_test_zip_code_data.csv')
scaled_val_dataset_name = os.path.join(preprocessed_data_directory,'scaled_val_zip_code_data.csv')
train_label_name = os.path.join(preprocessed_data_directory,'train_income_labels.pkl')
test_label_name = os.path.join(preprocessed_data_directory,'test_income_labels.pkl')
val_label_name = os.path.join(preprocessed_data_directory,'val_income_labels.pkl')

# read also unscaled datasets
train_dataset_name = os.path.join(preprocessed_data_directory,'train_zip_code_data.csv')
test_dataset_name = os.path.join(preprocessed_data_directory,'test_zip_code_data.csv')
val_dataset_name = os.path.join(preprocessed_data_directory,'val_zip_code_data.csv')

# outputs
results_directory = os.path.join(data_directory,'regression_results')

def create_dir(directory_name):
    if not os.path.exists(directory_name):
        os.makedirs(directory_name)
create_dir(results_directory)

metrics_filename = os.path.join(results_directory,'shallow_metrics.csv')

In [None]:
# for reproducible results when randomness is involved, we can set a random seed
random_seed = 42

### 1.2 Reading the data

In [None]:
# read train, validation and test datasets
scaled_x_train = pd.read_csv(scaled_train_dataset_name)
scaled_x_val = pd.read_csv(scaled_val_dataset_name)
x_train = pd.read_csv(train_dataset_name)
x_val = pd.read_csv(val_dataset_name)
y_train = pd.read_pickle(train_label_name)
y_val = pd.read_pickle(val_label_name)
scaled_x_test = pd.read_csv(scaled_test_dataset_name)
x_test = pd.read_csv(test_dataset_name)
y_test = pd.read_pickle(test_label_name)

## 2. Function definitions

For the regression exercises we will generate the following erroro metrics:
Root Mean Square Error (RMSE), Mean Absolute Error (MAE), coefficient of determination (R^2).
* RMSE and MAE: related to the income value, the smaller the better, unit is Euros;
* RMSE: "punishes" larger errors more than smaller
* MAE: more intuitive interpretation, "all errors are equal"
* R^2: determines "goodness of fit", not in euros, 1 would be a perfect fit



In [None]:
# calculating error metrics for regression predictions

def calculate_error_metrics(test_labels,label_predictions, model_name):

    #Asessing the performance of the model with root mean squared error, mean absolute error and coefficient of determination r2
    rmse = sqrt(mean_squared_error(test_labels, label_predictions))
    mae = mean_absolute_error(test_labels, label_predictions)
    r2 = r2_score(test_labels, label_predictions)

    # store them in a dictionary
    metrics_dict = dict(zip(['model','RMSE','MAE','R2'],[model_name,rmse,mae,r2]))

    return metrics_dict

def print_error_metrics(metrics_dict, model_name, dataset_name):
    print(f"\nError metrics for {model_name} on the {dataset_name} dataset: \n" +
            f"\t Root mean squared error (RMSE): {round(metrics_dict['RMSE'])} \n" +
            f"\t Mean absolute error (MAE): {round(metrics_dict['MAE'])} \n" +
            f"\t Coefficient of determination (R2): {round(metrics_dict['R2'],4)} \n")

def store_error_metrics(test_labels, label_predictions, model_name, metrics_collection = None):

    metrics_dict = calculate_error_metrics(test_labels, label_predictions, model_name)

    # in case these are the first results we want to store in that dataframe, we need to first create it
    if metrics_collection is None:
        metrics_collection = pd.DataFrame(columns=['model','RMSE','MAE','R2'])
 
    metrics_collection = metrics_collection.append(metrics_dict, ignore_index=True )

    #print_error_metrics(metrics_dict,model_name)

    return metrics_collection

# to time the model training we create a function for model training
def train_model(x_train, y_train, model):
    start_time = time.time()  
    print(model)
    model.fit(x_train,y_train)
    print('Model training took: ', round((time.time() - start_time), 2), ' seconds')
    return model

## 3. Lasso regression 

One of the simpler approaches for regression with few important features is lasso regression. Compared to any of the following models it is still easy to compute. Here, we use it with default parameters, but you can also check out others following the documentation:

* https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html

In [None]:
lasso = Lasso()
lasso_name = "Lasso Regressor"
lasso.fit(scaled_x_train, y_train)
lasso_predictions_val = lasso.predict(scaled_x_val)


# then we can get some performance metrics
print_error_metrics(calculate_error_metrics(y_val,lasso_predictions_val, lasso_name), lasso_name, 'validation')

# and store the results on the test dataset for later model comparison
lasso_predictions_test = lasso.predict(scaled_x_test)
print_error_metrics(calculate_error_metrics(y_test,lasso_predictions_test, lasso_name), lasso_name, 'test')
metrics_collection = store_error_metrics(y_test,lasso_predictions_test, lasso_name, metrics_collection)



## 4. Random Forest Regressor

* https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html
* https://scikit-learn.org/stable/modules/ensemble.html#forest


In [None]:
random_forest = RandomForestRegressor(n_estimators=100, max_depth= None, max_features=None, random_state= random_seed, n_jobs=1,verbose=1,)
random_forest_name = "Random Forest Regressor"

random_forest = train_model(x_train, y_train,random_forest)
random_forest_predictions = random_forest.predict(x_val)

# then we can get some performance metrics
print_error_metrics(calculate_error_metrics(y_val,random_forest_predictions, random_forest_name), random_forest_name, 'validation')

# and store the results on the test dataset for later model comparison, after we are done optimizing the parameters
random_forest_predictions = random_forest.predict(x_test)
print_error_metrics(calculate_error_metrics(y_test,random_forest_predictions, random_forest_name), random_forest_name, 'test')
metrics_collection = store_error_metrics(y_test,random_forest_predictions, random_forest_name, metrics_collection)

In [None]:
#let's take a look at feature importances
# create dataframe of feature importance from random forest model
adf = pd.DataFrame(zip(x_train.columns, random_forest.feature_importances_), columns= ['name','importance'])
# sort the dataframe to have highest ranking features first
adf.sort_values('importance', ascending=False)

## 5. Task

Study the [scikit-learn documentation](https://scikit-learn.org/stable/supervised_learning.html#supervised-learning) to find other regression models and/or experiment with different hyperparameter values. Can you improve on the performance metrics or make the training faster?
Report the best results (on test set, see below) and mark down model and parameters used, so that others can reproduce the results. 

You can also run below cells twice to check out some other ensemble regressors:

In [None]:
%load ada_boost.py

In [None]:
%load bagging.py

In [None]:
%load extra_trees.py

In [None]:
%load grad_boost.py

## 6. Model comparison

Let's compare our models performance on the test dataset. Make sure to store your results in metrics_collection (as done for random forest regressor) before running below cell.

In [None]:
print(metrics_collection.sort_values(by=['RMSE'], ascending=False))

# store comparison table 
metrics_collection.to_csv(metrics_filename)