<div class="alert alert-block alert-info">

**TODO:**
* check all texts
* fix comments
* try scaled/unscaled data for trees
* have it without outputs on Github
* remove all but one decision tree and one non-decision tree models and make it task to find others and adjust their hyperparameters; provide "solution" script with % load solutions.py (note that this cell has to be run twice then)
* Add info on used data and goal of exercise
* make sure to make it possible to go through in 30 min
    
</div>

# Shallow regression for vector data

Reminder: We are within supervised learning (we have labels/targets that are real values) -> Regression


In this notebook we read the zip code data produced by **02_vector_preparations** and create different machine learning models for
predicting the average zip code income from population and spatial features.

We will assesses the models error metrics with a test dataset but also predict the number to all zip codes and write the result to a geopackage for closer inspection.

Notebook contents:
0. Environment preparation
1. Reading the data
2. Function defintion
3. Baseline naive approach
4. Baseline linear regression
5. Gradient Boosting
6. Random Forest
7. Bagging
8. AdaBoost
9. Comparing the models
10. Task 


## 0. Environment preparation

In [1]:
import time
import geopandas as gpd
import pandas as pd
from math import sqrt
import os
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor, BaggingRegressor,ExtraTreesRegressor, AdaBoostRegressor

from sklearn.metrics import mean_squared_error, mean_absolute_error,r2_score

import seaborn as sns 

## 1. Reading the data
### 1.1 Define input and output file paths 

In [6]:
username = os.environ.get('USER')
base_directory= f'/scratch/project_2002044/{username}/2022/GeoML'

data_directory = os.path.join(base_directory,'data')

preprocessed_data_directory = os.path.join(data_directory,'preprocessed_regression')
train_dataset_name = os.path.join(preprocessed_data_directory,'train_zip_code_data.csv')
test_dataset_name = os.path.join(preprocessed_data_directory,'test_zip_code_data.csv')
val_dataset_name = os.path.join(preprocessed_data_directory,'val_zip_code_data.csv')
train_label_name = os.path.join(preprocessed_data_directory,'train_income_labels.pkl')
test_label_name = os.path.join(preprocessed_data_directory,'test_income_labels.pkl')
val_label_name = os.path.join(preprocessed_data_directory,'val_income_labels.pkl')



# Relative path to the zip code geopackage file that was prepared by vectorDataPreparations.py
input_geopackage_path = os.path.join(preprocessed_data_location,dataset_name,"zip_code_data_after_preparation.gpkg")

# Output file. You can change the name to identify different regression models
output_geopackage_path = os.path.join(preprocessed_data_location,dataset_name,"median_income_per_zipcode_shallow_model.gpkg")

metrics_filename = os.path.join(results_location,dataset_name,'shallow_metrics.csv')

### 1.2 Reading the data

In [7]:
# read train and test datasets
x_train = gpd.read_file(train_dataset_name)
x_test = gpd.read_file(test_dataset_name)
y_train = pd.read_pickle(train_label_name)
y_test = pd.read_pickle(test_label_name)

# 2. Function definitions



In [43]:
# calculating error metrics for regression predictions

def calculate_error_metrics(test_labels,label_predictions, model_name):

    #Asessing the performance of the model with root mean squared error, mean absolute error and coefficient of determination r2
    rmse = sqrt(mean_squared_error(test_labels, label_predictions))
    mae = mean_absolute_error(test_labels, label_predictions)
    r2 = r2_score(test_labels, label_predictions)

    # store them in a dictionary
    metrics_dict = dict(zip(['model','RMSE','MAE','R2'],[model_name,rmse,mae,r2]))

    return metrics_dict


def print_and_store_error_metrics(test_labels, label_predictions, model_name, metrics_collection = None):

    metrics_dict = calculate_error_metrics(test_labels, label_predictions, model_name)

    # in case these are the first results we want to store in that dataframe, we need to first create it
    if metrics_collection is None:
        metrics_collection = pd.DataFrame(columns=['model','RMSE','MAE','R2'])
 
    metrics_collection = metrics_collection.append(metrics_dict, ignore_index=True )

    print(f"\nError metrics for {model_name} on the test dataset: \n" +
          f"\t Root mean squared error (RMSE): {round(metrics_dict['RMSE'])} \n" +
          f"\t Mean absolute error (MAE): {round(metrics_dict['MAE'])} \n" +
          f"\t Coefficient of determination (R2): {round(metrics_dict['R2'],4)} \n")

    return metrics_collection

# to time the model training we create a function for model training
def train_model(x_train, y_train, model):
    start_time = time.time()  
    print(model)
    model.fit(x_train,y_train)
    print('Model training took: ', round((time.time() - start_time), 2), ' seconds')
    return model

# 3. Baseline naive approach

In order to determine, how well machine learning models perform on our dataset, we create some baseline results.
One way to get baseline results is taking the median of y labels in the training dataset and use this as the prediction for all labels. Very naive, but not a realistic assumption.



In [44]:
# the median training labels provides the predicted lable
naive_prediction_value = y_train.median()
naive_name = "Naive median prediction"
print(naive_prediction_value)
# the naive prediction value still needs to be repeated to fit with the features
naive_predictions = pd.DataFrame([naive_prediction_value]* y_test.shape[0])
# then we can get some accuracy measures
metrics_collection = print_and_store_error_metrics(y_test,naive_predictions, naive_name)

21200.0

Error metrics for Naive median prediction on the test dataset: 
	 Root mean squared error (RMSE): 3254 
	 Mean absolute error (MAE): 2591 
	 Coefficient of determination (R2): -0.0339 



  metrics_collection = metrics_collection.append(metrics_dict, ignore_index=True )


### 4. Baseline Linear regression 

Another baseline approach for regression is linear regression. Compared to any of the following models it is still easy to compute.

In [49]:

linear = LinearRegression()
linear_name = "Linear Regression"
linear.fit(x_train, y_train)
linear_predictions = linear.predict(x_test)

# get feature importance via their coefficients
feature_names = x_train.columns
model_coefficients = linear.coef_

coefficients_df = pd.DataFrame(data = model_coefficients, index = feature_names, columns = ['Coefficient value'])
print(coefficients_df)
# largest absolute values show the most important features

metrics_collection = print_and_store_error_metrics(y_test,linear_predictions, linear_name, metrics_collection)



ValueError: could not convert string to float: ''

## 5. Gradient Boosting Regressor


* https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html
* https://scikit-learn.org/stable/modules/ensemble.html#regression

In [48]:
grad_boost = GradientBoostingRegressor(n_estimators=30, learning_rate=0.1,verbose=1)
grad_boost_name = "Gradient Boosting Regressor"
grad_boost = train_model(x_train, y_train,grad_boost)
grad_boost_predictions = grad_boost.predict(x_test)
metrics_collection = print_and_store_error_metrics(y_test,grad_boost_predictions, grad_boost_name, metrics_collection)


GradientBoostingRegressor(n_estimators=30, verbose=1)


ValueError: could not convert string to float: ''

## 6. Random Forest Regressor

* https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html
* https://scikit-learn.org/stable/modules/ensemble.html#forest

Cannot extrapolate beyond training dataset; mostly used for classification

In [None]:
random_forest = RandomForestRegressor(n_estimators=30,verbose=1)
random_forest_name = "Random Forest Regressor"

random_forest = train_model(x_train, y_train,random_forest)
random_forest_predictions = random_forest.predict(x_test)
metrics_collection = print_and_store_error_metrics(y_test,random_forest_predictions, random_forest_name, metrics_collection)

In [None]:

#let's take a look at feature importances
feature_importances = pd.DataFrame(random_forest.feature_importances_, index =random_forest.columns,  columns=['importance']).sort_values('importance', ascending=False)
feature_importances.head()

## 7. Extra Trees Regressor

* https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesRegressor.html

In [None]:
extra_trees = ExtraTreesRegressor(n_estimators=30,verbose=1)
extra_trees_name = "Extra Trees Regressor"

extra_trees = train_model(x_train, y_train,extra_trees)
extra_trees_predictions = extra_trees.predict(x_test)
metrics_collection = print_and_store_error_metrics(y_test,extra_trees_predictions, extra_trees_name, metrics_collection)

## 8. Bagging Regressor

* https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingRegressor.html
* https://scikit-learn.org/stable/modules/ensemble.html#bagging

In [None]:
bagging = BaggingRegressor(n_estimators=30,verbose=1)
bagging_name = "Bagging Regressor"

baggings = train_model(x_train, y_train,bagging)
bagging_predictions = bagging.predict(x_test)
metrics_collection = print_and_store_error_metrics(y_test,bagging_predictions, bagging_name, metrics_collection)

## 8. AdaBoost Regressor

* https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostRegressor.html
* https://scikit-learn.org/stable/modules/ensemble.html#adaboost

In [None]:
ada_boost = AdaBoostRegressor(n_estimators=30)
ada_boost_name = "AdaBoost Regressor"

ada_boost = train_model(x_train, y_train,ada_boost)
ada_boost_predictions = ada_boost.predict(x_test)
metrics_collection = print_and_store_error_metrics(y_test,ada_boost_predictions, ada_boost_name, metrics_collection)

## 9. Model comparison

In [None]:
print(metrics_collection.sort_values(by=['RMSE'], ascending=False))

# store comparison table 
metrics_collection.to_csv(metrics_filename)

## 10. Task

Study the scikit-learn documentation of one of the above used models and experiment with different hyperparameter values. Can you improve on the accuracy or make the training faster?
Report the best results and mark down parameters used, so that others can reproduce the results.