<div class="alert alert-block alert-info">

**TODO:**
* check all texts
* fix comments
* testing here with validation; final tests with test set
* try scaled/unscaled data for trees
* have it without outputs on Github
* remove all but one decision tree and one non-decision tree models and make it task to find others and adjust their hyperparameters; provide "solution" script with % load solutions.py (note that this cell has to be run twice then)
* Add info on used data and goal of exercise
* make sure to make it possible to go through in 30 min
    
</div>

# Shallow regression for vector data

Reminder: We are within supervised learning (we have labels/targets that are real values) -> Regression


In this notebook we read the zip code data produced by **02_vector_preparations** and create different machine learning models for
predicting the average zip code income from population and spatial features.

We will assesses the models error metrics with a test dataset but also predict the number to all zip codes and write the result to a geopackage for closer inspection.

Notebook contents:
0. Environment preparation
1. Reading the data
2. Function defintion
3. Baseline naive approach
4. Baseline linear regression
5. Gradient Boosting
6. Random Forest
7. Bagging
8. AdaBoost
9. Comparing the models
10. Task 


## 0. Environment preparation

In [41]:
import time
import geopandas as gpd
import pandas as pd
from math import sqrt
import os
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor, BaggingRegressor,ExtraTreesRegressor, AdaBoostRegressor

from sklearn.metrics import mean_squared_error, mean_absolute_error,r2_score

import seaborn as sns 

## 1. Reading the data
### 1.1 Define input and output file paths 

In [42]:
username = os.environ.get('USER')
base_directory= f'/scratch/project_2002044/{username}/2022/GeoML'

data_directory = os.path.join(base_directory,'data')

preprocessed_data_directory = os.path.join(data_directory,'preprocessed_regression')
train_dataset_name = os.path.join(preprocessed_data_directory,'train_zip_code_data.csv')
test_dataset_name = os.path.join(preprocessed_data_directory,'test_zip_code_data.csv')
val_dataset_name = os.path.join(preprocessed_data_directory,'val_zip_code_data.csv')
train_label_name = os.path.join(preprocessed_data_directory,'train_income_labels.pkl')
test_label_name = os.path.join(preprocessed_data_directory,'test_income_labels.pkl')
val_label_name = os.path.join(preprocessed_data_directory,'val_income_labels.pkl')


metrics_filename = os.path.join(base_directory,'shallow_regression','shallow_metrics.csv')

### 1.2 Reading the data

In [43]:
# read train and test datasets
x_train = pd.read_csv(train_dataset_name)
x_test = pd.read_csv(test_dataset_name)
y_train = pd.read_pickle(train_label_name)
y_test = pd.read_pickle(test_label_name)

print(x_train)

       euref_x   euref_y  pinta_ala  he_vakiy  he_naiset  he_miehet   he_kika  \
0     0.586106  0.670425   0.037669  0.013195   0.012128   0.014689  0.388889   
1     0.967504  0.253283   0.009177  0.001619   0.001946   0.001461  0.703704   
2     0.651690  0.380416   0.014208  0.036806   0.032492   0.042144  0.277778   
3     0.642178  1.000000   0.405371  0.023048   0.020559   0.026225  0.444444   
4     0.460438  0.040723   0.000849  0.175024   0.161359   0.191417  0.166667   
...        ...       ...        ...       ...        ...        ...       ...   
1757  0.695743  0.119535   0.012365  0.008093   0.007523   0.008998  0.462963   
1758  0.485773  0.356373   0.052087  0.045251   0.039367   0.052449  0.370370   
1759  0.521515  0.334432   0.032468  0.008551   0.007783   0.009690  0.462963   
1760  0.246739  0.029551   0.005577  0.009677   0.009145   0.010536  0.407407   
1761  0.704285  0.195902   0.005151  0.003730   0.003502   0.004230  0.425926   

        he_0_2    he_3_6   

# 2. Function definitions



In [44]:
# calculating error metrics for regression predictions

def calculate_error_metrics(test_labels,label_predictions, model_name):

    #Asessing the performance of the model with root mean squared error, mean absolute error and coefficient of determination r2
    rmse = sqrt(mean_squared_error(test_labels, label_predictions))
    mae = mean_absolute_error(test_labels, label_predictions)
    r2 = r2_score(test_labels, label_predictions)

    # store them in a dictionary
    metrics_dict = dict(zip(['model','RMSE','MAE','R2'],[model_name,rmse,mae,r2]))

    return metrics_dict


def print_and_store_error_metrics(test_labels, label_predictions, model_name, metrics_collection = None):

    metrics_dict = calculate_error_metrics(test_labels, label_predictions, model_name)

    # in case these are the first results we want to store in that dataframe, we need to first create it
    if metrics_collection is None:
        metrics_collection = pd.DataFrame(columns=['model','RMSE','MAE','R2'])
 
    metrics_collection = metrics_collection.append(metrics_dict, ignore_index=True )

    print(f"\nError metrics for {model_name} on the test dataset: \n" +
          f"\t Root mean squared error (RMSE): {round(metrics_dict['RMSE'])} \n" +
          f"\t Mean absolute error (MAE): {round(metrics_dict['MAE'])} \n" +
          f"\t Coefficient of determination (R2): {round(metrics_dict['R2'],4)} \n")

    return metrics_collection

# to time the model training we create a function for model training
def train_model(x_train, y_train, model):
    start_time = time.time()  
    print(model)
    model.fit(x_train,y_train)
    print('Model training took: ', round((time.time() - start_time), 2), ' seconds')
    return model

# 3. Baseline naive approach

In order to determine, how well machine learning models perform on our dataset, we create some baseline results.
One way to get baseline results is taking the median of y labels in the training dataset and use this as the prediction for all labels. Very naive, but not a realistic assumption.



In [45]:
# the median training labels provides the predicted lable
naive_prediction_value = y_train.median()
naive_name = "Naive median prediction"
print(naive_prediction_value)
# the naive prediction value still needs to be repeated to fit with the features
naive_predictions = pd.DataFrame([naive_prediction_value]* y_test.shape[0])
# then we can get some accuracy measures
metrics_collection = print_and_store_error_metrics(y_test,naive_predictions, naive_name)

21200.0

Error metrics for Naive median prediction on the test dataset: 
	 Root mean squared error (RMSE): 3254 
	 Mean absolute error (MAE): 2591 
	 Coefficient of determination (R2): -0.0339 



  metrics_collection = metrics_collection.append(metrics_dict, ignore_index=True )


### 4. Baseline Linear regression 

Another baseline approach for regression is linear regression. Compared to any of the following models it is still easy to compute.

In [46]:

linear = LinearRegression()
linear_name = "Linear Regression"
linear.fit(x_train, y_train)
linear_predictions = linear.predict(x_test)

# get feature importance via their coefficients
feature_names = x_train.columns
model_coefficients = linear.coef_

coefficients_df = pd.DataFrame(data = model_coefficients, index = feature_names, columns = ['Coefficient value'])
print(coefficients_df)
# largest absolute values show the most important features

metrics_collection = print_and_store_error_metrics(y_test,linear_predictions, linear_name, metrics_collection)



                 Coefficient value
euref_x                -501.452407
euref_y               -1819.252663
pinta_ala              -241.241276
he_vakiy               1424.904914
he_naiset              2708.192161
...                            ...
Pohjois-Savo             20.410027
Päijät-Häme            -118.848740
Satakunta              -379.612541
Uusimaa                  64.089219
Varsinais-Suomi        -313.383641

[119 rows x 1 columns]

Error metrics for Linear Regression on the test dataset: 
	 Root mean squared error (RMSE): 1244 
	 Mean absolute error (MAE): 835 
	 Coefficient of determination (R2): 0.8489 



  metrics_collection = metrics_collection.append(metrics_dict, ignore_index=True )


## 5. Gradient Boosting Regressor


* https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html
* https://scikit-learn.org/stable/modules/ensemble.html#regression

In [47]:
grad_boost = GradientBoostingRegressor(n_estimators=30, learning_rate=0.1,verbose=1)
grad_boost_name = "Gradient Boosting Regressor"
grad_boost = train_model(x_train, y_train,grad_boost)
grad_boost_predictions = grad_boost.predict(x_test)
metrics_collection = print_and_store_error_metrics(y_test,grad_boost_predictions, grad_boost_name, metrics_collection)


GradientBoostingRegressor(n_estimators=30, verbose=1)
      Iter       Train Loss   Remaining Time 
         1     9755241.2072            0.79s
         2     8482946.6580            0.72s
         3     7438811.9523            0.68s
         4     6537466.0364            0.65s
         5     5788622.3219            0.62s
         6     5171330.3776            0.59s
         7     4614422.5874            0.57s
         8     4153911.8039            0.54s
         9     3763448.9246            0.52s
        10     3430378.2746            0.50s
        20     1809753.5164            0.24s
        30     1312947.0438            0.00s
Model training took:  0.73  seconds

Error metrics for Gradient Boosting Regressor on the test dataset: 
	 Root mean squared error (RMSE): 1370 
	 Mean absolute error (MAE): 963 
	 Coefficient of determination (R2): 0.8168 



  metrics_collection = metrics_collection.append(metrics_dict, ignore_index=True )


## 6. Random Forest Regressor

* https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html
* https://scikit-learn.org/stable/modules/ensemble.html#forest

Cannot extrapolate beyond training dataset; mostly used for classification

In [48]:
random_forest = RandomForestRegressor(n_estimators=30,verbose=1)
random_forest_name = "Random Forest Regressor"

random_forest = train_model(x_train, y_train,random_forest)
random_forest_predictions = random_forest.predict(x_test)
metrics_collection = print_and_store_error_metrics(y_test,random_forest_predictions, random_forest_name, metrics_collection)

RandomForestRegressor(n_estimators=30, verbose=1)


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Model training took:  1.67  seconds

Error metrics for Random Forest Regressor on the test dataset: 
	 Root mean squared error (RMSE): 1254 
	 Mean absolute error (MAE): 888 
	 Coefficient of determination (R2): 0.8464 



[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed:    1.7s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed:    0.0s finished
  metrics_collection = metrics_collection.append(metrics_dict, ignore_index=True )


In [50]:

#let's take a look at feature importances
feature_importances = pd.DataFrame(random_forest.feature_importances_,  columns=['importance']).sort_values('importance', ascending=False)
feature_importances.head()

Unnamed: 0,importance
52,0.582547
6,0.097827
53,0.083306
56,0.026287
33,0.022756


## 7. Extra Trees Regressor

* https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesRegressor.html

In [None]:
extra_trees = ExtraTreesRegressor(n_estimators=30,verbose=1)
extra_trees_name = "Extra Trees Regressor"

extra_trees = train_model(x_train, y_train,extra_trees)
extra_trees_predictions = extra_trees.predict(x_test)
metrics_collection = print_and_store_error_metrics(y_test,extra_trees_predictions, extra_trees_name, metrics_collection)

## 8. Bagging Regressor

* https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingRegressor.html
* https://scikit-learn.org/stable/modules/ensemble.html#bagging

In [None]:
bagging = BaggingRegressor(n_estimators=30,verbose=1)
bagging_name = "Bagging Regressor"

baggings = train_model(x_train, y_train,bagging)
bagging_predictions = bagging.predict(x_test)
metrics_collection = print_and_store_error_metrics(y_test,bagging_predictions, bagging_name, metrics_collection)

## 8. AdaBoost Regressor

* https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostRegressor.html
* https://scikit-learn.org/stable/modules/ensemble.html#adaboost

In [None]:
ada_boost = AdaBoostRegressor(n_estimators=30)
ada_boost_name = "AdaBoost Regressor"

ada_boost = train_model(x_train, y_train,ada_boost)
ada_boost_predictions = ada_boost.predict(x_test)
metrics_collection = print_and_store_error_metrics(y_test,ada_boost_predictions, ada_boost_name, metrics_collection)

## 9. Model comparison

In [None]:
print(metrics_collection.sort_values(by=['RMSE'], ascending=False))

# store comparison table 
metrics_collection.to_csv(metrics_filename)

## 10. Task

Study the scikit-learn documentation of one of the above used models and experiment with different hyperparameter values. Can you improve on the accuracy or make the training faster?
Report the best results and mark down parameters used, so that others can reproduce the results.