<div class="alert alert-block alert-info">

**TODO:**
* check all texts
* fix comments
* testing here with validation; final tests with test set
* try scaled/unscaled data for trees
* have it without outputs on Github
* remove all but one decision tree and one non-decision tree models and make it task to find others and adjust their hyperparameters; provide "solution" script with % load solutions.py (note that this cell has to be run twice then)
* Add info on used data and goal of exercise
* make sure to make it possible to go through in 30 min
    
</div>

# Shallow regression for vector data

Reminder: We are within supervised learning (we have labels/targets that are real values) -> Regression


In this notebook we read the zip code data produced by **02_vector_preparations** and create different machine learning models for
predicting the average zip code income from population and spatial features.

We will assesses the models error metrics with a test dataset but also predict the number to all zip codes and write the result to a geopackage for closer inspection.

Notebook contents:
0. Environment preparation
1. Reading the data
2. Function defintion
3. Baseline naive approach
4. Baseline linear regression
5. Gradient Boosting
6. Random Forest
7. Bagging
8. AdaBoost
9. Comparing the models
10. Task 


## 0. Environment preparation

In [2]:
import time
import geopandas as gpd
import pandas as pd
from math import sqrt
import os
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor, BaggingRegressor,ExtraTreesRegressor, AdaBoostRegressor

from sklearn.metrics import mean_squared_error, mean_absolute_error,r2_score

import seaborn as sns 

## 1. Reading the data
### 1.1 Define input and output file paths 

In [3]:
username = os.environ.get('USER')
base_directory= f'/scratch/project_2002044/{username}/2022/GeoML'

data_directory = os.path.join(base_directory,'data')

preprocessed_data_directory = os.path.join(data_directory,'preprocessed_regression')
scaled_train_dataset_name = os.path.join(preprocessed_data_directory,'scaled_train_zip_code_data.csv')
scaled_test_dataset_name = os.path.join(preprocessed_data_directory,'scaled_test_zip_code_data.csv')
scaled_val_dataset_name = os.path.join(preprocessed_data_directory,'scaled_val_zip_code_data.csv')
train_label_name = os.path.join(preprocessed_data_directory,'train_income_labels.pkl')
test_label_name = os.path.join(preprocessed_data_directory,'test_income_labels.pkl')
val_label_name = os.path.join(preprocessed_data_directory,'val_income_labels.pkl')

results_directory = os.path.join(data_directory,'regression_results')

def create_dir(directory_name):
    if not os.path.exists(directory_name):
        os.makedirs(directory_name)
create_dir(results_directory)

metrics_filename = os.path.join(results_directory,'shallow_metrics.csv')

#train_dataset_name = os.path.join(preprocessed_data_directory,'train_zip_code_data.csv')
#test_dataset_name = os.path.join(preprocessed_data_directory,'test_zip_code_data.csv')
#val_dataset_name = os.path.join(preprocessed_data_directory,'val_zip_code_data.csv')

### 1.2 Reading the data

In [4]:
# read train and validation datasets
x_train = pd.read_csv(scaled_train_dataset_name)
x_val = pd.read_csv(scaled_val_dataset_name)
y_train = pd.read_pickle(train_label_name)
y_val = pd.read_pickle(val_label_name)


# 2. Function definitions



In [4]:
# calculating error metrics for regression predictions

def calculate_error_metrics(test_labels,label_predictions, model_name):

    #Asessing the performance of the model with root mean squared error, mean absolute error and coefficient of determination r2
    rmse = sqrt(mean_squared_error(test_labels, label_predictions))
    mae = mean_absolute_error(test_labels, label_predictions)
    r2 = r2_score(test_labels, label_predictions)

    # store them in a dictionary
    metrics_dict = dict(zip(['model','RMSE','MAE','R2'],[model_name,rmse,mae,r2]))

    return metrics_dict


def print_and_store_error_metrics(test_labels, label_predictions, model_name, metrics_collection = None):

    metrics_dict = calculate_error_metrics(test_labels, label_predictions, model_name)

    # in case these are the first results we want to store in that dataframe, we need to first create it
    if metrics_collection is None:
        metrics_collection = pd.DataFrame(columns=['model','RMSE','MAE','R2'])
 
    metrics_collection = metrics_collection.append(metrics_dict, ignore_index=True )

    print(f"\nError metrics for {model_name} on the test dataset: \n" +
          f"\t Root mean squared error (RMSE): {round(metrics_dict['RMSE'])} \n" +
          f"\t Mean absolute error (MAE): {round(metrics_dict['MAE'])} \n" +
          f"\t Coefficient of determination (R2): {round(metrics_dict['R2'],4)} \n")

    return metrics_collection

# to time the model training we create a function for model training
def train_model(x_train, y_train, model):
    start_time = time.time()  
    print(model)
    model.fit(x_train,y_train)
    print('Model training took: ', round((time.time() - start_time), 2), ' seconds')
    return model

# 3. Baseline naive approach

In order to determine, how well machine learning models perform on our dataset, we create some baseline results.
One way to get baseline results is taking the median of y labels in the training dataset and use this as the prediction for all labels. Very naive, but not a realistic assumption.



In [5]:
# the median training labels provides the predicted lable
naive_prediction_value = y_train.median()
naive_name = "Naive median prediction"
print(naive_prediction_value)
# the naive prediction value still needs to be repeated to fit with the features
naive_predictions = pd.DataFrame([naive_prediction_value]* y_val.shape[0])
# then we can get some accuracy measures
metrics_collection = print_and_store_error_metrics(y_val,naive_predictions, naive_name)

21200.0

Error metrics for Naive median prediction on the test dataset: 
	 Root mean squared error (RMSE): 3210 
	 Mean absolute error (MAE): 2521 
	 Coefficient of determination (R2): -0.0058 



  metrics_collection = metrics_collection.append(metrics_dict, ignore_index=True )


### 4. Baseline Linear regression 

Another baseline approach for regression is linear regression. Compared to any of the following models it is still easy to compute.

In [6]:

linear = LinearRegression()
linear_name = "Linear Regression"
linear.fit(x_train, y_train)
linear_predictions = linear.predict(x_val)

# get feature importance via their coefficients
feature_names = x_train.columns
model_coefficients = linear.coef_

coefficients_df = pd.DataFrame(data = model_coefficients, index = feature_names, columns = ['Coefficient value'])
print(coefficients_df)
# largest absolute values show the most important features

metrics_collection = print_and_store_error_metrics(y_val,linear_predictions, linear_name, metrics_collection)



                 Coefficient value
euref_x                  -6.081669
euref_y                -707.226825
pinta_ala                 1.662934
he_vakiy                 31.197307
he_naiset               909.073870
...                            ...
Pohjois-Savo            -18.166157
Päijät-Häme            -174.413313
Satakunta              -866.225375
Uusimaa                1574.797429
Varsinais-Suomi        -231.631031

[117 rows x 1 columns]

Error metrics for Linear Regression on the test dataset: 
	 Root mean squared error (RMSE): 1600 
	 Mean absolute error (MAE): 1179 
	 Coefficient of determination (R2): 0.7501 



  metrics_collection = metrics_collection.append(metrics_dict, ignore_index=True )


## 5. Gradient Boosting Regressor


* https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html
* https://scikit-learn.org/stable/modules/ensemble.html#regression

In [7]:
grad_boost = GradientBoostingRegressor(n_estimators=30, learning_rate=0.1,verbose=1)
grad_boost_name = "Gradient Boosting Regressor"
grad_boost = train_model(x_train, y_train,grad_boost)
grad_boost_predictions = grad_boost.predict(x_val)
metrics_collection = print_and_store_error_metrics(y_val,grad_boost_predictions, grad_boost_name, metrics_collection)


GradientBoostingRegressor(n_estimators=30, verbose=1)
      Iter       Train Loss   Remaining Time 
         1    10017583.6313            0.76s
         2     8949232.4562            0.73s
         3     8078455.1699            0.70s
         4     7347146.6008            0.68s
         5     6701948.9174            0.65s
         6     6155046.2365            0.63s
         7     5686597.4554            0.61s
         8     5307098.3956            0.59s
         9     4959041.0561            0.56s
        10     4642294.8507            0.53s
        20     2810980.1416            0.26s
        30     2055765.2068            0.00s
Model training took:  0.77  seconds

Error metrics for Gradient Boosting Regressor on the test dataset: 
	 Root mean squared error (RMSE): 1522 
	 Mean absolute error (MAE): 1153 
	 Coefficient of determination (R2): 0.7738 



  metrics_collection = metrics_collection.append(metrics_dict, ignore_index=True )


## 6. Random Forest Regressor

* https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html
* https://scikit-learn.org/stable/modules/ensemble.html#forest

Cannot extrapolate beyond training dataset; mostly used for classification

In [8]:
random_forest = RandomForestRegressor(n_estimators=30,verbose=1)
random_forest_name = "Random Forest Regressor"

random_forest = train_model(x_train, y_train,random_forest)
random_forest_predictions = random_forest.predict(x_val)
metrics_collection = print_and_store_error_metrics(y_val,random_forest_predictions, random_forest_name, metrics_collection)

RandomForestRegressor(n_estimators=30, verbose=1)


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Model training took:  1.79  seconds

Error metrics for Random Forest Regressor on the test dataset: 
	 Root mean squared error (RMSE): 1449 
	 Mean absolute error (MAE): 1082 
	 Coefficient of determination (R2): 0.795 



[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed:    1.8s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed:    0.0s finished
  metrics_collection = metrics_collection.append(metrics_dict, ignore_index=True )


In [9]:
#let's take a look at feature importances
# create dataframe of feature importance from random forest model
adf = pd.DataFrame(zip(x_train.columns, random_forest.feature_importances_), columns= ['name','importance'])
# sort the dataframe to have highest ranking features first
adf.sort_values('importance', ascending=False)

Unnamed: 0,name,importance
6,he_kika,0.468741
1,euref_y,0.125612
35,te_takk,0.064266
54,tr_hy_tul,0.033022
61,ra_as_kpa,0.031266
...,...,...
90,tp_x_tunt,0.000062
88,tp_t_koti,0.000062
113,Päijät-Häme,0.000057
99,Etelä-Karjala,0.000036


## 7. Extra Trees Regressor

* https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesRegressor.html

In [10]:
extra_trees = ExtraTreesRegressor(n_estimators=30,verbose=1)
extra_trees_name = "Extra Trees Regressor"

extra_trees = train_model(x_train, y_train,extra_trees)
extra_trees_predictions = extra_trees.predict(x_val)
metrics_collection = print_and_store_error_metrics(y_val,extra_trees_predictions, extra_trees_name, metrics_collection)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


ExtraTreesRegressor(n_estimators=30, verbose=1)
Model training took:  0.85  seconds

Error metrics for Extra Trees Regressor on the test dataset: 
	 Root mean squared error (RMSE): 1282 
	 Mean absolute error (MAE): 951 
	 Coefficient of determination (R2): 0.8395 



[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed:    0.8s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done  30 out of  30 | elapsed:    0.0s finished
  metrics_collection = metrics_collection.append(metrics_dict, ignore_index=True )


## 8. Bagging Regressor

* https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingRegressor.html
* https://scikit-learn.org/stable/modules/ensemble.html#bagging

In [11]:
bagging = BaggingRegressor(n_estimators=30,verbose=1)
bagging_name = "Bagging Regressor"

baggings = train_model(x_train, y_train,bagging)
bagging_predictions = bagging.predict(x_val)
metrics_collection = print_and_store_error_metrics(y_val,bagging_predictions, bagging_name, metrics_collection)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


BaggingRegressor(n_estimators=30, verbose=1)
Model training took:  2.05  seconds

Error metrics for Bagging Regressor on the test dataset: 
	 Root mean squared error (RMSE): 1413 
	 Mean absolute error (MAE): 1061 
	 Coefficient of determination (R2): 0.805 



[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    2.1s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s finished
  metrics_collection = metrics_collection.append(metrics_dict, ignore_index=True )


## 8. AdaBoost Regressor

* https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostRegressor.html
* https://scikit-learn.org/stable/modules/ensemble.html#adaboost

In [12]:
ada_boost = AdaBoostRegressor(n_estimators=30)
ada_boost_name = "AdaBoost Regressor"

ada_boost = train_model(x_train, y_train,ada_boost)
ada_boost_predictions = ada_boost.predict(x_val)
metrics_collection = print_and_store_error_metrics(y_val,ada_boost_predictions, ada_boost_name, metrics_collection)

AdaBoostRegressor(n_estimators=30)
Model training took:  0.63  seconds

Error metrics for AdaBoost Regressor on the test dataset: 
	 Root mean squared error (RMSE): 1745 
	 Mean absolute error (MAE): 1363 
	 Coefficient of determination (R2): 0.7028 



  metrics_collection = metrics_collection.append(metrics_dict, ignore_index=True )


## 9. Model comparison

In [13]:
print(metrics_collection.sort_values(by=['RMSE'], ascending=False))

# store comparison table 
metrics_collection.to_csv(metrics_filename)

                         model         RMSE          MAE        R2
0      Naive median prediction  3210.315973  2521.001704 -0.005820
6           AdaBoost Regressor  1745.176252  1363.223006  0.702763
1            Linear Regression  1600.219500  1178.917810  0.750090
2  Gradient Boosting Regressor  1522.333054  1152.671273  0.773825
3      Random Forest Regressor  1449.204208  1081.964225  0.795033
5            Bagging Regressor  1413.473238  1061.154174  0.805016
4        Extra Trees Regressor  1282.324966   950.631289  0.839520


## 10. Task

Study the scikit-learn documentation of one of the above used models and experiment with different hyperparameter values. Can you improve on the accuracy or make the training faster?
Report the best results (on test set!) and mark down parameters used, so that others can reproduce the results. 