<div class="alert alert-block alert-info">

**TODO:**

* have it without outputs on Github
* Keep or remove linear regression?
* task update: remove all but RF, make task to also find other models, provide "solution" script with % load solutions.py (note that this cell has to be run twice then)
* make sure to make it possible to go through in 30 min
    
</div>

# Shallow regression for vector data

Reminder: We are within supervised learning (we have labels/targets that are real values) -> Regression

Data and goal: In this notebook we read the zip code data produced by **02_vector_preparations** and create different machine learning models for
predicting the average zip code income from population and spatial features. We will adjust parameters to improve the performance on a validation dataset and finally assesses the models error metrics with a test dataset.

Notebook contents:

0. Environment preparation
1. Reading the data
2. Function defintion
3. Baseline naive approach
4. Baseline linear regression
5. Gradient Boosting
6. Random Forest
7. Bagging
8. AdaBoost
9. Comparing the models
10. Task 


## 0. Environment preparation

In [None]:
import os
import time
import pandas as pd
from math import sqrt
# machine learning models
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor, BaggingRegressor,ExtraTreesRegressor, AdaBoostRegressor
# error metrics
from sklearn.metrics import mean_squared_error, mean_absolute_error,r2_score


## 1. Reading the data
### 1.1 Define input and output file paths 

In [None]:
username = os.environ.get('USER')
base_directory= f'/scratch/project_2002044/{username}/2022/GeoML'

# inputs, all datasets created in 02_vector_preparation.ipynb
data_directory = os.path.join(base_directory,'data')
preprocessed_data_directory = os.path.join(data_directory,'preprocessed_regression')
scaled_train_dataset_name = os.path.join(preprocessed_data_directory,'scaled_train_zip_code_data.csv')
scaled_test_dataset_name = os.path.join(preprocessed_data_directory,'scaled_test_zip_code_data.csv')
scaled_val_dataset_name = os.path.join(preprocessed_data_directory,'scaled_val_zip_code_data.csv')
train_label_name = os.path.join(preprocessed_data_directory,'train_income_labels.pkl')
test_label_name = os.path.join(preprocessed_data_directory,'test_income_labels.pkl')
val_label_name = os.path.join(preprocessed_data_directory,'val_income_labels.pkl')

# read also unscaled datasets
train_dataset_name = os.path.join(preprocessed_data_directory,'train_zip_code_data.csv')
test_dataset_name = os.path.join(preprocessed_data_directory,'test_zip_code_data.csv')
val_dataset_name = os.path.join(preprocessed_data_directory,'val_zip_code_data.csv')

# outputs
results_directory = os.path.join(data_directory,'regression_results')

def create_dir(directory_name):
    if not os.path.exists(directory_name):
        os.makedirs(directory_name)
create_dir(results_directory)

metrics_filename = os.path.join(results_directory,'shallow_metrics.csv')

In [None]:
# for reproducible results when randomness is involved, we can set a random seed
random_seed = 42

### 1.2 Reading the data

In [None]:
# read train, validation and test datasets
scaled_x_train = pd.read_csv(scaled_train_dataset_name)
scaled_x_val = pd.read_csv(scaled_val_dataset_name)
x_train = pd.read_csv(train_dataset_name)
x_val = pd.read_csv(val_dataset_name)
y_train = pd.read_pickle(train_label_name)
y_val = pd.read_pickle(val_label_name)
scaled_x_test = pd.read_csv(scaled_test_dataset_name)
x_test = pd.read_csv(test_dataset_name)
y_test = pd.read_pickle(test_label_name)

## 2. Function definitions



In [None]:
# calculating error metrics for regression predictions

def calculate_error_metrics(test_labels,label_predictions, model_name):

    #Asessing the performance of the model with root mean squared error, mean absolute error and coefficient of determination r2
    rmse = sqrt(mean_squared_error(test_labels, label_predictions))
    mae = mean_absolute_error(test_labels, label_predictions)
    r2 = r2_score(test_labels, label_predictions)

    # store them in a dictionary
    metrics_dict = dict(zip(['model','RMSE','MAE','R2'],[model_name,rmse,mae,r2]))

    return metrics_dict

def print_error_metrics(metrics_dict, model_name, dataset_name):
    print(f"\nError metrics for {model_name} on the {dataset_name} dataset: \n" +
            f"\t Root mean squared error (RMSE): {round(metrics_dict['RMSE'])} \n" +
            f"\t Mean absolute error (MAE): {round(metrics_dict['MAE'])} \n" +
            f"\t Coefficient of determination (R2): {round(metrics_dict['R2'],4)} \n")

def store_error_metrics(test_labels, label_predictions, model_name, metrics_collection = None):

    metrics_dict = calculate_error_metrics(test_labels, label_predictions, model_name)

    # in case these are the first results we want to store in that dataframe, we need to first create it
    if metrics_collection is None:
        metrics_collection = pd.DataFrame(columns=['model','RMSE','MAE','R2'])
 
    metrics_collection = metrics_collection.append(metrics_dict, ignore_index=True )

    #print_error_metrics(metrics_dict,model_name)

    return metrics_collection

# to time the model training we create a function for model training
def train_model(x_train, y_train, model):
    start_time = time.time()  
    print(model)
    model.fit(x_train,y_train)
    print('Model training took: ', round((time.time() - start_time), 2), ' seconds')
    return model

## 3. Baseline naive approach

In order to determine, how well machine learning models perform on our dataset, we create some baseline results.
One way to get baseline results is taking the median of y labels in the training dataset and use this as the prediction for all labels. Very naive, but not a realistic assumption.



In [None]:
# the median training labels provides the predicted lable
naive_prediction_value = y_train.median()
naive_name = "Naive median prediction"
print(naive_prediction_value)
# the naive prediction value still needs to be repeated to fit with the features
naive_predictions_val = pd.DataFrame([naive_prediction_value]* y_val.shape[0])
# then we can get some performance metrics
print_error_metrics(calculate_error_metrics(y_val,naive_predictions_val, naive_name), naive_name, 'validation')


# and store the results on the test dataset for later model comparison
naive_predictions_test = pd.DataFrame([naive_prediction_value]* y_test.shape[0])
print_error_metrics(calculate_error_metrics(y_test,naive_predictions_test, naive_name), naive_name, 'test')
metrics_collection = store_error_metrics(y_test,naive_predictions_test, naive_name)

## 4. Baseline Linear regression 

Another baseline approach for regression is linear regression. Compared to any of the following models it is still easy to compute.

In [None]:
linear = LinearRegression()
linear_name = "Linear Regression"
linear.fit(scaled_x_train, y_train)
linear_predictions_val = linear.predict(scaled_x_val)


# then we can get some performance metrics
print_error_metrics(calculate_error_metrics(y_val,linear_predictions_val, linear_name), linear_name, 'validation')

# and store the results on the test dataset for later model comparison
linear_predictions_test = linear.predict(scaled_x_test)
print_error_metrics(calculate_error_metrics(y_test,linear_predictions_test, linear_name), linear_name, 'test')
metrics_collection = store_error_metrics(y_test,linear_predictions_test, linear_name, metrics_collection)



## 5. Gradient Boosting Regressor


* https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html
* https://scikit-learn.org/stable/modules/ensemble.html#regression

In [None]:
grad_boost = GradientBoostingRegressor(n_estimators=30, learning_rate=0.1,verbose=1)
grad_boost_name = "Gradient Boosting Regressor"
grad_boost = train_model(x_train, y_train,grad_boost)
grad_boost_predictions = grad_boost.predict(x_val)

# then we can get some performance metrics
print_error_metrics(calculate_error_metrics(y_val,grad_boost_predictions, grad_boost_name), grad_boost_name, 'validation')

# and store the results on the test dataset for later model comparison, after we are done optimizing the parameters
#grad_boost_predictions = grad_boost.predict(x_test)
#print_error_metrics(calculate_error_metrics(y_test,grad_boost_predictions, grad_boost_name), grad_boost_name, 'test')
#metrics_collection = store_error_metrics(y_test,grad_boost_predictions, grad_boost_name, metrics_collection)

## 6. Random Forest Regressor

* https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html
* https://scikit-learn.org/stable/modules/ensemble.html#forest


In [None]:
random_forest = RandomForestRegressor(n_estimators=100,verbose=1)
random_forest_name = "Random Forest Regressor"

random_forest = train_model(x_train, y_train,random_forest)
random_forest_predictions = random_forest.predict(x_val)

# then we can get some performance metrics
print_error_metrics(calculate_error_metrics(y_val,random_forest_predictions, random_forest_name), random_forest_name, 'validation')

# and store the results on the test dataset for later model comparison, after we are done optimizing the parameters
#random_forest_predictions = random_forest.predict(x_test)
#print_error_metrics(calculate_error_metrics(y_test,random_forest_predictions, random_forest_name), random_forest_name, 'test')
#metrics_collection = store_error_metrics(y_test,random_forest_predictions, random_forest_name, metrics_collection)

In [None]:
#let's take a look at feature importances
# create dataframe of feature importance from random forest model
adf = pd.DataFrame(zip(x_train.columns, random_forest.feature_importances_), columns= ['name','importance'])
# sort the dataframe to have highest ranking features first
adf.sort_values('importance', ascending=False)

## 7. Extra Trees Regressor

* https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesRegressor.html

In [None]:
extra_trees = ExtraTreesRegressor(n_estimators=30,verbose=1, random_state=random_seed)
extra_trees_name = "Extra Trees Regressor"

extra_trees = train_model(x_train, y_train,extra_trees)
extra_trees_predictions = extra_trees.predict(x_val)

# then we can get some performance metrics
print_error_metrics(calculate_error_metrics(y_val,extra_trees_predictions, extra_trees_name), extra_trees_name, 'validation')

# and store the results on the test dataset for later model comparison, after we are done optimizing the parameters
#extra_trees_predictions = extra_trees.predict(x_test)
#print_error_metrics(calculate_error_metrics(y_test,extra_trees_predictions, extra_trees_name), extra_trees_name,'test')
#metrics_collection = store_error_metrics(y_test,extra_trees_predictions, extra_trees_name, metrics_collection)

## 8. Bagging Regressor

* https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingRegressor.html
* https://scikit-learn.org/stable/modules/ensemble.html#bagging

In [None]:
bagging = BaggingRegressor(n_estimators=30,verbose=1,random_state=random_seed )
bagging_name = "Bagging Regressor"

baggings = train_model(x_train, y_train,bagging)
bagging_predictions = bagging.predict(x_val)

# then we can get some performance metrics
print_error_metrics(calculate_error_metrics(y_val,bagging_predictions, bagging_name), bagging_name, 'validation')

# and store the results on the test dataset for later model comparison, after we are done optimizing the parameters
#bagging_predictions = bagging.predict(x_test)
#print_error_metrics(calculate_error_metrics(y_test,bagging_predictions, bagging_name), bagging_name, 'test')
#metrics_collection = store_error_metrics(y_test,bagging_predictions, bagging_name, metrics_collection)

## 8. AdaBoost Regressor

* https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostRegressor.html
* https://scikit-learn.org/stable/modules/ensemble.html#adaboost

In [None]:
ada_boost = AdaBoostRegressor(n_estimators=30, learning_rate=1.0, loss='linear', random_state=random_seed)
ada_boost_name = "AdaBoost Regressor"

ada_boost = train_model(x_train, y_train,ada_boost)
ada_boost_predictions = ada_boost.predict(x_val)

# then we can get some performance metrics
print_error_metrics(calculate_error_metrics(y_val,ada_boost_predictions, ada_boost_name), ada_boost_name, 'vaidation')

# and store the results on the test dataset for later model comparison, after we are done optimizing the parameters
#ada_boost_predictions = ada_boost.predict(x_test)
#print_error_metrics(calculate_error_metrics(y_test,ada_boost_predictions, ada_boost_name), ada_boost_name, 'test')
#metrics_collection = store_error_metrics(y_test,ada_boost_predictions, ada_boost_name, metrics_collection)

## 9. Model comparison

Let's compare our models performance on the test dataset. Make sure to uncomment the lines to store the results in above cells.

In [None]:
print(metrics_collection.sort_values(by=['RMSE'], ascending=False))

# store comparison table 
metrics_collection.to_csv(metrics_filename)

## 10. Task

Study the scikit-learn documentation of one of the above used models and experiment with different hyperparameter values. Can you improve on the accuracy or make the training faster?
Report the best results (on test set!) and mark down parameters used, so that others can reproduce the results. 

Alternative: (removing most models from exercise notebook)
Study the scikit-learn documentation for regression and find other models and experiment with different hyperparameter values. Can you improve on the accuracy or make the training faster?
Report the best results (on test set!) and mark down model and parameters used, so that others can reproduce the results. 