# Shallow regression for vector data

This script reads zip code data produced by **vectorDataPreparations** and creates a machine learning model for
predicting the average zip code income from population and spatial variables.

It assess the model accuracy with a test dataset but also predicts the number to all zip codes and writes it to a geopackage
for closer inspection

# 1. Read the data

In [None]:
import time
import geopandas as gpd
import pandas as pd
from math import sqrt
import os
import matplotlib.pyplot as plt

from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor, BaggingRegressor,ExtraTreesRegressor, AdaBoostRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error,r2_score

### 1.1 Input and output file paths 

In [None]:
paavo_data = "../data/paavo"

### Relative path to the zip code geopackage file that was prepared by vectorDataPreparations.py
input_geopackage_path = os.path.join(paavo_data,"zip_code_data_after_preparation.gpkg")

### Output file. You can change the name to identify different regression models
output_geopackage_path = os.path.join(paavo_data,"median_income_per_zipcode_shallow_model.gpkg")

### 1.2 Read the input data to a Geopandas dataframe

In [None]:
original_gdf = gpd.read_file(input_geopackage_path)
original_gdf.head()

# 2. Train the model 

You can try different regressor models by uncommenting the corresponding lines. You can also try different modeling parameters. 

Which one is the best model? Can you figure out how to improve it even more?

In [None]:
### Split the gdf to x (the predictor attributes) and y (the attribute to be predicted)
y = original_gdf['hr_mtu'] # Average income
### remove geometry and textual fields
x = original_gdf.drop(['geometry','postinumer','nimi','hr_mtu'],axis=1)

### Split the both datasets to train (80%) and test (20%) datasets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.4, random_state=42)

### Choose the model to be used
model = GradientBoostingRegressor(n_estimators=30, learning_rate=0.1,verbose=1)
#model = RandomForestRegressor(n_estimators=30,verbose=1)
#model = BaggingRegressor(n_estimators=30,verbose=1)
#model = ExtraTreesRegressor(n_estimators=30,verbose=1)
#model = AdaBoostRegressor(n_estimators=30)

print(model)

### Train the model with x and y of the train dataset
model.fit(x_train, y_train)

### Predict the unemployed number to the test dataset
prediction = model.predict(x_test)

### Assess the accuracy of the model with root mean squared error, mean absolute error and coefficient of determination r2
rmse = sqrt(mean_squared_error(y_test, prediction))
mae = mean_absolute_error(y_test, prediction)
r2 = r2_score(y_test, prediction)

print(f"\nMODEL ACCURACY METRICS WITH TEST DATASET: \n" +
      f"\t Root mean squared error: {round(rmse)} \n" +
      f"\t Mean absolute error: {round(mae)} \n" +
      f"\t Coefficient of determination: {round(r2,4)} \n")


# 3. Predict average income to all zip codes

Here we take the model ran the last time in the previous cell and apply it to the whole dataset

In [None]:
### Drop the not-used columns from original_gdf as done before model training.
x = original_gdf.drop(['geometry','postinumer','nimi','hr_mtu'],axis=1)

### Predict the median income with already trained model
prediction = model.predict(x)

### Join the predictions to the original geodataframe and pick only interesting columns for results
original_gdf['predicted_hr_mtu'] = prediction.round(0)
original_gdf['difference'] = original_gdf['predicted_hr_mtu'] - original_gdf['hr_mtu']
resulting_gdf = original_gdf[['postinumer','nimi','hr_mtu','predicted_hr_mtu','difference','geometry']]

In [None]:
fig, ax = plt.subplots(figsize=(20, 10))
ax.set_title("Predicted average income by zip code", fontsize=25)
ax.set_axis_off()
resulting_gdf.plot(column='predicted_hr_mtu', ax=ax, legend=True, cmap="magma")

# 4. EXERCISE: Calculate the difference between real and predicted incomes

Calculate the difference of real and predicted income amounts by zip code level and plot a map of it

* **original_gdf** is the original dataframe
* **resulting_gdf** is the predicted one

In [None]:
#### This is what students might do here
resulting_gdf['diff'] = resulting_gdf['predicted_hr_mtu'] - resulting_gdf['hr_mtu']
fig, ax = plt.subplots(figsize=(20, 10))
ax.set_title("Difference in average income by zip code", fontsize=25)
ax.set_axis_off()
resulting_gdf.plot(column='diff', ax=ax, legend=True, cmap="BrBG")