# 1. Vector data preparations

This script prepares the **Paavo zip code dataset** from the Statistics of Finland
for machine learning purposes. 

It reads the original shapefile, scales all the numerical values, joins some auxiliary
data and encodes one text field for machine learning purposes. The result is saved as geopackage.

The variable descriptions of this dataset can be found here in Finnish and English
* https://www.stat.fi/static/media/uploads/tup/paavo/paavo_lyhyt_kuvaus_2020_fi.pdf
* https://www.stat.fi/static/media/uploads/tup/paavo/paavo_lyhyt_kuvaus_2020_en.pdf

In [None]:
import time
import geopandas as gpd
import pandas as pd
import os
from shapely.geometry import Point, MultiPolygon, Polygon
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from joblib import dump, load
import zipfile
from urllib.request import urlretrieve
import matplotlib.pyplot as plt

### 1.1 Create directories if they do not already exist

In [None]:
directories = ['../data']

for directory in directories:
    if not os.path.exists(directory):
        os.makedirs(directory)

### 1.2 Download the Paavo data from Allas with urllib and unzip it to the data folder

In [None]:
urlretrieve('https://a3s.fi/gis-courses/gis_ml/paavo.zip', '../data/paavo.zip')

with zipfile.ZipFile('../data/paavo.zip', 'r') as zip_file:
    zip_file.extractall('../data')

### 1.3 Define file paths

In [None]:
zip_code_shapefile = '../data/paavo/pno_tilasto_2020.shp'
finnish_regions_shapefile = '../data/paavo/SuomenMaakuntajako_2020_10k.shp'
output_file_path = '../data/paavo/zip_code_data_after_preparation.gpkg'
scaler_path = '../data/paavo/zip_code_scaler.bin'

# 2. Reading and cleaning the data

Read the zip code dataset into a geopandas dataframe **original_gdf** and drop unnecessary rows and columns

In [None]:
### Read the data from a shapefile to a geopandas dataframe
original_gdf = gpd.read_file(zip_code_shapefile, encoding='utf-8')
print(f"Original dataframe size: {len(original_gdf.index)} zip codes with {len(original_gdf.columns)} columns")

### Drop all rows that have missing values or where average income is -1 (=not known) or 0
original_gdf = original_gdf.dropna()    
original_gdf = original_gdf[original_gdf["hr_mtu"]>0].reset_index(drop=True)

print(f"Dataframe size after dropping some rows: {len(original_gdf.index)} zip codes with {len(original_gdf.columns)} columns")

### Remove some columns that are strings (namn, kunta = name of the municipality in Finnish and Swedish.)
### or which might make the modeling too easy ('hr_mtu','hr_tuy','hr_pi_tul','hr_ke_tul','hr_hy_tul','hr_ovy')
columns_to_be_removed_completely = ['namn','kunta','hr_ktu','hr_tuy','hr_pi_tul','hr_ke_tul','hr_hy_tul','hr_ovy']
original_gdf = original_gdf.drop(columns_to_be_removed_completely,axis=1)

print(f"Dataframe size after dropping some columns: {len(original_gdf.index)} zip codes with {len(original_gdf.columns)} columns")


In [None]:
original_gdf

### 2.1 Plot the geodataframe
If plotting maps with matplotlib is not familiar. Here are some things you can play with
* **figsize** - different height, width
* **column** - try other zip code values
* **cmap** - this is the color map, here are the possibile options https://matplotlib.org/3.3.1/tutorials/colors/colormaps.html

In [None]:
fig, ax = plt.subplots(figsize=(20, 10))
ax.set_title("Average income by zip code", fontsize=25)
ax.set_axis_off()
original_gdf.plot(column='hr_mtu', ax=ax, legend=True, cmap="magma")

# 3. Scale the numerical columns
Most machine learning algorithms benefit from feature scaling which means normalizing or standardizing the dataset's variablity to values between e.g. **[0-1]** or **[-1 - 1]**

We do this for all numerical columns. Text (string) types of columns need different kind of treatment

When to normalize or standardize?
### Normalizing
* Values are simply rescaled so they end up ranging from **0** to **1**. If there are huge outliers in the data, most of the variation in the data might be squashed in a narrow range of values. Standardizing might be a better idea then
* In Scikit the transformer is called **MinMaxScaler()**
* Some machine learning methods prefer values from **0** to **-1**

### Standardizing
* The mean value is subtracted and the value is divided with the standard deviation producing a range from **-1** to **1** where the mean is **0**. 
* In Scikit the transformer is called **StandardScaler()**


In [None]:
### Get list of all column headings
all_columns = list(original_gdf.columns)

### List the column names that we don't want to be scaled
col_names_no_scaling = ['postinumer','nimi','hr_mtu','geometry']

### List of column names we want to scale. (all columns minus those we don't want)
col_names_to_scaling = [column for column in all_columns if column not in col_names_no_scaling]

### Subset the data for only those to-be scaled
gdf = original_gdf[col_names_to_scaling]

### Apply a Scikit StandardScaler() or MinMaxScaler() for all the columns left in dataframe
### You can also test both 
#scaler = StandardScaler()
scaler = MinMaxScaler()
scaled_values_array = scaler.fit_transform(gdf)

### You can save the scaler for later use with this. If there suddenly would be more zip codes in Finland, we could use the same scaler.
dump(scaler, scaler_path, compress=True)

### The scaled columns come back as a numpy ndarray, switch back to a geopandas dataframe again
gdf = pd.DataFrame(scaled_values_array)
gdf.columns = col_names_to_scaling

### Join the non-scaled columns back with the the scaled columns by index
scaled_gdf = original_gdf[col_names_no_scaling].join(gdf)
scaled_gdf.head()

# 4. Encode categorical (text) columns 

As an example for categorical values we add region names to post codes. The region for each post code area is retrieved from a spatial join with a regions dataset (SuomenMaankuntajako_2020_10k.shp).

Machine learning algorithms do not understand text, and need different kind of pre-processing. In this excercise we use the most popular method of **one-hot encoding** (aka dummy variables) for categorical data. 

We use the pandas **get_dummies()** function for one-hot encoding. Scikit would also have a **OneHotEncoder()** transformer for this

* More information on one-hot encoding https://www.kaggle.com/dansbecker/using-categorical-data-with-one-hot-encoding
* It might not always be the best option. See other options https://towardsdatascience.com/stop-one-hot-encoding-your-categorical-variables-bbb0fba89809

### 4.1 Spatially join the region information to the dataset 

In [None]:
### Read the regions shapefile and choose only the name of the region and its geometry
finnish_regions_gdf = gpd.read_file(finnish_regions_shapefile)
finnish_regions_gdf = finnish_regions_gdf[['NAMEFIN','geometry']]

### A function we use to return centroid point geometry from a zip code polygon
def returnPointGeometryFromXY(polygon_geometry):
    ## Calculate x and y of the centroid
    centroid_x,centroid_y = polygon_geometry.centroid.x,polygon_geometry.centroid.y
    ## Create a shapely Point geometry of the x and y coords
    point_geometry = Point(centroid_x,centroid_y)
    return point_geometry

### Stash the polygon geometry to another column as we are going to overwrite the 'geometry' with centroid geometry
scaled_gdf['polygon_geometry'] = scaled_gdf['geometry']

### We will be joining the region name to zip codes according to the zip code centroid. 
### This calls the function above and returns centroid to every row
scaled_gdf["geometry"] = scaled_gdf['geometry'].apply(returnPointGeometryFromXY)

### Spatially join the region name to the zip codes using the centroid of zip codes and region polygons
scaled_gdf = gpd.sjoin(scaled_gdf,finnish_regions_gdf,how='inner',op='intersects')
scaled_gdf.tail()

### 4.2 One-hot encode the region name

In [None]:
### Switch the polygon geometry back to the 'geometry' field and drop uselesss columns
scaled_gdf['geometry'] = scaled_gdf['polygon_geometry']
scaled_gdf.drop(['index_right','polygon_geometry'],axis=1, inplace=True)

### Encode the region name with the One-hot encoding (= in pandas, dummy encoding)
encoded_gdf = pd.get_dummies(scaled_gdf['NAMEFIN'])

### Join scaled gdf and encoded gdf together
scaled_and_encoded_gdf = scaled_gdf.join(encoded_gdf).drop('NAMEFIN',axis=1)

### The resulting dataframe has Polygon and Multipolygon geometries. 
### This upcasts the polygons to multipolygon format so all of them have the same format
scaled_and_encoded_gdf["geometry"] = [MultiPolygon([feature]) if type(feature) == Polygon else feature for feature in scaled_and_encoded_gdf["geometry"]]
print("Dataframe size after adding region name: " + str(len(scaled_and_encoded_gdf.index))+ " zip codes with " + str(len(scaled_and_encoded_gdf.columns)) + " columns")

### Print the tail of the dataframe
scaled_and_encoded_gdf.tail()

# 5. Write the pre-processed zip code data to file as a Geopackage

In [None]:
### Write the prepared zipcode dataset to a geopackage
scaled_and_encoded_gdf.to_file(output_file_path, driver="GPKG")