## Vector data for exercises

In this course, we will use the vector dataset **Paavo**, which represents postal code area statistics collected by [Statistics Finland](https://www.stat.fi). Metadata description can be found on [Statistics Finland webpage](https://www.stat.fi/static/media/uploads/tup/paavo/paavo_kuvaus_en.pdf), see page 5 ff for field name descriptions.

The dataset includes variables about each postcode area, describing:

1. Population Structure (24 variables) HE
2. Educational Structure (7 variables) KO
3. Inhabitants' Disposable Monetary Income (7 variables) HR
4. Size and Stage in Life of Households (15 variables) TE
5. Households' Disposable Monetary Income (7 variables) TR
6. Buildings and Dwellings (8 variables) RA
7. Workplace Structure (26 variables) TP
8. Main Type of Activity (9 variables) PT

The overall goal of the exercises is to predict the median income for each zip code based on other variables/features of the dataset. 
This exercise is meant to show the different steps to prepare a vector dataset for machine learning. To make this task worth an exercise, all variables/features of type HR (that tell about the income) are removed from the dataset.

## Vector data preparations

Content of this notebook:

0. Environment preparation
1. Data retrieval
2. Data exploration
3. Data cleaning
4. Feature engineering
5. Feature encoding
6. Train/Test split
7. Feature scaling
8. Store the results

In this notebook we will prepare the Paavo dataset for Machine Learning, by downloading all the necessary datasets, clean up some features, join auxiliary
data and encode text fields, split the dataset into train and test set and scale the features for machine learning purposes. 

The goal of this exercise is to get the dataset ready for subsequent machine learning tasks.


## 0. Environment preparation

Load all the needed Python packages. 


In [None]:

# operating system level operations
import os
# for file operations
import shutil
# filesystem exploration
import glob
# unpacking compressed files
import zipfile
# timing operations
import time
# data handling (and plotting)
import pandas as pd
# visualisation
import matplotlib.pyplot as plt
import seaborn as sns
# geospatial data handling 
import geopandas as gpd
from shapely.geometry import Point, MultiPolygon, Polygon
# Machine learning data preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder 
# download data from URL
from urllib.request import urlretrieve
# for saving the scaler, uncomment following:
# from joblib import dump

In [None]:
# for reproducible results when randomness is involved, we can set a random seed
random_seed= 63

## 1.  Data retrieval



### 1.1 Creating directories
Let's create a data directory in the base of our GeoML directory, where we store the original data.

In [None]:
# Jupyter for courses cloned ("copied") the course material for us into '/scratch/project_2002044/training_xxx/2022/GeoML', we need to define that path
username = os.environ.get('USER')
base_directory= f'/scratch/project_2002044/{username}/2022/GeoML'

def create_dir(directory_name):
    if not os.path.exists(directory_name):
        os.makedirs(directory_name)

data_directory = os.path.join(base_directory,'data')
paavo_directory=os.path.join(data_directory,'paavo')
maakunta_directory=os.path.join(data_directory,'maakunta')
preprocessed_data_directory = os.path.join(data_directory,'preprocessed_regression')

# make sure, all needed directories are created in beginning 
create_dir(data_directory)
create_dir(paavo_directory)
create_dir(maakunta_directory)
create_dir(preprocessed_data_directory)

### 1.2 Getting data
Let's get the original datasets that we need for this exercise from Puhtis `data` (read only) directory.

In [None]:
puhti_data_directory = '/appl/data/geo'

def copy_files(source,destination):
    for file in glob.glob(source):
        print(file)
        shutil.copy(file, destination)
               
copy_files(os.path.join(puhti_data_directory, 'tilastokeskus/paavo/2022/pno_tilasto_2022.*'), paavo_directory)
copy_files(os.path.join(puhti_data_directory, 'mml/hallintorajat_10k/2021_2022/SuomenMaakuntajako_2021_10k.*'), maakunta_directory)

#when working on your own, it is often preferred to not copy, but read the data directly from source
#from Puhti
#original_gdf = gpd.read_file('/appl/data/geo/tilastokeskus/paavo/2022/pno_tilasto_2022.shp', encoding='utf-8')    
#from Paituli
#original_gdf = gpd.read_file('/vsicurl/https://www.nic.funet.fi/index/geodata/tilastokeskus/paavo/2022/pno_tilasto_2022.shp', encoding='utf-8')


### 1.3 Storing file locations

In order to use the data later, we store their path and filename in variables.

In [None]:
#inputfiles
paavo_shapefile = os.path.join(paavo_directory,'pno_tilasto_2022.shp')
finnish_regions_shapefile = os.path.join(maakunta_directory, 'SuomenMaakuntajako_2021_10k.shp')

#outputfiles
scaled_train_dataset_name = os.path.join(preprocessed_data_directory,'scaled_train_zip_code_data.csv')
scaled_test_dataset_name = os.path.join(preprocessed_data_directory,'scaled_test_zip_code_data.csv')
scaled_val_dataset_name = os.path.join(preprocessed_data_directory,'scaled_val_zip_code_data.csv')
train_label_name = os.path.join(preprocessed_data_directory,'train_income_labels.pkl')
test_label_name = os.path.join(preprocessed_data_directory,'test_income_labels.pkl')
val_label_name = os.path.join(preprocessed_data_directory,'val_income_labels.pkl')
train_dataset_name = os.path.join(preprocessed_data_directory,'train_zip_code_data.csv')
test_dataset_name = os.path.join(preprocessed_data_directory,'test_zip_code_data.csv')
val_dataset_name = os.path.join(preprocessed_data_directory,'val_zip_code_data.csv')

# optional to store the scaler
#scaler_path = '../original_data/paavo/zip_code_scaler.bin'

# optional to store train, validation and test as geopackages for visualization
train_dataset_geo = os.path.join(preprocessed_data_directory,'train_zip_code_data.gpkg')
val_dataset_geo = os.path.join(preprocessed_data_directory,'val_zip_code_data.gpkg')
test_dataset_geo = os.path.join(preprocessed_data_directory,'test_zip_code_data.gpkg')

## 2. Data exploration

Always get to know your data before even thinking about Machine Learning. This section shows a few ways that we can get to know Paavo dataset a bit better. Possibilities are endless. For some models, you should also check that assumptions the model makes about data distribution are true.

### 2.1 Read the data into dataframe

Read the zip code dataset into a geopandas dataframe `original_gdf`:

In [None]:
# defining the encoding makes sure that characters are represented as intended, important especially with languages that have "special characters" 
original_gdf = gpd.read_file(paavo_shapefile, encoding='utf-8')
original_gdf


### 2.2 Exploring the dataframe


In [None]:
# dataframe columns and rows
print(f"Original dataframe size: {len(original_gdf.index)} rows (= zip codes) with {len(original_gdf.columns)} columns (=variables/features)")
# column names
print(list(original_gdf.columns))

In [None]:
# column data types
print(list(pd.unique(original_gdf.dtypes)))

In [None]:
# check for nodata isna/null gives True/False , summing gives +1 for each True for each column, summing again gives total amount of True for the whole dataframe
print(original_gdf.isna().sum().sum())
print(original_gdf.isnull().sum().sum())

In [None]:
# check for unsensible cells
# e.g. income equal to or below 0
print(len(original_gdf[original_gdf["hr_mtu"]==0]))
print(len(original_gdf[original_gdf["hr_mtu"]<0]))

In [None]:
# get value range of each column
for x in original_gdf.columns:
    if original_gdf[x].dtype in ['int64','float64']:
        min = original_gdf[x].min()
        max = original_gdf[x].max()
        print(f'Value range of {x} : {str(min)} to {str(max)}')

It seems like `-1` is used as no data value. We can keep and remember this as is, or replace all `-1` with `np.nan` which can for example later be interpolated or removed.

### 2.3 Visualization

Another way of data exploration is visualizing different features of your dataset in different ways to reveal phenonemons that might not be visible when looking at numbers only.

#### 2.3.1 Distribution plot

In this exercise, we are interested in the income per capita. So let's check out the distribution of that target feature by plotting a distribution plot with seaborn `distplot` functionality.


In [None]:
# reminder on how to get help within notebook
help(sns.histplot)

In [None]:
# using seaborn histogram plot
sns.histplot(original_gdf['hr_mtu'])
# other option:
#original_gdf['hr_mtu'].hist()
# another option to identify outliers would be to use boxplot 
#sns.boxplot(original_gdf['hr_mtu'])

We can see that some zip codes have an income of 0, which in this case probably means "no data available".

#### 2.3.2 Map
As we are working with spatial data, we can also plot a map of the target feature to explore its spatial distribution.

If plotting maps with matplotlib is not familiar. Here are some things you can play with
* **figsize** - different height, width
* **column** - try other features
* **cmap** - this is the color map, here are the possibile options https://matplotlib.org/3.3.1/tutorials/colors/colormaps.html

The following plots are only for quick visualization, to include these plots in publications, more features would need to be taken care of ( such as axes and their labels, north arrow, colorblind and print friendly color palette,...)

In [None]:
# check coordinate reference frame
print(original_gdf.crs)

fig, ax = plt.subplots(figsize=(20, 10))
# set title for the full plot
ax.set_title("Average income by zip code", fontsize=25)
# turn off all axes
ax.set_axis_off()
# plot the average income
plot = original_gdf.plot(column='hr_mtu', ax=ax, legend=True, cmap="magma")
# set colorbar label
cax = fig.get_axes()[1]
cax.set_ylabel('Income in €');

#### 2.3.3 Regression plots

We can also explore, how the different features are related to another by plotting them "against each other" by plotting some regression plots, i.e. scatter plots with a "best fitting" regression line. 

In [None]:
# choose some variables
variables = ['euref_x', 'euref_y', 'he_kika','he_miehet']
fig,ax = plt.subplots(2,2)
# to fit all titles and axes labels
fig.tight_layout(h_pad=5,w_pad=5)

# ravel axes to loop through and fill subplots one by one
for var,axes in zip(variables, ax.ravel()):
    # Regression Plot also by default includes best-fitting regression line which can be turned off via `fit_reg=False`
    sns.regplot(x=var, y='hr_mtu', data=original_gdf,  marker='.', scatter_kws = {'s': 10},ax = axes).set(title=f'Regression plot of \n {var} and average income');

## 3. Data cleaning 

We cann check for empty rows and columns as well as empty single cells and either remove them from the dataset or, if domain knowledge allows, fill them with sensible values. Note that this might have significant impact on the results. So fill with care, and if unsure, rather remove.


In [None]:
# Drop all rows that have missing values or where average income is -1 (=not known) or 0
selected_gdf = original_gdf.dropna()
selected_gdf = selected_gdf[selected_gdf["hr_mtu"]>0].reset_index(drop=True)

print(f"Dataframe size after dropping no data (for income) rows: {len(selected_gdf.index)} zip codes with {len(selected_gdf.columns)} columns")

# Remove some columns that are strings (nimi, namn, kunta = name of the municipality in Finnish and Swedish)
# or which might make the modeling too easy as directly realted to inhabitants income ('hr_mtu','hr_tuy','hr_pi_tul','hr_ke_tul','hr_hy_tul','hr_ovy') or household income ('tr_ktu', 'tr_mtu')
columns_to_be_removed_completely = ['nimi','namn','kunta','hr_ktu','hr_tuy','hr_pi_tul','hr_ke_tul','hr_hy_tul','hr_ovy', 'tr_ktu', 'tr_mtu']
selected_gdf = selected_gdf.drop(columns_to_be_removed_completely,axis=1)

print(f"Dataframe size after dropping columns with string values and columns that make modeling too easy : {len(selected_gdf.index)} zip codes with {len(selected_gdf.columns)} columns")

## 4. Feature engineering

This section does not include any computations, as this goes out of scope of the course and is mainly geospatial processing that is not specific to machine learning.

Sometimes, the features as they come from the dataset can be further refined to represent the data in a way that is easier to use for modelling. One example would be to calculate the ratio or some other statistical measure of two or multiple features. This step requires domain knowledge to find sensible features. In this step you can also again think about, what additional datasets could be used to add information for the task.

In the spatial domain, incorporating the neighborhood of each zip-code into features or describing the shape of polygons with values could be ways of feature engineering.

For example, we expect (=domain knowledge) that people with higher income can afford to live in cities and near lakes or some national park. So we could also engineer some features that represent these factors with additional datasets:

* distance to closest city center, which could be derived from [naturalearth populated places dataset](https://www.naturalearthdata.com/downloads/110m-cultural-vectors/110m-populated-places/)
* number of lakes in the area (e.g. 5km radius), see e.g. [SYKE water areas](https://ckan.ymparisto.fi/dataset/%7BAD287567-30F9-4529-9B47-2D6573FCAA9E%7D)
* distance to closest national park, see e.g. [SYKE state areas of natural protection](https://ckan.ymparisto.fi/dataset/%7BC8FC4A42-A2C3-40C4-92CD-2299C688514E%7D)

In the temporal domain, if we are working with timeseries, but not with specific time series models, we could create features representing the temporal domain, such as the ratio of a two timepoints of the same variable.

Be creative! 

> Note: Make sure that you can create the same features also for future datasets that you might want to apply your model to.


## 5. Feature encoding

* Most Machine Learning algorithms cannot handle categorical features per se, they have to be converted to numerical values
* Categorical features can be binary (True/False, 1/0), ordinal (low,medium,high) or nominal (monkey, donkey, tiger, penguin)

To practice, we can add region names to the post codes. One of the most-used encoding techniques is **one-hot encoding**. This means that instead of one column with different names, we create <number of unique values in column> new columns and fill then with 1/0. 
-> Same information content but numerical cells and no hierarchy (as we would get when simply assigning a numerical value to each string) 
-> also called "dummy variables"

We use the pandas **get_dummies()** function for one-hot encoding. Scikit would also have a **OneHotEncoder()** transformer for this

* More information on one-hot encoding https://www.kaggle.com/dansbecker/using-categorical-data-with-one-hot-encoding
* It might not always be the best option. See other options https://towardsdatascience.com/stop-one-hot-encoding-your-categorical-variables-bbb0fba89809

### 5.1 Spatially join the region information to the dataset 

First we need to bring the two dataframes together. We want to know which region each zip code are is in, so we want to "spatially join" the two dataframes. As the zip code areas might overlap several regions, let's choose that region for each zip code, where the mid point of each zip code polygon falls in.

In [None]:
# Read the regions shapefile and select only the name of the region and its geometry
finnish_regions_gdf = gpd.read_file(finnish_regions_shapefile)
finnish_regions_gdf = finnish_regions_gdf[['NAMEFIN','geometry']]

# A function we use to return centroid point geometry from a zip code polygon
def returnPointGeometryFromXY(polygon_geometry):
    ## Calculate x and y of the centroid
    centroid_x,centroid_y = polygon_geometry.centroid.x,polygon_geometry.centroid.y
    ## Create a shapely Point geometry of the x and y coords
    point_geometry = Point(centroid_x,centroid_y)
    return point_geometry

# Stash the polygon geometry to another column as we are going to overwrite the 'geometry' with centroid geometry
selected_gdf['polygon_geometry'] = selected_gdf['geometry']

# We will be joining the region name to zip codes according to the zip code centroid. 
# This calls the function above and returns centroid to every row
selected_gdf["geometry"] = selected_gdf['geometry'].apply(returnPointGeometryFromXY)

# Spatially join the region name to the zip codes using the centroid of zip codes and region polygons
selected_and_joined_gdf = gpd.sjoin(selected_gdf,finnish_regions_gdf,how='inner',predicate='intersects')
# look at the end of the dataframe to see if it worked (the beginning of the dataframe has too many zip codes in same area)
selected_and_joined_gdf.tail()


From here onwards, we do not need the geometry of the zip code areas anymore and can remove them from the dataframe.
If you want to visualize the resulting datasets with geometries later, you can join them back via the zipcode.

In [None]:
selected_and_joined_gdf.drop(['index_right','polygon_geometry', 'geometry'],axis=1, inplace=True)

### 5.2 One-hot encode the region name

Let's practice now the one-hot encoding on the spatially joined dataframe.

In [None]:
# Encode the region name with One-hot encoder (= in pandas, dummy encoding)
ohencoder = OneHotEncoder()
encoded_gdf = pd.get_dummies(selected_and_joined_gdf['NAMEFIN'])

col_names_no_scaling = list(encoded_gdf.columns)

# Join original gdf and encoded gdf together, drop the original finnish name column
new_encoded_gdf = selected_and_joined_gdf.join(encoded_gdf).drop('NAMEFIN',axis=1)

print("Dataframe size after adding region name: " + str(len(new_encoded_gdf.index))+ " zip codes with " + str(len(new_encoded_gdf.columns)) + " columns")

# Print the tail of the dataframe
new_encoded_gdf.tail()

## 6. Train- Test split

In order to determine later, how well our models perform on previously unseen data we need to split the dataset into so-called `train`, `test` and `validation` dataset. We use the `train` dataset during model training, so our regressor gets to know that dataset really well. Then we will use our `validation` dataset to finetune the parameters of our models, i.e. we use knowledge gained from applying the trained model on unseen data to adapt the parameters. That means that this dataset is no longer unknown to the model. So we need a third new dataset (`test`) to finally test how well the model performs on previously unseen data.

![](../images/supervised_workflow.png)



In [None]:
# Split the gdf to x (the predictor attributes) and y (the attribute to be predicted)
y = new_encoded_gdf['hr_mtu'] # Average income

# Remove label
x = new_encoded_gdf.drop(['hr_mtu'],axis=1)

# Split both datasets to train (60%) and test (40%) datasets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=.4, random_state=random_seed)

# Split the test dataset in half, to get 20% validation and 20% test dataset
x_val, x_test, y_val, y_test = train_test_split(x_test, y_test, test_size=.5, random_state=random_seed)

# \n tells the print command to start a new line
print(f'Shape of train dataset: {x_train.shape} \n Shape of test dataset: {x_test.shape} \n Shape of validation dataset: {x_val.shape}')
        
x_train = x_train.reset_index(drop=True)
x_test = x_test.reset_index(drop=True)
x_val = x_val.reset_index(drop=True)

## 7. Feature Scaling

Feature Scaling is one of the most important data preparation steps. This is to avoid biasing algorithms that compute distances between features (e.g. like KNN, SVM and other non-treebased) towards numerically larger values. Feature scaling also helps the algorithm to train and converge faster.
The most popoular scaling techniques are normalization and standardization. Both scale the values of the current cell based on all given other cells, this means that scaling has to be done before train/test split to avoid bias towards unseen data. Apply to test set afterwards.

## 7.1 Normalization or min-max scaling 

* X_new = (X - X_min)/(X_max - X_min)
* Used when features are of different scales, eg average size of household (te_takk) and number of inhabitants of a certain age class (he_x_y) 
* Scales the values into range [0,1] or [-1,1]
* Data should not have any large outliers (data exploration!), as the rest of the data will be squashed into narrow range. -> Standardization is better option
* Scikit-learn: [MinMaxScaler()](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html)

## 7.2 Standardization or Z-score normalization

* X_new = (X - mean)/std
* Used when "zero mean and unit standard deviation" needs to be ensured, we are standardizing to achieve equal variance of features
* Not bound to specific range
* less affected by outliers, as range is not set outliers will not have influence on the range of other values
* "1 implies that the value for that case is one standard deviation above the mean"
* Scikit-learn: [StandardScaler()](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler)


In [None]:
# Get list of all column headings
all_columns = list(x_train.columns)

col_names_no_scaling.extend(['postinumer'])
print(col_names_no_scaling)

# List of column names we want to scale. (all columns minus those we don't want)
col_names_to_scale = [column for column in all_columns if column not in col_names_no_scaling]

# Subset the data for only those to-be scaled
x_train_to_scale = x_train[col_names_to_scale]
# we do not need to scale the label, but we also need to scale the test and validation data
x_test_to_scale = x_test[col_names_to_scale]
x_val_to_scale = x_val[col_names_to_scale]


# Apply a Scikit StandardScaler() or MinMaxScaler() for all the columns left in dataframe
# You can also test both, rename variable `train/test/val_dataset_name` after running the remaining cells with one scaler, to not overwrite results
scaler = StandardScaler()
#scaler = MinMaxScaler()

# You can save the scaler for later use. If there suddenly would be more zip codes in Finland, we should use the same scaler.
# dump(scaler, scaler_path, compress=True)

# We fit the scaler to the training dataset and transform the trainin dataset
scaled_x_train_array = scaler.fit_transform(x_train_to_scale)

# we also need to scale x_test and x_val with the same scaler, note that we only transform , not fit the test data
scaled_x_test_array = scaler.transform(x_test_to_scale)
scaled_x_val_array = scaler.transform(x_val_to_scale)

# Result is a numpy ndarray, which we pack back into geopandas dataframe
# Join the non-scaled columns back with the the scaled columns by index and drop all rows that have nodata values after scaling
def to_pandas_and_rejoin(scaled_array, col_names_to_scale, unscaled_data):
    scaled_x = pd.DataFrame(scaled_array)
    scaled_x.columns = col_names_to_scale
    full_scaled_x = scaled_x.join(unscaled_data).dropna()
    return full_scaled_x
    

full_scaled_x_train = to_pandas_and_rejoin(scaled_x_train_array, col_names_to_scale, x_train[col_names_no_scaling])
full_scaled_x_test = to_pandas_and_rejoin(scaled_x_test_array, col_names_to_scale, x_test[col_names_no_scaling])
full_scaled_x_val = to_pandas_and_rejoin(scaled_x_val_array, col_names_to_scale, x_val[col_names_no_scaling])


# 8. Store the results

If you want to visualize the resulting datasets on a map, you can join the geometry back to the zip code and store as geopackage.

In [None]:
original_gdf[['postinumer','geometry']].merge(x_train,on='postinumer', how='right').to_file(train_dataset_geo, driver='GPKG')  
original_gdf[['postinumer','geometry']].merge(x_val,on='postinumer', how='right').to_file(val_dataset_geo, driver='GPKG')  
original_gdf[['postinumer','geometry']].merge(x_test,on='postinumer', how='right').to_file(test_dataset_geo, driver='GPKG') 

We will need the results without geometries of this notebook in two further notebooks, so we will store the prepared train, validation and test datasets without geometries and zip codes into csv.
We also store the labels for train, validation and test datasets as pickle.

In [None]:
# Write the prepared train and test zipcode datasets to csv, drop the zip code column ('postinumer') for that
full_scaled_x_train.drop(['postinumer'], axis=1).to_csv(scaled_train_dataset_name, index=False)
full_scaled_x_test.drop(['postinumer'], axis=1).to_csv(scaled_test_dataset_name, index=False)
full_scaled_x_val.drop(['postinumer'], axis=1).to_csv(scaled_val_dataset_name, index=False)

# You can also store the unscaled train, test and validation datasets, which can be used with tree-based models
x_train.drop(['postinumer'], axis=1).to_csv(train_dataset_name, index=False)
x_test.drop(['postinumer'], axis=1).to_csv(test_dataset_name, index=False)
x_val.drop(['postinumer'], axis=1).to_csv(val_dataset_name, index=False)

# Write the labels to pickle, as we do not need to read it outside of these notebooks, otherwise json or csv would be more compatible options
y_train.to_pickle(train_label_name)
y_test.to_pickle(test_label_name)
y_val.to_pickle(val_label_name)