# TPM034A Machine Learning for socio-technical systems 
## `Assignment 03: Visualising liveability in Rotterdam`

**Delft University of Technology**<br>
**Q2 2023**<br>
**Instructor:** Sander van Cranenburgh <br>
**TAs:**  Francisco Garrido-Valenzuela & Lucas Spierenburg <br>


## `Instructions`

**Assignments aim to:**<br>
* Examine your understanding of the key concepts and techniques.
* Examine your the applied ML skills.

**Assignments:**<br>
* Are graded and must be submitted (see the submission instruction below). 

### `Workspace set-up`

**Option 1: Local environment**<br>
Uncomment the following cell if you are running this notebook on your local environment. This will install all dependencies on your Python version.

In [None]:
#!pip install -r requirements.txt

**Option 2: Google Colab**<br>
Uncomment the following cells code lines if you are running this notebook on Colab

In [None]:
#!git clone https://github.com/TPM034A/Q2_2023
#!pip install -r Q2_2023/requirements_colab.txt
#!mv "/content/Q2_2023/Assignments/assignment_03/data" /content/data
#!mv "/content/Q2_2023/Assignments/assignment_03/assets" /content/assets

# `Application: liveability in Rotterdam` <br>

### **Introduction**

**Liveability** is a concept used to describe the quality of life in a certain area of a city. The liveability of an area is determined by a combination of factors, such as the presence of amenities, the quality of the environment, and the safety in the area. Therefore, liveability is a complex concept that is difficult to measure and/or quantify. In the Netherlands, an instrument developed to analyse liveability (of an area) is called the *Leefbaarometer*. The Leefbaarometer, owned by the Ministry of the Interior and Kingdom Relations, provides an estimate of the liveability  at 100x100 grid (for the entire Netherlands). The Leefbaarometer is used to signal areas with low liveability and to monitor the evolution of liveability over time (monitoring). See this [link](https://www.leefbaarometer.nl/home.php) for more information.

Currently, the Leefbaarometer does not consider the visual appearance of urban spaces. But, intuitively a relationship between the two exist (as we explored in Lab Session 03). In this assignment, you will utilise Machine Learning (ML) to predict liveability scores (taken from Leefbaarometer) based on street-level images. Just as we did in Lab session 03 when we predicted neighbourhood attractiveness, your task in this assignment is to explore using ML models to predict liveability and investigate its relationship with the visual appearance of the urban spaces (image embeddings).

#### **Data**
For this assignment you have access to different datasets. All of them will be available in the data folder after the execution of cell below this instructions. The data folder contains four sub-folder: `image_tabular`, `geo`, `liveability`, and `images`. The following list describes the datasets.

1. `data/image_tabular/image_metadata.csv`: A csv file with the image metadata (e.g., year, month or location) of Rotterdam images.
2. `data/image_tabular/image_embeddings.csv`: A tabular csv file with image embeddings from Rotterdam.
3. `data/geo/grid.gpkg`: A geo dataset of spatial units for the Netherlands called grid cells.
4. `data/liveability/liveability_scores.csv`: A tabular csv file with liveability scores for each grid cell in the Netherlands.
5. `data/images`: A folder with image files from Rotterdam (read below for more details).

As indicated, run the code in the cell below to prepare the dataset. The cell will download the datasets and place them in the data folder automatically for this assigment. It may take up to two minutes to download the data.

### **Notes**
- The Leefbaarometer is computed for different years and with different models (called verions). The following list provide an explanation of the columns in the liveability dataset:
    - `grid_id`: Geographical grid id of the liveability score
    - `versie`: Version of the Leefbaarometer model used to determine liveability scores
    - `jaar`: Year of the liveability score
    - `lbm`: liveability score
    - (`afw`, `fys`, `onv`, `soc`, `vrz`, `won`): Other indicators. If you want to explore them visit this [link](https://www.leefbaarometer.nl/home.php).
- The liveability data comes at grid level (100mx100m squares). In Lab Session 03 you worked with hexagon shaped data. Therefore, in this assigment you have to associate the image data with the grid.

In [None]:
## IMPORTANT: You have to be on the TUDelft network (eduroam) or under eduVPN to run this script
from assets import data_downloader as dld
dld.download_data()

### **Tasks and grading**

Your assignment is divided into 4 subtasks: (1) Data preparation, (2) Data exploration, (3) Model training, and (4) Reflection. In total, 10 points can be earned in this assignment. The weight per subtask is indicated below. 

1. **Data preparation** [2.0 pnt]<br>
    1. Loading datasets. Load the data from geo, image_tabular, and liveability sub-data folders and visualize them with `df.head()`.<br>
    1. Preparing liveability dataset. First, filter the liveability data to keep only the `version 3.0` (`versie` column) and `2020` (`jaar` column). Then, add the grid geometry to the resulting dataset based on `grid_id`. To do so, merge it with the grid dataset.<br>
    1. Preparing the image dataset. First, merge the image metadata with the image embeddings using `img_id`. Then, convert the resulting DataFrame into a GeoDataFrame (check the function provided in Lab03). Finally, do a left spatial join (`gpd.sjoin(how = 'left')`) with the grid dataset to add the corresponding grid_id and grid geometry to each image row.<br>
    1. Preparing the combined liveability-image dataset. First, merge the liveability data (from 1.2) with the image data (from 1.3) based on grid_id. Then, create two datasets based on two different approaches:<br>
        - Grid-level liveability scores: `group by` at grid level and compute the mean of the embeddings columns for each group. Use grid_id as the unit of analysis (this means that in the final dataset each row corresponds to one grid cell). Keep only grid cells with image information (embedding columns) and liveability score. <br>
        - Image-level liveability scores: Associate each image with one liveability score based on the grid where the image is located. Use image id as the unit of analysis (in the final dataset each row correspond to one image). Keep only: `img_id`, `img_path`, `in_folder`, `grid_id`, embedding columns, image-point geometry and `lbm`. **[0.5 pnt]**<br>

1. **Data exploration** [3.0 pnt]<br>
    1. Explore the liveability scores. Plot an histogram of the scores at grid level. Also plot in a map this distribution.<br>
    1. Explore the liveability with the images using the two dataset created in 1. 
        1. Visualize the images from the image-level dataset (from 1.4.2) in groups based on its liveability scores. Similarly as in Lab Session 03 use the number of percentiles (`n_percentiles`) and the number of images per percentile (`images_per_row`) to explore different groups of images.<br>
        1. Repeat the previous step but using the grid-level dataset (from 1.4.1).<br>
    1. Can you visually see to what extent the images contain information about livability? Which dataset (from 1.4.1 and 1.4.2) is more promising? Comment. <br>

1. **Model training** [3 pnt]<br>
    1. Use linear regression as benchmark to decide wich dataset (from 1.4.1 and 1.4.2) you have to use for predicting liveability based on the image embedding.<br>
    1. Do you think the results from linear regression is consistent with the results from the data exploration? Comment.<br>
    1. Train different machine learning models to predict liveability based on the embedding features. Report here two best models you found.<br>
    1. Train an ensemble model to see if the performance improves when combining your models.<br>

1. **Reflexions** [2 pnt]<br>
    1.1. How well do images predict liveability? Can they be used to predict liveability? Comment.<br> 
    1.2. What are the drawbacks of using this source of data? How this approach can be improved to better predict liveability? Comment.<br>

### `Learning objective`
This assignment provides less structure (i.e. concrete descriptions of tasks we expect you to do) than the previous ones. This is deliberate. By this time, you have more experience. The learning objective is that you are able to reasonably independently apply ML in the context of a socio-technical environment. 

### **Submission**
- The deadline for this assignment is **10 December 2023 23:59** 
- Use **Python 3.10 or above**
- You have to submit your work in zip file with the ipynb **(fully executed)**

In [None]:
# Basic libraries
import pandas as pd
import geopandas as gpd
from PIL import Image

# ML tools
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error, make_scorer
# Models
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import VotingRegressor
from sklearn.neural_network import MLPRegressor

# Visualization libraries
from matplotlib import pyplot as plt
from mpl_toolkits.axes_grid1 import ImageGrid
from branca.element import Figure

# Other libraries
from pathlib import Path
from shapely.geometry import Point

# Show all columns
pd.set_option('display.max_columns', None)

### **1. Data preparation**
#### 1.1 Loading datasets. Load the data from geo, image_tabular, and liveability sub-data folders and visualize them with `df.head()`

#### 1.2. Preparing liveability dataset. First, filter the liveability data to keep only the `version 3.0` (`versie` column) and `2020` (`jaar` column). Then, add the grid geometry to the resulting dataset based on `grid_id`. To do so, merge it with the grid dataset.

#### 1.3. Preparing the image dataset. First, merge the image metadata with the image embeddings using `img_id`. Then, convert the resulting DataFrame into a GeoDataFrame (check the function provided in Lab03). Finally, do a left spatial join (`gpd.sjoin(how = 'left')`) with the grid dataset to add the corresponding grid_id and grid geometry to each image row. 

#### 1.4. Preparing the combined liveability-image dataset. First, merge the liveability data (from 1.2) with the image data (from 1.3) based on grid_id. Then, create two datasets based on two different approaches:

##### 1.4.1. Grid-level liveability scores: `group by` at grid level and compute the mean of the embeddings columns for each group. Use grid_id as the unit of analysis (this means that in the final dataset each row correspond to one grid cell). Keep only grid cells with image information (embedding columns) and liveability score.

##### 1.4.2. Image-level liveability scores: Associate each image with one liveability score based on the grid where the image is located. Use images as the unit of analysis (in the final dataset each row correspond to one image). Keep only: `img_id`, `img_path`, `in_folder`, `grid_id`, embedding columns, image-point geometry and `lbm`.

### **2. Data exploration** 
#### 2.1. Explore the liveability scores. Plot an histogram of the scores at grid level. Also plot in a map this distribution.

#### 2.2. Explore how liveability relates to images using the two dataset created in 1.4. 
##### 2.2.1. Visualise the images from the image-level dataset (from 1.4.2) in groups based on its liveability scores. Similarly as in Lab Session 03 use the number of percentiles (`n_percentiles`) and the number of images per percentile (`images_per_row`) to explore different groups of images. Remember to use only images available in the folder.


##### 2.2.2. Repeat the previous step but using the grid-level dataset (from 1.4.1). To select the images, pick up them randomly from each grid cell. Remember to use only images available in the folder.

#### 2.3. Can you visually see to what extent the images contain information about liveability? Which dataset (from 1.4.1 and 1.4.2) is more promising? Comment.

### **3. Model training**
#### 3.1. Use linear regression as benchmark to decide wich dataset (from 1.4.1 and 1.4.2) to use for predicting liveability based on the image embedding.

In [None]:
def eval_regression_perf(model, X_train, X_test, Y_train, Y_test):
    
    # Make prediction with the trained model
    Y_pred_train = model.predict(X_train)
    Y_pred_test = model.predict(X_test)

    # Create a function that computes the MSE, MAE, and R2
    def perfs(Y,Y_pred):
        mse = mean_squared_error(Y,Y_pred)
        mae = mean_absolute_error(Y,Y_pred)
        R2 = r2_score(Y,Y_pred)
        return mse,mae,R2

    # Apply the perfs function to the train and test data sets
    mse_train, mae_train, r2_train = perfs(Y_train,Y_pred_train)
    mse_test,  mae_test , r2_test  = perfs(Y_test,Y_pred_test)
        
    # Print results
    print('Performance')
    print(f'Mean Squared  Error Train | Test: \t{mse_train:>7.4f}\t|  {mse_test:>7.4f}')
    print(f'Mean Absolute Error Train | Test: \t{mae_train:>7.4f}\t|  {mae_test:>7.4f}')
    print(f'R2                  Train | Test: \t{ r2_train:>7.4f}\t|  {r2_test:>7.4f}\n')

#### 3.2. Do you think the results from linear regression is consistent with the results from the data exploration? Could you think of an explanation?

#### 3.3. Train different machine learning models to predict liveability based on the embedding features. Report here two best models you found.

##### 3.3.1 Model 1

##### 3.2.1 Model 2

#### 3.4. Train an ensemble model to see if combining your models the performance is improved.

### **4. Reflection**
#### 4.1. How well do the images predict liveability? Can they be used to predict liveability? Comment.

#### 4.2. What are the drawbacks of using images to predict liveability? How can this approach can be improved to better predict liveability? Comment.