# TPM034A Machine Learning for socio-technical systems 
## `Assignment 01: Discover, explore and visualise data`

**Delft University of Technology**<br>
**Q2 2023**<br>
**Instructor:** Sander van Cranenburgh <br>
**TAs:**  Francisco Garrido Valenzuela & Lucas Spierenburg <br>

### `Instructions`

**Assignments aim to:**<br>
* Examine your understanding of the key concepts and techniques.
* Examine your applied ML skills.

**Assignments:**<br>
* Are graded and must be submitted (see the submission instruction below). 

### `Workspace set-up`

**Option 1: Local environment**<br>
Uncomment the following cell if you are running this notebook on your local environment. This will install all dependencies on your Python version.

In [None]:
#!pip install -r requirements.txt

**Option 2: Google Colab**<br>
Uncomment the following cells code lines if you are running this notebook on Colab

In [None]:
#!git clone https://github.com/TPM034A/Q2_2023
#!pip install -r Q2_2023/requirements_colab.txt
#!mv "/content/Q2_2023/Assignments/assignment_01/data" /content/data

## `Application: Liveability and affordable housing in Amsterdam` <br>

### **Introduction**
There is a widespread sense that affordable housing for the middle incomes households is under pressure. Especially for new entrants to the housing markets (i.e. those who do not yet own a house), affordable houses to buy in pleasant neighboorhoods are in short supply. Entrants to the housing market typically are people in their 20s and 30s.<br>

The municipality of Amsterdam would like to tackle this issue. (see https://openresearch.amsterdam/en/page/77950/housing-crisis for articles on the subject). However, at present, the municipality of Amsterdam lacks insights on the future evolution of real-estate prices and liveability in neighborhoods. <br>

*Your are asked to assist the municipality of Amsterdam in predicting **where** and **how** real-estate prices and liveability will change in the coming years.*<br>

### **Data**

You have access to four data sets:
1. Real-estate prices in Amsterdam, at a 100x100m square grid level
1. Liveability scores in the Netherlands, at a 100x100m square grid level
1. Population statistics in Amsterdam, at a 100x100m square grid level
1. Geographical boundaries of the 100x100m squares in Amsterdam

### **Notes**
- In the liveability scores dataset the column *versie* show the different versions of the livability score, we only use the 3rd version. Thus, you may filter this column to keep *Leefbaarometer 3.0*	only.
- You may assume that the population statistics and geospatial data have not substantially changed across the years 2014 and 2020. Thus, you may assume both apply to 2014 and 2020.
- For population statistics (3rd dataset), [this document](./data/demog_data/metadata.csv) provides a brief explanation of the features.

### **Tasks and grading**

Your assigment is divided into 3 subtasks: (1) Data preparation, (2) Data exploration and (3) Assess the affordability of 'liveable' neighbourhoods. In total, 10 points can be earned in this assignment. The weight per subtask is shown below. 

1.  **Data preparation: construct data from multiple data sources.** [2 pnt]
    1. Load the four dataset and show a preview of the dataset structure (some DataFrame rows).
    1. Prepare two tabular (i.e. non-GIS data) dataframes, one for 2014 and one for 2020, containing the following information for the city of Amsterdam:
        - The liveability data for the year of interest, using the 3rd version of the Leefbaarometer
        - Demographic data, housing stock data, accessibility to amenities
        - Real-estate prices
         
            **Hints:**
            1. *Make sure to filter the data and remove NULL (NaN values) if required*<br>
            1. *Each row in the data contains the information of one sq*
    1. Add the geospatial (i.e. GIS) data of the squares to your dataframes.
1.  **Assess the relative change in real-estate prices and liveability in Amsterdam.** [3 pnt]
    1. Explore how the relative change in liveability associates with the relative change in real-estate prices (i.e use the **percentage of change**). Show your results using a scatter plot.
    1. Visualize the spatial distribution of the relative change in real-estate price and liveability, using two maps of Amsterdam.
    1. What are the spatial trends for the evolution of the real-estate price and the liveability index? Do you observe some relationship between the two variables?
1.  **Train a regression model to predict the change in real-estate price and liveability in Amsterdam between 2014 and 2020, using data for 2014** [2 pnt]
    1. Train a regression model to predict the change in real-estate price and liveability in Amsterdam between 2014 and 2020, using data on real-estate price, liveability, demographics, housing stock, and accessibility to amenities in 2014.
    1. Interpretation of the regression results:
        1.  Interpret the relationship between the price in 2014 and the delta in price
        1. Interpret the relationship between the price in 2014 and the delta in liveability
        1. Compare the model performance of the two regression models. Which metric ($\Delta$ price or $\Delta$ liveavility) is easiest to predict given the available data in 2014? 
1.  **Predict the changes in real-estate price and liveability between 2020 and 2026.** [3 pnt]
    1. Apply your trained regression model on the data of 2020 to predict the relative changes in 2026.
    1. Visualize the spatial distribution of the relative change in real-estate price and liveability between 2020 and 2026. 
    1. What is the difference between the predicted evolution between 2020 and 2026 and the observed evolution between 2014 and 2020?
    1. Qualitative reflect on machine learning and generalisation. What assumption are you making when predicting future evolution from past evolution? Give an example of risk associated with this approach and elaborate.

### **Submission**
- The deadline for this assignment is **21 November 2023 23:59**
- Use **Python 3.10 or above**
- You have to submit your work in zip file with the ipynb (fully executed) in brightspace

In [None]:
import os
from os import getcwd
from pathlib import Path

import numpy as np
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import seaborn as sns
from branca.element import Figure

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

pd.set_option('display.max_columns', None)

### 1. Data preparation: construct data from multiple data sources [2 pnt]
#### 1.1 Load the four dataset and show a preview of the dataset structure (some DataFrame rows).

#### 1.2 Prepare two tabular dataframes (for 2014 and 2020) that contains the following information for Amsterdam:
- the liveability data for the year of interest, using the 3rd version of the Leefbaarometer
- population data 
- Real-estate prices

**Hints:**
1. *Make sure to filter the data and remove NULL (NaN values) if required*<br>
1. *Each row in the data contains the information of one sq*

#### 1.3 Add the geographic component of the square to your data

### 2. Assess the relative change in real-estate price and liveability in Amsterdam [3 pnt]

#### 2.1 Explore how the relative change in liveability associates with the relative change in real-estate prices (i.e. use **percentages**), using a scatter plot.

### 2.2 Investigate the spatial distribution of the change in real-estate price and liveability, using two maps of Amsterdam.
- Hint: for the maps, we suggest to set the boundaries for the colorscale to the 5th and the 95th percentiles of the change in price and liveability. You can use the map with the boundaries of Amsterdam as a background (`Amsterdam_boundary.gpkg`), which is located at [path](data/spatial_data/Amsterdam_boundary.gpkg).

### 2.3 What are the spatial trends for the evolution of the real-estate price and the liveability index? Do you observe some relationship between the two variables?

- Real-estate price:
   - High increase in Amsterdam Zuid and in Amsterdam Nieuwe-West
   - Low increase in old center
- Liveability:
   - Higher increase in Bos en Lommer (center West), and Noorderpark (North)
- There seems to be a weak spatial correlation between the two variables.

### 3. Predict changes in real-estate prices and liveability [2 pnt]

The municipality has more leverage to regulate real-estate prices if it can anticipate where and how prices may rise. You are thus asked to determine where the municipality should deploy measures in order to prevent liveable neighborhoods to become unaffordable.

### 3.1 Train a regression model to predict the change in real-estate price and liveability in Amsterdam between 2014 and 2020, using data for 2014

- hint: Remember to handle missing data (value=-99997 in this data set).
- Print the R2 for the train and the test set

#### 3.2 Interpretation
3.2.1 Interpret the relationship between the price in 2014 and the delta in price<br> 
3.2.2 Interpret the relationship between the price in 2014 and the delta in liveability<br>
3.2.2 Compare the model performance of the two regression models. Which metric ($\Delta$ price or $\Delta$ liveavility) is easiest to predict given the available data in 2014? 


The real-estate prices in cheaper neighborhoods increased more than in already expensive neighborhoods (in percentage terms).

### 4. Predict the changes in real-estate price and liveability between 2020 and 2026, using the data of 2020.
#### 4.1 Apply the (trained) regression model on the data for 2020 to predict the target variables ($\Delta$ price or $\Delta$ liveavility) in 2026.

#### 4.2 Visualize the spatial distribution of the change in real-estate price and liveability between 2020 and 2026. 
hint: Use the same colorscales as the previous map.

#### 4.3 What is the difference between the predicted evolution between 2020 and 2026 and the observed evolution between 2014 and 2020?

- Real-estate price:
    - Overall, predicted relative increase is lower compared to the increase between 2014 and 2020.
    - Increase in the West and in the North.
- Liveability:
    - Increase in the West and in the North.

#### 4.4 Qualitative reflection on machine learning and generalisation
The assignment involved using a regression model trained on data from 2014 to 2020 to predict the changes in real estate prices and liveability scores between 2020 and 2026. What assumption(s) underlies this approach? Explain a situation in which this assumption could fail.