# TPM034A Machine Learning for socio-technical systems 
## `Assignment 02: Artificial Neural Networks`

**Delft University of Technology**<br>
**Q2 2022**<br>
**Instructor:** Sander van Cranenburgh <br>
**TAs:**  Francisco Garrido Valenzuela & Lucas Spierenburg <br>

### `Instructions`

**Assignments aim to:**<br>
* Examine your understanding of the key concepts and techniques.
* Examine your the applied ML skills.

**Assignments:**<br>
* Are graded and must be submitted (see the submission instruction below). 

### `Workspace set-up`
**Option 1: Google Colab**<br>
Uncomment the following cells code lines if you are running this notebook on Colab

In [None]:
#!git clone https://github.com/TPM34A/Q2_2022
#!pip install -r Q2_2022/requirements_colab.txt
#!mv "/content/Q2_2022/Assignments/assignment_02/data" /content/data

**Option 2: Local environment**<br>
Uncomment the following cell if you are running this notebook on your local environment. This will install all dependencies on your Python version.

In [None]:
#!pip install -r requirements.txt

## `Application: Predicting the effects of a car ban in the city centre of Leeds` <br>

### **Introduction**
The city of Leeds, in the United Kingdom, is considering implementing a ban on private cars in the city center. Nowadays, car-free city centres are increasingly popular in Western European countries. As cars produce various externalities, including traffic accidents, air pollution, and noise pollution, a car ban has the potential to make the city centre more attractive and a better place to live and do business.

Your assignment is to inform the decision-makers in Leeds about the effects of a car ban. Specifically, the city of Leeds does not yet know the extent to which a car ban would shift the mode shares of trips going to the city centre. This information is vital to assess the viability and effectiveness of the car ban policy under consideration.

To inform the decision-makers in Leeds, in this assignment you will:
1. Create a model that predicts the mode choices, given a set of travel characteristics. Specifically, you will train a neural network based on observed travel patterns. 
2. Use your trained model to predict the effect of the car ban policy on mode shares for trips going to the city centre.<br>

### **Data**

You have access to three data sets:
1. Travel patterns and modes choice data. These data are obtained from a so-called revealed-preference survey, see a description of this data [here](https://link.springer.com/article/10.1007/s11116-018-9858-7)
1. Zones of Leeds (GIS)
1. Mode shares per zone in Leeds, derived from the two other datasets.
<br>

`IMPORTANT`<br>
These data are exclusively made available by its owners for **educational purposes**.<br> 
You are **NOT** allowed to **share or further distribute** these data with anyone other than those involved in TPM034A.

### **Notes**
- The description of each column of revealed-preference dataset is [here](data/model_average_RP_description.pdf)
- In revealed-preference dataset considers as *numerical travel features*: 'avail_car', 'avail_taxi', 'avail_bus' 'avail_rail', 'avail_cycling', 'avail_walking', 'total_car_cost', 'taxi_cost', 'bus_cost_total_per_leg', 'rail_cost_total_per_leg', 'car_distance_km', 'bus_distance_km', 'rail_distance_km', 'taxi_distance_km' 'cycling_distance_km', 'walking_distance_km', 'car_travel_time_min', 'bus_travel_time_min', 'rail_travel_time_min', 'taxi_travel_time_min', 'cycling_travel_time_min', 'walking_travel_time_min' 'bus_IVT_time_min', 'bus_access_egress_time_min', 'rail_IVT_time_min', 'rail_access_egress_time_min' 'bus_transfers', 'rail_transfers'.
- Each row in the zone dataset (2nd dataset) corresponds to an individual zone in Leeds, and contains 4 different columns. The description of each column is shown in the following able:


| Column   | Description                                                                                                                                                                                                  |
|----------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| LSOA11CD | Zone Code                                                                                                                                                                                                    |
| LSOA11NM | Zone Name                                                                                                                                                                                                    |
| Region   | Region Code, corresponds to a bigger region formed by a set of zones. Values = {'C': Center region, 'R': Ring center region, 'NW': North-West region, 'NE': North-East , 'SW': South-West, 'SE': South-East}  |
| geometry | Polygonal geometry of each zone                                                                                                                                                                              |


### **Tasks and grading**

Your assignment is divided into 4 subtasks: (1) Data preparation, (2) Data exploration, (3) Model training, and (4) Assessment of the impact of the car ban policy on mode shares. In total, 10 points can be earned in this assignment. The weight per subtask is shown below. 

1.  **Data preparation: Load datasets and make a first inspection** [1 pnt]
    1. Load the two dataset using Pandas and GeoPandas.
    1. Check the structure of both datasets e.g. using `df.head()` or `df.describe()`.
    1. Handle the NaN values. I.e. only keep only trips where the **destination** is known.
    1. Create a map that shows the six regions of Leeds (C, R, NW, NE, SW, SE) in separate colors.
1. **Data exploration: discover and visualise pattern of mobility data.** [3 pnt] 
    1. For each zone, count the number of times that zone is a destination (hint: use the pandas *groupby* method). Create a visualisation showing the statistical distribution of these counts, using a histogram. What can you say about this distribution? 
    1. Create a visualisation showing the spatial distribution of these counts. To do so, merge this count dataframe with the geographic delination of zones.
    1. Create a figure with 2 subplots showing the mode share of 'Car' (left) and of 'Bus' (right) in every destination zone in regions R and C, and interpret the results.<br> For your convenience, we have preprocessed the data for you. That is, we have added mode shares per destination zone. (Use the same color scale for the two maps)
1. **Model training: Train a MultiLayerPerceptron (MLP) neural network to predict the choices** [3 pnt]
    1. Use the *numerical travel features* (see notes above) and the following two categorical features: purpose and destination regions. Remember: (1) to scale all variables appropriately before training your MLP, and (2) to encode categorical variables.  (hint: use the pandas **get_dummies** method to encode the categorical variables).
    1. Tune the hyperparameters of your MLP. That is, do a gridsearch over the following hyperparameter space:
        - Architecture: {1 HL w/30 nodes, 2 HL w/ 5 nodes}
        - Alpha parameter: {0.1, 0.001}
        - Learning rate: {0.01, 0.001}
    1. Fit a MLP model, using the optimal hyperparameters found and report and interpret the following output metrics:
        - accuracy
        - cross-entropy
        - confusion matrix.
1. **Assess the impact of a car ban policy on mode shares** [3 pnt]
    1. Benchmark scenario: Create a new dataframe containing only trips with a destination in region C. Predict the mode shares for these trips, using your trained model. Use the *predict_proba* function from sk-learn, why should you NOT use the *predict* function in this case?
    1. Car-ban scenario: In the dataset created in 4.1 set *avail_car* to zero. Use your trained model to predict the modes. (** Remember to scale the data with the scaler created for training the model**).
    1. Compare your results. That is, analyse how mode shares have changed as a result of the car ban policy.  Create a visualisation representing the shift in mode shares. By which mode have car trips most often been substituted?
    1. Reflect on your analysis. Do you think your analysis are meaningful? Why/why not? What is the main limitation of your analysis?


### **Submission**
- The deadline for this assignment is **Wed, 30 November 2022** 
- Use **Python 3.7 or above**
- You have to submit your work in zip file with the ipynb **(fully executed)** in Brightspace

In [None]:
# Import required Python packages and modules
import os
import pandas as pd
import geopandas as gpd
import sklearn as sk
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pickle
from pathlib import Path

# Import selected functions and classes from Python packages
from os import getcwd
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV, cross_validate
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import ConfusionMatrixDisplay, log_loss, matthews_corrcoef, make_scorer, classification_report

# Setting
pd.set_option('display.max_columns', None)

### 1. Data preparation: Load datasets and make a first inspection [1 pnt]
#### 1.1. Load the two dataset using Pandas and GeoPandas

In [None]:
# Set the data folder path
data_folder = Path(f'data')

In [None]:
# Load the revealed-preference dataset as pandas DataFrame
rp_df = pd.read_csv(data_folder/'RP_mode_choice_data.csv')

# Load the Leeds zones dataset as geopandas GeoDataFrame
leeds_zones_gdf = gpd.read_file(data_folder/'Leeds_zones.gpkg')

#### 1.2. Check the structure of both datasets e.g. using `df.head()` or `df.describe()`.

In [None]:
rp_df

In [None]:
rp_df.describe()

In [None]:
leeds_zones_gdf

In [None]:
leeds_zones_gdf.describe()

#### 1.3. Handle the NaN values. I.e. only keep trips where the **destination** is known.

In [None]:
rp_df[rp_df.columns[rp_df.isnull().any()]].isnull().sum()

In [None]:
leeds_zones_gdf.isna().sum()

In [None]:
# Drop all NaN values in the destination column in the revealed-preference dataset as they cannot be used for destination predictions
rp_df = rp_df.dropna(subset=['Destination_lsoa_code'])

In [None]:
rp_df[rp_df.columns[rp_df.isnull().any()]].isnull().sum()

#### 1.4. Create a map that shows the six regions of Leeds (C, R, NW, NE, SW, SE) in separate colors.


In [None]:
# Draw the map based on region
fig, ax = plt.subplots(figsize=(10,10))

leeds_zones_gdf.plot(ax=ax, column = 'Region', legend = True, cmap='viridis')
ax.set_axis_off()
ax.set_title("Leeds zones grouped as regions")
plt.plot()

### 2. Data exploration: discover and visualise mobility patterns. [3 pnt]
#### 2.1 For each zone, count the number of times that zone is a destination (hint: use the pandas *groupby* method). Create a visualisation showing the statistical distribution of these counts, using a histogram. What can you say about this distribution?

In [None]:
zone_destination_count = rp_df['Destination_lsoa_code'].value_counts()
zone_destination_count.head()

In [None]:
# Create histogram and empirical CDF for zone destination count
fig, axes = plt.subplots(1, 2, figsize=(15, 5), sharex=True)
sns.histplot(ax = axes[0],x = zone_destination_count)
ecdf_data = sns.ecdfplot(ax = axes[1],x = zone_destination_count)
axes[0].set_xlabel("Zone ID")
axes[1].set_xlabel("Zone ID")
axes[1].grid(True,linewidth = 0.5)
axes[1].minorticks_on()
axes[1].grid(which='minor', linestyle=':', linewidth='0.5', color='black')

The amount of times a zone is a destination is heavily skewed towards a very small number of popular destination zones. Most others zones are very little visited as destinations. This will probably have to do with zones where people go for work, study, shopping or leisure as opposed to zones which are "only" for living.

#### 2.2 Create a visualisation showing the *spatial* distribution of these counts. To do so, merge this count dataframe with the geographic delination of zones.

In [None]:
# Create a mergeable DataFrame from the count series
zone_destination_count_df = zone_destination_count.to_frame()
zone_destination_count_df.reset_index(level=0, inplace=True)
zone_destination_count_df.rename(columns = {'index':'LSOA11CD', 'Destination_lsoa_code':'zone_destination_count'}, inplace=True)
zone_destination_count_df

In [None]:
# Merge the zone destination count DataFrame with the GeoDataFrame
leeds_zones_gdf = leeds_zones_gdf.merge(zone_destination_count_df, on="LSOA11CD")

In [None]:
# Plot the map
fig, ax = plt.subplots(figsize=(10,10))

leeds_zones_gdf.plot(ax=ax, column = 'zone_destination_count', legend = True, cmap='viridis')
ax.set_axis_off()
ax.set_title("Map of Leeds zones with destination count")
plt.plot()

#### 2.3. Create a figure with 2 subplots showing the mode share of 'Car' (left) and of 'Bus' (right) in every destination zone in regions R and C, and interpret the results.<br> For your convenience, we have preprocessed the data for you. That is, we have added mode shares per destination zone. (Use the same color scale for the two maps)

In [None]:
# Load the Leeds mode share per zone dataset as geopandas GeoDataFrame
mode_shares_per_zone_gdf = gpd.read_file(data_folder/'mode_shares_per_zones.gpkg')
mode_shares_per_zone_gdf

In [None]:
# Create spatial maps showing the buurten of Amsterdam with real estate prices and liveability levels for both 2014 and 2020
fig, axes = plt.subplots(1, 2, figsize=(20, 10), sharex=True, sharey=True)
fig.set_tight_layout(True)

mode_shares_per_zone_gdf[(mode_shares_per_zone_gdf.Region == 'C') | (mode_shares_per_zone_gdf.Region == 'R')].plot(ax=axes[0], column = 'car', legend=False, cmap='viridis')
mode_shares_per_zone_gdf[(mode_shares_per_zone_gdf.Region == 'C') | (mode_shares_per_zone_gdf.Region == 'R')].plot(ax=axes[1], column = 'bus', legend=True, cmap='viridis')

axes[0].set_title("Car share for zones in regions C and R in Leeds")
axes[0].axis('off')
axes[1].set_title("Bus share for zones in regions C and R in Leeds")
axes[1].axis('off')

plt.show()

In [None]:
# Draw the map based on region
fig, ax = plt.subplots(figsize=(10,10))

leeds_zones_gdf[(leeds_zones_gdf.Region == 'C') | (leeds_zones_gdf.Region == 'R')].plot(ax=ax, column = 'Region', legend = True, cmap='viridis')
ax.set_axis_off()
ax.set_title("Leeds zones in regions C and R")
plt.plot()

In general, car use is the dominant share throughout these regions. It seems that the zones in the North-East area of the regions C and R are relatively more popular to visit by car than the other zones. Therefore, bus share is low in this same area. For the bus share, it seems that only a few zones are heavily visited by bus whereas others seem to show quite a low average bus share. In addition, it does not seem that the zones in the center region show very different behaviour from the zones in the ring region, which might have been expected based on average city policies for visiting (old) city centres.

### 3. Model training: Train a MultiLayerPerceptron (MLP) neural network to predict the choices [3 pnt]

#### 3.1. Use the *numerical travel features* (see notes above) and the following two categorical features: purpose and destination regions. Remember: (1) to scale all variables appropriately before training your MLP, and (2) to encode categorical variables.  (hint: use the pandas **get_dummies** method to encode the categorical variables)

In [None]:
zone_region_df = leeds_zones_gdf[['LSOA11CD', 'Region']]
zone_region_df = zone_region_df.rename(columns={"LSOA11CD": "Destination_lsoa_code"})

rp_df = rp_df.merge(zone_region_df, on='Destination_lsoa_code')

In [None]:
rp_df = pd.get_dummies(rp_df, columns=['purpose', 'Region'])

In [None]:
# Create the list of features that we want to use in the model
features = ['avail_car', 'avail_taxi', 'avail_bus', 'avail_rail', 'avail_cycling', 'avail_walking', 'total_car_cost', 'taxi_cost', 'bus_cost_total_per_leg',
            'rail_cost_total_per_leg', 'car_distance_km', 'bus_distance_km', 'rail_distance_km', 'taxi_distance_km', 'cycling_distance_km', 'walking_distance_km',
            'car_travel_time_min', 'bus_travel_time_min', 'rail_travel_time_min', 'taxi_travel_time_min', 'cycling_travel_time_min', 'walking_travel_time_min',
            'bus_IVT_time_min', 'bus_access_egress_time_min', 'rail_IVT_time_min', 'rail_access_egress_time_min', 'bus_transfers', 'rail_transfers',
            'purpose_Cinema or other night out', 'purpose_Clothes shopping', 'purpose_College/University', 'purpose_Dropoff Daycare', 'purpose_Dropoff K12',
            'purpose_Dropoff Other', 'purpose_Dropoff Scheduled Activity', 'purpose_Dropoff Work', 'purpose_Errand Other', 'purpose_Errands with Appointment',
            'purpose_Errands without Appointment', 'purpose_Exercise', 'purpose_Family Activity', 'purpose_Gas', 'purpose_Grocery', 'purpose_Home', 'purpose_K-12 School',
            'purpose_Leisure Other', 'purpose_Medical', 'purpose_Museum/cultural', 'purpose_OtherPurpose', 'purpose_Primary Workplace', 'purpose_Restaurant',
            'purpose_Shopping - Major', 'purpose_Social', 'purpose_Sports activity', 'purpose_Vacation/Travel', 'purpose_Vocational education', 'purpose_Work Other',
            'purpose_Work Related', 'purpose_Work Travel', 'purpose_Work Volunteer', 'Region_C', 'Region_NE', 'Region_NW', 'Region_R', 'Region_SE', 'Region_SW']

X = rp_df.loc[:,features]

# Initiate scaler object & fit to data
scaler = StandardScaler()
scaler.fit(X)

# Create new dataframe X_scaled containing the scaled features
X_scaled = scaler.transform(X)

# Create the target
Y = rp_df['choice']

#### 3.2 Tune the hyperparameters of your MLP. That is, do a gridsearch over the following hyperparameter space:
        - Architecture: {1 HL w/30 nodes, 2 HL w/ 5 nodes}
        - Alpha parameter: {0.1, 0.001}
        - Learning rate: {0.01, 0.001}

In [None]:
# Create MLP object (plain vanilla MLP)
mlp_gs = MLPClassifier(activation = 'tanh', solver='adam', batch_size=250, max_iter=2000)

# Define the hyperparameter search space
hyperparameter_space = {
    'hidden_layer_sizes': [(30),(5,5)],
    'alpha': [0.1, 0.001],
    'learning_rate_init': [0.01,0.001]}

# Create scoring function
logloss = make_scorer(log_loss, greater_is_better = False, needs_proba = True)

# Create the grid_search object, with using the MLP classifier
folds = 5 # Number of cross validation splits
mlp_gridsearch = GridSearchCV(mlp_gs, hyperparameter_space, n_jobs=-1, cv=folds,scoring = logloss)

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X_scaled, Y, random_state = 12345, test_size = 0.2)

In [None]:
# Execute the training/gridsearch
mlp_gridsearch.fit(X_train, Y_train)

In [None]:
# Save your model
filename = 'my_tuned_model.sav'
pickle.dump(mlp_gridsearch, open(data_folder/filename,'wb'))

#### 3.3 Fit a MLP model, using the optimal hyperparameters found and report and interpret the following output metrics:
        - Accuracy
        - Cross-entropy
        - Confusion matrix

In [None]:
# Create a new mlp object using the optimised hyperparameters, just using the train/test split
layers = mlp_gridsearch.best_params_['hidden_layer_sizes']
lr = mlp_gridsearch.best_params_['learning_rate_init']
alpha = mlp_gridsearch.best_params_['alpha']
mlp_gs = MLPClassifier(hidden_layer_sizes = layers, solver='adam', learning_rate_init = lr, alpha=alpha, batch_size=250, activation = 'tanh', max_iter = 2000)

# Train the model
mlp_gs.fit(X_train,Y_train)

In [None]:
# Let's create a function that returns the accuracy and the cross entropy, for the train and test data sets
def calculate_acc_ce(mlp,X_train,Y_train,X_test, Y_test):

    def calculate_acc(mlp,X,Y):
        accuracy = mlp.score(X,Y)
        return accuracy

    def calculate_ce(mlp,X,Y):
        # Compute cross entropy
        # Use the model object to predict probabilities per class
        prob = mlp.predict_proba(X)

        # Multiply the probabilities with Y (0/1 array), and sum along the row axis to obtain the predicted probability of the target
        Y_dummy = pd.get_dummies(Y).to_numpy()
        prob_chosen = np.sum(prob*Y_dummy,axis=1)

        # Take the logarithm
        log_prob_chosen = np.log(prob_chosen)

        # Compute the cross entropy
        cross_entropy = -np.sum(log_prob_chosen)/len(Y)
        return cross_entropy

    # Compute the accuracy
    acc_train = calculate_acc(mlp,X_train,Y_train)
    acc_test  = calculate_acc(mlp,X_test,Y_test)

    # Apply cross entropy function
    ce_train = calculate_ce(mlp,X_train,Y_train)
    ce_test = calculate_ce(mlp,X_test,Y_test)
    return acc_train, acc_test, ce_train, ce_test

In [None]:
# Let's also evaluate performance of the hypertuned model using our evaluation function
accuracy_train_gs, accuracy_test_gs, cross_entropy_train_gs, cross_entropy_test_gs = calculate_acc_ce(mlp_gs,X_train,Y_train,X_test, Y_test)

# Report results
print('\t\t Train set\t Test    set')
print(f'Accuracy\t {accuracy_train_gs:0.3f}\t\t {accuracy_test_gs:0.3f}')
print(f'Cross entropy\t {cross_entropy_train_gs:0.3f}\t\t {cross_entropy_test_gs:0.3f}')

In [None]:
Y_pred_gs = mlp_gs.predict(X_test)

# Show the confusion matrix
fig, axes = plt.subplots(1, 2, figsize = (20,10))
fig.set_tight_layout(True)

ylabels = ['Car', 'Bus', 'Rail', 'Taxi', 'Cycling', 'Walking']
cm1 = ConfusionMatrixDisplay.from_predictions(ax=axes[0], y_true=Y_test,y_pred=Y_pred_gs, display_labels = ylabels, normalize=None)
cm2 = ConfusionMatrixDisplay.from_predictions(ax=axes[1], y_true=Y_test,y_pred=Y_pred_gs, display_labels = ylabels, normalize='true')

# Add titles
axes[0].set_title(f'MLP with {mlp_gs.hidden_layer_sizes} nodes')
axes[1].set_title(f'MLP with {mlp_gs.hidden_layer_sizes} nodes')

Based on the accuracy of 0.854 on the test set, the model seems to perform generally well. In addition, the confusion matrix shows strong rightfully predictions for the diagonal. The most significant errors occur for the model predicting bus, while in reality it was actually taxi or cycling. I assume this is caused by for people often dynamically switching between those three options when having to make a similar choice.

#### 4.1. Benchmark scenario: create a new dataframe containing only trips with a destination in region C. Predict the mode shares for these trips, using your trained model. Use the *predict_proba* function from sk-learn (why should you NOT use the *predict* function in this case?)

In [None]:
X_regionC = X[X['Region_C']==1]
X_regionC

In [None]:
X_scaled_regionC = scaler.transform(X_regionC)

In [None]:
Y_pred_gs_regionC = mlp_gs.predict_proba(X_scaled_regionC)
Y_pred_gs_regionC

You should not use predict as opposed to predict_proba as the first reduces a lot of insights when compared with the latter because it provides insights in the distribution and (un)certainty of a certain prediction, which information you would lose when using predict to obtain a single class value prediction.

#### 4.2. Car-ban scenario: in the dataset created in 4.1 set *avail_car* to zero. Use your trained model to predict the modes. (**Remember to scale the data with the scaler created for training the model**).


In [None]:
X_regionC_availcar0 = X_regionC
X_regionC_availcar0['avail_car'].values[:] = 0
X_regionC_availcar0

In [None]:
X_scaled_regionC_availcar0 = scaler.transform(X_regionC_availcar0)

In [None]:
Y_pred_gs_regionC_availcar0 = mlp_gs.predict_proba(X_scaled_regionC_availcar0)
Y_pred_gs_regionC_availcar0

#### 4.3. Compare your results. That is, analyse how mode shares have changed as a result of the car-ban policy. Create a visualisation representing the shift in mode shares. By which mode have car trips most often been substituted?

In [None]:
average_mode_Y_pred_gs_regionC = np.mean(Y_pred_gs_regionC, axis=0)
average_mode_Y_pred_gs_regionC_availcar0 = np.mean(Y_pred_gs_regionC_availcar0, axis=0)

In [None]:
average_mode_change = abs(average_mode_Y_pred_gs_regionC_availcar0 - average_mode_Y_pred_gs_regionC)
average_mode_change_without_car = np.delete(average_mode_change, 0)
average_mode_change_without_car

In [None]:
labels = ['Bus', 'Rail', 'Taxi', 'Cycling', 'Walking']

plt.pie(average_mode_change_without_car, labels=labels)
plt.show

It is clearly visible that a car ban has resulted in the biggest substitution of cars by bus

#### 4.3.Reflect on your analysis: <br> 
`A` Do you think your analysis and results are meaningful? Why/why not? <br> 
`B` What are the main limitations of your analysis?

Yes, I think the analysis and resluts are meaningful as there was enough balanced (real captured) data available for the model to learn the "normal" situation and be able to predict the consequences of policy changes.

The biggest limitation is that a full car ban in region C would have people probably rethink their transport options again, which is now not included in the dataset and therefore the neural network model.