# Probabilistic classification in practice

In this practical, we use a dataset of historic trips from the Netherlands to investigate passenger mode choice behaviour. The full data set is from the paper *A comparative study of machine learning classifiers for modeling travel mode choice* (Hagenauer and Helbich, 2017) and is available under CC BY 3.0 [here](https://www.sciencedirect.com/science/article/pii/S0957417417300738).

Human choices exhibit significant variability. Even when faced with identical circumstances, two individuals, or even the same person, may make entirely distinct decisions. Hence, human choices serve as a prime illustration for employing probabilistic classification techniques.

Now, we can proceed by importing the necessary libraries and loading the data. In this case, we will utilize logistic regression to model the data.

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix, log_loss

## Importing the data

Next we import and inspect the data.

In [None]:
df = pd.read_csv('data/travel_mode.csv')

df.head(10)

Everything looks relatively straightforward with our dataset. Let's investigate the distribution of the target variable, `mode_main`

### Class imbalance

Let's check the class imbalance by generating the value counts for the `mode_main` DataFrame.

_Hint, you can try normalising the `value_counts` method from Pandas._

In [None]:
# Your code here...
df.mode_main.value_counts(normalize=True)


It is apparent that our dataset consists of four classes, characterized by a significant data imbalance. Evidently, our sample population exhibits a strong aversion towards utilizing public transportation.

## Data preprocessing

Let's encode our target to an ordered number - 0 for car, 1 for walk, 2 for bike, 3 for public transport. 

_Hint: you can use `.map` to replace all values in a column at the same time_

In [None]:
# Your code here...
mode_map = {'car': 0, 'walk': 1, 'bike': 2, 'pt': 3}
df.mode_main= df.mode_main.map(mode_map)


We can also see there are few more non-numerical categorical columns (all numerical columns are continuous values) - we can investigate these further.

Print the normalised value counts for the columns `'male', 'ethnicity', 'education', 'license', 'weekend'`

In [None]:
# Your code here...
for col in ['male', 'ethnicity', 'education', 'license', 'weekend']:
    print(f'{col}:\n{df[col].value_counts(normalize=True)}\n\n')


After examining the data, it becomes evident that each categorical column consists of three or fewer classes, all of which are adequately represented. Therefore, it is highly likely that all classes will be present in any training fold we choose. Consequently, we can safely encode the categorical data in advance without concerns about data leakage.

It is important to remember that there are multiple approaches for encoding categorical data, depending on whether it is nominal or ordinal. For each categorical column, it is necessary to determine the appropriate method to be used and proceed with encoding the respective columns accordingly.


Encode male, ethnicity, education, income, license and weekend.

Remember to use `.map` if you are doing ordered encoding, and the `OneHotEncoder` from scikit-learn for one-hot encoding.

If you run out of time, don't worry - we've provided you with a processed version of the data later in the notebook.

In [None]:
# Your code here...
bool_map = {'no': 0, 'yes': 1}
education_map = {'lower': 0, 'middle': 1, 'higher': 2}
income_map = {'less20': 0, '20to40': 1, 'more40': 2}

df.male = df.male.map(bool_map)
df.license = df.license.map(bool_map)
df.weekend = df.weekend.map(bool_map)
df.education = df.education.map(education_map)
df.income = df.income.map(income_map)

df = pd.get_dummies(df)

#drop native column so n-1 catgories
df.drop('ethnicity_native', axis=1, inplace=True)


In [None]:
df.head(10)

With the completion of numerical encoding for all the data, we are now prepared to commence model training. However, for the time being, we will exclude the `household_id` column from the dataset. We will why and reintroduce the column later in our process.

In [None]:
df.drop('household_id', axis=1, inplace=True)

## Train-test split

The following code reloads and encodes the data as required for the modelling exercices, so it is ready to use for model training.

In [None]:
df = pd.read_csv('data/travel_mode.csv')
bool_map = {'no': 0, 'yes': 1}
education_map = {'lower': 0, 'middle': 1, 'higher': 2}
income_map = {'less20': 0, '20to40': 1, 'more40': 2}
mode_map = {'car': 0, 'walk': 1, 'bike': 2, 'pt': 3}

df.mode_main = df.mode_main.map(mode_map)
df.male = df.male.map(bool_map)
df.license = df.license.map(bool_map)
df.weekend = df.weekend.map(bool_map)
df.education = df.education.map(education_map)
df.income = df.income.map(income_map)

df = pd.get_dummies(df)

#drop native column so n-1 catgories
df.drop('ethnicity_native', axis=1, inplace=True)

#drop household_id
df.drop('household_id', axis=1, inplace=True)

We can first separate the data into our input `X`, and our target `y` - the mode people travelled by.

In [None]:
y = df.mode_main
X = df.drop('mode_main', axis=1)

And then split it into test and train data.

In [None]:
X_train_unscaled, X_test_unscaled, y_train, y_test = train_test_split(X, y)

Finally we scale our data. Use the `StandardScaler` to scale the training and testing data, calling it `X_train` and `X_test`. Remember to fit it on the training data and then use the fitted scaler to transform the test data.

In [None]:
# Your code here...
scl = StandardScaler()

X_train = scl.fit_transform(X_train_unscaled)
X_test = scl.transform(X_test_unscaled)


# A multinomial probabilistic classifier

We now have all the necessary components to initiate the modeling process. Let's proceed by loading a logistic regression model with default hyperparameters and fitting it to the dataset.

In [None]:
# Your code here...
clf = LogisticRegression()
clf.fit(X_train, y_train)


## Multinomial logistic regression - parameters

We can have a look at the model's coefficients and intercepts to see how the multinomial model works:

In [None]:
print(f'Model coefficients:\n{clf.coef_}\n\nShape of coefficient matrix:{clf.coef_.shape}\n\n')
print(f'Model intercepts: {clf.intercept_}')

We can see we have a different set of model coefficients and a corresponding intercept for each class. 

We have four classes, so we have four sets of coefficients/intercepts. 

The first column is the coefficients for walking, the second for cycling, the third for public transport, and the last for driving. 

We have 17 input features, so each set of input coefficients contains 17 different values. 

## Assessing our model

We have a fitted a classifier, lets evaluate it using accuracy:

In [None]:
# calculate the accuracy score for the logistic regression model
y_pred = clf.predict(X_test)
print(f'Accuracy: {accuracy_score(y_test, clf.predict(X_test)):.3f}')

The obtained result appears promising, as our model is able to accurately predict over 76% of the trips made by individuals. Now, let's examine the distribution of predicted trips for each mode.

In [None]:
inv_mode_map =  {v: k for k, v in mode_map.items()}
pd.Series(y_pred).map(inv_mode_map).value_counts()

We can compare these predictions with the actual outcomes recorded in the dataset.

In [None]:
pd.Series(y_test).map(inv_mode_map).value_counts()

The analysis of the predicted trip outcomes reveals a significant over-representation of car trips and a noticeable under-representation of other modes of transportation. Particularly, our model predicts less than one-fourth of the actual number of public transport trips. If these flawed predictions were provided to our client, it could have led to potentially detrimental decision-making, such as misguided investments in public transport based on figures inflated by 400%.

To gain further insights into the issue, let's examine the confusion matrix and investigate what is happening.

In [None]:
confusion_matrix(y_test, y_pred)
plt.figure(figsize=(12,8))
sns.heatmap(confusion_matrix(y_test, y_pred), annot=True, fmt="d")
plt.show()

Upon closer inspection, it becomes evident that our model exhibits poor recall for the least common modes of transportation. While we achieve a relatively high overall accuracy due to confident predictions of the most common mode, we take significant risks when attempting to predict the least common mode, resulting in inaccurate aggregate predictions.

### Implications

The issue highlighted in this problem has far-reaching consequences and can lead to serious practical problems. It has been observed at various levels, including high-profile instances such as the one mentioned [here](https://www.theguardian.com/inequality/2017/aug/08/rise-of-the-racist-robots-how-ai-is-learning-all-our-worst-impulses).

>"In May last year, a stunning report claimed that a computer program used by a US court for risk assessment was biased against black prisoners. The program, Correctional Offender Management Profiling for Alternative Sanctions (Compas), was much more prone to mistakenly label black defendants as likely to reoffend – wrongly flagging them at almost twice the rate as white people (45% to 24%), according to the investigative journalism organisation ProPublica."

### Proposed Solutions

While metrics derived from the confusion matrix, such as precision and recall, provide insights into underpredictions for certain modes, they do not offer a means to rectify the issue in discrete predictions.

Two potential solutions can be considered. The first option involves artificially oversampling or undersampling (or a combination of both) the data until all classes are equally represented. For instance, we could randomly sample public transport trips with replacement, repeating them multiple times until their count matches that of car trips. The same process can be applied to cycling and walking trips. However, this approach distorts the data and amplifies any noise present in those particular samples, thereby increasing the Signal to Noise Ratio. Conversely, if we were to undersample the most common class, we would remove valid information (our signal), once again increasing the Signal to Noise Ratio.

### Embracing Probabilistic Classification

Instead of resorting to data manipulation techniques, let's explore the potential of using our model as a probabilistic classifier. By simply utilizing the `predict_proba` function instead of `predict`, we can leverage the probabilities generated by the model.

In [None]:
y_probs = clf.predict_proba(X_test)

Now lets have a look at the output:

In [None]:
y_probs[:,0]
plt.figure(figsize=(15,2))
sns.heatmap(np.transpose(y_probs))
plt.show()

You can see we have a probability distribution for each trip, which gives us the probability of each mode being selected. We can measure the fit using the log-likelihood.

In [None]:
log_loss(y_test, y_probs)

The log loss metric penalizes models that exhibit excessive certainty in incorrect predictions. In the case of a deterministic model, which can only provide probabilities of either 1 or 0, and given that its accuracy score is not 1, we can conclude that it would yield an *infinite* log loss score. 

This highlights why log loss is not typically used to evaluate deterministic models. It appears unfair to compare the two models using this metric since it was specifically designed for probabilistic models. The choice between the two models should ultimately depend on the problem being addressed.

### Utilizing the Probabilistic Model for Simulations

By employing probability distributions, we can move beyond discrete predictions and simulate our predictions. Instead of relying on a single outcome, the following code allows us to draw values from the provided probability distribution:

In [None]:
#create data frame
cumprobs = pd.DataFrame()

#add cumulative  probabilities
cumprobs[0] = y_probs[:,0]
cumprobs[1] = y_probs[:,0] + y_probs[:,1]
cumprobs[2] = y_probs[:,0] + y_probs[:,1] + y_probs[:,2]

#generate random numbers 
a = np.random.rand(len(y_probs))

#predict transport for each individual 
y_sim = (np.zeros(len(y_probs))
         + (a>cumprobs[0])
         + (a>cumprobs[1])
         + (a>cumprobs[2])).astype(int)

y_sim

Now let's compare the simulated and actual mode shares.

In [None]:
pd.Series(y_sim).map(inv_mode_map).value_counts()

In [None]:
pd.Series(y_test).map(inv_mode_map).value_counts()

As you can see, there is a much closer match between the two. We can actually work out the exact predicted mode shares, simply by summing the probabilities.

In [None]:
y_shares = y_probs.sum(axis=0)
{k: y_shares[v] for k, v in mode_map.items()}

When we do simulations like this we don't generally expect in any one simulation to accurately predict what is going on the underlying data in the same way as we would with a classifier, but it's still nice to visualise the confusion matrix. 

In [None]:
plt.figure(figsize=(12,8))
sns.heatmap(confusion_matrix(y_test,y_sim), annot=True, fmt="d")
plt.tight_layout()

Whilst this isn't as good as the confusion matrix we generated for the deterministic classifiers (i.e. the values on the diagonal aren't as large), simulated data gives us opportunities to generate lots of realistic "data" that we can then use to do more complex calculations with. Because we can repeat these simulations again and again, and then look at the statistics around these simulations we can get really detailed models of how things may behave if the world was different.

To take real world examples, example financial derivatives are priced based on simulated stock data, and disease models use simulated people and households.

### Deciding on the Methodology

The choice of methodology depends on various factors, including the deployment of the tool and the potential impact of individual classifications. In situations where the tool's classifications carry high-cost implications, such as classifying medical conditions, it is beneficial to have human expertise involved, such as a doctor, who can make the final decision. In such cases, having a probability distribution as an output can complement the decision-making process. This approach may prompt further tests, data collection, or enable the formulation of a diagnosis and treatment plan.

On the other hand, for low-cost decisions like determining which advertisement to play next on a video streaming service, a deterministic classifier might be more suitable.

However, the probabilistic classifier, being a more sophisticated tool, still offers deployment possibilities. For instance, if the classifier assesses the likelihood of a web user responding to different advertisements, a sequence of ads can be selected based on their descending likelihood. Additionally, tools can be incorporated to update probabilities based on user responses to the ads encountered so far or make use of assumptions about the real world to enhance predictions. For instance, if a user displays an interest in sandwiches, advertising just before lunchtime may be the most opportune moment.

Probabilistic models also lend themselves to simulations, which can be highly valuable. Although they generally do not surpass well-designed deterministic classifiers in terms of accuracy or confusion matrix, these simulations can deliver remarkable outcomes. However, exploring the full potential of simulations will be covered later in the course.