## Submission instructions

All code that you write should be in this notebook. Please include your names and student numbers. You have to submit this notebook, with your code and answers filled in. Make sure to add enough documentation.

For questions, make use of the "Lab" session (see schedule).
Questions can also be posted to the MS teams channel called "Lab".

**Note:** You are free to make use of Python libraries (e.g., numpy, sklearn, etc.) except any *fairness* libraries.

#### Name and student numbers
Jep Antonisse 3312070 Elias Hendriks 5930413 


## Dataset

In this assignment we are going to use the **COMPAS** dataset.

If you haven't done so already, take a look at this article: https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing.
For background on the dataset, see https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism-algorithm.

**Reading in the COMPAS dataset**

The dataset can be downloaded here: https://github.com/propublica/compas-analysis/blob/master/compas-scores-two-years.csv

For this assignment, we focus on the protected attribute *race*.

The label (the variable we want to be able to predict) represents recidivism, which is defined as a new arrest within 2 years.

In [16]:
!wget -c https://raw.githubusercontent.com/propublica/compas-analysis/master/compas-scores-two-years.csv

--2025-05-12 11:38:58--  https://raw.githubusercontent.com/propublica/compas-analysis/master/compas-scores-two-years.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 416 Range Not Satisfiable

    The file is already fully retrieved; nothing to do.



In [27]:
import pandas as pd
compas_data = pd.read_csv('compas-scores-two-years.csv')

We apply several data preprocessing steps, including only retaining Caucasians and African Americans.

In [28]:
compas_data = compas_data[(compas_data.days_b_screening_arrest <= 30)
            & (compas_data.days_b_screening_arrest >= -30)
            & (compas_data.is_recid != -1)
            & (compas_data.c_charge_degree != 'O')
            & (compas_data.score_text != 'N/A')
            & ((compas_data.race == 'Caucasian') | (compas_data.race == 'African-American'))]

Take a look at the data:

In [29]:
print(compas_data.head())

    id              name      first    last compas_screening_date     sex  \
1    3       kevon dixon      kevon   dixon            2013-01-27    Male   
2    4          ed philo         ed   philo            2013-04-14    Male   
6    8     edward riddle     edward  riddle            2014-02-19    Male   
8   10  elizabeth thieme  elizabeth  thieme            2014-03-16  Female   
10  14    benjamin franc   benjamin   franc            2013-11-26    Male   

           dob  age       age_cat              race  ...  v_decile_score  \
1   1982-01-22   34       25 - 45  African-American  ...               1   
2   1991-05-14   24  Less than 25  African-American  ...               3   
6   1974-07-23   41       25 - 45         Caucasian  ...               2   
8   1976-06-03   39       25 - 45         Caucasian  ...               1   
10  1988-06-01   27       25 - 45         Caucasian  ...               4   

    v_score_text  v_screening_date  in_custody  out_custody  priors_count.1  \
1

Now take a look at the distribution of the protected attribute `race` and the distribution of our outcome variable `two_year_recid`.

**Note:** in the context of fair machine learning, the favorable label here is no recidivism, i.e., ```two_year_recid = 0```. So think about how what you will code as the positive class in your machine learning experiments, and make sure your interpretation of the results is consistent with this.

In [30]:
print('Number of instances per race category:')
print(compas_data[['race', 'two_year_recid']].value_counts())

Number of instances per race category:
race              two_year_recid
African-American  1                 1661
                  0                 1514
Caucasian         0                 1281
                  1                  822
Name: count, dtype: int64


## Data analysis

### **1. Exploration**

First we perform an exploratory analysis of the data.

**Question:** What is the size of the data? (i.e. how many data instances does it contain?)


In [31]:
# Your code
print('Number of data instances:')
print(compas_data.shape[0])
print('Number of features:')
print(compas_data.shape[1])

Number of data instances:
5278
Number of features:
53


**Question:** In the dataset, the protected attribute is `race`, which has two categories: White and African Americans. How many data instances belong to each category?

In [32]:
# Your code
print('Number of instances per race category:')
print(compas_data[['race']].value_counts())

Number of instances per race category:
race            
African-American    3175
Caucasian           2103
Name: count, dtype: int64


**Question:** What are the base rates (the probability of a favorable outcome for the two protected attribute classes)?

In [33]:
# Your code
def base_rate(attribute):
    grouped_by_a = compas_data.groupby(attribute)['two_year_recid']
    total_per_a = grouped_by_a.count()

    no_recid = grouped_by_a.apply(lambda x: (x == 0).sum())
    percentage_no_recid = (no_recid/total_per_a) * 100

    percentages = []
    print("Percentage of individuals with no recidivism (two_year_recid == 0):")
    for a in percentage_no_recid.index:
        print(f"{a}: {percentage_no_recid[a]:.2f}%")
        percentages.append(percentage_no_recid[a])
    return percentages

base_rate('race')
print("----")
base_rate('sex')
print()


Percentage of individuals with no recidivism (two_year_recid == 0):
African-American: 47.69%
Caucasian: 60.91%
----
Percentage of individuals with no recidivism (two_year_recid == 0):
Female: 63.82%
Male: 50.32%



**Question:** What are the base rates for the combination of both race and sex categories?

In [34]:
# Your code
base_rate(['race', 'sex'])
print()

Percentage of individuals with no recidivism (two_year_recid == 0):
('African-American', 'Female'): 63.02%
('African-American', 'Male'): 44.48%
('Caucasian', 'Female'): 64.73%
('Caucasian', 'Male'): 59.78%



**Question**

Write down a short interpretation of the statistics you calculated. What do you see?
> Answer:
- Most data instances (around 60%) have an African-American background
- The percentage of recedivism is much lower for Caucasian people than African-American People
- The percentage of recedivism is much lower for women than men. This holds for both races seperately as well
- Only for African-American males is the recedivism percentage more than 50%

### **2. Performance measures**

You will have to measure the performance and fairness of different classifiers in question 5. The performance will be calculated with the precision, recall, F1 and accuracy.
Additionally, you will have to calculate the statistical/demographic parity, the true positive rate (recall) and false positive rate per race group.

Make sure that you are able to calculate these metrics in the cell below.

In [42]:
# Your code for the performance measures
def performance(model_predictions, true_labels_data):
    # predictions have different indices, so we set them anew
    model_predictions = model_predictions.reset_index(drop=True)
    true_labels_data = true_labels_data.reset_index(drop=True)
    
    # Since no recidivism is the positive label, we can compute the confusion matrix:
    tp = ((model_predictions['pred'] == 0) & (true_labels_data['two_year_recid'] == 0)).sum()
    fp = ((model_predictions['pred'] == 0) & (true_labels_data['two_year_recid'] == 1)).sum()

    tn = ((model_predictions['pred'] == 1) & (true_labels_data['two_year_recid'] == 1)).sum()
    fn = ((model_predictions['pred'] == 1) & (true_labels_data['two_year_recid'] == 0)).sum()

    # true positives / all positives
    precision = tp / (tp + fp)

    # true positives / all labels 0
    recall = tp / (tp + fn)

    # false positive / all labels 1
    fpr = fp / (fp + tn)

    # all true predictions / total intences
    accuracy = (tp + tn) / (tp + fp + tn + fn)


    f1 = 2 * (precision * recall) / (precision + recall)

    print(f"Accuracy: {100 * accuracy:.2f}%")
    print(f"Precision: {100 * precision:.2f}%")
    print(f"Recall: {100 * recall:.2f}%")
    print(f"False Negative Rate: {100 * fpr:.2f}")
    print(f"F1-score: {f1:.2f}")
    return [accuracy, precision, recall, f1]

def group_performance(model_predictions, data, true_labels_data, attribute):

    data['pred'] = model_predictions['pred'].values #add the predictions to the datapoints
    data['two_year_recid'] = true_labels_data['two_year_recid'].values #add the true lables to the datapoints

    grouped_by_a = data.groupby(attribute)['pred'] #sort on 
    total_per_a = grouped_by_a.count()

    no_recid = grouped_by_a.apply(lambda x: (x == 0).sum())
    percentage_no_recid = (no_recid/total_per_a) * 100

    print("Percentage of individuals in a subgroup with no recidivism predicted:")
    percentages = []
    races = ['African-American', 'Caucasian']
    for a in percentage_no_recid.index:
        print(f"{races[a]}: {percentage_no_recid[a]:.2f}%")
        percentages.append(percentage_no_recid[a])
    print(f"statistical parity: {percentages[0]/percentages[1]:.2f}")

    black = data[data['race'] == 0]
    white = data[data['race'] == 1]

    print("\nDivided on sensitive attribute 'race':")
    print("\nAfrican-American:")
    performance(black[['pred']], black[['two_year_recid']])
    print("\nCaucasian:")
    performance(white[['pred']], white[['two_year_recid']])

    

### **3. Prepare the data**
For the classifiers in question 5, the input of the model can only contain numerical values, it is therefore important to convert the strings in the columns (features) of interest of the `compas_data` to floats or integers.

The columns of interest are features that you think will be informative or interesting in predicting the outcome variable.
Use the cell below to explore which of the Compas variables you need to convert to be able to use them for the classifiers.

Generate a new dataframe with your selected features in the right encoding (also make sure to include `two_year_recid`). You can implement this yourself, or use the `LabelEncoder` from `sklearn`.

**Note:** you do not need to convert all columns/features, only the ones you are interested in. However, do **not** include the feature `is_recid`.

In [43]:
# Your code to prepare the data
from sklearn.preprocessing import LabelEncoder
print(compas_data.columns)


selected_features_categorical = ['c_charge_degree', 'r_charge_degree', 'vr_charge_degree']
selected_features_numerical = ['priors_count']


prepared_compas_data = pd.DataFrame()


# turn the strings into their exact integer value
for feature in selected_features_numerical:
    prepared_compas_data[feature] = compas_data[feature].astype(int)

# for categorical features, we have to assign them integer values
for feature in selected_features_categorical:
    le = LabelEncoder()
    prepared_compas_data[feature] = le.fit_transform(compas_data[feature])


# Protected Features
le = LabelEncoder()
prepared_compas_data['race'] = le.fit_transform(compas_data['race'])
le = LabelEncoder()
prepared_compas_data['sex'] = le.fit_transform(compas_data['sex'])


prepared_compas_data['two_year_recid'] = compas_data['two_year_recid'].astype(int)

le = LabelEncoder()
prepared_compas_data['race'] = le.fit_transform(compas_data['race'])
le = LabelEncoder()
prepared_compas_data['sex'] = le.fit_transform(compas_data['sex'])

prepared_compas_data.head()

Index(['id', 'name', 'first', 'last', 'compas_screening_date', 'sex', 'dob',
       'age', 'age_cat', 'race', 'juv_fel_count', 'decile_score',
       'juv_misd_count', 'juv_other_count', 'priors_count',
       'days_b_screening_arrest', 'c_jail_in', 'c_jail_out', 'c_case_number',
       'c_offense_date', 'c_arrest_date', 'c_days_from_compas',
       'c_charge_degree', 'c_charge_desc', 'is_recid', 'r_case_number',
       'r_charge_degree', 'r_days_from_arrest', 'r_offense_date',
       'r_charge_desc', 'r_jail_in', 'r_jail_out', 'violent_recid',
       'is_violent_recid', 'vr_case_number', 'vr_charge_degree',
       'vr_offense_date', 'vr_charge_desc', 'type_of_assessment',
       'decile_score.1', 'score_text', 'screening_date',
       'v_type_of_assessment', 'v_decile_score', 'v_score_text',
       'v_screening_date', 'in_custody', 'out_custody', 'priors_count.1',
       'start', 'end', 'event', 'two_year_recid'],
      dtype='object')


Unnamed: 0,priors_count,c_charge_degree,r_charge_degree,vr_charge_degree,race,sex,two_year_recid
1,0,0,3,2,0,1,1
2,4,0,6,8,0,1,1
6,14,0,2,8,1,1,1
8,0,1,9,8,1,0,0
10,0,0,9,8,1,1,0


**Question**

Give a short motivation (one-two sentence) per feature why you think this is informative or interesting to take into account.
> Answer:
* priors count: this is a measure of the number of offenses commited in the past. This can provide important insight if the defendent has learned from previous mistakes. We purposely exclude felonies comitedded as a juvenile, since

* current charge degree: this shows the extend of the current crime, on which we should base our verdict. Naturally, more serious crimes can be indicators of a more profound role in the crimal circuit and therefore a higher chance of recidivism.

* r/vr charge degree: this gives us some indication if the defendent has already commited crimes on a higher level in the past. Similarly to that of the current charge degree, this can provide insight in the type of crimes commited

We purposely excluded:

* Personal information, such as age, since we deemed them irrelevant for the court. Law would and should treat all adults similarly, despite e.g. an age difference.
* decile score: this is a score of risk of recidivism provided by an algorithms. Although this could be seem as highly relevant, since this is also our objective, we will not utilize it since we want to stay clear of any other artificial computed scores. That way, we make sure that any bias in our classification is not due to hidden biases in the decile score algorithm.

### **4. Train and test split**

Divide the dataset into a train (80%) and test split (20%), either by implementing it yourself, or by using an existing library.

**Note:** Usually when carrying out machine learning experiments,
we also need a dev set for developing and selecting our models (incl. tuning of hyper-parameters).
However, in this assignment, the goal is not to optimize
the performance of models so we'll only use a train and test split.




In [44]:
# Your code to split the data
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(
    prepared_compas_data[selected_features_numerical + selected_features_categorical + ['race', 'sex']], prepared_compas_data['two_year_recid'], test_size=0.2  # 20% test, 80% train
)

x_train.head()

Unnamed: 0,priors_count,c_charge_degree,r_charge_degree,vr_charge_degree,race,sex
1956,3,1,6,5,1,1
2846,5,1,6,8,1,1
5333,1,0,3,8,0,1
1546,2,1,9,8,1,1
4766,0,1,9,8,1,0


### **5. Classifiers**

Now, train and test different classifiers and report the following statistics:

* Overall performance:

  * Precision
  * Recall
  * F1
  * Accuracy

* Fairness performance:

  * The statistical parity difference for the protected attribute `race`(i.e. the difference in the probability of receiving a favorable label between the two protected attribute groups);
  * The true positive rates of the two protected attribute groups
  * The false positive rates of the two protected attribute groups.

For training the classifier we recommend using scikit-learn (https://scikit-learn.org/stable/).

#### **5.1 Regular classification**
Train a logistic regression classifier with the race feature and all other features that you are interested in.

In [45]:
# Your code for classifier 1
from sklearn.linear_model import LogisticRegression

model_with_race = LogisticRegression()
model_with_race.fit(x_train[selected_features_numerical + selected_features_categorical + ['race']], y_train)

predictions = pd.DataFrame(model_with_race.predict(x_test[selected_features_numerical + selected_features_categorical + ['race']]), columns=['pred'])
true_labels = pd.DataFrame(y_test, columns=['two_year_recid'])

perf = performance(predictions, true_labels)
group_performance(predictions, x_test, true_labels, 'race')


Accuracy: 96.21%
Precision: 98.51%
Recall: 94.29%
False Negative Rate: 1.61
F1-score: 0.96
Percentage of individuals in a subgroup with no recidivism predicted:
African-American: 45.62%
Caucasian: 57.95%
statistical parity: 0.79

Divided on sensitive attribute 'race':

African-American:
Accuracy: 95.94%
Precision: 98.58%
Recall: 92.95%
False Negative Rate: 1.26
F1-score: 0.96

Caucasian:
Accuracy: 96.59%
Precision: 98.43%
Recall: 95.80%
False Negative Rate: 2.25
F1-score: 0.97


#### **5.2 Without the protected attribute**
Train a logistic regression classifier without the race feature, but with all other features you used in 5.1.


In [46]:
# Your code for classifier 2
from sklearn.linear_model import LogisticRegression

model_without_race = LogisticRegression()
model_without_race.fit(x_train[selected_features_categorical + selected_features_numerical], y_train)

wr_predictions = pd.DataFrame(model_without_race.predict(x_test[selected_features_categorical + selected_features_numerical]), columns=['pred'])
wr_true_labels = pd.DataFrame(y_test, columns=['two_year_recid'])

wr_perf = performance(wr_predictions, wr_true_labels)
group_performance(wr_predictions, x_test, wr_true_labels, 'race')

Accuracy: 96.21%
Precision: 98.51%
Recall: 94.29%
False Negative Rate: 1.61
F1-score: 0.96
Percentage of individuals in a subgroup with no recidivism predicted:
African-American: 45.62%
Caucasian: 57.95%
statistical parity: 0.79

Divided on sensitive attribute 'race':

African-American:
Accuracy: 95.94%
Precision: 98.58%
Recall: 92.95%
False Negative Rate: 1.26
F1-score: 0.96

Caucasian:
Accuracy: 96.59%
Precision: 98.43%
Recall: 95.80%
False Negative Rate: 2.25
F1-score: 0.97


**Question**

Write down a short interpretation of the results you calculated. What do you see?
> Answer:

We see that we are able to train a pretty good classifier, with quite high accuracy, precision and recall. Therefore, the percentage of individuals predicted a zero is quite close to the base rate for both race values. However, this also means that chances of being assigned a zero is not equal for both races: we can see a statistical parity of 0.4495/0.6102 = 0.74 (with the train-test-split at the time of writing). Considering a common treshold of 0.8, this means the classifier does not satisfy the fairness criterion of equal decision measures.

We can see this inequallity when we devide the data on the sensitive attribute race. Since a high percentage of the African-Americans has a label of 1, our classifier shows some more difficulties in assigning them the label 0 compared to the Caucasian population. This is most noticeable in the recall: there is quite a differnce in the ability to assign all non-recidivism individuals a 0 between the two ethnicity groups, with African-Americans being more often wrongly assinged a 1.

We would expect that removing the sensitive attribute column from the data would results in a decrease in this inequality of decision measures. After all, if the model is unable to see from which group an individual is from, it should prohibit decision making on race and force decsions on crimes alone. Therefore, individuals with similar crimes yet diffent races should be treated identical, therefore lowering the bias in the data and the statical parity.

However, against our expectations, both models return identical predictions on the test data set. This means that peformance for the model with the sensitive attribute is the same as for the model that excludes race. Altough suprising, we believe it might highlight the fact that fairness is not achieved through unawareness: leaving out the race attribute is not a solution if its value is strongly correlated with the other features we use. If this is the case, these proxies can allow the model to still predict the race group without being able to see that specific column. Considering the models performance, we think that the crime data might be able to reflect the race data, e.g. because African-American individuals commit on average more serious crimes, making the column about the race itself redundant for optimal classification.


#### **5.3 Pre-processing: Reweighing**
Train and test a classifier with weights (see lecture slide for the weight calculation)

In [None]:
# Your code for classifier 3
def base_rate_reweigh(attribute):
    # Group sizes probability:
    group_size = compas_data[[attribute]].value_counts()
    group_prob = group_size/len(compas_data)

    # Label probability
    label_size = compas_data[['two_year_recid']].value_counts()
    label_prob = label_size/len(compas_data)

    print(f"If {attribute} and label are independent, we would expect to observe the probabilies:")
    for a in group_prob.index:
        for l in label_prob.index:
            print(f"{a[0]}, {l[0]}: {group_prob[a] * label_prob[l]:.4f}")

    #print("\nHowever we observe:")
    observed_probability = compas_data[['race', 'two_year_recid']].value_counts() / len(compas_data)
    #print(observed_probability)

    print("\nTherefore, we need to reweigh into:")
    for a in group_prob.index:
        for l in label_prob.index:
            print(f"{a[0]}, {l[0]}: {((group_prob[a] * label_prob[l]) / observed_probability[a[0], l[0]]):.4f}")
    return


base_rate_reweigh('race')

print()

weights = []

for index, point in x_train.iterrows():
    race = point['race']
    recid = y_train.loc[index]
    
    if race == 0 and recid == 0:
        weights.append(1.1105)
    elif race == 0 and recid == 1:
        weights.append(0.8993)
    elif race == 1 and recid == 0:
        weights.append(0.8694)
    elif race == 1 and recid == 1:
        weights.append(1.2036)

weighted_model_without_race = LogisticRegression()
weighted_model_without_race.fit(x_train[selected_features_categorical + selected_features_numerical], y_train,
sample_weight=weights)

rw_predictions = pd.DataFrame(weighted_model_without_race.predict(x_test[selected_features_categorical + selected_features_numerical]), columns=['pred'])
rw_true_labels = pd.DataFrame(y_test, columns=['two_year_recid'])


rw_perf = performance(rw_predictions, rw_true_labels)
group_performance(rw_predictions, x_test, rw_true_labels, 'race')


If race and label are independent, we would expect to observe the probabilies:
African-American, 0: 0.3186
African-American, 1: 0.2830
Caucasian, 0: 0.2110
Caucasian, 1: 0.1874

Therefore, we need to reweigh into:
African-American, 0: 1.1105
African-American, 1: 0.8993
Caucasian, 0: 0.8694
Caucasian, 1: 1.2036

Accuracy: 97.63%
Precision: 99.25%
Recall: 96.17%
F1-score: 0.98
Percentage of individuals in a subgroup with no recidivism predicted:
African-American: 45.66%
Caucasian: 57.14%
statistical parity: 0.80

Divided on sensitive attribute 'race':

African-American:
Accuracy: 97.24%
Precision: 99.60%
Recall: 95.74%
F1-score: 0.98

Caucasian:
Accuracy: 97.91%
Precision: 98.94%
Recall: 96.56%
F1-score: 0.98


**Question**

 Report the 4 weights that are used for reweighing and a short **interpretation/discussion** of the weights and the classifier results.
> Answer:
The weights for combinations of race and outcome are:

African-American, 0: 1.1105,

African-American, 1: 0.8993,

Caucasian, 0: 0.8694,

Caucasian, 1: 1.2036,

This shows that African-American in combination with 0 and Caucasian in combination with 1 have smaller probabilities than expected with an independent relation between attribute and outcome. Therefore, we increase the weight of the underreperesented classes and decrease those that are overrepresented in the classifier, such that the joint distribution resembles the expected distribution more closely.

We would expect that increasing the sample weights would result in more equal classification between the race groups. This is, because with this more balanced distribution, the large difference in base rates (percentage of non-recidivism) should be leveled out for both groups. However, quite suprisingly, the model with these weights was unable to increase its performance. We believe this could be because the difference and not that large, resulting in weights quite close to 1. Therefore, it might be the case that this reweighing is not able to have the profound effect that can chance the decisions of the classifier.

#### **5.4 Post-processing: Equalized odds**
Use the predictions by the first classifier for this post processing part (see lecture slides for more information about post processing for equalized odds).

We have the following parameters (A indicates group membership, Y_{hat} the original prediction, Y_{tilde} the prediction of the derived predictor).

* `p_00` = P(Y_{tilde} = 1 | Y_{hat} = 0 & A = 0)
* `p_01` = P(Y_{tilde} = 1 | Y_{hat} = 0 & A = 1)
* `p_10` = P(Y_{tilde} = 1 | Y_{hat} = 1 & A = 0)
* `p_11` = P(Y_{tilde} = 1 | Y_{hat} = 1 & A = 1)


Normally, the best parameters `p_00, p_01, p_10, p_11` are found with a linear program that minimizes loss between predictions of a derived predictor and the actual labels. In this assignment we will not ask you to do this. Instead, we would like you to follow the next steps to find parameters, post-process the data and check the performance of this classifier with post-processing:

1. Generate 5000 different samples of these 4 parameters randomly;
2. Write a function (or more) that applies these 4 parameters to postprocess the predictions.
3. For each generated set of 4 parameters:
  - Change the predicted labels with the function(s) from step 2;
  - Evaluate these 'new' predictions, by calculating group-wise TPR and FPR, as well as overall performance based on F1 and/or accuracy.
4. Choose the best set of parameters. Take into account the equalized odds fairness measure, as well a performance measure like accuracy or F1.
5. Check the overall performance (precision, recall, accuracy, F1, etc.) of the new predictions after post-processing.

In [47]:
# Your code for step 1
import random

random_parameters = []
for _ in range(5000):
  p_00 = random.uniform(0,1)
  p_01 = random.uniform(0,1)
  p_10 = random.uniform(0,1)
  p_11 = random.uniform(0,1)
  random_parameters.append({(0, 0): p_00,
                            (0, 1): p_01,
                            (1, 0): p_10,
                            (1, 1): p_11})

# Example, first set of random parameters
print(random_parameters[0])

{(0, 0): 0.854737764342836, (0, 1): 0.6601852897361286, (1, 0): 0.6693255084887851, (1, 1): 0.9580250740412501}


In [48]:
import pandas as pd
import numpy as np
import random
from sklearn.metrics import confusion_matrix

# Your code for step 2
# Create a dataframe with the necessary information
compas_data = compas_data[compas_data['race'].isin(['African-American', 'Caucasian'])].copy()
compas_data['race_num'] = compas_data['race'].map({'Caucasian': 0, 'African-American': 1})

pred_labels_df = predictions.values.ravel()
true_labels_df = true_labels.values.ravel()
compas_subset = compas_data.loc[x_test.index].copy()

df_post_data = pd.DataFrame({'race_num': compas_subset['race_num'].values, #Deze krijg ik niet de goede size
                             'pred_labels': pred_labels_df,
                             'true_labels': true_labels_df})

# the number of cases falling in each condition
subset_sizes = {
    (0, 0): len(df_post_data.query('pred_labels == 0 & race_num == 0')),
    (0, 1): len(df_post_data.query('pred_labels == 0 & race_num == 1')),
    (1, 0): len(df_post_data.query('pred_labels == 1 & race_num == 0')),
    (1, 1): len(df_post_data.query('pred_labels == 1 & race_num == 1'))
}

def generate_labels(subset_sizes, p_dict):
    new_predictions = {}

    for (prediction, group), p in p_dict.items():

      # The number of instances for which we need to generate labels
      num_instances = subset_sizes[(prediction, group)]

      # Write your code here.
      new_labels = np.random.binomial(n=1, p=p, size=num_instances)
      # save the new predictions
      new_predictions[(prediction, group)] = new_labels

    return new_predictions
  
def post_processing(df, new_predictions):
    df = df.copy()
    df['postprocessed_preds'] = np.nan

    for (pred_val, group_val), labels in new_predictions.items():
        mask = (df['pred_labels'] == pred_val) & (df['race_num'] == group_val)
        df.loc[mask, 'postprocessed_preds'] = labels

    df['postprocessed_preds'] = df['postprocessed_preds'].astype(int)
    return df

# Compute TPR/FPR per group and overall metrics
def compute_group_metrics(df, pred_col='postprocessed_preds'):
    group_stats = {}
    for group in [0, 1]:
        sub = df[df['race_num'] == group]
        y_true = sub['true_labels']
        y_pred = sub[pred_col]
        tn, fp, fn, tp = confusion_matrix(y_true, y_pred, labels=[0, 1]).ravel()
        tpr = tp / (tp + fn) if (tp + fn) > 0 else 0
        fpr = fp / (fp + tn) if (fp + tn) > 0 else 0
        group_stats[group] = {'TPR': tpr, 'FPR': fpr}
    return group_stats




In [49]:
# Step 3
best_result = {
    'score': -np.inf,
    'params': None,
    'df': None,
    'group_metrics': None
}

for p_set in random_parameters:
    new_preds = generate_labels(subset_sizes, p_set)
    df_eval = post_processing(df_post_data, new_preds)
    group_metrics = compute_group_metrics(df_eval)

    # Calculate TPR gap and F1
    tpr_gap = abs(group_metrics[0]['TPR'] - group_metrics[1]['TPR'])

    y_true = df_eval['true_labels'].values
    y_pred = df_eval['postprocessed_preds'].values
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred, labels=[0, 1]).ravel()

    precision = tp / (tp + fp)
    recall = tp / (tp + fn)
    accuracy = (tp + tn) / (tp + fp + tn + fn)
    f1 = 2 * (precision * recall) / (precision + recall)

    score = f1 - tpr_gap

    if score > best_result['score']:
        best_result.update({
            'score': score,
            'params': p_set,
            'df': df_eval,
            'group_metrics': group_metrics,
            'confusion_matrix': (tn, fp, fn, tp)
        })

In [50]:
# Your code for step 4 and 5
# Step 4
print("Best parameters:")
for (yhat, group), prob in best_result['params'].items():
    print(f"  P(Ỹ=1 | Ŷ={yhat}, A={group}) = {prob:.3f}")

print("\nGroup-wise TPR/FPR:")
for g, stats in best_result['group_metrics'].items():
    print(f"  Group {g}: TPR = {stats['TPR']:.3f}, FPR = {stats['FPR']:.3f}")

# Step 5
tn, fp, fn, tp = best_result['confusion_matrix']

precision = tp / (tp + fp) if (tp + fp) else 0
recall = tp / (tp + fn) if (tp + fn) else 0
accuracy = (tp + tn) / (tp + fp + tn + fn) if (tp + fp + tn + fn) else 0
f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) else 0

print("\nOverall performance:")
print(f"Accuracy: {100 * accuracy:.2f}%")
print(f"Precision: {100 * precision:.2f}%")
print(f"Recall: {100 * recall:.2f}%")
print(f"F1-score: {f1:.2f}")

Best parameters:
  P(Ỹ=1 | Ŷ=0, A=0) = 0.053
  P(Ỹ=1 | Ŷ=0, A=1) = 0.059
  P(Ỹ=1 | Ŷ=1, A=0) = 0.986
  P(Ỹ=1 | Ŷ=1, A=1) = 0.967

Group-wise TPR/FPR:
  Group 0: TPR = 0.966, FPR = 0.084
  Group 1: TPR = 0.965, FPR = 0.134

Overall performance:
Accuracy: 92.52%
Precision: 88.54%
Recall: 96.57%
F1-score: 0.92


**Question**

Describe how you selected the best set of parameters. Furthermore, how do you interpret the best set of parameters that you found? And what do you think of the results of the new classifier?
>Answer

#### **Overall discussion**
For all 4 classifiers that you trained, describe:
- Does this classifier satisfies statistical parity?
- Does the classifier satisfy the equal opportunity criterion?

Finally, how do the different classifiers compare against each other?

>Answer

* The first three classifiers have identical results:
- Seeing if they satisfy statistical parity is easy, since we calculated that as a group performance measure. Using the common treshold of 0.8, our classifiers fall just below that bar. Therefore, we say that different race groups are not equally likely to be assigned the label 0 in our classifiers
- To see if the equal opportunity criterion holds for these models, we need to examine the true positive rate/recall measure between the groups. For all three classifiers, we do see a difference in favor of the Caucausian rate in terms of recall. This means the models have slight inequallities in treating people with similar outcome the same. However, since we should keep in mind that these differences are quite small, around one percent.

* The forth classifier has different performance measure:
- For the equal opportunity criterion, we see that the TPR between both groups is (almost) identical. This shows that the model does satisfy the equal opportunity criterion.


### **6. Intersectional fairness**
In the questions above `race` was the only protected attribute. However, multiple protected attributes sometimes interact, leading to different fairness outcomes for different combinations of these protected attributes.

Now explore the intersectional fairness for protected attributes `race` and `sex` for the first two classifiers from question 5. Make a combination of the `race` and `sex` column, resulting in four new subgroups (e.g., female Caucasian), and report the maximum difference between the subgroups for statistical parity, TPR and FPR.
For example, suppose we have four groups with TPRs 0.1, 0.2, 0.3, 0.8, then the maximum difference is 0.7.

Your code to evaluate intersectional fairness for Classifier 1:




In [54]:
# Your code for intersectional fairness
compas_subset = compas_data.loc[x_test.index].copy()
df_post_data['sex'] = compas_subset['sex'].values

df_post_data['intersectional'] = df_post_data['race_num'].astype(str) + "_" + df_post_data['sex'].astype(str)

parity_intersect = {}
TPR_intersect = {}
FPR_intersect = {}

for combo in df_post_data['intersectional'].unique():
    sectional = df_post_data[df_post_data['intersectional'] == combo]
    parity_intersect[combo] = (sectional['pred_labels'] == 0).mean()
    
    y_true = sectional['true_labels']
    y_pred = sectional['pred_labels']
    
    tn, fp, fn, tp = confusion_matrix(y_true, y_pred, labels=[0, 1]).ravel()
    TPR_intersect[combo] = tp / (tp + fn) if (tp + fn) > 0 else 0
    FPR_intersect[combo] = fp / (fp + tn) if (fp + tn) > 0 else 0
    
print("Intersectional statistical parity with no recidivism predicted")
for group, value in parity_intersect.items():
    print(f"{group}: {value:.3f}")
for group, value in TPR_intersect.items():
    print(f"{group}: {value:.3f}")
for group, value in FPR_intersect.items():
    print(f"{group}: {value:.3f}")
    
print()
print("Max disparity:")
print(f"Statistical Parity gap: {max(parity_intersect.values()) - min(parity_intersect.values())}")
print(f"TPR gap: {max(TPR_intersect.values()) - min(TPR_intersect.values())}")
print(f"FPR gap: {max(FPR_intersect.values()) - min(FPR_intersect.values())}")

Intersectional statistical parity with no recidivism predicted
1_Male: 0.416
0_Male: 0.553
1_Female: 0.654
0_Female: 0.667
1_Male: 0.989
0_Male: 0.986
1_Female: 0.971
0_Female: 0.941
1_Male: 0.079
0_Male: 0.046
1_Female: 0.043
0_Female: 0.029

Max disparity:
Statistical Parity gap: 0.25065104166666663
TPR gap: 0.04826014913007459
FPR gap: 0.049535603715170275


Your code to evaluate intersectional fairness for Classifier 2:


In [55]:
# Your code for intersectional fairness Classifier 2

pred_labels_2 = wr_predictions.values.ravel()
true_labels_2 = wr_true_labels.values.ravel()
compas_subset_2 = compas_data.loc[x_test.index].copy()

df_post_data_2 = pd.DataFrame({'race_num': compas_subset_2['race_num'].values,
                             'pred_labels': pred_labels_2,
                             'true_labels': true_labels_2})
compas_subset_2 = compas_data.loc[x_test.index].copy()
df_post_data_2['sex'] = compas_subset_2['sex'].values

df_post_data_2['intersectional'] = df_post_data_2['race_num'].astype(str) + "_" + df_post_data_2['sex'].astype(str)

parity_intersect_2 = {}
TPR_intersect_2 = {}
FPR_intersect_2 = {}

for combo in df_post_data_2['intersectional'].unique():
    sectional_2 = df_post_data_2[df_post_data_2['intersectional'] == combo]
    parity_intersect_2[combo] = (sectional['pred_labels'] == 0).mean()
    
    y_true_2 = sectional_2['true_labels']
    y_pred_2 = sectional_2['pred_labels']
    
    tn, fp, fn, tp = confusion_matrix(y_true_2, y_pred_2, labels=[0, 1]).ravel()
    TPR_intersect_2[combo] = tp / (tp + fn) if (tp + fn) > 0 else 0
    FPR_intersect_2[combo] = fp / (fp + tn) if (fp + tn) > 0 else 0
    
print("Intersectional statistical parity with no recidivism predicted")
for group, value in parity_intersect_2.items():
    print(f"{group}: {value:.3f}")
for group, value in TPR_intersect_2.items():
    print(f"{group}: {value:.3f}")
for group, value in FPR_intersect_2.items():
    print(f"{group}: {value:.3f}")
    
print()
print("Max disparity:")
print(f"Statistical Parity gap: {max(parity_intersect_2.values()) - min(parity_intersect_2.values())}")
print(f"TPR gap: {max(TPR_intersect_2.values()) - min(TPR_intersect_2.values())}")
print(f"FPR gap: {max(FPR_intersect_2.values()) - min(FPR_intersect_2.values())}")

Intersectional statistical parity with no recidivism predicted
1_Male: 0.667
0_Male: 0.667
1_Female: 0.667
0_Female: 0.667
1_Male: 0.989
0_Male: 0.986
1_Female: 0.971
0_Female: 0.941
1_Male: 0.079
0_Male: 0.046
1_Female: 0.043
0_Female: 0.029

Max disparity:
Statistical Parity gap: 0.0
TPR gap: 0.04826014913007459
FPR gap: 0.049535603715170275


**Question**

Write down a short interpretation of the results you calculated. What do you see?
> Answer:

- The TPR and FPR gap is extremely small, showing that TPR and FPR values for all subgroups lie very close together. Therefore, it performs well on the equal opportunity criterion, since people with similar outcomes are thus assigned similar labels.

## Discussion
Provide a short ethical discussion (1 or 2 paragraphs) reflecting on these two aspects:

1) The use of a ML system to try to predict recidivism;

2) The public release of a dataset like this.

> Answer

The use of artificial decision making systems, especially in these high consequence fields such as the justice system, is an emerging trend accompanied with legitimate concerns about fairness. As we can see from our experiment, historical data can be biased, which limits the classifier's ability to make correct decisions for underrepresented subgroups. Furthermore, we saw that removing these biases, by excluding sensitive information or including sample weights, does not always has the profound increase in equality one might hope. Altough one might argue that accuracy can be valued above fairness, having a more WYSIWYG approach, we should never overlook the effect the models decisions have on human lives. Discriminatory models can maintain and even amplify exisiting historical biases, by directing human action in based stereotypical believes.

The release of datasets like COMPAS to the public is a good step towards more transparency in these classification tasks.