<a href="https://colab.research.google.com/github/gitan5hi/Bias-Detection-in-COMPAS-dataset/blob/main/Bias_Detection_in_COMPAS_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Loading dataset

In [3]:
!pip install kaggle



In [4]:
##Uploading Kaggle API token
from google.colab import files
files.upload()

Saving kaggle.json to kaggle.json


{'kaggle.json': b'{"username":"gitanshisingh","key":"f1f70925013e0f03eef4fb75e0a7cc15"}'}

In [5]:
##Set up the Kaggle API credentials for use in the colab environment
!mkdir -p ~/.kaggle   #creates a hidden directory for storing Kaggle credentials
!mv kaggle.json ~/.kaggle/   #moves the downloaded API token to this secure location
!chmod 600 ~/.kaggle/kaggle.json   #only the owner can access the file, protectin sensitive information

In [6]:
##Downloading the dataset
!kaggle datasets download -d danofer/compass

Dataset URL: https://www.kaggle.com/datasets/danofer/compass
License(s): DbCL-1.0
Downloading compass.zip to /content
  0% 0.00/2.72M [00:00<?, ?B/s]
100% 2.72M/2.72M [00:00<00:00, 559MB/s]


In [7]:
!unzip compass.zip -d compas_dataset

Archive:  compass.zip
  inflating: compas_dataset/compas-scores-raw.csv  
  inflating: compas_dataset/cox-violent-parsed.csv  
  inflating: compas_dataset/cox-violent-parsed_filt.csv  
  inflating: compas_dataset/propublicaCompassRecividism_data_fairml.csv/._propublica_data_for_fairml.csv  
  inflating: compas_dataset/propublicaCompassRecividism_data_fairml.csv/propublica_data_for_fairml.csv  


In [8]:
import pandas as pd
df=pd.read_csv("compas_dataset/propublicaCompassRecividism_data_fairml.csv/propublica_data_for_fairml.csv")
print(df.head())

   Two_yr_Recidivism  Number_of_Priors  score_factor  Age_Above_FourtyFive  \
0                  0                 0             0                     1   
1                  1                 0             0                     0   
2                  1                 4             0                     0   
3                  0                 0             0                     0   
4                  1                14             1                     0   

   Age_Below_TwentyFive  African_American  Asian  Hispanic  Native_American  \
0                     0                 0      0         0                0   
1                     0                 1      0         0                0   
2                     1                 1      0         0                0   
3                     0                 0      0         0                0   
4                     0                 0      0         0                0   

   Other  Female  Misdemeanor  
0      1       0        

**Attributes Explained**

* *Two_yr_Recidivism*: The target variable; indicates if a defendant was rearrested (recidivated) within two years (1=yes, 0=no).

* *Number_of_Priors*: Count of prior offenses before the most recent arrest for each defendant.

* *score_factor*: The COMPAS system assigns risk scores to defendants predicting their likelihood of recidivism (reoffending) within a certain period (e.g., two years) based on various behavioral, demographic, and criminal history factors.

* *Age_Above_FourtyFive, Age_Below_TwentyFive*: Binary features showing if age is above 45 or below 25, respectively, encoding age group buckets for fairness analysis.

* *African_American, Asian, Hispanic, Native_American, Other*: One-hot encoded columns for racial groups. These are mutually exclusive, so a single '1' per row indicates racial membership.

* *Female*: Binary feature for gender.

* *Misdemeanor*: Binary indicator for the type of charge (misdemeanor vs. felony).




# Project Outcome

This analysis checks if the model that predicts whether someone might commit another crime is unfair to certain groups, like people of different  races, ages, or genders.


The model looks at details such as how many crimes a person did before, their age, race, and if their current charge is serious or not.


If a model is biased, it could mean certain people get harsher treatement from the justics system, not because of facts, but because of unfair preictions made by a computer.

Bias detection helps ensure everyone gets equal treatment, and that the model's decisions are just and reliable for all backgrounds.


# Data preprocessing

In [12]:
##Look for correlation of Two_yr_Recidivism with other attributes
import pandas as pd
print(df.info())
correlation_matrix=df.corr()
print(correlation_matrix["Two_yr_Recidivism"].sort_values(ascending=False))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6172 entries, 0 to 6171
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype
---  ------                --------------  -----
 0   Two_yr_Recidivism     6172 non-null   int64
 1   Number_of_Priors      6172 non-null   int64
 2   score_factor          6172 non-null   int64
 3   Age_Above_FourtyFive  6172 non-null   int64
 4   Age_Below_TwentyFive  6172 non-null   int64
 5   African_American      6172 non-null   int64
 6   Asian                 6172 non-null   int64
 7   Hispanic              6172 non-null   int64
 8   Native_American       6172 non-null   int64
 9   Other                 6172 non-null   int64
 10  Female                6172 non-null   int64
 11  Misdemeanor           6172 non-null   int64
dtypes: int64(12)
memory usage: 578.8 KB
None
Two_yr_Recidivism       1.000000
score_factor            0.314832
Number_of_Priors        0.290607
African_American        0.140609
Age_Below_TwentyFive    0.111

In [22]:
##Split the data into training and testing sets
from sklearn.model_selection import train_test_split

feature_cols = ['Number_of_Priors', 'score_factor', 'Age_Above_FourtyFive', 'Age_Below_TwentyFive', 'Misdemeanor']

X = df[feature_cols]
y = df['Two_yr_Recidivism']
X_train, X_test, y_train, y_test= train_test_split(X,y,test_size=0.2, random_state=42)

# Model Development and Training

Different *Predictive Models* for binary outcome (Logistic regression, Decision Tree) using features from the dataset.

In [25]:
##Train logistic regression model
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
logreg=LogisticRegression(max_iter=100)
logreg.fit(X_train, y_train)
#Predict and evaluate accuracy
logreg_preds=logreg.predict(X_test)
print("Logistic Regression Accuracy:", accuracy_score(y_test, logreg_preds))

Logistic Regression Accuracy: 0.668825910931174


Linear Regression (a Predictive ML model) is not used here because the outcome is binary and it predicts continuous values.

In [24]:
##Train Random Forest model
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
rf=RandomForestClassifier(n_estimators=10, random_state=42)
rf.fit(X_train,y_train)
rf_preds=rf.predict(X_test)
print("Random Forest Accuracy:", accuracy_score(y_test, rf_preds))

Random Forest Accuracy: 0.665587044534413


# Bias Detection and Measurement

In [26]:
import numpy as np
from sklearn.metrics import confusion_matrix

# For bias metrics, put sensitive attributes aside for group-wise evaluation
race = df.loc[y_test.index, 'African_American']  # Example sensitive attribute
gender = df.loc[y_test.index, 'Female']

def disparity_impact(y_true, y_pred, sensitive_attr):
    # Ratio of favorable outcomes for protected vs unprotected group
    favorable_protected = np.mean(y_pred[sensitive_attr == 1] == 0)
    favorable_unprotected = np.mean(y_pred[sensitive_attr == 0] == 0)
    return favorable_protected / favorable_unprotected

def statistical_parity_difference(y_pred, sensitive_attr):
    p_privileged = np.mean(y_pred[sensitive_attr == 0] == 0)
    p_protected = np.mean(y_pred[sensitive_attr == 1] == 0)
    return p_protected - p_privileged

def false_positive_rate_difference(y_true, y_pred, sensitive_attr):
    cm_protected = confusion_matrix(y_true[sensitive_attr == 1], y_pred[sensitive_attr == 1])
    cm_unprotected = confusion_matrix(y_true[sensitive_attr == 0], y_pred[sensitive_attr == 0])
    fpr_protected = cm_protected[0,1] / (cm_protected[0,1] + cm_protected[0,0])
    fpr_unprotected = cm_unprotected[0,1] / (cm_unprotected[0,1] + cm_unprotected[0,0])
    return fpr_protected - fpr_unprotected

# Calculate bias metrics for Logistic Regression
di_race = disparity_impact(y_test, logreg_preds, race)
spd_race = statistical_parity_difference(logreg_preds, race)
fprd_race = false_positive_rate_difference(y_test, logreg_preds, race)

print(f"Disparate Impact (Race): {di_race:.3f}")
print(f"Statistical Parity Difference (Race): {spd_race:.3f}")
print(f"False Positive Rate Difference (Race): {fprd_race:.3f}")


Disparate Impact (Race): 0.638
Statistical Parity Difference (Race): -0.268
False Positive Rate Difference (Race): 0.221


The model appears to be less favorable and more punitive towards the protected group (e.g., African-American when predicting recidivism)