<div class="alert alert-block alert-danger" style="color:#ffffff;background-color:#181a1b">
<h2 style="color:#ffffff;background-color:#181a1b">Racial bias in machine learning systems</h2> <a class="anchor" id="partB"></a> 
    
![](https://static.propublica.org/projects/algorithmic-bias/assets/img/generated/opener-b-crop-2400*1350-00796e.jpg)

<br />
<b> Dataset Description </b><br />
    
The main dataset is the `compas.csv` <br />
ProPublica's analysis is publically available [here](https://github.com/propublica/compas-analysis).

The dataset was made publically available by **Northpointe**, an American tech-company that works with law establishment across several states in the US to predict future crimes based on past records of criminals.

It has been suspected that the software used by Northpointe, `COMPAS`, is biased against the african american criminals, who end up with `high-risk` tags, despite minor criminal record, whereas `Caucasians` regularly received low-scores despite more significant criminal charges.
    
After pressure from several news agencies and a public investigation by ProPublica, the company released this dataset with a slice of the factors usually considered in order to assign a score to criminals.
<br /><br />
The dataset also contains a column `two_year_recid` with a binary response, i.e `1` if the released criminal ended up committing another crime within two years and `0` if the criminal did not commit a crime within a period of two years.
    
To learn more about this dataset, and the public investigation, you are highly recommended to read ProPublica's article on [Machine Bias](https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing)

In [None]:
# Import libraries
%matplotlib inline
import math
import numpy as np
import pandas as pd
import seaborn as sns
from collections import Counter
import matplotlib.pyplot as plt
from sklearn.utils import shuffle
from sklearn.linear_model import Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_validate
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from sklearn.metrics import confusion_matrix
from sklearn import tree
import pydot
sns.set()

import warnings

Splitting the data into 80% training and 20% validation sets stratified by race.

Stratified mean that the two sets should have roughly similar distribution of races as the original data.


In [None]:
# Loading data set 
df = pd.read_csv('data/compas.csv')
df.head()

In [None]:
# Dividing the data set between X(Predictor) and y(Response) variable

X = df.drop(columns=['score_text','two_year_recid', 'c_charge_desc'])
y = df[['score_text','two_year_recid']]

In [None]:
# Encoding the categorical data

X_encoding = {'race': {'Caucasian': 0, 'African-American': 1, "Other": 2, "Hispanic": 3, "Native American": 4, "Asian" : 5},
              'c_charge_degree' : {"M": 0, "F": 1}}
X_decoding = {'race': {0: 'Caucasian', 1: 'African-American', 2: "Other", 3: "Hispanic", 4: "Native American", 5 : "Asian"},
              'c_charge_degree' : {0: "M", 1: "F"}}
y_encoding = {'score_text': {"Low": 0, "Medium": 1, "High": 2}}
y_decoding = {'score_text':{0: "Low", 1: "Medium", 2: "High"}}


X = X.replace(X_encoding)
y = y.replace(y_encoding)


In [None]:
# Splitting the data into training and validation sets

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, stratify=X['race'], random_state=11)

In [None]:
# Creating a subset of the dataframe containing only the relevent columns

aa_cc = X[['race', 'age', 'priors_count', 'sex', 'length_of_stay_thresh']]

# Dataframes containing data for the relevent races only
aa = aa_cc[aa_cc['race'] == 1]
cc = aa_cc[aa_cc['race'] == 0]

# Plotting the desired graphs

fig, axis = plt.subplots(2, 2, figsize=(18,12))

axis[0,0].hist(aa['age'], alpha= 0.5)
axis[0,0].hist(cc['age'], alpha= 0.5)
axis[0,0].set_title("Age distribution for African-Americans and Caucasians")
axis[0,0].set_xlabel("Age")
axis[0,0].set_ylabel("Number of people")
axis[0,0].legend(['African-Americans', 'Caucasians'])

axis[0,1].hist(aa['priors_count'], alpha= 0.5)
axis[0,1].hist(cc['priors_count'], alpha= 0.5)
axis[0,1].set_title("Prior Counts distribution for African-Americans and Caucasians")
axis[0,1].set_xlabel("Priors Counts")
axis[0,1].set_ylabel("Number of people")
axis[0,1].legend(['African-Americans', 'Caucasians'])

axis[1,0].hist(aa['sex'].map({0:'Female', 1:'male'}), alpha= 0.5)
axis[1,0].hist(cc['sex'].map({0:'Female', 1:'male'}), alpha= 0.5)
axis[1,0].set_title("Sex distribution for African-Americans and Caucasians")
axis[1,0].set_xlabel("sex")
axis[1,0].set_ylabel("Number of people")
axis[1,0].legend(['African-Americans', 'Caucasians'])

axis[1,1].hist(aa['length_of_stay_thresh'], alpha= 0.5)
axis[1,1].hist(cc['length_of_stay_thresh'], alpha= 0.5)
axis[1,1].set_title("Stay distribution for African-Americans and Caucasians")
axis[1,1].set_xlabel("Length of Stay")
axis[1,1].set_ylabel("Number of people")
axis[1,1].legend(['African-Americans', 'Caucasians'])


plt.suptitle('Distribution of African-American and Caucasian population over different parameters',fontsize=20)

plt.show()    


<b> 
Building a logistic regression model to predict recidivism (two_year_recid) on these data and be sure to include race as a predictor.

In [None]:
# Initializing the ogistic refgression object

lreg = LogisticRegression(random_state=66, max_iter=20000)

# Redefining y for this question
y_new_train = y_train['two_year_recid']
y_new_val   = y_val['two_year_recid']

# Fitting the model

lreg.fit(X_train, y_new_train)

# Making prediction 

y_train_pred = lreg.predict(X_train)
y_val_pred   = lreg.predict(X_val)


In [None]:
# Calculating the accuracy of the model

accuracy = accuracy_score(y_new_val, y_val_pred)
print(f"The overall accuracy of the model is : {accuracy * 100:.4f} %")


In [None]:
# To report further findings we will seperate the data by race 

# African-American data
y_val_AA = y_new_val[X_val['race'] == 1]
y_val_pred_AA = y_val_pred[X_val['race'] == 1]

# Caucasian data

y_val_CC = y_new_val[X_val['race'] == 0]
y_val_pred_CC = y_val_pred[X_val['race'] == 0]


In [None]:
# Calculating True Negative False Negative True Positive and False Positive

# For African-Americans
AA_confusion_matrix = confusion_matrix(y_val_AA, y_val_pred_AA)

TN_AA = AA_confusion_matrix[0][0]
FN_AA = AA_confusion_matrix[1][0]
TP_AA = AA_confusion_matrix[1][1]
FP_AA = AA_confusion_matrix[0][1]

# True Positive Rate
TPR_AA = TP_AA/(TP_AA+FN_AA)
# True Negative Rate
TNR_AA = TN_AA/(TN_AA+FP_AA) 
# False Positive Rate
FPR_AA = FP_AA/(FP_AA+TN_AA)
# False Negative Rate
FNR_AA = FN_AA/(TP_AA+FN_AA)


# For Caucasians
CC_confusion_matrix = confusion_matrix(y_val_CC, y_val_pred_CC)

TN_CC = CC_confusion_matrix[0][0]
FN_CC = CC_confusion_matrix[1][0]
TP_CC = CC_confusion_matrix[1][1]
FP_CC = CC_confusion_matrix[0][1]

# True Positive Rate
TPR_CC = TP_CC/(TP_CC+FN_CC)
# True Negative Rate
TNR_CC = TN_CC/(TN_CC+FP_CC) 
# False Positive Rate
FPR_CC = FP_CC/(FP_CC+TN_CC)
# False Negative Rate
FNR_CC = FN_CC/(TP_CC+FN_CC)


# Ratio of False Positive Rates and False Negative Rates between African Americans and caucasians

FPR_ratio = FPR_AA/FPR_CC 
FNR_ratio = FNR_AA/FNR_CC

# Printing Results
print(f"The False Positive Rate for AFrical-Americans is : {FPR_AA*100:.4f} %")
print(f"The False Positive Rate for Caucasians is : {FPR_CC*100:.4f} %")
print(f"The False Negative Rate for AFrical-Americans is : {FNR_AA*100:.4f} %")
print(f"The False Negative Rate for Caucasians is : {FNR_CC*100:.4f} %")
print(f"The ratio of False Positive rates betwee African Americans and Caucasians is {FPR_ratio:.4f}")
print(f"The ratio of False Negative rates betwee African Americans and Caucasians is {FNR_ratio:.4f}")

<b>
Refitting the logistic model in but this time without race as a predictor. 

In [None]:
# Initializing the ogistic refgression object

lreg = LogisticRegression(random_state=66, max_iter=20000)

# Redefining X for this question
# y_new_val remains same

X_new_train = X_train.drop(columns='race')
X_new_val   = X_val.drop(columns='race')


# Fitting the model

lreg.fit(X_new_train, y_new_train)

# Making prediction 

y_train_pred = lreg.predict(X_new_train)
y_val_pred   = lreg.predict(X_new_val)

# Calculating the accuracy of the model

accuracy = accuracy_score(y_new_val, y_val_pred)
print(f"The overall accuracy of the model is : {accuracy * 100:.4f} %")


In [None]:
# To report further findings we will seperate the data by race 

# African-American data
X_val_AA = X_new_val[X_val['race'] == 1]
y_val_AA = y_new_val[X_val['race'] == 1]
y_val_pred_AA = y_val_pred[X_val['race'] == 1]

# Caucasian data
X_val_CC = X_new_val[X_val['race'] == 0]
y_val_CC = y_new_val[X_val['race'] == 0]
y_val_pred_CC = y_val_pred[X_val['race'] == 0]


In [None]:
# Calculating True Negative False Negative True Positive and False Positive

# For African-Americans
AA_confusion_matrix = confusion_matrix(y_val_AA, y_val_pred_AA)

TN_AA = AA_confusion_matrix[0][0]
FN_AA = AA_confusion_matrix[1][0]
TP_AA = AA_confusion_matrix[1][1]
FP_AA = AA_confusion_matrix[0][1]

# True Positive Rate
TPR_AA = TP_AA/(TP_AA+FN_AA)
# True Negative Rate
TNR_AA = TN_AA/(TN_AA+FP_AA) 
# False Positive Rate
FPR_AA = FP_AA/(FP_AA+TN_AA)
# False Negative Rate
FNR_AA = FN_AA/(TP_AA+FN_AA)


# For Caucasians
CC_confusion_matrix = confusion_matrix(y_val_CC, y_val_pred_CC)

TN_CC = CC_confusion_matrix[0][0]
FN_CC = CC_confusion_matrix[1][0]
TP_CC = CC_confusion_matrix[1][1]
FP_CC = CC_confusion_matrix[0][1]

# True Positive Rate
TPR_CC = TP_CC/(TP_CC+FN_CC)
# True Negative Rate
TNR_CC = TN_CC/(TN_CC+FP_CC) 
# False Positive Rate
FPR_CC = FP_CC/(FP_CC+TN_CC)
# False Negative Rate
FNR_CC = FN_CC/(TP_CC+FN_CC)


# Ratio of False Positive Rates and False Negative Rates between African Americans and caucasians

FPR_ratio = FPR_AA/FPR_CC 
FNR_ratio = FNR_AA/FNR_CC

# Printing Results
print(f"The False Positive Rate for AFrical-Americans is : {FPR_AA*100:.4f} %")
print(f"The False Positive Rate for Caucasians is : {FPR_CC*100:.4f} %")
print(f"The False Negative Rate for AFrical-Americans is : {FNR_AA*100:.4f} %")
print(f"The False Negative Rate for Caucasians is : {FNR_CC*100:.4f} %")
print(f"The ratio of False Positive rates betwee African Americans and Caucasians is {FPR_ratio:.4f}")
print(f"The ratio of False Negative rates betwee African Americans and Caucasians is {FNR_ratio:.4f}")

<b>
Now using logistic regression from above and plot the Receiver Operating Characteristic curve for two races, African Americans & Caucasians.

In [None]:
# Plotting ROC Curve

# For African-American
y_pred_proba_AA = lreg.predict_proba(X_val_AA)[::,1]
FPR_AA, TPR_AA, threshold_AA = roc_curve(y_val_AA,  y_pred_proba_AA)
auc_AA = roc_auc_score(y_val_AA, y_pred_proba_AA)



# For Caucasian
y_pred_proba_CC = lreg.predict_proba(X_val_CC)[::,1]
FPR_CC, TPR_CC, threshold_CC = roc_curve(y_val_CC,  y_pred_proba_CC)
auc_CC = roc_auc_score(y_val_CC, y_pred_proba_CC)

plt.figure(figsize=(7, 6))
plt.plot(FPR_AA,TPR_AA,label=f"African-American ROC curve (area ={auc_AA:.2f})" )
plt.plot(FPR_CC,TPR_CC,label=f"Caucasian ROC curve (area = {auc_CC:.2f})" )
plt.plot([0, 1], [0, 1], color="green", linestyle="--")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve for African-American and Caucasian ")
plt.legend()
plt.show()

**Building a Decision Tree model and find the best depth**

In [None]:
# Creating a list for depth values upto a max depth
max_depth = 10
depth = list(np.arange(1,max_depth + 1))


In [None]:
# Initialize Empyt lists to store error
validation_error, training_error = [], []

for x in depth:
    # Initialize Classifier
    clf = tree.DecisionTreeClassifier(max_depth=x, random_state=x)

    # Cross-Validating
    mse_score = cross_validate(clf, X_train, y_new_train,cv=5, scoring="neg_mean_squared_error")
    validation_error.append(-1*(np.mean(mse_score['test_score'])))
    

In [None]:
plt.figure(figsize=(12, 6))
plt.scatter(depth, validation_error)
plt.xlabel("Max Depth of the Decision Tree")
plt.ylabel("Mean Validation Error")
plt.title("Mean Error Across Max-Depth Levels of Decision Tree")
plt.show()

In [None]:
min_val_mse = min(validation_error)
best_depth_index = validation_error.index(min_val_mse)
best_depth = depth[best_depth_index]
print(f"The best depth value came out to be : {best_depth}")

In [None]:
# Evaluating model performance at best depth
clf = tree.DecisionTreeClassifier(max_depth=best_depth)
clf.fit(X_train, y_new_train)
y_val_pred = clf.predict(X_val)

accuracy = accuracy_score(y_new_val, y_val_pred)

print(f"The accurance of the model is : {accuracy*100:.4f} %")

In [None]:
# Using sklearn.tree and pydot to convert our Decision Tree to a png file
tree.export_graphviz(clf,out_file="tree.dot",filled = True)
(graph,) = pydot.graph_from_dot_file('tree.dot')
graph.write_png('Decision_Tree.png')