# **COMPAS Database:**
Across the nation, many jurisdictions (such as judges, probation, and parole officers) around the U.S. are using algorithms to predict if a convicted criminal is likely to re-offend. One of the most popular algorithm used nationwide was **COMPAS** (Correctional Offender Management Profiling for Alternative Sanctions) which were increasingly used in pretrial and sentencing.

Depending on the scores generated by this software, the judge can decide upon whether to detain the defendant prior to trial and/or when sentencing.
<br>
Due to its major rule in the sentencing, we would like to assess the underlying accuracy of the algorithm and to test whether the algorithm was biased against certain groups.

# **How does COMPAS works?**
When defendants are booked in jail, they respond to a COMPAS questionnaire. Their answers are fed into the COMPAS software that generates several scores including predictions of “Risk of Recidivism” and “Risk of Violent Recidivism.”

# **Importing:**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from IPython.display import display
from sklearn.model_selection import train_test_split
from google.colab import files
import io
import plotly.graph_objs as go  
import plotly.tools as tls 
import plotly.offline as py 
import plotly.express as px
from sklearn.linear_model import LogisticRegression
pd.options.mode.chained_assignment = None  # default='warn'
from scipy.sparse.coo import coo_matrix
from patsy import dmatrices
from sklearn import (linear_model, metrics, neural_network, pipeline, preprocessing, model_selection)

In [None]:
dataURL = 'https://raw.githubusercontent.com/propublica/compas-analysis/master/compas-scores-two-years.csv'
data = pd.read_csv(dataURL)


# **Data description:**
ProPublica obtained pretrial defendant's COMPAS scores from the Broward County Sheriff’s Office in Florida in 2013 – 2014.
Each pretrial defendant received at least three COMPAS scores, each ranged from 1 to 10, with ten being the highest risk: 
1. **decile_score**- Risk of recidivism
2. **v_decile_score**- Risk of violence
3. Risk of Failure to Appear
<br>

We are provided as well 2 category based evaluation labeled as **“High”** (8 – 10), **“Medium”** (5 – 7) and **“Low”** (1 – 4):
1. **score_text**-  Risk of recidivism category
2. **v_score_text**- Risk of violence category


**days_b_screening_arrest**- number of days before COMPAS assessment being conducted 

**c_charge_degree**- the degree of the charge

**priors_count**- number of prior offences

**is_recid**- yes/no prediction of the model of whether the defendant will reoffend

**two_year_recid**- actual result over a two-year period 

**is_violent_recid**- yes/no prediction of the model of whether the defendant will have a violent offence

**juv_misd_count**- number of juvenile misdemeanor crimes

**juv_fel_count**- number of juvenile felony crimes

**juv_other_count**- number of juvenile crimes with dgree diffrent than misdemeanor or felony



# **Data preprocessing:**
We filtered the underlying data from Broward county to include only those rows representing people who had either recidivated in two years, or had at least two years outside of a correctional facility.

1. To match COMPAS scores with accompanying cases, we considered cases with arrest dates or charge dates **within 30 days of a COMPAS assessment** being conducted
2. We did not count traffic tickets and some municipal ordinance violations as recidivism because there is no jail time.
3. In order to determine if a person had been charged with a new crime subsequent to a crime for which they were COMPAS screened, we did not count people who were arrested for failing to appear at their court hearings, or people who were later charged with a crime that occurred prior to their COMPAS screening. 

In [None]:
df = (data
      .loc[(data['days_b_screening_arrest'] <= 30) & (data['days_b_screening_arrest'] >= -30), :]
      .loc[data['is_recid'] != -1, :]
      .loc[data['c_charge_degree'] != 'O', :])
df.reset_index(inplace = True)
df=df[['age', 'c_charge_degree', 'race', 'age_cat', 'score_text', 'sex', 'priors_count', 'days_b_screening_arrest', 'decile_score', 'is_recid', 'two_year_recid', 'c_jail_in', 'c_jail_out','is_violent_recid','v_decile_score', 'v_score_text','juv_misd_count', 'juv_other_count','juv_fel_count']]
df.head()

Unnamed: 0,age,c_charge_degree,race,age_cat,score_text,sex,priors_count,days_b_screening_arrest,decile_score,is_recid,two_year_recid,c_jail_in,c_jail_out,is_violent_recid,v_decile_score,v_score_text,juv_misd_count,juv_other_count,juv_fel_count
0,69,F,Other,Greater than 45,Low,Male,0,-1.0,1,0,0,2013-08-13 06:03:42,2013-08-14 05:41:20,0,1,Low,0,0,0
1,34,F,African-American,25 - 45,Low,Male,0,-1.0,3,1,1,2013-01-26 03:45:27,2013-02-05 05:36:53,1,1,Low,0,0,0
2,24,F,African-American,Less than 25,Low,Male,4,-1.0,4,1,1,2013-04-13 04:58:34,2013-04-14 07:02:04,0,3,Low,0,1,0
3,44,M,Other,25 - 45,Low,Male,0,0.0,1,0,0,2013-11-30 04:50:18,2013-12-01 12:28:56,0,1,Low,0,0,0
4,41,F,Caucasian,25 - 45,Medium,Male,14,-1.0,6,1,1,2014-02-18 05:08:24,2014-02-24 12:18:30,0,2,Low,0,0,0


We assume that the date itself of the incarceration and the release are not as important for the  model prediction as the acutal time spent in prison since the length of the incarceration may implay of the risk. 

In [None]:
def jail_time(row):
 in_var = pd.to_datetime(row['c_jail_in'])
 out_var = pd.to_datetime(row['c_jail_out'])
 return (out_var - in_var).days

df['days_in_jail'] = df.apply(jail_time, axis=1)
df=df.drop(["c_jail_in","c_jail_out"],axis=1)

# **Visualization**:

In [None]:
races = [race for race in df['race'].unique()]

fig = go.Figure()
fig.add_trace(go.Bar(
    x=races,
    y=[df[df['race']==race].shape[0] for race in races],
    marker_color='gray'
))

fig.update_layout(barmode='group',yaxis_title="amount of pepole", title="Races Distrution",title_x=0.5,width=700, height=500)
fig.show()

We can see that Caucasian and African-American represent 85.5% of the data, thus we will focus on those 2 groups because we want more accurate results

In [None]:
val_nv,count_nv=np.unique(df['score_text'].values, return_counts=True)
val_v,count_v=np.unique(df['v_score_text'].values, return_counts=True)

fig = go.Figure()
fig.add_trace(go.Bar(
            x =list(val_nv),
            y = list(count_nv),
            marker_color='black',
            name='non violent'))
fig.add_trace(go.Bar(
            x =list(val_v),
            y = list(count_v),
            marker_color='gray',
            name='violent'))

fig.update_layout(barmode='group',yaxis_title="amount of pepole", title="Non Violent And Violent Recidivism Scores Distrution",title_x=0.5,width=700, height=500)
fig.show()


We can see that most of the results are from the 'LOW' category, furthermore according to Northpointe’s practitioners guide: *“scores in the medium and high range garner more interest from supervision agencies than low scores, as a low score would suggest there is little risk of general recidivism,”* 

so we will consider scores that are higher than 'LOW' to indicate a risk of recidivism.

# **Bias in the data:**




Create dummy variables to make the predection

In [None]:
cat = ['score_text','age_cat','sex','race','c_charge_degree','v_score_text']
df.loc[:,cat] = df.loc[:,cat].astype('category')

dfDum = pd.get_dummies(data = df, columns=cat)

new_column_names = [col.lstrip().rstrip().lower().replace(" ", "_").replace("-", "_") for col in dfDum.columns]
dfDum.columns = new_column_names

dfDum['score_text_high'] = dfDum['score_text_medium'] + dfDum['score_text_high']
dfDum['v_score_text_high'] = dfDum['v_score_text_medium'] + dfDum['v_score_text_high']


## **Risk of Recidivism**

In [None]:
text_by_race = df.groupby(['race', 'score_text'], sort=True).size().reset_index()
text_by_race = text_by_race.rename(columns={0:'count'})
text_by_race[['count']] = text_by_race[['count']].apply(pd.to_numeric)
gb = df.groupby(['race']).size().reset_index()
gb_d = gb.set_index('race').to_dict().get(0)
text_by_race['count_percentage'] = text_by_race.apply(lambda x: ((int(x['count'])/int(gb_d.get(x['race'])))), axis=1)
races = ['African-American', 'Caucasian']
levels=['High','Medium','Low']
fig = go.Figure()
fig.add_trace(go.Bar(
    x=levels,
    y=[text_by_race[(text_by_race['race']=='African-American') & (text_by_race['score_text']==level)]['count_percentage'].values[0] for level in levels],
    name='African-American',
    marker_color='black'
))
fig.add_trace(go.Bar(
    x=levels,
    y=[text_by_race[(text_by_race['race']=='Caucasian') & (text_by_race['score_text']==level)]['count_percentage'].values[0] for level in levels],
    name='Caucasian',
    marker_color='gray'
))

fig.update_layout(barmode='group',yaxis_title="percentage of pepole", title="Predicted Recidivism Score Distrution In Diffrent Races",title_x=0.5,width=700, height=500)
fig.show()

We can see that 26.6% of African-American received a “HIGH” score whereas only 11% of Caucasian individuals received a similar score, meaning that the rate of receiving a “HIGH” score for African-Americans is **more 2.5** times that of Caucasians.
<br> <br>
In addition, we can see that 67% of Caucasians received a "LOW" score whereas only 42.3% of African-American individuals received a similar score, meaning that the rate of receiving a "LOW" score for Caucasians is **more 1.6** times that of African-Americans.



In order to test the diffrences in the score distribution for the diffrent races, we created a logistic regression model that considered race, criminal history, future recidivism, charge degree, gender and age.

In [None]:
X=dfDum[['sex_female', 'age_cat_greater_than_45', 'age_cat_less_than_25','race_african_american', 'race_asian' ,'race_hispanic' ,'race_native_american' ,'race_other' ,'priors_count' ,'c_charge_degree_m' ,'two_year_recid']]
y = dfDum["score_text_high"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model=LogisticRegression()
model.fit(X, y)

intercept=model.intercept_[0]
dct={}
estimation=model.coef_[0]
cof=model.feature_names_in_
for i in range(len(cof)):
  dct[cof[i]]=estimation[i]

control= np.exp(intercept) / (1 + np.exp(intercept))
col=list(X.columns)
p_higher_than_white={}

for i in col:
  p_higher_than_white[i]=(np.exp(dct[i]) / (1 - control + (control * np.exp(dct[i]))))-1
  print(i+':{:.2f}% '.format(((np.exp(dct[i]) / (1 - control + (control * np.exp(dct[i]))))-1)*100))




sex_female:19.24% 
age_cat_greater_than_45:-69.98% 
age_cat_less_than_25:148.51% 
race_african_american:45.31% 
race_asian:-15.96% 
race_hispanic:-30.08% 
race_native_american:94.46% 
race_other:-50.42% 
priors_count:23.96% 
c_charge_degree_m:-22.97% 
two_year_recid:68.46% 


**We can see that black defendants were 45.31% more likely to get a higher score than whites.**

## **Risk of Violent Recidivism:**

Lets look at the defendants who had not been arrested for a new offense or who had recidivated within two years violently.

In [None]:
dataURL2 = 'https://raw.githubusercontent.com/propublica/compas-analysis/master/compas-scores-two-years-violent.csv'
data2 = pd.read_csv(dataURL2)
df_violent = (data2
      .loc[(data['days_b_screening_arrest'] <= 30) & (data['days_b_screening_arrest'] >= -30), :]
      .loc[data['is_recid'] != -1, :]
      .loc[data['c_charge_degree'] != 'O', :]
      .loc[data['v_score_text'] != 'N/A', :])
df_violent.reset_index(inplace = True)
df_violent=df_violent[['age', 'c_charge_degree', 'race', 'age_cat', 'score_text', 'sex', 'priors_count', 'days_b_screening_arrest', 'decile_score', 'is_recid', 'two_year_recid', 'c_jail_in', 'c_jail_out','is_violent_recid','v_decile_score', 'v_score_text','juv_misd_count', 'juv_other_count','juv_fel_count']]
df_violent.head()


Unnamed: 0,age,c_charge_degree,race,age_cat,score_text,sex,priors_count,days_b_screening_arrest,decile_score,is_recid,two_year_recid,c_jail_in,c_jail_out,is_violent_recid,v_decile_score,v_score_text,juv_misd_count,juv_other_count,juv_fel_count
0,69,F,Other,Greater than 45,Low,Male,0,-1.0,1,0,0,2013-08-13 06:03:42,2013-08-14 05:41:20,0,1,Low,0,0,0
1,34,F,African-American,25 - 45,Low,Male,0,-1.0,3,1,1,2013-01-26 03:45:27,2013-02-05 05:36:53,1,1,Low,0,0,0
2,23,F,African-American,Less than 25,High,Male,1,,8,0,0,,,0,6,Medium,1,0,0
3,43,F,Other,25 - 45,Low,Male,3,-1.0,4,0,0,2013-08-29 08:55:23,2013-08-30 08:42:13,0,3,Low,0,0,0
4,39,M,Caucasian,25 - 45,Low,Female,0,-1.0,1,0,0,2014-03-15 05:35:34,2014-03-18 04:28:46,0,1,Low,0,0,0


In [None]:
cat = ['score_text','age_cat','sex','race','c_charge_degree','v_score_text']
df_violent.loc[:,cat] = df_violent.loc[:,cat].astype('category')

dfDum = pd.get_dummies(data = df_violent, columns=cat)

new_column_names = [col.lstrip().rstrip().lower().replace(" ", "_").replace("-", "_") for col in dfDum.columns]
dfDum.columns = new_column_names

dfDum['score_text_high'] = dfDum['score_text_medium'] + dfDum['score_text_high']
dfDum['v_score_text_high'] = dfDum['v_score_text_medium'] + dfDum['v_score_text_high']

In [None]:
text_by_race = df_violent.groupby(['race', 'v_score_text'], sort=True).size().reset_index()
text_by_race = text_by_race.rename(columns={0:'count'})
text_by_race[['count']] = text_by_race[['count']].apply(pd.to_numeric)
gb = df_violent.groupby(['race']).size().reset_index()
gb_d = gb.set_index('race').to_dict().get(0)
text_by_race['count_percentage'] = text_by_race.apply(lambda x: ((int(x['count'])/int(gb_d.get(x['race'])))), axis=1)
races = ['African-American', 'Caucasian']
levels=['High','Medium','Low']
fig = go.Figure()
fig.add_trace(go.Bar(
    x=levels,
    y=[text_by_race[(text_by_race['race']=='African-American') & (text_by_race['v_score_text']==level)]['count_percentage'].values[0] for level in levels],
    name='African-American',
    marker_color='black'
))
fig.add_trace(go.Bar(
    x=levels,
    y=[text_by_race[(text_by_race['race']=='Caucasian') & (text_by_race['v_score_text']==level)]['count_percentage'].values[0] for level in levels],
    name='Caucasian',
    marker_color='gray'
))

fig.update_layout(barmode='group',yaxis_title="percentage of pepole", title="Predicted Violence Score Distrution In Diffrent Races",title_x=0.5,width=700, height=500)
fig.show()

We can see that 12.02% of African-American received a “HIGH” violence score whereas only 3.73% of Caucasian individuals received a similar score, meaning that the rate of receiving a “HIGH” violence score for African-Americans is **about 3** times that of Caucasians.
<br> <br>
In addition, we can see that 60.46% of Caucasians received a "LOW" violence score whereas only 82.4% of African-American individuals received a similar score, meaning that the rate of receiving a "LOW" violence score for Caucasians is **more 1.36** times that of African-Americans.


In [None]:
X=dfDum[['sex_female', 'age_cat_greater_than_45', 'age_cat_less_than_25','race_african_american', 'race_asian' ,'race_hispanic' ,'race_native_american' ,'race_other' ,'priors_count' ,'c_charge_degree_m' ,'two_year_recid']]
y = dfDum["v_score_text_high"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model=LogisticRegression()
model.fit(X, y)

intercept=model.intercept_[0]
dct={}
estimation=model.coef_[0]
cof=model.feature_names_in_
for i in range(len(cof)):
  dct[cof[i]]=estimation[i]
control= np.exp(intercept) / (1 + np.exp(intercept))
col=list(X.columns)
p_higher_than_white={}

for i in col:
  p_higher_than_white[i]=(np.exp(dct[i]) / (1 - control + (control * np.exp(dct[i]))))-1
  print(i+':{:.2f}% '.format(((np.exp(dct[i]) / (1 - control + (control * np.exp(dct[i]))))-1)*100))



sex_female:-43.24% 
age_cat_greater_than_45:-80.46% 
age_cat_less_than_25:569.77% 
race_african_american:96.39% 
race_asian:-53.65% 
race_hispanic:7.63% 
race_native_american:40.32% 
race_other:-14.75% 
priors_count:13.85% 
c_charge_degree_m:-13.95% 
two_year_recid:97.14% 


**Black defendants were 96.39% more likely than white defendants to receive a higher score, correcting for criminal history and future violent recidivism.**

# **Bias in the algorithm:**

we are interested in how the COMPAS scores do at predicting recidivism and how their ability to predict depends on race.

To assist us in evaluating the performance of various models we will use a new metric called the confusion matrix.
<img src='https://www.auditingalgorithms.net/media/Table4-3_draft3.png' />






## **Training the model:**

In [None]:
groups = ["overall", "African-American", "Caucasian"]

ind = ["Portion_of_NoRecid_and_LowRisk", "Portion_of_Recid_and_LowRisk",
    "Portion_of_NoRecid_and_HighRisk", "Portion_of_Recid_and_HighRisk"]

fmla = "two_year_recid ~ C(decile_score)"
y,X = dmatrices(fmla, df)

X_train, X_test, y_train, y_test, df_train, df_test = model_selection.train_test_split(X,y.reshape(-1),df, test_size=0.2, random_state=42)

decile_mod = linear_model.LogisticRegression(solver="lbfgs").fit(X_train,y_train)

## **Computing the confusion rates measures:**

In [None]:
def cm_tables(pred, y, df):
    output = pd.DataFrame(index=ind, columns=groups)
    # for each race compute the cm values
    for group in groups:
        if group in ["African-American", "Caucasian"]:
            subset=(df.race==group)
        else:
            subset=np.full(y.shape, True)

        y_sub = y[subset]
        pred_sub = pred[subset]

        cm = metrics.confusion_matrix(y_sub, pred_sub)

        # Compute fraction for which the predection is correct
        total = cm.sum()
        vals = np.array(cm/total)
        output.loc[:, group] = vals.reshape(-1)
    
    #compute bayes probabilties
    def bayes_probs(col, axis):
      d=int(np.sqrt(len(col)))
      pcm = np.array(col).reshape(d,d)
      pcm = pcm/pcm.sum(axis=axis, keepdims=True)
      return(pcm.reshape(-1))
    
    # compute TNR,FPR,FNR,PPV
    given_outcome = output.copy()
    given_outcome.index = ["P(LowRisk|NoRecid)","P(HighRisk|NoRecid)","P(LowRisk|Recid)","P(HighRisk|Recid)"]
    given_outcome=given_outcome.apply(lambda c: bayes_probs(c,1))

    # compute TNR,FDR,FOR,TPR
    given_pred = output.copy()
    given_pred.index = ["P(NoRecid|LowRisk)","P(NoRecid|HighRisk)","P(Recid|LowRisk)","P(Recid|HighRisk)"]
    given_pred=given_pred.apply(lambda c: bayes_probs(c,0))
    return(given_outcome, given_pred)
given_outcome, given_pred =cm_tables(decile_mod.predict(X_test),y_test, df_test)

## **Results:**

In [None]:
given_outcome

Unnamed: 0,overall,African-American,Caucasian
P(LowRisk|NoRecid),0.777452,0.6875,0.840741
P(HighRisk|NoRecid),0.222548,0.3125,0.159259
P(LowRisk|Recid),0.496377,0.36129,0.663043
P(HighRisk|Recid),0.503623,0.63871,0.336957


**FNR - P(LowRisk|Recid)** are higher for Caucasian than African-American (0.663043 vs 0.36129), meaning we are more likely to misclassify white defendants who recidivated as a "LOW" risk than their black counterparts.
<br> <br>
**FPR - P(HighRisk|NoRecid)** are higher for African-American than Caucasian (0.31250 vs 0.159259), meaning we are more likely to misclassify black defendants who did not recidivated as a "HIGH" risk than their white counterparts.


In [None]:
given_pred

Unnamed: 0,overall,African-American,Caucasian
P(NoRecid|LowRisk),0.659627,0.65109,0.65043
P(NoRecid|HighRisk),0.353488,0.324232,0.409524
P(Recid|LowRisk),0.340373,0.34891,0.34957
P(Recid|HighRisk),0.646512,0.675768,0.590476


Northpointe, the company that produces COMPAS, argued that COMPAS is not biased because the probabilities of outcomes conditional on predictions (like P(NoRecid|LowRisk)) are approximately equal across races.

As we can see, the distribution of outcomes conditional on predictions does not vary too much with race.
Moreover, if anything, it discriminates in favor of African-Americans.

# **Task:**
Prepare a presentation to your team explaining the findings in this notebook intuitively. It should take no more than 5 minutes.

You should create 1-3 slides (we recommend using diagrams) AND use the notebook (or include the output/visualization in the slides).

In your presentation, please consider the following aspects:
1.	Describe the COMPAS algorithm, what does it do, who uses it and what does it used for?
2.	Describe the data, what was predicted, which attributes are sensitive?
3.	What is the bias in the data? use the correct terms learned in class. 
4.	What is the bias in the model? use the correct terms learned in class.
5.	From the bias found in the data and aggregated fairness measures we saw in class, which measure should be employed in this data? Explain why.

**Remember who your audience is!** Think about the prior knowledge of your fellow and be mindful of the terms or jargon you are using.

Submit your presentation via Gradescope [[link]](https://go.responsibly.ai/gradescope). This assignment is mandatory, but not graded.

