# DivExplorer analysis - Example with COMPAS dataset 

In [1]:
%load_ext autoreload
%autoreload 2 

This notebook shows an example of the use of DivExplorer.  First, let's import the package. 
We import it from its regular installation place; if you checked out the source repository,
you can also import it directly from source.

In [2]:
# Uncomment these two lines if you want to import from the source distribution, 
# rather than from the installed git package.
import sys
sys.path.append("..")

# Import all divexplorer components. 
from divexplorer import DivergenceExplorer

AttributeError: module 'divexplorer' has no attribute 'FP_Divergence'

# Import dataset

In this running example we analyze the [COMPAS](https://www.propublica.org/article/how-we-analyzed-the-compas-recidivism-algorithm) dataset.

We import an already pre-processed and discretized version. You can refer to this [notebook](www.google.com) for its pre-processing.

The COMPAS dataset contains demographic information and the criminal history of defendants.
For each criminal defendant, the following attributes are available:
- *age_cat*: age category of the defendant. 
- *c_charge_degree*: severity of the charge. 'M'='Misdemeanor','F'='Felony' 
- *race*: 'Afr-Am', 'Asian', 'Cauc', 'Hispanic', 'Native American', 'Other'
- *sex*: 'Female', 'Male'
- *length_of_stay*: the days spent in jail, derived from the records of the date in which the person entered lastly the jail and the date in which the person was released.
- *priors_count*: the number of prior crimes committed.


For each defendant, the dataset also records:
- the *class*, whether the defendant recidivated (was arrested again) in the next two years. It is 1 if the defendant recidivated, 0 otherwise.
- the *predicted* recidivism, derived from the COMPAS score of recidivism risk assesses the defendant’s likelihood of committing another offense in a period of two years. It is 1 if the risk is high.

In [3]:
def get_true_positives(y_trues, y_preds, positive_value=1):
    """Returns true positives. True if it is a true positive, false otherwise."""

    return ((y_trues == y_preds) & (y_trues == positive_value))

def get_true_negatives(y_trues, y_preds, negative_value=1):
    """Returns true negatives. True if it is a true negative, false otherwise."""

    return ((y_trues == y_preds) & (y_trues == negative_value))

def get_false_positives(y_trues, y_preds, negative_value=1):
    """Returns false positives. True if it is a false positive, false otherwise."""

    return ((y_trues != y_preds) & (y_trues == negative_value))

def get_false_negatives(y_trues, y_preds, positive_value=1):
    """Returns false negatives. True if it is a false negative, false otherwise."""

    return ((y_trues != y_preds) & (y_trues == positive_value))

def get_false_positive_rate_outcome(y_trues, y_preds, negative_value=0):
    """Returns boolean outcome for the false positive rate. 1 if it is a false positive, 0 if it is a true negative, np.nan otherwhise.
    """
    import numpy as np

    fp = np.array(get_false_positives(y_trues, y_preds, negative_value=negative_value)).astype(int)
    tn = np.array(get_true_negatives(y_trues, y_preds, negative_value=negative_value)).astype(int)

    fp_outcome = np.full(fp.shape, np.nan)
    fp_outcome[fp == 1] = 1
    fp_outcome[tn == 1] = 0


    return fp_outcome

In [4]:
import os
inputDir=os.path.join(".", "notebooks", "datasets")

import pandas as pd
df= pd.read_csv(os.path.join(inputDir, "compas_discretized.csv"))
df.head()

Unnamed: 0,age,charge,race,sex,#prior,stay,class,predicted
0,>45,F,Other,Male,0,<week,0,0
1,25-45,F,Afr-Am,Male,0,1w-3M,1,0
2,<25,F,Afr-Am,Male,>3,<week,1,0
3,25-45,M,Other,Male,0,<week,0,0
4,25-45,F,Cauc,Male,>3,<week,1,0


In [5]:
y_trues = df["class"].values
y_preds = df["predicted"].values

In [6]:
df['fpr_outcome'] =  get_false_positive_rate_outcome(y_trues, y_preds, negative_value=0)

In [7]:
df = df.drop(columns=['class', 'predicted'])  

# Compute pattern divergence

With DivExplorer, we can compute the the performance metrics and the divergence scores of *all* itemsets whose frequency in the dataset is above a minimum frequency threshold (min_sup).

To compute the divergence score, we have to specify the following info.

FP_DivergenceExplorer:
- **df**: discretized dataset (DataFrame)
- true_class_name: name of the **class** column (str)
- predicted_class_name: name of the **predicted** class column (str)
- dictionary indicating the positive and the negative class (dict)  {'N': 0, 'P': 1}.

getFrequentPatternDivergence methods:
- min_sup: the minimum frequency threshold
- metrics: divergence metrics of interest. In the example we specify the false positive (d_fpr), false negative (d_fnr) and accuracy divergence (d_accuracy) (also the default ones)


As output we obtain a dataframe. Each row contain the information of a frequent pattern (with a support greater or equal to min_sup) and the divergence scores.

In [8]:
from divexplorer.divexplorer import DivergenceExplorer

min_sup=0.1
fp_diver=DivergenceExplorer(
    df)
FP_fm=fp_diver.get_pattern_divergence(min_support=min_sup, boolean_outcomes = ['fpr_outcome'], show_coincise = True)
FP_fm.sort_values(by="support", ascending=False).head(10)

Unnamed: 0,support,itemsets,fpr_outcome,fpr_outcome_div,fpr_outcome_t,length,support_count
0,1.0,(),0.088314,0.0,0.0,0,6172.0
1,0.809624,(sex=Male),0.09381,0.005496,0.738697,1,4997.0
2,0.772683,(stay=<week),0.072043,-0.016271,2.339352,1,4769.0
3,0.643227,(charge=F),0.107754,0.01944,2.301183,1,3970.0
4,0.614226,"(stay=<week, sex=Male)",0.076379,-0.011935,1.555571,2,3791.0
5,0.572262,(age=25-45),0.088842,0.000528,0.087728,1,3532.0
6,0.532404,"(charge=F, sex=Male)",0.111041,0.022727,2.471729,2,3286.0
7,0.51442,(race=Afr-Am),0.139366,0.051052,5.046409,1,3175.0
8,0.471484,"(charge=F, stay=<week)",0.088179,-0.000135,0.01678,2,2910.0
9,0.460629,"(sex=Male, age=25-45)",0.096377,0.008063,0.914899,2,2843.0


## Quantitative 

In [10]:
import os
import numpy as np

inputDir=os.path.join(".", "notebooks", "datasets")

import pandas as pd
df= pd.read_csv(os.path.join(inputDir, "compas_discretized.csv"))

df['score'] = [np.random.uniform() for i in range(len(df))]
df.drop(columns=['class', 'predicted'], inplace=True)

In [12]:
from divexplorer.divexplorer import DivergenceExplorer

min_sup=0.1
fp_diver=DivergenceExplorer(
    df)
FP_fm=fp_diver.get_pattern_divergence(min_support=min_sup, quantitative_outcomes = ['score'], show_coincise = True)
FP_fm.sort_values(by="support", ascending=False).head(10)

Unnamed: 0,support,itemsets,score,score_SQUARED,score_div,score_t,length,support_count
0,1.0,(),0.504831,2072.852284,0.0,0.0,0,6172.0
1,0.809624,(sex=Male),0.506865,1689.869128,0.002034,0.375196,1,4997.0
2,0.772683,(stay=<week),0.506061,1609.471162,0.00123,0.223937,1,4769.0
3,0.643227,(charge=F),0.505467,1335.801726,0.000636,0.109898,1,3970.0
4,0.614226,"(stay=<week, sex=Male)",0.508774,1293.040345,0.003943,0.6683,2,3791.0
5,0.572262,(age=25-45),0.508694,1200.343185,0.003864,0.643244,1,3532.0
6,0.532404,"(charge=F, sex=Male)",0.507766,1114.883878,0.002935,0.476703,2,3286.0
7,0.51442,(race=Afr-Am),0.506295,1071.939928,0.001465,0.235392,1,3175.0
8,0.471484,"(charge=F, stay=<week)",0.505202,980.870925,0.000371,0.057822,2,2910.0
9,0.460629,"(sex=Male, age=25-45)",0.512601,978.447593,0.007771,1.202592,2,2843.0
