# Predicting Parole Violators

In [32]:
import pandas as pd
import numpy as np

import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn import metrics

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

## Problem 1.1 - Loading the Dataset
Load the dataset parole.csv into a data frame called parole.

How many parolees are contained in the dataset?
- 675

In [33]:
parole = pd.read_csv('../data/parole.csv')
parole.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 675 entries, 0 to 674
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   male               675 non-null    int64  
 1   race               675 non-null    int64  
 2   age                675 non-null    float64
 3   state              675 non-null    int64  
 4   time.served        675 non-null    float64
 5   max.sentence       675 non-null    int64  
 6   multiple.offenses  675 non-null    int64  
 7   crime              675 non-null    int64  
 8   violator           675 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 47.6 KB


## Problem 1.2 - Loading the Dataset
How many of the parolees in the dataset violated the terms of their parole?

In [34]:
parole['violator'].sum()

78

## Problem 2.1 - Preparing the Dataset
You should be familiar with unordered factors. Which variables in this dataset are unordered factors with at least three levels? *Select all that apply*.
- state
- crime

## Problem 2.2 - Preparing the Dataset
In the last subproblem, we identified variables that are unordered factors with at least 3 levels, so we need to convert them to factors for our prediction problem. Convert these variables to factors (categorical). Keep in mind that we are not changing the values, just the way Python understands them (the values are still numbers).

How does the output of summary() (in R) change for a factor variable as compared to a numerical variable?
- The output becomes similar to that of the table() (in R) function applied to that variable.

## Problem 3.1 - Splitting into a Training and Testing Set
Allocate 70% to the training set and 30% to the testing set.

In [35]:
state_dummies = pd.get_dummies(parole['state'], prefix='state_', drop_first=True)
crime_dummies = pd.get_dummies(parole['crime'], prefix='crime_', drop_first=True)
dummies = pd.merge(left=state_dummies, right=crime_dummies, left_index=True, right_index=True)
parole = pd.merge(left=parole, right=dummies, left_index=True, right_index=True)
parole.drop(labels=['state', 'crime'], axis=1, inplace=True)

features1 = list(parole.columns)
features1.remove('violator')
X1 = parole[features1]
y1 = parole['violator']

X_train1, X_test1, y_train1, y_test1 = train_test_split(
    X1, y1, train_size=0.7, random_state=144
)

## Problem 4.1 - Building a Logistic Regression Model
Train a logistic regression model on the training set. Your dependent variable is "violator", and you should use all of the other variables as independent variables.

What variables are significant in this model? Significant variables should have a least one star, or should have a probability less than 0.05 (the column Pr(>|z|) in the summary output). *Select all that apply*.
- race
- state4
- multiple.offenses

In [36]:
model1 = sm.Logit(y_train1, sm.add_constant(X_train1)).fit()
print(model1.summary())

Optimization terminated successfully.
         Current function value: 0.264284
         Iterations 8
                           Logit Regression Results                           
Dep. Variable:               violator   No. Observations:                  472
Model:                          Logit   Df Residuals:                      459
Method:                           MLE   Df Model:                           12
Date:                Sat, 21 Aug 2021   Pseudo R-squ.:                  0.2658
Time:                        10:32:30   Log-Likelihood:                -124.74
converged:                       True   LL-Null:                       -169.89
Covariance Type:            nonrobust   LLR p-value:                 4.312e-14
                        coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------------
const                -3.6939      1.225     -3.016      0.003      -6.094      -1.293
male    

## Problem 4.2 - Building a Logistic Regression Model
What can we say based on the coefficient of the multiple.offenses variable?

The following two properties might be useful to you when answering this question:
1. If we have a coefficient c for a variable, then that means the log odds (or Logit) are increased by c for a unit increase in the variable.
2. If we have a coefficient c for a variable, then that means the odds are multiplied by e^c for a unit increase in the variable.

- Our model predicts that a parolee who committed multiple offenses has $e^{c}$ times higher odds of being a violator than a parolee who did not commit multiple offenses but is otherwise identical.

## Problem 4.3 - Building a Logistic Regression Model
Consider a parolee who is male, of white race, aged 50 years at prison release, from the state of Maryland, served 3 months, had a maximum sentence of 12 months, did not commit multiple offenses, and committed a larceny. Answer the following questions based on the model's predictions for this individual. *HINT: You should use the coefficients of your model, the Logistic Response Function, and the Odds equation to solve this problem*.

According to the model, what are the odds this individual is a violator?
- 0.16796494198768974

According to the model, what is the probability this individual is a violator?
- 0.14380991753214806

In [37]:
c = (
    -4.4218     # const
    + 0.4134*1  # male 
    + 0.8785*1  # race 
    + 0.0100*50 # age
    + -0.1511*3 # time.served
    + 0.0903*12 # max.sentence
    + 1.7627*0  # multiple.offenses
    + 0.2156*1  # crime2
)
odds = np.exp(c)
prob = 1 / (1 + np.exp(-c))
print("Odds:", odds)
print("Probability:", prob)

Odds: 0.16796494198768974
Probability: 0.14380991753214806


## Problem 5.1 - Evaluating the Model on the Testing Set
Obtain the model's predicted probabilities for parolees in the testing set.

What is the maximum predicted probability of a violation?

In [38]:
pred_1 = model1.predict(sm.add_constant(X_test1))
pred_1.max()

0.7182378450775322

## Problem 5.2 - Evaluating the Model on the Testing Set
In the following questions, evaluate the model's predictions on the test set using a threshold of 0.5.

What is the model's sensitivity?
- 0.21739130434782608

What is the model's specificity?
- 0.9722222222222222

What is the model's accuracy?
- 0.8866995073891626

In [39]:
pred_df = y_test1.to_frame()
pred_df['Predicted'] = (pred_1 >= 0.5).astype(int)

cfm = pred_df.value_counts().sort_index()
sensitivity = cfm.loc[1,1] / (cfm.loc[1,1] + cfm.loc[1,0])
specificity = cfm.loc[0,0] / (cfm.loc[0,0] + cfm.loc[0,1])
accuracy = (cfm.loc[0,0] + cfm.loc[1,1]) / cfm.sum()

print("Sensitivity:", sensitivity)
print("Specificity:", specificity)
print("Accuracy:", accuracy)

Sensitivity: 0.21739130434782608
Specificity: 0.9722222222222222
Accuracy: 0.8866995073891626


## Problem 5.3 - Evaluating the Model on the Testing Set
What is the accuracy of a simple model that predicts that every parolee is a non-violator?

In [40]:
1 - y_test1.mean()

0.8866995073891626

## Problem 5.4 - Evaluating the Model on the Testing Set
Consider a parole board using the model to predict whether parolees will be violators or not. The job of a parole board is to make sure that a prisoner is ready to be released into free society, and therefore parole boards tend to be particularily concerned about releasing prisoners who will violate their parole. Which of the following most likely describes their preferences and best course of action?
- The board assigns more cost to a false negative than a false positive, and should therefore use a logistic regression cutoff less than 0.5.

## Problem 5.5 - Evaluating the Model on the Testing Set
Which of the following is the most accurate assessment of the value of the logistic regression model with a cutoff 0.5 to a parole board, based on the model's accuracy as compared to the simple baseline model?
- The model is likely of value to the board, and using a different logistic regression cutoff is likely to improve the model's value.

## Problem 5.6 - Evaluating the Model on the Testing Set
What is the AUC value for the model?

In [41]:
fpr, tpr, ths = metrics.roc_curve(y_test1, pred_1)
auc = metrics.auc(fpr, tpr)
print(auc)

0.8632850241545893


## Problem 5.7 - Evaluating the Model on the Testing Set
Describe the meaning of AUC in this context.
- The probability the model can correctly differentiate between a randomly selected parole violator and a randomly selected parole non-violator.

## Problem 6.1 - Identifying Bias in Observational Data
Our goal has been to predict the outcome of a parole decision, and we used a publicly available dataset of parole releases for predictions. In this final problem, we'll evaluate a potential source of bias associated with our analysis. It is always important to evaluate a dataset for possible sources of bias.

The dataset contains all individuals released from parole in 2004, either due to completing their parole term or violating the terms of their parole. However, it does not contain parolees who neither violated their parole nor completed their term in 2004, causing non-violators to be underrepresented. This is called "selection bias" or "selecting on the dependent variable," because only a subset of all relevant parolees were included in our analysis, based on our dependent variable in this analysis (parole violation). How could we improve our dataset to best address selection bias?
- We should use a dataset tracking a group of parolees from the start of their parole until either they violated parole or they completed their term. 