In this problem, we will build and validate a model that predicts if an inmate will violate the terms of his or her parole. Such a model could be useful to a parole board when deciding to approve or deny an application for parole.

In [1]:
parole = read.csv("./data/parole.csv")
str(parole)

'data.frame':	675 obs. of  9 variables:
 $ male             : int  1 0 1 1 1 1 1 0 0 1 ...
 $ race             : int  1 1 2 1 2 2 1 1 1 2 ...
 $ age              : num  33.2 39.7 29.5 22.4 21.6 46.7 31 24.6 32.6 29.1 ...
 $ state            : int  1 1 1 1 1 1 1 1 1 1 ...
 $ time.served      : num  5.5 5.4 5.6 5.7 5.4 6 6 4.8 4.5 4.7 ...
 $ max.sentence     : int  18 12 12 18 12 18 18 12 13 12 ...
 $ multiple.offenses: int  0 0 0 0 0 0 0 0 0 0 ...
 $ crime            : int  4 3 3 1 1 4 3 1 3 2 ...
 $ violator         : int  0 0 0 0 0 0 0 0 0 0 ...


In [2]:
table(parole$violator)


  0   1 
597  78 

Using the as.factor() function, convert these variables to factors. Keep in mind that we are not changing the values, just the way R understands them (the values are still numbers).

In [11]:
parole$state = as.factor(parole$state)
parole$crime = as.factor(parole$crime)

In [12]:
summary(parole$state)

In [13]:
set.seed(144)

In [14]:
library(caTools)
split = sample.split(parole$violator, SplitRatio = 0.7)
train = subset(parole, split == TRUE)
test = subset(parole, split == FALSE)

In [15]:
model1 = glm(violator ~ ., data=train, family=binomial)
summary(model1)


Call:
glm(formula = violator ~ ., family = binomial, data = train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.7041  -0.4236  -0.2719  -0.1690   2.8375  

Coefficients:
                    Estimate Std. Error z value Pr(>|z|)    
(Intercept)       -4.2411574  1.2938852  -3.278  0.00105 ** 
male               0.3869904  0.4379613   0.884  0.37690    
race               0.8867192  0.3950660   2.244  0.02480 *  
age               -0.0001756  0.0160852  -0.011  0.99129    
state2             0.4433007  0.4816619   0.920  0.35739    
state3             0.8349797  0.5562704   1.501  0.13335    
state4            -3.3967878  0.6115860  -5.554 2.79e-08 ***
time.served       -0.1238867  0.1204230  -1.029  0.30359    
max.sentence       0.0802954  0.0553747   1.450  0.14705    
multiple.offenses  1.6119919  0.3853050   4.184 2.87e-05 ***
crime2             0.6837143  0.5003550   1.366  0.17180    
crime3            -0.2781054  0.4328356  -0.643  0.52054    
crime4    

In [23]:
exp(-4.2411574+0.3869904*1-0.0001756*50+0.8867192*1-0.1238867*3+0.0802954*12+0.6837143*1)

In [25]:
1/(1+(1/0.18256868876081))

In [26]:
predTest = predict(model1, type="response", newdata=test)
summary(predTest)

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
0.002334 0.023780 0.057910 0.146600 0.147500 0.907300 

In [27]:
table(test$violator, predTest>=0.5)

   
    FALSE TRUE
  0   167   12
  1    11   12

In [28]:
12/23

In [29]:
167/179

In [30]:
179/(179+23)

In [31]:
library("ROCR")

Loading required package: gplots

Attaching package: 'gplots'

The following object is masked from 'package:stats':

    lowess



In [32]:
ROCRpredTest = prediction(predTest, test$violator)
auc = as.numeric(performance(ROCRpredTest, "auc")@y.values)
auc