In [None]:
%matplotlib inline
from google.colab import drive
drive.mount('/content/gdrive')
%cd '/content/gdrive/My Drive/Econ 484/datasets'

Mounted at /content/gdrive
/content/gdrive/My Drive/Econ 484/datasets


In [None]:
import numpy as np
np.random.seed(seed=484)
import warnings
warnings.filterwarnings('ignore')

# Midterm Exam

Exam is open book, open note, and open Google. You are not allowed outside
help from another person, however. All work, including coding, must be yours alone. Remember to turn in both the written portion and this coding portion. The coding portion can be turned in by submitting a shared link to your Colab notebook. To complete this coding portion, make sure to save a copy of this notebook in your own Google drive, supply the python code in the empty cells below, and execute the notebook. To get full credit, the completed notebook should be able to run top to bottom, producing the results asked for in the prompts below.

This portion of the exam will take you through the steps of the supervised machine learning process.

## 1. Figure out your question

The question you want to answer is: How does being detained prior to trial affect a defendant's ultimate probability of conviction? We can use machine learning to help answer this question by building a model that predicts whether a defendant will be detained prior to trial on the basis of his or her characteristics, and another that predicts conviction on the basis of the same characteristics.

## 2. Obtain a labeled dataset

Import the python library that is good for manipulating datasets:

In [None]:
import pandas as pd

Accompanying the exam materials are a spreadsheet of defendants from the Philadelphia criminal justice system, 'defendants.csv' and another file, 'defendants_defs.txt' that explains each variable in the spreadsheet. Read in the data in the spreadsheet 'defendants.csv', print out the first few rows of data with the variable names, and print out the number of observations and variables in the dataset:

In [None]:
defendants = pd.read_csv("defendants.csv") #read in data
print(defendants.head(5)) #print first 5 rows
col = len(defendants.columns)
print(f'The number of columns is {col}') #display number of columns
row = len(defendants)
print(f'The number of rows is {row}') #display number of rows

   priorCases  prior_felChar  prior_guilt  fel  mis  ...  t3  t4  t5  t6  baildate
0           4              3            1    1    1  ...   0   0   0   0     17058
1           0              0            0    1    1  ...   0   0   0   0     17059
2          10              1            3    1    1  ...   0   0   0   0     17059
3           0              0            0    0    1  ...   0   0   0   0     17059
4          24              4            3    1    1  ...   0   0   0   0     17060

[5 rows x 38 columns]
The number of columns is 38
The number of rows is 3312


Define a label (outcome) vector, $y_1$, to be an indicator for pre-trial detention, another outcome vector, $y_2$ to be an indicator for conviction, and define a feature (regressor) matrix, $X$, to contain indicators for drug possession, robbery, DUI first offense, drug selling, aggravated assault, black, age, male, white, an indicator for any prior cases, number of prior cases, indicator for a prior charge within the last 5 years, number of prior felony charges, number of prior guilty cases, indicator for at least 3 prior cases, indicator for felony charge, indicator for misdemeanor charge, all "other charge indicator" variables, bail date, and all variables for "time/day the arraignment occurred". Print out the number of X variables.

In [None]:
y1 = defendants['jail3']
y2 = defendants['guilt']

columns = [x for x in defendants.columns if x not in ['jail3','guilt']]
X= defendants.loc[:,columns]

number = len(columns)
print(f'number of X variables is {number}')
X.head(3)


number of X variables is 36


Unnamed: 0,priorCases,prior_felChar,prior_guilt,fel,mis,sum,F1,F2,F3,F,M1,M2,M3,M,robbery,aggAss,possess,drugSell,DUI1st,white,black,age,male,onePrior,threePriors,priorWI5,day,day2,day3,t1,t2,t3,t4,t5,t6,baildate
0,4,3,1,1,1,0,0,0,0,1,0,0,0,1,0,0,0,1,0,0,1,26.901369,1,1,1,1,2,4,8,1,0,0,0,0,0,17058
1,0,0,0,1,1,0,0,0,0,0,1,1,0,0,0,1,0,0,0,0,0,37.758904,1,0,0,0,3,9,27,1,0,0,0,0,0,17059
2,10,1,3,1,1,0,1,1,0,0,0,1,0,0,1,0,0,0,0,0,1,42.035618,1,1,1,1,3,9,27,1,0,0,0,0,0,17059


"Pre-process" your features, $X$, by standardizing them to have zero mean and unit variance. Hint: you may import a useful package to do this.

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler() #create scaler object
scaler.fit(X) #feed the scaler object the x
x_scaled = scaler.transform(X)



## 3. Divide into training and set sets

Import the python library that is good for randomly splitting datasets into training and test sets:

In [None]:
from sklearn.model_selection import train_test_split


Now make a training and test feature matrix and a training and test label vectors:

In [None]:
X_train, X_test, y1_train, y1_test, y2_train, y2_test = train_test_split(x_scaled,y1,y2,random_state=42)


## 4. Pick an appropriate method

Choose a method appropriate for classification and import its library:

In [None]:
from sklearn.neighbors import KNeighborsClassifier

## 5 and 6. Choose regularization parameters via cross-validation on the training set and fit model on the whole training set using the cross-validated parameters

The outcome you should use in this part is $y_2$, the indicator for conviction

Search over a grid of values of the regularization parameters for the parameters that perform the best on the left-out folds:

In [None]:
from sklearn.model_selection import GridSearchCV


knn = KNeighborsClassifier()
#dictionary with tuning parameter values to test
parameter_grid = {'n_neighbors': np.arange(1, 15),'leaf_size': list(range(10,40))}
#use gridsearch to test all values for n_neighbors
knn_gscv = GridSearchCV(knn, parameter_grid, cv=5)

knn_gscv.fit(X_train, y2_train)
print(knn_gscv.best_params_)
print(knn_gscv.best_score_)

#this takes about 5 minutes but it runs. With more parameters in the grid or with other classification techniques it took much longer.

{'leaf_size': 10, 'n_neighbors': 9}
0.6288359187382359


## 7. Evaluate model by applying it to test set

#Compute and print out the "score" of the model applied to the test set:

In [None]:
#in comments is what Logan showed to be the corect way
#best_model.fit(X_train,y_train)
#best_model.score(x_test,y_test)
##

knn_test = KNeighborsClassifier(n_neighbors=9, leaf_size=10)
knn_test.fit(X_test,y2_test) #this shouldn't have been refit
print(knn_test.score(X_test,y2_test))

0.7077294685990339


## 8. Repeat 4-7 for $y_1$
using a method appropriate for regression-style prediction to predict the probability of being detained prior to trial

Import the method's library, do cross validation to find tuning parameters, fit the model on the training data using the cross-validated tuning parameters, and compute (and report) the model's score on the test set:

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:

logit = LogisticRegression()
param_grid = {'penalty' : ['l1', 'l2','elasticnet','none'],'C' : [1000,120,110,100,80,60,40,20,10,5, 1.0, 0.1, 0.01]}
grid_search = GridSearchCV(logit,param_grid,cv=5,return_train_score=True)
best_model = grid_search.fit(X_train,y1_train)
print(best_model.best_params_)
print(best_model.best_score_)



{'C': 1000, 'penalty': 'l2'}
0.7371105666255598


In [None]:
best_logit = LogisticRegression(penalty='l2',C=1000)
best_logit.fit(X_test,y1_test) #shouldn't have refit it. just skip to the score, or you can get predictions
print(best_logit.score(X_test,y1_test))

0.7620772946859904


## 9. Apply the prediction  models to new observations for which we have no labels

The spreadsheet 'newdefendants.csv' contains information on 30 new defendants.

Read in the new observations' information and apply the models to predict the probability of each defendant being detained prior to trial, and the predicted probability of conviction, and print out the predictions. Hint: don't forget to apply the same pre-processing steps to the new observations as you did to your training and test observations. This means standardizing the new observations using the means and variances of your labeled dataset, not the means and variances of these two new observations.

In [None]:
new_defendants = pd.read_csv("newdefendants.csv")

In [None]:
print(new_defendants.head(3))
y1_new = new_defendants['jail3']
y2_new = new_defendants['guilt']

columns_new = [x for x in new_defendants.columns if x not in ['jail3','guilt']]
X_new= new_defendants.loc[:,columns]

x_scaled_new = scaler.transform(X_new) #this should retain the scale from the object when used previously

   priorCases  prior_felChar  prior_guilt  fel  mis  ...  t3  t4  t5  t6  baildate
0           2              0            1    0    0  ...   0   0   0   0     17169
1           8              4            4    0    1  ...   0   0   0   0     17247
2           3              2            2    0    1  ...   0   0   0   0     17348

[3 rows x 38 columns]


In [None]:
knn_test.fit(x_scaled_new,y2_new) #dont fit this is wrong
#print(knn_test.score(x_scaled_new,y2_new))
##or .predict(x_scaled_new)

best_logit.fit(x_scaled_new,y1_new) #dont fit this is wrong
#print(best_logit.score(x_scaled_new,y1_new))
##or .predict(x_scaled_new)

logit_prob = best_logit.predict_proba(x_scaled_new)[:,1]
logit_predictions = best_logit.predict(x_scaled_new)
kn_predict = knn_test.predict(x_scaled_new)
print(f'the logit predicted probabilities of pre-trial detention are: {logit_prob}')
print(f'the logit predictions are: {logit_predictions}')
print(f'the knn predictions for conviction are: {kn_predict}')

#the predictions for being convicted are not probabilities but y-hats. This is because I used a classification algorithm without probabilities which
#is in line with the instrctions.

the logit predicted probabilities of pre-trial detention are: [1.00000000e+00 9.99995273e-01 2.46625470e-04 9.99505709e-01
 1.00000000e+00 1.14686752e-03 9.97911664e-01 1.68823074e-04
 9.99999919e-01 9.99723103e-01 9.99999992e-01 9.99445140e-01
 9.99999880e-01 2.37959733e-04 5.35077541e-11 1.22240346e-04
 1.98292881e-04 5.21212521e-10 3.90587732e-04 9.98106469e-01
 3.41517635e-04 2.46629458e-03 2.01441937e-03 9.94761105e-01
 2.79623062e-03 9.99787602e-01 6.73581786e-07 6.75092281e-05
 2.82369792e-03 9.97736052e-01]
the logit predictions are: [1 1 0 1 1 0 1 0 1 1 1 1 1 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 1]
the knn predictions for conviction are: [0 1 0 1 1 0 0 0 1 0 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
