# Felony Modeling

Code for Charlottesville is collaborating with the LAJC to examine Virginia criminal court data and advocate for expungement of certain criminal records. As part of this project, they are interested in whether and how racial bias may play a role in case outcomes. One question of interest is what factors are predictive of whether someone is charged with a felony or a misdemeanor for a particular crime.

To investigate this question, we implement logistic regression and random forest classification models using felony/misdemeanor as the target predictor variable. We use race along with other predictor variables to see how much impact race has on charge type. We are particularly interested in marijuana charges, since the recent legalization of marijuana in Virginia has made prior marijuana charges a particular priority for expungement.

### Summary

Using these techniques, we found that race did not make much difference in the rates at which people were charged with felonies or misdemeanors for charges of marijuana possession with intent to distribute. However, race was more predictive than gender as a predictive demographic attribute for misdemeanor/felony charges.

### Read in and Pre-process Data

We begin by looking at the district criminal data from 2019. This data contains just under 2 million records, of which about 650,000 are classified as either a misdemeanor or felony.

In [1]:
from pyspark.sql import SparkSession

from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType
from pyspark.sql import functions as F

import json
import pandas as pd
import matplotlib.pyplot as plt

spark = SparkSession.builder.getOrCreate()

In [2]:
district = spark.read\
                .format("csv")\
                .option("header", "true")\
                .load("district_criminal_2019_anon_*.csv")

In [3]:
#most fields are strings, but wanted to create schema so that a few columns are numeric
fields = []
for f in json.loads(district.schema.json())["fields"]:
    if f["name"] in ['SentenceTime', 'ProbationTime', 'Fine', 'Costs', 'FineCostsDue', 'FineCostsPaid']:
        fields.append(StructField(f["name"], DoubleType(), True))
    else:
        fields.append(StructField.fromJson(f))

schema = StructType(fields)

In [4]:
district = spark.read.schema(schema)\
                .format("csv")\
                .option("header", "true")\
                .load("district_criminal_2019_anon_*.csv")

In [5]:
#cleaning/standardizing race names

district = district.withColumn('RaceClean', F.regexp_replace('Race', '\(Non-Hispanic\)', ''))
district = district.withColumn('RaceClean', F.regexp_replace('RaceClean', ' Caucasian', ''))
district = district.withColumn('RaceClean', F.regexp_replace('RaceClean', 'Asian Or', 'Asian or'))
district = district.withColumn('RaceClean', F.regexp_replace('RaceClean', ' \(Includes Not Applicable, Unknown\)', ''))
district = district.withColumn('RaceClean', F.regexp_replace('RaceClean', '\(Includes Not Applicable, Unknown\)', ''))
district = district.withColumn('RaceClean', F.regexp_replace('RaceClean', 'Other', 'Unknown'))

In [6]:
data = district.filter(district.CaseType.isin(['Misdemeanor', 'Felony']))

In [7]:
data = data.withColumn('Felony', F.when(data.CaseType == 'Felony', 1).otherwise(0))

### Logistic Regression Model

To test whether racial bias influences the outcomes, we create a logistic regression model to predict case type (misdemeanor/felony) using code section (code representing crime charged), gender, race, and plea type (innocent/guilty/Nolo Contendere/etc).

We chose to use a lasso model because the lasso penalty encourages model coefficients to be equal to zero when they are not contributing significantly to the model. Thus if the coefficients for race are non-zero, we have some evidence that race is useful in predicting whether a crime is a misdemeanor or a felony.

In [8]:
from pyspark.ml import Pipeline  
from pyspark.ml.feature import *
from pyspark.ml.classification import LogisticRegression

In [9]:
#all predictor variables are categorical and need to be one-hot encoded before modeling

gendInd = StringIndexer(inputCol="Gender", outputCol="GendInd", handleInvalid = "skip")
gend = OneHotEncoder(inputCol="GendInd", outputCol="GenderOH")

raceInd = StringIndexer(inputCol="RaceClean", outputCol="RaceInd", handleInvalid = "skip")
race = OneHotEncoder(inputCol="RaceInd", outputCol="RaceOH")

chargeInd = StringIndexer(inputCol="CodeSection", outputCol="ChargeInd", handleInvalid = "skip")
charge = OneHotEncoder(inputCol="ChargeInd", outputCol="ChargeCodeOH")

#gather encoded predictors into features vector
va = VectorAssembler(inputCols=["RaceOH", "ChargeCodeOH", "GenderOH"], outputCol="features", 
                     handleInvalid = "skip")

logm = LogisticRegression(labelCol = 'Felony', elasticNetParam = 1) #lasso = 1, ridge = 0

In [10]:
pipeline = Pipeline(stages=[gendInd, gend, raceInd, race, chargeInd, charge, va, logm])

Normally, we would split the data into training and testing sets, but since we're primarily interested in model interpretation rather than prediction, we go ahead and train on the full dataset here.

In [11]:
model = pipeline.fit(data)
pred = model.transform(data)

In [12]:
trainingSummary = model.stages[-1].summary

# Obtain the receiver-operating characteristic as a dataframe and areaUnderROC.
# trainingSummary.roc.show()
print("areaUnderROC: " + str(trainingSummary.areaUnderROC))

areaUnderROC: 0.9829157611246887


In [13]:
trainingSummary.precisionByLabel

[0.9421483067127724, 0.9150108918080402]

In [14]:
trainingSummary.recallByLabel
# https://spark.apache.org/docs/2.4.5/api/python/pyspark.ml.html?highlight=coefficients#pyspark.ml.classification.LogisticRegressionModel.coefficients

[0.9569337979094077, 0.8875241438922622]

The AUC tells us that this model is quite successful in predicting whether a charge is a felony or misdemeanor from charge, plea, race, and gender.

In [15]:
#figure out which coefficients map to which characteristics

# https://stackoverflow.com/questions/39022052/relating-column-names-to-model-parameters-in-pyspark-ml

# numeric_metadata = pred.select("features").schema[0].metadata.get('ml_attr').get('attrs').get('numeric')
binary_metadata = pred.select("features").schema[0].metadata.get('ml_attr').get('attrs').get('binary')

# merge_list = numeric_metadata + binary_metadata 
binary_metadata[:7]

[{'idx': 0, 'name': 'RaceOH_White'},
 {'idx': 1, 'name': 'RaceOH_Black'},
 {'idx': 2, 'name': 'RaceOH_Hispanic'},
 {'idx': 3, 'name': 'RaceOH_Unknown'},
 {'idx': 4, 'name': 'RaceOH_Asian or Pacific Islander'},
 {'idx': 5, 'name': 'RaceOH_American Indian'},
 {'idx': 6, 'name': 'ChargeCodeOH_A.46.2-862'}]

In [16]:
print(model.stages[-1].coefficients[0])
print(model.stages[-1].coefficients[1])
print(model.stages[-1].coefficients[2])
print(model.stages[-1].coefficients[3])
print(model.stages[-1].coefficients[4])
print(model.stages[-1].coefficients[5])

-1.472667565470601
-1.5054607141732788
-3.1922771686826588
-1.909415425831092
-1.4324164228370782
-1.8795350034162341


None of the race coefficients are equal to 0, so this is evidence that race does play at least some role in whether someone is charged with a misdemeanor or felony.

In [17]:
pred.filter(pred.RaceClean == 'White').filter(pred.Gender == 'Male')\
        .filter(pred.CodeSection == '18.2-248.1')\
        .select('Race', 'Gender', 'probability', 'prediction').take(1)

[Row(Race='White Caucasian(Non-Hispanic)', Gender='Male', probability=DenseVector([0.1572, 0.8428]), prediction=1.0)]

In [18]:
pred.filter(pred.RaceClean == 'Black').filter(pred.Gender == 'Male')\
        .filter(pred.CodeSection == '18.2-248.1')\
        .select('Race', 'Gender', 'probability', 'prediction').take(1)

[Row(Race='Black(Non-Hispanic)', Gender='Male', probability=DenseVector([0.1616, 0.8384]), prediction=1.0)]

This model predicts roughly the same chance of getting charged with a felony for Black and white men. Other races have a small sample size, so we did not consider them here.

## Model with Just Marijuana Charges

Next, we create a similar model using just the data for marijuana possession with intent to distribute (code section 18.2-248.1). This charge can be either a misdemeanor or a felony, while marijuana possession (18.2-250.1) is always a misdemeanor. This subset of the data contains 4445 records.

In [19]:
data_mj = data.filter(data.CodeSection.isin(['18.2-248.1'])) 
# this can be felony or misdemeanor (marijuana poss w/ intent to distribute)
# 18.2-250.1 is always a misdemeanor (possession)

In [20]:
#remove charge code as a predictor since just one charge included now
#one-hot encoding and logistic modeling steps can stay the same

va = VectorAssembler(inputCols=["RaceOH", "GenderOH"], outputCol="features", 
                     handleInvalid = "skip")

pipeline_mj = Pipeline(stages=[gendInd, gend, raceInd, race, va, logm])

In [21]:
model_mj = pipeline_mj.fit(data_mj)

In [22]:
pred_mj = model_mj.transform(data_mj)

In [23]:
model_mj.stages[-1].coefficients

DenseVector([-0.2824, -0.0777, 0.6497, 1.455, 0.0645])

As in the previous model, none of the coefficients for race are equal to zero, although with so few variables in the model, this isn't as meaningful as if race was a significant predictor in the presence of many other predictors.

In [24]:
pred_mj.filter(pred.RaceClean == 'White').filter(pred.Gender == 'Male')\
        .select('Race', 'Gender', 'probability', 'prediction').take(1)

[Row(Race='White Caucasian(Non-Hispanic)', Gender='Male', probability=DenseVector([0.1502, 0.8498]), prediction=1.0)]

In [25]:
pred_mj.filter(pred.RaceClean == 'Black').filter(pred.Gender == 'Male')\
        .select('Race', 'Gender', 'probability', 'prediction').take(1)

[Row(Race='Black(Non-Hispanic)', Gender='Male', probability=DenseVector([0.1783, 0.8217]), prediction=1.0)]

In [26]:
pred_mj.filter(pred.RaceClean == 'Hispanic').filter(pred.Gender == 'Male')\
        .select('Race', 'Gender', 'probability', 'prediction').take(1)

[Row(Race='Hispanic', Gender='Male', probability=DenseVector([0.1406, 0.8594]), prediction=1.0)]

In [27]:
pred_mj.filter(pred.RaceClean == 'White').filter(pred.Gender == 'Female')\
        .select('Race', 'Gender', 'probability', 'prediction').take(1)

[Row(Race='White Caucasian(Non-Hispanic)', Gender='Female', probability=DenseVector([0.1587, 0.8413]), prediction=1.0)]

In [28]:
pred_mj.filter(pred.RaceClean == 'Black').filter(pred.Gender == 'Female')\
        .select('Race', 'Gender', 'probability', 'prediction').take(1)

[Row(Race='Black(Non-Hispanic)', Gender='Female', probability=DenseVector([0.1879, 0.8121]), prediction=1.0)]

In [29]:
pred_mj.filter(pred.RaceClean == 'Hispanic').filter(pred.Gender == 'Female')\
        .select('Race', 'Gender', 'probability', 'prediction').take(1)

[Row(Race='Hispanic', Gender='Female', probability=DenseVector([0.1486, 0.8514]), prediction=1.0)]

Since this model only uses race and gender as predictors, the prediction will be the same for every white male and every black male (and every other combination of race and gender).

In this very basic model, a Black man is predicted to have an 85% chance of being charged with a felony for possession of marijuana, while a white man is predicted to have an 82% chance. 

Below, we look at the percentages in the data to sanity-check our model.

In [30]:
fel = data.groupBy('RaceClean').agg(F.count(data.RaceClean).alias('count'), 
                                    F.sum(data.Felony).alias('n_felonies'))

fel.withColumn('percent', fel['n_felonies']/fel['count']).show()

+--------------------+------+----------+--------------------+
|           RaceClean| count|n_felonies|             percent|
+--------------------+------+----------+--------------------+
|                null|     0|        40|                null|
|             Unknown| 18919|       975| 0.05153549341931392|
|     American Indian|   771|        34| 0.04409857328145266|
|               White|342174|     55906| 0.16338471070274188|
|               Black|262707|     47861|  0.1821839539867609|
|            Hispanic| 20567|       112|0.005445616764720...|
|Asian or Pacific ...|  7218|       970|  0.1343862565807703|
|American Indian o...|    74|         2| 0.02702702702702703|
+--------------------+------+----------+--------------------+



In [31]:
fel = data.filter(data.CodeSection == '18.2-248.1').groupBy('RaceClean')\
            .agg(F.count(data.RaceClean).alias('count'), 
                 F.sum(data.Felony).alias('n_felonies'))

fel.withColumn('percent', fel['n_felonies']/fel['count']).show()

+--------------------+-----+----------+------------------+
|           RaceClean|count|n_felonies|           percent|
+--------------------+-----+----------+------------------+
|                null|    0|         3|              null|
|             Unknown|   37|        34| 0.918918918918919|
|               White| 1775|      1420|               0.8|
|               Black| 2566|      1984|0.7731878409976617|
|            Hispanic|    8|         7|             0.875|
|Asian or Pacific ...|   56|        48|0.8571428571428571|
+--------------------+-----+----------+------------------+



For those charged with marijuana possession, 80% of whites and 77% of Blacks were charged with a felony, both of which are slightly lower than predicted by the model (for both genders).

Other races have very small sample sizes, so we can't really draw many conclusions from that.

## Random Forest Feature Importance

Next, we implement a random forest model to quantify the feature importance further.

In [32]:
from pyspark.ml.classification import RandomForestClassifier

In [33]:
rf = RandomForestClassifier(labelCol="Felony", featuresCol="features", numTrees=100)

In [34]:
va = VectorAssembler(inputCols=["RaceOH", "ChargeCodeOH", "GenderOH"], outputCol="features", 
                     handleInvalid = "skip")

pipeline_rf = Pipeline(stages=[gendInd, gend, raceInd, race, chargeInd, charge, va, rf])

In [35]:
model_rf = pipeline_rf.fit(data)
pred_rf = model_rf.transform(data)

In [36]:
model_rf.stages[-1].featureImportances

SparseVector(4385, {1: 0.0001, 2: 0.0058, 3: 0.012, 6: 0.0055, 7: 0.0523, 8: 0.0301, 9: 0.0248, 10: 0.0123, 11: 0.0145, 12: 0.0202, 13: 0.0091, 14: 0.005, 16: 0.0065, 17: 0.0052, 18: 0.0245, 20: 0.0011, 21: 0.005, 22: 0.0885, 23: 0.0215, 24: 0.0001, 25: 0.0131, 26: 0.0065, 27: 0.0085, 28: 0.0162, 29: 0.0022, 30: 0.0433, 31: 0.0077, 32: 0.0059, 33: 0.0015, 34: 0.0005, 35: 0.0014, 36: 0.0002, 37: 0.0322, 38: 0.0097, 39: 0.0071, 41: 0.0115, 42: 0.0475, 43: 0.0037, 44: 0.0018, 45: 0.0029, 47: 0.0135, 49: 0.0025, 50: 0.0151, 51: 0.0217, 52: 0.0001, 53: 0.0101, 54: 0.0006, 55: 0.0137, 56: 0.0008, 57: 0.0201, 58: 0.0008, 59: 0.0013, 61: 0.0361, 62: 0.0053, 63: 0.0015, 64: 0.0005, 65: 0.0005, 66: 0.0013, 67: 0.0169, 69: 0.0049, 70: 0.0061, 71: 0.0005, 72: 0.0002, 76: 0.022, 77: 0.0012, 78: 0.0171, 79: 0.0113, 81: 0.0002, 83: 0.0111, 84: 0.0005, 85: 0.0002, 87: 0.0001, 88: 0.0036, 91: 0.0001, 95: 0.0185, 97: 0.0003, 98: 0.0, 100: 0.0014, 101: 0.0127, 102: 0.0149, 104: 0.0065, 108: 0.004, 109: 0

In [37]:
meta_rf = pred_rf.select("features").schema[0].metadata.get('ml_attr').get('attrs').get('binary')

# merge_list = numeric_metadata + binary_metadata 
feature_names = [field['name'] for field in meta_rf]

In [38]:
list(zip(model_rf.stages[-1].featureImportances.toArray(), feature_names))[:15]

[(0.0, 'RaceOH_White'),
 (0.00012706110633525647, 'RaceOH_Black'),
 (0.0058008513585579066, 'RaceOH_Hispanic'),
 (0.011998627521152817, 'RaceOH_Unknown'),
 (0.0, 'RaceOH_Asian or Pacific Islander'),
 (0.0, 'RaceOH_American Indian'),
 (0.005466628882552965, 'ChargeCodeOH_A.46.2-862'),
 (0.05230187267937904, 'ChargeCodeOH_B.46.2-301'),
 (0.030146516747104328, 'ChargeCodeOH_46.2-300'),
 (0.024756935829529585, 'ChargeCodeOH_18.2-250.1'),
 (0.012335922179153332, 'ChargeCodeOH_C.46.2-862'),
 (0.014490057647865429, 'ChargeCodeOH_18.2-250'),
 (0.02017981376068468, 'ChargeCodeOH_A.18.2-266'),
 (0.009054333832899723, 'ChargeCodeOH_18.2-388'),
 (0.004973723378849848, 'ChargeCodeOH_18.2-57')]

In [39]:
list(zip(model_rf.stages[-1].featureImportances.toArray(), feature_names))[-1]

(0.002180120217458102, 'GenderOH_Male')

The only races with a non-zero feature importance for predicting whether a crime will be a felony or not are Black, Hispanic and unknown. Several charges have much higher feature importance, and gender also has non-zero feature importance.

This matches expectation, since some charges will obviously have much higher rates of felonies than other charges, and some charges will always be either a misdemeanor and other will always be a felony.

Next, we again look at just marijuana charges for feature importance.

In [40]:
va = VectorAssembler(inputCols=["RaceOH", "GenderOH"], outputCol="features", 
                     handleInvalid = "skip")

pipeline_mj = Pipeline(stages=[gendInd, gend, raceInd, race, va, rf])

In [41]:
model_rf = pipeline_mj.fit(data_mj)
pred_rf = model_rf.transform(data_mj)

In [42]:
#get feature names

meta_rf = pred_rf.select("features").schema[0].metadata.get('ml_attr').get('attrs').get('binary')

# merge_list = numeric_metadata + binary_metadata 
feature_names = [field['name'] for field in meta_rf]

In [43]:
#combine names with importances
list(zip(feature_names, model_rf.stages[-1].featureImportances))

[('RaceOH_Black', 0.2576646203782787),
 ('RaceOH_White', 0.25975752149859244),
 ('RaceOH_Asian or Pacific Islander', 0.17215317915534445),
 ('RaceOH_Unknown', 0.17605941299451264),
 ('GenderOH_Male', 0.13436526597327178)]

From this, we see that race is a more important factor than gender in determining whether marijuana possession with intent to distribute is a felony or a misdemeanor.