### Load in libraries the data

In [1]:
import pandas as pd
import numpy as np
from sklearn import linear_model, ensemble, naive_bayes, model_selection, neighbors, \
metrics, feature_extraction, preprocessing
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn
import xgboost
%matplotlib inline

In [148]:
raw_crimes = pd.read_csv('~/chicago-crimes/data/Crimes_-_2001_to_present.csv')
domestic_crimes_pre2018 = raw_crimes[(raw_crimes.Domestic == 1) & (raw_crimes.Year < 2018)].copy()
domestic_crimes_2018 = raw_crimes[(raw_crimes.Domestic == 1) & (raw_crimes.Year == 2018)].copy()

### Clean Data
In addition to elimination of some extreme outliers and suspicious data, and combining of some variables, the best result of this project requires that I eliminate certain types of information that cannot be *used*. For instance, though some domestic incidents are labeled in this dataset as "Interference with Public Officer," a 911 call for a domestic incident would likely not indicate such a crime - it's more likely that this was what the perpetrator was charged for despite the initial incident being something different. So, this type of information being included in the model would not represent a real, usable solution. While some of this type of data is cleaned up front, some was eliminated at the end, to create a final model with only the number of variables allowed within the given parameters.

In [149]:
# To make it a little easier to read, I'm converting the FBI codes to strings.
fbi_to_string = {'08B' : 'Simple Battery',
  '08A' : 'Simple Assault',
  '26' : 'Other',
  '14' : 'Vandalism', 
  '04B': 'Aggravated Battery',
  '06' : 'Larceny',
  '04A' : 'Aggravated Assault',
  '20' : 'Offenses Against Family',
  '02' : 'Criminal Sexual Assault',
  '03' : 'Robbery',
  '05' : 'Burglary',
  '11' : 'Fraud',
  '07' : 'Criminal Sexual Abuse',
  '01A' : 'Homicide',
  '09' : 'Arson',
  '10' : 'Forgery & Counterfeiting',
  '24' : 'Disorderly Conduct',
  '15' : 'Weapons Violation',
  '18' : 'Drug Abuse',
  '13' : 'Stolen Property',
  '16' : 'Prostitution',
  '22' : 'Liquor License',
  '12' : 'Embezzlement',
  '19' : 'Gambling',
  '01B' : 'Involuntary Manslaughter'
}
domestic_crimes_pre2018['FBI Code'].replace(fbi_to_string, inplace = True)
domestic_crimes_2018['FBI Code'].replace(fbi_to_string, inplace = True)

In [150]:
# Filter out codes that would likely not be identified on the phone as domestic-related
# or the crime listed is likely un-related to what would have been shared on a 911 call
# Also filter out codes that are extremely rare or suspicious (don't seem domestic-related)
useless_fbi_codes = [
  'Embezzlement', 'Gambling', 'Disorderly Conduct', 'Liquor License', 'Fraud', 
  'Forgery & Counterfeiting', 'Weapons Violation', 'Involuntary Manslaughter', 
  'Prostitution', 'Stolen Property'
]
useless_primary_types = [
  'DECEPTIVE PRACTICE', 'PUBLIC PEACE VIOLATION', 'NARCOTICS', 'WEAPONS VIOLATION', 
  'INTERFERENCE WITH PUBLIC OFFICER', 'OBSCENITY', 'LIQUOR LAW VIOLATION',
  'PROSTITUTION', 'NON-CRIMINAL (SUBJECT SPECIFIED)', 'RITUALISM', 'HUMAN TRAFFICKING', 
  'GAMBLING', 'DOMESTIC VIOLENCE', 'PUBLIC INDECENCY', 'NON-CRIMINAL'
]
domestic_crimes_pre2018_clean = domestic_crimes_pre2018[
    ~(domestic_crimes_pre2018['FBI Code'].isin(useless_fbi_codes)) & 
    ~(domestic_crimes_pre2018['Primary Type'].isin(useless_primary_types))
].copy()
domestic_crimes_2018_clean = domestic_crimes_2018[
    ~(domestic_crimes_2018['FBI Code'].isin(useless_fbi_codes)) & 
    ~(domestic_crimes_2018['Primary Type'].isin(useless_primary_types))
].copy()

In [151]:
# Some location descriptions are so rare, they should be categorized as "other" for practical
# purposes(it would never make sense to put in the dialog box a location type that)
# occurs less than once per month. Some categories could be combined for later iterations of
# this model
loc_occurrences = domestic_crimes_pre2018_clean['Location Description'].value_counts()
locs_to_other = loc_occurrences[loc_occurrences < 17*12].index.tolist() + [np.NaN]
domestic_crimes_pre2018_clean['Location Description'].replace(locs_to_other, 
                                                              'OTHER', 
                                                              inplace = True)
domestic_crimes_2018_clean['Location Description'].replace(locs_to_other, 
                                                              'OTHER', 
                                                              inplace = True)
# Replace missing Community Areas and Wards with a value for missing: 0
domestic_crimes_pre2018_clean['Community Area'].fillna(0, inplace = True)
domestic_crimes_pre2018_clean['Ward'].fillna(0, inplace = True)
domestic_crimes_pre2018_clean['District'].fillna(0, inplace = True)
domestic_crimes_2018_clean['Community Area'].fillna(0, inplace = True)
domestic_crimes_2018_clean['Ward'].fillna(0, inplace = True)
domestic_crimes_2018_clean['District'].fillna(0, inplace = True)

# Get rid of those missing X and Y coordinates
domestic_crimes_pre2018_clean.dropna(subset = ['X Coordinate', 'Y Coordinate'], inplace= True)
domestic_crimes_2018_clean.dropna(subset = ['X Coordinate', 'Y Coordinate'], inplace= True)

### Extract Features
I will extract some features from the Date, as well as from the Descriptions provided. Because the descriptions themselves are too numerous to be used as their own variables, I will extract individual words and create binary columns indicating the presence or absence of these words. Some of these words will be usable to extract attributes to be included in the dialog box (e.g. "knife", "handgun", "harassment") and others will not be (e.g. "person", "electronic"). After an initial model is created with these in, I will sort through the significant words to determine which should be kept in the final model.

NOTE: In the final version, I got rid of the Descriptions part because it requires too much manual work for the time alotted. With more time, I would have gone throught he predictive Description words and tried to pull out patterns among the highly predictive words.

In [161]:
# Create a column with Date as a datetime object
domestic_crimes_pre2018_clean['datetime'] = domestic_crimes_pre2018_clean.Date.apply(
    lambda x: datetime.strptime(x, '%m/%d/%Y %I:%M:%S %p'))
domestic_crimes_2018_clean['datetime'] = domestic_crimes_2018_clean.Date.apply(
    lambda x: datetime.strptime(x, '%m/%d/%Y %I:%M:%S %p'))

# Extract date information
domestic_crimes_pre2018_clean['dayofweek'] = domestic_crimes_pre2018_clean.datetime.apply(
    lambda x: x.dayofweek)
domestic_crimes_pre2018_clean['dayofyear'] = domestic_crimes_pre2018_clean.datetime.apply(
    lambda x: x.dayofyear)
domestic_crimes_pre2018_clean['weekofyear'] = domestic_crimes_pre2018_clean.datetime.apply(
    lambda x: x.weekofyear)
domestic_crimes_pre2018_clean['hour'] = domestic_crimes_pre2018_clean.datetime.apply(
    lambda x: x.hour)
domestic_crimes_pre2018_clean['month'] = domestic_crimes_pre2018_clean.datetime.apply(
    lambda x: x.month)

domestic_crimes_2018_clean['dayofweek'] = domestic_crimes_2018_clean.datetime.apply(
    lambda x: x.dayofweek)
domestic_crimes_2018_clean['dayofyear'] = domestic_crimes_2018_clean.datetime.apply(
    lambda x: x.dayofyear)
domestic_crimes_2018_clean['weekofyear'] = domestic_crimes_2018_clean.datetime.apply(
    lambda x: x.weekofyear)
domestic_crimes_2018_clean['hour'] = domestic_crimes_2018_clean.datetime.apply(
    lambda x: x.hour)
domestic_crimes_2018_clean['month'] = domestic_crimes_2018_clean.datetime.apply(
    lambda x: x.month)

In [7]:
# # Pull common words out of descriptions, and create a binary variable representing the presence
# # or absense of those words.

# # create and fit a CountVectorizer for the words in the Description column
# descriptionVectorizer = feature_extraction.text.CountVectorizer(max_features=100,stop_words="english")
# descriptionVectorizer.fit(domestic_crimes_pre2018_clean.Description)

# # Create a dataframe with indicators for each row of wither a particular word is in the Description
# words_as_cols = [x + '_in_description' for x in descriptionVectorizer.get_feature_names()]
# description_word_df_pre2018  = pd.DataFrame(
#     descriptionVectorizer.transform(domestic_crimes_pre2018_clean.Description).todense(),
#     columns = words_as_cols,
#     index = domestic_crimes_pre2018_clean.index
# )
# description_word_df_2018  = pd.DataFrame(
#     descriptionVectorizer.transform(domestic_crimes_2018_clean.Description).todense(),
#     columns = words_as_cols,
#     index = domestic_crimes_2018_clean.index
# )
# # add these columsn into the dataframe
# domestic_crimes_pre2018_clean = domestic_crimes_pre2018_clean.join(description_word_df_pre2018)
# domestic_crimes_2018_clean = domestic_crimes_2018_clean.join(description_word_df_pre2018)

### Data Preparation for Modeling
I start be eliminating variables that can't or shouldn't be used for modeling:

- ID and Case Number are unique identifiers
- I've already extracted the information I need from Date and "datetime"
- "Block" is too granular, using other location-related variables instead
- I'm using Primary Type and Description in place of IUCR
- I've already pulled words out of "Description"
- The Domestic flag isn't useful since I filtered to only include domestics
- "Updated On" won't be useful since that's not information that could be included in the 911 call
- I'm using X and Y coordinate instead of Latitude, Longitude, or "Location"

Any features relating to automatically collected information (i.e. those relating to date, time, and location) can be manipulated into any format desired, but any variable that would have to be input by the 911 operator must be put into binary (representing the checkbox) or categorical with <= 5 features (representing the pull-down menu). I will binarize these variables.

For the categorical variables that are location-based (and can therefore be categorized in whichever way I'd like), I will do a K-fold likelihood target encoding.

In [162]:
# Eliminate variables that I won't use for any models:
train = domestic_crimes_pre2018_clean.drop([
    'ID', 'Case Number', 'Date', 'Block', 'IUCR', 'Description', 'Domestic',
    'Updated On', 'Latitude', 'Longitude', 'Location', 'datetime'
], axis = 1).sample(frac = .1).copy()
test = domestic_crimes_2018_clean.drop([
    'ID', 'Case Number', 'Date', 'Block', 'IUCR', 'Description', 'Domestic',
    'Updated On', 'Latitude', 'Longitude', 'Location', 'datetime'
], axis = 1).copy()

In [163]:
# Create the dummy variables (b)
for col in ['Primary Type', 'Location Description', 'FBI Code']:
    # Instantiate a label binarizer
    this_lb = preprocessing.LabelBinarizer()
    # Fit the binarizer on the training set
    this_lb.fit(train[col])
    # Apply the binarizer to both sets
    this_train_binarized = this_lb.transform(train[col])
    this_test_binarized = this_lb.transform(test[col])
    # Put results into an easier-to-read dataframe
    new_cols = [col + '_' + cl for cl in this_lb.classes_]
    this_train_df = pd.DataFrame(this_train_binarized, 
                                 columns = new_cols,
                                 index = train.index)
    this_test_df = pd.DataFrame(this_test_binarized, 
                                columns = new_cols,
                                index = test.index)
    # Combine the results into the train and test sest
    train = train.join(this_train_df).drop(col, axis = 1)
    test = test.join(this_test_df).drop(col, axis = 1)

In [164]:
class CustomEncodeAndCV:
    def __init__(self, data, random_state = None):
        self.data = data
        self.random_state = random_state
        self.target_encodings = {}
        
    def split(self, K = 3):
        """Splits the data into random parst for cross-validation"""
        # Randomize the data
        rand_data = self.data.sample(frac = 1, random_state = self.random_state)
        # Get the indices for each of the K parts
        n_per_sample = float(self.data.shape[0]) / K
        self.split_indices = [rand_data.index[int(n_per_sample*i):int(n_per_sample*(i+1))] for i in range(K)]
        
    def kfoldTargetEncode(self, column, K = 10):
        """Creates a list of the k-fold target encodings for each split"""
        # instantiate a list and iterate through each of the splits from above
        means = []
        for indices in self.split_indices:
            # Randomize the data subset to do a K-fold split
            this_data = self.data.loc[indices]
            this_rand_data = this_data.sample(frac = 1, random_state = self.random_state)
            this_n_per_sample = float(this_rand_data.shape[0]) / K
            this_split_indices = [this_rand_data.index[int(this_n_per_sample*i):int(this_n_per_sample*(i+1))] for i in range(K)]
            
            # Iterate through the sub-splits and find the categorical mean for each fold 
            # for the given column
            this_mean_df = pd.DataFrame(index = set(this_data[column]))
            for i, sub_split in enumerate(this_split_indices):
                sub_data = this_data.loc[sub_split]
                this_pct_arrested_by_cat = sub_data.groupby(column).Arrest.agg(
                    [('pct_arrested' + str(i), lambda x: float(sum(x))/len(x))])
                this_mean_df = this_mean_df.join(this_pct_arrested_by_cat)
            # find the mean of these means. If one has an NaN, fill with overall mean
            this_means = this_mean_df.mean(axis = 1)
            this_means.fillna(this_means.mean(), inplace = True)
            means.append(this_means)
        # store in target_encodings attribute
        self.target_encodings[column] = means
    
    def binarize(self, column):
        # create a binarizer for this column
        lb = preprocessing.LabelBinarizer()
        lb.fit(self.data[column])
        # Create a dataframe representing the binarizer, and put into the dataframe.
        lb_df = pd.DataFrame(lb.transform(self.data[column]),
                            columns = [column + '_' + cl for cl in lb.classes_],
                            index = self.data.index)
        self.data = self.data.join(lb_df)
        # drop the binarized column
        self.data.drop(column, axis = 1, inplace = True)
        
    def crossValidate(self, estimator, y_column):
        """Iterate through each fold. Build model on the other K-1 folds and predict
        on that fold. If target_encodings is defined for a column, replace it with the
        target encoded version."""
        X = self.data.drop(y_column, axis = 1).copy()
        y = self.data[y_column]
        predictions = []
        predict_probas = []
        
        ys = []
        for i, indices in enumerate(self.split_indices):
            # Set the test-train splits
            X_test = X.loc[indices]
            X_train = X[~X.index.isin(indices)]
            y_train = y[~X.index.isin(indices)]
            
            # Set the value of target encoded variables to the mean of the K-1 training folds
            train_folds = [n for n in range(len(self.split_indices)) if n != i]
            for column, target_encs in self.target_encodings.iteritems():
                comb_target_encodings = pd.DataFrame(target_encs[train_folds[0]])
                for fold in train_folds[1:]:
                    comb_target_encodings = comb_target_encodings.join(
                        pd.DataFrame(target_encs[fold]), lsuffix = 'a', rsuffix = 'b')
                train_target_encodings = comb_target_encodings.mean(axis = 1)
                train_target_encodings.fillna(train_target_encodings.mean(), inplace = True)
                
                # Replace each value in the target encoded columns with the encoding
                X_test[column].replace(train_target_encodings, inplace = True)
            estimator.fit(X_train, y_train)
            predictions += estimator.predict(X_test).tolist()
            predict_probas += estimator.predict_proba(X_test)[:,1].tolist()
            ys += y.loc[indices].tolist()
        
        self.y_true = ys
        self.predictions = predictions
        self.predict_proba = predict_probas
        self.estimator = estimator

In [191]:
initial_model = CustomEncodeAndCV(train)
initial_model.split()
initial_model.kfoldTargetEncode('Beat')
initial_model.kfoldTargetEncode('Community Area')
initial_model.kfoldTargetEncode('Ward')
initial_model.kfoldTargetEncode('District')

### Selecting most important variables
To determine which things should be included in the Dialog Box, I looked at which binary variables are the most predictive, on Logistic Regression (Lasso) and XGBoost (generally good tree ensemble).

In [168]:
initial_model.crossValidate(linear_model.LogisticRegression(penalty = 'l1'))
metrics.roc_auc_score(y_score = initial_model.predict_proba, 
                      y_true = initial_model.y_true)

0.6065825453606764

In [172]:
pd.DataFrame({'coef' : initial_model.estimator.coef_.flatten().tolist(),
              'odds_ratio' : [np.e**c for c in initial_model.estimator.coef_.flatten().tolist()],
             'col' : initial_model.data.drop('Arrest',axis = 1).columns}).sort_values('coef')

Unnamed: 0,coef,col,odds_ratio
42,-0.996646,Location Description_DAY CARE CENTER,0.369115
38,-0.783414,Location Description_COMMERCIAL / BUSINESS OFFICE,0.456844
47,-0.637768,Location Description_GOVERNMENT BUILDING/PROPERTY,0.528470
66,-0.575805,Location Description_SMALL RETAIL STORE,0.562252
21,-0.573394,Primary Type_KIDNAPPING,0.563609
28,-0.527565,Primary Type_THEFT,0.590040
62,-0.510918,"Location Description_SCHOOL, PRIVATE, BUILDING",0.599944
52,-0.441638,Location Description_NURSING HOME/RETIREMENT HOME,0.642982
79,-0.434732,FBI Code_Larceny,0.647438
56,-0.344642,Location Description_POLICE FACILITY/VEH PARKI...,0.708474


In [192]:
initial_model.crossValidate(xgboost.XGBClassifier(), 'Arrest')
metrics.roc_auc_score(y_score = initial_model.predict_proba, 
                      y_true = initial_model.y_true)

  if diff:
  if diff:
  if diff:


0.6196643164246418

Predictive Locations, which could be included in a dropdown:

- Location Description_DAY CARE CENTER
- Location Description_COMMERCIAL / BUSINESS OFFICE
- Location Description_GOVERNMENT BUILDING/PROPERTY
- Location Description_SMALL RETAIL STORE
- Location Description_SCHOOL, PRIVATE, BUILDING
- Location Description_NURSING HOME/RETIREMENT HOME
- Location Description_CTA BUS STOP
- Location Description_PARK PROPERTY
- Location Description_HOTEL/MOTEL
- OTHER

Things that would be predictive for checkboxes:
- Primary Type_THEFT
- Primary Type_CRIMINAL TRESPASS	
- FBI Code_Aggravated Assault
- FBI Code_Aggravated Battery
- FBI Code_Offenses Against Family

Rarer circumstances that could be grouped as a dropdown:
- Primary Type_HOMICIDE
- Primary Type_ARSON
- Primary Type_KIDNAPPING
- OTHER

In [182]:
final_train = train[['Arrest', 'Beat', 'District', 'Ward', 'Community Area', 'X Coordinate',
                     'Y Coordinate', 'Year', 'dayofweek', 'dayofyear', 'weekofyear', 
                     'hour', 'month', 'Location Description_DAY CARE CENTER', 
                     'Location Description_COMMERCIAL / BUSINESS OFFICE', 
                     'Location Description_GOVERNMENT BUILDING/PROPERTY',
                     'Location Description_SMALL RETAIL STORE',
                     'Location Description_SCHOOL, PRIVATE, BUILDING',
                     'Location Description_NURSING HOME/RETIREMENT HOME',
                     'Location Description_CTA BUS STOP',
                     'Location Description_PARK PROPERTY',
                     'Location Description_HOTEL/MOTEL',
                     'Primary Type_THEFT',
                     'Primary Type_CRIMINAL TRESPASS',
                     'FBI Code_Aggravated Assault',
                     'FBI Code_Aggravated Battery',
                     'FBI Code_Offenses Against Family',
                     'Primary Type_HOMICIDE',
                     'Primary Type_ARSON',
                     'Primary Type_KIDNAPPING']]
final_test = test[['Arrest', 'Beat', 'District', 'Ward', 'Community Area', 'X Coordinate',
                     'Y Coordinate', 'Year', 'dayofweek', 'dayofyear', 'weekofyear', 
                     'hour', 'month', 'Location Description_DAY CARE CENTER', 
                     'Location Description_COMMERCIAL / BUSINESS OFFICE', 
                     'Location Description_GOVERNMENT BUILDING/PROPERTY',
                     'Location Description_SMALL RETAIL STORE',
                     'Location Description_SCHOOL, PRIVATE, BUILDING',
                     'Location Description_NURSING HOME/RETIREMENT HOME',
                     'Location Description_CTA BUS STOP',
                     'Location Description_PARK PROPERTY',
                     'Location Description_HOTEL/MOTEL',
                     'Primary Type_THEFT',
                     'Primary Type_CRIMINAL TRESPASS',
                     'FBI Code_Aggravated Assault',
                     'FBI Code_Aggravated Battery',
                     'FBI Code_Offenses Against Family',
                     'Primary Type_HOMICIDE',
                     'Primary Type_ARSON',
                     'Primary Type_KIDNAPPING']]

In [201]:
final_model = CustomEncodeAndCV(final_train)
final_model.split()
final_model.kfoldTargetEncode('Beat')
final_model.kfoldTargetEncode('Community Area')
final_model.kfoldTargetEncode('Ward')
final_model.kfoldTargetEncode('District')

In [202]:
final_model.crossValidate(xgboost.XGBClassifier(), 'Arrest')

  if diff:
  if diff:
  if diff:


In [203]:
metrics.roc_auc_score(y_score = final_model.predict_proba, y_true = final_model.y_true)

0.5854200392499024

### Now test the model on the 2018 data

In [206]:
# kfold target encoder
def kfoldTargetEncode(data, column, K = 10, random_state = None):
    # Randomize the data subset to do a K-fold split
    this_data = data.loc[indices]
    this_rand_data = this_data.sample(frac = 1, random_state = random_state)
    this_n_per_sample = float(this_rand_data.shape[0]) / K
    this_split_indices = [this_rand_data.index[int(this_n_per_sample*i):int(this_n_per_sample*(i+1))] for i in range(K)]

    # Iterate through the sub-splits and find the categorical mean for each fold 
    # for the given column
    this_mean_df = pd.DataFrame(index = set(this_data[column]))
    for i, sub_split in enumerate(this_split_indices):
        sub_data = this_data.loc[sub_split]
        this_pct_arrested_by_cat = sub_data.groupby(column).Arrest.agg(
            [('pct_arrested' + str(i), lambda x: float(sum(x))/len(x))])
        this_mean_df = this_mean_df.join(this_pct_arrested_by_cat)
    # find the mean of these means. If one has an NaN, fill with overall mean
    this_means = this_mean_df.mean(axis = 1)
    this_means.fillna(this_means.mean(), inplace = True)
    return(this_means)

#Recode to the target encoder
for col in ['District', 'Ward', 'Community Area', 'Beat']:
    recoding = kfoldTargetEncode(final_train, col)
    final_train[col].replace(recoding)
    final_test[col].replace(recoding)

Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
https://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike


XGBoost was the best out-of-the-box classifier, so for time I will use that as my final model

In [246]:
# Pull the predicted probabilities
xgb = xgboost.XGBClassifier()
xgb.fit(X = final_train.drop('Arrest', axis = 1),
       y = final_train.Arrest)
pred_probas = xgb.predict_proba(final_test.drop('Arrest', axis = 1))[:,1]

In [247]:
# overall score
metrics.roc_auc_score(final_test.Arrest, pred_probas)

0.5919930383742291

In [248]:
max(final_test.dayofyear)

242

To see what this model can do for us, let's assume that, on average, CPD can only check 20 calls per day. So, in 242 days of 2018, that's 4,840 total that they can check. Let's see how many that require arrest would be captured if only selecting the top 4,840

In [251]:
# Select the threshold for the top 4840 most likely to end in arrest, check out confusion matrix
threshold = sorted(pred_probas, reverse = True)[4840]
category_pred = [1 if p >= threshold else 0 for p in pred_probas]
metrics.confusion_matrix(y_pred=category_pred, y_true=final_test.Arrest)

array([[20549,  3681],
       [ 3614,  1167]])

In [266]:
print('Random 20 per day, avg percent ending in arrest: {}%'.format(
    round(100*float(sum(final_test.Arrest))/len(final_test.Arrest), 2)))
print('Model-predicted 20 per day, avg percent ending in arrest: {}%'.format(
    round(100*float(1167)/(1167 + 3861), 2)))

Random 20 per day, avg percent ending in arrest: 16.48%
Model-predicted 20 per day, avg percent ending in arrest: 23.21%


**Final result: The model improves the chances of reviewing an arrest record that will end in an arrest from 16.5% to 23%, meaning it's about 40% more likely that a record selected for review by this model will actually be one the department prefers to send a Community Support Officer out to. There is certainly plenty of room for improvement on this model, but it is a good start**