# NYPD Allegations
* See Project 03 for information on the dataset.
* A few example prediction questions to pursue are listed below. However, don't limit yourself to them!
    * Predict the outcome of an allegation (might need to feature engineer your output column).
    * Predict the complainant or officer ethnicity.
    * Predict the amount of time between the month received vs month closed (difference of the two columns).
    * Predict the rank of the officer.

Be careful to justify what information you would know at the "time of prediction" and train your model using only those features.

# Summary of Findings


### **Introduction**
For our project, since we are dealing with a dataset provided by the New York Police Department, we have decided to come up with the following question: 
> 
__Are we able to predict the age of the Complainant?__
> 
This question is a Regression problem because we are predicting continuous, real values: a person's age. 
Our target variable will be the age of the complainant ("complainant_age_incident"), and our evaluation metrics will be R^2 and the Root Means Squared Error (RMSE).



### **Baseline Model**
For our baseline model, we decided to add 3 features from the dataset: the officer's age (Quantitative), officer's gender (Nominal), and complainant's gender (Nominal), in order to predict the complainant's age. 
> 
Our baseline model's R^2 was ~1.64%, and the RMSE was ~11.95. These results at first glance don't look good, but with our Final Model we can further compare if these values are truly good or not good.


### **Final Model**
For our Final Model, we went ahead and added two new engineered features. We decided to apply two different types of transformers on the officer age column: A Standardizer (z-scaler), and a Binarizer that approximately differs between old and young age. By applying these transformers, we can see just how important the officer age column is to our predictions, by applying two different transformers on it. 
> 
We have chosen a DecisionTreeRegressor, because this regressor model uses decision trees to fit the patterns in the data, proving that decision trees are definitely not restricted to just Classification problems. They also provide a handful of parameters to test combinations in a Grid Search. Upon running a grid search on 4 parameter variables from DecisionTreeRegressor, we found that the best parameter combination was: {'max_depth':13, 'min_samples_leaf':2, 'min_samples_split':2, 'max_leaf_nodes':40}. In terms of model selection, we decided on two; either LinearRegression or DecisionTreeRegressor. We ultimately went with DecisionTreeRegressor because we knew that this would provide us with the flexibility of multiple parameter combinations, in order to run GridSearchCV.

### **Fairness Evaluation**
For our Fairness Evaluation, we wanted to look at the subset of old officers vs young ones using a permutation test. We used the difference in R^2 values as a parity measure because we are looking at a regression question as opposed to a classification one. This works as an accuracy parity measure because this allows to check that the classifier is performing equally well among the two groups. With the threshold being 30 years old, we separated the subsets into old officers and young ones. The Null Hypothesis was that the model is fair and that the R^2 values of the two groups are similar. The Alternate Hypothesis was that the model is unfair as the R^2 values would not be close to each other. After running the permutation test with 100 repetitions and setting the significance level at 0.05, we failed to reject the Null Hypothesis as we obtained a P-value of about 0.42 and 0.42 > 0.05. This leads us to believe that the model we have created is fair when looking at the subset of the two different age groups of officers.

# Code

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import seaborn as sns
%matplotlib inline
%config InlineBackend.figure_format = 'retina'  # Higher resolution figures

Here, we will be cleaning up basic missing values by replacing string values "Unknown" as missing, np.nan:

In [None]:
data_fp = os.path.join('sample_data', 'allegations_202007271729.csv')
data = pd.read_csv(data_fp)
# Replacing all "Unknwon" values with nulls.
cleaned = data.replace('Unknown', np.NaN)
cleaned.head()

Unnamed: 0,unique_mos_id,first_name,last_name,command_now,shield_no,complaint_id,month_received,year_received,month_closed,year_closed,command_at_incident,rank_abbrev_incident,rank_abbrev_now,rank_now,rank_incident,mos_ethnicity,mos_gender,mos_age_incident,complainant_ethnicity,complainant_gender,complainant_age_incident,fado_type,allegation,precinct,contact_reason,outcome_description,board_disposition
0,10004,Jonathan,Ruiz,078 PCT,8409,42835,7,2019,5,2020,078 PCT,POM,POM,Police Officer,Police Officer,Hispanic,M,32,Black,Female,38.0,Abuse of Authority,Failure to provide RTKA card,78.0,Report-domestic dispute,No arrest made or summons issued,Substantiated (Command Lvl Instructions)
1,10007,John,Sears,078 PCT,5952,24601,11,2011,8,2012,PBBS,POM,POM,Police Officer,Police Officer,White,M,24,Black,Male,26.0,Discourtesy,Action,67.0,Moving violation,Moving violation summons issued,Substantiated (Charges)
2,10007,John,Sears,078 PCT,5952,24601,11,2011,8,2012,PBBS,POM,POM,Police Officer,Police Officer,White,M,24,Black,Male,26.0,Offensive Language,Race,67.0,Moving violation,Moving violation summons issued,Substantiated (Charges)
3,10007,John,Sears,078 PCT,5952,26146,7,2012,9,2013,PBBS,POM,POM,Police Officer,Police Officer,White,M,25,Black,Male,45.0,Abuse of Authority,Question,67.0,PD suspected C/V of violation/crime - street,No arrest made or summons issued,Substantiated (Charges)
4,10009,Noemi,Sierra,078 PCT,24058,40253,8,2018,2,2019,078 PCT,POF,POF,Police Officer,Police Officer,Hispanic,F,39,,,16.0,Force,Physical force,67.0,Report-dispute,Arrest - other violation/crime,Substantiated (Command Discipline A)


### Baseline Model

Before we get to the baseline, we will clean up some of the strings in the complainant gender column in order to clearly differ between Male and Female genders.

In [None]:
def m_or_w(x):
    if str(x) == 'Male':
        return 'Male'
    elif str(x) == 'Transman (FTM)':
        return 'Male'
    elif str(x) == 'Female':
        return 'Female'
    elif str(x) == 'Transwoman (MTF)':
        return 'Female'
    else:
        return np.NaN

In [None]:
cleaned['complainant_gender'] = cleaned['complainant_gender'].apply(m_or_w)
cleaned = cleaned.dropna() # Drops 33358 - 27176 = 6182 rows
cleaned.head()

Unnamed: 0,unique_mos_id,first_name,last_name,command_now,shield_no,complaint_id,month_received,year_received,month_closed,year_closed,command_at_incident,rank_abbrev_incident,rank_abbrev_now,rank_now,rank_incident,mos_ethnicity,mos_gender,mos_age_incident,complainant_ethnicity,complainant_gender,complainant_age_incident,fado_type,allegation,precinct,contact_reason,outcome_description,board_disposition
0,10004,Jonathan,Ruiz,078 PCT,8409,42835,7,2019,5,2020,078 PCT,POM,POM,Police Officer,Police Officer,Hispanic,M,32,Black,Female,38.0,Abuse of Authority,Failure to provide RTKA card,78.0,Report-domestic dispute,No arrest made or summons issued,Substantiated (Command Lvl Instructions)
1,10007,John,Sears,078 PCT,5952,24601,11,2011,8,2012,PBBS,POM,POM,Police Officer,Police Officer,White,M,24,Black,Male,26.0,Discourtesy,Action,67.0,Moving violation,Moving violation summons issued,Substantiated (Charges)
2,10007,John,Sears,078 PCT,5952,24601,11,2011,8,2012,PBBS,POM,POM,Police Officer,Police Officer,White,M,24,Black,Male,26.0,Offensive Language,Race,67.0,Moving violation,Moving violation summons issued,Substantiated (Charges)
3,10007,John,Sears,078 PCT,5952,26146,7,2012,9,2013,PBBS,POM,POM,Police Officer,Police Officer,White,M,25,Black,Male,45.0,Abuse of Authority,Question,67.0,PD suspected C/V of violation/crime - street,No arrest made or summons issued,Substantiated (Charges)
5,10012,Paula,Smith,078 PCT,4021,37256,5,2017,10,2017,078 PCT,SGT,SGT,Sergeant,Sergeant,Black,F,50,White,Male,31.0,Abuse of Authority,Refusal to process civilian complaint,78.0,C/V telephoned PCT,No arrest made or summons issued,Substantiated (Command Lvl Instructions)


Now, we will take our cleaned dataset and select only the features that we believe have influence on the age of the complainant. The features are the gender and age of the officer, and the gender of the complainant, to start our Baseline Model off. We will then split them into train and test sets, with train set size of 75%, and test set size of 25%.

In [None]:
feats = ['mos_gender', 'mos_age_incident', 'complainant_gender']

X = cleaned[feats]
y = cleaned['complainant_age_incident']

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=80)
X_train.head()

Unnamed: 0,mos_gender,mos_age_incident,complainant_gender
4282,M,44,Male
10674,M,31,Male
8817,M,31,Female
20629,M,23,Male
16370,M,40,Male


For our Baseline Model, we will be using One-Hot encoding on the officer and complainant gender columns, and leaving the officer age column alone, as this one is quantitative. We will then perform Linear Regression, fit the entire pipeline to our train set, and see our R^2 results on the test set.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression
from sklearn.compose import ColumnTransformer

feats = ['mos_gender', 'complainant_gender']
feats_transformer = OneHotEncoder()
preproc = ColumnTransformer(transformers=[('onehot', feats_transformer, feats)])

pl = Pipeline([('one-hot', preproc), ('lin-reg', LinearRegression())])
pl.fit(X_train, y_train)

# R^2
r2 = pl.score(X_test, y_test)
print("R^2: " + str(r2))

R^2: 0.01640671497490509


We see above that we get a low R^2 value of 1.64%. Let's see the Root Means Squared Error for our predictions on the test set:

In [None]:
pred_test = pl.predict(X_test)
rmse_test = np.sqrt(np.mean((pred_test - y_test)**2))
print("RMSE: " + str(rmse_test))

RMSE: 11.955282233568317


We see that we get RMSE of approximately 11.95 .

### Final Model

For our Final Model, we have decided to keep our OneHot encoder from the baseline model, as well as producing two engineered features.
> 
Specifically, we will be standardizing the officer age column, as well as Binarizing the officer age column (age > 32 is represented as 1, and age <= 32 is represented as 0) in order to get two engineered features. For our regression model, we have gone with the Decision Tree Regressor because this regressor uses a decision tree to fit the patterns in the data.

In [None]:
import sklearn.preprocessing as pp
from sklearn.tree import DecisionTreeRegressor

In [None]:
std_feat = ['mos_age_incident']
std_transformer = Pipeline(steps=[('scaler', pp.StandardScaler())])

bin_feat = ['mos_age_incident']
bin_transformer = Pipeline(steps=[('binarizer', pp.Binarizer(32))])

others_feat = ['mos_gender', 'complainant_gender']
others_transformer = Pipeline(steps=[('ohe', OneHotEncoder())])

preproc = ColumnTransformer(transformers=[('std', std_transformer, std_feat), ('bin', bin_transformer, bin_feat), ('others', others_transformer, others_feat)])

pl2 = Pipeline(steps=[('preprocessor', preproc), ('regressor', DecisionTreeRegressor())])

In [None]:
pl2.fit(X_train, y_train)

rtwo = pl2.score(X_test, y_test)
print("R^2: " + str(rtwo))

pred_test = pl2.predict(X_test)
rmse_test = np.sqrt(np.mean((pred_test - y_test)**2))
print("RMSE: " + str(rmse_test))

R^2: 0.01363617540516182
RMSE: 11.972107933405933


So as shown, we see that our new pipeline gives us a R^2 of 1.36%, as well as an RMSE of 11.972 .
>
In an effort to improve these results, we will perform a GridSearch on the DecisionTreeRegressor parameters, focusing on the parameters max_depth, min_samples_split, min_samples_leaf, and max_leaf_nodes, in order to determine which combination of parameter values will yield us the best test results.

In [None]:
from sklearn.model_selection import GridSearchCV

parameters = {
    'max_depth': [2,3,4,5,7,10,13,15,18,None],
    'min_samples_split': [2,3,5,7,10,15,20],
    'min_samples_leaf': [1,2,3,5,7,10,15,20],
    'max_leaf_nodes': np.arange(5,51,5)
}

arraye = pl2.named_steps['preprocessor'].transform(X_train)
firststep = pd.DataFrame(arraye, columns=['std_scaler_mosage', 'is_older32', 'is_mosFemale', 'is_mosMale', 'is_complainantFemale', 'is_complainantMale'])

clf = GridSearchCV(DecisionTreeRegressor(), parameters, cv = 5)
clf.fit(firststep, y_train)

clf.best_params_

{'max_depth': 13,
 'max_leaf_nodes': 40,
 'min_samples_leaf': 2,
 'min_samples_split': 2}

Here, we see that our GridSearch tells us that the best parameters are:
> 
'max_depth': 13, 
> 
'min_samples_leaf': 2
>
 'min_samples_split': 2
> 
'max_leaf_nodes': 40
> 


Let's see if these parameters improve our results.


In [None]:
pl2 = Pipeline(steps=[('preprocessor', preproc), ('regressor', DecisionTreeRegressor(max_depth=13, min_samples_leaf=2, min_samples_split=2, max_leaf_nodes=40))])
pl2.fit(X_train, y_train)

rtwo = pl2.score(X_test, y_test)
print("R^2: " + str(rtwo))

pred_test = pl2.predict(X_test)
rmse_test = np.sqrt(np.mean((pred_test - y_test)**2))
print("RMSE: " + str(rmse_test))

R^2: 0.017769661807690684
RMSE: 11.946996256562356


Here, we can see that the parameters did, in fact, improve our results, as R^2 has improved by approx. .4%, and the RMS Error has decreased by .02 .

### Fairness Evaluation

Here we are creating the results dataframe that we will be using for the permutation test. This is done by taking the original X_test variable and adding columns with the predictions and tags.

In [None]:
preds = pl2.predict(X_test)
results = X_test
results['prediction'] = preds
results['tag'] = y_test

In [None]:
# Identifying old officers as those that are older than 30 and young ones as those who are 30 years old or younger.
results['is_young'] = (results.mos_age_incident <= 30).replace({True:'young', False:'old'})

In the permutation test, we are using the R^2 value as our fairness measure because we cannot use the typical metrics as we are evaluating a regression problem. The two subsets that we are looking at the difference between are the old officers and young ones. Our null hypothesis is that our model is fair as a result of the difference of R^2 values being similar. On the other hand, the alternate hypothesis is that the model is not fair.

In [None]:
# Observed difference between the different R^2 values.
obs = pl2.score(results[results['is_young'] == 'old'], results[results['is_young'] == 'old'].tag) - pl2.score(results[results['is_young'] == 'young'], results[results['is_young'] == 'young'].tag)

metrs = []
for _ in range(100):
    # Sampling to get simulated difference in R^2 values.
    changed = results.assign(is_young=results.is_young.sample(frac=1.0, replace=False).reset_index(drop=True))
    metr = pl2.score(changed[changed['is_young'] == 'old'], changed[changed['is_young'] == 'old'].tag) - pl2.score(changed[changed['is_young'] == 'young'], changed[changed['is_young'] == 'young'].tag)
    metrs.append(metr)

With a P-value of around 0.42, we fail to reject the null hypothesis as 0.42 is greater than the significance level we set at 0.05. This leads us to believe that under our permutation test, our model is fair when looking at the subsets of old officers vs young ones.

In [None]:
# Calculating the p-value.
pval = pd.Series(metrs <= obs).mean()
pval

0.44