# Summary of Findings

### Introduction ###
Prediction Problem: 
Given the NYPD DataFrame, classify whether or not the number of
months elapsed for a complaint was greater than or equal to or less than the median number
of months elapsed for every complaint.

Since it is attempting to label each complaint in the DataFrame as either True or False, it is a
classification problem.

Within the DataFrame, there is a column named "is_median". Within this column, it labels as
True or False whether or not the specific complaint took greater than or equal to or less than
the median number of months elapsed for every complaint. For this classification problem,
the model is attempting to correctly classify each complaint based on a set of features, and
its correctness is based on this column, or target variable. The model used for this was a
Decision Tree Classifier.

For the evaluation metric, the model uses R-Squared to gauge the accuracy of our prediction
because this is a classification problem.


### Baseline Model ###

For the baseline model, it uses 16 columns: 1. Numbers of Quantitative Columns: 3 2.
Number of Ordinal Columns: 2 3. Number of Nominal Columns: 11

Evaluation Metric:
1. Training Data: 0.7842353505476057
2. Testing Data: 0.7630695443645084

In this case, it seems like the baseline model is decent, with a 75% accuracy on the test
data.

However, the model has a slightly higher R-Squared value for the training data compared to
the R-Squared value of the testing data. Therefore, it seems that the baseline model is
overfitting the training data. The max depth of the Decision tree was modified to have a
balance between the accuracy of the testing prediction and the amount of overfit. It was
found that a max depth of 10 provided the best results.

Consequently, the baseline model is not a good model because it overfits the training data.


### Final Model ###

Engineered Features:
1. The first engineered feature is a very simple feature that multiplies the month with the
year. Since the model is attempting to classify whether or not a specific complaint took
greater than or equal to or less than the median number of months elapsed for every
complaint, these dates along with some combination of these dates will provide a lot of
information for this model such that it is able to make more accurate classifications.

2. Year Elapsed. This was found by finding the difference between year closed and year
received

3. [('complainant_age_incident', 'precinct'), ('year_received', 'mos_age_incident'),
('year_closed', 'mos_age_incident'), ('month_received', 'mos_age_incident'),
('month_closed', 'precinct'), ('month_closed', 'mos_age_incident'), ('mos_age_incident',
'complainant_age_incident'), ('month_closed', 'complainant_age_incident'),
('year_closed', 'precinct'), ('mos_age_incident', 'precinct'), ('year_received', 'precinct'),
('month_received', 'complainant_age_incident'), ('month_received', 'precinct'),
('year_received', 'complainant_age_incident'), ('year_closed',
'complainant_age_incident')]. For each of these possible combinations, their linear
regression score is added to a dictionary in order to check how well each pair predicts
the target variable: "is_median". From here, these values are sorted by their score from
best to worst. The best scores that do not include both the "month_..." and "year_...”
were included in the model by multiplying the pair of columns together such that it
engineers a new feature.

We also attempted to use Linear Regression by predicting the number of months that a
particular complaint took, and using this value, we checked whether or not it took greater
than or equal to or less than the median number of months elapsed for every complaint to
classify that complaint as True or False. However, this model provided very horrible results
with the R^2 being in the range of .1 to .2. Therefore, we decided to drop this idea
completely and stick with either a Decision Tree Classifier or k-Means Clustering. Because
this is a classification problem that classifies complaints as either True or False, it was
between a Decision Tree Classifier and k-Means Clustering. We decided against using
k-Means Clustering as the model because k-Means Clustering is often not as good for a
higher dimensional set of features. Since the model used quite a large number of features,
we decided against going for a k-Means Clustering Model, and we decided to stick with a
Decision Tree Classifier. For this Decision Tree Classifier, the parameter that is taken into consideration is the max depth of the tree. In this case, the max depth was set to 10, which
was the same as the max depth for the baseline model. This was intentional because we
were attempting to check whether or not the engineered features would lead to a higher
R-Squared value under this same constant, which would lead to a more fair evaluation of
whether or not the engineered features provide for a higher correct classification rate and
lower amount of overfitting.


### Fairness Evaluation ###
Question: 
Are the R^2 scores similar between male and female complainants?
Null: The distribution of R^2 scores between male and female complainants are the same.
Alternative: The distribution of R^2 scores between male and female complainants are not
the same.

Significance Threshold: .05

Test Statistic: Difference in Accuracy

For this evaluation we chose to use Accuracy over alternatives such as Recall or Precision.
Since our prediction deals with a more general classification (true or false) the fairness metric
of accuracy would be a better fit than alternatives such as precision. Due to it being binary,
precision would not be as good of a metric due to it evaluating how precise, or how close
predictions are to each other. Accuracy on the other hand will evaluate whether or not the
correct prediction was made. After conducting the fairness evaluation, it was found that the
p-value was 0.25. Since this is greater than the significance threshold of .05, we would reject
the null hypothesis in favor of the alternative. This suggests that the accuracy for male and
female complainants’ predictions are not the same. This could be because there are more
male entries than female entries, so the data fit more in favor of the majority

In [1121]:
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import seaborn as sns
import datetime as dt
from dateutil.relativedelta import relativedelta
from sklearn.model_selection import GridSearchCV
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import Binarizer
from sklearn import metrics 
from itertools import combinations
import datetime
from dateutil.relativedelta import relativedelta
%matplotlib inline
%config InlineBackend.figure_format = 'retina'  # Higher resolution figures

import warnings
warnings.filterwarnings('ignore')

## Read CSV File as DataFrame

In [1122]:
# Read CSV File #
csv_file = os.path.join('data', 'allegations.csv')

# Read CSV File as DataFrame #
nypd_df = pd.read_csv(csv_file)

display(
    nypd_df.head()
)

Unnamed: 0,unique_mos_id,first_name,last_name,command_now,shield_no,complaint_id,month_received,year_received,month_closed,year_closed,...,mos_age_incident,complainant_ethnicity,complainant_gender,complainant_age_incident,fado_type,allegation,precinct,contact_reason,outcome_description,board_disposition
0,10004,Jonathan,Ruiz,078 PCT,8409,42835,7,2019,5,2020,...,32,Black,Female,38.0,Abuse of Authority,Failure to provide RTKA card,78.0,Report-domestic dispute,No arrest made or summons issued,Substantiated (Command Lvl Instructions)
1,10007,John,Sears,078 PCT,5952,24601,11,2011,8,2012,...,24,Black,Male,26.0,Discourtesy,Action,67.0,Moving violation,Moving violation summons issued,Substantiated (Charges)
2,10007,John,Sears,078 PCT,5952,24601,11,2011,8,2012,...,24,Black,Male,26.0,Offensive Language,Race,67.0,Moving violation,Moving violation summons issued,Substantiated (Charges)
3,10007,John,Sears,078 PCT,5952,26146,7,2012,9,2013,...,25,Black,Male,45.0,Abuse of Authority,Question,67.0,PD suspected C/V of violation/crime - street,No arrest made or summons issued,Substantiated (Charges)
4,10009,Noemi,Sierra,078 PCT,24058,40253,8,2018,2,2019,...,39,,,16.0,Force,Physical force,67.0,Report-dispute,Arrest - other violation/crime,Substantiated (Command Discipline A)


## EDA: Columns

In [1123]:
# Get Columns #
col_arr = nypd_df.columns

display(
    col_arr
)

Index(['unique_mos_id', 'first_name', 'last_name', 'command_now', 'shield_no',
       'complaint_id', 'month_received', 'year_received', 'month_closed',
       'year_closed', 'command_at_incident', 'rank_abbrev_incident',
       'rank_abbrev_now', 'rank_now', 'rank_incident', 'mos_ethnicity',
       'mos_gender', 'mos_age_incident', 'complainant_ethnicity',
       'complainant_gender', 'complainant_age_incident', 'fado_type',
       'allegation', 'precinct', 'contact_reason', 'outcome_description',
       'board_disposition'],
      dtype='object')

We dropped redundant columns (rank_abbrev_incident, rank_abbrev_now) since there is another column in the dataframe that has the same information, except in a different format.
Columns that contain information that would not be useful towards out prediction, like names or identification numbers, were also dropped in order to reduce noise.

In [1124]:
# Drop: Redundant Columns #
drop_col = [
    'rank_abbrev_incident',
    'rank_abbrev_now'
]

nypd_df = nypd_df.drop(drop_col, axis = 1)

# Drop Unique ID Columns #
drop_col = [
    'unique_mos_id',
    'first_name',
    'last_name',
    'shield_no',
    'complaint_id',
]

nypd_df = nypd_df.drop(drop_col, axis = 1)

nypd_df.columns

Index(['command_now', 'month_received', 'year_received', 'month_closed',
       'year_closed', 'command_at_incident', 'rank_now', 'rank_incident',
       'mos_ethnicity', 'mos_gender', 'mos_age_incident',
       'complainant_ethnicity', 'complainant_gender',
       'complainant_age_incident', 'fado_type', 'allegation', 'precinct',
       'contact_reason', 'outcome_description', 'board_disposition'],
      dtype='object')

## Calculate: Month Elapsed

This function uses the datetime package to calculate the different in time, in terms of month. This information will be what we are trying to predict from the dataframe. 

In [1125]:
def helper_func(lst):
    
    date_frmt = '%Y-%m'
    
    start = datetime.datetime.strptime(lst[0], date_frmt)
    end = datetime.datetime.strptime(lst[1], date_frmt)
    calc_diff = relativedelta(end, start)
    
    num_mth = (calc_diff.years * 12) + calc_diff.months
    
    return num_mth

This function uses the helper above on the nypd_df to get the month elasped between the received and closed date for each case. This it put into a new column called 'month_elapsed'.

In [1126]:
def calc_time_elapsed(mth_r, yr_r, mth_c, yr_c):
    
    start_date = (
        nypd_df[yr_r].astype(str) + '-' + nypd_df[mth_r].astype(str)
    )
    
    end_date = (
        nypd_df[yr_c].astype(str) + '-' + nypd_df[mth_c].astype(str)
    )
    
    lambda_splt = lambda val: val.split()
    comb_date = (start_date + ' ' + end_date).apply(lambda_splt)
    time_elapsed = comb_date.apply(helper_func)
        
    return time_elapsed

The columns 'month_elapsed' and 'is_median' are added to the dataframe. 'is_median' is a binary column which is True for when the month elapsed is greater than the median for all months elapsed and False otherwise. 

In [1127]:
nypd_df['month_elapsed'] = calc_time_elapsed('month_received', 'year_received', 'month_closed', 'year_closed')
elapsed_median = nypd_df['month_elapsed'].median()
nypd_df['is_median'] = nypd_df['month_elapsed'] >= elapsed_median


print('Median:', elapsed_median)
nypd_df[['month_elapsed', 'is_median']]


Median: 10.0


Unnamed: 0,month_elapsed,is_median
0,10,True
1,9,False
2,9,False
3,14,True
4,6,False
...,...,...
33353,6,False
33354,6,False
33355,6,False
33356,6,False


## Baseline Model

This baseline model imputes the missing values with 0's and nulls for numerical and categorical variables respectively. It also ordinal encodes the rank columns and one hot encodes the categorical columns. Since this is a classification problem, a decision tree was used with max depth as 10, which was found to reduce the models overfit. 

In [1128]:
X = nypd_df.drop(['is_median', 'month_elapsed'], axis=1)
y = nypd_df['is_median']

# Gets a list of categorial columns and a list of numerical columns
types = X.dtypes
catcols = types.loc[types == np.object].index
numcols = types.loc[types != np.object].index

num_cols = [
    'month_received',
    'year_received',
    'complainant_age_incident',
]

# These two columns are ordinally encoded
ord_col = [
    'rank_incident',
    'rank_now'
]
ords = Pipeline(
    steps = [
        ('ordinal_encoding', OrdinalEncoder())
    ]
)

# Every other categorical column that is not ordinal is considered nominal. These are imputed and one hot encoded.
nom_col = [col for col in catcols if (col not in ord_col)]
nom = Pipeline([
    ('imp', SimpleImputer(strategy='constant', fill_value='NULL')),
    ('ohe', OneHotEncoder(handle_unknown='ignore', sparse=False))
])


ct = ColumnTransformer([
    ('nomcols', nom, catcols),
    ('numcols', SimpleImputer(strategy='constant', fill_value=0), num_cols),
    ('ordcols', ords, ord_col)
])

pl = Pipeline([('feats', ct), ('reg', DecisionTreeClassifier(max_depth=10))])

In [1129]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
pl.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('feats',
                 ColumnTransformer(n_jobs=None, remainder='drop',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('nomcols',
                                                  Pipeline(memory=None,
                                                           steps=[('imp',
                                                                   SimpleImputer(add_indicator=False,
                                                                                 copy=True,
                                                                                 fill_value='NULL',
                                                                                 missing_values=nan,
                                                                                 strategy='constant',
                                                                

R^2 for the training model:

In [1130]:
pl.score(X_train, y_train)


0.7859940842593333

R^2 for the test model:

In [1131]:
pl.score(X_test, y_test)

0.752757793764988

### Feature Engineering

The column 'month_x_year' is the produce of month received and year received. Since those two are strong predictors for the is_median columns, engineering a feature out of them may bring up the accuracy of the model. Another column called 'year_elapsed' was created by finding the difference between the year closed and year received for each case. 

In [1132]:
nypd_df['month_x_year'] = nypd_df.month_received * nypd_df.year_received
nypd_df['year_elapsed'] = nypd_df['year_closed'] - nypd_df['year_received']

List of numerical columns in the dataframe.

In [1133]:
numcols

Index(['month_received', 'year_received', 'month_closed', 'year_closed',
       'mos_age_incident', 'complainant_age_incident', 'precinct'],
      dtype='object')

Converts the column we want to predict to a column of 1's and 0's to be used for manual iterative method to find the best feature pairs. 

In [1134]:
nypd_num = nypd_df[numcols]
nypd_num['is_median'] = nypd_df['is_median'].astype(int)
nypd_num = nypd_num.dropna().reset_index(drop=True)
nypd_num.head()

Unnamed: 0,month_received,year_received,month_closed,year_closed,mos_age_incident,complainant_age_incident,precinct,is_median
0,7,2019,5,2020,32,38.0,78.0,1
1,11,2011,8,2012,24,26.0,67.0,0
2,11,2011,8,2012,24,26.0,67.0,0
3,7,2012,9,2013,25,45.0,67.0,1
4,8,2018,2,2019,39,16.0,67.0,0


This code initializes a new model and finds every possible combinations for the columns within the numerical subset dataframe.

In [1135]:
X_comb = nypd_num.drop('is_median', axis = 1)
y_train = nypd_num['is_median']

col_list = X_comb.columns
combos = list(combinations(col_list, 2))

For each of the possible combinations, its linear regression score is added to a dictionary to see how well each pair predicts 'is_median'.

In [1136]:
# finds the score for each pair of features using linear regression

dic = {}
for combo in combos:
   X_train_int = X_comb
   X_train_int['int'] = X_train_int[combo[0]] * X_train_int[combo[1]]
   lr = LinearRegression()
   lr.fit(X_train_int, y_train)
   scored = lr.score(X_train_int, y_train)
   dic[scored] = combo

These R^2 scores are then sorted in decreasing order to see which pairs best predict is_median.

In [1137]:
best = sorted(dic.keys(), reverse = True)[:20]
counter = 0
best_combos = []
for combo in best:
   best_combos.append(dic[combo])
   print(dic[combo], 'R2:', best[counter])
   counter +=1

('month_received', 'month_closed') R2: 0.6660351700572122
('year_received', 'month_closed') R2: 0.6278377634264865
('year_received', 'year_closed') R2: 0.6278205331302502
('complainant_age_incident', 'precinct') R2: 0.6277753999928355
('year_received', 'mos_age_incident') R2: 0.6277514818168362
('year_closed', 'mos_age_incident') R2: 0.6277316600071761
('month_received', 'mos_age_incident') R2: 0.6277096871993945
('month_received', 'year_closed') R2: 0.6276980495448005
('month_closed', 'precinct') R2: 0.6276819875105961
('month_closed', 'mos_age_incident') R2: 0.6276705546879111
('mos_age_incident', 'complainant_age_incident') R2: 0.6276639425757718
('month_closed', 'complainant_age_incident') R2: 0.6276533843430219
('year_closed', 'precinct') R2: 0.6276530733860418
('mos_age_incident', 'precinct') R2: 0.6276495737557267
('year_received', 'precinct') R2: 0.6276438167693917
('month_received', 'complainant_age_incident') R2: 0.6276437928202467
('month_received', 'year_received') R2: 0.62

The columns that do not have both year and month attributes are then filtered to be used in the final model.

In [1138]:
filtered_combos = []
reqs = ['month_received', 'month_closed', 'year_elapsed', 'year_closed', 'month_year', 'year_received']
for i in best_combos:
    if i[0] not in reqs or i[1] not in reqs:
        filtered_combos.append(i)
filtered_combos

[('complainant_age_incident', 'precinct'),
 ('year_received', 'mos_age_incident'),
 ('year_closed', 'mos_age_incident'),
 ('month_received', 'mos_age_incident'),
 ('month_closed', 'precinct'),
 ('month_closed', 'mos_age_incident'),
 ('mos_age_incident', 'complainant_age_incident'),
 ('month_closed', 'complainant_age_incident'),
 ('year_closed', 'precinct'),
 ('mos_age_incident', 'precinct'),
 ('year_received', 'precinct'),
 ('month_received', 'complainant_age_incident'),
 ('month_received', 'precinct'),
 ('year_received', 'complainant_age_incident'),
 ('year_closed', 'complainant_age_incident')]

Number of columns before and after this feature engineering. A total of 15 new columns were added. 

In [1139]:
prev_shape = nypd_df.shape
for i in filtered_combos:
    new_col = nypd_df[i[0]] * nypd_df[i[1]]
    nypd_df[i] = new_col

print('Before:', prev_shape)
print('After:', nypd_df.shape)
print('Num Columns Added:', nypd_df.shape[1] - prev_shape[1])

Before: (33358, 24)
After: (33358, 39)
Num Columns Added: 15


### Final Model

In addition to what was done in the baseline model, the final model included the new engineer columns, binarized columns of if at least 2 years elapsed or not, numerical columns imputed with the mean instead of 0, and PCA done on the allegations columns to reduce noise since that column consisted of many unique values.

In [1140]:
X = nypd_df.drop(['is_median', 'month_elapsed'], axis=1)
y = nypd_df['is_median']

types = X.dtypes
catcols = types.loc[types == np.object].index
numcols = types.loc[types != np.object].index

num_cols = [
    'month_received',
    'year_received',
    'complainant_age_incident',
    'month_x_year'
] + filtered_combos    #includes the new feature pair columns that were engineered

bin_col = [
    'year_elapsed'
]
# binarizes the year elapsed column to True if at least 2 years have passed and False otherwise
bin_transf = Pipeline(
    steps = [
        ('binarizer', Binarizer(threshold = 2))
    ])

# ordinally encodes these two columns
ord_col = [
    'rank_incident',
    'rank_now'
]
ords = Pipeline(
    steps = [
        ('ordinal_encoding', OrdinalEncoder(handle_unknown='ignore'))
    ]
)

dont_include = ['rank_now', 'command_now', 'allegation']
# everything besides what was already engineered is considered as nominal
nom_col = [col for col in catcols if (col not in ord_col)]
nom_col = [col for col in catcols if (col not in dont_include)]
nom = Pipeline([
    ('imp', SimpleImputer(strategy='constant', fill_value='NULL')),
    ('ohe', OneHotEncoder(handle_unknown='ignore', sparse=False))
])

alleg = Pipeline([
    ('imp', SimpleImputer(strategy='constant', fill_value='NULL')),
    ('ohe', OneHotEncoder(handle_unknown='ignore', sparse=False)),
    ('pca', PCA(svd_solver='auto', n_components=50))
])

ct = ColumnTransformer([
    ('catcols', nom, catcols),
    ('numcols', SimpleImputer(strategy='mean'), num_cols),
    ('ordcols', ords, ord_col),
    ('alleg', alleg, ['allegation']),
    ('bina', bin_transf, bin_col)
])

pl = Pipeline([('feats', ct), ('reg', DecisionTreeClassifier(max_depth=10))])

In [1141]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
pl.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('feats',
                 ColumnTransformer(n_jobs=None, remainder='drop',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('catcols',
                                                  Pipeline(memory=None,
                                                           steps=[('imp',
                                                                   SimpleImputer(add_indicator=False,
                                                                                 copy=True,
                                                                                 fill_value='NULL',
                                                                                 missing_values=nan,
                                                                                 strategy='constant',
                                                                

Training score for final model:

In [1142]:
pl.score(X_train, y_train)

0.8440322967463426

Test score for final model:

In [1143]:
pl.score(X_test, y_test)

0.8105515587529976

### Fairness Evaluation

Question: Are the R^2 scores similar between male and female complainants?

Null: The distribution of R^2 scores between male and female complainants are the same. 

Alternative: The distribution of R^2 scores between male and female complainants are not the same.

Significance Threshold: .05

Test Statistic: Difference in Accuracy

This code was taken from project 3 to clean the complainant gender column. It converts the genders to only male and female for the purposes of our analysis. 

In [1144]:
# Clean Column: 'complainant_gender' #

# 'Transman (FTM)' = 'Male' #
nypd_df['complainant_gender'] = (nypd_df
    ['complainant_gender']
    .replace('Transman (FTM)', 'Male')
)

# 'Transwoman (MTF)' = 'Female' #
nypd_df['complainant_gender'] = (nypd_df
    ['complainant_gender']
    .replace('Transwoman (MTF)', 'Female')
)

# 'Not described' = NaN #
nypd_df['complainant_gender'] = (nypd_df
    ['complainant_gender']
    .replace('Not described', np.NaN)
)

# Drop 'Gender non-conforming' #
nypd_subset = nypd_df[
    (nypd_df['complainant_gender'] == 'Male') 
        |
    (nypd_df['complainant_gender'] == 'Female')
        |
    (nypd_df['complainant_gender'].isnull())
]

This is a helper function to create subsets of a given dataframe, one with males and one with females.

In [1145]:
def male_female_split(df):
    col = 'complainant_gender'
    male = df[df[col] == 'Male']
    female = df[df[col] == 'Female']
    return male, female
    

Helper method to calculate the precision of a given dataset. It creates a pipeline similar to the final model's then calculates the precision of that pipeline. It is a slightly modified version of the code to construct the final pipeline above. In addition to creating the pipeline, it also calculates the precision. 

In [1156]:
def accuracy_calc(df):
    X = nypd_df.drop(['is_median', 'month_elapsed'], axis=1)
    y = nypd_df['is_median']

    types = X.dtypes
    catcols = types.loc[types == np.object].index
    numcols = types.loc[types != np.object].index

    num_cols = [
        'month_received',
        'year_received',
        'complainant_age_incident',
        'month_x_year'
    ] + filtered_combos    #includes the new feature pair columns that were engineered

    bin_col = [
        'year_elapsed'
    ]
    bin_transf = Pipeline(
        steps = [
            ('binarizer', Binarizer(threshold = 2))
        ])

    ord_col = [
        'rank_incident',
        'rank_now'
    ]
    ords = Pipeline(
        steps = [
            ('ordinal_encoding', OrdinalEncoder())
        ]
    )

    dont_include = ['rank_now', 'command_now', 'allegation']
    cat_col = [col for col in catcols if (col not in ord_col)]
    cat_col = [col for col in catcols if (col not in dont_include)]
    cats = Pipeline([
        ('imp', SimpleImputer(strategy='constant', fill_value='NULL')),
        ('ohe', OneHotEncoder(handle_unknown='ignore', sparse=False))
    ])

    alleg = Pipeline([
        ('imp', SimpleImputer(strategy='constant', fill_value='NULL')),
        ('ohe', OneHotEncoder(handle_unknown='ignore', sparse=False)),
        ('pca', PCA(svd_solver='auto', n_components=50))
    ])

    ct = ColumnTransformer([
        ('catcols', cats, catcols),
        ('numcols', SimpleImputer(strategy='mean'), num_cols),
        ('ordcols', ords, ord_col),
        ('alleg', alleg, ['allegation']),
        ('bina', bin_transf, bin_col)
    ])

    pl = Pipeline([('feats', ct), ('reg', DecisionTreeClassifier(max_depth=10))])
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)
    pl.fit(X_train, y_train)
    preds = pl.predict(X_test)
    
    metrics.accuracy_score(y_test, preds)
    tn, fp, fn, tp = metrics.confusion_matrix(y_test, preds).ravel()
    precision = (tp + tn) / (tp + fn + tn + fp)
    return precision

Helper method to calculate the observed precision. 

In [1157]:
def obs_acc(df):
    male, female = male_female_split(nypd_df)

    m_prec = accuracy_calc(male)
    f_prec = accuracy_calc(female)

    difference = np.abs(m_prec - f_prec)
    return difference

In [1158]:
obs = obs_acc(nypd_df)

This code conducts a permutation test with the difference of precision as the test statistic. It calculates the R^2 for each male and female rows then find the difference between those two and adds it to a list. At the ends, the p-value is calculated. 

In [1159]:
n_repetitions = 100

differences = []
for _ in range(n_repetitions):
    
    # shuffle the weights
    shuffled_gender = (
        nypd_df['complainant_gender']
        .sample(replace=False, frac=1)
        .reset_index(drop=True) 
    )
    
    # put them in a table
    shuffled = (
        nypd_df
        .assign(**{'complainant_gender': shuffled_gender})
    )
    
    # split the dataframe into male and female 
    male, female = male_female_split(nypd_df)

    # calculate the precision for male and female subset datasets
    m_prec = accuracy_calc(male)
    f_prec = accuracy_calc(female)

    difference = np.abs(m_prec - f_prec)
    
    # add it to the list of results
    differences.append(difference)

p_val = np.count_nonzero(differences >= obs) / n_repetitions
print('p-value:', p_val)

p-value: 0.8
