# NYPD Allegations
* **See the main project notebook for instructions to be sure you satisfy the rubric!**
* See Project 03 for information on the dataset.
* A few example prediction questions to pursue are listed below. However, don't limit yourself to them!
    * Predict the outcome of an allegation (might need to feature engineer your output column).
    * Predict the complainant or officer ethnicity.
    * Predict the amount of time between the month received vs month closed (difference of the two columns).
    * Predict the rank of the officer.

Be careful to justify what information you would know at the "time of prediction" and train your model using only those features.

# Summary of Findings

### Introduction
In this project I will be building models to predict the ethnicity of complainants in the nypd dataset. Since ethnicity is a nominal categorical column, this is a classification problem. This will not be an easy task, since the responses in this dataset do not easily differentiate such ethnicites. It will be especially difficult to acheive a good accuracy (which will be our evaluation metric), since over 15% of complainant ethnicities are missing. Since we cannot just drop such a large chunk of data, we must incorporate missing data as an option ("Unknown") for the classifier to predict. Unfortunately, this can cause a lot of problems because we don't know the true distribution of the missing complainant ethnicities. The reason why this is so problematic is because there might be feature patterns common with a certain ethnicity that is also the true hidden ethnicity in allegations with missing complainant ethnicity. This means that the classifier might often predict "Unknown" instead of an actual ethnicity which might actually be a better prediction. Regardless, we will do our best to create a reliable classifier of complainant ethnicites. I will be using accuracy as the evaluation metric for this prediction problem because it will give us a basic understanding of how often the classifier predicted correctly. With the large obstacles and limitations described, the goal of the final model created is to produce an accuracy on the test set of 70%.

It is worth noting that for this project, I used the cleaned version of this dataset, which was cleaned in project3. The following was done to clean the data. I started by looking through each column to see if missingness was explained in different ways. For example, in the shield_no column, missingness was usually defined as a shield_no of 0 instead of nans, since this is an integer column. Similarly, the precinct column contained an absurd amount of 0s and 1000s. This wouldn't make sense sicne there are not precincts with these numbers. If I had not filled these with nans, we might have seen much different results with our later analysis of NYC's precincts since there were so many precinct inputs of 0 and 1000. It was found that some of the complainant_ages were below 0, which also doesn't make sense. There were also "Unknown"s in the complainant_ethnicity column. For all of these cases, the mentioned observations were converted to nans. In order to retain the type of each of the integer columns, the columns were converted to type Int8 or Int16. Some of the complainant ages were found to be between ages 1 and 10. This wouldn't make much sense, since the age input should be that of the person filing the complaint. It is hard to imagine a child this young filing a formal complaint and more likely that the parent of a child filed it for them with the child's age (or that it was simply mistyped). For this reason, I was conservative and converted ages 8 and below to nans. Next, I added a few columns to the dataframe for potential future EDA/analysis. One of these columns added was the substantiated column. This column is a series of boolean which tells whether each allegation was found to be true or not. This column was useful in the hypotesis testing section, where it was tested whether or not there was a significant difference between guilty police demographics and demographics of NYPD police. It also proved useful in coming up for the idea of the question posed, since being accused of something is different from being guilty. Therefore, this was a crucial column that would help differentiate ethnicities of accused officers vs. guilty officers. Next, a column was added to aggregate the month/year columns into one for all-in-one access. This column did not end up being useful in analysis, since month of allegations weren't considered. EDA was focused mainly on inspecting the ethnicities and precincts of the data. Immediately after noticing how skewed the demographics for both complainant ethnicity and NYPD member ethnicity were, I knew this would be the focus of the project. With NYPD demographics, setup for a later hypothesis was done within the EDA section by pulling external NYPD demographics off a .gov site. With this and the demographics of recent allegations, a hypothesis test was ready to be done later down the line. The majority of the EDA section consisted of looking into the precinct data. Right off the bat, the 75th precinct had an extreme amount of allegations compared to the others. After taking population and substantiation rate into account before becoming suspicious of corruption, the precict still appeared extreme compared to the others. From here, this precinct was compared to other even more dangerous precincts using a metric that incorporated number of allegations per population times substantiation rate. This was the metric used because it normalizes by person and takes into account how many of the officers were found to be guilty of the claims against them. With these metrics, we're all set up for another hypothesis test (using a test that we haven't used in this class).
### Baseline Model
For my baseline model, I will be using 5 features. These features are officer ethnicity, precinct of incident, allegation, last name of complainant, and number of months before case was resolved by the review board. Officer ethnicity, precinct, allegation, and last name are all nominal categorical features, and will therefore all be OneHot-Encoded. This leaves Months Before Resolved, which is a discrete numerical feature. This feature will be standardized. After 20 runs of splitting the dataset into training and testing data, training the model, and finding the test accuracy, the mean of such accuracies was 0.667 (or 66.7%). This is not great, but considering we have accounted for overfitting (as described in the baseline model code), this is okay for our baseline. This is not great because the classifier is only correcting 2/3 times. However, when we remember that many of these incorrect predictions could be due to the "Unknown" ethnicity prediction, and that it is quite difficult to predict the correect ethnicity out of the 7 possible options, it doesn't seem so bad. Ideally, we will improve the model by 3% in our final model through feature engineering to achieve our goal of an average 70% accuracy on testing data.

### Final Model
In my final model, I first and foremost feature engineered a pipeline that would deal with uncommon responses in the dataset. There are several responses in the "allegation" and "last_name" columns that are different from more common responses. In other words, there are plently of responses in these columns with only 1 or 2 of those exact responses. The worry is that when these uncommon responses are OneHot-Encoded and the model is trained on them, the model will memorize these responses when making predictions and overfit as a result. Furthemore, the whole idea behind including these features is that a more common response pertains more to a certain ethnicity. A threshold of 3 responses was used. This means that in the entirety of the dataset, only last names and allegation responses with 3 or more collective responses would be kept. Responses below this threshold were converted to string "IGNORE", which would all be OneHot-Encoded into a single column. The second form of feature-engineering I included in my final model was the extraction of the board's decision for each allegtion. The board's decision (Substantiated, Unsubstantiated, or Exonnerated) could be an important predictor because there may be trends with each ethnicity telling of how frequently that ethnicity's claims led to the officer being punished. Since all of the possible responses in the board_disposition column contain the actual decision in the first word of the string, we could simply split() the string and extract the first word in the list to accomplish this. Once this function was written, the result was OneHot-Encoded and used in the final model. It is important to note that I decided to add complainant_gender into the final model as a feature. Gender could play a large part in trends we see within the characteristics of ethnicity groups in the dataset. I stuck with the DecisionTreeClassifier for my final model type because it ended up providing the best results in terms of accuracy. I did not include the testing of other model types in this notebook since GridSearching with them would have taken very long when opening the notebook from scratch. I tried KNeighbors Classifier with gridsearched n_neighbors 3,4,5,10,20 and RandomForest Classifier with the same gridsearch parameters as my DecisionTreeClassifier. The DecisionTree Classifier produced very similar accuracies to KNeighbors, so I went with the DecisionTreeClassifier since it is faster and is easier to understand wnen changing its parameters. The RandomForest Classifier took extremely long to run and also produced very similar accuracies. Unfortunately, with each GridSearch, we would receive the combination of parameters with the DecisionTree Classifier that would yeild the most amount of overfitting. These parameters were max_depth = None, min_samples_leaf = 2, and min_samples_split = 2. This combination of parameters allows for the most amount of decision trees creatable through the Decision Tree Classifier, since it is putting no limitations on it. We can imagine why these parameters would yield such different training and testing accuracies, since the classifier is almost memorizing the training set. When using these "best" parameters, the training accuracy was around 89%, while the testing accuracy was around 73%. This indicates clear overfitting. In order to reduce this overfitting, I fine tuned the Classifier's parameters to produce training and testing accuracies that were reasonably closer. I decided on the parameters: max_depth=75, min_samples_leaf=10, and min_samples_split=5. These parameters produced reasonably close accuracies of 72% (train accuracy) and 68% (test accuracy). Although our testing accuracy dropped slightly, it is more important that we had accounted for the overfitting of the Classifier, since it could be used on future nypd allegation data. Even though we barely missed our 70% testing accuracy goal, we've produced a transparent model that, with the limitations, does a decent job at predicting complainant ethnicites. 

### Fairness Evaluation
The accuracy of the model on female complainants was evaluated to determine if the model is worse at predicting for this gender compared to the others. The motivation for permuation testing females comes from their underrepresentation in the dataset compared to males. I started by permutation testing the accuracy of the model on female complainants to see if there was a significant difference in its accuracy vs. other genders. I chose accuracy because this not a binary classification problem; there are multiple ethnicities that the Classifier can predict. After permutation testing the females subset, it was determined that the accuracy of females is significantly different from the accuracies of the other genders. Therefore, this model is not fair towards all genders. I recognize that the final model should definitely be further revised before using this model for any reason.

# Code

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import seaborn as sns
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import FunctionTransformer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
%matplotlib inline
%config InlineBackend.figure_format = 'retina'  # Higher resolution figures

In [2]:
allegations = pd.read_csv(os.path.join("data", "cleaned_allegations.csv"), index_col = 0)
print(allegations.columns)
allegations.head()

Index(['unique_mos_id', 'first_name', 'last_name', 'command_now', 'shield_no',
       'complaint_id', 'month_received', 'year_received', 'received mo/yr',
       'month_closed', 'year_closed', 'closed mo/yr', 'command_at_incident',
       'rank_abbrev_incident', 'rank_abbrev_now', 'rank_now', 'rank_incident',
       'mos_ethnicity', 'mos_gender', 'mos_age_incident',
       'complainant_ethnicity', 'complainant_gender',
       'complainant_age_incident', 'fado_type', 'allegation', 'precinct',
       'contact_reason', 'outcome_description', 'board_disposition',
       'Substantiated', 'Months Before Resolved'],
      dtype='object')


Unnamed: 0,unique_mos_id,first_name,last_name,command_now,shield_no,complaint_id,month_received,year_received,received mo/yr,month_closed,...,complainant_gender,complainant_age_incident,fado_type,allegation,precinct,contact_reason,outcome_description,board_disposition,Substantiated,Months Before Resolved
0,10004,Jonathan,Ruiz,078 PCT,8409.0,42835,7,2019,"(7, 2019)",5,...,Female,38.0,Abuse of Authority,Failure to provide RTKA card,78.0,Report-domestic dispute,No arrest made or summons issued,Substantiated (Command Lvl Instructions),True,10
1,10007,John,Sears,078 PCT,5952.0,24601,11,2011,"(11, 2011)",8,...,Male,26.0,Discourtesy,Action,67.0,Moving violation,Moving violation summons issued,Substantiated (Charges),True,9
2,10007,John,Sears,078 PCT,5952.0,24601,11,2011,"(11, 2011)",8,...,Male,26.0,Offensive Language,Race,67.0,Moving violation,Moving violation summons issued,Substantiated (Charges),True,9
3,10007,John,Sears,078 PCT,5952.0,26146,7,2012,"(7, 2012)",9,...,Male,45.0,Abuse of Authority,Question,67.0,PD suspected C/V of violation/crime - street,No arrest made or summons issued,Substantiated (Charges),True,14
4,10009,Noemi,Sierra,078 PCT,24058.0,40253,8,2018,"(8, 2018)",2,...,,16.0,Force,Physical force,67.0,Report-dispute,Arrest - other violation/crime,Substantiated (Command Discipline A),True,6


### Baseline Model

Considering prediction of complainant_ethnicity with features: mos_ethnicity, allegation, first word in board_disposition, precinct, last name, months before resolved

Let's first take a look at the missingness in our columns to see if we might need to impute, since our classifier will not work on missing data.

In [3]:
print(allegations.isna().sum())
allegations.shape[0]

unique_mos_id                  0
first_name                     0
last_name                      0
command_now                    0
shield_no                   5392
complaint_id                   0
month_received                 0
year_received                  0
received mo/yr                 0
month_closed                   0
year_closed                    0
closed mo/yr                   0
command_at_incident         1544
rank_abbrev_incident           0
rank_abbrev_now                0
rank_now                       0
rank_incident                  0
mos_ethnicity                  0
mos_gender                     0
mos_age_incident               0
complainant_ethnicity       5505
complainant_gender          4195
complainant_age_incident    4829
fado_type                      0
allegation                     1
precinct                      48
contact_reason               199
outcome_description           56
board_disposition              0
Substantiated                  0
Months Bef

33358

The first thing we notice is that the response we want to predict with our classifier, "complainant_ethnicity", has an enourmous amount of missing values. In most cases, we would attempt to impute such a column based on its type of missingness. However, as explained after investigation in my project3, it would be unethical to impute this one. Incorrectly imputing a complainant's ethnicity could create problems of misrepresentation in allegations and make many assumptions, when in reality, each allegation has many factors and potential confounders. Unfortunately, it is possible that dropping such a large amount of observations might bias the complainant ethnicity predictions. I theororize that the majority of the allegations with missing complainant ethnicities are minorities, who, when filing a complaint, don't want their ethnicity taken into account (NMAR). This would be completely rational given the absurd amount of unpunished attrocities committed by police against minorities in the US. Rather than imputing or dropping these missing ethnicities, we will instead fill the missing ethnicites with an "Unknown" category. In other words, we will include missing ethnicities as an option for classification for the model. This makes sense because there could perhaps be combinations of features, rather than *just* an ethnicity being missing because of the value of itself, that contribute to the complainant leaving their ethnicity unfilled. In other words, with my theory that the label of one's ethnicity is the primary motivating factor in leaving such response blank, it is possible that other factors within the context of the allegation have furthered this motivation (and therefore made the complainant more likely to leave out their ethnicity). I'm hopeful that if the right features and parameters are passed into the chosen classifier, the classifier will successfully predict these observations as "Unknown". If my before stated theory is true, we might see the classifier predict many of the observations with missing ethnicities as minority ethnicities, since they would share patterns in feature responses.

In [4]:
allegations["complainant_ethnicity"].value_counts()

Black              17114
Hispanic            6424
White               2783
Other Race           677
Asian                532
Refused              259
American Indian       64
Name: complainant_ethnicity, dtype: int64

We also see within the complainant_ethnicity responses that 259 complainants responded with "Refused". Since these ethnicities are also consdidered unknown, we will consider them "Unknown" for classification, as we will do for the missing (nan) ethnicities.

In [5]:
allegations["complainant_ethnicity"] = allegations[
    "complainant_ethnicity"].fillna("Unknown")
allegations["complainant_ethnicity"] = allegations[
    "complainant_ethnicity"].str.replace("Refused", "Unknown")

Since there is only one missing "allegation" response and 48 missing "precinct" responses in our dataset of over 33,000 responses, let's elect to drop these allegations from the dataset.

In [6]:
allegations = allegations.dropna(subset = ["allegation", "precinct"])

Now we can start building our model. We will start by including ethnicity of accused officer, precinct that incident took place in, what the allegation was, last name of the complainant, and number of months it took to resolve the complaint filed. We consider the ethnicity of the accused officer since, in many cases, misconduct of officers are racially motivated. With this in mind, the ethnicity of the complainant may be more likely to be a minority if the officer's ethnicity was White (this is just one example). Precinct should also be considered as a feature for two reasons. Firsly, different precincts have different demographics, which means (at least slightly) different proportions of ethnicities filing complaints. Secondly, as seen in my project3, different precincts have much more corruption than others, being that they're all under different leadership. The allegation column might be useful because certain allegations might be more common among certain ethnicities. For example, Hate Speech might be more common amongst minority accusers. I feel that last name will be a good predictor of complainant ethnicity, since there are more common last names for each ethnicity. Lastly, the amount of months an accusation took to be resolved might be a useful feature. My reasoning behind using this column in the model is that when an accuser's race is a factor in a case, it could be considered an additional factor in the case potentially requiring further investigation.

To deal with the mos_ethnicity, precinct, allegation, and last_name columns, we can OneHot-Encode since these are categorical columns. We can then standardize the Months Before Resolved column. Once we manipulate our features in these ways, we'll use a Decision Tree Classifier. Before actually fitting the model, we will use GridSearchCV to find the best parameters for our model (specifically mex_depth, min_samples_leaf, and min_samples_split). Within GridSearchCV, we will set cv = 10 since our data spans many decades, and many associations can change over such long periods of time. Once we determine the best parameters, we will fit our model using these parameters within our classifier. Lastly, we will find our training and testing accuracies after splitting our data.

In [7]:
#OHE pipeline for mos_ethnicity, allegation, precinct
ohe = Pipeline([("other_OHE", OneHotEncoder(
    handle_unknown='ignore'))])

#Standardize pipeline for Months Before Resolved
stdscalar = Pipeline([("stdscalar", StandardScaler())])

#column transform to OHE and classify
transformer = ColumnTransformer([
    ("ohe", ohe, ["mos_ethnicity", "precinct", "allegation", "last_name"]),
    ("standardize", stdscalar, ["Months Before Resolved"])
])

#GridSearch parameters
parameters = {
    'max_depth': [2,5,10,None],
    'min_samples_split':[2,3,5,7],
    'min_samples_leaf':[2,3,5,7]
}

gridsearch = Pipeline([
    ("transform", transformer),
    ("gridsearch", GridSearchCV(DecisionTreeClassifier(),
                                parameters, cv = 10))
])

In [8]:
Xtrain, Xtest, Ytrain, Ytest = train_test_split(
    allegations.drop(["complainant_ethnicity"], axis = 1),
    allegations["complainant_ethnicity"], test_size=0.2)

In [9]:
#Gridearch to find best features
gridsearch.fit(Xtrain, Ytrain).named_steps["gridsearch"].best_params_

{'max_depth': None, 'min_samples_leaf': 2, 'min_samples_split': 3}

We know from seeing these best parameters that we will probably see some overfitting in using these, since this combination of parameters allow for almost the most amount of branches and divisions in our tree.

In [10]:
model1 = Pipeline([
    ("transform", transformer),
    ("classify", DecisionTreeClassifier(
        max_depth=None, min_samples_leaf=2, min_samples_split=2))
])

In [11]:
model1.fit(Xtrain, Ytrain)
print(model1.score(Xtrain, Ytrain))
print(model1.score(Xtest, Ytest))

0.8868165271888018
0.7499249474632242


Since we see that using these "best parameters" lead to overfitting (a large difference in the accuracy of our training and testing data), let's see if we can get our accuracies closer together by fine-tuning the DecisionClassifier's parameters.

In [12]:
#try new parameters to reduce overfitting
model1 = Pipeline([
    ("transform", transformer),
    ("classify", DecisionTreeClassifier(
        max_depth=75, min_samples_leaf=10, min_samples_split=5))
])

model1.fit(Xtrain, Ytrain)
print("train set accuracy:", model1.score(Xtrain, Ytrain))
print("test set accuracy:", model1.score(Xtest, Ytest))

train set accuracy: 0.6975269261080047
test set accuracy: 0.6693185229660763


Even though our training accuracy has dropped significatly, we know that we are not overfitting nearly as much anymore with our training and testing accuracies pretty close together. Unfortunately, our test accuracy did drop a little bit too, but this is a necessary tradeoff considering how different our two accuracies were previously.

### Final Model

The first thing that seems appropriate to feature-engineer for this model is a function that will account for responses that we don't commonly see in columns. For example, in the "allegation" column, we see some common responses like "Assault", but not many very specific responses like "Questioned immigration status". Having these very specific responses may lead to overfitting, which is why I want to incorporate this function. Responses with under 3 total responses will be converted to a string "IGNORE". Since the column passed into this function will be OneHot-Encoded after, the responses under the 3 threshold will be encoded into a single column designated for special, uncommon responses. Since there are also many unique last names, this column will be passed into the function as well before being OneHot-Encoded.

In [13]:
allegations["board_disposition"].value_counts()

Unsubstantiated                             15427
Exonerated                                   9601
Substantiated (Charges)                      3790
Substantiated (Formalized Training)          1033
Substantiated (Command Discipline A)          962
Substantiated (Command Discipline)            851
Substantiated (Command Discipline B)          784
Substantiated (Command Lvl Instructions)      452
Substantiated (Instructions)                  247
Substantiated (No Recommendations)            161
Substantiated (MOS Unidentified)                1
Name: board_disposition, dtype: int64

The second form of feature-engineering I will include in my final model is extracting the disposition of the board from each case. Since the first string of each of the decisions made by the board about accusations inform us of the outcome, we can extract this string and include it as a feature. In order to do this, we will write a function that splits each observation and gets the first string of the list. This might serve as a useful predictor of complainant ethnicity since the review board might have racial biases. It is also possible that certain ethnicites have higher substantiation rates than others. All features from the previous model will be included in this final model. A GridSearchCV will be used with this model to search for the best parameters to use in our Decision Tree Classifier.

In [14]:
#create function to remove special responses
def remove_unique(df):
    threshold = 3
    col_dict = {}
    def determination(val):
        try:
            obs_above_thresh[val]
            return val
        except:
            return "IGNORE"
    for col in df:
        obs_above_thresh = df[col].value_counts()[
            (df[col].value_counts() >= threshold)].to_dict()
        col_dict[col] = df[col].apply(determination)

    return pd.DataFrame(col_dict)

#remove unique responses and OHE features
ohe_exclude = Pipeline([("exclude", FunctionTransformer(remove_unique)),
                        ("OHE", OneHotEncoder(handle_unknown='ignore'))
])

#extract first word in board_disposition and OHE
def extract_disposition(df):
    return pd.DataFrame(
        df["board_disposition"].str.split().apply(lambda lst: lst[0]))

disposition_ohe = Pipeline([
    ("disposition_summarized", FunctionTransformer(extract_disposition)),
    ("disposition_OHE", OneHotEncoder(handle_unknown='ignore'))])

#include feature-engineering in final model
transformer2 = ColumnTransformer([
    ("disposition", disposition_ohe, ["board_disposition"]),
    ("ohe_excl", ohe_exclude, ["allegation", "last_name"]),
    ("ohe", ohe, ["mos_ethnicity", "precinct", "complainant_gender"]),
    ("standardize", stdscalar, ["Months Before Resolved"])
])

parameters = {
    'max_depth': [2,5,10,None], 
    'min_samples_split':[2,3,5,7],
    'min_samples_leaf':[2,3,5,7]
}

gridsearch = Pipeline([
    ("transform", transformer2),
    ("gridsearch", GridSearchCV(
        DecisionTreeClassifier(), parameters, cv = 10))
])

In [15]:
Xtrain, Xtest, Ytrain, Ytest = train_test_split(
    allegations.drop(["complainant_ethnicity"], axis = 1),
    allegations["complainant_ethnicity"], test_size=0.2)

In [16]:
#Once again, Gridsearch to find best set of parameters
gridsearch.fit(Xtrain, Ytrain).named_steps["gridsearch"].best_params_

{'max_depth': None, 'min_samples_leaf': 2, 'min_samples_split': 2}

In [21]:
model2 = Pipeline([
    ("transform", transformer2),
    ("classify", DecisionTreeClassifier(
        max_depth=None, min_samples_leaf=2, min_samples_split=2))
])

model2.fit(Xtrain, Ytrain)
print("train set accuracy:", model2.score(Xtrain, Ytrain))
print("test set accuracy:", model2.score(Xtest, Ytest))

train set accuracy: 0.9078695537959245
test set accuracy: 0.7377664365055538


Since we once again see some of the infamous overfitting parameters producing far different accuracies, we will again tune the parameters to what seem best without overfitting.

In [18]:
model2 = Pipeline([
    ("transform", transformer2),
    ("classify", DecisionTreeClassifier(
        max_depth=75, min_samples_leaf=10, min_samples_split=5))
])

model2.fit(Xtrain, Ytrain)
print("train set accuracy:", model2.score(Xtrain, Ytrain))
print("test set accuracy:", model2.score(Xtest, Ytest))

train set accuracy: 0.722670469471235
test set accuracy: 0.6844791353947763


### Fairness Evaluation

Since complainant gender is now a feature in our model, I would like to investigate whether predictions are worse on female complainants, as opposed to any other gender. In order to do this, the accuracy will be compared. My strategy is to use accuracy as my test statistic to see if the accuracies are significantly different. Females are somewhat underrepresented in the data compared to males as we can see below. If the accuracy does end up being significantly lower (one-sided) for females, this model should not be viewed as fair and usable. Further modifications, including addition/removal of features, or switching to a different classifier would need to be considered before using this model for purposes other than this project. The null hypothesis is that my model is fair and that the accuracy for female is approximately equal to the accuracies of the other genders. The alternate hypothesis is that my model is not fair and the accuracy for female is not approximately equal to the accuracies of the other genders. A significance level of 0.05 will be used for this test.

In [19]:
allegations["complainant_gender"].value_counts()

Male                     24039
Female                    5016
Not described               57
Transwoman (MTF)            20
Transman (FTM)               5
Gender non-conforming        2
Name: complainant_gender, dtype: int64

Null Hypothesis: 
My model is fair; the accuracy for female is approximately equal to the accuracies of the other genders

Alternate Hypothesis: 
My model is not fair; the accuracy for female is not approximately equal to the accuracies of the other genders

In [20]:
#Conduct permutation test
Xtrain, Xtest, Ytrain, Ytest = train_test_split(
    allegations.drop(["complainant_ethnicity"], axis = 1),
    allegations["complainant_ethnicity"], test_size=0.2)
model2.fit(Xtrain, Ytrain)

#Put together test data to subset by Female
concat = pd.concat([Xtest, Ytest], axis = 1)
females = concat[concat["complainant_gender"] == "Female"]

#Find accuracy of model on Female subset
femaleobs = model2.score(females.drop(["complainant_ethnicity"], axis = 1),
    females["complainant_ethnicity"])

#Permute complainant_gender column and subset to compute shuffled gender accuracy
scores = []
n_trials = 1000
for trial in range(n_trials):
    permuted = concat.assign(shuffled = np.random.permutation(
        concat["complainant_gender"]))
    shuffled_female = permuted[permuted["shuffled"] == "Female"]
    scores.append(model2.score(
        shuffled_female.drop(["complainant_ethnicity"], axis = 1),
        shuffled_female["complainant_ethnicity"]))
    
#How often do we see a accuracy this low
np.count_nonzero(femaleobs >= scores) / n_trials

0.0

With a p-value of approximately 0, we reject the null hypothesis. Therefore, the final model is not fair; the accuracy for female is not approximately equal to the accuracies of the other genders. This model should definitely be further revised before considering using this model for any reason other than this project.