<a href="https://colab.research.google.com/github/adamlutzz/DS-Unit-2-Applied-Modeling/blob/master/Fatal_Police_Encounters.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fatal Police Shootings

**Race:**

*  W: White, non-Hispanic
*  B: Black, non-Hispanic
*  A: Asian
*  L: Latino
*  O: Other
*  U: Unknown<br/>

**Gender:**

*  M: Male
*  F: Female
*  U: Unknown

Possible Features to Engineer
*  Minority? (non white)

### Installations

In [0]:
import os, sys
in_colab = 'google.colab' in sys.modules

# If you're in Colab...
if in_colab:

    # Install required python packages
    !pip install -r 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Kaggle-Challenge/master/requirements.txt'

### Pre-processing

In [0]:
import category_encoders as ce
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import accuracy_score
from sklearn.pipeline import make_pipeline

# link from vice github
link = 'https://raw.githubusercontent.com/vicenews/shot-by-cops/master/subject_data.csv'

# read in 
df = pd.read_csv(link)

# check df shape
print(df.shape)

# first look at df
df.head()

In [0]:
# target category is 'Fatal'
target = 'Fatal'

In [0]:
# lets see how balanced our class is
print(df[target].value_counts())
df[target].value_counts(normalize=True)

It looks like I will need to clean these up. However there are likely 3 classes<br/>
N - non-fatal<br/>
F - Fatal<br/>
U - Unknown<br/>

In [0]:
# remove space
df['Fatal'] = df[target].str.strip()

# repeat step to check work
print(df[target].value_counts())
df[target].value_counts(normalize=True)

In [0]:
# visualize
df[target].value_counts().plot(kind='bar')
plt.title('Police Shootings')
plt.show()

### Calculate Baseline

In [0]:
from sklearn.model_selection import train_test_split

# do a train test split
train, test = train_test_split(df, test_size=.2, random_state=11)

train.shape, test.shape

In [0]:
# split again for validation
train, val = train_test_split(train, test_size=.2, random_state=11)

train.shape, val.shape

In [0]:
# divide into X features matrix and y target vector
X_train = train.drop(columns=target)
X_val = val.drop(columns=target)
X_test = test.drop(columns=target)

# y target vector
y_train = train[target]
y_val = val[target]
y_test = test[target]

# verify shape
X_train.shape, y_train.shape

In [0]:
from sklearn.impute import SimpleImputer

# create quick baseline model
base_RF = make_pipeline(
    ce.OrdinalEncoder(),
    SimpleImputer(strategy='mean'),
    RandomForestClassifier(n_jobs=-1, random_state=11)
)
# Fit on train, score on val
base_RF.fit(X_train, y_train)
y_pred = base_RF.predict(X_val)
print('Validation Accuracy', accuracy_score(y_val, y_pred))

### Pandas Profiling Report

In [0]:
import pandas_profiling

df.profile_report()

### Feature Exploration

In [0]:
# This is explained above, but I wanted a quick reference
df.dtypes

**To-Do's**


*   Change date to datetime format
*   Split features into ordnial and nominal
*   Figure out how to best impute NA's
*   Identify High Cardinality features and investigate binning
*   Clean up number of shots (should be int) try using bins
*   Drop 'NumberofSubjects' - it is a constant
*   SubjectAge into bins 10-19 etc.
*   IsMale new feature counts how many male officers involved
*   IsFemale new feature counting how many female officers involved
*   IsUknown for unknown Gender
*   Bin top 10 nature of stops?
*   Make a feature for each officer race and count how many
*   Group observations that only have year and no day
*   Make Dictionary for all classes in low-cardinality features




In [0]:
# in the profile report there were 5 unique values including NA so lets check it out
df['SubjectArmed'].value_counts()

# looks like extra spaces will need to be removed

In [0]:
# figure out number of shots (binning)

In [0]:
# convert date to datetime

In [0]:
# declare list of columns you want to edit
wrangle_col = ['SubjectAge', 'OfficerGender',
                'OfficerRace', 'SubjectGender',
                'SubjectRace']

# create dictionary of bad characters
bad_chars = {'/':'', ':':';', ' ':'', 'H':'L', }

# remove space
def wrangle(df):
  
  df = df.copy()

  # strip out leading and trailing spaces
  for col in df.columns:
    df[col] = df[col].str.strip()

  # replace un-uniform characters
  for col in df[wrangle_col]:
    for char in bad_char:
    df[col] = df[col].str.replace('/', '')
    df[col] = df[col].str.replace(':', ';')
    df[col] = df[col].str.replace(' ', '')
    df[col] = df[col].str.replace('H', 'L')
    df[col] = df[col].str.replace('OtLer', 'Other')
    df[col] = df[col].str.replace('WLITE', 'WHITE')
  
  return df

In [0]:
# def encoding function

In [0]:
# def pre-processing function

In [0]:
# re-run model

### Feature Engineering

In [0]:
# make minority feature