# NYPD ARREST DATA

# The NYPD just arrested a person. You are told the crime commited, gender and age of person and location of crime. Can you predict the race of the person with accuracy and precision?

We are going to create a supervised machine learning algorithm to predict the race accurately.

First, we will load the data, then clean it, and explore it and along the way get it ready for the classification task.

# Loading Data

In [None]:
#Import the necessary libraries
import numpy as np
import matplotlib.pyplot as plt
font = {'family': 'serif','color':  'darkred','weight': 'normal','size': 16}
import seaborn as sns
import pandas as pd
%matplotlib inline
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.decomposition import PCA

In [None]:
#Load in the data
data = pd.read_csv('NYPD_Arrest_Data__Year_to_Date_.csv')
#Look at five rows
data.head(5)

In [None]:
#Look at the columns and rows
data.info()

# The Data Dictionary

In [None]:
#The Data Dictionary for the Columns obtained from the NYPD Website
data_dictionary = """
ARREST_KEY: Randomly generated persistent ID for each arrest

ARREST_DATE: Exact date of arrest for the reported event

PD_CD: Three digit internal classification code (more granular than
Key Code)

PD_DESC: Description of internal classification corresponding with PD
code (more granular than Offense Description)

KY_CD: Three digit internal classification code (more general
category than PD code)

OFNS_DESC: Description of internal classification corresponding with KY
code (more general category than PD description)

LAW_CODE: Law code charges corresponding to the NYS Penal Law,
VTL and other various local laws

LAW_CAT_CD: Level of offense: felony, misdemeanor, violation

ARREST_BORO: Borough of arrest. B(Bronx), S(Staten Island), K(Brooklyn),
M(Manhattan), Q(Queens)

ARREST_PRECINCT: Precinct where the arrest occurred

JURISDICTION_CODE: Jurisdiction responsible for arrest. Jurisdiction codes
0(Patrol), 1(Transit) and 2(Housing) represent NYPD whilst
codes 3 and more represent non NYPD jurisdictions

AGE_GROUP: Perpetrator’s age within a category

PERP_SEX: Perpetrator’s sex description

PERP_RACE: Perpetrator’s race description

X_COORD_CD: Midblock X-coordinate for New York State Plane
Coordinate System, Long Island Zone, NAD 83, units feet (FIPS 3104)

Y_COORD_CD: Midblock Y-coordinate for New York State Plane
Coordinate System, Long Island Zone, NAD 83, units feet (FIPS 3104)

Latitude: Latitude coordinate for Global Coordinate System, WGS
1984, decimal degrees (EPSG 4326)

Longitude: Longitude coordinate for Global Coordinate System, WGS
1984, decimal degrees (EPSG 4326)

"""

# Preprocess Data

In [None]:
#change borough columns to make it clear
borough= {'K': 'Brooklyn', 'M': 'Manhattan','B':'Bronx','Q':"Queens", 'S':'Staten Island'}
#change the crime labels to make it clear
crimes = {'M': 'Misdemeanor', 'F':'Felony', 'V':'Violation' ,'I':'Infraction'}
#change the gender labels
gender = {'M': 'Male', 'F':'Female'}
#Replace borough label
data['ARREST_BORO'] = data['ARREST_BORO'].replace(borough)
#Replace Crime label
data['LAW_CAT_CD'] = data['LAW_CAT_CD'].replace(crimes)
#Replace gender label
data['PERP_SEX'] =  data['PERP_SEX'].replace(gender)
# Change the column names
data.rename(columns={'PD_DESC':'OFFENSE_DESC_1', 'OFNS_DESC':'OFFENSE_DESC_2' ,'LAW_CAT_CD':'LEVEL_OF_OFFENSE','PD_CD':'INTERNAL_CODE_1','KY_CD':'INTERNAL_CODE_2'}, inplace=True)

In [None]:
#Look at the unique values for each column with less than 20 unique values
for column in data.columns:
    if len(data[column].unique()) <20:
        print(column,':',data[column].unique())

The above categorical values are found in the data. The LAW_CAT_CD is type of crime indicator (dictionary above) that lists felony, misdemeanor, violation, etc. The others are Borough (ARREST_BORO), age group (AGE_GROUP), gender (PERP_SEX), and race (PERP_RACE).

Let me look at columns that have more than 20 unique values next.
Because the dataframe is 214617 rows long (as seen in the output from data.info()), I am going to keep any column with unique features that are under 1000 in number. Even though 1000 is a big number, it is much smaller than ~200,000. To be exact, it is half a percent.

In [None]:
#Look at the unique values for columns containing greater than 20 values
#We don't want TOO MUCH variability because easier to cause overfitting
for column in data.columns:
    print(column,'unique values:',len(data[column].unique()))

In [None]:
data.INTERNAL_CODE_2.unique()

In [None]:
#Based on the above unique counts, I will select the following columns to drop 
#because they are not useful features to train a classifier.
drop_following = """
Columns to drop:

ARREST_KEY: this is a unique identifier used internally by the NYPD. It serves no purpose here.
ARREST_DATE: WE will drop this and bin it into to Month
X_COORD_CD: Not using coordinates in logistic regression
Y_COORD_CD: Not using coordinates in logistic regression
Latitude: Not using coordinates
Longitude: Not using coordinates

"""

In [None]:
#Drop the columns that wont provide informational value
data.drop(['ARREST_KEY', 'X_COORD_CD','Y_COORD_CD', 'Latitude','Longitude','ARREST_DATE'], axis=1, inplace=True)

In [None]:
#How many empty values are there in each column?
data.isnull().sum()

In [None]:
#Because there are not that many nulls, won't worry about them, just going to drop them
data.dropna(inplace=True)

In [None]:
# data['ARREST_DATE'] = pd.to_datetime(data['ARREST_DATE'], format = "%m/%d/%Y")

In [None]:
#Create a catagoric column for month
#Use a function to convert the date into a month category. 
# Then use one-hot encoding to convert the months into twelve variables 

Now that the data is ready, we will proceed with exploring the columns

# Borough

In [None]:
#Number of arrests by Borough
arrest_by_boro = data.groupby('ARREST_BORO').ARREST_BORO.count()
arrest_by_boro = pd.DataFrame(arrest_by_boro.sort_values(ascending = False))
#Plot race of arrested
plt.style.use('seaborn-whitegrid')
fig = plt.figure(num=None, figsize=(15, 7))
ax = plt.axes()
ax = data.ARREST_BORO.value_counts(sort=True).plot(kind='bar')
ax.set_title('Numbers Arrested by Borough',fontsize=18)
ax.set_xlabel("Borough",fontsize=18)
ax.set_ylabel("Number of Arrested",fontsize=18)

In [None]:
#Looking at the types of arrests by Borough
fig = plt.figure(num=None, figsize=(20, 7))
ax = plt.axes()
ax = data.groupby('LEVEL_OF_OFFENSE')['ARREST_BORO'].value_counts(sort=True).plot(kind='bar',)
ax.set_title('Numbers Arrested in Borough for Type of Crime',fontsize=18,fontdict=font)
ax.set_xlabel("Type of Crime and Borough",fontsize=18,fontdict=font)
ax.set_ylabel("Number of Arrested",fontsize=18,fontdict=font)
ax.tick_params(labelsize=15)

Brooklyn has the most felonies.

In [None]:
#Arrests by Boro Precinct
font = {'family': 'serif','color':  'darkred','weight': 'normal','size': 16}
fig = plt.figure(num=None, figsize=(20, 7))
ax = plt.axes()
ax = data.groupby('ARREST_BORO').ARREST_PRECINCT.value_counts(sort=True).plot(kind='bar')
ax.set_title('Numbers Arrested in Borough and Precinct',fontsize=18,fontdict=font)
ax.set_xlabel("Borough and Police Precinct",fontsize=18,fontdict=font)
ax.set_ylabel("Number of Arrested",fontdict=font)
ax.tick_params(labelsize=12)

# Age

We see below a majority of arrested persons are between 25 and 44.

In [None]:
#Age Group of Arrested
counts = data.AGE_GROUP.value_counts()
percent = data.AGE_GROUP.value_counts(normalize = True).mul(100).round(2).astype(str) + '%'
pd.DataFrame({'Counts': counts, 'Percent': percent})

In [None]:
plt.style.use('seaborn-whitegrid')
fig = plt.figure(figsize=(8, 6))
ax = plt.axes()
ax = data.AGE_GROUP.value_counts(sort=True).plot(kind='bar')
ax.set_title('Numbers Arrested in Age Group',fontsize=18,fontdict=font)
ax.set_xlabel("Age Group",fontsize=18,fontdict=font)
ax.set_ylabel("Number of Arrested",fontsize=18,fontdict=font)
ax.tick_params(labelsize=12)

# Race

Race is an important feature for this project. One of the main attempts of our classification tasks is to see if we can predict race from the other arrest features. Doing this will allow us to know if there are distinct characteristics to a crime that may indicate a person's race.

In [None]:
#Race of Arrested
counts = data.PERP_RACE.value_counts()
percent = data.PERP_RACE.value_counts(normalize = True).mul(100).round(2).astype(str) + '%'
pd.DataFrame({'Counts': counts, 'Percent': percent})

In [None]:
#Plot race of arrested
plt.style.use('seaborn-whitegrid')
fig = plt.figure(num=None, figsize=(15, 7))
ax = plt.axes()
ax = data.PERP_RACE.value_counts(sort =True).plot(kind='bar')
ax.set_title('Numbers Arrested by Race',fontsize=18)
ax.set_xlabel("Race",fontsize=18)
ax.set_ylabel("Number of Arrested",fontsize=18)

Black persons are just under 50% of those arrested. If we add Black Hispanic, then black people make up a majority of arrests.

# Race and Age

In [None]:
#Races of the Arrested by Age Group
counts = data.groupby('AGE_GROUP').PERP_RACE.value_counts(sort=True)
# pd.DataFrame({'Counts': counts})

Most of those arrested are between 25 to 44. 

In [None]:
plt.style.use('seaborn-whitegrid')
fig = plt.figure(num=None, figsize=(17, 7))
ax = plt.axes()
ax = data.groupby('AGE_GROUP').PERP_RACE.value_counts(sort=True).plot(kind='bar')
ax.set_title('Numbers Arrested in Age Group and Race',fontsize=18,fontdict=font)
ax.set_xlabel("Age Group and Race",fontsize=18,fontdict=font)
ax.set_ylabel("Number of Arrested",fontsize=18,fontdict=font)
ax.tick_params(labelsize=15)

It looks from the shape of the above graphs that every race has the same pattern of arrests by age group. So every race behaves the same across age.

# Gender

Below we see that 81.64% of those arrested are men

In [None]:
#Male vs Female
counts = data.PERP_SEX.value_counts()
percent = data.PERP_SEX.value_counts(normalize = True).mul(100).round(2).astype(str) + '%'
pd.DataFrame({'Counts': counts, 'Percent':percent})

# Gender and Race

What is the relationship between gender and race?

In [None]:
#Arrests by Gender and Race
font = {'family': 'serif','color':  'darkred','weight': 'normal','size': 16}
fig = plt.figure(num=None, figsize=(20, 7))
ax = plt.axes()
ax = data.groupby('PERP_SEX').PERP_RACE.value_counts(sort=True).plot(kind='bar')
ax.set_title('Number of Arrested, Gender and Race',fontsize=18,fontdict=font)
ax.set_xlabel("Gender and Race",fontsize=18,fontdict=font)
ax.set_ylabel("Number of Arrested",fontdict=font)
ax.tick_params(labelsize=12)

Black males make up the largest portion of those arrested. The number of arrested males of each race is more than its female counterparts: Males are more criminal in general.

# Classification Models

In [None]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.decomposition import PCA

In [None]:
#Look at the data columns again
data.columns

In [None]:
#Look at the data types again
data.info()

# Classification data preprocessing

Turn everything into an object because the numerical values don't have numerical meaning and won't be used (one-hot-encoder will be used next)


In [None]:
#Turn everything into an object because the numerical values don't have numerical meaning and won't be used (one-hot-encoder will be used next)
data = data.astype('object')
data.info()

One-hot-encoded data is created because the features are categorical entirely.

In [None]:
#Create one hot encoded data
target_removed = data.loc[:, data.columns != 'PERP_RACE']
X = pd.get_dummies(target_removed)
target = data.loc[:,'PERP_RACE']

In [None]:
#Look at the one hot encoded table
X.head()

In [None]:
#Look at the table dimensions
X.info()

In [None]:
#Visually inspect the new one-hot encoded columns
X.columns

# Train Test Validaiton Split for Classification

In [None]:
#Train Test Split the data
X_train_all, X_test, y_train_all, y_test = train_test_split( X, target, test_size=0.2, random_state=42)
#Keep testing set for final testing, so make validation set here
X_train, X_val, y_train, y_val = train_test_split( X_train_all, y_train_all, test_size=0.2, random_state=42)


# Logistic Regerssion

In [None]:
#Fit logistic Regression
lr = LogisticRegression(solver='lbfgs', multi_class='multinomial')
lr.fit(X_train, y_train)
#Predict on training set
lr_y_pred = lr.predict(X_train)
print('Logistic Training accuracy: ', accuracy_score(y_train, lr_y_pred))
#Print training classification report
print(classification_report(y_train, lr_y_pred))

In [None]:
#Predict validation accuracy
lr_y_pred_val = lr.predict(X_val)
print('Logistic Validation accuracy: ', accuracy_score(y_val, lr_y_pred_val))

#Print validation classification report
print(classification_report(y_val, lr_y_pred_val))

# KNeighbours Classifier


In [None]:
# #Creat knn object and fit data
# knn = KNeighborsClassifier(n_neighbors=5)
# knn.fit(X_train, y_train)

In [None]:
# #predict the training accuracy
# knn_y_pred = knn.predict(X_train)
# print('knn training accuracy: ', accuracy_score(y_train, y_pred_knn))

In [None]:
# #Print training classification report
# print(classification_report(y_train, knn_y_pred)

In [None]:
# #Predict validation accuracy
# lr_y_pred_val = lr.predict(X_val)
# print('KNN Validation accuracy: ', accuracy_score(y_val, lr_y_pred_val))

In [None]:
# #Print validation classification report
# print(classification_report(y_val, lr_y_pred_val))

# Support Vector Machine

In [None]:
#Creat support vector object and fit data
svc = LinearSVC(C=10)
svc.fit(X_train, y_train)

In [None]:
#Predict training accuracy 
svc_y_pred = svc.predict(X_train)
print('svc training accuracy: ', accuracy_score(y_train, svc_y_pred))

In [None]:
#Print training classification report
print(classification_report(y_train, svc_y_pred))

In [None]:
#Predict validation accuracy
svc_y_pred_val = lr.predict(X_val)
print('SVC Validation accuracy: ', accuracy_score(y_val, svc_y_pred_val))

In [None]:
#Print validation classification report
print(classification_report(y_val, svc_y_pred_val))

# Decision Tree Classifier

In [None]:
#Creat Decision Tree classifier and fit data
decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train,y_train)

In [None]:
#Predict training accuracy 
decision_tree_y_pred = svc.predict(X_train)
print('Decision Tree training accuracy: ', accuracy_score(y_train, decision_tree_y_pred))

In [None]:
#Print training classification report
print(classification_report(y_train, decision_tree_y_pred))

In [None]:
#Predict validation accuracy
decision_tree_y_pred_val = decision_tree.predict(X_val)
print('Decision Tree validation accuracy: ', accuracy_score(y_val, decision_tree_y_pred_val))

In [None]:
#Print validation classification report
print(classification_report(y_val, decision_tree_y_pred_val))

# Random Forest Classifier

In [None]:
#Creat Decision Tree classifier and fit data
rforest = RandomForestClassifier()
rforest.fit(X_train,y_train)

In [None]:
#Predict training accuracy 
rforest_y_pred = svc.predict(X_train)
print('Random Forest training accuracy: ', accuracy_score(y_train, rforest_y_pred))

In [None]:
#Print training classification report
print(classification_report(y_train, rforest_y_pred))

In [None]:
#Predict validation accuracy
rforest_y_pred_val = lr.predict(X_val)
print('Random Forest Validation accuracy: ', accuracy_score(y_val, rforest_y_pred_val))

In [None]:
#Print validation classification report
print(classification_report(y_val, rforest_y_pred_val))

# Principal Component Analysis (PCA)

In [None]:
#Use PCA to look at the explained variance by column in order to reduce the dimension for arrest features
pca = PCA(n_components=150)
pca.fit(X_train)

In [None]:
#calculate variance ratios
variance = pca.explained_variance_ratio_ 
var=np.cumsum(np.round(pca.explained_variance_ratio_, decimals=3)*100)

In [None]:
#Plot the PCA info
plt.ylabel('% Variance Explained')
plt.xlabel('# of Features')
plt.title('PCA Analysis')
# plt.ylim(30,100.5)
plt.style.context('seaborn-whitegrid')
plt.plot(var)

We see from using a 150 components PCA that the explained variance starts to plateau after 20 components; and that it does not improve much after 100 compoents (at which point percentage explained is 90%). At this point, since our classifiers are not performing very well, reducing any number of components will likely make them perform less. But,  even if they perfrom slightly less, it is worth reducing th enumber of dimensions from the current column numbers 1892 down to about 100.

In [None]:
#Use PCA to look at the explained variance by column
pca = PCA(n_components=100)
pca.fit(X_train)
pca_X_train = pca.transform(X_train)
pca_X_val = pca.transform(X_val)

#Fit logistic Regression
lr = LogisticRegression(solver='lbfgs', multi_class='multinomial')
lr.fit(pca_X_train, y_train)
#Predict on training set
lr_y_pred = lr.predict(pca_X_train)
print('Logistic Training accuracy: ', accuracy_score(y_train, lr_y_pred))

#Print training classification report
print(classification_report(y_train, lr_y_pred))

#Predict validation accuracy
lr_y_pred_val = lr.predict(pca_X_val)
print('Logistic Validation accuracy: ', accuracy_score(y_val, lr_y_pred_val))

#Print validation classification report
print(classification_report(y_val, lr_y_pred_val))

After PCA, the accuracy did not increase much. But recall did improve slightly.