#00 Instructions
The dataset we're working with looks at records for individauls killed while being pursued by police in car chases from 2017-2022. Look through the original dataset and make sure you're clear on what the columns represent.

Fatal Police Pursuit [Database](https://github.com/sfchronicle/police_pursuits)

## Load in any libraries need for performing regressions and predictive modeling.

In [None]:
# Basic working with data libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

# Library including linear/logistic regression
import statsmodels.api as sm

Read in the [data](http://github.com/sfchronicle/police_pursuits/blob/master/data/sfc_pursuit_fatalities.csv) as 'fatalities'.


In [None]:
fatalities = pd.read_csv("sfc_pursuit_fatalities.csv")
fatalities.head()

Unnamed: 0,unique_id,data_source,year,date,number_killed,age,gender,race,race_source,county,...,long,name,initial_reason,person_role,main_agency,news_urls,city,zip,centroid_geo,in_fars_pursuit
0,2918,nhtsa_sfchronicle,2021,10/28/21,1,93.0,male,white,"photo,nhtsa",boulder,...,-105.074233,joe william gold,suspected nonviolent,bystander,longmont police services,https://www.9news.com/article/news/crime/longm...,longmont,80504.0,0,1
1,323,nhtsa_sfchronicle,2017,8/17/17,1,92.0,male,white,"photo,nhtsa",muskegon,...,-86.218686,duane levi quigg,suspected nonviolent,bystander,norton shores police department,http://fox17online.com/2017/08/17/police-innoc...,norton shores,49444.0,0,1
2,1488,nhtsa_sfchronicle,2019,9/15/19,3,91.0,male,white,"photo,nhtsa",cole,...,-92.240667,bernard g steffel,suspected nonviolent,bystander,jefferson city police department,https://wreg.com/2019/09/16/memphis-man-two-ot...,jefferson city,65109.0,0,1
3,2315,nhtsa_sfchronicle,2020,12/25/20,2,89.0,male,"white,latino","photo,nhtsa",wyandotte,...,-94.648189,mario madruga,suspected nonviolent,bystander,kansas city police department,https://fox4kc.com/news/suspects-in-stolen-tru...,kansas city,66102.0,0,1
4,1720,nhtsa_sfchronicle,2020,2/14/20,1,89.0,female,white,"photo,nhtsa",oakland,...,-83.161506,mary lackamp,traffic stop,bystander,ferndale police department,https://wwjnewsradio.radio.com/articles/news/e...,detroit,48221.0,0,1


In [None]:
fatalities.columns

Index(['unique_id', 'data_source', 'year', 'date', 'number_killed', 'age',
       'gender', 'race', 'race_source', 'county', 'state', 'lat', 'long',
       'name', 'initial_reason', 'person_role', 'main_agency', 'news_urls',
       'city', 'zip', 'centroid_geo', 'in_fars_pursuit'],
      dtype='object')

#1. Regression
(a) Build a model predicting whether a fatal police indicident was recorded as such. Remove all unnecessary variables.



In [None]:
fatalities["county"].value_counts() #too much location data, removing for now because we can't predict much based on location alone considering
#how few entries we have per location

#This includes things like county, state, main agency, etc

#Then as for the date columns, these likely wont be too predictive but also there isn't really continuity with dates in a csv, wed look at each one as its own
#category and then wed run into the same problem again of far too little data per category

Unnamed: 0_level_0,count
county,Unnamed: 1_level_1
harris,79
los angeles,62
cook,57
dallas,45
jefferson,44
...,...
norton,1
platte,1
faulkner,1
mecklenberg,1


In [None]:
fatalities = fatalities[['year','number_killed', 'age',
       'gender', 'race',
       'initial_reason', 'person_role','centroid_geo', 'in_fars_pursuit']]
fatalities = fatalities.dropna()

In [None]:
fatalities["in_fars_pursuit"] = (fatalities["in_fars_pursuit"] != 2)

In [None]:
X = fatalities[['year','number_killed', 'age',
       'gender', 'race',
       'initial_reason', 'person_role','centroid_geo']]
X = pd.get_dummies(X).astype("float32")

y = fatalities["in_fars_pursuit"]

We need to drop some more columns to make this work, just because of how the get_dummmies transfered people with multiple racial identities and gender. This is obviously not ideal but it is necessary to work. However, it does indicate that our model is likely not great in looking at intersectional identites and that should be taken into account.

In [None]:
X=X.drop(columns=["race_black,latino","race_other,latino","race_white,latino","race_unknown","gender_nonbinary","gender_unknown"])

In [None]:
X = sm.tools.tools.add_constant(X)

model = sm.OLS(y, X).fit()

print(model.summary())

                            OLS Regression Results                            
Dep. Variable:        in_fars_pursuit   R-squared:                       0.927
Model:                            OLS   Adj. R-squared:                  0.926
Method:                 Least Squares   F-statistic:                     1084.
Date:                Tue, 19 Nov 2024   Prob (F-statistic):               0.00
Time:                        14:22:17   Log-Likelihood:                 1866.7
No. Observations:                1983   AIC:                            -3685.
Df Residuals:                    1959   BIC:                            -3551.
Df Model:                          23                                         
Covariance Type:            nonrobust                                         
                                             coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------------------

(b) Choose one slope in the model to explain, and interpet what that means.


(c) Evaluate your model.

Does this model seem like a good fit?

#2. Decision Tree
(a) With the same variables, fit this model as a decision tree. Give it a reasonable depth.



(b)What is the accuracy of this model?


(c) You've been tasked with using this dataset to look at why some deaths are recorded as occuring as part of a police chase, while other were not recorded in the same way. Which model would you prefer to use for explaining your findings?



#3. Linear Regression

(a) Based on the same position as above, is there a reasonable linear regression one can make from this dataset?
