# Analysing Police Data
This notebook is my attempt to analyze the police data provided by the Stanford Open Police Project. The intercation between police and civilians is documented for a number of states. Yet, only the data from Rhode Island state is considered in this analysis. The data can be found via this [link](https://openpolicing.stanford.edu/data/)  
More Information about the dataset can be found on this [README](https://github.com/stanford-policylab/opp/blob/master/data_readme.md) file. 

## Imports and Loading data

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os

In [None]:
data_loc = os.path.join("data.csv") # change depending on the data's location in the local machine

In [None]:
df_org = pd.read_csv(data_loc)
df = df_org.copy()

In [None]:
print(df.head())
print(df.shape) # we have such a large dataset with 500k rows and 31 columns
df.info

## Data Cleaning
Data is inherently dirty. Clearning the data is a fundamental and crucial step in the data analysis process.

### Nan values
Nan values are the convention to represent unavailable data whether it is missing, corrupted or simply data that does not fit in the specific column. Such values should be either dropped, or imputed for further data manipulation and drawing meaningfull conclusions.

In [None]:
#df.info()
# we can see that certain rows have a large number of Nan values
nan_ratios = df.isna().sum() / len(df)
# good cols are assumed to have at most 10% Nan values
high_nan_cols = nan_ratios[nan_ratios > 0.9].index

df.drop(high_nan_cols, axis=1, inplace=True)

In [None]:
print(df.shape) # dropping 10 columns
# let's consider now the nan values
print(df.isna().sum()) 
# the vehicle make or model does not seem of much importance
# let's drop both such model
df.drop(['raw_row_number', 'vehicle_make', 'vehicle_model'], axis=1, inplace=True)

# among the interesting points in that dataset is to understand the relation between sex, race, so all row with nan values in this columns 
# are to be dropped
df.dropna(subset=['subject_race', 'subject_sex'], inplace=True)

In [None]:
print(df.isna().sum()) # only one column is left with Nan values. We are ready to proceed with the data analysis.
print(df.shape) # 

### Fixing DataTypes 
Types are of importance as they define the operations that can be performed on the values as well as the efiiciency of working with them. Leaving as few object columns as possible is a good start.


In [None]:
print(df.dtypes) # assigning the right data types to columns could optimize the entire process.
# let's check columns 5 by 5

In [69]:
for i in range(0, 5):
    new_series = pd.Series(df.iloc[:, i])
    print(new_series.value_counts()) # we can see that zone, race, gender can be converted to category datatype. 
    print(print(new_series.name + "'s type: " + str(new_series.dtype)))
# first let's set rename some columns
df.rename(columns={"subject_race": "race", "subject_sex":"sex"}, inplace=True)

# convert type to category
df['zone'] = df['zone'].astype('category')
df['race'] = df['race'].astype('category')
df['sex'] = df['sex'].astype('category')

2006-05-21    294
2015-09-05    292
2005-11-05    283
2012-01-08    278
2012-01-07    275
             ... 
2005-01-12      1
2005-05-11      1
2005-06-06      1
2005-08-08      1
2005-06-18      1
Name: date, Length: 3803, dtype: int64
date's type: object
None
10:00:00    1677
11:00:00    1660
10:30:00    1552
09:00:00    1548
09:30:00    1400
            ... 
05:09:00       7
05:19:00       7
04:52:00       7
05:27:00       7
05:22:00       5
Name: time, Length: 1440, dtype: int64
time's type: object
None
X4    125670
K3    108868
K2     97281
X3     89431
K1     46110
X1     13224
Name: zone, dtype: int64
zone's type: object
None
white                     344716
black                      68577
hispanic                   53123
asian/pacific islander     12824
other                       1344
Name: race, dtype: int64
race's type: object
None
male      349446
female    131138
Name: sex, dtype: int64
sex's type: object
None


In [70]:
for i in range(5, 10):    
    new_series = pd.Series(df.iloc[:, i])
    print(new_series.value_counts())  
    print(print(new_series.name + "'s type: " + str(new_series.dtype)))  
    print()
# arrest_made, citation_made and warning_issued should be converted to boolean
df.rename(columns={"arrest_made": "arrest", "citation_issued": "citation", "warning_issued":"warning"}, inplace=True)
df['arrest'] = df['arrest'].astype(bool)
df['warning'] = df['warning'].astype(bool)
df['citation'] = df['citation'].astype(bool)

500    114274
300     87077
900     71600
200     70925
600     28568
        ...  
20          1
MA          1
501         1
1.0         1
006         1
Name: department_id, Length: 75, dtype: int64
department_id's type: object
None

vehicular    480584
Name: type, dtype: int64
type's type: object
None

False    463981
True      16603
Name: arrest, dtype: int64
arrest's type: bool
None

True     428378
False     52206
Name: citation, dtype: int64
citation's type: bool
None

False    451744
True      28840
None



In [73]:
# what is even more interesting here is the type column 
print(df['type'].value_counts())
print(df['type'].isna().sum())
# the type is the same across the dataset which eliminates the need for that column
df.drop('type',axis=1, inplace=True)

vehicular    480584
Name: type, dtype: int64
0


In [None]:
for i in range(10, 15):
    new_series = pd.Series(df.iloc[:, i])
    print(new_series.value_counts())  
    print(print(new_series.name + "'s type: " + str(new_series.dtype)))  
    print() 

# the documentation explains that the outcome column can have mainly 4 values: ['arrest', 'citation, 'warning' and 'summons']
# we can consider the 'arrest', 'warning' and 'citation' as the result of one hot encoding the outcome column
# let's run an integrity check first before proceeding any further


In [61]:
outcomes = ['arrest', 'citation', 'warning']
def outcome_valid(row):
    out = row['outcome']
    if out not in outcomes:
        row['outcome_valid'] = row[outcomes].sum() == 0
        return row
    
    row['outcome_valid'] = ((row[out] == True) and (row[outcomes].sum() == 1)) 
    return row

df = df.apply(outcome_valid, axis=1)
non_valid_rows = df[df['outcome_valid'] == False]

In [62]:
# under the assumption that only one outcome can take place, the nan values in the outcome column should correspond to summons
# the non_valid_rows should be dealt with seperately
print(non_valid_rows.empty)

# the nan value in the outcome column represent summons.
df['summons'] = df['outcome'].isna()
df = df.fillna('summons')

# we can drop either reason_for_stop or raw_BasisForStop as they are equivalent. the latter seems more likely as its values are less expressive
df.drop('raw_BasisForStop', axis=1, inplace=True) 
df['reason_for_stop'] = df['reason_for_stop'].astype('category')

True


In [66]:
for i in range(14, min(19, len(df.columns))):
    new_series = pd.Series(df.iloc[:, i])
    print(new_series.value_counts())  
    print(print(new_series.name + "'s type: " + str(new_series.dtype)))  
    print() 
# these columns are to be dropped as they are raw versions of some other columns
df = df.iloc[:, :13]    

W    344716
B     68577
H     44046
I     12824
L      9077
O       814
N       530
Name: raw_OperatorRace, dtype: int64
raw_OperatorRace's type: object
None

M    349446
F    131138
Name: raw_OperatorSex, dtype: int64
raw_OperatorSex's type: object
None

M    428378
W     28840
D     14630
N      3431
A      3332
P      1973
Name: raw_ResultOfStop, dtype: int64
raw_ResultOfStop's type: object
None

True    480584
Name: outcome_valid, dtype: int64
outcome_valid's type: bool
None



In [74]:
print(df.columns)
print(df.dtypes)

Index(['date', 'time', 'zone', 'race', 'sex', 'department_id', 'arrest',
       'search_conducted'],
      dtype='object')
date                  object
time                  object
zone                category
race                category
sex                 category
department_id         object
arrest                  bool
citation                bool
summons                 bool
frisk_performed         bool
search_conducted        bool
dtype: object
