# TITLE:PREDICTION OF ARREST BASED ON TERRY TRAFFIC STOP

![Terry Stop Project Image](b15734dc-3688-4f31-ba91-c1a113f867db.webp)

<div style="font-family: 'Times New Roman'; font-size: 12pt;">
<h2>Introduction</h2>
<p>
A Terry Stop, also known as an "Investigative Detention" or "Stop and Frisk," allows law enforcement officers to temporarily detain a person for investigation when they reasonably suspect involvement in criminal activity. This stop is used when there is insufficient evidence for an arrest (i.e., no probable cause) but enough suspicion to justify a brief investigation.
</p>
<p>
The primary purpose is to either confirm or dispel the officer's suspicions. If during the stop, probable cause for an arrest arises, the suspect is arrested. If no probable cause is found, the suspect is released.
</p>
</div>



<div style="font-family: 'Times New Roman'; font-size: 12pt;">
<h2>Problem Statement</h2>
<p>
Terry stops can disrupt the normal lives of individuals, especially when no arrests are made. During these stops, police officers may spend valuable time that could otherwise be used to address crime in other areas. Additionally, Terry stops initiated through 911 calls can lead to a waste of resources, particularly when they result in no arrests. These stops have also been plagued by concerns of racial profiling, with many individuals believing that the stops are disproportionately based on their race or gender. This has eroded the public’s perception of law enforcement and diminished the confidence they have in the police.

Although Terry stops are beneficial in preventing potential crimes, a more balanced approach is needed. To aid in this, a model has been developed to predict whether a Terry stop will result in an arrest. Given that Terry stops can be initiated through various channels such as 911 calls, messages, alarms, and police interactions, this model can help schedule calls appropriately and allocate resources more efficiently.
</p>
</div>

<div style="font-family: 'Times New Roman'; font-size: 12pt;">
<h2>Objective </h2>
<p>
To create a model capable of accurately predicting whether a Terry stop will result in an arrest, aiming for a f1 score exceeding 75% and an accuracy above 85%.
</p>
</div>

<div style="font-family: 'Times New Roman'; font-size: 12pt;">
<h2>Data Understanding</h2>
<p>
The data for this analysis was obtained from the seatle police department.It represents records of police reported stops under Terry v. Ohio, 392 U.S. 1 (1968). Each row represents a unique stop.

Each record contains perceived demographics of the subject, as reported by the officer making the stop and officer demographics as reported to the Seattle Police Department
The original dataset contained 62020 entries.the officer squad had missing values which were dropped resulting to 61459 entries with 23 columns.Due to the sensitivity and data ethics consideration,gender and race columns were not used in the analysis.

the data can be obtined from:http://tiny.cc/l4hzzz
        
Columns description:http://tiny.cc/c5hzzz
</p>
</div>

## Data Preparation 

In [1]:
#importing the relevant libraries and  loading the data.
import pandas as pd 
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")
Data=pd.read_csv("Terry_Stops_20241202.csv")
#checking the first rows of the data frame
Data.head()

Unnamed: 0,Subject Age Group,Subject ID,GO / SC Num,Terry Stop ID,Stop Resolution,Weapon Type,Officer ID,Officer YOB,Officer Gender,Officer Race,...,Reported Time,Initial Call Type,Final Call Type,Call Type,Officer Squad,Arrest Flag,Frisk Flag,Precinct,Sector,Beat
0,46 - 55,-1,20160000003294,178940,Field Contact,,6902,1969,M,White,...,09:02:00.0000000,-,-,-,WEST PCT 1ST W - KQ/DM RELIEF,N,-,Southwest,W,W2
1,26 - 35,-1,20160000003741,187898,Field Contact,,7430,1984,F,White,...,19:53:00.0000000,-,-,-,NORTH PCT 2ND WATCH - NORTH BEATS,N,N,East,C,C1
2,26 - 35,-1,20150000004367,76437,Field Contact,,5453,1962,M,White,...,17:58:00.0000000,-,-,-,WEST PCT 2ND W - DAVID BEATS,N,N,-,-,-
3,46 - 55,-1,20160000000533,126509,Field Contact,,7597,1982,M,White,...,16:03:00.0000000,-,-,-,NORTH PCT 2ND WATCH - NORTH BEATS,N,N,-,-,-
4,26 - 35,7726996644,20220000001484,30755907245,Arrest,-,6122,1966,M,White,...,19:12:35.0000000,SHOPLIFT - THEFT,--THEFT - SHOPLIFT,"TELEPHONE OTHER, NOT 911",SOUTHWEST PCT 1ST W - WILLIAM - PLATOON 2,Y,Y,Southwest,F,F1


In [2]:
#checking for general information about the dataframe
Data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 62020 entries, 0 to 62019
Data columns (total 23 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Subject Age Group         62020 non-null  object
 1   Subject ID                62020 non-null  int64 
 2   GO / SC Num               62020 non-null  int64 
 3   Terry Stop ID             62020 non-null  int64 
 4   Stop Resolution           62020 non-null  object
 5   Weapon Type               62020 non-null  object
 6   Officer ID                62020 non-null  object
 7   Officer YOB               62020 non-null  int64 
 8   Officer Gender            62020 non-null  object
 9   Officer Race              62020 non-null  object
 10  Subject Perceived Race    62020 non-null  object
 11  Subject Perceived Gender  62020 non-null  object
 12  Reported Date             62020 non-null  object
 13  Reported Time             62020 non-null  object
 14  Initial Call Type     

In [3]:
#creating a copy of the original data
df=Data.copy()

The data frame contains 62020 entries and 23 columns.The data datatypes are of types integer and object.There are also dates but are recorded as object.Missing values are seen in the officer squad column.

In [4]:
#checking for missing data and their percentage 

class MissingData:
    def __init__(self, data):
        self.data = data

    def check_missing(self):
        missing_data = self.data.isna().sum()
        percent_missing = (missing_data / len(self.data)) * 100
        
        # Creating a DataFrame to return missing data and percentage
        results = pd.DataFrame({
            'Missing Values': missing_data,
            'Percent Missing': percent_missing
        })
        
        return results
missing_data=MissingData(df)
missing_data=missing_data.check_missing()
missing_data

Unnamed: 0,Missing Values,Percent Missing
Subject Age Group,0,0.0
Subject ID,0,0.0
GO / SC Num,0,0.0
Terry Stop ID,0,0.0
Stop Resolution,0,0.0
Weapon Type,0,0.0
Officer ID,0,0.0
Officer YOB,0,0.0
Officer Gender,0,0.0
Officer Race,0,0.0


Only the officer squad column has missing values at 0.9045%.

### Data cleaning


In [5]:
#checking for duplicates
df.duplicated().sum()

0

There are no duplicates in the dataset

In [6]:
#droppinng the missing values
df=Data.dropna()

In [7]:
df.shape

(61459, 23)

The resulting dataframe contains 61459 and 23 columns.

In [8]:
#checking for unique values for the columns 
class UniqueValues:
    def __init__(self, data, column):
        self.data = data  
        self.column = column  

    def checking_unique(self):
        unique = self.data[self.column].value_counts()
        return unique        

In [9]:
subject_age_group=UniqueValues(df,"Subject Age Group")
print(subject_age_group.checking_unique())#
df["Subject Age Group"]=df["Subject Age Group"].replace("-","Age Unknown")
df["Subject Age Group"].value_counts(normalize=True)#checking for percentage of missing values

26 - 35         20544
36 - 45         13793
18 - 25         11585
46 - 55          7797
56 and Above     3240
1 - 17           2286
-                2214
Name: Subject Age Group, dtype: int64


26 - 35         0.334272
36 - 45         0.224426
18 - 25         0.188500
46 - 55         0.126865
56 and Above    0.052718
1 - 17          0.037196
Age Unknown     0.036024
Name: Subject Age Group, dtype: float64

The subject age group is categorical data consisting of different age groups.Age group 26-35 accounts for a majority of stops followed by ages 36-45.Unknown ages and ages 1-17 account for about 6% of the stops.

In [10]:
#checking for subject ID 
sub_id=UniqueValues(df,"Subject ID")
print(sub_id.checking_unique())

-1              34747
 7753260438        29
 7774286580        22
 21375848115       21
 7726918259        21
                ...  
 38095058973        1
 7730377407         1
 54607472558        1
 7735110248         1
 60787853579        1
Name: Subject ID, Length: 17425, dtype: int64


This column contains the subjects identification numbers.It is not a relevant as it is just a unique identifier of a person.

In [11]:
GO_SC_Num=UniqueValues(df,"GO / SC Num")
print(GO_SC_Num.checking_unique())

20160000378750    16
20150000190790    16
20240000319277    15
20210000267148    14
20180000134604    14
                  ..
20160000300181     1
20160000397706     1
20170000003922     1
20200000293330     1
20230000186324     1
Name: GO / SC Num, Length: 49280, dtype: int64


This column is relevant as it helps in tracking the validity or outcomes of complaints linked to specific Terry stops.

In [12]:
#checking 
Terry_Stop_ID=UniqueValues(df,"Terry Stop ID")
print(Terry_Stop_ID.checking_unique())

19324329995    3
32633045284    3
19268585233    3
13080077761    3
55477887782    3
              ..
238381         1
124682         1
194981         1
478096         1
51297593765    1
Name: Terry Stop ID, Length: 61354, dtype: int64


The terry stop ID uniquely identifies a terry stop.

In [13]:
Stop_Resolution=UniqueValues(df,"Stop Resolution")
print(print(Stop_Resolution.checking_unique()))

Field Contact               29979
Offense Report              15643
Arrest                      14900
Referred for Prosecution      719
Citation / Infraction         218
Name: Stop Resolution, dtype: int64
None


Stop resolution-refers to the outcome or conclusion of a police stop or encounter, specifically how the interaction between an officer and a detained individual ends.This is a relevant column.

In [14]:
#categorising the stop resolution to arrest and no arrest
# Replace "Arrest" with 1, and all other values with 0
df["Stop Resolution"].value_counts(normalize=True)

Field Contact               0.487789
Offense Report              0.254527
Arrest                      0.242438
Referred for Prosecution    0.011699
Citation / Infraction       0.003547
Name: Stop Resolution, dtype: float64

The resulting stop resolution column contains the information whether a stop resolution led to an arrest or not

In [15]:
# weapon type column
#checking for unique values in the column
Weapon_Type=UniqueValues(df,"Weapon Type")
print(Weapon_Type.checking_unique())

None                                    32208
-                                       25334
Lethal Cutting Instrument                1464
Knife/Cutting/Stabbing Instrument        1374
Handgun                                   399
Blunt Object/Striking Implement           181
Firearm                                   113
Firearm Other                              99
Other Firearm                              79
Mace/Pepper Spray                          58
Club, Blackjack, Brass Knuckles            49
Taser/Stun Gun                             20
None/Not Applicable                        20
Firearm (unk type)                         15
Fire/Incendiary Device                     13
Rifle                                      11
Club                                        9
Shotgun                                     6
Personal Weapons (hands, feet, etc.)        2
Automatic Handgun                           2
Brass Knuckles                              1
Blackjack                         

In [16]:
#replacing - with unknown
df["Weapon Type"]=df["Weapon Type"].replace("-","weapon unknown")
df["Weapon Type"]=df["Weapon Type"].replace({"None/Not Applicable":"None","None":"None"})
# majority of the data fall under none and unknown.the other weapons can therefore be placed in one category called weapon_found
# Create a list of weapon-related values
weapon_types = [
    "Lethal Cutting Instrument", "Knife/Cutting/Stabbing Instrument", "Handgun",
    "Blunt Object/Striking Implement", "Firearm", "Firearm Other", "Other Firearm",
    "Mace/Pepper Spray", "Firearm (unk type)", "Club, Blackjack, Brass Knuckles",
    "Taser/Stun Gun", "Shotgun", "Fire/Incendiary Device", "Club",
    "Personal Weapons (hands, feet, etc.)", "Automatic Handgun",
    "Brass Knuckles", "Blackjack", "Poison","Rifle"
]

# Replace matching values with "weapon_found"
df["Weapon Type"] = df["Weapon Type"].apply(lambda x: "weapon_found" if x in weapon_types else x)


In [17]:
df["Weapon Type"].value_counts()

None              32228
weapon unknown    25334
weapon_found       3897
Name: Weapon Type, dtype: int64

the resulting weapon type column conntains none ie no weapon found,unknown and weapon founds.

In [18]:
#converting to datetime format
df["Reported Date"] = pd.to_datetime(df["Reported Date"], format="%Y/%m/%d")
# creating collumns for the year,month and day
# Extract year, month, and day from the "Reported Date" column
df["reported_Year"] = df["Reported Date"].dt.year
df["reported_Month"] = df["Reported Date"].dt.month
df["reported_Day"] = df["Reported Date"].dt.day
#creating another column for day of week
df["day_of_week"]=df["Reported Date"].dt.weekday

In [19]:
df["Reported Time"].value_counts(normalize=True)

03:09:00.0000000    0.000846
02:56:00.0000000    0.000846
19:18:00.0000000    0.000830
17:00:00.0000000    0.000830
03:13:00.0000000    0.000814
                      ...   
00:45:11.0000000    0.000016
23:01:39.0000000    0.000016
14:37:15.0000000    0.000016
14:14:33.0000000    0.000016
05:10:04.0000000    0.000016
Name: Reported Time, Length: 24282, dtype: float64

In [20]:
#report time is of object time.we convert it to datetime format and create a new column contaning only the hour
df["Reported Time"]=pd.to_datetime(df["Reported Time"], format="%H:%M:%S.%f")
df["Time"]=df["Reported Time"].dt.hour

In [21]:
df["Initial Call Type"].value_counts(normalize=True)

-                                                    0.216388
SUSPICIOUS STOP - OFFICER INITIATED ONVIEW           0.079256
SUSPICIOUS PERSON, VEHICLE, OR INCIDENT              0.069445
DISTURBANCE                                          0.050131
ASLT - CRITICAL (NO SHOOTINGS)                       0.044339
                                                       ...   
HARBOR - WATER EMERGENCIES                           0.000016
INJURED -  PERSON/INDUSTRIAL ACCIDENT                0.000016
PUBLIC DISPLAY OF PORNOGRAPHY                        0.000016
REQUEST TO WATCH                                     0.000016
ALARM - PUBLIC TRANSPORTATION (CITY/STATE/COUNTY)    0.000016
Name: Initial Call Type, Length: 182, dtype: float64

In [22]:
df["Final Call Type"].value_counts()

-                                                     13299
--SUSPICIOUS CIRCUM. - SUSPICIOUS PERSON               5716
--PROWLER - TRESPASS                                   4385
--DISTURBANCE - OTHER                                  3665
--ASSAULTS, OTHER                                      3063
                                                      ...  
PEDESTRIAN VIOLATIONS                                     1
-DOWN TIME - OUT OF SERVICE                               1
ASLT - DV CRITICAL                                        1
URINATING, DEFECATING IN PUBLIC                           1
--HELP THE OFFICER -ASSIST THE OFFICER (NON EMERG)        1
Name: Final Call Type, Length: 201, dtype: int64

In [23]:
df["Call Type"].value_counts()

911                              29022
ONVIEW                           14439
-                                13299
TELEPHONE OTHER, NOT 911          4133
ALARM CALL (NOT POLICE ALARM)      534
TEXT MESSAGE                        30
HISTORY CALL (RETRO)                 1
SCHEDULED EVENT (RECURRING)          1
Name: Call Type, dtype: int64

In [24]:
# replacing "-" with unknown
df["Call Type"]=df["Call Type"].replace("-","Call Type UNKNOWN")

In [25]:
df["Call Type"].unique()

array(['Call Type UNKNOWN', 'TELEPHONE OTHER, NOT 911', '911', 'ONVIEW',
       'ALARM CALL (NOT POLICE ALARM)', 'TEXT MESSAGE',
       'HISTORY CALL (RETRO)', 'SCHEDULED EVENT (RECURRING)'],
      dtype=object)

In [26]:
#combining text message,history call and scheduled to a single column other_calls
other_calls=['TEXT MESSAGE','HISTORY CALL (RETRO)', 'SCHEDULED EVENT (RECURRING)']
# Mapping the column
df["Call Type"] = df["Call Type"].apply(lambda x: "OTHER CALLS" if x in other_calls else x)
df["Call Type"].value_counts()

911                              29022
ONVIEW                           14439
Call Type UNKNOWN                13299
TELEPHONE OTHER, NOT 911          4133
ALARM CALL (NOT POLICE ALARM)      534
OTHER CALLS                         32
Name: Call Type, dtype: int64

In [27]:
df["Officer Squad"].value_counts(normalize=True)

TRAINING - FIELD TRAINING SQUAD                   0.107145
WEST PCT 1ST W - DAVID - PLATOON 1                0.027758
WEST PCT 3RD W - KING - PLATOON 1                 0.022470
WEST PCT 1ST W - KING - PLATOON 1                 0.021266
SOUTHWEST PCT 2ND W - FRANK - PLATOON 2           0.020811
                                                    ...   
SOUTH PCT 2ND W - ROBERT                          0.000016
NORTH PCT 3RD W - LINCOLN                         0.000016
TRAINING - ADVANCED UNIT ADMINISTRATION           0.000016
COMM - INTERNET AND TELEPHONE REPORTING (ITRU)    0.000016
SOUTHWEST PCT OPS - BURG/THEFT                    0.000016
Name: Officer Squad, Length: 273, dtype: float64

Field training squad accounts for the largest percentage of terry stops at about 11%

In [28]:
# arrest flag
df["Arrest Flag"].value_counts(normalize=True)
# assigning 0 to no arrest made and 1 for arrest
df["Arrest Flag"]=df["Arrest Flag"].replace({"N":0,"Y":1})
df["Arrest Flag"].value_counts(normalize=True)

0    0.890008
1    0.109992
Name: Arrest Flag, dtype: float64

only 10% of terry stops lead to arrest and about 90% do not lead to arrest.

In [29]:
df["Frisk Flag"].value_counts()

N    46246
Y    14742
-      471
Name: Frisk Flag, dtype: int64

In [30]:
# replacing "-" with "UNKNOWN" and assigning the values 0,1,2 
df["Frisk Flag"] = df["Frisk Flag"].replace("-","Frisk Flag UNKNOWN")
df["Frisk Flag"] = df["Frisk Flag"].replace({"N": 0, "Y": 1, "Frisk Flag UNKNOWN": 2}).astype(int)
df["Frisk Flag"].value_counts().head()

0    46246
1    14742
2      471
Name: Frisk Flag, dtype: int64

In [31]:
#checking for precinct
df["Precinct"].value_counts(normalize=True)

West         0.279292
North        0.211230
-            0.171317
East         0.133992
South        0.121577
Southwest    0.077401
Unknown      0.003222
OOJ          0.001611
FK ERROR     0.000358
Name: Precinct, dtype: float64

In [32]:
#replacing "-" with unknown
df["Precinct"]=df["Precinct"].replace({"-":"precinct Unknown","Unknown":"precinct Unknown"})

west precinct contribute about 28% of the total stops followed by north,unknown ,east and south with FK ERROR being the least.

In [33]:
df["Sector"].value_counts(normalize=True)
#replacing - with UNKNOWN
df["Sector"]=df["Sector"].replace("-","Sector UNKNOWN")
df["Sector"].value_counts()

Sector UNKNOWN    10678
K                  5837
M                  5259
E                  4364
N                  3646
D                  3485
F                  2857
R                  2846
B                  2785
Q                  2580
L                  2534
O                  2371
U                  2254
S                  2254
G                  2096
W                  1897
C                  1775
J                  1761
99                  129
OOJ                  51
Name: Sector, dtype: int64

Areas where the sector is not known accounts for the majority of terry stops followed at a distance by sector(K) and the subsequent sectors following each other closely.Sector OOJ registered the least terry stops conducted.


In [34]:
df["Beat"].value_counts(normalize=True)


-      0.173644
K3     0.055435
M3     0.041947
N3     0.029646
E2     0.029467
E1     0.023837
D1     0.022340
N2     0.022259
M2     0.022194
R2     0.021836
K2     0.021657
M1     0.021478
D2     0.021120
Q3     0.020013
F2     0.019281
K1     0.017882
E3     0.017670
B2     0.017312
U2     0.016954
B1     0.016873
O1     0.015604
L2     0.014286
S2     0.014270
F3     0.013668
L1     0.013603
F1     0.013537
L3     0.013342
R1     0.013261
D3     0.013245
W2     0.013147
G2     0.012626
U1     0.012382
O3     0.012317
Q2     0.012268
S3     0.012171
C1     0.011748
G3     0.011618
R3     0.011211
B3     0.011194
J3     0.010885
O2     0.010658
J1     0.010495
S1     0.010234
W1     0.010186
C3     0.010039
G1     0.009844
Q1     0.009698
W3     0.007550
N1     0.007387
U3     0.007338
J2     0.007273
C2     0.007110
99     0.002115
OOJ    0.000814
S      0.000033
Name: Beat, dtype: float64

In [35]:
df["Beat"]=df["Beat"].replace("-","Beat UNKNOWN")
df["Beat"].value_counts(normalize=True)

Beat UNKNOWN    0.173644
K3              0.055435
M3              0.041947
N3              0.029646
E2              0.029467
E1              0.023837
D1              0.022340
N2              0.022259
M2              0.022194
R2              0.021836
K2              0.021657
M1              0.021478
D2              0.021120
Q3              0.020013
F2              0.019281
K1              0.017882
E3              0.017670
B2              0.017312
U2              0.016954
B1              0.016873
O1              0.015604
L2              0.014286
S2              0.014270
F3              0.013668
L1              0.013603
F1              0.013537
L3              0.013342
R1              0.013261
D3              0.013245
W2              0.013147
G2              0.012626
U1              0.012382
O3              0.012317
Q2              0.012268
S3              0.012171
C1              0.011748
G3              0.011618
R3              0.011211
B3              0.011194
J3              0.010885


In [36]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 61459 entries, 0 to 62019
Data columns (total 28 columns):
 #   Column                    Non-Null Count  Dtype              
---  ------                    --------------  -----              
 0   Subject Age Group         61459 non-null  object             
 1   Subject ID                61459 non-null  int64              
 2   GO / SC Num               61459 non-null  int64              
 3   Terry Stop ID             61459 non-null  int64              
 4   Stop Resolution           61459 non-null  object             
 5   Weapon Type               61459 non-null  object             
 6   Officer ID                61459 non-null  object             
 7   Officer YOB               61459 non-null  int64              
 8   Officer Gender            61459 non-null  object             
 9   Officer Race              61459 non-null  object             
 10  Subject Perceived Race    61459 non-null  object             
 11  Subject Perceiv

In [37]:
#columns to drop:"Subject ID","Terry Stop ID","Officer ID","Officer Gender","Officer Race"
df=df.drop(columns=["Subject ID","Subject Perceived Race","Subject Perceived Gender","Terry Stop ID","Officer ID","Officer Gender","Officer Race","Reported Time","Reported Date","reported_Day"])

In [38]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 61459 entries, 0 to 62019
Data columns (total 18 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   Subject Age Group  61459 non-null  object
 1   GO / SC Num        61459 non-null  int64 
 2   Stop Resolution    61459 non-null  object
 3   Weapon Type        61459 non-null  object
 4   Officer YOB        61459 non-null  int64 
 5   Initial Call Type  61459 non-null  object
 6   Final Call Type    61459 non-null  object
 7   Call Type          61459 non-null  object
 8   Officer Squad      61459 non-null  object
 9   Arrest Flag        61459 non-null  int64 
 10  Frisk Flag         61459 non-null  int64 
 11  Precinct           61459 non-null  object
 12  Sector             61459 non-null  object
 13  Beat               61459 non-null  object
 14  reported_Year      61459 non-null  int64 
 15  reported_Month     61459 non-null  int64 
 16  day_of_week        61459 non-null  int64

The final dataframe contains 8 columns with integer values and 10 of type object.this results to a total of 20 columns.There are no missing data or duplicates in the final dataset.This data is going to be used for analysis.

In [40]:
df.to_csv("EDA_data.csv")