#  Overview

The Terry Stops problem aims to predict the outcome of police stops based on reasonable suspicion using a classification model. The model considers various factors such as presence of weapons, time of day, and possibly gender and race of both the officer and the subject. However, the use of race and gender data raises ethical concerns and the importance of avoiding bias and discrimination must be taken into consideration. The goal of this model is to improve the efficiency and fairness of law enforcement actions, but the agencies must also monitor and address any potential biases.

# 1. Business Understanding


## 1.1. Problem
The Terry Stops presents a business opportunity to improve the efficiency and fairness of law enforcement actions. By developing a predictive model that can assist officers in determining the likelihood of an arrest being made during a Terry Stop, the law enforcement agencies can make informed decisions and potentially reduce the number of false arrests and incidents of police misconduct. However, it is important to approach this problem with caution and transparency, considering the ethical concerns raised by the use of gender and race data. The goal is to provide a tool that can help improve policing, while avoiding biases and discrimination.

## 1.2 Aim

The aim of this project is to build a classifier that can predict the outcome of a Terry Stop (whether an arrest was made or not) based on reasonable suspicion. This will be done by considering various factors such as the presence of weapons, time of day of the call, and other relevant information. The model will be designed to address the binary classification problem, with the goal of improving the efficiency and fairness of law enforcement actions.

## 1.3. 0bjectives
* To create a predictive model for Terry Stops that accurately predicts the outcome of the stop (arrest made or not)
* To take into consideration key factors such as the presence of weapons and the time of the call in the model
* To ensure that the model is ethically sound and avoids any biases or discrimination related to gender and race.

# 2. Data Understanding

## 2.1 Data Understanding
This dataset was provided by the City of Seattle and is managed by the Seattle Police Department. It was created on April 13, 2017 and last updated on February 6, 2023. The dataset contains **54873**, rows and **23** columns, each row representing a unique Terry Stop record as reported by the officer conducting the stop. The columns in the dataset include information about the subject of the stop, such as the perceived age group, perceived race, and perceived gender. 

The dataset also includes information about the officer, such as the officer's gender, race, and year of birth. Additionally, the dataset includes information about the resolution of the stop, any weapons found, the date and time the stop was reported, and information about the underlying Computer Aided Dispatch (CAD) event. The data is updated daily and is licensed under the public domain.

# 3. Requirements

* Data Preparation -> Loading Libraries -> Loading data -> Descriptive Exploration -> Data Cleaning -> Exploratory Descriptive Analysis (EDA) -> Pre-processing Data

* Modelling -> Train test split -> Logistic Regression -> K-Nearest -> Decision Tree -> Logistic Regression -> Random Forest
    
* Evaluation -> Classification Metrics -> Best Perfoming Model

* Conclusion -> Best Model
    
* Recommendation -> Most imporatnt features

# 4. Data Preparation

*  Update the Stop Resolution column to either be arrested (1) or not arrested (0):
*  Change the date column to datetime so we can work with it. Add in the month as a new column:
*  Group weapons into firearms vs. non-firearms vs. no weapon:
*  Change Officer year of bith to give the officer age:
*  Drop columns that we are not going to need:

* Converting categorical data to numeric format through label encoder

i) Loading Libraries -> 
ii) Loading data -> 
iii) Descriptive Exploration -> 
iv) Data Cleaning -> 
v) Exploratory Descriptive Analysis (EDA) -> 
vi) Pre-processing Data

### 4.1. Loading Libraries

In [None]:
# import relevant libraries
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

### 4.2. Loading Data

In [None]:
# read the csv file to pandas data frame
Tery_stops_df = pd.read_csv("data/Terry_Stops.csv")

# preview the first 3 rows
Tery_stops_df.head(3)

#### These are the  23 columns, with a concise explanation of the information contained in each column :

**Subject Age Group:** Subject Age Group (10 year increments) as reported by the officer.

**Subject ID:** Key, generated daily, identifying unique subjects in the dataset using a character to character match of first name and last name. "Null" values indicate an "anonymous" or "unidentified" subject. Subjects of a Terry Stop are not required to present identification.

**GO/SC Num:** General Offense or Street Check number, relating the Terry Stop to the parent report. This field may have a one to many relationship in the data.

**Terry Stop ID:** Key identifying unique Terry Stop reports.

**Stop Resolution:** Resolution of the stop as reported by the officer.

**Weapon Type:** Type of weapon, if any, identified during a search or frisk of the subject. Indicates "None" if no weapons was found.

**Officer ID:** Key identifying unique officers in the dataset.

**Officer YOB:** Year of birth, as reported by the officer.

**Officer Gender:** Gender of the officer, as reported by the officer.

**Officer Race:** Race of the officer, as reported by the officer.

**Subject Perceived Race:** Perceived race of the subject, as reported by the officer.

**Subject Perceived Gender:** Perceived gender of the subject, as reported by the officer.

**Reported Date:** Date the report was filed in the Records Management System (RMS). Not necessarily the date the stop occurred but generally within 1 day.

**Reported Time:** Time the stop was reported in the Records Management System (RMS). Not the time the stop occurred but generally within 10 hours.

**Initial Call Type:** Initial classification of the call as assigned by 911.

**Final Call Type:** Final classification of the call as assigned by the primary officer closing the event.

**Call Type:** How the call was received by the communication center.

**Officer Squad:** Functional squad assignment (not budget) of the officer as reported by the Data Analytics Platform (DAP).

**Arrest Flag:** Indicator of whether a "physical arrest" was made, of the subject, during the Terry Stop. Does not necessarily reflect a report of an arrest in the Records Management System (RMS).

**Frisk Flag:** Indicator of whether a "frisk" was conducted, by the officer, of the subject, during the Terry Stop.

**Precinct:** Precinct of the address associated with the underlying Computer Aided Dispatch (CAD) event. Not necessarily where the Terry Stop occurred.

**Sector:** Sector of the address associated with the underlying Computer Aided Dispatch (CAD) event. Not necessarily where the Terry Stop occurred.

**Beat:** Beat of the address associated with the underlying Computer Aided Dispatch (CAD) event. Not necessarily where the Terry Stop occurred.




### 4.3.  Descriptive Exploration
To summarizes the characteristics of a data set

In [None]:
# data shape
print(f"This dataset has {Tery_stops_df.shape[0]} rows and {Tery_stops_df.shape[1]} columns")

#### 4.3.1.  What are the datatype of columns ?

In [None]:
# # check number of categorical and numerical columns
def columns_dtypes(df):
    num = len(df.select_dtypes(include=np.number).columns)
    cat = len(df.select_dtypes(include='object').columns)
    print(f"Numerical columns: {num}")
    print(f"Categorical columns: {cat}")
    
# call the function  
columns_dtypes(Tery_stops_df)

#### 4.3.2. Data set description

In [None]:
# describe the dataset
Tery_stops_df.describe().T

>Most of the figures are too large to give a clear description but this before any scaling.

#### 4.3.3 Target column distribution


In [None]:
def plot_value_counts(df, col_name):
    # Count the number of unique values for the column "Stop Resolution"
    value_counts = df[col_name].value_counts()

    # Plot the bar chart
    plt.figure(figsize=(12,8))
    plt.bar(value_counts.index, value_counts.values)

    # Label the x and y axis
    plt.title("Bar plot for the count of " + col_name, fontsize=20)
    plt.xlabel(col_name, fontsize=16)
    plt.ylabel("Count", fontsize=16)

    # Add grid
    plt.grid(True, linestyle='--')
    
    # Show the plot
    plt.show()

# call the function    
plot_value_counts(Tery_stops_df,"Stop Resolution")

>Field contact dominated the stop resolution as repoted by officer

#### 4.3.4 Categorical Columns

In [None]:
# function to describe categorical columns
def count_plot(df, x_col, y_col):
    # Plot the count plot
    plt.figure(figsize=(12,8))
    sns.countplot(x=x_col, hue=y_col, data=df)

    # Label the x and y axis
    plt.title("Count plot of " + x_col + " and " + y_col, fontsize=20)
    plt.xlabel(x_col, fontsize=16)
    plt.ylabel("Count", fontsize=16)

    # Show the plot
    plt.show()

# call the function 
count_plot(Tery_stops_df, "Subject Age Group", "Stop Resolution" )

In [None]:
count_plot(Tery_stops_df, "Stop Resolution","Officer Race")

### 4.4 Data Cleaning

Identifying and correcting or removing inaccuracies, inconsistencies, and irrelevant data from a dataset. Start by handling missing values, removing duplicates, correcting data format, and transforming variables to make the data ready for modelling and predictions.

In [None]:
# check to see type of data
Tery_stops_df.info()

### 4.4.1. Missing and Duplicate Values
 a function to check duplicates and null

In [None]:
def check_duplicates_missing(dataframe):
    # calculate percentage of missing values
    percent_missing = dataframe.isnull().mean().round(4) * 100
    count_missing = dataframe.isnull().sum()
    # calculate percentage of duplicate rows
    percent_duplicates = dataframe.duplicated().mean() * 100
    # create result dataframe
    result = pd.DataFrame({'Missing Values %': percent_missing, 
                           'Missing Values Count': count_missing, 
                           'Duplicate Values %': percent_duplicates})
    # find column with most missing values
    if percent_missing.max() !=0:
        column_most_missing = percent_missing.idxmax()
        print(f"{(column_most_missing).capitalize()} is the column with most missing values.")
        print()
    else:
        print("No column with missing values")
    if percent_duplicates.max() !=0:
        column_most_duplicates = percent_duplicates.idxmax()
        print("Column with most duplicates:",column_most_duplicates)
    else:
        print("No duplicates")
    return result

check_duplicates_missing(Tery_stops_df)


In [None]:
null_data = Tery_stops_df[Tery_stops_df.isnull().any(axis=1)]
#null_data

#### Update the Stop Resolution column to binary

In [None]:
# numerical columns
num_cols = Tery_stops_df.select_dtypes(include= np.number)

# categorical columns
cat_cols = Tery_stops_df.select_dtypes(include= 'object')

# 5.0. Modelling

>Is this a classification task? 
What models will we try?
How do we deal with overfitting?
Do we need to use regularization or not?
What sort of validation strategy will we be using to check that our model works well on unseen data?
What loss functions will we use?
What threshold of performance do we consider as successful?

In [None]:
### how are our features distrubuted
def dist_check(my_data):
    num_cols = my_data.select_dtypes(include=["float64", "int64"]).columns
    ncols = len(num_cols)
    plt.figure(figsize=(20, int(22/4 * (ncols/4 + 1))))
    for i, col in enumerate(num_cols):
        plt.subplot(ncols//4 + 1, 4, i + 1)
        sns.histplot(my_data[col], kde=True)
        plt.title(col)
dist_check(DataOne)

In [None]:
columns_dtypes(Tery_stops_df)

In [None]:
def visualize_outliers_zscore(df, threshold=3):
    plt.rcParams["axes.grid"] = True
    numerical_cols = df.select_dtypes(include=["float64", "int64"]).columns
    for col in numerical_cols:
        mean = df[col].mean()
        std = df[col].std()
        z_scores = (df[col] - mean) / std
        outliers = df[np.abs(z_scores) > threshold]

        fig, ax = plt.subplots(figsize=(10, 6))
        sns.boxenplot(df[col], ax=ax)
        ax.scatter(x=outliers.index, y=outliers[col], color='red', s=50)
        ax.set_title(f"Outliers in {col.title()}")
        ax.set_xlabel(col)
        plt.show()

visualize_outliers_zscore(DataOne)   

In [None]:
def corre_plot(df, col=None):
    plt.rcParams["axes.grid"] = True
    corr = df.corr()
    if col:
        corr = corr[col].drop(col)
        corr = corr[abs(corr) > 0.5].sort_values(ascending=False)
    mask = np.triu(np.ones_like(corr, dtype=bool))
    plt.figure(figsize=(12, 10))
    sns.heatmap(corr, mask=mask, annot=True, cmap='coolwarm',linewidths=0.5)
    plt.title("Correlation", fontsize=18)
    # show plot
    plt.show()