# Introduction
This notebook is a result of a master's project in supervised learning.  Given the choice of topic, I initally veered directly towards climate change oriented work.  However, I stumbed upon the stop and frisk questionnaire dataset while perusing the NYC Open Data site.  After comparing some of the previous analysis from 2016, I found the line of inquiry much more compelling and fascinating from a social impact point of view.  

With an initial perusal of the data via pivot tables, I was just shocked that there are a disturbing number of disproportionate stops of black children and black individuals at large.  

Racially biased policing is a phrase that's almost redundant for the anti-racist individuals, so I'm not uncovering anything new here.  

But are these findings statistically significant?  Is this dataset truly representative of stop and frisk? Is there anything here in the data that might improve the lives of our NYC neighbors?  I may be asking these questions (and assuming their answers) as an average NYC resident, but I hope to answer them rigorously as a data scientist and as a machine learning engineer.  

As such, I decided to focus on a 2024 exploration of the NYPD generated dataset to find latent truths hidden in the data and attempt to train a predictive model which helps hold the NYPD accountable, but more importantly to provide tooling to aid those who are unjustly targeted.

# EDA (Exploratory Data Analysis)
EDA is a method of exploring through a given set of data, finding and cleaning outliers, and ultimately preparing the data for ML model training.  In contrast to many public/open datasets, this dataset is fairly clean and consistent, possibly because of legal oversight.  

Regardless, I will thankfully be able to focus more on analyzing the data than fixing and cleaning it.

In [2]:
# Import all the things...
import numpy as np 
import pandas as pd
from matplotlib import pyplot as pyplot
import seaborn as sea
import torch
import math

Pyarrow will become a required dependency of pandas in the next major release of pandas (pandas 3.0),
(to allow more performant data types, such as the Arrow string type, and better interoperability with other libraries)
but was not found to be installed on your system.
If this would cause problems for you,
please provide us feedback at https://github.com/pandas-dev/pandas/issues/54466
        
  import pandas as pd


In [3]:
raw_df = pd.read_csv('./data/sqf-2024.csv')
raw_df.head()

  raw_df = pd.read_csv('./data/sqf-2024.csv')


Unnamed: 0,STOP_ID,STOP_FRISK_DATE,STOP_FRISK_TIME,YEAR2,MONTH2,DAY2,STOP_WAS_INITIATED,ISSUING_OFFICER_RANK,ISSUING_OFFICER_COMMAND_CODE,SUPERVISING_OFFICER_RANK,...,SUSPECT_OTHER_DESCRIPTION,STOP_LOCATION_PRECINCT,STOP_LOCATION_SECTOR_CODE,STOP_LOCATION_APARTMENT,STOP_LOCATION_FULL_ADDRESS,STOP_LOCATION_STREET_NAME,STOP_LOCATION_X,STOP_LOCATION_Y,STOP_LOCATION_PATROL_BORO_NAME,STOP_LOCATION_BORO_NAME
0,279772561,2024-01-01,01:58:00,2024,January,Monday,Based on Self Initiated,PO,46,SGT,...,(null),46,A,(null),1775 CLAY AVE,CLAY AVE,1010576,247603,PBBX,BRONX
1,279772564,2024-01-01,00:48:00,2024,January,Monday,Based on Self Initiated,PO,120,SGT,...,BLACK HOODIE SWEATSHIRT,67,D,(null),4515 FARRAGUT RD,FARRAGUT RD,1002798,171482,PBBS,BROOKLYN
2,279772565,2024-01-01,01:10:00,2024,January,Monday,Based on Radio Run,PO,871,SGT,...,SCAR ON LIP,68,D,(null),&&,,977764,170616,PBBS,BROOKLYN
3,279772566,2024-01-01,01:10:00,2024,January,Monday,Based on Radio Run,PO,871,SGT,...,RED JACKET/ RED HAT,68,D,(null),&&,,977764,170616,PBBS,BROOKLYN
4,279772567,2024-01-01,01:10:00,2024,January,Monday,Based on Radio Run,PO,871,SGT,...,BLACK JACKET,68,D,(null),&&,,977764,170616,PBBS,BROOKLYN


## And maybe I spoke too soon
As this dataset is pretty messy under the hood when it comes to specific datatypes.  This is relatively fine for human based consumption, but it will throw our model for a loop down the line during training.  As such, I'll first trim down the number of columns based on sparsity or irrelevance, and categorize them for uniform standardization.

In [68]:
# We have a number of broken columns which should by integer based, but contain null strings as well.  We'll clean those up if they make the final cut...
# broken_columns = [63,72,77,78]
# for v in broken_columns:
#     print(raw_df.columns[v])


# I'll come back to these columns later, but these will be transformed/hot-encoded later for training
# I'm omitting any background or suspects actions as those values should show up within the reason for the stop field, e.g.
# I will skip SUSPECTS_ACTIONS_CONCEALED_POSSESSION_WEAPON_FLAG because the corresonding rows have a criminal possession of a weapon reason already

categorical_columns_to_be_encoded = ['STOP_WAS_INITIATED']

column_categories = {
    'numeric': ['STOP_ID', 'YEAR2', 'OBSERVED_DURATION_MINUTES', 'STOP_DURATION_MINUTES', 'SUSPECT_REPORTED_AGE',	'SUSPECT_WEIGHT', 'STOP_LOCATION_X',
                'STOP_LOCATION_Y'],
    'string': ['MONTH2', 'STOP_WAS_INITIATED', 'ISSUING_OFFICER_RANK', 'SUPERVISING_OFFICER_RANK', 'JURISDICTION_DESCRIPTION', 'SUSPECT_ARREST_OFFENSE',
               'SUMMONS_OFFENSE_DESCRIPTION', 'SUSPECTED_CRIME_DESCRIPTION',  'DEMEANOR_OF_PERSON_STOPPED', 'SUSPECT_SEX', 
               'SUSPECT_RACE_DESCRIPTION', 'SUSPECT_BODY_BUILD_TYPE', 'SUSPECT_OTHER_DESCRIPTION','STOP_LOCATION_BORO_NAME'
 ],
    'datetime': ['STOP_FRISK_DATE', 'STOP_FRISK_TIME'],
    'boolean': ['OFFICER_EXPLAINED_STOP_FLAG', 'OTHER_PERSON_STOPPED_FLAG', 'SUSPECT_ARRESTED_FLAG',
               'SUMMONS_ISSUED_FLAG', 'OFFICER_IN_UNIFORM_FLAG', 'FRISKED_FLAG',
               'SEARCHED_FLAG', 'ASK_FOR_CONSENT_FLG', 'CONSENT_GIVEN_FLG', 'OTHER_CONTRABAND_FLAG', 'FIREARM_FLAG', 'KNIFE_CUTTER_FLAG',
               'OTHER_WEAPON_FLAG', 'WEAPON_FOUND_FLAG', 'PHYSICAL_FORCE_CEW_FLAG', 'PHYSICAL_FORCE_DRAW_POINT_FIREARM_FLAG',
               'PHYSICAL_FORCE_OC_SPRAY_USED_FLAG', 'PHYSICAL_FORCE_OTHER_FLAG', 'PHYSICAL_FORCE_RESTRAINT_USED_FLAG', 'PHYSICAL_FORCE_VERBAL_INSTRUCTION_FLAG',
               'PHYSICAL_FORCE_WEAPON_IMPACT_FLAG', 'SEARCH_BASIS_ADMISSION_FLAG',	'SEARCH_BASIS_CONSENT_FLAG',	'SEARCH_BASIS_HARD_OBJECT_FLAG',	
               'SEARCH_BASIS_INCIDENTAL_TO_ARREST_FLAG', 'SEARCH_BASIS_OTHER_FLAG',	'SEARCH_BASIS_OUTLINE_FLAG'],
    'mixed': ['SUSPECT_HEIGHT']
}
columns_to_keep = [column for category in column_categories.values() for column in category] 

null_strings = ['(null)','#N/A', 'NA', '?', '', ' ', '&&', 'nan']

def clean_numeric(df, cols):
    for col in cols:
        df[col] = (
            df[col]
            .replace(null_strings, np.nan)
            .apply(pd.to_numeric, errors='coerce')  # Coerce non-numeric to NaN
            .astype('Int64')  # Convert to nullable integer
        )
    return df

def clean_string(df, cols):
    for col in cols:
        df[col] = (
            df[col]
            .astype(str)
            .replace(null_strings, np.nan)
            .str.strip()  # Remove leading/trailing whitespace
        )
    return df

def clean_boolean(df, cols):
    pd.set_option('future.no_silent_downcasting', True) # Future proofing
    for col in cols:
        # These are boolean fields, so we shouldn't see anything other than two options, whether null or false, but never both.
        if len(df[col].value_counts()) != 2:
            print(col, df[col].value_counts()) 
        df[col] = (
            df[col]
            .replace(null_strings, False)
            .str.strip()  # Remove leading/trailing whitespace
            .replace({'Y': True, 'N': False, '1': True, '0': False}).infer_objects(copy=False)
            .astype('boolean')  # Pandas' nullable boolean
        )
        # print(col, df[col].value_counts())
    return df

def clean_datetime(df, cols):
    date_formats = {
        'datestop': '%m%d%Y',  # MMDDYYYY
        'timestop': '%H%M',    # HHMM
        'dob': '%Y%m%d'        # YYYYMMDD (example)
    }
    for col in cols:
        df[col] = pd.to_datetime(df[col], format=date_formats.get(col), errors='coerce')
    return df

def clean_height(df, cols, null_strings=null_strings):
    for col in cols:
        df[col] = (
            df[col]
            .astype(str)
            .replace(null_strings, np.nan)
            .str.extract(r'^(\d+)\.?(\d+)?$')  # Extract feet and inches
            .apply(lambda x: (int(x[0]) * 30.48) + (int(x[1]) * 2.54) if pd.notna(x[0]) else np.nan, axis=1)
            .round()
            .astype('Int64')
        )
    return df    

def clean_data(df, categories, columns_to_keep=columns_to_keep, null_strings=null_strings):
    new_df = df[columns_to_keep].copy()
    new_df = clean_datetime(new_df, categories['datetime'])
    new_df = clean_string(new_df, categories['string'])
    new_df = clean_numeric(new_df, categories['numeric'])
    new_df = clean_boolean(new_df, categories['boolean'])
    new_df = clean_height(new_df, categories['mixed'])
    return new_df
cleaned_df = clean_data(raw_df, column_categories)
cleaned_df.head()



  df[col] = pd.to_datetime(df[col], format=date_formats.get(col), errors='coerce')


ASK_FOR_CONSENT_FLG ASK_FOR_CONSENT_FLG
N         21924
Y          2831
(null)      631
Name: count, dtype: int64
CONSENT_GIVEN_FLG CONSENT_GIVEN_FLG
N         13697
(null)     8577
Y          3112
Name: count, dtype: int64


Unnamed: 0,STOP_ID,YEAR2,OBSERVED_DURATION_MINUTES,STOP_DURATION_MINUTES,SUSPECT_REPORTED_AGE,SUSPECT_WEIGHT,STOP_LOCATION_X,STOP_LOCATION_Y,MONTH2,STOP_WAS_INITIATED,...,PHYSICAL_FORCE_RESTRAINT_USED_FLAG,PHYSICAL_FORCE_VERBAL_INSTRUCTION_FLAG,PHYSICAL_FORCE_WEAPON_IMPACT_FLAG,SEARCH_BASIS_ADMISSION_FLAG,SEARCH_BASIS_CONSENT_FLAG,SEARCH_BASIS_HARD_OBJECT_FLAG,SEARCH_BASIS_INCIDENTAL_TO_ARREST_FLAG,SEARCH_BASIS_OTHER_FLAG,SEARCH_BASIS_OUTLINE_FLAG,SUSPECT_HEIGHT
0,279772561,2024,1,1,27,160,1010576,247603,January,Based on Self Initiated,...,,True,,,,True,True,,,180
1,279772564,2024,1,3,22,200,1002798,171482,January,Based on Self Initiated,...,,True,,,,True,,,,185
2,279772565,2024,0,20,17,140,977764,170616,January,Based on Radio Run,...,,True,,,,,,,,183
3,279772566,2024,0,20,13,120,977764,170616,January,Based on Radio Run,...,,True,,,,,,,,152
4,279772567,2024,0,20,14,120,977764,170616,January,Based on Radio Run,...,,True,,,,,,,,152


## Ongoing Assumptions
Cleaning the data has been more laborious than expected, but I wanted to explicitly call out my subjective modifications.  In particular, I've noted that I'm omitting a number of columns up front due to sparcity (i.e. not enough data) and high correlation/duplicative values (WEAPON_FOUND_FLAG overlaps with the various arrest reason flags like SUSPECTS_ACTIONS_CONCEALED_POSSESSION_WEAPON_FLAG) that will throw off future training.

Similarly, a few of the boolean/flag fields actually contain three types of values due to null strings rather than the expected two.  Initially I was going to assume a false value for all null values here, but I felt it would alter the data too much/not provide enough benefits to modify those values given what they represent.  For transparency, the flags are ASK_FOR_CONSENT_FLG and CONSENT_GIVEN_FLG, with null values representing ~2.49% and 12.26% respectively.

In [70]:
cleaned_df['DEMEANOR_OF_PERSON_STOPPED'].value_counts()
631 / len(cleaned_df)*100, 3112 / len(cleaned_df)*100

(2.485621996375955, 12.258725281651305)

In [None]:
sparsity_report = pd.DataFrame({
    'column': raw_df.columns,
    'null_rate': raw_df.isnull().mean(),
    'unique_values': raw_df.nunique()
}).sort_values('null_rate', ascending=False)

print(sparsity_report.head(20))

# References
1. [NYCLU Stop and Frisk Dataset](https://www.nyclu.org/data/stop-and-frisk-data)
    1. [NYCLU Stop and Frisk Analysis regarding children](https://www.nyclu.org/data/closer-look-stop-and-frisk-nyc)
1.