# Introduction
This notebook is a result of a master's project in supervised learning.  Given the choice of topic, I initally veered directly towards climate change oriented work.  However, I stumbed upon the stop and frisk questionnaire dataset while perusing the NYC Open Data site.  After comparing some of the previous analysis from 2016, I found the line of inquiry much more compelling and fascinating from a social impact point of view.  

With an initial perusal of the data via pivot tables, I was just shocked that there are a disturbing number of disproportionate stops of black children and black individuals at large.  

Racially biased policing is a phrase that's almost redundant for the anti-racist individuals, so I'm not uncovering anything new here.  

But are these findings statistically significant?  Is this dataset truly representative of stop and frisk? Is there anything here in the data that might improve the lives of our NYC neighbors?  I may be asking these questions (and assuming their answers) as an average NYC resident, but I hope to answer them rigorously as a data scientist and as a machine learning engineer.  

As such, I decided to focus on a 2024 exploration of the NYPD generated dataset to find latent truths hidden in the data and attempt to train a predictive model which helps hold the NYPD accountable, but more importantly to provide tooling to aid those who are unjustly targeted.

# Unsupervised Learning Update
I split the previous notebook up over a number of sub-notebooks so pardon the dust while refactoring all of the context and code.

New UL features in this update:
- Expanding the dataset from just 2024 to 2021-24
- Adding geocoded neighborhoods to the dataset
- Removing much of the boilerplate code to various utility modules to streamline the notebooks
- Extracting new UL features via K-Means clustering and PCA dimensionality reduction
- SL model updates with UL findings

# Whats in this notebook?
This is a pretty code heavy notebook that contains all of the data cleaning, geocoding, and ultimately saving the data for later processing.

Most of writeup is in subsequent notebooks

# Data Processing for good and ill

In [2]:
# Import all the things...
import pandas as pd
from pathlib import Path
from utils.data_processing import (
    clean_data,
    null_strings,
    is_valid_child,
    column_categories,
    geocode_df,
    validate_neighborhood_data,
    prepare_additional_features,
    standardize_dates
)


In [3]:
# null_strings = ['(null)','#N/A', 'NA', '?', '', ' ', '&&', 'nan']

path = Path('./data/raw')
csv_files = path.glob('*.csv')

raw_df = pd.concat((pd.read_csv(f,dtype={
        'STOP_FRISK_DATE': 'object',  # Keep as string
        'STOP_FRISK_TIME': 'object'   # Keep as string
    },
    na_values=null_strings,  # Standardize nulls
    low_memory=False) for f in csv_files), ignore_index=True)
raw_df.head()
print(f"There are {len(raw_df)} records before cleaning")

There are 66406 records before cleaning


In [4]:
# Standardize the dates...
raw_df['STOP_FRISK_DATE'] = raw_df['STOP_FRISK_DATE'].apply(standardize_dates)

## Ongoing Assumptions
Cleaning the data has been more laborious than expected, but I wanted to explicitly call out my subjective modifications.  In particular, I've noted that I'm omitting a number of columns up front due to sparcity (i.e. not enough data) and high correlation/duplicative values (WEAPON_FOUND_FLAG overlaps with the various arrest reason flags like SUSPECTS_ACTIONS_CONCEALED_POSSESSION_WEAPON_FLAG) that will throw off future training.

Similarly, a few of the boolean/flag fields actually contain three types of values due to null strings rather than the expected two.  Initially I was going to assume a false value for all null values here, but I felt it would alter the data too much/not provide enough benefits to modify those values given what they represent.  For transparency, the flags are ASK_FOR_CONSENT_FLG and CONSENT_GIVEN_FLG, with null values representing ~2.49% and 12.26% respectively.

Conversely, there are a number of flag fields which only contain Y values, so I'll be assuming that those null/missing values are indeed false.

For clarity, all of the underlying code here is in the clean_data utility function.  Caveat Emptor.

## Validating the age with height
A portion of this data is still erroneous after all the basic cleaning above.  Here we will validate the ages entered by officers with CDC based correlations of height and filter out the outliers.

In [5]:
cleaned_df = clean_data(raw_df, column_categories)
print(f'Approximately {(len(cleaned_df)/ len(raw_df)*100):.2f}% of the original data after cleaning')

plausible_df = cleaned_df[
    (cleaned_df.apply(is_valid_child, axis=1))
]
print(f'{(len(plausible_df) / len(cleaned_df)*100):.2f}% of data remains after age-height validation of children')

Approximately 83.02% of the original data after cleaning
99.96% of data remains after age-height validation of children


In [6]:
geocoded_df = geocode_df(plausible_df)

NTA CRS: EPSG:4326
Stops GDF CRS: EPSG:4326
NTA CRS: EPSG:4326
Sample of transformed coordinates:
         lat        lon
0  40.832167 -73.893445
1  40.672714 -73.753920
3  40.680767 -73.906721
4  40.812541 -73.955302
5  40.833561 -73.895933
Points outside expected NYC bounds: 20
Unmatched points: 26 out of 55073 (0.05%)
Unmatched points remaining after cleaning: 0 out of 55047 (0.00%)


## Geocode the vibe coding
The geocoded data still has a couple outliers that I'll drop for now given how few they are.

In [7]:
validate_neighborhood_data(geocoded_df)

All neighborhood-borough mappings validated successfully!
Unique missing NTAs: {'Ferry Point Park-St. Raymond Cemetery', "St. Michael's Cemetery", 'Mount Olivet & All Faiths Cemeteries', 'Rockaway Community Park', 'Holy Cross Cemetery'}
Found 23 borough name mismatches:
      STOP_ID                                ntaname STOP_LOCATION_BORO_NAME  \
590       591                Howard Beach-Lindenwood                BROOKLYN   
1993     1994                             Ozone Park                BROOKLYN   
1994     1995                             Ozone Park                BROOKLYN   
4126     4127                        Bushwick (East)                  QUEENS   
4919     4920  Flatbush (West)-Ditmas Park-Parkville                  QUEENS   

      boroname  
590     Queens  
1993    Queens  
1994    Queens  
4126  Brooklyn  
4919  Brooklyn  
After cleaning: Found 0 borough name mismatches:
Empty DataFrame
Columns: [STOP_ID, ntaname, STOP_LOCATION_BORO_NAME, boroname]
Index: []


## Spar-city, here we come 
We can see from the vast differences in null rates between the datasets that these flag features are incredibly sparse overall, so normalizing them enables usage of a much larger number of these fields.  That said, we still have some outlier features that I'll consider dropping as we more forwards, specifically weight.

Until then, let's just explore the data as is.


In [None]:
# todo: refactor to utility
def sparsity_report(df):
    sparsity_report = pd.DataFrame({
        'column': df.columns,
        'null_rate': df.isnull().mean(),
        'unique_values': df.nunique()
    }).sort_values('null_rate', ascending=False)

    print(sparsity_report[['null_rate', 'unique_values']].head(20))
sparsity_report(raw_df)
sparsity_report(geocoded_df)

                                                    null_rate  unique_values
PHYSICAL_FORCE_OC_SPRAY_USED_FLAG                    0.999880              1
PHYSICAL_FORCE_WEAPON_IMPACT_FLAG                    0.999639              1
ID_CARD_IDENTIFIES_OFFICER_FLAG                      0.999006              1
PHYSICAL_FORCE_CEW_FLAG                              0.994564              1
SUSPECTS_ACTIONS_IDENTIFY_CRIME_PATTERN_FLAG         0.994157              1
SUSPECTS_ACTIONS_DRUG_TRANSACTIONS_FLAG              0.992561              1
SUSPECTS_ACTIONS_LOOKOUT_FLAG                        0.988164              1
VERBAL_IDENTIFIES_OFFICER_FLAG                       0.987019              1
SHIELD_IDENTIFIES_OFFICER_FLAG                       0.985483              1
OTHER_WEAPON_FLAG                                    0.983300              1
SEARCH_BASIS_ADMISSION_FLAG                          0.980559              1
PHYSICAL_FORCE_OTHER_FLAG                            0.976463              1

## Aggregation at Spar-city
We can see that there still a great deal of sparse data even after the cleaning processes above.  In order to both preserve the data, but also provide more meaningful results, I will aggregate a number of columns/features using the prepare_additional_features utility, which will create the following features.

Outcome of Stop: Categorical column with values of "ARRESTED", "SUMMONS ISSUED", and "Innocent/No Action Taken" which will be aggregated from their respective columns while innocent will be the default value in the absence of the others

Officers Used Physical Force: A Boolean feature aggregated from the many physical force features (drawing firearms, using restraint, etc.) 

In [8]:
final_df = prepare_additional_features(geocoded_df)
final_df

OUTCOME_OF_STOP 
No Charges Filed    34939
Arrested            18501
Summoned             1607
Name: count, dtype: int64
OFFICER_USED_FORCE
False                 41203
True                  13844
Name: count, dtype: int64
FORCE_TYPE          
No Force                41203
Handcuffs                8891
Firearm Drawn            2184
Restraint Used           1241
Other Physical Force     1199
Taser                     303
Weapon Impact              21
Pepper Spray                5
Name: count, dtype: int64


Unnamed: 0,STOP_FRISK_DATE,YEAR2,STOP_ID,OBSERVED_DURATION_MINUTES,STOP_DURATION_MINUTES,SUSPECT_REPORTED_AGE,SUSPECT_WEIGHT,STOP_LOCATION_X,STOP_LOCATION_Y,MONTH2,...,boroname,ntatype,nta2020,borocode,countyfips,ntaabbrev,cdta2020,OUTCOME_OF_STOP,OFFICER_USED_FORCE,FORCE_TYPE
0,2021-01-01 00:01:00,2021,1,0,2,40,200,1013737,242476,January,...,Bronx,0,BX0303,2,005,CrtnaPkEst,BX03,Arrested,True,Handcuffs
1,2021-01-01 00:01:00,2021,2,1,10,19,160,1052511,184460,January,...,Queens,0,QN1305,4,081,Lrltn,QN13,Arrested,False,No Force
2,2021-01-01 00:01:00,2021,4,1,5,28,180,1010122,187312,January,...,Brooklyn,0,BK1601,3,047,OcnHl,BK16,No Charges Filed,False,No Force
3,2021-01-01 00:01:00,2021,5,1,30,19,185,996623,235311,January,...,Manhattan,0,MN0901,1,061,MrngsdHts,MN09,No Charges Filed,True,Handcuffs
4,2021-01-01 00:01:00,2021,6,0,6,32,160,1013048,242983,January,...,Bronx,0,BX0303,2,005,CrtnaPkEst,BX03,Arrested,False,No Force
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
55042,2022-01-31 00:12:00,2022,15098,1,7,45,250,1021687,236876,December,...,Bronx,0,BX0902,2,005,Sdvw_ClsPt,BX09,No Charges Filed,False,No Force
55043,2022-01-31 00:12:00,2022,15099,0,1,36,,1009784,239429,December,...,Bronx,0,BX0301,2,005,Mrrsnia,BX03,No Charges Filed,False,No Force
55044,2022-01-31 00:12:00,2022,15100,1,2,32,160,1019464,212685,December,...,Queens,0,QN0301,4,081,JcksnHts,QN03,No Charges Filed,False,No Force
55045,2022-01-31 00:12:00,2022,15101,1,1,18,150,1008017,244068,December,...,Bronx,0,BX0403,2,005,MtEdn,BX04,No Charges Filed,False,No Force


In [9]:
first_type = type(final_df['STOP_FRISK_DATE'].iloc[0])
mixed_rows = final_df[final_df['STOP_FRISK_DATE'].apply(lambda x: type(x)) != first_type]
print(f"# of mixed rows: {len(mixed_rows)}")
mixed_rows.head()

# of mixed rows: 0


Unnamed: 0,STOP_FRISK_DATE,YEAR2,STOP_ID,OBSERVED_DURATION_MINUTES,STOP_DURATION_MINUTES,SUSPECT_REPORTED_AGE,SUSPECT_WEIGHT,STOP_LOCATION_X,STOP_LOCATION_Y,MONTH2,...,boroname,ntatype,nta2020,borocode,countyfips,ntaabbrev,cdta2020,OUTCOME_OF_STOP,OFFICER_USED_FORCE,FORCE_TYPE


In [10]:
final_df = final_df.dropna(subset=['STOP_FRISK_DATE'])
final_df.to_csv('./data/processed/stop-and-frisk.csv', index=False)
final_df.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 55047 entries, 0 to 55046
Data columns (total 71 columns):
 #   Column                                  Non-Null Count  Dtype         
---  ------                                  --------------  -----         
 0   STOP_FRISK_DATE                         55047 non-null  datetime64[ns]
 1   YEAR2                                   55047 non-null  Int64         
 2   STOP_ID                                 55047 non-null  Int64         
 3   OBSERVED_DURATION_MINUTES               55047 non-null  Int64         
 4   STOP_DURATION_MINUTES                   55047 non-null  Int64         
 5   SUSPECT_REPORTED_AGE                    55047 non-null  Int64         
 6   SUSPECT_WEIGHT                          54113 non-null  Int64         
 7   STOP_LOCATION_X                         55047 non-null  Int64         
 8   STOP_LOCATION_Y                         55047 non-null  Int64         
 9   MONTH2                                  55

In [11]:
print(f'From the initial {len(raw_df)} records, {len(final_df)} remains.  Approximately {(len(final_df) / len(raw_df)):.2f} of the original data remains after processing')

From the initial 66406 records, 55047 remains.  Approximately 0.83 of the original data remains after processing
