### Bronze to Silver Medallion Data Layer for Stop, Question and Frisk Data (2017-2022)

We conduct the following transformations and cleaning to get to the Silver Medallion Layer
- Borough mapping for consistency (ex: PBMN and PBMS both map to Manhattan, Staten Is and Staten Island both map to Staten Island)
- Race mapping for consistency (ex: ASIAN/PAC.ISL and ASIAN/PACIFIC ISLANDER map to the same entity)
- Null value mapping (ex: Null values represented in different formats like (null), Nan, NULL, etc.)
- Getting column values into a consistent format (ex: 2017.00 to 2017)
- Dropping irrelevant columns for the analysis and those columns with very high percentages of missing values
- Date preprocessing (ex: Stop Frisk Date has two different data representations: 2/5/2020 and 2021-07-29)

In [1]:
import sys
sys.path.append("..")

In [2]:
from snowflake.snowpark import Session
import snowflake.snowpark.functions as F
from snowflake.snowpark.functions import when, col, sum, concat, lit
from snowflake.snowpark.functions import expr, regexp_extract, to_date
from datetime import date
from helpers import SnowflakeHelper
import json
import os
import matplotlib.pyplot as plt
import pandas as pd

In [3]:
borough_mapping = {
    "PBBX": "BRONX", 
    "PBSI": "STATEN ISLAND", 
    "PBMN": "MANHATTAN", 
    "PBMS": "MANHATTAN",
    "PBBN": "BROOKLYN", 
    "PBBS": "BROOKLYN", 
    "PBQS": "QUEENS", 
    "PBQN": "QUEENS",
    "STATEN IS": "STATEN ISLAND"
}

null_value_mapping = {
    "(null)" : None,
    "NaN" : None,
    "(" : None,
    "NULL": None,
    "(nu": None, 
    "#N/A": None
}

race_mapping = {
    "ASIAN / PACIFIC ISLANDER": "ASIAN/PACIFIC ISLANDER",
    "ASIAN/PAC.ISL": "ASIAN/PACIFIC ISLANDER",
    "AMER IND": "AMERICAN INDIAN/ALASKAN NATIVE",
    "AMERICAN INDIAN/ALASKAN N": "AMERICAN INDIAN/ALASKAN NATIVE",
    "MIDDLE EASTERN/SOUTHWEST": "MIDDLE EASTERN/SOUTHWEST ASIAN"
}

gender_mapping = {
    "M": "MALE",
    "F": "FEMALE"
}

In [4]:
snowflake_helper = SnowflakeHelper()
snowflake_config = './../helpers/snowflake_config.json'
session = snowflake_helper.create_snowpark_session(snowflake_config)

[INFO] No schema passed, using default schema SAFEGUARDING_NYC_SCHEMA_BRONZE for the session
[SUCCESS] Config file loaded successfully!
[SUCCESS] Snowspark Session created successfully on schema SAFEGUARDING_NYC_SCHEMA_BRONZE!


In [5]:
sqf_data = session.table('SQF')

In [6]:
sqf_data.show()

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [7]:
sqf_data.count()

69689

In [8]:
total_rows = sqf_data.count()

# Calculate the count of missing values for each column
missing_counts = sqf_data.select([sum(col(c).isNull().cast("int")).alias(c) for c in sqf_data.columns])

# Calculate the percentage of missing values for each column
missing_percentages = missing_counts.select([(col(c) / total_rows * 100).alias(c) for c in missing_counts.columns])

# Display the result
missing_percentages.show()

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [9]:
columns_to_drop= ['_AIRBYTE_RAW_ID', '_AIRBYTE_EXTRACTED_AT', '_AIRBYTE_META', 'STOP_LOCATION_ZIP_CODE', 'ID_CARD_IDENTIFIES_OFFICER_FLAG', 'OFFICER_NOT_EXPLAINED_STOP_DESCRIPTION',\
                    'SUSPECTS_ACTIONS_CASING_FLAG', 'SUMMONS_ISSUED_FLAG', 'VERBAL_IDENTIFIES_OFFICER_FLAG', 'SEARCH_BASIS_ADMISSION_FLAG', 'SEARCH_BASIS_OTHER_FLAG', 'SEARCH_BASIS_CONSENT_FLAG',\
                    'BACKROUND_CIRCUMSTANCES_SUSPECT_KNOWN_TO_CARRY_WEAPON_FLAG', 'RECORD_STATUS_CODE', 'PHYSICAL_FORCE_RESTRAINT_USED_FLAG', 'PHYSICAL_FORCE_HANDCUFF_SUSPECT_FLAG',\
                    'OTHER_PERSON_STOPPED_FLAG', 'SUSPECTS_ACTIONS_PROXIMITY_TO_SCENE_FLAG', 'DEMEANOR_CODE', 'SUPERVISING_OFFICER_COMMAND_CODE', 'STOP_ID_ANONY', 'ISSUING_OFFICER_COMMAND_CODE',\
                    'LOCATION_IN_OUT_CODE', 'JURISDICTION_DESCRIPTION', 'PHYSICAL_FORCE_VERBAL_INSTRUCTION_FLAG']
sqf_data = sqf_data.drop(*columns_to_drop)
sqf_data.show()

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [10]:
sqf_data.group_by('STOP_LOCATION_PATROL_BORO_NAME')\
        .count()\
        .show()

----------------------------------------------
|"STOP_LOCATION_PATROL_BORO_NAME"  |"COUNT"  |
----------------------------------------------
|PBQS                              |5897     |
|PBMN                              |10729    |
|PBQN                              |5958     |
|(nul                              |417      |
|5 AV                              |1        |
|0233                              |1        |
|0220                              |1        |
|0216                              |1        |
|PBBS                              |10520    |
|PBMS                              |6795     |
----------------------------------------------



In [11]:
sqf_data.group_by('STOP_LOCATION_BORO_NAME')\
        .count()\
        .show()

---------------------------------------
|"STOP_LOCATION_BORO_NAME"  |"COUNT"  |
---------------------------------------
|BROOKLYN                   |21648    |
|BRONX                      |15705    |
|PBMS                       |1        |
|0210334                    |1        |
|PBBS                       |1        |
|0190241                    |1        |
|PBMN                       |3        |
|(null)                     |410      |
|0155070                    |1        |
|MANHATTAN                  |17524    |
---------------------------------------



In [12]:
mapping_expr = when(col('STOP_LOCATION_PATROL_BORO_NAME') == 'PBBX', 'BRONX')
for key, value in borough_mapping.items():
    mapping_expr = mapping_expr.when(col("STOP_LOCATION_PATROL_BORO_NAME") == key, value)

preprocessed_sqf_df = sqf_data.withColumn("STOP_LOCATION_PATROL_BORO_NAME", mapping_expr.otherwise(col("STOP_LOCATION_PATROL_BORO_NAME")))

mapping_expr = when(col('STOP_LOCATION_BORO_NAME') == 'PBBX', 'BRONX')
for key, value in borough_mapping.items():
    mapping_expr = mapping_expr.when(col("STOP_LOCATION_BORO_NAME") == key, value)

preprocessed_sqf_df = preprocessed_sqf_df.withColumn("STOP_LOCATION_BORO_NAME", mapping_expr.otherwise(col("STOP_LOCATION_BORO_NAME")))

In [13]:
preprocessed_sqf_df.show()

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [14]:
preprocessed_sqf_df = preprocessed_sqf_df.filter(col('STOP_LOCATION_PATROL_BORO_NAME').isin(list(borough_mapping.values())))
preprocessed_sqf_df = preprocessed_sqf_df.filter(col('STOP_LOCATION_BORO_NAME').isin(list(borough_mapping.values())))

In [15]:
preprocessed_sqf_df.show()

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [16]:
preprocessed_sqf_df.count()

69257

In [17]:
preprocessed_sqf_df.group_by('STOP_LOCATION_BORO_NAME')\
        .count()\
        .show()

---------------------------------------
|"STOP_LOCATION_BORO_NAME"  |"COUNT"  |
---------------------------------------
|QUEENS                     |11855    |
|STATEN ISLAND              |2525     |
|BRONX                      |15705    |
|BROOKLYN                   |21648    |
|MANHATTAN                  |17524    |
---------------------------------------



In [18]:
preprocessed_sqf_df.group_by('STOP_LOCATION_PATROL_BORO_NAME')\
        .count()\
        .show()

----------------------------------------------
|"STOP_LOCATION_PATROL_BORO_NAME"  |"COUNT"  |
----------------------------------------------
|QUEENS                            |11855    |
|STATEN ISLAND                     |2525     |
|MANHATTAN                         |17524    |
|BRONX                             |15705    |
|BROOKLYN                          |21648    |
----------------------------------------------



In [19]:
preprocessed_sqf_df.group_by('YEAR2')\
        .count()\
        .sort('YEAR2', ascending=True)\
        .show()

---------------------
|"YEAR2"  |"COUNT"  |
---------------------
|2017.00  |11197    |
|2018     |11008    |
|2019     |13459    |
|2020     |9544     |
|2021     |8947     |
|2022     |15102    |
---------------------



In [20]:
mapping_expr_year = when(col('YEAR2') == '2017.00', '2017')
preprocessed_sqf_df = preprocessed_sqf_df.withColumn("YEAR2", mapping_expr_year.otherwise(col("YEAR2")))
preprocessed_sqf_df.group_by('YEAR2')\
        .count()\
        .sort('YEAR2', ascending=True)\
        .show()

---------------------
|"YEAR2"  |"COUNT"  |
---------------------
|2017     |11197    |
|2018     |11008    |
|2019     |13459    |
|2020     |9544     |
|2021     |8947     |
|2022     |15102    |
---------------------



In [21]:
preprocessed_sqf_df.count()

69257

In [22]:
total_rows = preprocessed_sqf_df.count()

# Calculate the count of missing values for each column
missing_counts = preprocessed_sqf_df.select([sum(col(c).isNull().cast("int")).alias(c) for c in preprocessed_sqf_df.columns])

# Calculate the percentage of missing values for each column
missing_percentages = missing_counts.select([(col(c) / total_rows * 100).alias(c) for c in missing_counts.columns])

# Display the result
missing_percentages.show()

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [23]:
preprocessed_sqf_df.columns

['PHYSICAL_FORCE_OC_SPRAY_USED_FLAG',
 'SEARCH_BASIS_HARD_OBJECT_FLAG',
 'STOP_ID',
 'SUSPECT_WEIGHT',
 'STOP_LOCATION_SECTOR_CODE',
 'OTHER_WEAPON_FLAG',
 'PHYSICAL_FORCE_WEAPON_IMPACT_FLAG',
 'STOP_LOCATION_STREET_NAME',
 'STOP_DURATION_MINUTES',
 'STOP_FRISK_DATE',
 'PHYSICAL_FORCE_OTHER_FLAG',
 'SUSPECT_SEX',
 'ASK_FOR_CONSENT_FLG',
 'SUMMONS_OFFENSE_DESCRIPTION',
 'STOP_LOCATION_X',
 'SUSPECTS_ACTIONS_CONCEALED_POSSESSION_WEAPON_FLAG',
 'SUSPECTS_ACTIONS_OTHER_FLAG',
 'PHYSICAL_FORCE_CEW_FLAG',
 'STOP_LOCATION_Y',
 'DAY2',
 'FIREARM_FLAG',
 'WEAPON_FOUND_FLAG',
 'MONTH2',
 'SEARCH_BASIS_INCIDENTAL_TO_ARREST_FLAG',
 'SUSPECTS_ACTIONS_DRUG_TRANSACTIONS_FLAG',
 'BACKROUND_CIRCUMSTANCES_VIOLENT_CRIME_FLAG',
 'FRISKED_FLAG',
 'SUSPECT_HAIR_COLOR',
 'SUSPECT_ARREST_OFFENSE',
 'OFFICER_IN_UNIFORM_FLAG',
 'STOP_LOCATION_APARTMENT',
 'JURISDICTION_CODE',
 'SEARCHED_FLAG',
 'SUSPECTED_CRIME_DESCRIPTION',
 'SUSPECT_HEIGHT',
 'SUSPECTS_ACTIONS_DECRIPTION_FLAG',
 'KNIFE_CUTTER_FLAG',
 'SUPERVI

In [24]:
columns_to_drop= ['PHYSICAL_FORCE_OC_SPRAY_USED_FLAG', 'SEARCH_BASIS_HARD_OBJECT_FLAG', 'OTHER_WEAPON_FLAG', 'PHYSICAL_FORCE_WEAPON_IMPACT_FLAG',\
                'PHYSICAL_FORCE_OTHER_FLAG', 'SUMMONS_OFFENSE_DESCRIPTION', 'SUSPECTS_ACTIONS_CONCEALED_POSSESSION_WEAPON_FLAG', 'SUSPECTS_ACTIONS_OTHER_FLAG',\
                'PHYSICAL_FORCE_CEW_FLAG', 'FIREARM_FLAG', 'SEARCH_BASIS_INCIDENTAL_TO_ARREST_FLAG', 'SUSPECTS_ACTIONS_DRUG_TRANSACTIONS_FLAG', \
                'BACKROUND_CIRCUMSTANCES_VIOLENT_CRIME_FLAG', 'STOP_LOCATION_APARTMENT', 'KNIFE_CUTTER_FLAG', 'KNIFE_CUTTER_FLAG',\
                'PHYSICAL_FORCE_DRAW_POINT_FIREARM_FLAG', 'SHIELD_IDENTIFIES_OFFICER_FLAG',\
                'SUSPECTS_ACTIONS_IDENTIFY_CRIME_PATTERN_FLAG', 'SUSPECTS_ACTIONS_LOOKOUT_FLAG']
preprocessed_sqf_df = preprocessed_sqf_df.drop(*columns_to_drop)
preprocessed_sqf_df.show()

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [25]:
columns_to_convert_null_values = ['SUSPECT_WEIGHT', 'STOP_LOCATION_SECTOR_CODE', 'STOP_LOCATION_STREET_NAME', 'STOP_DURATION_MINUTES',\
 'SUSPECT_SEX', 'ASK_FOR_CONSENT_FLG', 'STOP_LOCATION_X', 'STOP_LOCATION_Y', 'DAY2', 'WEAPON_FOUND_FLAG', 'MONTH2', 'FRISKED_FLAG',\
 'SUSPECT_HAIR_COLOR', 'SUSPECT_ARREST_OFFENSE', 'OFFICER_IN_UNIFORM_FLAG', 'JURISDICTION_CODE', 'SEARCHED_FLAG', 'SUSPECTED_CRIME_DESCRIPTION',\
 'SUSPECT_HEIGHT', 'SUSPECTS_ACTIONS_DECRIPTION_FLAG', 'SUPERVISING_OFFICER_RANK', 'STOP_LOCATION_FULL_ADDRESS', 'OTHER_CONTRABAND_FLAG',\
 'SUSPECT_BODY_BUILD_TYPE', 'DEMEANOR_OF_PERSON_STOPPED', 'SUSPECT_ARRESTED_FLAG', 'SUSPECT_RACE_DESCRIPTION', 'SUSPECT_REPORTED_AGE',\
 'SUSPECT_EYE_COLOR', 'OBSERVED_DURATION_MINUTES', 'CONSENT_GIVEN_FLG', 'STOP_WAS_INITIATED', 'SEARCH_BASIS_OUTLINE_FLAG',\
 'ISSUING_OFFICER_RANK', 'OFFICER_EXPLAINED_STOP_FLAG', 'STOP_LOCATION_PRECINCT', 'SUSPECT_OTHER_DESCRIPTION', 'STOP_LOCATION_PATROL_BORO_NAME',\
 'STOP_LOCATION_BORO_NAME', 'YEAR2']

for column in columns_to_convert_null_values:
    for key, value in null_value_mapping.items():
        preprocessed_sqf_df = preprocessed_sqf_df.withColumn(column, when(col(column) == key, value).otherwise(col(column)))

preprocessed_sqf_df.show()

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [26]:
total_rows = preprocessed_sqf_df.count()

# Calculate the count of missing values for each column
missing_counts = preprocessed_sqf_df.select([sum(col(c).isNull().cast("int")).alias(c) for c in preprocessed_sqf_df.columns])

# Calculate the percentage of missing values for each column
missing_percentages = missing_counts.select([(col(c) / total_rows * 100).alias(c) for c in missing_counts.columns])

# Display the result
missing_percentages.show()

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [27]:
distinct_values_cols_check = ['STOP_LOCATION_SECTOR_CODE', 'SUSPECT_SEX', 'ASK_FOR_CONSENT_FLG', 'WEAPON_FOUND_FLAG', 'MONTH2', 'FRISKED_FLAG',\
 'SUSPECT_HAIR_COLOR', 'SUSPECT_ARREST_OFFENSE', 'OFFICER_IN_UNIFORM_FLAG', 'JURISDICTION_CODE', 'SEARCHED_FLAG', 'SUSPECTS_ACTIONS_DECRIPTION_FLAG',\
 'SUPERVISING_OFFICER_RANK', 'OTHER_CONTRABAND_FLAG', 'SUSPECT_BODY_BUILD_TYPE', 'DEMEANOR_OF_PERSON_STOPPED', 'SUSPECT_ARRESTED_FLAG',\
 'SUSPECT_EYE_COLOR', 'CONSENT_GIVEN_FLG', 'STOP_WAS_INITIATED', 'ISSUING_OFFICER_RANK', 'OFFICER_EXPLAINED_STOP_FLAG',\
 'STOP_LOCATION_PATROL_BORO_NAME', 'STOP_LOCATION_BORO_NAME', 'YEAR2', 'SUSPECT_RACE_DESCRIPTION']

In [29]:
for column in distinct_values_cols_check:
    distinct_count = preprocessed_sqf_df.select(column).distinct().count()
    print(f"Number of distinct values in {column}: {distinct_count}")

Number of distinct values in STOP_LOCATION_SECTOR_CODE: 18
Number of distinct values in SUSPECT_SEX: 3
Number of distinct values in ASK_FOR_CONSENT_FLG: 3
Number of distinct values in WEAPON_FOUND_FLAG: 3
Number of distinct values in MONTH2: 12
Number of distinct values in FRISKED_FLAG: 2
Number of distinct values in SUSPECT_HAIR_COLOR: 17
Number of distinct values in SUSPECT_ARREST_OFFENSE: 32
Number of distinct values in OFFICER_IN_UNIFORM_FLAG: 2
Number of distinct values in JURISDICTION_CODE: 5
Number of distinct values in SEARCHED_FLAG: 2
Number of distinct values in SUSPECTS_ACTIONS_DECRIPTION_FLAG: 2
Number of distinct values in SUPERVISING_OFFICER_RANK: 16
Number of distinct values in OTHER_CONTRABAND_FLAG: 2
Number of distinct values in SUSPECT_BODY_BUILD_TYPE: 10
Number of distinct values in DEMEANOR_OF_PERSON_STOPPED: 6325
Number of distinct values in SUSPECT_ARRESTED_FLAG: 2
Number of distinct values in SUSPECT_EYE_COLOR: 13
Number of distinct values in CONSENT_GIVEN_FLG: 3

In [30]:
for col in distinct_values_cols_check:
    distinct_values = preprocessed_sqf_df.select(col).distinct()
    print(f"Distinct values in {col}:")
    #print('=' * 35)
    distinct_values.show()

Distinct values in STOP_LOCATION_SECTOR_CODE:
-------------------------------
|"STOP_LOCATION_SECTOR_CODE"  |
-------------------------------
|D                            |
|C                            |
|A                            |
|J                            |
|1                            |
|M                            |
|I                            |
|K                            |
|B                            |
|NULL                         |
-------------------------------

Distinct values in SUSPECT_SEX:
-----------------
|"SUSPECT_SEX"  |
-----------------
|MALE           |
|FEMALE         |
|NULL           |
-----------------

Distinct values in ASK_FOR_CONSENT_FLG:
-------------------------
|"ASK_FOR_CONSENT_FLG"  |
-------------------------
|N                      |
|NULL                   |
|Y                      |
-------------------------

Distinct values in WEAPON_FOUND_FLAG:
-----------------------
|"WEAPON_FOUND_FLAG"  |
-----------------------
|N           

In [28]:
preprocessed_sqf_df.group_by('SUSPECT_RACE_DESCRIPTION')\
        .count()\
        .show()

--------------------------------------------
|"SUSPECT_RACE_DESCRIPTION"      |"COUNT"  |
--------------------------------------------
|WHITE HISPANIC                  |14424    |
|AMERICAN INDIAN/ALASKAN NATIVE  |49       |
|ASIAN / PACIFIC ISLANDER        |1253     |
|MIDDLE EASTERN/SOUTHWEST        |132      |
|BLACK                           |40189    |
|ASIAN/PAC.ISL                   |203      |
|MIDDLE EASTERN/SOUTHWEST ASIAN  |219      |
|BLACK HISPANIC                  |6067     |
|AMER IND                        |9        |
|NULL                            |764      |
--------------------------------------------



In [29]:
mapping_expr = when(col('SUSPECT_RACE_DESCRIPTION') == 'AMER IND', 'AMERICAN INDIAN/ALASKAN NATIVE')
for key, value in race_mapping.items():
    mapping_expr = mapping_expr.when(col("SUSPECT_RACE_DESCRIPTION") == key, value)

preprocessed_sqf_df = preprocessed_sqf_df.withColumn("SUSPECT_RACE_DESCRIPTION", mapping_expr.otherwise(col("SUSPECT_RACE_DESCRIPTION")))

In [30]:
preprocessed_sqf_df.show()

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [31]:
preprocessed_sqf_df.where(col("YEAR2") == '2018').show()

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [32]:
preprocessed_sqf_df.group_by('SUSPECT_RACE_DESCRIPTION')\
        .count()\
        .show()

--------------------------------------------
|"SUSPECT_RACE_DESCRIPTION"      |"COUNT"  |
--------------------------------------------
|WHITE                           |5923     |
|NULL                            |764      |
|BLACK                           |40189    |
|ASIAN/PACIFIC ISLANDER          |1456     |
|MIDDLE EASTERN/SOUTHWEST ASIAN  |351      |
|WHITE HISPANIC                  |14424    |
|AMERICAN INDIAN/ALASKAN NATIVE  |83       |
|BLACK HISPANIC                  |6067     |
--------------------------------------------



In [33]:
preprocessed_sqf_df.group_by('SUSPECT_SEX')\
        .count()\
        .show()

---------------------------
|"SUSPECT_SEX"  |"COUNT"  |
---------------------------
|MALE           |62892    |
|FEMALE         |5941     |
|NULL           |424      |
---------------------------



In [34]:
preprocessed_sqf_df.select('STOP_FRISK_DATE').distinct().show()

---------------------
|"STOP_FRISK_DATE"  |
---------------------
|3/10/2019          |
|3/19/2019          |
|3/30/2019          |
|4/7/2019           |
|4/11/2019          |
|11/2/2019          |
|12/3/2019          |
|6/5/2019           |
|9/5/2019           |
|4/2/2019           |
---------------------



In [35]:
sql_expr_for_date_conversion = """
    TO_DATE(STOP_FRISK_DATE)
"""

preprocessed_sqf_df = preprocessed_sqf_df.withColumn("STOP_FRISK_DATE", expr(sql_expr_for_date_conversion))
preprocessed_sqf_df.show()


----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [35]:
preprocessed_sqf_df.select('STOP_FRISK_DATE').distinct().limit(5).show()

---------------------
|"STOP_FRISK_DATE"  |
---------------------
|2020-06-19         |
|2020-06-16         |
|2020-06-08         |
|2021-07-30         |
|2021-08-02         |
---------------------



In [39]:
table_name = 'SQF_DATA'
schema_name = 'SAFEGUARDING_NYC_SCHEMA_SILVER'
snowflake_helper.save_data_in_snowflake(session, schema_name, table_name, preprocessed_sqf_df, mode="overwrite")

[SUCCESS] Data saved successfully in SAFEGUARDING_NYC_SCHEMA_SILVER.SQF_DATA table in Snowflake!
