### Bronze to Silver Medallion Data Layer for Shooting Incidents Data (2006-2022)

We conduct the following transformations and cleaning to get to the Silver Medallion Layer
- Null value mapping (ex: Null values represented in different formats like (null), Nan, NULL, etc.)
- Dropping irrelevant columns for the analysis
- Date preprocessing (formatted the date to standard date (YYYY-MM-DD) format)
- Handled incorrect data (ex: Removed incorrect ages like 940, 1022, 224)
- Race Mapping: Correctly formatted the ASIAN/PACIFIC ISLANDER values

## Importing Libraries 

In [160]:
import sys
sys.path.append("..")

In [161]:
from snowflake.snowpark import Session, dataframe
import snowflake.snowpark.functions as F
from snowflake.snowpark.functions import when, col, to_date, count
from helpers import SnowflakeHelper
from snowflake.snowpark.functions import date_format
import json
import os

In [162]:
null_value_mapping = {
    "(null)": None,
    "Nan": None,
    "NONE": None,
    "nan": None,
    "U": None,
    "UNKNOWN": None
}

age_mapping = {
    "224": None, 
    "1020": None, 
    "940": None, 
    "1022": None
}

race_mapping = {
    "ASIAN / PACIFIC ISLANDER" : "ASIAN/PACIFIC ISLANDER"
}

gender_mapping = {
    "M": "MALE",
    "F": "FEMALE"
}

## Extracting Data from Snowflake

In [163]:
snowflake_helper = SnowflakeHelper()
snowflake_config = './../helpers/snowflake_config.json'
session = snowflake_helper.create_snowpark_session(snowflake_config)

[INFO] No schema passed, using default schema SAFEGUARDING_NYC_SCHEMA_BRONZE for the session
[SUCCESS] Config file loaded successfully!
[SUCCESS] Snowspark Session created successfully on schema SAFEGUARDING_NYC_SCHEMA_BRONZE!


In [164]:
shooting_data = session.table("SHOOTING_INCIDENTS")

In [165]:
shooting_data.show()

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|"_AIRBYTE_RAW_ID"                     |"_AIRBYTE_EXTRACTED_AT"    |"_AIRBYTE_META"  |"VIC_RACE"      |"OCCUR_TIME"  |"X_COORD_CD"  |"INCIDENT_KEY"  |"VIC_AGE_GROUP"  |"LOC_OF_OCCUR_DESC"  |"LATITUDE"   |"PERP_RACE"     |"Y_COORD_CD"  |"STATISTICAL_MURDER_FLAG"  |"LONGITUDE"   |"VIC_SEX"  |"BORO"     |"LON_LAT"                                      |"PERP_SEX"  |"LOCATION_DESC"            |"OCCUR_DATE"  |"PRECINCT"  |"LOC_CLASSFCTN_DESC"  |"PERP_AGE_GROUP"  |"JURISDICTION_CODE"  |
------------------------------

## Preproccesing Data using Snowspark

## Dropping irrelevant Columns

In [166]:
# Dropping unnecessary columns

columns_to_drop = ["_AIRBYTE_RAW_ID", "_AIRBYTE_EXTRACTED_AT", "_AIRBYTE_META", "X_COORD_CD", "Y_COORD_CD", "LON_LAT", "JURISDICTION_CODE", "LOC_OF_OCCUR_DESC", "LOC_CLASSFCTN_DESC"]
shooting_data = shooting_data.drop(*columns_to_drop)
shooting_data.show()

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|"VIC_RACE"      |"OCCUR_TIME"  |"INCIDENT_KEY"  |"VIC_AGE_GROUP"  |"LATITUDE"   |"PERP_RACE"     |"STATISTICAL_MURDER_FLAG"  |"LONGITUDE"   |"VIC_SEX"  |"BORO"     |"PERP_SEX"  |"LOCATION_DESC"            |"OCCUR_DATE"  |"PRECINCT"  |"PERP_AGE_GROUP"  |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|BLACK           |19:13:00      |28832628        |<18              |40.70370632  |BLACK           |FALSE                      |-73.94325706  |F          |BROOKLYN   |M           |MULTI DWELL - PUBLIC HOUS  |05/04/2007    |90          |

In [167]:
shooting_data.printSchema()

root
 |-- "VIC_RACE": StringType(16777216) (nullable = True)
 |-- "OCCUR_TIME": StringType(16777216) (nullable = True)
 |-- "INCIDENT_KEY": StringType(16777216) (nullable = True)
 |-- "VIC_AGE_GROUP": StringType(16777216) (nullable = True)
 |-- "LATITUDE": StringType(16777216) (nullable = True)
 |-- "PERP_RACE": StringType(16777216) (nullable = True)
 |-- "STATISTICAL_MURDER_FLAG": StringType(16777216) (nullable = True)
 |-- "LONGITUDE": StringType(16777216) (nullable = True)
 |-- "VIC_SEX": StringType(16777216) (nullable = True)
 |-- "BORO": StringType(16777216) (nullable = True)
 |-- "PERP_SEX": StringType(16777216) (nullable = True)
 |-- "LOCATION_DESC": StringType(16777216) (nullable = True)
 |-- "OCCUR_DATE": StringType(16777216) (nullable = True)
 |-- "PRECINCT": StringType(16777216) (nullable = True)
 |-- "PERP_AGE_GROUP": StringType(16777216) (nullable = True)


## Formatting Date

In [168]:
shooting_data = shooting_data.withColumn("OCCUR_DATE", when(col("OCCUR_DATE").isNotNull(), to_date(col("OCCUR_DATE"), 'MM/DD/YYYY')).otherwise(None))

In [169]:
shooting_data.show()

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|"VIC_RACE"      |"OCCUR_TIME"  |"INCIDENT_KEY"  |"VIC_AGE_GROUP"  |"LATITUDE"   |"PERP_RACE"     |"STATISTICAL_MURDER_FLAG"  |"LONGITUDE"   |"VIC_SEX"  |"BORO"     |"PERP_SEX"  |"LOCATION_DESC"            |"PRECINCT"  |"PERP_AGE_GROUP"  |"OCCUR_DATE"  |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|BLACK           |19:13:00      |28832628        |<18              |40.70370632  |BLACK           |FALSE                      |-73.94325706  |F          |BROOKLYN   |M           |MULTI DWELL - PUBLIC HOUS  |90          |<18            

## Mapping

In [None]:
# Distinct Values
check_columns = ["VIC_RACE", "VIC_AGE_GROUP", "PERP_RACE", "STATISTICAL_MURDER_FLAG", "VIC_SEX", "BORO", "PERP_SEX", "LOCATION_DESC", "PERP_AGE_GROUP"]

for column in check_columns:
    distinct_values = shooting_data.select(column).distinct()
    print(f"Distinct values in {column}:")
    distinct_values.show()


In [170]:
# Null Value Mapping
no_date_column = ["VIC_RACE", "OCCUR_TIME", "INCIDENT_KEY", "VIC_AGE_GROUP", "LATITUDE", "PERP_RACE", "STATISTICAL_MURDER_FLAG", "LONGITUDE", "VIC_SEX", "BORO", "PERP_SEX", "LOCATION_DESC", "PRECINCT", "PERP_AGE_GROUP"]

for column in no_date_column:
    for key, value in null_value_mapping.items():
        shooting_data = shooting_data.withColumn(column, when(col(column) == key, value).otherwise(col(column)))

In [171]:
shooting_data.show()

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|"OCCUR_DATE"  |"VIC_RACE"      |"OCCUR_TIME"  |"INCIDENT_KEY"  |"VIC_AGE_GROUP"  |"LATITUDE"   |"PERP_RACE"     |"STATISTICAL_MURDER_FLAG"  |"LONGITUDE"   |"VIC_SEX"  |"BORO"     |"PERP_SEX"  |"LOCATION_DESC"            |"PRECINCT"  |"PERP_AGE_GROUP"  |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|2007-05-04    |BLACK           |19:13:00      |28832628        |<18              |40.70370632  |BLACK           |FALSE                      |-73.94325706  |F          |BROOKLYN   |M           |MULTI DWELL - PUBLIC HOUS  |90          |

In [172]:
# Age Mapping

age_columns = ["PERP_AGE_GROUP", "VIC_AGE_GROUP"]

for column in age_columns:
    for key, value in age_mapping.items():
        shooting_data = shooting_data.withColumn(column, when(col(column) == key, value).otherwise(col(column)))

In [173]:
# Race Mapping

race_columns = ["VIC_RACE", "PERP_RACE" ]

for column in race_columns:
    for key, value in race_mapping.items():
        shooting_data = shooting_data.withColumn(column, when(col(column) == key, value).otherwise(col(column)))

In [174]:
# Gender Mapping

gender_columns = ["PERP_SEX", "VIC_SEX"]

for column in gender_columns:
    for key, value in gender_mapping.items():
        shooting_data = shooting_data.withColumn(column, when(col(column) == key, value).otherwise(col(column)))

In [175]:
# Distinct Values of Preprocessed Data

for column in check_columns:
    distinct_values = shooting_data.select(column).distinct()
    print(f"Distinct values in {column}:")
    distinct_values.show() 

Distinct values in VIC_RACE:
----------------------------------
|"VIC_RACE"                      |
----------------------------------
|BLACK                           |
|WHITE                           |
|ASIAN/PACIFIC ISLANDER          |
|NULL                            |
|WHITE HISPANIC                  |
|BLACK HISPANIC                  |
|AMERICAN INDIAN/ALASKAN NATIVE  |
----------------------------------

Distinct values in VIC_AGE_GROUP:
-------------------
|"VIC_AGE_GROUP"  |
-------------------
|25-44            |
|<18              |
|45-64            |
|18-24            |
|NULL             |
|65+              |
-------------------

Distinct values in PERP_RACE:
----------------------------------
|"PERP_RACE"                     |
----------------------------------
|NULL                            |
|BLACK                           |
|ASIAN/PACIFIC ISLANDER          |
|WHITE                           |
|BLACK HISPANIC                  |
|WHITE HISPANIC                  |
|AMER

In [176]:
# Count of Distinct Values

for column in check_columns:
    distinct_count = shooting_data.select(column).distinct().count()
    print(f"Number of distinct values in {column}: {distinct_count}")

Number of distinct values in VIC_RACE: 7
Number of distinct values in VIC_AGE_GROUP: 6
Number of distinct values in PERP_RACE: 7
Number of distinct values in STATISTICAL_MURDER_FLAG: 2
Number of distinct values in VIC_SEX: 3
Number of distinct values in BORO: 5
Number of distinct values in PERP_SEX: 3
Number of distinct values in LOCATION_DESC: 39
Number of distinct values in PERP_AGE_GROUP: 6


In [177]:
shooting_data.show()

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|"OCCUR_DATE"  |"OCCUR_TIME"  |"INCIDENT_KEY"  |"LATITUDE"   |"STATISTICAL_MURDER_FLAG"  |"LONGITUDE"   |"BORO"     |"LOCATION_DESC"            |"PRECINCT"  |"PERP_AGE_GROUP"  |"VIC_AGE_GROUP"  |"VIC_RACE"      |"PERP_RACE"     |"PERP_SEX"  |"VIC_SEX"  |
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|2007-05-04    |19:13:00      |28832628        |40.70370632  |FALSE                      |-73.94325706  |BROOKLYN   |MULTI DWELL - PUBLIC HOUS  |90          |<18               |<18              |BLACK           |BLACK           |MALE  

## Uploading Processed Data to Snowflake (Silver Layer)

In [178]:
# Uploading data to Silver Schema

table_name = "SAFEGUARDING_NYC_SCHEMA_SILVER.shooting_incidents"
shooting_data.write.saveAsTable(table_name, mode="overwrite")

In [179]:
table_name = "SAFEGUARDING_NYC_SCHEMA_GOLD.shooting_incidents"
shooting_data.write.saveAsTable(table_name, mode="overwrite")