### Note on what preprocessing should be done
Refer Inference and TO DO in the `01_initial_exploratory_analysis.ipynb` File.
1) Date format conversion
2) Age Column Cleaning
3) Removing unwanted Columns
4) Check for cleaning on `state`, `city_or_county` and `address`
5) Major cleaning reguired for the fields - `gun_stolen`, `gun_type`, `participant_age`, `participant_age_group`, `participant_gender`, `participant_name`, `participant_status` and `participant_type`.
6) Clean Text Data - `incident_characterstics` and `notes`
7) Change data types too

And generate visualizations after cleaning too!

Leave encoding out!

In [46]:
import warnings
warnings.filterwarnings("ignore")

In [50]:
# Importing packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [53]:
%matplotlib inline
plt.style.use('bmh')

In [56]:
# self created packages
import sys
import os

sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..')))
from scripts.visualizations import Visualization

In [151]:
# pyspark packages
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum, desc, explode, split, year, month, dayofweek
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, FloatType, BooleanType, DateType, DoubleType

### Setting Spark Session and Loading Data

In [83]:
spark = SparkSession.builder \
    .appName("MIS548 Project PreProcessing") \
    .config("spark.sql.debug.maxToStringFields", "1000") \
    .getOrCreate()

spark

In [85]:
# creating data schema
ip_data_schema = StructType([
    StructField("incident_id", IntegerType(), True),
    StructField("date", DateType(), True),
    StructField("state", StringType(), True),
    StructField("city_or_county", StringType(), True),
    StructField("address", StringType(), True),
    StructField("n_killed", IntegerType(), True),
    StructField("n_injured", IntegerType(), True),
    StructField("incident_url", StringType(), True),
    StructField("source_url", StringType(), True),
    StructField("incident_url_fields_missing", BooleanType(), True),
    StructField("congressional_district", IntegerType(), True),
    StructField("gun_stolen", StringType(), True),
    StructField("gun_type", StringType(), True),
    StructField("incident_characteristics", StringType(), True),
    StructField("latitude", DoubleType(), True),
    StructField("location_description", StringType(), True),
    StructField("longitude", DoubleType(), True),
    StructField("n_guns_involved", IntegerType(), True),
    StructField("notes", StringType(), True),
    StructField("participant_age", StringType(), True),
    StructField("participant_age_group", StringType(), True),
    StructField("participant_gender", StringType(), True),
    StructField("participant_name", StringType(), True),
    StructField("participant_relationship", StringType(), True),
    StructField("participant_status", StringType(), True),
    StructField("participant_type", StringType(), True),
    StructField("sources", StringType(), True),
    StructField("state_house_district", IntegerType(), True),
    StructField("state_senate_district", IntegerType(), True)
])

In [88]:
ip_data = spark.read.option("header", "True") \
                .option("inferSchema", "True") \
                .option("quote", '"') \
                .option("escape", '"') \
                .option("sep", ",") \
                .option("ignoreLeadingWhiteSpace", "True") \
                .option("ignoreTrailingWhiteSpace", "True") \
                .option("multiLine", "True") \
                .option("mode", "PERMISSIVE") \
                .csv("../data/gun-violence-data_01-2013_03-2018.csv", schema = ip_data_schema)

In [91]:
print(f"Number of records in the data : {ip_data.count()}")
print(f"Number of columns: {len(ip_data.columns)}")

[Stage 0:>                                                          (0 + 1) / 1]

Number of records in the data : 239677
Number of columns: 29


                                                                                

In [115]:
ip_data.printSchema()

root
 |-- incident_id: integer (nullable = true)
 |-- date: date (nullable = true)
 |-- state: string (nullable = true)
 |-- city_or_county: string (nullable = true)
 |-- address: string (nullable = true)
 |-- n_killed: integer (nullable = true)
 |-- n_injured: integer (nullable = true)
 |-- incident_url: string (nullable = true)
 |-- source_url: string (nullable = true)
 |-- incident_url_fields_missing: boolean (nullable = true)
 |-- congressional_district: integer (nullable = true)
 |-- gun_stolen: string (nullable = true)
 |-- gun_type: string (nullable = true)
 |-- incident_characteristics: string (nullable = true)
 |-- latitude: double (nullable = true)
 |-- location_description: string (nullable = true)
 |-- longitude: double (nullable = true)
 |-- n_guns_involved: integer (nullable = true)
 |-- notes: string (nullable = true)
 |-- participant_age: string (nullable = true)
 |-- participant_age_group: string (nullable = true)
 |-- participant_gender: string (nullable = true)
 |-- 

### Preprocessing

#### Duplicate Check

In [106]:
# checking for duplicates
def check_duplicates_except(df, column_to_exclude=""):
    columns_to_check = [col for col in df.columns if col != column_to_exclude]
    
    df_duplicates = df.groupBy(columns_to_check).count().filter("count > 1")
    
    return df_duplicates

In [101]:
ip_data_dup_chk = check_duplicates_except(ip_data)

print(f"Number of Duplicate Rows: {ip_data_dup_chk.count()}")

[Stage 3:>                                                          (0 + 1) / 1]

Number of Duplicate Rows: 0


                                                                                

In [104]:
# Drop duplicates if there are any
# ip_data = ip_data.dropDuplicates()

# print("DataFrame after dropping duplicates:")
# ip_data.count()

#### Null Values Check

In [118]:
def get_null_counts(df):
    total_rows = df.count()
    
    null_counts = df.select([sum(col(c).isNull().cast('int')).alias(c) for c in df.columns])

    narrow_null_counts = null_counts.selectExpr(
                                    f"'{null_counts.columns[0]}' as column_name",
                                    f"{null_counts.columns[0]} as null_count",
                                    f"({null_counts.columns[0]} / {total_rows} * 100) as null_percentage")

    for c in null_counts.columns[1:]:
        next_col = null_counts.selectExpr(f"'{c}' as column_name", 
                                          f"{c} as null_count",
                                          f"({c} / {total_rows} * 100) as null_percentage")
        narrow_null_counts = narrow_null_counts.union(next_col)
    
    narrow_null_counts = narrow_null_counts.orderBy(desc("null_count"))
    
    return narrow_null_counts

In [121]:
narrow_null_counts = get_null_counts(ip_data)
narrow_null_counts.show(n=29, truncate=False)

                                                                                

+---------------------------+----------+-------------------+
|column_name                |null_count|null_percentage    |
+---------------------------+----------+-------------------+
|participant_relationship   |223903    |93.4186425898188   |
|location_description       |197588    |82.43928286819344  |
|participant_name           |122253    |51.00739745574252  |
|gun_stolen                 |99498     |41.51337007722894  |
|gun_type                   |99451     |41.493760352474375 |
|n_guns_involved            |99451     |41.493760352474375 |
|participant_age            |92298     |38.509327136104005 |
|notes                      |81017     |33.80257596682201  |
|participant_age_group      |42119     |17.573233977394576 |
|state_house_district       |38772     |16.17677123795775  |
|participant_gender         |36362     |15.171251309053435 |
|state_senate_district      |32335     |13.49107340295481  |
|participant_status         |27626     |11.526345873821851 |
|participant_type       

There are some columns which do not add much significance to our analysis. We are dropping those out to aid in the processing  speed.

Might drop participant_age_group, state_house_district, state_senate_district, participant_name Later

In [134]:
trivial_columns = ["participant_relationship", "location_description", "sources", "source_url", 
                   "incident_url", "incident_url_fields_missing"]

In [136]:
ip_data = ip_data.drop(*trivial_columns)

For missing data, our plan is to impute the data. But some models such as Decision Trees, Random Forest, XGBoost do cater for missing data.

My plan is the use different sets of data and verify the performance. Let's see how it goes. So I would do the imputation after all the necessary preprocessing is done.

#### New Date Features

In [157]:
ip_data = ip_data.withColumn("year", year("date")) \
                .withColumn("month", month("date")) \
                .withColumn("day_of_week", dayofweek("date"))

In [159]:
ip_data.select("date", "year", "month", "day_of_week").show(5)

+----------+----+-----+-----------+
|      date|year|month|day_of_week|
+----------+----+-----+-----------+
|2013-01-01|2013|    1|          3|
|2013-01-01|2013|    1|          3|
|2013-01-01|2013|    1|          3|
|2013-01-05|2013|    1|          7|
|2013-01-07|2013|    1|          2|
+----------+----+-----+-----------+
only showing top 5 rows

