# CHI crime: An Analysis of Homicides and Non-Fatal Shootings in Chicago

Much has been written about the frequency of shootings and homicides in Chicago. Most of the analyses and attempted interventions have focused on reducing the overall prevalence of guns and the raw number of shootings — with only modest success. The question I will address instead is, what can we learn by analyzing the nature of the shootings that have occurred over the past few decades? Can we build a good predictive model that will tell us **who** is likely to be killed in a shooting, and **where**? If so, then maybe we could imagine interventions that specifically address the lethality of shootings, as opposed to their prevalence.

My project aims to use a Decision Tree Classifier to create a model that predicts whether or not someone victimized in a shooting will be killed.

To train and test my model, I will use a dataset containing fatal and non-fatal shooting victimizations in the City of Chicago from 1991 to the present day (March, 2025).

Data source:
City of Chicago. (2025). *Violence Reduction - Victims of Homicides and Non-Fatal Shootings* (Updated on February 24, 2025) [Data set]. https://data.cityofchicago.org/Public-Safety/Violence-Reduction-Victims-of-Homicides-and-Non-Fa/gumc-mgzr/about_data 

In [8]:
import numpy as np
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

In [9]:
df = pd.read_csv("Violence_Reduction_-_Victims_of_Homicides_and_Non-Fatal_Shootings.csv")

  df = pd.read_csv("Violence_Reduction_-_Victims_of_Homicides_and_Non-Fatal_Shootings.csv")


First I'm going to take a high-level look at  the data:

In [29]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61140 entries, 0 to 61139
Data columns (total 38 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   CASE_NUMBER                   61140 non-null  object 
 1   DATE                          61140 non-null  object 
 2   BLOCK                         61140 non-null  object 
 3   VICTIMIZATION_PRIMARY         61140 non-null  object 
 4   INCIDENT_PRIMARY              61140 non-null  object 
 5   GUNSHOT_INJURY_I              61140 non-null  object 
 6   UNIQUE_ID                     61140 non-null  object 
 7   ZIP_CODE                      61136 non-null  float64
 8   WARD                          61136 non-null  float64
 9   COMMUNITY_AREA                61136 non-null  object 
 10  STREET_OUTREACH_ORGANIZATION  44630 non-null  object 
 11  AREA                          61136 non-null  float64
 12  DISTRICT                      61136 non-null  float64
 13  B

In [33]:
df.head()

Unnamed: 0,CASE_NUMBER,DATE,BLOCK,VICTIMIZATION_PRIMARY,INCIDENT_PRIMARY,GUNSHOT_INJURY_I,UNIQUE_ID,ZIP_CODE,WARD,COMMUNITY_AREA,...,MONTH,DAY_OF_WEEK,HOUR,LOCATION_DESCRIPTION,STATE_HOUSE_DISTRICT,STATE_SENATE_DISTRICT,UPDATED,LATITUDE,LONGITUDE,LOCATION
0,JF167335,03/08/2022 03:27:00 PM,6000 N KENMORE AVE,HOMICIDE,HOMICIDE,NO,HOM-JF167335-#1,60660.0,48.0,EDGEWATER,...,3,3,15,APARTMENT,14.0,7.0,02/24/2023 05:55:43 AM,41.99057,-87.657,POINT (-87.657 41.9905705)
1,JG148375,02/11/2023 02:30:00 AM,8400 S WABASH AVE,HOMICIDE,HOMICIDE,YES,HOM-JG148375-#1,60619.0,6.0,CHATHAM,...,2,7,2,ALLEY,34.0,17.0,02/12/2023 05:09:11 AM,41.7399,-87.62286,POINT (-87.62286 41.7399005)
2,JD438266,11/21/2020 10:15:00 PM,7900 S BRANDON AVE,HOMICIDE,HOMICIDE,YES,HOM-JD438266-#1,60617.0,7.0,SOUTH CHICAGO,...,11,7,22,STREET,25.0,13.0,01/30/2025 05:49:40 AM,41.750567,-87.547249,POINT (-87.547249058699 41.750566904142)
3,JH317789,06/23/2024 08:11:00 AM,12300 S HALSTED ST,HOMICIDE,HOMICIDE,YES,HOM-JH317789-#1,60628.0,9.0,WEST PULLMAN,...,6,1,8,STREET,28.0,14.0,01/30/2025 05:40:06 AM,41.670653,-87.641779,POINT (-87.641779058699 41.670653095858)
4,JH317789,06/23/2024 08:11:00 AM,12300 S HALSTED ST,HOMICIDE,HOMICIDE,YES,HOM-JH317789-#2,60628.0,9.0,WEST PULLMAN,...,6,1,8,STREET,28.0,14.0,01/30/2025 05:40:14 AM,41.670653,-87.641779,POINT (-87.641779058699 41.670653095858)


The data is 61,140 rows with 38 columns, in a file that's 17.7 MB.

Several aspects of the data suggest cleaning and prepping work ahead:

1. Most significantly, there's no clear target column. Fatal shootings are indicated by a value in the VICTIMIZATION_PRIMARY colunn of "HOMICIDE" alongside a GUNSHOT_INJURY_I value of "YES." Non-fatal shootings are indicated by any other values in VICTIMIZATION_PRIMARY. I will want look at just the shooting incidents and give a binary answer to "was this a homicide?" in a new column. Then I can delete all the redundant and extraneous crime classification columns. 

2. The fact that so many columns are of type "object" is going to be problematic. Some of these should be oridinal categoricals (such as "age"), while others such as "sex" are non-ordinal. I need to change the dtypes to make the data more useful for analysis.

3. I  got a warning when importing the data that columns 25-27 have mixed data. Since those columns contain the victims' name, I think I can safely delete them. Furthermore, there are is lof of redundant location information in columns that I can probably ignore.

In [34]:
df.isnull().sum()

CASE_NUMBER                         0
DATE                                0
BLOCK                               0
VICTIMIZATION_PRIMARY               0
INCIDENT_PRIMARY                    0
GUNSHOT_INJURY_I                    0
UNIQUE_ID                           0
ZIP_CODE                            4
WARD                                4
COMMUNITY_AREA                      4
STREET_OUTREACH_ORGANIZATION    16510
AREA                                4
DISTRICT                            4
BEAT                                4
AGE                                 0
SEX                                 0
RACE                                0
VICTIMIZATION_FBI_CD              332
INCIDENT_FBI_CD                     5
VICTIMIZATION_FBI_DESCR           335
INCIDENT_FBI_DESCR                  5
VICTIMIZATION_IUCR_CD               0
INCIDENT_IUCR_CD                    0
VICTIMIZATION_IUCR_SECONDARY     8431
INCIDENT_IUCR_SECONDARY          8114
HOMICIDE_VICTIM_FIRST_NAME      39579
HOMICIDE_VIC

Checking for null entries confirms my original 

In [35]:
# Define Figure Size
# plt.figure(figsize=(12, 6))

# # 1. Age Distribution
# sns.histplot(df['AGE'], bins=20, kde=False)
# plt.title("Age Distribution")
# plt.show()

In [36]:
# plt.figure(figsize=(12, 8))
# sns.heatmap(df.corr(), annot=True, cmap="coolwarm", fmt=".2f")
# plt.title("Correlation Heatmap")
# plt.show()