# Analysis of Crime Incidents in Buffalo: Trends, Patterns, and Predictive Insights

# Problem Description:

Crime prevention is a critical issue for city management, law enforcement, and public safety. This project seeks to analyze historical crime data from Buffalo to uncover trends and patterns, which could be instrumental in understanding crime hotspots, types of incidents, and time-based crime occurrences.

# The objective is to answer the following key questions:

1. What are the most frequent types of crimes in Buffalo, and how are they distributed across different neighborhoods?
2. Are there specific days or hours when certain types of crime are more likely to occur?
3. Can crime trends over the years reveal increasing or decreasing patterns in specific types of incidents?

# Background and Significance:

The dataset contains details of crime incidents in Buffalo, with information including the type of crime, location, and time of occurrence. Understanding crime distribution and trends helps law enforcement agencies allocate resources more effectively, improve public safety measures, and possibly anticipate future incidents through predictive analysis.

# Potential Impact:

This analysis will provide valuable insights into crime patterns in Buffalo, allowing law enforcement to:

Target resources toward high-crime areas.
Identify periods of increased crime risk.
Formulate proactive measures for crime reduction.
Additionally, the project could inform local governance on community-based interventions and strategies to promote safer neighborhoods.

In [199]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [200]:
crime_data = pd.read_csv('Crime_Incidents_20241019.csv');

In [201]:
crime_data['Incident Datetime'] = pd.to_datetime(crime_data['Incident Datetime'], errors='coerce')

Here We drop non-correlated features

In [202]:
columns_to_drop= ['Case Number', '2010 Census Block Group','2010 Census Block', 'TRACTCE20', 'GEOID20_tract', 'GEOID20_blockgroup', 'GEOID20_block']
crime_data = crime_data.drop(columns=columns_to_drop)

Here we convert the values in the columns 'Incident Type Primary' and 'Parent Incident Type' of the crime_data DataFrame to lowercase, ensuring consistent text formatting across these columns for easier analysis or comparison.

In [203]:
crime_data['Incident Type Primary'] = crime_data['Incident Type Primary'].str.lower()
crime_data['Parent Incident Type'] = crime_data['Parent Incident Type'].str.lower()

In [204]:
crime_data['Year'] = crime_data['Incident Datetime'].dt.year
crime_data['Month'] = crime_data['Incident Datetime'].dt.month
crime_data['Hour'] = crime_data['Incident Datetime'].dt.hour

In [205]:
crime_data = crime_data.drop(columns=['Incident ID'])

This code removes columns from the `crime_data` DataFrame that contain only `NaN` values (`dropna` with `axis=1` and `how='all'`). The updated DataFrame is then displayed.

In [206]:
crime_data = crime_data.dropna(axis=1, how='all')
crime_data

Unnamed: 0,Incident Datetime,Incident Type Primary,Incident Description,Parent Incident Type,Hour of Day,Day of Week,Address,City,State,Location,...,Council District,Council District 2011,Census Tract,Census Block Group,Census Block,2010 Census Tract,Police District,Year,Month,Hour
0,2016-06-14 01:20:00,assault,ASSAULT,assault,1,Tuesday,E AMHERST ST & E AMHERST ST,Buffalo,NY,POINT (-78.889 42.938),...,NORTH,NORTH,55,2,2003,55,District D,2016,6,1
1,2016-12-13 05:00:00,larceny/theft,LARCENY/THEFT,theft,5,Tuesday,1000 Block E LOVEJOY ST,Buffalo,NY,POINT (-78.809 42.889),...,LOVEJOY,LOVEJOY,23,4,4001,23,District C,2016,12,5
2,2020-07-19 03:09:00,assault,Buffalo Police are investigating this report o...,assault,3,Sunday,GRIDER ST & KENSINGTON WB,Buffalo,NY,,...,,,,,,,,2020,7,3
3,2014-11-17 08:08:00,larceny/theft,LARCENY/THEFT,theft,8,Monday,2100 Block ELMWOOD AV,Buffalo,NY,POINT (-78.879 42.954),...,NORTH,NORTH,56,2,2007,56,District D,2014,11,8
4,2015-04-20 10:22:00,larceny/theft,LARCENY/THEFT,theft,10,Monday,2100 Block ELMWOOD AV,Buffalo,NY,POINT (-78.879 42.954),...,NORTH,NORTH,56,2,2007,56,District D,2015,4,10
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
315183,2024-10-08 19:00:00,uuv,Buffalo Police are investigating this report o...,theft of vehicle,7,Wednesday,1700 Block JEFFERSON AV,Buffalo,NY,POINT (-78.854 42.922),...,MASTEN,MASTEN,33.01,1,1001,33.01,District E,2024,10,19
315184,2024-10-03 14:30:00,uuv,Buffalo Police are investigating this report o...,theft of vehicle,12,Thursday,800 Block FILLMORE AV,Buffalo,NY,POINT (-78.84 42.897),...,FILLMORE,FILLMORE,166,1,1000,166,District C,2024,10,14
315185,2024-10-17 22:40:31,burglary,Buffalo Police are investigating this report o...,breaking & entering,22,Thursday,1900 Block FILLMORE AV,Buffalo,NY,POINT (-78.84 42.928),...,MASTEN,MASTEN,40.02,1,1014,40.01,District E,2024,10,22
315186,2024-10-10 00:37:50,uuv,Buffalo Police are investigating this report o...,theft of vehicle,0,Thursday,0 Block BEST ST,Buffalo,NY,,...,,,,,,,,2024,10,0


# 1. Fill missing values in numeric columns

In [207]:
crime_data['zip_code'] = crime_data['zip_code'].fillna(0)

# 2. Correct inconsistent categorical values in 'Day of Week'
# First, convert all values to lowercase

In [208]:
crime_data['Day of Week'] = crime_data['Day of Week'].str.lower()

# 3. Split the 'Location' column into 'Latitude' and 'Longitude'
# The location is in the format 'POINT (longitude latitude)'

In [209]:
crime_data[['Longitude', 'Latitude']] = crime_data['Location'].str.extract(r'POINT \((-?\d+\.\d+) (-?\d+\.\d+)\)')

# 4. Remove duplicate rows

In [210]:
crime_data = crime_data.drop_duplicates()

In [211]:
crime_data = crime_data.copy()  # Create a deep copy to avoid the warning
crime_data.rename(columns={'Incident Type Primary': 'Primary Incident Type'}, inplace=True)


In [212]:
# Display the first few rows after all cleaning steps
crime_data.head()

Unnamed: 0,Incident Datetime,Primary Incident Type,Incident Description,Parent Incident Type,Hour of Day,Day of Week,Address,City,State,Location,...,Council District,Council District 2011,Census Tract,Census Block Group,Census Block,2010 Census Tract,Police District,Year,Month,Hour
0,2016-06-14 01:20:00,assault,ASSAULT,assault,1,tuesday,E AMHERST ST & E AMHERST ST,Buffalo,NY,POINT (-78.889 42.938),...,NORTH,NORTH,55.0,2.0,2003.0,55.0,District D,2016,6,1
1,2016-12-13 05:00:00,larceny/theft,LARCENY/THEFT,theft,5,tuesday,1000 Block E LOVEJOY ST,Buffalo,NY,POINT (-78.809 42.889),...,LOVEJOY,LOVEJOY,23.0,4.0,4001.0,23.0,District C,2016,12,5
2,2020-07-19 03:09:00,assault,Buffalo Police are investigating this report o...,assault,3,sunday,GRIDER ST & KENSINGTON WB,Buffalo,NY,,...,,,,,,,,2020,7,3
3,2014-11-17 08:08:00,larceny/theft,LARCENY/THEFT,theft,8,monday,2100 Block ELMWOOD AV,Buffalo,NY,POINT (-78.879 42.954),...,NORTH,NORTH,56.0,2.0,2007.0,56.0,District D,2014,11,8
4,2015-04-20 10:22:00,larceny/theft,LARCENY/THEFT,theft,10,monday,2100 Block ELMWOOD AV,Buffalo,NY,POINT (-78.879 42.954),...,NORTH,NORTH,56.0,2.0,2007.0,56.0,District D,2015,4,10


In [213]:
crime_data.loc[:, 'Day of Week'] = crime_data['Day of Week'].str.lower()
crime_data['Incident Description'] = crime_data['Incident Description'].str.lower()
crime_data.loc[:, ['Longitude', 'Latitude']] = crime_data['Location'].str.extract(r'POINT \((-?\d+\.\d+) (-?\d+\.\d+)\)')

In [214]:
# 4. Remove duplicate rows
crime_data = crime_data.drop_duplicates()

In [215]:
##crime_data = crime_data.dropna()
crime_data['zip_code'] = crime_data['zip_code'].fillna(0)

The process cleans and standardizes the `'Incident Description'` column by:

1. Replacing repetitive phrases like "buffalo police are investigating" with "investigation pending."
2. Standardizing shorthand terms such as "aggr assault" to "aggravated assault."
3. Removing extra whitespace.

It then prints the unique values in the column after cleaning.


In [216]:
crime_data['Incident Description'] = crime_data['Incident Description'].replace(
    to_replace=r'.*buffalo police are investigating.*', value='investigation pending', regex=True)

crime_data['Incident Description'] = crime_data['Incident Description'].replace({
    'aggr assault': 'aggravated assault',
    'agg assault on p/officer': 'aggravated assault on officer'
})

crime_data['Incident Description'] = crime_data['Incident Description'].str.strip()

for cols in crime_data['Incident Description'].unique():
    print(cols)

assault
larceny/theft
investigation pending
burglary
sexual abuse
rape
uuv
robbery
aggravated assault
crim negligent homicide
theft of services
murder
aggravated assault on officer


In [217]:
crime_data = crime_data.drop(columns=['Address'])

crime_data = crime_data.drop(columns=['City'])

crime_data = crime_data.drop(columns=['State','Latitude','Longitude','Created At','Council District 2011','Year','Month','Hour','zip_code','Council District'])

crime_data = crime_data.dropna(subset=['Location'])
crime_data.columns

Index(['Incident Datetime', 'Primary Incident Type', 'Incident Description',
       'Parent Incident Type', 'Hour of Day', 'Day of Week', 'Location',
       'neighborhood', 'Census Tract', 'Census Block Group', 'Census Block',
       '2010 Census Tract ', 'Police District'],
      dtype='object')

# We convert the `'Incident Datetime'` column to datetime format and then to Unix timestamps (in seconds). Afterward, we drop the original datetime column 


In [218]:
crime_data['Incident Datetime'] = pd.to_datetime(crime_data['Incident Datetime'], errors='coerce')
crime_data['Incident Datetime (Unix)'] = crime_data['Incident Datetime'].astype(int) // 10**9

crime_data[['Incident Datetime', 'Incident Datetime (Unix)']].head()
crime_data = crime_data.drop(columns=['Incident Datetime'])

# We convert the `'Census Tract'` and `'2010 Census Tract '` columns to numeric format, handling any non-numeric values. Then, we create a new column `'census avg'`, which represents the average of these two columns.


In [219]:
crime_data['Census Tract'] = pd.to_numeric(crime_data['Census Tract'], errors='coerce')
crime_data['2010 Census Tract '] = pd.to_numeric(crime_data['2010 Census Tract '], errors='coerce')

crime_data['census avg'] = crime_data[['Census Tract', '2010 Census Tract ']].mean(axis=1)

crime_data = crime_data.drop(columns=['Census Tract','Primary Incident Type'])

# We create a new column, `'case_solved'`, where '0' represents unsolved cases (marked as "investigation pending") and '1' represents solved cases. Afterward, we drop the `'Incident Description'`, `'neighborhood'`, `'Day of Week'`, `'Hour of Day'`, and `'2010 Census Tract '` columns.


In [220]:
crime_data['case_solved'] = crime_data['Incident Description'].apply(
    lambda x: 0 if x == 'investigation pending' else 1)

crime_data = crime_data.drop(columns=['Incident Description','neighborhood','Day of Week','Hour of Day','2010 Census Tract '])

crime_data.columns

Index(['Parent Incident Type', 'Location', 'Census Block Group',
       'Census Block', 'Police District', 'Incident Datetime (Unix)',
       'census avg', 'case_solved'],
      dtype='object')

# We modify the `'Police District'` column by removing the word "District" to simplify the entries. This is done without using regular expressions to ensure the exact text "District " is replaced.


In [221]:
crime_data['Police District'] = crime_data['Police District'].str.replace('District ', '', regex=False)
crime_data

Unnamed: 0,Parent Incident Type,Location,Census Block Group,Census Block,Police District,Incident Datetime (Unix),census avg,case_solved
0,assault,POINT (-78.889 42.938),2,2003,D,1465867200,55.000,1
1,theft,POINT (-78.809 42.889),4,4001,C,1481605200,23.000,1
3,theft,POINT (-78.879 42.954),2,2007,D,1416211680,56.000,1
4,theft,POINT (-78.879 42.954),2,2007,D,1429525320,56.000,1
5,breaking & entering,POINT (-78.848 42.913),3,3009,C,1428634800,33.020,1
...,...,...,...,...,...,...,...,...
315182,theft of vehicle,POINT (-78.89 42.919),2,2002,D,1729238760,61.000,0
315183,theft of vehicle,POINT (-78.854 42.922),1,1001,E,1728414000,33.010,0
315184,theft of vehicle,POINT (-78.84 42.897),1,1000,C,1727965800,166.000,0
315185,breaking & entering,POINT (-78.84 42.928),1,1014,E,1729204831,40.015,0


# We first drop any rows with missing values in the `crime_data` DataFrame. Next, we remove rows where any value is labeled as 'UNKNOWN' by replacing it with `NaN` and then dropping those rows.


In [222]:
crime_data = crime_data.dropna()

# Remove rows where any value is 'UNKNOWN'
crime_data = crime_data.replace('UNKNOWN', pd.NA).dropna()

crime_data.to_csv('preprocessed.csv')

# One hot encoding of Parent Incident Type

In [223]:
crime_data = pd.get_dummies(crime_data, columns=['Parent Incident Type'])
crime_data

crime_data.to_csv('preprocessed.csv')
crime_data.shape

(307578, 16)

# Remove Rows that are not in Buffalo based on Latitude and Longitude

In [224]:
crime_data[['Longitude', 'Latitude']] = crime_data['Location'].str.extract(r'POINT \((-?\d+\.\d+) (-?\d+\.\d+)\)')

crime_data['Latitude'] = pd.to_numeric(crime_data['Latitude'], errors='coerce')
crime_data['Longitude'] = pd.to_numeric(crime_data['Longitude'], errors='coerce')

lat_min, lat_max = 42.8296, 42.9985
lon_min, lon_max = -78.8998, -78.7002

buffalo_filtered_data = crime_data[
    (crime_data['Latitude'] >= lat_min) &
    (crime_data['Latitude'] <= lat_max) &
    (crime_data['Longitude'] >= lon_min) &
    (crime_data['Longitude'] <= lon_max)
]

# We save the cleaned and preprocessed `crime_data` DataFrame to a CSV file named `'preprocessed.csv'`.


In [225]:
crime_data.to_csv('preprocessed.csv')