

# Data Analysis of Burglaries in Chicago (2015-2019)
#### By Annivas Exarchos

<img src="img/chicago.png" alt="chicago" style="width: 900px;" align="left"/>

source: https://hotelemc2.com/why-chicago-is-the-best-city-in-the-world/

## Introduction
**The overall objective of this project will be to analyze burglary data for Chicago, IL from 2015 to 2019.**

**Throughout this tutorial, we will attempt to find when and where burglaries are most likely to take place, while also complementing our analysis with interesting burglary trends and statistics.**

## Table of Contents
- [Required Tools](#Required-Tools)
- [Installations & Imports](#Installations-&-Imports)
- [Data Collection](#1.-Data-Collection)
- [Data Processing](#2.-Data-Processing)
- [Exploratory Data Analysis & Visualization](#3.-Exploratory-Data-Analysis-&-Visualization)
- [Insight & Observations](#4.-Insight-&-Observations)

## Required Tools

This project is written in [python 3.91](https://docs.python.org/3/).

You will need to the following python libraries:
- [pandas](https://pandas.pydata.org/docs/)
- [numpy](https://numpy.org/doc/)
- [matplotlib.pyplot](https://matplotlib.org/api/pyplot_api.html)
- [folium](https://python-visualization.github.io/folium/)
- [sodapy](https://github.com/xmunoz/sodapy)

## Installations & Imports

In [6]:
!pip install folium # install to create maps



In [15]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
import folium
from sodapy import Socrata
from folium.plugins import HeatMap

## 1. Data Collection
This is the first stage of the data lifecycle. Here, we will collect all the data needed for our project.

The main [dataset](https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-Present/ijzp-q8t2) that we will be using contains all reported crimes in the city of Chicago since 2001 and can be found in the official [Chicago Data Portal](https://data.cityofchicago.org/).

The data is stored in a large csv file, which we will be accessing using the [sodapy](https://github.com/xmunoz/sodapy) client through the [Socrata Open Data API](https://dev.socrata.com/).

From this file we will only extract crime data for the years 2015-2019.

In [8]:
# These can be found in the data portal
domain = 'data.cityofchicago.org'
dataset_id = 'ijzp-q8t2'

# Generate token by creating an account for the data portal
token = 'Lkysyak9elTtcNXRVmfsj9YLX'

client = Socrata(domain, token)

# Get data for 2015-2019
results = client.get(dataset_id, where="date >= '2015-01-01' and date < '2020-01-01'", limit=2000000)

# Store into pandas dataframe
crime_table = pd.DataFrame.from_dict(results)

# Display first 5 rows of dataframe
crime_table.head()

Unnamed: 0,id,case_number,date,block,iucr,primary_type,description,location_description,arrest,domestic,...,ward,community_area,fbi_code,x_coordinate,y_coordinate,year,updated_on,latitude,longitude,location
0,10225520,HY412735,2015-01-01T00:00:00.000,075XX S BLACKSTONE AVE,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,RESIDENCE,False,False,...,5,43,11,1187511.0,1855334.0,2015,2018-02-10T15:50:01.000,41.758131167,-87.588352326,"{'latitude': '41.758131167', 'longitude': '-87..."
1,11028448,JA360336,2015-01-01T00:00:00.000,051XX W HURON ST,281,CRIM SEXUAL ASSAULT,NON-AGGRAVATED,APARTMENT,True,True,...,37,25,2,,,2015,2019-09-02T15:57:18.000,,,
2,10225760,HY412902,2015-01-01T00:00:00.000,050XX N MARINE DR,810,THEFT,OVER $500,APARTMENT,False,False,...,48,3,6,1169650.0,1934124.0,2015,2018-02-10T15:50:01.000,41.974742888,-87.651517395,"{'latitude': '41.974742888', 'longitude': '-87..."
3,11242929,JB168310,2015-01-01T00:00:00.000,049XX S COTTAGE GROVE AVE,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,APARTMENT,False,False,...,4,39,11,,,2015,2018-03-01T15:54:55.000,,,
4,10229179,HY416572,2015-01-01T00:00:00.000,039XX S LAKE PARK AVE,1752,OFFENSE INVOLVING CHILDREN,AGG CRIM SEX ABUSE FAM MEMBER,RESIDENCE,False,False,...,4,36,20,1183388.0,1878984.0,2015,2018-02-10T15:50:01.000,41.823125769,-87.602725951,"{'latitude': '41.823125769', 'longitude': '-87..."


Additionally, we will be using some [data](https://crime-data-explorer.fr.cloud.gov/explorer/national/united-states/crime) downloaded from FBI's [Crime Data Explorer](https://crime-data-explorer.fr.cloud.gov/) in csv format.
These burglary-specific datasets include statistics about victims' and offenders' age, sex, and race, as well the relationship between victims and offenders and other crimes that burglary offenders have been charged with.

To import this data into dataframes, we will be using pandas' [read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) method.

In [16]:
# Burglary offenders by age
offender_age = pd.read_csv("https://annivas.github.io/files/offender-age-2015-2019.csv")
offender_age

HTTPError: HTTP Error 404: Not Found

In [17]:
# Burglary offenders by sex
offender_sex = pd.read_csv("https://annivas.github.io/files/offender-sex-2015-2019.csv")
offender_sex

HTTPError: HTTP Error 404: Not Found

In [None]:
# Burglary offenders by race
offender_race = pd.read_csv("https://annivas.github.io/files/offender-race-2015-2019.csv")
offender_race

In [None]:
# Burglary victims by age
victim_age = pd.read_csv("https://annivas.github.io/files/victim-age-2015-2019.csv")
victim_age

In [None]:
# Burglary victims by sex
victim_sex = pd.read_csv("https://annivas.github.io/files/victim-sex-2015-2019.csv")
victim_sex

In [None]:
# Burglary victims by race
victim_race = pd.read_csv("https://annivas.github.io/files/victim-race-2015-2019.csv")
victim_race

In [None]:
# Relationship between burglary offenders and victims
victim_offender_relationship = pd.read_csv("https://annivas.github.io/files/victim-offender-relationship-2015-2019.csv")
victim_offender_relationship

In [None]:
# Other offenses linked to burglary offenders
linked_offenses = pd.read_csv("https://annivas.github.io/files/linked-offenses-2015-2019.csv")
linked_offenses

## 2. Data Processing
Now that we have collected all the necessary data, it's time to process it and organize it in a way that will serve our needs for the remainder of the project.

First, let's extract all burglaries from the crime table into a new table. We will only choose the columns we need, as the initial crime table is filled with unnecessay information.

In [None]:
# Get burglaries from crime table (only the columns we need)
burglary_table = crime_table[['id', 'primary_type', 'description', 'arrest', 'location', 'latitude', 'longitude', 'date', 'year']].loc[crime_table['primary_type']=='BURGLARY']
# Display first 5 rows
burglary_table.head()

Now that we have only the data we need, let's add some new columns deriving from the "date" column

month column:

In [None]:
# Create month column. Months are represented as ints from 1 (January) to 12 (December).
# We could represent months as strings, but integers facilitate plotting.
burglary_table['month'] = pd.DatetimeIndex(burglary_table['date']).month
# Display first 5 rows
burglary_table.head()

day column:

In [None]:
# Create day column. Days are represented as ints from 0 (Monday) to 6 (Sunday).
# We could represent days as strings, but integers facilitate plotting.
burglary_table['day'] = pd.DatetimeIndex(burglary_table['date']).weekday
# Display first 5 rows
burglary_table.head()

time column:

In [None]:
# Create time column. Time is expressed in hours and hours are represented as ints from 0 (12 am) to 23 (11 pm)
burglary_table['time'] = pd.DatetimeIndex(burglary_table['date']).hour
# Display first 5 rows
burglary_table.head()

For the complementary data we imported, the only processing that needs to be done is setting the "Key" column as the index of each table and sorting the tables by "Value" to facilitate plotting.

In [None]:
offender_age = offender_age.set_index('Key').sort_values(by="Value", ascending=False)
offender_sex = offender_sex.set_index('Key').sort_values(by="Value", ascending=False)
offender_race = offender_race.set_index('Key').sort_values(by="Value", ascending=False)
victim_age = victim_age.set_index('Key').sort_values(by="Value", ascending=False)
victim_sex = victim_sex.set_index('Key').sort_values(by="Value", ascending=False)
victim_race = victim_race.set_index('Key').sort_values(by="Value", ascending=False)
victim_offender_relationship = victim_offender_relationship.set_index('Key').sort_values(by="Value", ascending=False)
linked_offenses = linked_offenses.set_index('Key').sort_values(by="Value", ascending=False)

## 3. Exploratory Data Analysis & Visualization
Now that our data is clean and organized, it's time to analyze it through the use of visualizations. This is usually the most interesting part of the data lifecycle, as we will attempt to plot our data and observe possible trends.

First, we will use the original crime table to measure the occurrences of each type of crime in the last 5 years.

In [None]:
# Caluculate number of each crime type occurrence in crime_table
crime_type_occ = crime_table['primary_type'].value_counts()
crime_type_occ

From the above data, theft looks to be the most common crime in Chicago, while burglary is 8th.

Now, let's plot the 12 most common types of crime in a pie chart to get a better idea.

In [None]:
crime_type_occ[0:12].plot(kind='pie', figsize=(10, 10), title="Types of Crime in Chicago (2015-2019)", autopct='%1.1f%%')
plt.ylabel("")
plt.show()

By using the burglary_table, we can plot the number of burglaries by year and hopefully observe a trend.

In [None]:
burglary_table['year'].value_counts().sort_index().plot(kind='bar', rot=0, title="Burglaries by Year", figsize=(10, 8))
plt.ylabel("Number of Burglaries")
plt.xlabel("Year")
plt.show()

From the above bar plot, we can tell that in the last 5 years, 2016 had the most burglaries. Most importantly, there seems to be a decreasing trend since 2016, meaning that the number of burglaries has only decreased since then.

Now let's try to visualize burglaries by month. By counting the occurrences of each month in our burglary table, we can get the average number of burglaries occurred by month throughout the last 5 years.

In [None]:
# Total number of burglaries for 5 years, grouped by month
burglaries_by_month = burglary_table['month'].value_counts().sort_index()
# Divide value for each month by 5 to get normalized number of burglaries per month
burglaries_by_month = burglaries_by_month.apply(lambda x: x/5)
burglaries_by_month

In [None]:
figure(num=None, figsize=(14, 8))
x = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']
y = burglaries_by_month
plt.bar(x,y)
plt.title("Burglaries by Month")
plt.ylabel("Number of Burglaries")
plt.xlabel("Month")
plt.show()

After plotting the average number of burglaries per month, we can start observing some trends. The most burglaries occur in August (a little less than 1200), followed by July, which could be attributed to the fact that many homes are left unoccupied during summer vacations. February seems to have the lowest average number of burglaries (about 2/3 of August's burglaries), meaning that households are the safest during that month of the year.

We can also plot the number of burglaries by day of the week.

In [None]:
# Total number of burglaries for 5 years, grouped by day of week
burglaries_by_day = burglary_table['day'].value_counts().sort_index()
# Divide value for each day by 5 to get number of burglaries for each year by day
burglaries_by_day = burglaries_by_day.apply(lambda x: x/5)
# Divide value for each day by 52.1429 (number of weeks in a year) to get normalized number of burglaries by day of week
burglaries_by_day = burglaries_by_day.apply(lambda x: x/52.1429)
burglaries_by_day

In [None]:
figure(num=None, figsize=(14, 8))
x = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
y = burglaries_by_day
plt.bar(x,y)
plt.title("Burglaries by Day of the Week")
plt.ylabel("Number of Burglaries")
plt.xlabel("Day")
plt.show()

There seem to be about 35 burglaries per weekday in Chicago, while the number is lower on weekends. This could be attributed to the fact that most people are at work on weekdays, and empty houses make better targets for burglars. Weekends seem to be less suitable days for burglaries, as most families stay at home.

Now let's dive a step deeper, and plot the number of burglaries by time of the day.

In [None]:
# Total number of burglaries for 5 years, grouped by hour in the day
burglaries_by_time = burglary_table['time'].value_counts().sort_index()
# Divide value for each hour by 5 to get number of burglaries for each year per hour
burglaries_by_time = burglaries_by_time.apply(lambda x: x/5)
# Divide value for each day by 8760 (number of hours in a year) to get normalized number of burglaries by time
burglaries_by_time = burglaries_by_time.apply(lambda x: x/8760)
burglaries_by_time

In [None]:
burglaries_by_time.plot(kind='bar', rot=0, title="Burglaries by Time of the Day", figsize=(10, 8))
plt.ylabel("Number of Burglaries")
plt.xlabel("Time (in hours)")
plt.show()

It might be expected that most burglaries occur at nightime. However, according to the above bar plot, most burglaries in Chicago occur around 8am, 9am, and 12pm. In fact, almost one burglary occurs at these times every day. Burglaries are least likely to occur from 1am to 6am. A possible reason for this trend could be the same as above. There seems to be an increase in the number of burglaries at the times when most people leave home for work. It looks like empty homes are preferred by burglars.

These observations seem interesting. Let's also visualize them in a line plot.

In [None]:
burglaries_by_time.plot(rot=0, title="Burglaries by Time of the Day", figsize=(10, 8))
plt.ylabel("Number of Burglaries")
plt.xlabel("Time (in hours)")
plt.show()

This line plot confirms the observations from our bar plot and shows the big difference in burglary occurrence between different times of the day.

Now that we have determined when burglaries are most likely to occur, let's observe **where** they are most likely to occur.

We will do this by creating an interactive heat map indicating the areas of Chicago with the highest concentration of burglaries.

The burglary table contains a very large number of datapoints, which would make our heatmap ugly and unreadable.
To improve readability and accuracy, we will be using a random sample of size 10,000.

In [None]:
# Take random sample of 10,000 rows
sample_table = burglary_table.sample(n=10000)
# Display first 5 rows
sample_table.head()

To map our sample, we will be using the folium package

In [None]:
# Create map
map_osm = folium.Map(location=[41.88, -87.63], zoom_start=11)
# Drop rows where location is missing
heat_table = sample_table[sample_table['location'].notna()]
# Get heat data from sample
heat_data = [[row['latitude'], row['longitude']] for index, row in heat_table.iterrows()]
# Create heat map
HeatMap(heat_data, radius=20).add_to(map_osm)
    
map_osm

Now let's make our map a bit more descriptive, by adding some more data.

We will be creating circles, indicating the location of each burglary. By clicking on the circles, one will be able to see the incident description. Additionally, green circles will indicate burglaries where the offender has been arrested, while black circles will mean that the offender has not been arrested.

In [None]:
# Add circles
for index, row in heat_table.iterrows():
    color=''
    if row['arrest'] == True:
        color = 'green'
    else:
        color = 'black'
        
    folium.Circle(
    radius = 20,
    location = [row['latitude'], row['longitude']],
    popup = row['description'],
    color = color,
    fill = True,
).add_to(map_osm)
    
map_osm

A considerably large proportion of the circles on the map above are black. This means that most burglars never get arrested. Lets visualize this in a pie chart.

In [None]:
burglary_table['arrest'].value_counts().plot(kind='pie', figsize=(10, 10), title="Burglars Arrested", autopct='%1.1f%%')
plt.ylabel("")
plt.show()

From the plot above, we can see that a surprisingly low percentage of burglars get arrested. 94.8% of them never get caught.

Now let's plot some other interesting statistics, using our complementary datasets.

We will use pie charts to visualize the age, sex, and race distributions of burglary offenders and victims.

In [None]:
offender_age[:7].plot(kind='pie', y='Value', figsize=(10,10), autopct='%1.1f%%', title="Burglars by Age")
plt.ylabel("")
plt.show()

In [None]:
offender_sex.plot(kind='pie', y='Value', figsize=(10,10), autopct='%1.1f%%', title="Burglars by Sex")
plt.ylabel("")
plt.show()

In [None]:
offender_race.plot(kind='pie', y='Value', figsize=(10,10), autopct='%1.1f%%', title="Burglars by Race")
plt.ylabel("")
plt.show()

In [None]:
victim_age[:10].plot(kind='pie', y='Value', figsize=(10,10), autopct='%1.1f%%', title="Victims by Age")
plt.ylabel("")
plt.show()

In [None]:
victim_sex.plot(kind='pie', y='Value', figsize=(10,10), autopct='%1.1f%%', title="Victims by Sex")
plt.ylabel("")
plt.show()

In [None]:
victim_race[:5].plot(kind='pie', y='Value', figsize=(10,10), autopct='%1.1f%%', title="Victims by Race")
plt.ylabel("")
plt.show()

In [None]:
victim_offender_relationship[:12].plot(kind='pie', y='Value', figsize=(10,10), autopct='%1.1f%%', title="Relationship between Victim and Offender", subplots=True)
plt.ylabel("")
plt.show()

In [None]:
linked_offenses[:15].plot(kind='pie', y='Value', figsize=(10,10), autopct='%1.1f%%', title="Offenders linked to other offenses", subplots=True)
plt.ylabel("")
plt.show()

## 4. Insight & Observations

For the last stage of the data lifecycle, we will be utilizing the analysis we conducted to derive some insights and observations about burglaries in Chicago.

The number of burglaries seems to be decreasing every year as of 2016, meaning that Chicago is becoming a safer place to live.

Most burglaries occur in the summer months, on weekdays, between 8am and 12pm. Our analysis of burglaries by month, day, and time seem to aggree with each other and all confirm the same assumption: Burglars prefer vacant homes, where the chance of confrontation is decreased.

The highest concentration of burglaries seems to be in the center of the city. Other than that, there does not look to be any other obvious trend. An educated assumption would be that wealthier, less-secure households have a higher chance of being burglarized.

The majority of burglars are white males between the ages 20-29 and around 95% of them never get arrested.

A considerable amount of burglary victims seem to know the burglar in some way. Only 19% of burglary victims have reported the burglar as a complete stranger.

Amongst burglars who were also linked with another offense, 50% of them had been involved in Destruction/Damage/Vandalism of Property. 