#### Events Initial Analysis

The purpose of this notebook is to carry out a very rough, baseline analysis of the impact of events on nearby crimes. 

This analysis is limited in several important ways. See technical log for longer discussion, but to sum up some of the major ones:

- It uses single-day events, _without_ incorporating multi-day events from the dataset. This effectively intermingles our control (no-event) and treatment (event) groups, making it harder for us to detect a true effect.

- It uses a hard 1km square buffer around venues to identify possibly related crimes. Using this rather than a 'smart' buffer based on behavior also may increase the extent to which we intermingle impacted and unimpacted crimes, making it harder to detect a true effect. Also, if the buffer is too large, that will make it harder to detect a true effect. Conversely, if the buffer is too small we will increase the variance of our estimate.

- It relies on linear models. This isn't the worst thing ever but could _possibly_ be improved upon with a more nonlinear ML-y approach.

- It assumes that spillovers are (on average) linearly additive: being within the bandwidth of two events has on average 2x the effect of being within the bandwidth of one event.

- It relies on a selection-on-observables approach that only incorporates date, venue, and time as controls. This probably accounts for some significant part of any selection effect (day of the week, etc.) but not all. Other factors that are correlated with whether events take place and with crime will induce bias.

In [1]:
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import datetime as dt
from itertools import product
from sklearn import linear_model
import statsmodels.api as sm
%matplotlib inline
plt.rcParams['figure.figsize'] = [12, 12]  # add more space to figures

In [2]:
events = pd.DataFrame(gpd.read_file('geo/single_day_events.shp').drop(columns=['geometry']))
# convert event dates to datetime
events['date'] = pd.to_datetime(events['Date(s)'])
events

Unnamed: 0,Name of Ev,Category,Date(s),Location,Estimated,Number Peo,Address,Venue Capa,Venue Type,Latitude,Longitude,date
0,Alison Wonderland,Music,2018-08-24 00:00:00,Aragon Ballroom,,4800,"1106 W Lawrence Ave, Chicago, IL 60640",4800.0,Event Space,41.969436,-87.658038,2018-08-24
1,"Virtual Self ""Utopia System""",Music,2018-09-07 00:00:00,Aragon Ballroom,,4800,"1106 W Lawrence Ave, Chicago, IL 60640",4800.0,Event Space,41.969436,-87.658038,2018-09-07
2,5 Seconds of Summer / The Aces,Music,2018-09-08 00:00:00,Aragon Ballroom,,4800,"1106 W Lawrence Ave, Chicago, IL 60640",4800.0,Event Space,41.969436,-87.658038,2018-09-08
3,SIGRID W/ HOUSES,Music,2019-08-01 00:00:00,Thalia Hall,Medium,800,"1807 S Allport St, Chicago, IL 60608",800.0,Music,41.857681,-87.657392,2019-08-01
4,JUDAH & THE LION W/ THE BAND CAMINO,Music,2019-08-02 00:00:00,Thalia Hall,Medium,800,"1807 S Allport St, Chicago, IL 60608",800.0,Music,41.857681,-87.657392,2019-08-02
...,...,...,...,...,...,...,...,...,...,...,...,...
920,Chicago Cubs vs Los Angeles Dodgers,Sports,2019-03-20 00:00:00,Wrigley Field,Large,41160,"1060 W Addison St, Chicago, IL 60613",41160.0,Sports,41.947588,-87.656125,2019-03-20
921,Chicago Cubs vs San Francisco Giants,Sports,2019-03-21 00:00:00,Wrigley Field,Large,41160,"1060 W Addison St, Chicago, IL 60613",41160.0,Sports,41.947588,-87.656125,2019-03-21
922,Chicago Cubs vs Los Angeles Angels,Sports,2019-06-03 00:00:00,Wrigley Field,Large,41160,"1060 W Addison St, Chicago, IL 60613",41160.0,Sports,41.947588,-87.656125,2019-06-03
923,14TH ANNUAL RACE TO WRIGLEY 5K CHARITY RUN PRE...,Sports,2019-04-27 00:00:00,Wrigley Field,Large,41160,"1060 W Addison St, Chicago, IL 60613",41160.0,Sports,41.947588,-87.656125,2019-04-27


In [3]:
data = pd.read_csv('data/crime_event_treated.csv')
data['date'] = pd.to_datetime(data.date)
data

Unnamed: 0.1,Unnamed: 0,id,case_number,date,block,iucr,primary_type,description,location_description,arrest,...,near_any_venue,event_day_crimes,event_evening_crimes,event_3to6_crimes,event_4to7_crimes,event_5to8_crimes,event_6to9_crimes,event_7to10_crimes,event_8to11_crimes,event_9to12_crimes
0,0,11393720,JB367241,2018-07-27 00:00:00,047XX N BROADWAY,1305,CRIMINAL DAMAGE,CRIMINAL DEFACEMENT,BANK,False,...,1,0,0,0,0,0,0,0,0,0
1,1,11394961,JB368503,2018-07-27 00:00:00,001XX W HUBBARD ST,0870,THEFT,POCKET-PICKING,TAVERN/LIQUOR STORE,False,...,1,0,0,0,0,0,0,0,0,0
2,2,11480254,JB480430,2018-07-27 00:00:00,012XX S MICHIGAN AVE,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,RESIDENCE,False,...,1,0,0,0,0,0,0,0,0,0
3,3,11395054,JB368869,2018-07-27 00:01:00,040XX N BROADWAY,0620,BURGLARY,UNLAWFUL ENTRY,APARTMENT,False,...,1,0,0,0,0,0,0,0,0,0
4,4,11395057,JB368870,2018-07-27 00:01:00,040XX N BROADWAY,1150,DECEPTIVE PRACTICE,CREDIT CARD FRAUD,APARTMENT,False,...,1,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
103490,103490,11790084,JC387494,2019-08-10 23:45:00,001XX S HALSTED ST,0870,THEFT,POCKET-PICKING,BAR OR TAVERN,False,...,1,0,0,0,0,0,0,0,0,0
103491,103491,11790431,JC387911,2019-08-10 23:59:00,040XX S FEDERAL ST,1320,CRIMINAL DAMAGE,TO VEHICLE,STREET,False,...,1,0,0,0,0,0,0,0,0,0
103492,103492,11790428,JC387941,2019-08-11 00:00:00,031XX N SHEFFIELD AVE,1310,CRIMINAL DAMAGE,TO PROPERTY,RESIDENCE,False,...,1,0,0,0,0,0,0,0,0,0
103493,103493,11791865,JC389527,2019-08-11 00:00:00,009XX W BELMONT AVE,0890,THEFT,FROM BUILDING,BAR OR TAVERN,False,...,1,0,0,0,0,0,0,0,0,0


For the purpose of the following analysis, we will not use any crimes that did not occur within the bandwidth of at least one venue. These crimes could be useful for training a predictive model to estimate the counterfactual of crimes at a venue, but otherwise should not be included in the analysis. 

In [4]:
relevant_crimes = data[data.near_any_venue==1]

In [5]:
len(relevant_crimes)

48197

Simple difference in means: $\bar{Y}(D_i>0)-\bar{Y}(D_i=0)$. By "mean", we mean the average number of crimes per time period under consideration. 

First we'll do this with days. There are 381 days and 26 venues in our dataset. So we calculate the average number of crimes near venues on days that they have events, minus the average number of crimes near venues on days that they don't have events.

In [6]:
venuedays_with_events = len(events)
venuedays_without_events = 381*26 - venuedays_with_events

avg_crime_per_day_events = len(relevant_crimes[relevant_crimes.event_day_crimes>0])/venuedays_with_events
avg_crime_per_day_noevents = len(relevant_crimes[relevant_crimes.event_day_crimes==0])/venuedays_without_events

print(f"Average crime near venues on days with events: {avg_crime_per_day_events}")
print(f"Average crime near venues on days without events: {avg_crime_per_day_noevents}")
print(f"Naive estimated effect of events on crime: {avg_crime_per_day_events-avg_crime_per_day_noevents}")

Average crime near venues on days with events: 10.518918918918919
Average crime near venues on days without events: 4.283153323683331
Naive estimated effect of events on crime: 6.235765595235588


So there are many more crimes near our venues on days where they have events than on days where they don't. This is a fact, but doesn't necessarily imply causation. As a causal estimate this is likely to be biased upwards for many reasons, primarily due to selection (see technical log).

In [35]:
# repeat but for evening crimes

# note that the number of venue-days and venue-evenings with/without events is the same
avg_crime_per_evening_events = len(relevant_crimes[relevant_crimes.event_evening_crimes>0])/venuedays_with_events
avg_crime_per_evening_noevents = len(relevant_crimes[relevant_crimes.event_evening_crimes==0])/venuedays_without_events

print(f"Average crime near venues on evenings with events: {avg_crime_per_evening_events}")
print(f"Average crime near venues on evenings without events: {avg_crime_per_evening_noevents}")
print(f"Naive estimated effect of events on evening crime: {avg_crime_per_evening_events-avg_crime_per_evening_noevents}")

Average crime near venues on evenings with events: 3.089585666293393
Average crime near venues on evenings without events: 5.041384666592699
Naive estimated effect of events on evening crime: -1.9517990002993062


When we only consider crimes that occurred in the evening, though, the estimated effect reverses! Another fact appears to be that there are _less_ evening crimes near our venues on days where there are events than on days where there aren't events. 

Now we'll move on to regression- and prediction-based approaches. Our outcome is the number of crimes per venue and date/time, so we will need to modify our dataset: each observation should be per venue, per date/time. 

Our first version of this will be for the day and evening time bandwidths. Our dataset will contain 26*380 observations: one per venue per day. Each observation's attributes will be venue (including as dummy variable), date, number of day crimes, number of evening crimes, event binary.

In [8]:
relevant_crimes.date.max()

Timestamp('2019-08-11 00:00:00')

In [9]:
date_list = [dt.date.fromisoformat('2019-08-11') - dt.timedelta(days=i) for i in range(381)]
locations = list(set(events.Location))

In [10]:
# source: https://stackoverflow.com/questions/25634489/get-all-combinations-of-elements-from-two-lists
relevant_crimes_daily = pd.DataFrame(list(product(date_list, locations)), columns=['date','Location'])
relevant_crimes_daily['date'] = pd.to_datetime(relevant_crimes_daily['date'])
relevant_crimes_daily

Unnamed: 0,date,Location
0,2019-08-11,Aragon Ballroom
1,2019-08-11,Guaranteed Rate Field
2,2019-08-11,Grant Park
3,2019-08-11,Concord Music Hall
4,2019-08-11,United Center
...,...,...
9901,2018-07-27,Credit Union 1 Arena
9902,2018-07-27,Soldier Field
9903,2018-07-27,Riviera Theatre
9904,2018-07-27,Huntington Bank Pavilion


In [11]:
relevant_crimes_daily = pd.merge(relevant_crimes_daily, events[["date","Location","Category"]], on=['date','Location'],how='left')
relevant_crimes_daily

Unnamed: 0,date,Location,Category
0,2019-08-11,Aragon Ballroom,
1,2019-08-11,Guaranteed Rate Field,
2,2019-08-11,Grant Park,
3,2019-08-11,Concord Music Hall,
4,2019-08-11,United Center,
...,...,...,...
9932,2018-07-27,Credit Union 1 Arena,
9933,2018-07-27,Soldier Field,
9934,2018-07-27,Riviera Theatre,
9935,2018-07-27,Huntington Bank Pavilion,


Note that our dataframe increased in size. This indicates that there are actually events that occur at the same venue on the same day. We'll remove and ignore those. (Also, this means that our estimates above were actually slightly off!)

In future analysis it will be valuable to look at a breakdown by event type, as there could be heterogenous treatment effect by type. But we will not do that here.

In [12]:
relevant_crimes_daily["event"] = np.where(
    relevant_crimes_daily.Category.isna(), 0, 1)
relevant_crimes_daily = relevant_crimes_daily.drop(columns=['Category']).drop_duplicates()
relevant_crimes_daily

Unnamed: 0,date,Location,event
0,2019-08-11,Aragon Ballroom,0
1,2019-08-11,Guaranteed Rate Field,0
2,2019-08-11,Grant Park,0
3,2019-08-11,Concord Music Hall,0
4,2019-08-11,United Center,0
...,...,...,...
9932,2018-07-27,Credit Union 1 Arena,0
9933,2018-07-27,Soldier Field,0
9934,2018-07-27,Riviera Theatre,0
9935,2018-07-27,Huntington Bank Pavilion,0


In [13]:
relevant_crimes_daily.event.value_counts()

0    9013
1     893
Name: event, dtype: int64

In [14]:
venuedays_with_events = len(relevant_crimes_daily[relevant_crimes_daily.event==1])
venuedays_without_events = len(relevant_crimes_daily[relevant_crimes_daily.event==0])

avg_crime_per_day_events = len(relevant_crimes[relevant_crimes.event_day_crimes>0])/venuedays_with_events
avg_crime_per_day_noevents = len(relevant_crimes[relevant_crimes.event_day_crimes==0])/venuedays_without_events

print("Days:")
print(f"Average crime near venues on days with events: {avg_crime_per_day_events}")
print(f"Average crime near venues on days without events: {avg_crime_per_day_noevents}")
print(f"Naive estimated effect of events on crime: {avg_crime_per_day_events-avg_crime_per_day_noevents}")

# note that the number of venue-days and venue-evenings with/without events is the same
avg_crime_per_evening_events = len(relevant_crimes[relevant_crimes.event_evening_crimes>0])/venuedays_with_events
avg_crime_per_evening_noevents = len(relevant_crimes[relevant_crimes.event_evening_crimes==0])/venuedays_without_events

print("\nEvenings:")
print(f"Average crime near venues on days with events: {avg_crime_per_evening_events}")
print(f"Average crime near venues on days without events: {avg_crime_per_evening_noevents}")
print(f"Naive estimated effect of events on crime: {avg_crime_per_evening_events-avg_crime_per_evening_noevents}")

Days:
Average crime near venues on days with events: 10.89585666293393
Average crime near venues on days without events: 4.267946299789194
Naive estimated effect of events on crime: 6.6279103631447365

Evenings:
Average crime near venues on days with events: 3.089585666293393
Average crime near venues on days without events: 5.041384666592699
Naive estimated effect of events on crime: -1.9517990002993062


In [15]:
for location in locations:
    relevant_crimes_daily[location] = np.where(
        relevant_crimes_daily.Location==location,
        1,0)
relevant_crimes_daily

Unnamed: 0,date,Location,event,Aragon Ballroom,Guaranteed Rate Field,Grant Park,Concord Music Hall,United Center,Petrillo Music Shell,Main Hall in UIC Dorin Forum,...,Union Station,The Chicago Theatre,Field Museum,Thalia Hall,Cinespace Chicago Film Studios,Credit Union 1 Arena,Soldier Field,Riviera Theatre,Huntington Bank Pavilion,Wrigley Field
0,2019-08-11,Aragon Ballroom,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2019-08-11,Guaranteed Rate Field,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,2019-08-11,Grant Park,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,2019-08-11,Concord Music Hall,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,2019-08-11,United Center,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9932,2018-07-27,Credit Union 1 Arena,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
9933,2018-07-27,Soldier Field,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
9934,2018-07-27,Riviera Theatre,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
9935,2018-07-27,Huntington Bank Pavilion,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0


In [16]:
relevant_crimes["day"] = relevant_crimes.date.dt.date
crimes_by_date = relevant_crimes.groupby(by=['day']).sum()[locations].reset_index()
crimes_by_date["day"] = pd.to_datetime(crimes_by_date["day"])
crimes_by_date.rename(columns={'day':'date'}, inplace=True)
crimes_by_date

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,date,Aragon Ballroom,Guaranteed Rate Field,Grant Park,Concord Music Hall,United Center,Petrillo Music Shell,Main Hall in UIC Dorin Forum,Arie Crown Theater,Chicago Symphony Orchestra,...,Union Station,The Chicago Theatre,Field Museum,Thalia Hall,Cinespace Chicago Film Studios,Credit Union 1 Arena,Soldier Field,Riviera Theatre,Huntington Bank Pavilion,Wrigley Field
0,2018-07-27,8,4,13,6,7,20,2,10,28,...,13,43,8,10,6,5,10,8,1,14
1,2018-07-28,11,11,16,9,7,23,4,3,26,...,10,34,8,11,8,9,13,11,2,12
2,2018-07-29,3,5,5,11,17,22,6,8,24,...,14,45,3,6,6,6,6,3,0,21
3,2018-07-30,6,3,14,4,5,35,8,8,39,...,19,43,10,6,8,4,16,6,0,13
4,2018-07-31,11,4,12,7,6,19,4,4,31,...,32,41,9,3,5,10,12,12,0,8
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
376,2019-08-07,9,7,11,5,6,23,4,4,26,...,14,34,8,5,7,6,11,10,1,7
377,2019-08-08,17,6,14,7,3,25,7,6,33,...,23,42,8,5,5,5,13,16,0,11
378,2019-08-09,7,4,9,7,11,21,9,6,23,...,11,34,4,5,6,13,6,8,0,9
379,2019-08-10,10,6,7,6,7,15,5,7,18,...,13,41,5,5,8,11,9,10,0,16


In [17]:
# assign proper crimes from crimes_by_date to relevant_crimes_daily
relevant_crimes_daily = pd.merge(relevant_crimes_daily, crimes_by_date, on=['date'], how='left')
relevant_crimes_daily

Unnamed: 0,date,Location,event,Aragon Ballroom_x,Guaranteed Rate Field_x,Grant Park_x,Concord Music Hall_x,United Center_x,Petrillo Music Shell_x,Main Hall in UIC Dorin Forum_x,...,Union Station_y,The Chicago Theatre_y,Field Museum_y,Thalia Hall_y,Cinespace Chicago Film Studios_y,Credit Union 1 Arena_y,Soldier Field_y,Riviera Theatre_y,Huntington Bank Pavilion_y,Wrigley Field_y
0,2019-08-11,Aragon Ballroom,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,3
1,2019-08-11,Guaranteed Rate Field,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,3
2,2019-08-11,Grant Park,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,3
3,2019-08-11,Concord Music Hall,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,3
4,2019-08-11,United Center,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9901,2018-07-27,Credit Union 1 Arena,0,0,0,0,0,0,0,0,...,13,43,8,10,6,5,10,8,1,14
9902,2018-07-27,Soldier Field,0,0,0,0,0,0,0,0,...,13,43,8,10,6,5,10,8,1,14
9903,2018-07-27,Riviera Theatre,0,0,0,0,0,0,0,0,...,13,43,8,10,6,5,10,8,1,14
9904,2018-07-27,Huntington Bank Pavilion,0,0,0,0,0,0,0,0,...,13,43,8,10,6,5,10,8,1,14


In [18]:
relevant_crimes_daily["event_day_crimes"] = 0
for location in locations:
    relevant_crimes_daily["event_day_crimes"] = np.where(
        relevant_crimes_daily.Location==location,
        relevant_crimes_daily[location+'_y'],
        relevant_crimes_daily["event_day_crimes"])
relevant_crimes_daily

Unnamed: 0,date,Location,event,Aragon Ballroom_x,Guaranteed Rate Field_x,Grant Park_x,Concord Music Hall_x,United Center_x,Petrillo Music Shell_x,Main Hall in UIC Dorin Forum_x,...,The Chicago Theatre_y,Field Museum_y,Thalia Hall_y,Cinespace Chicago Film Studios_y,Credit Union 1 Arena_y,Soldier Field_y,Riviera Theatre_y,Huntington Bank Pavilion_y,Wrigley Field_y,event_day_crimes
0,2019-08-11,Aragon Ballroom,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,3,0
1,2019-08-11,Guaranteed Rate Field,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,3,0
2,2019-08-11,Grant Park,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,3,0
3,2019-08-11,Concord Music Hall,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,3,0
4,2019-08-11,United Center,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,3,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9901,2018-07-27,Credit Union 1 Arena,0,0,0,0,0,0,0,0,...,43,8,10,6,5,10,8,1,14,5
9902,2018-07-27,Soldier Field,0,0,0,0,0,0,0,0,...,43,8,10,6,5,10,8,1,14,10
9903,2018-07-27,Riviera Theatre,0,0,0,0,0,0,0,0,...,43,8,10,6,5,10,8,1,14,8
9904,2018-07-27,Huntington Bank Pavilion,0,0,0,0,0,0,0,0,...,43,8,10,6,5,10,8,1,14,1


In [19]:
relevant_crimes_daily = relevant_crimes_daily[['date', 'Location', 'event','event_day_crimes', 'Thalia Hall_x',
       'Huntington Bank Pavilion_x', 'Union Station_x',
       'Cinespace Chicago Film Studios_x', 'Lincoln Park Zoo_x',
       'Chicago Symphony Orchestra_x', 'Petrillo Music Shell_x',
       'Millennium Park_x', 'Gallagher Way_x', 'Wrigley Field_x',
       'Concord Music Hall_x', 'Sheraton Grand Chicago_x',
       'The Chicago Theatre_x', 'Aragon Ballroom_x', 'Auditorium Theatre_x',
       'Museum of Science and Industry_x', 'Field Museum_x', 'Grant Park_x',
       'Main Hall in UIC Dorin Forum_x', 'Arie Crown Theater_x',
       'Soldier Field_x', 'Civic Opera House_x', 'United Center_x',
       'Guaranteed Rate Field_x', 'Credit Union 1 Arena_x',
       'Riviera Theatre_x']]

In [20]:
for column in relevant_crimes_daily.columns:
    if column.endswith('_x'):
        relevant_crimes_daily = relevant_crimes_daily.rename(columns={column:column[:-2]})
relevant_crimes_daily

Unnamed: 0,date,Location,event,event_day_crimes,Thalia Hall,Huntington Bank Pavilion,Union Station,Cinespace Chicago Film Studios,Lincoln Park Zoo,Chicago Symphony Orchestra,...,Field Museum,Grant Park,Main Hall in UIC Dorin Forum,Arie Crown Theater,Soldier Field,Civic Opera House,United Center,Guaranteed Rate Field,Credit Union 1 Arena,Riviera Theatre
0,2019-08-11,Aragon Ballroom,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2019-08-11,Guaranteed Rate Field,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,2019-08-11,Grant Park,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
3,2019-08-11,Concord Music Hall,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,2019-08-11,United Center,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9901,2018-07-27,Credit Union 1 Arena,0,5,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
9902,2018-07-27,Soldier Field,0,10,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
9903,2018-07-27,Riviera Theatre,0,8,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
9904,2018-07-27,Huntington Bank Pavilion,0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [21]:
# now make this a function and add other crime types

def add_crimes(relevant_crimes_daily,relevant_crimes,crime_type):
    crimes_by_date = relevant_crimes[relevant_crimes[crime_type]>0].groupby(by=['day']).sum()[locations].reset_index()
    crimes_by_date["day"] = pd.to_datetime(crimes_by_date["day"])
    crimes_by_date.rename(columns={'day':'date'}, inplace=True)
    relevant_crimes_daily = pd.merge(relevant_crimes_daily, crimes_by_date, on=['date'], how='left')
    relevant_crimes_daily = relevant_crimes_daily.fillna(0)
    relevant_crimes_daily[crime_type] = 0
    for location in locations:
        relevant_crimes_daily[crime_type] = np.where(
            relevant_crimes_daily.Location==location,
            relevant_crimes_daily[location+'_y'],
            relevant_crimes_daily[crime_type])
    relevant_crimes_daily = relevant_crimes_daily.drop(columns=['Petrillo Music Shell_y', 'Grant Park_y', 'Guaranteed Rate Field_y',
       'United Center_y', 'Lincoln Park Zoo_y', 'Riviera Theatre_y',
       'Thalia Hall_y', 'Millennium Park_y', 'Field Museum_y',
       'Concord Music Hall_y', 'Wrigley Field_y', 'Auditorium Theatre_y',
       'Huntington Bank Pavilion_y', 'Union Station_y',
       'Main Hall in UIC Dorin Forum_y', 'Museum of Science and Industry_y',
       'Credit Union 1 Arena_y', 'Civic Opera House_y', 'Soldier Field_y',
       'Sheraton Grand Chicago_y', 'Cinespace Chicago Film Studios_y',
       'Gallagher Way_y', 'Arie Crown Theater_y', 'The Chicago Theatre_y',
       'Chicago Symphony Orchestra_y', 'Aragon Ballroom_y',])
    for column in relevant_crimes_daily.columns:
        if column.endswith('_x'):
            relevant_crimes_daily = relevant_crimes_daily.rename(columns={column:column[:-2]})
    
    return relevant_crimes_daily

event_types = [
    "event_day_crimes","event_evening_crimes","event_3to6_crimes","event_4to7_crimes","event_5to8_crimes",
    "event_6to9_crimes","event_7to10_crimes","event_8to11_crimes","event_9to12_crimes"
    ]
for crime_type in event_types[1:]:
    relevant_crimes_daily = add_crimes(relevant_crimes_daily,relevant_crimes,crime_type)
relevant_crimes_daily

Unnamed: 0,date,Location,event,event_day_crimes,Thalia Hall,Huntington Bank Pavilion,Union Station,Cinespace Chicago Film Studios,Lincoln Park Zoo,Chicago Symphony Orchestra,...,Credit Union 1 Arena,Riviera Theatre,event_evening_crimes,event_3to6_crimes,event_4to7_crimes,event_5to8_crimes,event_6to9_crimes,event_7to10_crimes,event_8to11_crimes,event_9to12_crimes
0,2019-08-11,Aragon Ballroom,0,0,0,0,0,0,0,0,...,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2019-08-11,Guaranteed Rate Field,0,0,0,0,0,0,0,0,...,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2019-08-11,Grant Park,0,0,0,0,0,0,0,0,...,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,2019-08-11,Concord Music Hall,0,0,0,0,0,0,0,0,...,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,2019-08-11,United Center,0,0,0,0,0,0,0,0,...,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9901,2018-07-27,Credit Union 1 Arena,0,5,0,0,0,0,0,0,...,1,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9902,2018-07-27,Soldier Field,0,10,0,0,0,0,0,0,...,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9903,2018-07-27,Riviera Theatre,0,8,0,0,0,0,0,0,...,0,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9904,2018-07-27,Huntington Bank Pavilion,0,1,0,1,0,0,0,0,...,0,0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Now we'll first try a simple linear regression of daily crimes on whether there was an event nearby. This could recover an unbiased estimate of the effect of events on crimes if venues had been completely randomly assigned to have events - which they obviously did not.

In [27]:
X = relevant_crimes_daily.event
Y = relevant_crimes_daily.event_day_crimes

# sklearn version
#day_regression = linear_model.LinearRegression()
#day_regression.fit(X,Y)

print('Same day - simple regression:')
sm.OLS(Y, X).fit().summary()

Same day - simple regression:


0,1,2,3
Dep. Variable:,event_day_crimes,R-squared (uncentered):,0.042
Model:,OLS,Adj. R-squared (uncentered):,0.041
Method:,Least Squares,F-statistic:,429.9
Date:,"Mon, 13 Jun 2022",Prob (F-statistic):,1.64e-93
Time:,21:08:06,Log-Likelihood:,-42573.0
No. Observations:,9906,AIC:,85150.0
Df Residuals:,9905,BIC:,85150.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
event,12.3449,0.595,20.733,0.000,11.178,13.512

0,1,2,3
Omnibus:,3404.642,Durbin-Watson:,0.974
Prob(Omnibus):,0.0,Jarque-Bera (JB):,15920.253
Skew:,1.611,Prob(JB):,0.0
Kurtosis:,8.309,Cond. No.,1.0


This estimates that having an event causes, on average, 12 more crimes that day within a 2km square around the venue. However, this is certainly biased due to our lack of actually being able to randomize events; any factors that both make crimes and events more likely would induce upward bias on our estimate (for example, day of the week, or venue type).

To try to somewhat account for this we will now repeat the regression with venue fixed effects and then venue and date fixed effects.

In [28]:
X = relevant_crimes_daily.drop(columns=['Location','date']+event_types)
Y = relevant_crimes_daily.event_day_crimes

print('Same-day crime - venue fixed effects:')
sm.OLS(Y, X).fit().summary()

Same-day crime - venue fixed effects:


0,1,2,3
Dep. Variable:,event_day_crimes,R-squared:,0.759
Model:,OLS,Adj. R-squared:,0.759
Method:,Least Squares,F-statistic:,1200.0
Date:,"Mon, 13 Jun 2022",Prob (F-statistic):,0.0
Time:,21:08:12,Log-Likelihood:,-32022.0
No. Observations:,9906,AIC:,64100.0
Df Residuals:,9879,BIC:,64290.0
Df Model:,26,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
event,0.1228,0.227,0.540,0.589,-0.323,0.568
Thalia Hall,4.8968,0.319,15.370,0.000,4.272,5.521
Huntington Bank Pavilion,0.3607,0.315,1.146,0.252,-0.257,0.978
Union Station,16.1913,0.315,51.463,0.000,15.575,16.808
Cinespace Chicago Film Studios,6.3330,0.315,20.129,0.000,5.716,6.950
Lincoln Park Zoo,3.8767,0.315,12.316,0.000,3.260,4.494
Chicago Symphony Orchestra,27.7237,0.316,87.708,0.000,27.104,28.343
Petrillo Music Shell,23.3553,0.315,74.230,0.000,22.739,23.972
Millennium Park,31.8953,0.316,100.810,0.000,31.275,32.516

0,1,2,3
Omnibus:,9107.572,Durbin-Watson:,1.597
Prob(Omnibus):,0.0,Jarque-Bera (JB):,961237.631
Skew:,4.044,Prob(JB):,0.0
Kurtosis:,50.576,Cond. No.,1.89


In [26]:
time_dummies = relevant_crimes_daily.copy(deep=True).drop(columns='Location')
time_dummies["date"]=time_dummies.date.astype(str)
time_dummies = pd.get_dummies(time_dummies)
time_dummies

Unnamed: 0,event,event_day_crimes,Thalia Hall,Huntington Bank Pavilion,Union Station,Cinespace Chicago Film Studios,Lincoln Park Zoo,Chicago Symphony Orchestra,Petrillo Music Shell,Millennium Park,...,date_2019-08-02,date_2019-08-03,date_2019-08-04,date_2019-08-05,date_2019-08-06,date_2019-08-07,date_2019-08-08,date_2019-08-09,date_2019-08-10,date_2019-08-11
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9901,0,5,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9902,0,10,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9903,0,8,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9904,0,1,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [29]:
X = time_dummies.drop(columns=event_types)
Y = time_dummies.event_day_crimes

print('Same-day crime - venue and time fixed effects:')
sm.OLS(Y, X).fit().summary()

Same-day crime - venue and time fixed effects:


0,1,2,3
Dep. Variable:,event_day_crimes,R-squared:,0.825
Model:,OLS,Adj. R-squared:,0.817
Method:,Least Squares,F-statistic:,110.1
Date:,"Mon, 13 Jun 2022",Prob (F-statistic):,0.0
Time:,21:08:18,Log-Likelihood:,-30452.0
No. Observations:,9906,AIC:,61720.0
Df Residuals:,9499,BIC:,64650.0
Df Model:,406,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
event,-0.0080,0.205,-0.039,0.969,-0.409,0.393
Thalia Hall,4.0831,0.277,14.757,0.000,3.541,4.625
Huntington Bank Pavilion,-0.4746,0.273,-1.736,0.083,-1.010,0.061
Union Station,15.3491,0.273,56.188,0.000,14.814,15.885
Cinespace Chicago Film Studios,5.4908,0.273,20.100,0.000,4.955,6.026
Lincoln Park Zoo,3.0397,0.273,11.123,0.000,2.504,3.575
Chicago Symphony Orchestra,26.8987,0.274,98.018,0.000,26.361,27.437
Petrillo Music Shell,22.5145,0.273,82.418,0.000,21.979,23.050
Millennium Park,31.0720,0.275,113.114,0.000,30.534,31.610

0,1,2,3
Omnibus:,6432.517,Durbin-Watson:,2.204
Prob(Omnibus):,0.0,Jarque-Bera (JB):,340502.762
Skew:,2.463,Prob(JB):,0.0
Kurtosis:,31.297,Cond. No.,1.18e+16


So the estimated effect disappears with fixed effects.

We shouldn't pay too much attention to these, though - considering all the crimes from the day is certainly much too high a time bandwidth that could hide any real effect.

We'll try the same thing with crimes from 6pm - midnight on the day of an event.

In [31]:
X = relevant_crimes_daily.event
Y = relevant_crimes_daily.event_evening_crimes

print('Evening - simple regression:')
sm.OLS(Y, X).fit().summary()

Evening - simple regression:


0,1,2,3
Dep. Variable:,event_evening_crimes,R-squared (uncentered):,0.19
Model:,OLS,Adj. R-squared (uncentered):,0.19
Method:,Least Squares,F-statistic:,2319.0
Date:,"Mon, 13 Jun 2022",Prob (F-statistic):,0.0
Time:,21:08:44,Log-Likelihood:,-21658.0
No. Observations:,9906,AIC:,43320.0
Df Residuals:,9905,BIC:,43320.0
Df Model:,1,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
event,3.4714,0.072,48.156,0.000,3.330,3.613

0,1,2,3
Omnibus:,6206.755,Durbin-Watson:,1.417
Prob(Omnibus):,0.0,Jarque-Bera (JB):,70073.071
Skew:,2.899,Prob(JB):,0.0
Kurtosis:,14.669,Cond. No.,1.0


In [33]:
X = relevant_crimes_daily.drop(columns=['Location','date']+event_types)
Y = relevant_crimes_daily.event_evening_crimes

print('Evening - venue fixed effects:')
sm.OLS(Y, X).fit().summary()

Evening - venue fixed effects:


0,1,2,3
Dep. Variable:,event_evening_crimes,R-squared:,0.312
Model:,OLS,Adj. R-squared:,0.31
Method:,Least Squares,F-statistic:,172.2
Date:,"Mon, 13 Jun 2022",Prob (F-statistic):,0.0
Time:,21:11:09,Log-Likelihood:,-19978.0
No. Observations:,9906,AIC:,40010.0
Df Residuals:,9879,BIC:,40200.0
Df Model:,26,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
event,2.9689,0.067,44.050,0.000,2.837,3.101
Thalia Hall,-0.2819,0.094,-2.984,0.003,-0.467,-0.097
Huntington Bank Pavilion,-0.0901,0.093,-0.966,0.334,-0.273,0.093
Union Station,0.8846,0.093,9.484,0.000,0.702,1.067
Cinespace Chicago Film Studios,-0.0052,0.093,-0.055,0.956,-0.188,0.178
Lincoln Park Zoo,-0.0722,0.093,-0.774,0.439,-0.255,0.111
Chicago Symphony Orchestra,2.0383,0.094,21.751,0.000,1.855,2.222
Petrillo Music Shell,2.0896,0.093,22.402,0.000,1.907,2.272
Millennium Park,2.3746,0.094,25.316,0.000,2.191,2.559

0,1,2,3
Omnibus:,5243.456,Durbin-Watson:,1.451
Prob(Omnibus):,0.0,Jarque-Bera (JB):,54944.595
Skew:,2.329,Prob(JB):,0.0
Kurtosis:,13.556,Cond. No.,1.89


In [34]:
X = time_dummies.drop(columns=event_types)
Y = time_dummies.event_evening_crimes

print('Evening crimes - venue and time fixed effects:')
sm.OLS(Y, X).fit().summary()

Evening crimes - venue and time fixed effects:


0,1,2,3
Dep. Variable:,event_evening_crimes,R-squared:,0.487
Model:,OLS,Adj. R-squared:,0.465
Method:,Least Squares,F-statistic:,22.17
Date:,"Mon, 13 Jun 2022",Prob (F-statistic):,0.0
Time:,21:11:33,Log-Likelihood:,-18528.0
No. Observations:,9906,AIC:,37870.0
Df Residuals:,9499,BIC:,40800.0
Df Model:,406,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
event,2.5952,0.061,42.251,0.000,2.475,2.716
Thalia Hall,-0.2459,0.083,-2.961,0.003,-0.409,-0.083
Huntington Bank Pavilion,-0.1160,0.082,-1.414,0.158,-0.277,0.045
Union Station,0.8392,0.082,10.237,0.000,0.678,1.000
Cinespace Chicago Film Studios,-0.0506,0.082,-0.617,0.537,-0.211,0.110
Lincoln Park Zoo,-0.1029,0.082,-1.255,0.210,-0.264,0.058
Chicago Symphony Orchestra,2.0419,0.082,24.795,0.000,1.880,2.203
Petrillo Music Shell,2.0481,0.082,24.985,0.000,1.887,2.209
Millennium Park,2.3832,0.082,28.911,0.000,2.222,2.545

0,1,2,3
Omnibus:,3445.985,Durbin-Watson:,1.952
Prob(Omnibus):,0.0,Jarque-Bera (JB):,21061.348
Skew:,1.537,Prob(JB):,0.0
Kurtosis:,9.448,Cond. No.,1.18e+16


With or without venue and time fixed effects, these models estimate that having an event on the same day increases evening crimes within a 2 km square by 2 to 3 per event (remember that we are treating multiple nearby events' effects as linearly additive). These esimates are both statistically and practically significant (remember that on average there are only 3 to 5 crimes logged per evening within a 2 km square of all the venues). Also, the likely-larger-than-ideal time bandwidth would probably attenuate this estimate from a real effect. On the other hand, these estimates are still very susceptible to bias from unobserved factors.

Finally, we'll estimate time and venue fixed effects regressions of crimes within each 3-hour time window on same-day events. Note that this is kinda basically the same as using a linear model to predict the crimes near each venue at each date and comparing that with the actual observed number of crimes (see technical log for more discussion).

In [38]:
results = {}
for event_type in event_types[2:]:
    X = time_dummies.drop(columns=event_types)
    Y = time_dummies[event_type]
    model = sm.OLS(Y, X).fit()
    results[event_type] = [model.params[0],model.pvalues[0]]
print('Estimated effect of event on crimes by time period (and p values)')
results

Estimated effect of event on crimes by time period (and p values)


{'event_3to6_crimes': [2.1834760374115114, 7.469817183225417e-200],
 'event_4to7_crimes': [2.1498151484956436, 3.28960746954763e-224],
 'event_5to8_crimes': [2.0540709527736567, 1.902204822424958e-252],
 'event_6to9_crimes': [1.9001599035840568, 2.028412114398096e-295],
 'event_7to10_crimes': [1.7346327473656589, 0.0],
 'event_8to11_crimes': [1.5267930299804386, 0.0],
 'event_9to12_crimes': [1.0598827077599116, 1.708379691437031e-289]}

These results are a bit strange. The estimates are of similar magnitude (events cause 1-2 additional crimes in 2 km square), reducing later in the evening, and are very statistically significant for all the periods from 3pm to midnight. 

Personally, I think these results are indicative of unaccounted-for selection: factors we are not including that both make crime at various times in the day and having an event more likely. 