# ANOVA

Analysis of Variance

---

ANOVA is a statistical test used to compare the means of multiple groups, determining if there are statistically significant differences between them by analyzing the variance between the groups.

In [10]:
# Import statements
import scipy.stats as stats
import pandas as pd

## Data Preparation

In [11]:
# Load the dataset
tps_incidents = pd.read_csv('../data/tps_incident_data_2010.csv')
display(tps_incidents)

Unnamed: 0,Dispatch_Time,ID,Incident_Type,Priority_Number,Units_Arrived_At_Scene,FSA
0,2010-01-01 00:01:36,3061660,Medical,4,2,M5R
1,2010-01-01 00:04:23,3061663,Medical,1,1,M5V
2,2010-01-01 00:05:52,3061664,Medical,5,1,M5N
3,2010-01-01 00:09:53,3061667,Medical,1,1,M4Y
4,2010-01-01 00:10:36,3061668,Medical,1,2,M6K
...,...,...,...,...,...,...
204800,2010-12-31 23:54:16,3364482,Medical,4,1,M5B
204801,2010-12-31 23:54:35,3364481,Medical,4,1,M1H
204802,2010-12-31 23:54:40,3364480,Medical,1,1,M4X
204803,2010-12-31 23:57:24,3364485,Medical,1,1,M6K


In [12]:
# Convert Dispatch_Time to a datetime format
tps_incidents['Dispatch_Time_Datetime'] = pd.to_datetime(tps_incidents['Dispatch_Time'])

# Set Dispatch_Time as the index
tps_incidents.set_index('Dispatch_Time_Datetime', inplace=True)

# Resample the data to an hourly time interval and aggregate by count (number of incidents per day)
daily_counts = tps_incidents.resample('D').size()

In [13]:
# Drop any dates in 2011 from daily_counts
daily_counts = daily_counts[daily_counts.index < "2011-01-01"]

# Convert to a dataframe
daily_counts = daily_counts.to_frame(name="Count")
display(daily_counts)

Unnamed: 0_level_0,Count
Dispatch_Time_Datetime,Unnamed: 1_level_1
2010-01-01,609
2010-01-02,478
2010-01-03,526
2010-01-04,497
2010-01-05,545
...,...
2010-12-27,616
2010-12-28,620
2010-12-29,605
2010-12-30,631


## ANOVA by Month

Since the p-value of the F distribution is <0.05, we can determine that there is a significant difference in incident counts between months.

In [14]:
# Add a new column for the month
daily_counts['Month'] = daily_counts.index.month

# Counts for each month
monthly_counts = [daily_counts[daily_counts['Month'] == m]['Count'] for m in range(1, 13)]

# Perform ANOVA
f_stat, p_value = stats.f_oneway(*monthly_counts)

# Display the results formatted to 4 decimal places
print(f"ANOVA F-statistic: {f_stat:.4f}, \np-value: {p_value:.4f}")

ANOVA F-statistic: 44.2897, 
p-value: 0.0000


## ANOVA by Day of the Week

Since the p-value of the F distribution is <0.05, we can determine that there is a difference in incident counts between days of the week. But not as strong of a difference in incident counts between Months of the year.

In [15]:
# Create a column for the day of the week (Monday=0, Sunday=6)
daily_counts["DayOfWeek"] = daily_counts.index.dayofweek  

# Group by day of the week
weekly_counts = [daily_counts[daily_counts["DayOfWeek"] == d]["Count"] for d in range(7)]

# Perform ANOVA
f_stat, p_value = stats.f_oneway(*weekly_counts)

# Output results
print(f"ANOVA F-statistic: {f_stat:.4f}, \np-value: {p_value:.4f}")

ANOVA F-statistic: 3.1110, 
p-value: 0.0055


## ANOVA by Hours of the Day

Since the p-value of the F distribution is <0.05, we can determine that there is an extremely significant difference in incident counts between hours of the day.

In [21]:
# Extract the hour and day from the Dispatch_Time timestamp
tps_incidents['Hour'] = tps_incidents.index.hour
tps_incidents["Date"] = tps_incidents.index.date

# Group by hour and count incidents
hourly_counts = tps_incidents.groupby(['Date', 'Hour']).size().reset_index(name='Count')

# Prepare ANOVA input: Create a list of daily incident counts for each hour (0-23) military time
hourly_incidents = [hourly_counts[hourly_counts['Hour'] == h]['Count'].values for h in range(24)]

# Perform ANOVA
f_stat, p_value = stats.f_oneway(*hourly_incidents)

# Output results
print(f"ANOVA F-statistic: {f_stat:.4f}, \np-value: {p_value:.4f}")

ANOVA F-statistic: 579.7488, 
p-value: 0.0000
