# Team-5---Project-1
***
For our project, we investigated traffic accidents and the circumstances surrounding them.

Our data was obtained from __[Kaggle](https://www.kaggle.com/sobhanmoosavi/us-accidents)__

## Team Members:
***
Chris Hauck, Colleen Banzhof, Ken Njema, and Eric Lewiston

# Dependencies and Setup
***

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as st
import seaborn as sns

# config is python file with shared variables between Clean_Data.ipynb and main.ipynb
from config import destination_zip, col_types

# Global Plot Style Settings
***

In [None]:
# plot options and style
sns.set_style('whitegrid')
# stacked bar chart colors
colors = ("#74C476", "#4DA75D", "#278A45", "#006D2C")

def set_chart_style(ax, title=None, ylabel=None, xlabel=None):
    """Returns the ax object with the following plt and ax settings applied"""
    plt.xticks(rotation=45)
    plt.grid(color='#95a5a6', linestyle='--', linewidth=1, axis='y', alpha=0.7)
    ax.set_facecolor('ivory')
    plt.figure(figsize=(18, 16))
    if title is not None:
        ax.set_title(title, fontsize=14, fontweight="bold")
    if ylabel is not None:
        ax.set_ylabel(ylabel, fontweight="bold")
    if xlabel is not None:
        ax.set_xlabel(xlabel, fontweight="bold")
    return ax

# Read CSV
***
* ID: 
    * This is a unique identifier of the accident record.
* Severity:
    * Shows the severity of the accident, a number between 1 and 4, where 1 indicates the least impact on traffic (i.e., short delay as a result of the accident) and 4 indicates a significant impact on traffic (i.e., long delay).
* Start_Time:
    * Shows start time of the accident in local time zone.

In [None]:
# read csv in zip pandas automatically selects the csv residing within the zip
accidents_df = pd.read_csv(destination_zip, 
                           usecols=col_types.keys(), 
                           dtype=col_types, 
                           parse_dates=['Start_Time'],
                           infer_datetime_format=True)

accidents_df.head()

# Define general shared variables
***

In [None]:
days_of_week = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
severity_values = [1, 2, 3, 4]
month_seasons = {
    12: 'winter', 1: 'winter', 2: 'winter',
    3: 'spring', 4: 'spring', 5: 'spring',
    6: 'summer', 7: 'summer', 8: 'summer',
    9: 'fall', 10: 'fall', 11: 'fall'
}
season_order = ['fall', 'winter', 'spring', 'summer']

# Accidents and Severity vs. Season
***

In [None]:
# Ken's code starts here

#Grouping Data Into Seasons
# In order to group dates into seasons, we need to define seasons then group the dates into seasons column using  Meteorological seasons classifications for northern hemisphere
#Spring runs from March 1 to May 31;
#Summer runs from June 1 to Aug.31;
#Fall (Autumn) runs from September 1st thru Nov.30th and
#Winter runs from December 1 to 28 Feb

accidents_df['seasons'] = pd.to_datetime(accidents_df['Start_Time']).dt.month.map(month_seasons)
accidents_df.tail(50000)

In [None]:
# Sort  Accidents into seasons 

# group by seasons, get count of ID (really just the accident count per season)
season_group = accidents_df.groupby(['seasons'])['ID'].count().reset_index()

# order values based on column season, and season_order
season_group = season_group.set_index('seasons').loc[season_order]

# plot bar chart
season_accidents = season_group.plot.bar(legend=False)
season_accidents = set_chart_style(season_accidents, title="Number of Accidents by season", ylabel="Number of Accidents", xlabel="Seasons")

In [None]:
# convert the value counts back into a dataframe
# group by seasons and severity, get value counts, turn back into data frame
seasons=pd.DataFrame(accidents_df.groupby(['seasons'])['Severity'].value_counts())
# sort by severity
seasons.sort_index(inplace=True)

# using reset_index, change season/severity from idices to columns
seasons = seasons.rename(columns={'Severity': 'Accidents'}).reset_index()

seasons

In [None]:
# create pivot table from above dataframe
season_pivot = seasons.pivot(index='seasons', columns='Severity', values='Accidents')

# order values based on column season, and season_order
# season_pivot = season_pivot.set_index('seasons').loc[season_order]

season_pivot

In [None]:
# Stacked Bar Chart
season_pivot_stacked = season_pivot.plot(kind= 'bar',stacked=True, color=colors)

plt.legend(loc='best')
season_pivot_stacked = set_chart_style(season_pivot_stacked, title="Severity Of Accidents By Season", ylabel="Number of Accidents", xlabel="Seasons")

# Ken's code ends here

# Accidents and Severity vs. Day of the Week
***

In [None]:
# Chris's code starts here

In [None]:
#Create DataFrame to store number of accidents per day of week
week_days_df = pd.DataFrame(accidents_df['Start_Time'].dt.dayofweek.value_counts())
# Sort days of the week from 0-6
week_days_df.sort_index(inplace=True)
# Replace day of week number with value from days_of_week list
week_days_df.set_index(pd.Index(days_of_week), inplace=True)
week_days_df

In [None]:
#Create bargraph from DataFrame
plotdata = week_days_df.plot.bar(legend=False)
plotdata = set_chart_style(plotdata, title="Number of Accidents by Day of Week", ylabel="Number of Accidents", xlabel="Days")

In [None]:
#Make a DataFrame to track on severity of crash by day of week
# Set up dataframe for pivot table with all severity/day combinations
days = days_of_week * len(severity_values)
severity = np.repeat(severity_values,len(days_of_week))
severity_df = pd.DataFrame(zip(severity, days), columns=['Severity', 'Days'])

# Get count of accidents for each severity/day combinations 
vals = []
for index, row in severity_df.iterrows():
    vals.append(accidents_df[(accidents_df['Severity'] == row['Severity']) & (accidents_df['Start_Time'].dt.dayofweek == days_of_week.index(row['Days']))]['ID'].count())
severity_df.insert(2, 'Value', vals)
severity_df

In [None]:
#Define the summary table
pivot_df = severity_df.pivot(index='Days', columns='Severity', values='Value')
# Sort by days_of_week
pivot_df = pivot_df.loc[days_of_week]
pivot_df

In [None]:
# Create stacked bar chart
pivot_df_plot = pivot_df.plot.bar(stacked=True, color = colors)
pivot_df_plot = set_chart_style(pivot_df_plot, title="Severity of Crashes by Day of Week", ylabel="Number of Crashes", xlabel="Days")

In [None]:
# Chris's code ends here

# Accidents and Severity vs. Time of Day
***

In [None]:
# Colleen's code starts here

In [None]:
# TIME QUESTION 1: Does the time of day (early morning, rush hour, late night) affect NUMBER of accidents?

In [None]:
# Add a new column to the original DataFrame to parse out hours only from Start_Time column
accidents_df['Hour'] = accidents_df['Start_Time'].dt.hour
accidents_df.head()

In [None]:
#Create a dataframe to store number of accidents per hour of the day
accidents_per_hour = pd.DataFrame(accidents_df['Hour'].value_counts())
accidents_per_hour.sort_index(inplace=True)
accidents_per_hour

In [None]:
# TIME QUESTION 1 - VISUALIZATION: 
# Create a graph to show total # of accidents in 1 hour increments over 24 hrs timeframe
#     What time of day do most accidents occur?
#     How does this help identify time periods such as AM/PM rush hours?

plotdata1 = accidents_per_hour.plot.bar(legend=False)
plotdata1 = set_chart_style(plotdata1, title="Total Accidents per Hour", ylabel="Number of Accidents", xlabel="Hour of the Day (military time)")

In [None]:
# TIME QUESTION 2: Does the time of day (early morning, rush hour, late night) affect SEVERITY of accidents?

In [None]:
# Group Data Into Time Periods
# In order to group accidents into time periods, we need to define time periods based on the above analysis.
# Early Morning: 0,1,2,3,4,5
# AM Rush Hour: 6,7,8,9
# Mid Day: 10,11,12,13,14
# PM Rush Hour: 15,16,17,18
# Late Evening: 19,20,21,22,23

# Establish bins for time periods
bins = [0,6,10,15,19,24]
    
# Create the names for the time period bins
time_period = ["Early Morning", "AM Rush Hour", "Mid Day", "PM Rush Hour", "Late Evening"]

# Categorize the hours of the day using the time period bins and pd.cut()
accidents_df["Time Period"]=pd.cut(accidents_df["Hour"],bins,labels=time_period)

#Display Time Periods Table
accidents_df

In [None]:
# Sense check count by type and sum of severtiy of accidents
accidents_per_hour = pd.DataFrame(accidents_df['Severity'].value_counts())
accidents_per_hour

In [None]:
# TIME QUESTION 2 - VISUALIZATION:
# Create a bar graph that shows severity ranking of accidents during peak timeframes 
#   (in other words, during peak accident times (ie. rush hours) when are the most severe accidents likely to occur?)

# Sort by Time Period and Severity, get value counts, convert back into a dataframe
time_period=pd.DataFrame(accidents_df.groupby(['Time Period'])['Severity'].value_counts())

# Sort by severity
time_period.sort_index(inplace=True)

# Rename value counts, because the index and column had conflicting names. Reset_index pushes the indices into columns
time_period = time_period.rename(columns={'Severity': 'Accidents'}).reset_index()

# Turn into a pivot table
time_period_pivot = time_period.pivot(index='Time Period', columns='Severity', values='Accidents')

# Create a graph using the time period pivot table
time_period_stacked = time_period_pivot.plot.bar(stacked=True, color=colors)
time_period_stacked = set_chart_style(time_period_stacked, title="Accident Severity by Time Period", ylabel="Number of Accidents by Severity", xlabel="Hour of the Day (military time)")

In [None]:
# Colleen's code ends here

In [None]:
# Eric's code starts here

# Hypothesis Testing
***
__Our Hypothesis:__
> The number of accidents differ per season.

__Null Hypothesis:__
> The occurrence of accidents are distributed equally across seasons. 

***
We start by getting the counts of accidents in each season

In [None]:
fall = {'Fall': accidents_df[accidents_df['seasons'] == 'fall'].count()}
winter = {'Winter': accidents_df[accidents_df['seasons'] == 'winter'].count()}
spring = {'Spring': accidents_df[accidents_df['seasons'] == 'spring'].count()}
summer = {'Summer': accidents_df[accidents_df['seasons'] == 'summer'].count()}
all_seasons = [fall, winter, spring, summer]

## ANOVA Test
***
The ANOVA test is utilized as a starting point simply to see if there's any statistically significant <br>differences between the seasons. If the $p-value>0.05$, we can stop looking at seasonal differences.

In [None]:
four_sesaon_anova = st.f_oneway(fall['Fall'], winter['Winter'], spring['Spring'], summer['Summer'])[1]
print(f"ANOVA p-value for all four seasons: {four_sesaon_anova}")

## Further Testing
***
The $p-value$ for comparing all four seasons is $3.62e^{-34}$ <br>Clearly, this is far less than $0.05$, indicating a need for further testing.
***

In [None]:
p_values = []

# Loop through all combinations of seasons
for season in all_seasons:
    season_name = [*season][0]
    for otherseason in [val for val in all_seasons if val.keys() != season.keys()]:
        otherseason_name = [*otherseason][0]
        # run the ANOVA test on each season pair
        p_values.append(st.f_oneway(season[season_name], otherseason[otherseason_name])[1])
        
        print(f"""ANOVA test between {season_name} and {otherseason_name}
        P Value: {p_values[-1]}""")
print(f"\nThe largest p-value between seasons is {max(p_values)}")

## Conclusion
***
The largest $p-value$ from any combination of seasons is $3.65e^{-9}$, which is still far smaller than $0.05$ <br> Based on these results, we can confidently reject the null hypothesis of all seasonal accidents being distributed equally, 