![Add a relevant banner image here](path_to_image)

# **Flatiron Data Science Capstone Project 1: US Traffic Accidents**


## Overview

text here


# Business Understanding

The US Department of Transportation (DOT) is concerned about the number of traffic accidents across the US and wants to develop strategies to reduce accidents and improve road safety.

For the Department of Transportation (DOT), understanding and reducing traffic accidents is a critical mission that directly impacts public safety, economic costs, and quality of life across the United States.

This analysis matters from a stakeholder and business perspective:

-Economic Impact: Traffic accidents cost billions annually in medical expenses, property damage, and lost productivity, making even small reductions highly valuable.

-Public Safety: As a leading cause of injury and death, reducing traffic accidents directly fulfills DOT's core mandate to protect citizens.

-Infrastructure Prioritization: Data analysis enables strategic allocation of limited infrastructure improvement budgets to highest-risk areas.

-Policy Development: Accident data informs new safety regulations and provides metrics to evaluate existing programs' effectiveness.

-Stakeholder Accountability: Comprehensive analysis demonstrates evidence-based decision-making to Congress, local governments, and the public.

-Cross-Agency Collaboration: Shared data insights can align accident reduction efforts across DOT, law enforcement, and emergency services.

-Technology Integration: Understanding accident patterns guides how emerging vehicle technologies should be regulated to maximize safety benefits.

This project supports DOT's mission by translating complex accident data into actionable insights. By identifying key patterns and risk factors, it empowers smarter infractstructure investment, better regulation and ultimately safer roads for all.

Project Objectives:
1. Identify accident hotspots: This analysis will determine when and where accidents most frequently occur.  Patterns such as time of day, day of week, season, and geographic location will be examined to determine of there are critical hotspots and time periods that may warrant intervention.
2. Analyze environmental risk factors: This analysis will determine how weather conditions correlate with accident rates. Factors such as visibility, precipitation, temperature and other environmental variables will be examined to assess their impact on driver behavior and road conditions. The goal is to determine if certain weather conditions should trigger early warning notifications to drivers.
3. Identify infrastructure considerations: This analysis will determine how specific road features are associated with accident severity. This will include road design, signage, lighting, and other infrastructural elements that could contribute to or mitigate accident risk

By successfully identifying accident hotspots, environmental risk factors, and infracture considerations, DOT can execute initiatives to address these issues and fulfill its mission to the public.

# Data Understanding

Text here

## Data Understanding - Overall

In [39]:
# from google.colab import drive
# drive.mount('/content/drive')

In [40]:
# Load relevant imports here
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels as stats
import scipy
import holidays

us_holidays = holidays.UnitedStates()  # Create a US holidays object to check for holidays

# column definitions in Onedrive


In [None]:
# load data
# df = pd.read_csv('/content/drive/MyDrive/US_Accidents_March23.csv')

# PC path to the CSV file
# df = pd.read_csv(r"C:\Users\jtlee\OneDrive\Documents\Flatiron Schoolwork\DS_11 Capstone\Project 1 US Traffic Accidents\US_Accidents_March23.csv")

# add laptop path to CSV file
df = pd.read_csv(r'C:\Users\eeleoj62\OneDrive\Documents\Flatiron Schoolwork\DS_11 Capstone\Project 1 US Traffic Accidents\US_Accidents_March23.csv')

In [None]:
print(f"\nShape of US accidents dataset: {df.shape}\n")     # (rows, columns)

df.info()  # DataFrame info

In [None]:
print(f"Print preview of the dataframe:\n")
print(df.head())

In [None]:
print(df.isnull().sum())

In [None]:
# convert severity to categories (for EDA and readibility)
severity_map = {
    1: 'Low',
    2: 'Moderate', 
    3: 'High',
    4: 'Severe'
}

df['Severity_Level'] = df['Severity'].map(severity_map)

## Data Understanding - Objective 1 Accident hotspot analysis

Will look at patterns such as time of day, day of week, season, and geogrpahic location

Temporal features:
Start time can be used to determine time of day categorized into periods such as morning, afternoon and evening to assess whether accidents occur during certain windows. Day of week - accidents may increase on certain weekdays such as Fridays due to end of week fatigure or increased traffic. Season - month of the year will be mapped to seasons to explore weather or seasonal patterns. using US federal holiday calendars we will tag whether accidents occurred on or during holidays when traffic and risk behaviors may increase 

Geographic features:
Geographic location will be determined using state, county and zip code. This will help identify regions with higher accident concentrations, supporting hotspot detection at multiple geographic levels.

Together, these temporal and spatial insights will help the DOT pinpoint high-risk times and locations, enabling targeted interventions such as increased patrols, public service campaigns, or infrastructure investment.

In [None]:
# Create new dataframe for objective 1 accident hotspot analysis

df_hotspots = df[['ID', 'Severity_Level', 'Start_Time', 'City', 'County', 'State', 'Zipcode', ]].copy()

print(f"Shape of hotspots df: {df_hotspots.shape}")
print(df_hotspots.info())
print(df_hotspots.isnull().sum())


253 missing City and 1,915 zipcode values out of 7.7 million data points is not significant. They will be dropped, as they shouldn't hurt the analysis at all

In [None]:
# check for unique states and their counts
print(f"\nUnique states: {sorted(df['State'].unique())}")
print(f"\nNumber of states included in the dataset: {df['State'].nunique()}")


It appears that there are only 49 states included in this US accidents dataset. Upon further examination, DC has been included, and Alaska and Hawaii are not present in the data.

## Data Understanding - Objective 2 Environmental Risk Factors

Will look at environmental factors such as temperature, wind chill, humidity, pressure, visibility, wind speed, precipitation, and weather condition

In [None]:
# Create new dataframe for objective 1 accident hotspot analysis

df_weather = df[['ID', 'Severity_Level', 'Temperature(F)', 'Wind_Chill(F)', 'Humidity(%)', 'Pressure(in)', 'Visibility(mi)', 'Wind_Speed(mph)', 'Precipitation(in)', 'Weather_Condition' ]].copy()

print(f"Shape of weather df: {df_weather.shape}")
print(df_weather.info())
print(df_weather.isnull().sum())

There are missing values, most notably Wind_Chill and Precipitation, and to an extent wind speed

Temperature -> impute with median temp
Wind_chill -> impute with Temperature
Humidity -> impute with median humidity
Pressure -> impute with median pressure
visibility -> impute with median visibility
wind speed -> conditional median imputation 
df['Wind_Speed(mph)'] = df.groupby('Weather_Condition')['Wind_Speed(mph)']\
                          .transform(lambda x: x.fillna(x.median()))
precipitation -> impute 0 (no rain)
weather condition -> impute 'Unknown'

In [None]:
# visualize weather distributions

weather_cols = ['Temperature(F)', 'Wind_Chill(F)', 'Humidity(%)', 'Pressure(in)', 'Visibility(mi)', 'Wind_Speed(mph)', 'Precipitation(in)']

# Set up subplots: 4 rows, 2 columns (adjust layout as needed)
fig, axes = plt.subplots(nrows=4, ncols=2, figsize=(15, 20))
axes = axes.flatten()

# Plot each weather feature in a subplot
for i, col in enumerate(weather_cols):
    sns.histplot(df_weather[col].dropna(), bins=20, ax=axes[i])
    axes[i].set_title(f'Distribution of {col}')
    axes[i].set_xlabel(col)
    axes[i].set_ylabel('Frequency')

# Remove empty subplot if cols < subplots
for j in range(len(weather_cols), len(axes)):
    fig.delaxes(axes[j])

plt.tight_layout()
plt.show()


Check for outliers in temperature, pressure, visiblity, wind speed, and precipitation

In [None]:
# visualize weather conditions

plt.figure(figsize=(14, 6))
top_weather = df['Weather_Condition'].value_counts().head(20)

sns.barplot(x=top_weather.index, y=top_weather.values)
plt.xticks(rotation=45)
plt.title('Top 20 Weather Conditions')
plt.ylabel('Count')
plt.xlabel('Weather Condition')
plt.grid(True)
plt.show()


## Data Understanding - Objective 3 Infrastructure Considerations


Examine infrastructure data for the bumps, crossings, yield signs, junctions, dead-ends/no-exit roads, railway crossings, roundabouts, stations (train, gas, etc.), stop signs, traffic calming features, traffic lights, and turning loops.

In [None]:
# Create new dataframe for objective 3 infrastructure analysis

df_infra = df[['ID', 'Severity_Level', 'Bump', 'Crossing', 'Give_Way', 'Junction', 'No_Exit', 'Railway', 'Roundabout', 'Station', 'Stop', 'Traffic_Calming', 'Traffic_Signal', 'Turning_Loop']].copy()

print(f"Shape of infrastructure df: {df_infra.shape}")
print(df_infra.info())
print(df_infra.isnull().sum())

In [None]:
# visualize presence of infrastructure

infra_cols = ['Bump', 'Crossing', 'Give_Way', 'Junction', 'No_Exit', 'Railway', 'Roundabout', 'Station', 'Stop', 'Traffic_Calming', 'Traffic_Signal', 'Turning_Loop']

infra_counts = df_infra[infra_cols].sum().sort_values()

plt.figure(figsize=(10, 6))
infra_counts.plot(kind='barh')
plt.title('Number of Accidents with Infrastructure Feature Present')
plt.xlabel('Count')
plt.ylabel('Infrastructure Feature')
plt.tight_layout()
plt.show()

# Data Preparation




## Data Prep - Objective 1 Accident hotspot analysis

In [None]:
# drop rows with missing City and Zipcode values
df_hotspots.dropna(subset=['City', 'Zipcode'], inplace=True)
df_hotspots.shape

In [None]:
# take Start Time column and convert to datetime format
df_hotspots['Start_Time'] = pd.to_datetime(df_hotspots['Start_Time'], format='mixed', errors='coerce')

# create new column for time of day (hour of the day)
df_hotspots['Time_of_day'] = df_hotspots['Start_Time'].dt.hour

# create new column for day of week
# df_hotspots['day_of_week'] = df_hotspots['Start_Time'].dt.dayofweek

# convert day of week to actual day name
df_hotspots['Day_of_week'] = df_hotspots['Start_Time'].dt.day_name()

# Create a new column for whether the accident occurred on a holiday
df_hotspots['Is_Holiday'] = df_hotspots['Start_Time'].apply(
    lambda x: x.date() in us_holidays if pd.notnull(x) else False
)

print(df_hotspots.head())

In [None]:
# define a function to get the season based on the month
def get_season(month):
    if month in [12, 1, 2]:
        return 'Winter'
    elif month in [3, 4, 5]:
        return 'Spring'
    elif month in [6, 7, 8]:
        return 'Summer'
    elif month in [9, 10, 11]:
        return 'Fall'
    
df_hotspots['Season'] = df_hotspots['Start_Time'].dt.month.apply(get_season)

print(df_hotspots.head())

In [None]:
# Check of consistency of spelling in top 50 city and county columns
print(df_hotspots['City'].value_counts().head(50))
print(df_hotspots['County'].value_counts().head(50))


In [None]:
# extract just the first 5 digits of the Zipcode column
# This is to ensure that the Zipcode column is in the correct format
df_hotspots['Zipcode'] = df_hotspots['Zipcode'].astype(str).str.extract(r'^(\d{5})')
print(df_hotspots['Zipcode'].value_counts().head(50))
print(df_hotspots['Zipcode'].unique())

In [None]:
print(df_hotspots.head())

In [None]:
# group accidents by time of day, group further into periods of the day
def convert_hour_to_period(hour):
    if 0 <= hour < 6:
        return 'Late Night'
    elif 6 <= hour < 12:
        return 'Morning'
    elif 12 <= hour < 17:
        return 'Afternoon'
    elif 17 <= hour < 21:
        return 'Evening'
    else:
        return 'Night'

df_hotspots['Time_period'] = df_hotspots['Time_of_day'].apply(convert_hour_to_period)
print(df_hotspots.head())

In [None]:
# examining value counts for the time periods
df_hotspots['Time_period'].value_counts()

In [None]:
# examining value counts for days of the week
df_hotspots['Day_of_week'].value_counts()

In [None]:
# examining value counts for holidays
df_hotspots['Is_Holiday'].value_counts()

In [None]:
print(df_hotspots.head())

## Data Prep - Objective 2 Environmental risk 

Temperature -> impute with median temp
Wind_chill -> impute with Temperature
Humidity -> impute with median humidity
Pressure -> impute with median pressure
visibility -> impute with median visibility
wind speed -> conditional median imputation 
df['Wind_Speed(mph)'] = df.groupby('Weather_Condition')['Wind_Speed(mph)']\
                          .transform(lambda x: x.fillna(x.median()))
precipitation -> impute 0 (no rain)
weather condition -> impute 'Unknown'

In [None]:
# determine outliers in temperature
print("\nTemperature statistics:")

print(df_weather['Temperature(F)'].describe())

Q1 = df_weather['Temperature(F)'].quantile(0.25)
Q3 = df_weather['Temperature(F)'].quantile(0.75)
IQR = Q3 - Q1
upper_bound = Q3 + 1.5 * IQR

# filter out outliers
df_weather = df_weather[df_weather['Temperature(F)'] <= 130]  # Ensure no extreme outliers above 130, anything over 130 is highly unlikely in the US

df_weather = df_weather[df_weather['Temperature(F)'] >= -60]  # Ensure no extreme outliers below -60, anything under is highly unlikely in the continental US

print(df_weather['Temperature(F)'].describe())

In [None]:
# determine outliers in pressure
print("\nPressure statistics:")
print(df_weather['Pressure(in)'].describe())
Q1 = df_weather['Pressure(in)'].quantile(0.25)
Q3 = df_weather['Pressure(in)'].quantile(0.75)
IQR = Q3 - Q1
upper_bound = Q3 + 1.5 * IQR
# filter out outliers
df_weather = df_weather[df_weather['Pressure(in)'] <= upper_bound]
print(df_weather['Pressure(in)'].describe())

In [None]:
# determine outliers in visibility
print("\nVisibility(mi) statistics:")

print(df_weather['Visibility(mi)'].describe())

Q1 = df_weather['Visibility(mi)'].quantile(0.25)
Q3 = df_weather['Visibility(mi)'].quantile(0.75)
IQR = Q3 - Q1
upper_bound = Q3 + 1.5 * IQR

# Filter out outliers
df_weather = df_weather[df_weather['Visibility(mi)'] <= upper_bound]

print(df_weather['Visibility(mi)'].describe())

In [None]:
# determine outliers in wind speed
print("\nWind_Speed(mph) statistics:")

print(df_weather['Wind_Speed(mph)'].describe())

Q1 = df_weather['Wind_Speed(mph)'].quantile(0.25)
Q3 = df_weather['Wind_Speed(mph)'].quantile(0.75)
IQR = Q3 - Q1
upper_bound = Q3 + 1.5 * IQR

df_weather = df_weather[df_weather['Wind_Speed(mph)'] <= 100] # Ensure no extreme outliers above 100 mph, anything over is highly unlikely in the US

print(df_weather['Wind_Speed(mph)'].describe())

In [None]:
# visualize weather again to ensure outliers have been removed
# Set up subplots: 4 rows, 2 columns (adjust layout as needed)
fig, axes = plt.subplots(nrows=4, ncols=2, figsize=(15, 20))
axes = axes.flatten()

# Plot each weather feature in a subplot
for i, col in enumerate(weather_cols):
    sns.histplot(df_weather[col].dropna(), bins=20, ax=axes[i])
    axes[i].set_title(f'Distribution of {col}')
    axes[i].set_xlabel(col)
    axes[i].set_ylabel('Frequency')

# Remove empty subplot if cols < subplots
for j in range(len(weather_cols), len(axes)):
    fig.delaxes(axes[j])

plt.tight_layout()
plt.show()

Temperature -> impute with median temp
Wind_chill -> impute with Temperature
Humidity -> impute with median humidity
Pressure -> impute with median pressure
visibility -> impute with median visibility
wind speed -> conditional median imputation 
df['Wind_Speed(mph)'] = df.groupby('Weather_Condition')['Wind_Speed(mph)']\
                          .transform(lambda x: x.fillna(x.median()))
precipitation -> impute 0 (no rain)
weather condition -> impute 'Unknown'

In [None]:
# impute missing values in temperature column with median temperature
df_weather['Temperature(F)'].fillna(df_weather['Temperature(F)'].median(), inplace=True)

# impute missing values in wind chill column with median wind chill
df_weather['Wind_Chill(F)'].fillna(df_weather['Wind_Chill(F)'].median(), inplace=True)

# impute missing values in humidity column with median humidity
df_weather['Humidity(%)'].fillna(df_weather['Humidity(%)'].median(), inplace=True)

# impute missing values in pressure with median pressure
df_weather['Pressure(in)'].fillna(df_weather['Pressure(in)'].median(), inplace=True)

# impute missing values in visibility with median visibility
df_weather['Visibility(mi)'].fillna(df_weather['Visibility(mi)'].median(), inplace=True)

# impute missing values in wind speed with median wind speed
df_weather['Wind_Speed(mph)'] = df_weather.groupby('Weather_Condition')['Wind_Speed(mph)']\
                          .transform(lambda x: x.fillna(x.median()))

# impute missing values in precipitation column with 0 (no rain)
df_weather['Precipitation(in)'].fillna(0, inplace=True)

# impute missing values in weather condition column with 'Clear'
df_weather['Weather_Condition'].fillna('Clear', inplace=True)

In [None]:
print(df_weather.head())

## Data Prep - Objective 3 Infrastructure considerations

Every infrastructure column (bool type) is completed. No imputation or rows drops are required.There are no outliers because bool is just True/False

In [None]:
# convert booleans to integers
df_infra[infra_cols] = df_infra[infra_cols].astype(int)



In [None]:
print(df_infra.head())

# Analysis

Text here

# Evaluation

## Business Insight/Recommendation 1

## Business Insight/Recommendation 2

## Business Insight/Recommendation 3

### Tableau Dashboard link

## Conclusion and Next Steps
Text here