## Data Analysis and Science: US Accidents (2016-2023)
 
#### This notebook aims to perform a comprehensive and exploratory analysis of accidents in the USA from 2016 to 2023.

## Objectives

- Explore the trends and patterns in traffic accidents during the specified period.
- Identify the factors contributing to the occurrence and severity of accidents.
- Build predictive models to forecast accident severity based on various characteristics.


#### Acknowledgements

* Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, and Rajiv Ramnath. <a href="https://arxiv.org/abs/1906.05409"> “A Countrywide Traffic Accident Dataset.”</a>, 2019.

* Moosavi, Sobhan, Mohammad Hossein Samavatian, Srinivasan Parthasarathy, Radu Teodorescu, and Rajiv Ramnath. <a href="https://arxiv.org/abs/1909.09638">"Accident Risk Prediction based on Heterogeneous Sparse Data: New Dataset and Insights."</a> In proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, ACM, 2019.

In [10]:
!pip install us xgboost



In [11]:
import warnings

warnings.filterwarnings("ignore")

import folium
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import us
from folium.plugins import HeatMap, MarkerCluster
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier

In [13]:
# Reading the dataset

df = pd.read_csv('US_Accidents_March23_sampled_500k.csv')

df.head()

Unnamed: 0,ID,Source,Severity,Start_Time,End_Time,Start_Lat,Start_Lng,End_Lat,End_Lng,Distance(mi),...,Roundabout,Station,Stop,Traffic_Calming,Traffic_Signal,Turning_Loop,Sunrise_Sunset,Civil_Twilight,Nautical_Twilight,Astronomical_Twilight
0,A-2047758,Source2,2,2019-06-12 10:10:56,2019-06-12 10:55:58,30.641211,-91.153481,,,0.0,...,False,False,False,False,True,False,Day,Day,Day,Day
1,A-4694324,Source1,2,2022-12-03 23:37:14.000000000,2022-12-04 01:56:53.000000000,38.990562,-77.39907,38.990037,-77.398282,0.056,...,False,False,False,False,False,False,Night,Night,Night,Night
2,A-5006183,Source1,2,2022-08-20 13:13:00.000000000,2022-08-20 15:22:45.000000000,34.661189,-120.492822,34.661189,-120.492442,0.022,...,False,False,False,False,True,False,Day,Day,Day,Day
3,A-4237356,Source1,2,2022-02-21 17:43:04,2022-02-21 19:43:23,43.680592,-92.993317,43.680574,-92.972223,1.054,...,False,False,False,False,False,False,Day,Day,Day,Day
4,A-6690583,Source1,2,2020-12-04 01:46:00,2020-12-04 04:13:09,35.395484,-118.985176,35.395476,-118.985995,0.046,...,False,False,False,False,False,False,Night,Night,Night,Night


## Exploratory Data Analysis and Handling of Null Values

In [14]:
df.dtypes

ID                        object
Source                    object
Severity                   int64
Start_Time                object
End_Time                  object
Start_Lat                float64
Start_Lng                float64
End_Lat                  float64
End_Lng                  float64
Distance(mi)             float64
Description               object
Street                    object
City                      object
County                    object
State                     object
Zipcode                   object
Country                   object
Timezone                  object
Airport_Code              object
Weather_Timestamp         object
Temperature(F)           float64
Wind_Chill(F)            float64
Humidity(%)              float64
Pressure(in)             float64
Visibility(mi)           float64
Wind_Direction            object
Wind_Speed(mph)          float64
Precipitation(in)        float64
Weather_Condition         object
Amenity                     bool
Bump      

In [15]:
df.isnull().sum()

ID                            0
Source                        0
Severity                      0
Start_Time                    0
End_Time                      0
Start_Lat                     0
Start_Lng                     0
End_Lat                  220377
End_Lng                  220377
Distance(mi)                  0
Description                   1
Street                      691
City                         19
County                        0
State                         0
Zipcode                     116
Country                       0
Timezone                    507
Airport_Code               1446
Weather_Timestamp          7674
Temperature(F)            10466
Wind_Chill(F)            129017
Humidity(%)               11130
Pressure(in)               8928
Visibility(mi)            11291
Wind_Direction            11197
Wind_Speed(mph)           36987
Precipitation(in)        142616
Weather_Condition         11101
Amenity                       0
Bump                          0
Crossing

In [16]:
# Excluding columns that will not be necessary for analysis

drop = [
    'End_Lat',
    'End_Lng',
    'Distance(mi)',
    'Street',
    'County',
    'Zipcode',
    'Timezone',
    'Airport_Code',
    'Weather_Timestamp',
    'Visibility(mi)',
    'Wind_Direction',
    'Wind_Speed(mph)',
    'Precipitation(in)',
    'Weather_Condition',
    'Sunrise_Sunset',
    'Civil_Twilight',
    'Nautical_Twilight',
    'Astronomical_Twilight',
    'Temperature(F)',
    'Wind_Chill(F)',
    'Humidity(%)',
    'Pressure(in)'
]

df = df.drop(drop, axis=1)

df.head()

Unnamed: 0,ID,Source,Severity,Start_Time,End_Time,Start_Lat,Start_Lng,Description,City,State,...,Give_Way,Junction,No_Exit,Railway,Roundabout,Station,Stop,Traffic_Calming,Traffic_Signal,Turning_Loop
0,A-2047758,Source2,2,2019-06-12 10:10:56,2019-06-12 10:55:58,30.641211,-91.153481,Accident on LA-19 Baker-Zachary Hwy at Lower Z...,Zachary,LA,...,False,False,False,False,False,False,False,False,True,False
1,A-4694324,Source1,2,2022-12-03 23:37:14.000000000,2022-12-04 01:56:53.000000000,38.990562,-77.39907,Incident on FOREST RIDGE DR near PEPPERIDGE PL...,Sterling,VA,...,False,False,False,False,False,False,False,False,False,False
2,A-5006183,Source1,2,2022-08-20 13:13:00.000000000,2022-08-20 15:22:45.000000000,34.661189,-120.492822,Accident on W Central Ave from Floradale Ave t...,Lompoc,CA,...,False,False,False,False,False,False,False,False,True,False
3,A-4237356,Source1,2,2022-02-21 17:43:04,2022-02-21 19:43:23,43.680592,-92.993317,Incident on I-90 EB near REST AREA Drive with ...,Austin,MN,...,False,False,False,False,False,False,False,False,False,False
4,A-6690583,Source1,2,2020-12-04 01:46:00,2020-12-04 04:13:09,35.395484,-118.985176,RP ADV THEY LOCATED SUSP VEH OF 20002 - 726 CR...,Bakersfield,CA,...,False,False,False,False,False,False,False,False,False,False


In [17]:
df.dropna(inplace=True)

In [18]:
df.isnull().sum()

ID                 0
Source             0
Severity           0
Start_Time         0
End_Time           0
Start_Lat          0
Start_Lng          0
Description        0
City               0
State              0
Country            0
Amenity            0
Bump               0
Crossing           0
Give_Way           0
Junction           0
No_Exit            0
Railway            0
Roundabout         0
Station            0
Stop               0
Traffic_Calming    0
Traffic_Signal     0
Turning_Loop       0
dtype: int64

In [19]:
df.head(2)

Unnamed: 0,ID,Source,Severity,Start_Time,End_Time,Start_Lat,Start_Lng,Description,City,State,...,Give_Way,Junction,No_Exit,Railway,Roundabout,Station,Stop,Traffic_Calming,Traffic_Signal,Turning_Loop
0,A-2047758,Source2,2,2019-06-12 10:10:56,2019-06-12 10:55:58,30.641211,-91.153481,Accident on LA-19 Baker-Zachary Hwy at Lower Z...,Zachary,LA,...,False,False,False,False,False,False,False,False,True,False
1,A-4694324,Source1,2,2022-12-03 23:37:14.000000000,2022-12-04 01:56:53.000000000,38.990562,-77.39907,Incident on FOREST RIDGE DR near PEPPERIDGE PL...,Sterling,VA,...,False,False,False,False,False,False,False,False,False,False


In [20]:
#Viewing the number of accidents per severity level

df['Severity'].value_counts()

Severity
2    398126
3     84518
4     13062
1      4274
Name: count, dtype: int64