In [1]:
import pandas as pd
import numpy as np
import re

In [2]:
data = pd.read_csv("data/US_Accidents_March23.csv")
data.shape

(7728394, 46)

In this notebook, we load and process The US Traffic Accident dataset for analysis and modeling. We clean the data and also extract some potentially useful information.

## 1. Dataset Overview

The US Traffic Accident dataset from [Kaggle](https://www.kaggle.com/datasets/sobhanmoosavi/us-accidents) provides comprehensive information on traffic accidents across the United States from 2016 to 2023. It contains 7,728,394 rows and 24 columns, each representing a different attribute of the accidents. Here's a brief overview of some of the key columns:

* `ID`: Unique identifier for each accident.
* `Source`: Source of the accident report (e.g., 911 call, news).
* `Severity`: Accident severity rating (on a scale from 1 to 4). 1 indicates the least impact on traffic (i.e., short delay as a result of the accident) and 4 indicates a significant impact on traffic (i.e., long delay).
* `Start_Time`: Start time of the accident.
* `End_Time`: Time when the impact of accident on traffic flow was dismissed.
* `Start_Lat/Start_Lng`: Latitude and longitude where the accident started.
* `End_Lat/End_Lng`: Latitude and longitude where the accident ended (many missing values).
* `Distance(mi)`: The length of the road extent affected by the accident.
* `Description`: Brief description of the accident.
* `Street`, `City`, `County`, `State`, `Zipcode`: Location details of the accident.
* `Country`: Country where the accident occurred (all should be the USA).
* `Timezone`: Timezone of the accident location.
* `Airport_Code`: Nearest airport to the accident location.
* `Weather_Timestamp`: Time when weather data was recorded.
* `Temperature(F)`, `Wind_Chill(F)`, `Humidity(%)`, `Pressure(in)`, `Visibility(mi)`, `Wind_Direction`, `Wind_Speed(mph)`, `Precipitation(in)`, `Weather_Condition`: Various weather-related attributes.
* `Amenity`, `Bump`, `Crossing`, `Give_Way`, `Junction`, `No_Exit`, `Railway`, `Roundabout`, `Station`, `Stop`, `Traffic_Calming`, `Traffic_Signal`, `Turning_Loop`: Boolean indicators for the presence of specific road features.
* `Sunrise_Sunset`, `Civil_Twilight`, `Nautical_Twilight`, `Astronomical_Twilight`: Time of day indicators related to the position of the sun.

## 2. Data Processing

### 2.2. Handling Missing Values

In [3]:
init_nrows = data.shape[0]

In [4]:
missing_values = data.isnull().sum()
missing_values = missing_values[missing_values > 0]

Percentage of missing values per column

In [5]:
#missing_values

In [6]:
100 * missing_values / data.shape[0]

End_Lat                  44.029355
End_Lng                  44.029355
Description               0.000065
Street                    0.140637
City                      0.003274
Zipcode                   0.024779
Timezone                  0.101030
Airport_Code              0.292881
Weather_Timestamp         1.555666
Temperature(F)            2.120143
Wind_Chill(F)            25.865904
Humidity(%)               2.253301
Pressure(in)              1.820288
Visibility(mi)            2.291524
Wind_Direction            2.267043
Wind_Speed(mph)           7.391355
Precipitation(in)        28.512858
Weather_Condition         2.244438
Sunrise_Sunset            0.300787
Civil_Twilight            0.300787
Nautical_Twilight         0.300787
Astronomical_Twilight     0.300787
dtype: float64

As we want to predict accident severity, some columns may be redundant or not useful for our predictive modeling.
* `ID`: Unique identifier, not useful for prediction.
* `Source`: Source of the accident report; may not impact severity directly.
* `End_Lat/End_Lng`: High percentage of missing values and likely less useful.
* `Description`: Free-text description; challenging to use directly without NLP techniques.
* `Country`: All entries should be the USA, providing no variance.
* `Airport_Code`: Specific to the nearest airport, which may not directly impact severity.
* `Civil_Twilight`, `Nautical_Twilight`, `Astronomical_Twilight`: They could be less useful or redundant given the information provided by `Sunrise_Sunset`.
* `Weather_Timestamp`: May be redundant but useful for verification and imputation of weather information.
* `Wind_Chill(F)`: Given that `Wind_Chill(F)` is derived from `Temperature(F)` and `Wind_Speed(mph)`, it may be redundant.
* `Street`: Steet names may not be useful on their own but we could use them to determine the type of road (street, highway, etc.)
* `Timezone`: Given we have the local time and day/night status (`Sunrise_Sunset`), the time zone may not be relevant.
* `Wind_Direction`: Compared to other weather information, `Wind_Direction` nay not have a significant importance.

In [7]:
df_clean = data.drop(columns=[
    "Source", "End_Lat", "End_Lng", "Description", "Country", "Airport_Code", "Civil_Twilight", "Wind_Direction",
    "Nautical_Twilight", "Astronomical_Twilight", "Wind_Chill(F)", "Timezone", "Weather_Timestamp", "Zipcode"]).copy()

In [8]:
del(data)

After removing redundant or not so useful columns, we still have a few with missing values. Most columns are missing a very small percentage of values, except for `Precipitation(in)` which is missing 28.5\%.

#### Fixing DateTime datatypes and Splitting Date and Time

In [9]:
#df_clean["Weather_Timestamp"] = pd.to_datetime(df_clean["Weather_Timestamp"])
df_clean["Start_Time"] = pd.to_datetime(df_clean["Start_Time"])
df_clean["End_Time"] = pd.to_datetime(df_clean["End_Time"])

In [10]:
df_clean["Date"] = pd.to_datetime(df_clean["Start_Time"].dt.date)

#### Remove Duplicates

In [11]:
df_clean.drop(columns=["ID"]).duplicated().sum()

140899

In [12]:
df_clean_ids = df_clean["ID"].copy()
df_clean = df_clean.drop(columns=["ID"]).drop_duplicates()
df_clean["ID"] = df_clean_ids
df_clean.shape

(7587495, 33)

In [13]:
del(df_clean_ids)

#### Dropping rows with missing values

We have 336307 unique street names and 10869 (0.14\%) of data points have missing values. Given that it is a very small percentage, we could remove those rows.

In [14]:
df_clean = df_clean[~df_clean["Street"].isnull()]

Some city names are missing (253 entries). This is an incredibly small amount (~0.003\%). We could also drop these rows.

In [15]:
df_clean = df_clean[~df_clean["City"].isnull()]

Similar to `Street` and `City`, a small percentage of data points (0.024\%) are missing `Zipcode`. `Zipcode` provide more granular information than `City` and `County` so a correlation analysis would be needed to determine if we keep the column or not. However, given the small number of missing values, we could drop these rows.

`Temperature(F)` is a fairly useful information about weather. We could use location (`City`, `County`, etc.) and time information (`Weather_Timestamp`) to fill the missing values. However, since very few data points are missing `Temperature(F)`, we could remove those entries. A similar thing goes for other weather data.

In [16]:
df_clean = df_clean[~df_clean["Temperature(F)"].isnull()]

In [17]:
df_clean = df_clean[~df_clean["Visibility(mi)"].isnull()]

In [18]:
df_clean = df_clean[~df_clean["Pressure(in)"].isnull()]

In [19]:
df_clean = df_clean[~df_clean["Humidity(%)"].isnull()]

In [20]:
df_clean = df_clean[~df_clean["Weather_Condition"].isnull()]

~28\% of the data is missing values for `Precipitation(in)`. This is a significant amount so we could use imputation to fill the missing values. For instance, missing precipiation could mean there were no rain. To validate this, we could use `Weather_Condition`.

Let's extract `Weather_Category` and `Weather_Intensity` information.

In [21]:
def extract_weather_intensity(condition):
    intensity_terms = "(light|heavy|patches|small|widspread|partial)"
    match = re.search(intensity_terms, condition.lower())
    if match:
        return match.group(1)
    return "regular"

In [22]:
def extract_weather_category(condition):
    regular_weather = "(cloud|overcast|rain|drizzle|thunderstorm|thunder|t-storm|tornado|snow|haze|fog|mist|smoke|sand|dust|hail|squalls|ice pellets|sleet|wintry mix)"
    special_weather = "(fair|clear|volcanic ash|fair windy|showers in the vicinity|n/a precipitation|fair)"
    match = re.search(special_weather, condition.lower())
    if match:
        return match.group(1)
    match = re.search(regular_weather, condition.lower())
    if match:
        return match.group(1)
    return "other"

In [23]:
weather_categories = {
    "cloud": "cloudy",
    "overcast": "cloudy",
    "fair": "clear condition",
    "clear": "clear condition",
    "rain": "precipitation",
    "drizzle": "precipitation",
    "showers in the vicinity": "precipitation",
    "t-storm": "thunderstorm",
    "thunder": "thunderstorm",
    "thunderstorm": "thunderstorm",
    "wintry mix": "precipitation",
    "n/a precipitation": "precipitation",
    "mist": "precipitation",
    "sleet": "precipitation",
    "ice pellets": "precipitation",
    "hail": "precipitation",
    "snow": "snowstorm",
    "fog": "visibility issue",
    "haze": "visibility issue",
    "smoke": "visibility issue",
    "dust": "visibility issue",
    "sand": "visibility issue",
    "volcanic ash": "visibility issue",
    "tornado": "extreme condition",
    "squalls": "extreme condition",
}

In [24]:
%%time
df_clean["Weather_Category"] = df_clean["Weather_Condition"].apply(extract_weather_category)
df_clean["Weather_Category"] = df_clean["Weather_Category"].replace(weather_categories)

CPU times: total: 18.3 s
Wall time: 19 s


In [25]:
df_clean["Weather_Category"].value_counts()

clear condition      3303959
cloudy               3087383
precipitation         534549
visibility issue      191910
snowstorm             155636
thunderstorm           73094
extreme condition         96
Name: Weather_Category, dtype: int64

In [26]:
%%time
df_clean["Weather_Intensity"] = df_clean["Weather_Condition"].apply(extract_weather_intensity)
df_clean["Weather_Intensity"] = df_clean["Weather_Intensity"].replace({"patches": "light", "patches": "light", "small": "light", "partial": "light"})

CPU times: total: 6.34 s
Wall time: 6.6 s


In [27]:
df_clean["Weather_Intensity"].value_counts()

regular    6760770
light       534012
heavy        51845
Name: Weather_Intensity, dtype: int64

We set data points with missing `Precipitation(in)` where the `Weather_Category` isn't a precipitation to 0.

In [28]:
df_clean["Precipitation(in)"].describe()

count    5.316802e+06
mean     8.067124e-03
std      9.213091e-02
min      0.000000e+00
25%      0.000000e+00
50%      0.000000e+00
75%      0.000000e+00
max      3.647000e+01
Name: Precipitation(in), dtype: float64

In [29]:
df_clean.loc[(df_clean["Precipitation(in)"].isnull()) & (df_clean["Weather_Category"].str not in {"precipitation", "snowstorm", "thunderstorm"}), "Precipitation(in)"] = 0

About 7\% of the data is missing `Wind_Speed(mph)` and 2\% is missing `Wind_Direction`.

In [30]:
%%time
for category in df_clean["Weather_Category"].unique():
    median = df_clean.loc[df_clean["Weather_Category"] == category, "Wind_Speed(mph)"].median()
    df_clean.loc[df_clean["Weather_Category"] == category, "Wind_Speed(mph)"] = df_clean.loc[df_clean["Weather_Category"] == category, "Wind_Speed(mph)"].fillna(median)

CPU times: total: 6.34 s
Wall time: 6.67 s


In [31]:
df_clean["Wind_Speed(mph)"].isnull().sum()

0

Some rows in our dataset are missing day/time information in `Sunrise_Sunset`. We could use manually fill the missing values using the `Time` data. However, given that very few rows are missing this information (~0.3%), we could just drop them.

In [32]:
df_clean = df_clean[~df_clean["Sunrise_Sunset"].isnull()]

In [33]:
missing_values = df_clean.isnull().sum()
missing_values[missing_values > 0].shape[0]

0

In [34]:
print(f"{100 * (init_nrows - df_clean.shape[0]) / init_nrows:.2f}% of the original dataset was dropped")

5.19% of the original dataset was dropped


### 2.2. Fixing Inconsistencies 

#### Numerical Variables
The values for some of the numerical variables are unrealistic, namely `Temperature(F)`, `Distance(mi)`, `Pressure(in)`, `Visibility(mi)`, and `Wind_Speed(mph)`.
For instance, the highest recorded temperature on Earth is around 134°F so a max of 203 is unrealistic. Similarly, the highest wind speeds recorded in hurricanes and tornadoes are well below 1087 mph.

In [35]:
df_clean.describe()

Unnamed: 0,Severity,Start_Lat,Start_Lng,Distance(mi),Temperature(F),Humidity(%),Pressure(in),Visibility(mi),Wind_Speed(mph),Precipitation(in)
count,7327531.0,7327531.0,7327531.0,7327531.0,7327531.0,7327531.0,7327531.0,7327531.0,7327531.0,7327531.0
mean,2.213746,36.17201,-94.75451,0.5502983,61.78422,64.80563,29.5452,9.094921,7.65451,0.005833346
std,0.4867966,5.09129,17.35555,1.757743,18.98521,22.8062,0.9961187,2.676606,5.264719,0.07851588
min,1.0,24.5548,-124.6238,0.0,-45.0,1.0,0.0,0.0,0.0,0.0
25%,2.0,33.36993,-117.2294,0.0,49.0,48.0,29.37,10.0,4.6,0.0
50%,2.0,35.78004,-87.84453,0.024,64.0,67.0,29.86,10.0,7.0,0.0
75%,2.0,40.08551,-80.39143,0.447,76.0,84.0,30.03,10.0,10.0,0.0
max,4.0,49.0022,-67.11317,441.75,203.0,100.0,58.63,140.0,1087.0,36.47


We can discard some inomalies and outliers for the numerical variables based on the following real-world observations:
* Temperatures as low as -60 °F can be observed in extreme cold regions, while temperatures up to 130 °F can be seen in extreme hot regions.
* Normal atmospheric pressure at sea level ranges from about 27 to 31 inches of mercury.
* Visibility can drop to 0 in dense fog, and clear conditions might extend visibility to about 20 miles.
* Typical wind speeds can vary from calm (0 mph) to extreme conditions like hurricanes and tornadoes, which can reach up to 150 mph.

In [36]:
print(f'{100 * (df_clean["Temperature(F)"] > 130).sum() / df_clean.shape[0]:.5f}% of the data reports a temperature of over 130F')
df_clean = df_clean[df_clean["Temperature(F)"] <= 130]

0.00053% of the data reports a temperature of over 130F


In [37]:
print(f'{100 * ((df_clean["Pressure(in)"] < 25) | (df_clean["Pressure(in)"] > 31)).sum() / df_clean.shape[0]:.2f}% of the data reports an atmospheric pressure outside of normal (<27 or >31)')
df_clean = df_clean[(df_clean["Pressure(in)"] >= 25) & (df_clean["Pressure(in)"] <= 31)]

1.26% of the data reports an atmospheric pressure outside of normal (<27 or >31)


In [38]:
print(f'{100 * (df_clean["Visibility(mi)"] > 20).sum() / df_clean.shape[0]:.5f}% of accident occurred with a visibility of 20 miles or more')
df_clean = df_clean[df_clean["Visibility(mi)"] <= 20]

0.10450% of accident occurred with a visibility of 20 miles or more


In [39]:
print(f'{100 * (df_clean["Wind_Speed(mph)"] > 150).sum() / df_clean.shape[0]:.5f}% of the data reports a wind speeds of over 150mph')
df_clean = df_clean[df_clean["Wind_Speed(mph)"] <= 150]

0.00066% of the data reports a wind speeds of over 150mph


In [40]:
print(f"{100 * (init_nrows - df_clean.shape[0]) / init_nrows:.2f}% of the original dataset was dropped")

6.48% of the original dataset was dropped


#### Boolean Variables

No aparent incosistencies as all were automatically detected as `bool`.

#### Categorical Variables

In [41]:
df_clean["City"].unique()

array(['Dayton', 'Reynoldsburg', 'Williamsburg', ..., 'Ness City',
       'Clarksdale', 'American Fork-Pleasant Grove'], dtype=object)

In [42]:
df_clean["County"].unique()

array(['Montgomery', 'Franklin', 'Clermont', ..., 'Woods', 'Mellette',
       'Ness'], dtype=object)

Only data from the State of Alaska is missing.

In [43]:
df_clean["State"].unique()

array(['OH', 'WV', 'CA', 'FL', 'GA', 'SC', 'NE', 'IA', 'IL', 'MO', 'WI',
       'IN', 'MI', 'NJ', 'NY', 'CT', 'MA', 'RI', 'NH', 'PA', 'KY', 'MD',
       'VA', 'DC', 'DE', 'TX', 'WA', 'OR', 'AL', 'NC', 'AZ', 'TN', 'LA',
       'MN', 'OK', 'NV', 'UT', 'KS', 'NM', 'AR', 'CO', 'MS', 'ME', 'VT',
       'ID', 'ND', 'MT', 'SD', 'WY'], dtype=object)

### 2.3. Extract More Data and Rename Columns

We can extract the road type from the `Street` name.

In [44]:
def road_type(street_name):
    # Check if the street name contains common highway prefixes or suffixes
    highway_prefixes = ['I-', 'US-', 'SR-', 'HWY', 'INTERSTATE', 'US HIGHWAY', 'STATE ROUTE']
    highway_suffixes = ['INTERSTATE', 'HIGHWAY', 'EXPRESSWAY', 'TURNPIKE', 'PARKWAY', 'ROUTE']
    
    for prefix in highway_prefixes:
        if street_name.upper().startswith(prefix):
            return 'Highway'
    
    for suffix in highway_suffixes:
        if street_name.upper().endswith(suffix):
            return 'Highway'
    
    return 'Local Road'

In [45]:
df_clean["Road_Type"] = df_clean["Street"].apply(road_type)

In [46]:
df_clean["Road_Type"].value_counts()

Local Road    5215438
Highway       2012187
Name: Road_Type, dtype: int64

We can also extract time related data from the `Start_Time` and `End_Time`

In [47]:
df_clean["Duration(min)"] = (df_clean["End_Time"] - df_clean["Start_Time"]).dt.total_seconds() / 60

In [48]:
df_clean["Duration(min)"].describe()

count    7.227625e+06
mean     3.913585e+02
std      1.250362e+04
min      1.216667e+00
25%      3.036667e+01
50%      7.430000e+01
75%      1.243167e+02
max      2.812939e+06
Name: Duration(min), dtype: float64

3948 (0.054\%) of reported accident in the dataset had an impact on traffic flow for more than 24h. Those are either anomalies in the dataset or extremely rare cases. We could therefore remove these rows.

In [49]:
df_clean = df_clean[df_clean["Duration(min)"] <= 24*60*60]

In [50]:
df_clean["Hour"] = df_clean["Start_Time"].dt.hour

In [51]:
df_clean["Day"] = df_clean["Start_Time"].dt.day

In [52]:
df_clean["Day_of_Week"] = df_clean["Start_Time"].dt.dayofweek

In [53]:
df_clean["Month"] = df_clean["Start_Time"].dt.month

In [54]:
df_clean["Year"] = df_clean["Start_Time"].dt.year

Convert columns to boolean

In [55]:
df_clean = df_clean.rename(columns={"Sunrise_Sunset": "Is_Night"})
df_clean["Is_Night"] = df_clean["Is_Night"] == "Night"

In [56]:
df_clean = df_clean.rename(columns={"Road_Type": "Is_Highway"})
df_clean["Is_Highway"] = df_clean["Is_Highway"] == "Highway"

Final columns

In [57]:
df_clean.isnull().sum()

Severity             0
Start_Time           0
End_Time             0
Start_Lat            0
Start_Lng            0
Distance(mi)         0
Street               0
City                 0
County               0
State                0
Temperature(F)       0
Humidity(%)          0
Pressure(in)         0
Visibility(mi)       0
Wind_Speed(mph)      0
Precipitation(in)    0
Weather_Condition    0
Amenity              0
Bump                 0
Crossing             0
Give_Way             0
Junction             0
No_Exit              0
Railway              0
Roundabout           0
Station              0
Stop                 0
Traffic_Calming      0
Traffic_Signal       0
Turning_Loop         0
Is_Night             0
Date                 0
ID                   0
Weather_Category     0
Weather_Intensity    0
Is_Highway           0
Duration(min)        0
Hour                 0
Day                  0
Day_of_Week          0
Month                0
Year                 0
dtype: int64

In [58]:
print(f"{100 * (init_nrows - df_clean.shape[0]) / init_nrows:.2f}% of the original dataset was dropped")

6.53% of the original dataset was dropped


### 2.4. Save Clean Data

In [59]:
df_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7223677 entries, 0 to 7728393
Data columns (total 42 columns):
 #   Column             Dtype         
---  ------             -----         
 0   Severity           int64         
 1   Start_Time         datetime64[ns]
 2   End_Time           datetime64[ns]
 3   Start_Lat          float64       
 4   Start_Lng          float64       
 5   Distance(mi)       float64       
 6   Street             object        
 7   City               object        
 8   County             object        
 9   State              object        
 10  Temperature(F)     float64       
 11  Humidity(%)        float64       
 12  Pressure(in)       float64       
 13  Visibility(mi)     float64       
 14  Wind_Speed(mph)    float64       
 15  Precipitation(in)  float64       
 16  Weather_Condition  object        
 17  Amenity            bool          
 18  Bump               bool          
 19  Crossing           bool          
 20  Give_Way           bool 

In [60]:
df_clean.to_csv("data/US_Accidents_March23_Clean.csv", index=False)