# Dataset #

- Location: https://www.kaggle.com/datasets/sobhanmoosavi/us-accidents
- Subject: US Accidents between 2016 and 2023

# Problem to Solve #

1. Problem 1: Given the weather forecast information create a model that will predict how many accidents will happen.
    - Reason: predict how many accidents will happen allows the local emergency services plan ahead 

2. Problem 2: Given the weather forecast information create a model that will predict the total impact for that day.
    - Note: impact = duration of the problems caused by the accidents
    - Reason: impact means that people was stuck on trafic loosing productive time. Reduce the number of accidents can increse the overal productivity of the population by reducing time-wated 

In [8]:
import pandas as pd
import numpy as np

In [2]:
# load the dataset
accidents = pd.read_csv("../raw/US_Accidents_March23.csv")

In [3]:
print (accidents.shape)
accidents.head()

(7728394, 46)


Unnamed: 0,ID,Source,Severity,Start_Time,End_Time,Start_Lat,Start_Lng,End_Lat,End_Lng,Distance(mi),...,Roundabout,Station,Stop,Traffic_Calming,Traffic_Signal,Turning_Loop,Sunrise_Sunset,Civil_Twilight,Nautical_Twilight,Astronomical_Twilight
0,A-1,Source2,3,2016-02-08 05:46:00,2016-02-08 11:00:00,39.865147,-84.058723,,,0.01,...,False,False,False,False,False,False,Night,Night,Night,Night
1,A-2,Source2,2,2016-02-08 06:07:59,2016-02-08 06:37:59,39.928059,-82.831184,,,0.01,...,False,False,False,False,False,False,Night,Night,Night,Day
2,A-3,Source2,2,2016-02-08 06:49:27,2016-02-08 07:19:27,39.063148,-84.032608,,,0.01,...,False,False,False,False,True,False,Night,Night,Day,Day
3,A-4,Source2,3,2016-02-08 07:23:34,2016-02-08 07:53:34,39.747753,-84.205582,,,0.01,...,False,False,False,False,False,False,Night,Day,Day,Day
4,A-5,Source2,2,2016-02-08 07:39:07,2016-02-08 08:09:07,39.627781,-84.188354,,,0.01,...,False,False,False,False,True,False,Day,Day,Day,Day


# What Infor is Not Related to the Problem? #

1. Location Coordinates (Lat / Log)
- Start_Lat	
- Start_Lng	
- End_Lat
- End_Lng

2. General Information about the accident
- source: Source of raw accident data
- ID: This is a unique identifier of the accident record.
- description: hows a human provided description of the accident.
- Country
- Timezone
- Airport_Code

3. Points of Interest
-  Amenity: A POI annotation which indicates presence of amenity in a nearby location.
- Bump: A POI annotation which indicates presence of speed bump or hump in a nearby location.
-  Crossing: A POI annotation which indicates presence of crossing in a nearby location.
- Give_away: A POI annotation which indicates presence of give_way in a nearby location
- Junction: A POI annotation which indicates presence of junction in a nearby location.
- No_Exit: A POI annotation which indicates presence of no_exit in a nearby location.
- Railway: A POI annotation which indicates presence of railway in a nearby location.
- Roundabout: A POI annotation which indicates presence of roundabout in a nearby location.
- Station: A POI annotation which indicates presence of station in a nearby location.
- Stop: A POI annotation which indicates presence of stop in a nearby location.
- Traffic_Calming: A POI annotation which indicates presence of traffic_calming in a nearby location.
- Traffic_Signal: A POI annotation which indicates presence of traffic_signal in a nearby location.
- Turning_Loop: A POI annotation which indicates presence of turning_loop in a nearby location.

4. Sunset - Twilight variations - Sunrise
- Sunrise_Sunset: Shows the period of day (i.e. day or night) based on sunrise/sunset.
- Civil_Twilight: Shows the period of day (i.e. day or night) based on civil twilight.
- Nautical_Twilight: Shows the period of day (i.e. day or night) based on nautical twilight.
- Astronomical_Twilight: Shows the period of day (i.e. day or night) based on astronomical twilight.

    - Civil Twilight: This phase begins when the Sun is less than 6 degrees below the horizon. In the morning, it starts before sunrise, and in the evening, it begins at sunset. Civil twilight is the brightest form of twilight, providing enough natural sunlight for outdoor activities without artificial light. It’s often used for aviation, hunting, and street lighting regulations1.
    - Nautical Twilight: Occurring when the Sun is between 6 and 12 degrees below the horizon, nautical twilight is less bright than civil twilight. During this phase, artificial light is generally needed for outdoor activities. Nautical twilight has historical significance for sailors who used stars to navigate the seas. It’s also relevant for military planning1.
    - Astronomical Twilight: This phase happens when the Sun is between 12 and 18 degrees below the horizon. It’s the darkest form of twilight. Astronomers use it to define dawn and dusk. During astronomical twilight, celestial objects become visible, and the horizon remains clear. It’s essential for stargazing and astronomical observations

5. Miscelaneous
- Zipcode


# Useful Infomratio for the Problem #

1. Accident Location
- Street
- City
- County
- State

2. Impact of the accident
- New Column: Duration (End_Time - Start_Time)
    - Start_time: Shows start time of the accident in local time zone.
    - End_Time: Shows end time of the accident in local time zone. End time here refers to when the impact of accident on traffic flow was dismissed.
- Severity: Shows the severity of the accident, a number between 1 and 4, where 1 indicates the least impact on traffic (i.e., short delay as a result of the accident) and 4 indicates a significant impact on traffic (i.e., long delay).
- Distance(mi): The length of the road extent affected by the accident in miles.

3. Accident Date
- Start_Time: Shows start time of the accident in local time zone.
- End_Time: Shows end time of the accident in local time zone. End time here refers to when the impact of accident on traffic flow was dismissed.

4. Weather Information

- Weather_Timestamp: Shows the time-stamp of weather observation record (in local time).
- Temperature(F)
- Wind_Chill(F)
- Humidity(%)
- Pressure(in)
- Visibility(mi): Shows visibility (in miles).
- Wind_Direction
- Wind_Speed(mph)
- Precipitation(in)
- Weather_Condition: Shows the weather condition (rain, snow, thunderstorm, fog, etc.)

In [9]:
# create a new DataFrame with accidents from Los Angeles
accidents_temp = pd.DataFrame(accidents.loc[(accidents.City == 'Los Angeles')])
accidents_temp.shape



(156491, 46)

In [10]:
#############################################################################
# drop the columns that will not be used
#############################################################################

# 'Start_Lat','Start_Lng','End_Lat','End_Lng'
# 'source','ID','description','Country','Timezone','Airport_Code'
# 'Amenity','Bump','Crossing','Give_away','Junction','No_Exit','Railway','Roundabout','Station','Stop','Traffic_Calming','Traffic_Signal','Turning_Loop'
# 'Sunrise_Sunset','Civil_Twilight','Nautical_Twilight','Astronomical_Twilight'
# 'Zipcode'


accidents_temp.drop(columns=['End_Lat','End_Lng','Wind_Chill(F)','Wind_Direction'], inplace=True)
accidents_temp.drop(columns=['Source','ID','Description','Country','Timezone','Airport_Code'], inplace=True)
accidents_temp.drop(columns=['Amenity','Bump','Crossing','Give_Way','Junction','No_Exit','Railway'], inplace=True)
accidents_temp.drop(columns=['Roundabout','Station','Stop','Traffic_Calming','Traffic_Signal','Turning_Loop'], inplace=True)
accidents_temp.drop(columns=['Sunrise_Sunset','Civil_Twilight','Nautical_Twilight','Astronomical_Twilight'], inplace=True)                            
accidents_temp.drop(columns=['Zipcode'], inplace=True) 

accidents_temp.shape

accidents_temp.head()

Unnamed: 0,Severity,Start_Time,End_Time,Start_Lat,Start_Lng,Distance(mi),Street,City,County,State,Weather_Timestamp,Temperature(F),Humidity(%),Pressure(in),Visibility(mi),Wind_Speed(mph),Precipitation(in),Weather_Condition
42866,2,2016-06-21 10:46:30,2016-06-21 11:27:00,34.078926,-118.28904,0.0,US-101 N,Los Angeles,Los Angeles,CA,2016-06-21 10:47:00,82.9,47.0,29.95,10.0,4.6,,Clear
42867,3,2016-06-21 10:49:21,2016-06-21 11:34:21,34.091179,-118.239471,0.0,Golden State Fwy S,Los Angeles,Los Angeles,CA,2016-06-21 10:47:00,82.9,47.0,29.95,10.0,4.6,,Clear
42881,3,2016-06-21 10:51:45,2016-06-21 11:36:45,34.037239,-118.309074,0.0,I-10 W,Los Angeles,Los Angeles,CA,2016-06-21 10:47:00,82.9,47.0,29.95,10.0,4.6,,Clear
42883,3,2016-06-21 10:56:24,2016-06-21 11:34:00,34.027458,-118.27449,0.0,Harbor Fwy N,Los Angeles,Los Angeles,CA,2016-06-21 10:47:00,82.9,47.0,29.95,10.0,4.6,,Clear
42898,3,2016-06-21 11:30:46,2016-06-21 12:00:46,33.947544,-118.279434,0.0,Harbor Fwy N,Los Angeles,Los Angeles,CA,2016-06-21 11:53:00,80.1,52.0,29.96,10.0,9.2,,Clear


In [51]:
#################################################################
# Solving Missing Data Points from Data Frame
#################################################################

missing_values_count = accidents_temp.isnull().sum()

missing_values_count[0:20]

Severity              0
Start_Time            0
End_Time              0
Start_Lat             0
Start_Lng             0
Distance(mi)          0
Street               69
City                  0
County                0
State                 0
Weather_Timestamp     0
Temperature(F)        0
Humidity(%)           0
Pressure(in)          0
Visibility(mi)        0
Wind_Speed(mph)       0
Precipitation(in)     0
Weather_Condition     0
Accident_Date         0
dtype: int64

In [12]:
###################################
# Data Noramlization
###################################

# Accident Date (no hour)
accidents_temp['Accident_Date'] = pd.to_datetime(accidents_temp.Start_Time , format = "mixed")
accidents_temp['Accident_Date'] = accidents_temp['Accident_Date'].dt.date


In [13]:
# Temperature(F)
# Replace the null values by the average temperature

print ("Temperature Null:", accidents_temp['Temperature(F)'].map(lambda p: pd.isnull(p)).sum() )
print ("Temperature Not Null:", accidents_temp['Temperature(F)'].notnull().sum())
print ("Temperature Mean:",accidents_temp['Temperature(F)'].mean())

# Temperature Null: 1105
# Temperature Not Null: 155386
# Mean: 65.65538465498824

temperature_mean = accidents_temp['Temperature(F)'].mean()
accidents_temp['Temperature(F)'] = np.where (
                                            pd.isnull(accidents_temp['Temperature(F)']) 
                                            , temperature_mean
                                            , accidents_temp['Temperature(F)'] )


Temperature Null: 1105
Temperature Not Null: 155386
Temperature Mean: 65.65538465498824


In [14]:
# Weather_Timestamp
# when Weather_Timestamp is null replace by the start_time

accidents_temp['Weather_Timestamp'] = np.where (
                                            pd.isnull(accidents_temp['Weather_Timestamp']) 
                                            , accidents_temp['Start_Time']
                                            , accidents_temp['Weather_Timestamp'] )

In [15]:
# Humidity(%)
# when humidity is null, add humidity mean

print ("Humidity Null:", accidents_temp['Humidity(%)'].map(lambda p: pd.isnull(p)).sum() )
print ("Humidity Not Null:", accidents_temp['Humidity(%)'].notnull().sum())
print ("Humidity Mean:",accidents_temp['Humidity(%)'].mean())

humidity_mean = accidents_temp['Humidity(%)'].mean()
accidents_temp['Humidity(%)'] = np.where ( pd.isnull(accidents_temp['Humidity(%)']) 
                                            , humidity_mean
                                            , accidents_temp['Humidity(%)'] )


Humidity Null: 1180
Humidity Not Null: 155311
Humidity Mean: 60.53426994868361


In [16]:
# Pressure(in)
# when Pressure(in) is null replace by Pressure(in)' mean

print ("Pressure Null:", accidents_temp['Pressure(in)'].map(lambda p: pd.isnull(p)).sum() )
print ("Pressure Not Null:", accidents_temp['Pressure(in)'].notnull().sum())
print ("Pressure Mean:",accidents_temp['Pressure(in)'].mean())

pressure_mean = accidents_temp['Pressure(in)'].mean()

accidents_temp['Pressure(in)'] = np.where ( pd.isnull(accidents_temp['Pressure(in)']) 
                                            , pressure_mean
                                            , accidents_temp['Pressure(in)'] )



Pressure Null: 965
Pressure Not Null: 155526
Pressure Mean: 29.852869938145385


In [17]:
# Visibility(mi)
# when visibility is null, replace by the mean
# 
  
print ("Visibility Null:", accidents_temp['Visibility(mi)'].map(lambda p: pd.isnull(p)).sum() )
print ("Visibility Not Null:", accidents_temp['Visibility(mi)'].notnull().sum())
print ("Visibility Mean:",accidents_temp['Visibility(mi)'].mean())

# Visibility Null: 684
# Visibility Not Null: 155807
# Visibility Mean: 9.099849878375169

visibility_mean = accidents_temp['Visibility(mi)'].mean()

accidents_temp['Visibility(mi)'] = np.where ( pd.isnull(accidents_temp['Visibility(mi)']) 
                                            , visibility_mean
                                            , accidents_temp['Visibility(mi)'] )



Visibility Null: 684
Visibility Not Null: 155807
Visibility Mean: 9.099849878375169


In [18]:
# Wind_Speed(mph)
# when wind_speed is null, replace by the mean

print ("Wind_Speed Null:", accidents_temp['Wind_Speed(mph)'].map(lambda p: pd.isnull(p)).sum() )
print ("Wind_Speed Not Null:", accidents_temp['Wind_Speed(mph)'].notnull().sum())
print ("Wind_Speed Mean:",accidents_temp['Wind_Speed(mph)'].mean())

# Wind_Speed Null: 23989
# Wind_Speed Not Null: 132502
# Wind_Speed Mean: 3.9225702253550887

wind_speed_mean = accidents_temp['Wind_Speed(mph)'].mean()

accidents_temp['Wind_Speed(mph)'] = np.where ( pd.isnull(accidents_temp['Wind_Speed(mph)']) 
                                            , wind_speed_mean
                                            , accidents_temp['Wind_Speed(mph)'] )



Wind_Speed Null: 23989
Wind_Speed Not Null: 132502
Wind_Speed Mean: 3.9225702253550887


In [19]:
# Precipitation(in)
# when precipitation is null, replace by the mean

print ("Precipitation Null:", accidents_temp['Precipitation(in)'].map(lambda p: pd.isnull(p)).sum() )
print ("Precipitation Not Null:", accidents_temp['Precipitation(in)'].notnull().sum())
print ("Precipitation Mean:",accidents_temp['Precipitation(in)'].mean())

# Precipitation Null: 47471
# Precipitation Not Null: 109020
# Precipitation Mean: 0.0034345074298293894

precipitation_mean = accidents_temp['Precipitation(in)'].mean()

accidents_temp['Precipitation(in)'] = np.where ( pd.isnull(accidents_temp['Precipitation(in)']) 
                                            , precipitation_mean
                                            , accidents_temp['Precipitation(in)'] )



Precipitation Null: 47471
Precipitation Not Null: 109020
Precipitation Mean: 0.0034345074298293894


In [20]:
# How many Weather Conditions are in the DataSet?
#
# Use set() to eliminate duplicate values in column 'Weather_Condition'
unique_values_set = set(accidents['Weather_Condition'])

# Print the unique values
print (pd.DataFrame(unique_values_set))

                           0
0                        NaN
1                Dust Whirls
2               Blowing Dust
3       Light Snow and Sleet
4                 Small Hail
..                       ...
140        N/A Precipitation
141                     Hail
142        Sleet and Thunder
143  Thunder in the Vicinity
144              Heavy Smoke

[145 rows x 1 columns]


In [21]:
# Weather_Condition
# To replace missing values in a Pandas DataFrame with the value from the row below

print ("Weather Condition Null:", accidents_temp['Weather_Condition'].isnull().sum())
print ("Weather Condition NOT Null:", accidents_temp['Weather_Condition'].notnull().sum())

# Weather Condition Null: 650
# Weather Condition NOT Null: 155841

accidents_temp['Weather_Condition'] = np.where ( pd.isnull(accidents_temp['Weather_Condition']) 
                                            , accidents_temp['Weather_Condition'].bfill(axis='rows')
                                            , accidents_temp['Weather_Condition'] )


Weather Condition Null: 650
Weather Condition NOT Null: 155841


In [22]:
accidents_temp.head()

Unnamed: 0,Severity,Start_Time,End_Time,Start_Lat,Start_Lng,Distance(mi),Street,City,County,State,Weather_Timestamp,Temperature(F),Humidity(%),Pressure(in),Visibility(mi),Wind_Speed(mph),Precipitation(in),Weather_Condition,Accident_Date
42866,2,2016-06-21 10:46:30,2016-06-21 11:27:00,34.078926,-118.28904,0.0,US-101 N,Los Angeles,Los Angeles,CA,2016-06-21 10:47:00,82.9,47.0,29.95,10.0,4.6,0.003435,Clear,2016-06-21
42867,3,2016-06-21 10:49:21,2016-06-21 11:34:21,34.091179,-118.239471,0.0,Golden State Fwy S,Los Angeles,Los Angeles,CA,2016-06-21 10:47:00,82.9,47.0,29.95,10.0,4.6,0.003435,Clear,2016-06-21
42881,3,2016-06-21 10:51:45,2016-06-21 11:36:45,34.037239,-118.309074,0.0,I-10 W,Los Angeles,Los Angeles,CA,2016-06-21 10:47:00,82.9,47.0,29.95,10.0,4.6,0.003435,Clear,2016-06-21
42883,3,2016-06-21 10:56:24,2016-06-21 11:34:00,34.027458,-118.27449,0.0,Harbor Fwy N,Los Angeles,Los Angeles,CA,2016-06-21 10:47:00,82.9,47.0,29.95,10.0,4.6,0.003435,Clear,2016-06-21
42898,3,2016-06-21 11:30:46,2016-06-21 12:00:46,33.947544,-118.279434,0.0,Harbor Fwy N,Los Angeles,Los Angeles,CA,2016-06-21 11:53:00,80.1,52.0,29.96,10.0,9.2,0.003435,Clear,2016-06-21


In [158]:
accidents_los_angeles = pd.DataFrame()

accidents_los_angeles.head()

In [159]:
##########################################################
# Build Data Frame with data agregated by day
##########################################################

accidents_los_angeles['Accident_Date'] = accidents_temp['Accident_Date'].unique()

In [160]:
# Weather Condition
# get the most common weather condition of the day and use it

weather_counts = accidents_temp.groupby(['Accident_Date', 'Weather_Condition']).size().reset_index(name='counts')
weather = weather_counts.loc[weather_counts.groupby('Accident_Date')['counts'].idxmax()]

# Merge the weather condition into the DataFrame
accidents_los_angeles = pd.merge(weather , accidents_los_angeles , on='Accident_Date')

accidents_los_angeles = accidents_los_angeles.drop('counts' , axis=1)

In [161]:
# Temperature 
# get the average of the day and use it

mean_temperature = accidents_temp.groupby('Accident_Date')['Temperature(F)'].mean().reset_index()

# merge the temperature into the DataFrame
accidents_los_angeles = pd.merge(mean_temperature , accidents_los_angeles , on='Accident_Date')



In [162]:
# Severity 
# get the average of the day and use it

mean_severity = accidents_temp.groupby('Accident_Date')['Severity'].mean().reset_index()

# merge the temperature into the DataFrame
accidents_los_angeles = pd.merge(mean_severity , accidents_los_angeles , on='Accident_Date')


In [163]:
# Humidity(%) 
# get the average of the day and use it

mean_humidity = accidents_temp.groupby('Accident_Date')['Humidity(%)'].mean().reset_index()

# merge the temperature into the DataFrame
accidents_los_angeles = pd.merge(accidents_los_angeles ,mean_humidity , on='Accident_Date')

In [164]:
# Pressure(in)
# get the average of the day and use it

mean_pressure = accidents_temp.groupby('Accident_Date')['Pressure(in)'].mean().reset_index()

# merge the temperature into the DataFrame
accidents_los_angeles = pd.merge(accidents_los_angeles , mean_pressure , on='Accident_Date')

In [165]:
# Visibility(mi)
# get the average of the day and use it

mean_visibility = accidents_temp.groupby('Accident_Date')['Visibility(mi)'].mean().reset_index()

# merge the temperature into the DataFrame
accidents_los_angeles = pd.merge(accidents_los_angeles , mean_visibility , on='Accident_Date')

In [166]:
# Wind_Speed(mph)
# get the average of the day and use it

mean_wind_speed = accidents_temp.groupby('Accident_Date')['Wind_Speed(mph)'].mean().reset_index()

# merge the temperature into the DataFrame
accidents_los_angeles = pd.merge(accidents_los_angeles , mean_wind_speed , on='Accident_Date')

In [167]:
# Precipitation(in)
# get the average of the day and use it

mean_precipitation = accidents_temp.groupby('Accident_Date')['Precipitation(in)'].mean().reset_index()

# merge the temperature into the DataFrame
accidents_los_angeles = pd.merge(accidents_los_angeles , mean_precipitation , on='Accident_Date')

In [179]:
# Number of accidents per day
#

accidents_per_day = accidents_temp.groupby('Accident_Date').size().reset_index(name = "total_accidents")

# merge the accidents into the DataFrame
accidents_los_angeles = pd.merge(accidents_los_angeles , accidents_per_day , on='Accident_Date')


In [180]:
print (accidents_los_angeles.head())
print (accidents_los_angeles.shape)

  Accident_Date  Severity  Temperature(F) Weather_Condition  Humidity(%)  \
0    2016-03-22  2.400000       63.040000             Clear    26.000000   
1    2016-03-23  2.529412       68.491176             Clear    29.794118   
2    2016-03-24  2.485294       68.372059             Clear    33.544118   
3    2016-03-25  2.533333       65.831111             Clear    56.088889   
4    2016-03-26  2.736842       64.394737             Clear    70.631579   

   Pressure(in)  Visibility(mi)  Wind_Speed(mph)  Precipitation(in)  \
0     30.016000       10.000000         9.680000           0.003435   
1     30.114559       10.000000         5.885718           0.003435   
2     30.008235        9.870588         5.265365           0.003435   
3     29.824444       10.000000         4.864368           0.003435   
4     29.853158        9.842105         6.060947           0.003435   

   total_accidents  
0                5  
1               68  
2               68  
3               45  
4          

In [181]:
# Export DataFrame to CSV

accidents_los_angeles.to_csv("./Los_Angeles_Accidents_2016_2023.csv", index=False)
