# US Accidents (3.5 Million Records)

# Import Dataset

In [1]:
import findspark 
findspark.init()

from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession

sc = SparkContext.getOrCreate()
spark = SparkSession(sc)

us_accidents = spark.read \
 .option("inferSchema", "true") \
 .option("MultiLine", "true") \
 .option("header", "true") \
 .csv("US_Accidents_June20.csv")
us_accidents.take(1)

[Row(ID='A-1', Source='MapQuest', TMC=201.0, Severity=3, Start_Time='2016-02-08 05:46:00', End_Time='2016-02-08 11:00:00', Start_Lat=39.865147, Start_Lng=-84.058723, End_Lat=None, End_Lng=None, Distance(mi)=0.01, Description='Right lane blocked due to accident on I-70 Eastbound at Exit 41 OH-235 State Route 4.', Number=None, Street='I-70 E', Side='R', City='Dayton', County='Montgomery', State='OH', Zipcode='45424', Country='US', Timezone='US/Eastern', Airport_Code='KFFO', Weather_Timestamp='2016-02-08 05:58:00', Temperature(F)=36.9, Wind_Chill(F)=None, Humidity(%)=91.0, Pressure(in)=29.68, Visibility(mi)=10.0, Wind_Direction='Calm', Wind_Speed(mph)=None, Precipitation(in)=0.02, Weather_Condition='Light Rain', Amenity=False, Bump=False, Crossing=False, Give_Way=False, Junction=False, No_Exit=False, Railway=False, Roundabout=False, Station=False, Stop=False, Traffic_Calming=False, Traffic_Signal=False, Turning_Loop=False, Sunrise_Sunset='Night', Civil_Twilight='Night', Nautical_Twilight=

# Exploratory Data Analysis

In [2]:
#Count number of accidents
us_accidents.count()

3513617

Here, we can confirm that there were roughly 3.5 million accidents that occured in the US.

In [3]:
#Preliminary View of Dataset
us_accidents.show(1)
#Or: us_accidents.first()

+---+--------+-----+--------+-------------------+-------------------+---------+----------+-------+-------+------------+--------------------+------+------+----+------+----------+-----+-------+-------+----------+------------+-------------------+--------------+-------------+-----------+------------+--------------+--------------+---------------+-----------------+-----------------+-------+-----+--------+--------+--------+-------+-------+----------+-------+-----+---------------+--------------+------------+--------------+--------------+-----------------+---------------------+
| ID|  Source|  TMC|Severity|         Start_Time|           End_Time|Start_Lat| Start_Lng|End_Lat|End_Lng|Distance(mi)|         Description|Number|Street|Side|  City|    County|State|Zipcode|Country|  Timezone|Airport_Code|  Weather_Timestamp|Temperature(F)|Wind_Chill(F)|Humidity(%)|Pressure(in)|Visibility(mi)|Wind_Direction|Wind_Speed(mph)|Precipitation(in)|Weather_Condition|Amenity| Bump|Crossing|Give_Way|Junction|No_E

## What can we learn about the first record (A-1) from this dataset?

Per the dataframe, the first accident occured on the 8th of February in 2016 and was a low-to-medium severe accident caused by a lane being blocked on an interestate in Dayton, Ohio. 

In [4]:
#List the columns of the dataframe
us_accidents.columns

['ID',
 'Source',
 'TMC',
 'Severity',
 'Start_Time',
 'End_Time',
 'Start_Lat',
 'Start_Lng',
 'End_Lat',
 'End_Lng',
 'Distance(mi)',
 'Description',
 'Number',
 'Street',
 'Side',
 'City',
 'County',
 'State',
 'Zipcode',
 'Country',
 'Timezone',
 'Airport_Code',
 'Weather_Timestamp',
 'Temperature(F)',
 'Wind_Chill(F)',
 'Humidity(%)',
 'Pressure(in)',
 'Visibility(mi)',
 'Wind_Direction',
 'Wind_Speed(mph)',
 'Precipitation(in)',
 'Weather_Condition',
 'Amenity',
 'Bump',
 'Crossing',
 'Give_Way',
 'Junction',
 'No_Exit',
 'Railway',
 'Roundabout',
 'Station',
 'Stop',
 'Traffic_Calming',
 'Traffic_Signal',
 'Turning_Loop',
 'Sunrise_Sunset',
 'Civil_Twilight',
 'Nautical_Twilight',
 'Astronomical_Twilight']

## Display schema and size of the DataFrame 

In [5]:
from IPython.display import display, Markdown

us_accidents.printSchema()
display(Markdown("This DataFrame has **%d rows**." % us_accidents.count()))

root
 |-- ID: string (nullable = true)
 |-- Source: string (nullable = true)
 |-- TMC: double (nullable = true)
 |-- Severity: integer (nullable = true)
 |-- Start_Time: string (nullable = true)
 |-- End_Time: string (nullable = true)
 |-- Start_Lat: double (nullable = true)
 |-- Start_Lng: double (nullable = true)
 |-- End_Lat: double (nullable = true)
 |-- End_Lng: double (nullable = true)
 |-- Distance(mi): double (nullable = true)
 |-- Description: string (nullable = true)
 |-- Number: double (nullable = true)
 |-- Street: string (nullable = true)
 |-- Side: string (nullable = true)
 |-- City: string (nullable = true)
 |-- County: string (nullable = true)
 |-- State: string (nullable = true)
 |-- Zipcode: string (nullable = true)
 |-- Country: string (nullable = true)
 |-- Timezone: string (nullable = true)
 |-- Airport_Code: string (nullable = true)
 |-- Weather_Timestamp: string (nullable = true)
 |-- Temperature(F): double (nullable = true)
 |-- Wind_Chill(F): double (nullable =

This DataFrame has **3513617 rows**.

Per our learning materials, the typical data types in Spark are Boolenas, Numbers, Strings, Date and Timestamps, Handling nulls, and Arrays. The most common data types in this dataset include strings and booleans.

In [6]:
from IPython.display import display, Markdown
from pyspark.sql.functions import when, count, col, countDistinct, desc, first, lit


print ("Summary of a few numerical columns:")
us_accidents.select('Severity','Distance(mi)','Temperature(F)','Wind_Chill(F)','Humidity(%)','Pressure(in)','Visibility(mi)').summary().show()

print("Checking for nulls on numerical columns:")
us_accidents.select([count(when(col(c).isNull(), c)).alias(c) for c in ['Severity','Distance(mi)','Temperature(F)','Wind_Chill(F)','Humidity(%)','Pressure(in)','Visibility(mi)']]).show()

print("Checking amount of distinct values on numerical columns:")
us_accidents.select([countDistinct(c).alias(c) for c in ['Severity','Distance(mi)','Temperature(F)','Wind_Chill(F)','Humidity(%)','Pressure(in)','Visibility(mi)']]).show()

print ("Most and least frequent occurrences for 'Start_Time' and'End_Time':")
severity = us_accidents.groupBy('Severity').agg(count(lit(1)).alias("Total"))

print ("Summary of columns'Start_Time'and 'End_Time':")
us_accidents.select('Start_Time','End_Time').summary().show()

print("Checking for nulls on columns 'Start_Time' and 'End_Time':")
us_accidents.select([count(when(col(c).isNull(), c)).alias(c) for c in ['Start_Time','End_Time']]).show()

print("Checking amount of distinct values in columns 'Start_Time' and 'End_Time':")
us_accidents.select([countDistinct(c).alias(c) for c in ['Start_Time','End_Time']]).show()

Summary of a few numerical columns:
+-------+------------------+------------------+------------------+------------------+------------------+------------------+-----------------+
|summary|          Severity|      Distance(mi)|    Temperature(F)|     Wind_Chill(F)|       Humidity(%)|      Pressure(in)|   Visibility(mi)|
+-------+------------------+------------------+------------------+------------------+------------------+------------------+-----------------+
|  count|           3513617|           3513617|           3447885|           1645368|           3443930|           3457735|          3437761|
|   mean|2.3399286262560772|0.2816166867384315| 61.93511900773892| 53.55729532847424| 65.11427003452451|29.744628810482546| 9.12264429086247|
| stddev|0.5521934519055788|1.5501343247484876|18.621056594747945|23.773336781939438|22.755581256697198|0.8319758234849445|2.885879326590114|
|    min|                 1|               0.0|             -89.0|             -89.0|               1.0|        

In [7]:
from pyspark.sql.functions import countDistinct
print("Checking amount of distinct values in ID, Source, Severity, and State:")
us_accidents.select([countDistinct(c).alias(c) for c in ['ID','Source','Severity','State']]).show()

Checking amount of distinct values in ID, Source, Severity, and State:
+-------+------+--------+-----+
|     ID|Source|Severity|State|
+-------+------+--------+-----+
|3513617|     3|       4|   49|
+-------+------+--------+-----+



Given the previous line of code, we can confirm that the 3.5 million US accidents came from a few sources, have four severity levels, and occured all but one state.

# Goal of Analysis

The goal of my analysis is to answer ten business questions about all US accidents between February 2016 and June 2020. Read below for the questions and their respective answers!

# Deep Dive Analysis 

## 1) In which states do most accidents occur? 

In [8]:
from pyspark.sql.functions import min, max, avg, countDistinct, col

print ("States with the most accidents:")

location = us_accidents.groupBy('State').agg(countDistinct('ID').alias("Accidents"))

location.sort(col("Accidents").desc()).show(10)

States with the most accidents:
+-----+---------+
|State|Accidents|
+-----+---------+
|   CA|   816825|
|   TX|   329284|
|   FL|   258002|
|   SC|   173277|
|   NC|   165958|
|   NY|   160817|
|   PA|   106787|
|   IL|    99692|
|   VA|    96075|
|   MI|    95983|
+-----+---------+
only showing top 10 rows



It seems that accidents occur the most in California (CA), Texas (TX), and Florida (FL).

## 2) Which one of the aforementioend states have the most severe accidents? 

In [9]:
print ("States with the most severe accidents:")

location = us_accidents.groupBy('State').pivot('Severity').agg(countDistinct('ID'))

location.show(49)

States with the most severe accidents:
+-----+----+------+------+----+
|State|   1|     2|     3|   4|
+-----+----+------+------+----+
|   AZ|6705| 55089| 13178|3612|
|   SC| 116|137371| 34620|1170|
|   LA|1262| 47099| 11925|1229|
|   MN|  41| 53538| 27817| 467|
|   NJ|  93| 39160| 16040|3766|
|   DC|  43|  2991|  1099| 687|
|   OR|1263| 77747|  8073|3039|
|   VA|1739| 51639| 37187|5510|
|   RI|  71|  5567|  6000| 115|
|   KY|  40| 11920|  9745| 848|
|   WY|null|   133|   178| 197|
|   NH|   4|  6352|  1444| 184|
|   MI|  57| 57060| 33542|5324|
|   NV|   3|  6986|  3202| 452|
|   WI|  33| 10077|  7302|2708|
|   ID|null|  1594|   207| 243|
|   CA|5801|576742|225820|8462|
|   CT|  22| 12002| 11632|2245|
|   NE|  38| 20009|  3637| 286|
|   MT|null|   264|   141| 107|
|   NC|1806|139050| 22047|3055|
|   VT|   1|   486|   146|  69|
|   MD| 305| 26051| 21359|5878|
|   DE|  10|  4288|   629| 812|
|   MO|  70| 13868| 18014|1691|
|   IL| 265| 63401| 32652|3374|
|   ME|   1|  2065|    75| 102|
|

Between CA, TX, and FL, it seems as if FL has the most severe accidents.

## 3) What are the top descriptions of the accidents? 

In [10]:
print ("Descriptions of accidents:")

description = us_accidents.groupBy('State','Description').agg(countDistinct('ID').alias("Accidents"))

description.sort(col("Accidents").desc()).show(10)

Descriptions of accidents:
+-----+--------------------+---------+
|State|         Description|Accidents|
+-----+--------------------+---------+
|   CA|At I-405/San Dieg...|     1782|
|   CA| At I-15 - Accident.|     1665|
|   CA|  At I-5 - Accident.|     1495|
|   CA|At I-605 - Accident.|     1492|
|   CA|At Grand Ave - Ac...|     1037|
|   CA|At US-101 - Accid...|      795|
|   CA|At I-710/Long Bea...|      754|
|   CA|At Central Ave - ...|      725|
|   CA|At I-10/San Berna...|      723|
|   CA|At CA-60/Pomona F...|      710|
+-----+--------------------+---------+
only showing top 10 rows



It seems as if the descriptions of the accidents include information regarding congestion on interestates, which may align with the intutiion we have about the safety of interstates. Maybe public officials in this area should reconsider these dangerous areas!

## 4) Under what conditions do most of these accidents occur?

In [11]:
print ("Conditions of accidents:")

conditions = us_accidents.groupBy('Weather_Condition').agg(countDistinct('ID').alias("Accidents"))

conditions.sort(col("Accidents").desc()).show()

Conditions of accidents:
+--------------------+---------+
|   Weather_Condition|Accidents|
+--------------------+---------+
|               Clear|   808202|
|                Fair|   547721|
|       Mostly Cloudy|   488094|
|            Overcast|   382485|
|       Partly Cloudy|   344815|
|              Cloudy|   212878|
|    Scattered Clouds|   204660|
|          Light Rain|   176942|
|                null|    76138|
|          Light Snow|    50435|
|                Rain|    42016|
|                Haze|    38699|
|                 Fog|    31066|
|          Heavy Rain|    15351|
|       Light Drizzle|    12427|
|        Fair / Windy|     7954|
|                Snow|     5798|
|Light Thunderstor...|     4928|
|        Thunderstorm|     4440|
|Mostly Cloudy / W...|     4438|
+--------------------+---------+
only showing top 20 rows



Contrary to intution, it seems that most accidents occur while the weather conditions are clear, fair, or mostly cloudy. This most likely has to do with the fact that climates differ depending on which state you live in. For example, it may tend to rain more in Florida than in California.

## 5) What is the landscape of accidents in the area in which I live?

In [12]:
print ("An example of an accident in my Zipcode:")
my_zipcode = us_accidents.filter(col("Zipcode") == 22903)

my_zipcode.show(1)

print ("Number of accidents in my Zipcode:")
my_location = my_zipcode.groupBy('Zipcode').agg(countDistinct('ID').alias("Accidents"))

my_location.sort(col("Accidents").desc()).show()

print ("Severity of accidents in my Zipcode:")
my_location_severity = my_zipcode.groupBy('Zipcode').pivot('Severity').agg(countDistinct('ID'))

my_location_severity.show()

An example of an accident in my Zipcode:
+--------+--------+-----+--------+-------------------+-------------------+---------+----------+-------+-------+------------+--------------------+------+------+----+---------------+---------+-----+-------+-------+----------+------------+-------------------+--------------+-------------+-----------+------------+--------------+--------------+---------------+-----------------+-----------------+-------+-----+--------+--------+--------+-------+-------+----------+-------+-----+---------------+--------------+------------+--------------+--------------+-----------------+---------------------+
|      ID|  Source|  TMC|Severity|         Start_Time|           End_Time|Start_Lat| Start_Lng|End_Lat|End_Lng|Distance(mi)|         Description|Number|Street|Side|           City|   County|State|Zipcode|Country|  Timezone|Airport_Code|  Weather_Timestamp|Temperature(F)|Wind_Chill(F)|Humidity(%)|Pressure(in)|Visibility(mi)|Wind_Direction|Wind_Speed(mph)|Precipitation(

I have lived in Charlottesville, VA since August of 2015. Although I am not familiar with the first accident listed above, I am very familiar with the roads/interstates mentioned in the description. Fortunately, there haven't been severe accidents in the area nor have I been involved in them!

## 6) What is the average wind speed for each of the serverity levels?

In [13]:
print ("Average wind speed for each of the severity levels:")

severity_wind = us_accidents.groupBy('Severity').agg(avg('Wind_Speed(mph)'))

severity_wind.show()

Average wind speed for each of the severity levels:
+--------+--------------------+
|Severity|avg(Wind_Speed(mph))|
+--------+--------------------+
|       1|   8.398047942805075|
|       3|    8.54860331734758|
|       4|   8.391610811941124|
|       2|   8.072976681701455|
+--------+--------------------+



From the above analysis, it does seem as if wind and severity are positively correlated albeit marginally.

## 7) What is the average precipitation for each of the severity levels? 

In [14]:
print ("Average precipitation for each of the severity levels:")

severity_precipitation = us_accidents.groupBy('Severity').agg(avg('Precipitation(in)'))

severity_precipitation.show()

Average precipitation for each of the severity levels:
+--------+----------------------+
|Severity|avg(Precipitation(in))|
+--------+----------------------+
|       1|  0.005452252665367...|
|       3|  0.023859750898532792|
|       4|  0.012359928931492537|
|       2|  0.013614858837609049|
+--------+----------------------+



From the above analysis, it does seem as if rain and severity are positively correlated albeit marginally.

## 8) What is the average temperature for each of the severity levels? 

In [15]:
print ("Average precipitation for each of the severity levels:")

severity_temp = us_accidents.groupBy('Severity').agg(avg('Temperature(F)'))

severity_temp.show()

Average precipitation for each of the severity levels:
+--------+-------------------+
|Severity|avg(Temperature(F))|
+--------+-------------------+
|       1|  70.74177685233386|
|       3|   61.8595724249142|
|       4|  59.02189844306066|
|       2| 61.994931063611716|
+--------+-------------------+



From the above analysis, it does seem as if temp and severity are negatively correlated. The colder it is, the more likely there is to be an accident.

## 9)  How severe are accidents at crossings?

In [16]:
print ("Accidents at Crossings:")

crossing = us_accidents.groupBy('Crossing').pivot('Severity').agg(countDistinct('ID'))

crossing.show()

Accidents at Crossings:
+--------+-----+-------+------+------+
|Crossing|    1|      2|     3|     4|
+--------+-----+-------+------+------+
|    true| 8924| 240742| 19526|  5334|
|   false|20250|2132468|979387|106986|
+--------+-----+-------+------+------+



It seems as if accidents at crossings are quite severe. Most accidents at crossings are either a 3 or 4 category.

## 10) How severe are accidents at railways? 

In [17]:
print ("Accidents at Railways:")

railway = us_accidents.groupBy('Railway').pivot('Severity').agg(countDistinct('ID'))

railway.show()

Accidents at Railways:
+-------+-----+-------+------+------+
|Railway|    1|      2|     3|     4|
+-------+-----+-------+------+------+
|   true|  538|  24589|  5235|   813|
|  false|28636|2348621|993678|111507|
+-------+-----+-------+------+------+



It seems as if accidents at railways are not as severe as they could be with most severe accidents at railways being a category 2. Relative to crossings, railways are a bit safer.