<a href="https://colab.research.google.com/github/Zhuo-Feng-Lei/JOUR-460/blob/master/jour_460_final_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Data

The United States Coast Guards Boating Safety Division compiled the 2018 US boat accidents dataset. The dataset contains derived statistics from accidentreports that are filed by the owners and operators of recreational vessels in the US. It consists of 3823 observations and 49 variables. 

Some examples of interesting variables to note are:

* State: The state in which the accident occurred
* Numberdeaths: Number of people that died in the accident
* Numberinjured: Number of people injured in the accident
* TotalDamage: The monetary total of damage resulting from the accident
* AccidentCause: Cause of the accident

The full dataset can be found [here](http://ymn.web.illinois.edu/data/Accidents.csv).

In [0]:
import pandas as pd
import numpy as np
import plotly.express as px
# reading in data from url
df  = pd.read_csv('http://ymn.web.illinois.edu/data/Accidents.csv',encoding= 'unicode_escape')

# printing first five observations from the dataset
df.head()

Unnamed: 0,BARDID,Year,RedactedNarrative,NumberDeaths,NumberInjured,MeetsInjuryThreshold,MeetsDamageThreshold,NumberVesselsInvolved,NumberVesselsLost,TotalDamage,Date,Time,State,Location,NameOfBodyOfWater,NearestCityTown,County,Clear,Cloudy,Fog,Rain,Snow,Hazy,WaterConditions,StrongCurrent,HazardousWaters,CongestedWaters,Wind,AirTemperature,WaterTemperature,DayNight,Visibility,DayofWeek,AccidentCause1,AccidentCause2,AccidentCause3,OtherAccidentCause,MachineryFailure,OtherMachineryFailure,EquipmentFailure,OtherEquipmentFailure,AccidentEvent1,AccidentEvent2,AccidentEvent3,OtherAccidentEvent,StateCaseID,Latitude,Longitude,CoordinatesConfidence
0,AK-2018-0001,2018,US Coast Guard rescued an adult paddler outsid...,0.0,1.0,-1,0,1.0,0.0,0.0,3/4/2018 0:00:00,12/30/1899 14:00:00,AK,Valdez,Port of Valdez,Valdez,Valdez-Cordova,-1.0,0.0,0.0,0.0,0.0,0.0,Choppy,N,Y,N,Strong,39.0,39.0,-1.0,Fair,Sunday,Hazardous waters,Operator inattention,,,,,,,Capsizing,Person ejected from vessel,,,,,,
1,AK-2018-0002,2018,"According to witnesses, the two victims depart...",2.0,0.0,0,0,1.0,0.0,0.0,4/1/2018 0:00:00,12/30/1899 13:00:00,AK,Pogibshi Point,Peril Strait,Sitka,Sitka City and Borough,0.0,-1.0,0.0,0.0,0.0,0.0,Very rough,N,Y,N,Storm,,38.0,-1.0,Fair,Sunday,Hazardous waters,Weather,,,,,,,Capsizing,Person ejected from vessel,,,AK18021535,57.51,-135.54,
2,AK-2018-0003,2018,"On April 2, 2018 at approximately 0810 hours, ...",1.0,0.0,0,0,1.0,1.0,0.0,3/31/2018 0:00:00,12/30/1899 11:00:00,AK,outside of harbor,Passage Canal,Whittier,Valdez-Cordova,-1.0,0.0,0.0,0.0,0.0,0.0,Rough,,Y,,Storm,39.0,,-1.0,Good,Saturday,Hazardous waters,Weather,,,,,,,Flooding/swamping,Sinking,Person ejected from vessel,,AK18021719,60.8,-148.63,
3,AK-2018-0004,2018,On 5/25/18 at approximately 0149 hours Trooper...,1.0,0.0,0,0,1.0,0.0,0.0,5/25/2018 0:00:00,12/30/1899 2:00:00,AK,Palmer,Finger Lake,Palmer,Matanuska Susitna,-1.0,0.0,0.0,0.0,0.0,0.0,Calm,,,,,45.0,40.0,0.0,Fair,Friday,Alcohol use,,,,,,,,Capsizing,Person ejected from vessel,,,AK18034991,61.6,-149.28,NC
4,AK-2018-0005,2018,On 06/09/2018 troopers and park rangers respon...,1.0,2.0,-1,0,2.0,0.0,0.0,6/9/2018 0:00:00,12/30/1899 15:04:00,AK,Adjacent to Big Lake,Flat Lake,Big Lake,Matanuska-Susitna Borough,-1.0,0.0,0.0,0.0,0.0,0.0,Calm,N,N,N,,,,-1.0,Good,Saturday,Alcohol use,Excessive speed,Operator inattention,,,,,,Collision with recreational vessel,Skier mishap,Person struck by propeller,,AK18039110,61.53,-149.89,NC


# Where do US boat accidents occur the most?

In [0]:
from collections import Counter

# convert counter object to dataframe
df_accident_count = pd.DataFrame.from_dict(Counter(df['State']), orient='index').reset_index()

# rename columns
df_accident_count = df_accident_count.rename(columns={'index':'State', 0:'NumberOfAccidents'})

# California have 0 accidents so I have to manually add that into the dataframe
df_accident_count.loc[-1] = ['CA', 0]

# deriving statistics
[str(i) for i in range(10)]
df_accident_count['percentage'] = [str(i) for i in round(df_accident_count['NumberOfAccidents'] / df_accident_count['NumberOfAccidents'].sum() * 100, 4)]

# display first five observations in dataframe
df_accident_count.head()

Unnamed: 0,State,NumberOfAccidents,percentage
0,AK,22,0.5755
1,AL,66,1.7264
2,AR,60,1.5694
3,AT,10,0.2616
4,AZ,129,3.3743


In [0]:
import plotly.graph_objects as go

# create figure
fig = go.Figure(data = go.Choropleth(
    locations=df_accident_count['State'], 
    z = df_accident_count['NumberOfAccidents'], # Data to be color-coded
    locationmode = 'USA-states',
    colorscale = 'Blues',
    colorbar_title = "Number of Recreational Boating Accidents",
    text =  df_accident_count['State'] + ' accounts for ' + df_accident_count['percentage']  + '% of US recreational Boating Accidents'
))

# add figure title
fig.update_layout(
    title_text = 'Number of Boat Accidents in 2018 by State',
    title_x = .5,
    geo_scope = 'usa'
)

# display visual
fig.show()

Color vision deficiency is a condition where a person's eyes are unable to see colors in normal light conditions. The most common color deficiencies is red-green color blindness, however color blind people typically can see all shades of blue. Therefore, I decided to use the blue sequential color scheme to make my choropleth map. 

According to the choropleth map, states bordering bodies of water tend to have higher number of recreational boat accidents as indicated by the darker shades of blue. California is an outlier however. California is the only US territory to have zero recreational boating accidents in 2013. Out of all 50 states, Florida have the highest number of accidents. There is a huge disparity between the states with the most accidents and the states with the second highest. 607 boating accidents occurr in Florida alone in 2018, accounting for 15.8776% of all US recreational boating accidents in 2018. The next highest is Texas with 204 accidents, accounting for 5.3361% of the accidents. This is a huge disparity. Florida have lax boating regulations, and it is easy to drive a boat there with zero practice. This may be why Florida have such high numbers when compared to the rest of the US territory. 


# How does water temperature and air temperature correlate with the number of people that injured or died in recreational boating accidents in 2018?

In [0]:
from plotly.subplots import make_subplots

# create subplot with titles
fig2 = make_subplots(rows=1, cols=2,
                     specs=[[{'is_3d': True}, {'is_3d': True}]],
                     subplot_titles=("Number of Deaths vs Water Temperature vs Air Temperature", "Number Injured vs Water Temperature vs Air Temperature"),
                     shared_xaxes=True
                     )

# add traces
fig2.add_trace(
    go.Scatter3d(x = df['WaterTemperature'], y = df['AirTemperature'], z = df['NumberDeaths'], mode = 'markers', 
                 marker=dict(
                     size=8,
                     color= df['NumberDeaths'],
                     colorscale='Viridis',
                     )
                 ),
    row = 1, 
    col = 1
)

fig2.add_trace(
    go.Scatter3d(x = df['WaterTemperature'], y = df['AirTemperature'], z = df['NumberInjured'], mode = 'markers',
                 marker=dict(
                     size=8,
                     color= df['NumberInjured'],
                     colorscale='Viridis',
                     )
                 ),
    row = 1, 
    col = 2
)

# set axis labels
fig2.update_layout(scene = dict(
                    xaxis_title = 'Air Temperature',
                    yaxis_title = 'Water Temperature',
                    zaxis_title = 'Number of Deaths'),
                   scene2 = dict(
                    xaxis_title = 'Air Temperature',
                    yaxis_title = 'Water Temperature',
                    zaxis_title = 'Number of Injured')
)

# display visual
fig2.show()

I want to see if air temperature and water temperature have any correlations with the number of deaths and the number injured in recreational boating accidents. In the 3d scatterplot above, we can see that the points are pretty flat on the z axis. Despite varying values of air temperature and water temperature, the range of numbers of injured and number of deaths remained about the same. Therefore, it is reasonable to conclude that air temperature and water temperature does not have a strong correlations to the number of deaths or the number of people injured. 

In [0]:
#create subplots
fig4 = make_subplots(rows = 2, cols = 2,
                     subplot_titles=("Water Temperature vs Number Injured",
                                     "Water Temperature vs Number of Deaths",
                                     "Air Temperature vs Number Injured",
                                     "Air Temperature vs Number of Deaths"))

# create 2d scatterplots
fig4.add_trace(
    go.Scatter(x = df['WaterTemperature'], y = df['NumberInjured'], mode = 'markers'),
    row = 1, col = 1
)

fig4.add_trace(
    go.Scatter(x = df['WaterTemperature'], y = df['NumberDeaths'], mode = 'markers'),
    row = 1, col = 2
)

fig4.add_trace(
    go.Scatter(x = df['AirTemperature'], y = df['NumberInjured'], mode = 'markers'),
    row = 2, col = 1
)

fig4.add_trace(
    go.Scatter(x = df['AirTemperature'], y = df['NumberDeaths'], mode = 'markers'),
    row = 2, col = 2
)

# Update xaxis properties
fig4.update_xaxes(title_text = "Water Temperature", row=1, col=1)
fig4.update_xaxes(title_text = "Water Temperature", row=1, col=2)
fig4.update_xaxes(title_text = "Air Temperature", row=2, col=1)
fig4.update_xaxes(title_text = "Air Temperature", row=2, col=2)

# Update yaxis properties
fig4.update_yaxes(title_text = "Number Injured", row=1, col=1)
fig4.update_yaxes(title_text = "Number of Deaths", row=1, col=2)
fig4.update_yaxes(title_text = "Number Injured", row=2, col=1)
fig4.update_yaxes(title_text = "Number of Deaths", row=2, col=2)

# add title
fig4.update_layout(title_text = "Side by Side Comparisons", title_x = .5)


The 2d scatterplots allow the reader to see the same trends in the 3d scatterplot more clearly. In all four scatterplots, the points create a flat  trend line. There no positive or negative correlations between temperature and the number injured or death. As the temperature change, the number injured and the number of deaths did not increaseor decrease. Therefore, it is reasonable to conclude that air and water temperature does not have a significant relationship with the number injured and the number of deaths in recreational boating accidents.


# What are  the top causes of US boat accidents in 2018?

In [0]:
# convert counter object to dataframe
df_accident_cause = pd.DataFrame.from_dict(Counter(df['AccidentCause1']), orient='index').reset_index()
df_accident_cause2 = pd.DataFrame.from_dict(Counter(df['AccidentCause2']), orient='index').reset_index()
df_accident_cause3 = pd.DataFrame.from_dict(Counter(df['AccidentCause3']), orient='index').reset_index()

# summing all categories and turning them into fraction
df_accident_cause[0] = (df_accident_cause[0] + df_accident_cause2[0] +  df_accident_cause3[0])
df_accident_cause['Percentage'] = [str(i) for i in round(df_accident_cause[0] / df_accident_cause[0].sum() * 100,4)]

# rename columns
df_accident_cause = df_accident_cause.rename(columns={'index':'AccidentCause', 0:'Frequency'})

#deleting observations with  nan frequency
df_accident_cause = df_accident_cause[~np.isnan(df_accident_cause['Frequency'])]

# display first five observations in dataframe
df_accident_cause.head()

Unnamed: 0,AccidentCause,Frequency,Percentage
0,Hazardous waters,3597.0,31.5692
1,Alcohol use,424.0,3.7213
2,Restricted vision,2016.0,17.6935
3,Machinery failure,497.0,4.3619
4,Ignition of fuel or vapor,231.0,2.0274


In [0]:
fig3  =  go.Figure(data = go.Bar(
    x = df_accident_cause['AccidentCause'],
    y = df_accident_cause['Frequency'],
    text = df_accident_cause['Frequency'],
    textposition = 'outside',
    hovertext = 'Accounts for ' + df_accident_cause['Percentage']  + '% of US recreational Boating Accidents',
    marker_color='#9ecae1', marker_line_color='#08306b'
    )
)

# add figure  title
fig3.update_layout(
    title_text = 'Percentage of Recreational Boat Accident Causes in 2018',
    title_x = .5,
    xaxis = {'title': 'Accident Cause','categoryorder' : 'total descending'},
    yaxis = {'title': 'Frequency',  'range' : [0, 4000]},
)

fig3.show()

Above is a bar chart of the causes of recreational  boating accidents in 2018. The x-axis is the categories, the cause of the recreational boating accident. The y-axis is the frequency of the causes. If you hover over each bar, the bar chart will also tell the reader how much percent of 2018 US recreational boating accidents it accounts for. All of the causes add up to 100%. From the bar chart, we can see that hazardous waters is the most common cause, accounting for ~31.6% of recreational boating accidents. The second most common cause is restricted vision, accounting for ~17.7% of the accidents. Both of these causes account for ~49.3% of accidents. About half of the time, these accidents are outside of  the operator's control. Recreational boating accidents caused by the operator's fault are not as common. Causes that are the operators fault such as improper lookout, excessive speed, alcohol use, operator inattention only account for about ~27% of the accidents. Accidents are caused either natural causes or equipment and machinery failure most of the time. Based on this visualization, boat operators should be advised to not sail in non-optimal conditions and to check their equipments before sailing. 