# Data Visualisation Project 
## By Alexandre COGORDAN & Victor BOCQUIN

### Our motivation

Working on the dataset of car accidents in India provides an opportunity to explore complex dynamics related to road safety in a specific context.

The diversity of variables, such as road conditions, driver characteristics, vehicle details, and accident causes, allows for a deeper understanding of contributing factors to accidents. 

Analyzing these data can not only highlight key challenges in road safety but also provide crucial insights to guide targeted preventive initiatives. By understanding collision patterns, profiles of at-risk drivers, and predominant environmental conditions, we could, if it were a real project, contribute to improving road safety policies and maybe reducing accidents, safer road environment in India.

### Introduction - Understanding our dataset

In [1463]:
import pandas as pd
import dash
from dash import dcc, html
from dash.dependencies import Input, Output
import plotly.express as px
import pandas as pd

pd.set_option('display.max_columns', None)

In [1464]:
df = pd.read_csv('road.csv')

In [1465]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12316 entries, 0 to 12315
Data columns (total 32 columns):
 #   Column                       Non-Null Count  Dtype 
---  ------                       --------------  ----- 
 0   Time                         12316 non-null  object
 1   Day_of_week                  12316 non-null  object
 2   Age_band_of_driver           12316 non-null  object
 3   Sex_of_driver                12316 non-null  object
 4   Educational_level            11575 non-null  object
 5   Vehicle_driver_relation      11737 non-null  object
 6   Driving_experience           11487 non-null  object
 7   Type_of_vehicle              11366 non-null  object
 8   Owner_of_vehicle             11834 non-null  object
 9   Service_year_of_vehicle      8388 non-null   object
 10  Defect_of_vehicle            7889 non-null   object
 11  Area_accident_occured        12077 non-null  object
 12  Lanes_or_Medians             11931 non-null  object
 13  Road_allignment              12

In [1466]:
df.head()

Unnamed: 0,Time,Day_of_week,Age_band_of_driver,Sex_of_driver,Educational_level,Vehicle_driver_relation,Driving_experience,Type_of_vehicle,Owner_of_vehicle,Service_year_of_vehicle,Defect_of_vehicle,Area_accident_occured,Lanes_or_Medians,Road_allignment,Types_of_Junction,Road_surface_type,Road_surface_conditions,Light_conditions,Weather_conditions,Type_of_collision,Number_of_vehicles_involved,Number_of_casualties,Vehicle_movement,Casualty_class,Sex_of_casualty,Age_band_of_casualty,Casualty_severity,Work_of_casuality,Fitness_of_casuality,Pedestrian_movement,Cause_of_accident,Accident_severity
0,17:02:00,Monday,18-30,Male,Above high school,Employee,1-2yr,Automobile,Owner,Above 10yr,No defect,Residential areas,,Tangent road with flat terrain,No junction,Asphalt roads,Dry,Daylight,Normal,Collision with roadside-parked vehicles,2,2,Going straight,na,na,na,na,,,Not a Pedestrian,Moving Backward,Slight Injury
1,17:02:00,Monday,31-50,Male,Junior high school,Employee,Above 10yr,Public (> 45 seats),Owner,5-10yrs,No defect,Office areas,Undivided Two way,Tangent road with flat terrain,No junction,Asphalt roads,Dry,Daylight,Normal,Vehicle with vehicle collision,2,2,Going straight,na,na,na,na,,,Not a Pedestrian,Overtaking,Slight Injury
2,17:02:00,Monday,18-30,Male,Junior high school,Employee,1-2yr,Lorry (41?100Q),Owner,,No defect,Recreational areas,other,,No junction,Asphalt roads,Dry,Daylight,Normal,Collision with roadside objects,2,2,Going straight,Driver or rider,Male,31-50,3,Driver,,Not a Pedestrian,Changing lane to the left,Serious Injury
3,1:06:00,Sunday,18-30,Male,Junior high school,Employee,5-10yr,Public (> 45 seats),Governmental,,No defect,Office areas,other,Tangent road with mild grade and flat terrain,Y Shape,Earth roads,Dry,Darkness - lights lit,Normal,Vehicle with vehicle collision,2,2,Going straight,Pedestrian,Female,18-30,3,Driver,Normal,Not a Pedestrian,Changing lane to the right,Slight Injury
4,1:06:00,Sunday,18-30,Male,Junior high school,Employee,2-5yr,,Owner,5-10yrs,No defect,Industrial areas,other,Tangent road with flat terrain,Y Shape,Asphalt roads,Dry,Darkness - lights lit,Normal,Vehicle with vehicle collision,2,2,Going straight,na,na,na,na,,,Not a Pedestrian,Overtaking,Slight Injury


We display the percentage of null values per columns and proceed to drop them

In [1467]:
df.isna().sum() / len(df) * 100
df = df.dropna()

### Graph 1 - Distribution of accidents by day of the week. 

##### Each bar represents the count of accidents on a specific day, with colors distinguishing days. The checkbox filter allows users to explore this distribution based on the gender of drivers, offering insights into how accident patterns vary across different days for selected genders.We can observe that overall the number of accidents is higher on Fridays. When filtering for men, it is also on Fridays. However, when filtering for women, there were more accidents on Mondays.

In [1468]:
app = dash.Dash(__name__)


app.layout = html.Div([
    dcc.Graph(id='accidents-by-day'),
    dcc.Checklist(
        id='gender-filter',
        options=[
            {'label': 'Male', 'value': 'Male'},
            {'label': 'Female', 'value': 'Female'},

        ],
        value=['Male', 'Female'],
        labelStyle={'display': 'block'}
    )
])

@app.callback(
    Output('accidents-by-day', 'figure'),
    [Input('gender-filter', 'value')]
)
def update_graph(selected_genders):
    filtered_df = df[df['Sex_of_driver'].isin(selected_genders)]
    fig = px.histogram(filtered_df, x='Day_of_week', color='Day_of_week',
                       category_orders={"Day_of_week": ["Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"]},
                       title='Accidents by Day of the Week',
                       labels={'Day_of_week': 'Day of the Week'})
    return fig


if __name__ == '__main__':
    app.run_server(debug=True)

## Graph 2 - Distribution of accidents by educational level
##### We thought that this would be a good idea to know if a higher educational level meant a lower accident rate. Seeing the results, it seems true, but this could also be explained by the age and therefore the driving experience of the drivers - a factor we'll later in the other graphs.

In [1469]:
fig2 = px.histogram(df, x='Educational_level', title='Distribution of accidents by educational level')
fig2.show()

## Graph 3 - Violin plot of Age Distribution by Day of the Week

##### A violin plot illustrating the distribution of driver age bands across different days of the week, providing insights into the age demographics associated with accidents on each day. Each day seems very similar but we can notice that for exemple there are more accident with the 18 - 50  year olds on friday than on monday.

In [1470]:
fig3 = px.violin(df, x='Day_of_week', y='Age_band_of_driver', title='Age Distribution by Day of the Week')
fig3.show()

## Graph 4 - 3D scatter plot of Age, Service year of vehicle, and Casualty severity

##### This 3D Scatter Plot depicts the relationship between driver age bands, gender, and the number of accidents (nb_Accident). The color of the points represents the intensity of accident occurrences. We can see that Male driver with an age 18 - 50 have many accident in this dataset.

In [1471]:
df2 = df.copy()
df2['nb_Accident'] = 1
df2 = df2.groupby(['Sex_of_driver', 'Age_band_of_driver']).count().reset_index()
df2

Unnamed: 0,Sex_of_driver,Age_band_of_driver,Time,Day_of_week,Educational_level,Vehicle_driver_relation,Driving_experience,Type_of_vehicle,Owner_of_vehicle,Service_year_of_vehicle,Defect_of_vehicle,Area_accident_occured,Lanes_or_Medians,Road_allignment,Types_of_Junction,Road_surface_type,Road_surface_conditions,Light_conditions,Weather_conditions,Type_of_collision,Number_of_vehicles_involved,Number_of_casualties,Vehicle_movement,Casualty_class,Sex_of_casualty,Age_band_of_casualty,Casualty_severity,Work_of_casuality,Fitness_of_casuality,Pedestrian_movement,Cause_of_accident,Accident_severity,nb_Accident
0,Female,18-30,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9
1,Female,31-50,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3
2,Female,Over 51,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4,4
3,Female,Under 18,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2
4,Female,Unknown,140,140,140,140,140,140,140,140,140,140,140,140,140,140,140,140,140,140,140,140,140,140,140,140,140,140,140,140,140,140,140
5,Male,18-30,987,987,987,987,987,987,987,987,987,987,987,987,987,987,987,987,987,987,987,987,987,987,987,987,987,987,987,987,987,987,987
6,Male,31-50,906,906,906,906,906,906,906,906,906,906,906,906,906,906,906,906,906,906,906,906,906,906,906,906,906,906,906,906,906,906,906
7,Male,Over 51,352,352,352,352,352,352,352,352,352,352,352,352,352,352,352,352,352,352,352,352,352,352,352,352,352,352,352,352,352,352,352
8,Male,Under 18,194,194,194,194,194,194,194,194,194,194,194,194,194,194,194,194,194,194,194,194,194,194,194,194,194,194,194,194,194,194,194
9,Male,Unknown,249,249,249,249,249,249,249,249,249,249,249,249,249,249,249,249,249,249,249,249,249,249,249,249,249,249,249,249,249,249,249


In [1472]:
fig4 = px.scatter_3d(df2, x='Age_band_of_driver', y='Sex_of_driver', z='nb_Accident',
                    title='Age, Service year of vehicle, and Casualty severity',color='nb_Accident')

fig4.show()

## Graph 5: Line chart of the number of accidents over time

##### The following line chart illustrates the trend of accident severity over time. This can be useful to analyse possible time based accidents and when most accidents happen during the day.

In [1473]:
df3 = df.copy()
df3['nb_Accident'] = 1


df3 = df3.groupby(['Time']).count().reset_index()

# We've smoothened the data in order to have a visualisation more readable
df3['nb_Accident'] = df3['nb_Accident'].rolling(10).mean()

fig5 = px.line(df3, x='Time', y='nb_Accident', title='Number of Accidents Over Time')

fig5.show()

## Graph 6: Repartition of accidents by light conditions
##### What we wished to analyse with this graph was the possible effect of light on accidents. We expected that a lot of accidents happened in the darkness because of the reduced vision conditions. However, most accidents happened during the day, which still makes sense because there is heavier traffic during that period. However, we were still very suprised that the accidents that happened in darker conditions were mostly caused in roads where the lights were lit.

In [1474]:
import plotly.express as px
fig6 = px.pie(df, names='Light_conditions', color_discrete_sequence=px.colors.sequential.RdBu)
fig6.show()

## Graph 7: Vehicle Type and Driving Experience
##### We analysed the types of vehicules that were the most involved in an accident then proceeded to check the age of the drivers in these different types of vehicules. This allows us to find the most 'dangerous' mean of transport and their most 'dangerous' types of drivers (in terms of driving experience which is based on the data they've obtained their driving licence if they even have any). We can observe that most accidents are caused by automobile drivers with a driving experience between 5 to 10 years.

In [1475]:
fig7 = px.treemap(df, path=['Type_of_vehicle', 'Driving_experience'], title='Vehicle Type and Driving Experience')
fig7.show()

## Graph 8: Weather and Road Conditions
##### We wanted to check the number of accidents based on the weather, the type of road and its condition. From our result, we've come to understand that the majority of accidents happen on dry, asphalt roads with normal conditions. This is also explained because most roads are made of asphalt, and conditions in india are mostly dry. The results might have been a lot more different for a northen country like Sweden for example.

In [1476]:
fig8 = px.sunburst(df, path=['Weather_conditions', 'Road_surface_type', 'Road_surface_conditions'], title='Weather and Road Conditions')
fig8.show()

## Graph 9: Junction and Collision Types
##### We've tried to find a possible link between the junction type and the collision type. This would make sense as more accidents are likely to happen the same way in the same settings. We also added buttons to sort the number of accidents by collision type on the left so that we can see which junction type is most associated with it. From what we've observed, the Y-shape junctions seem to be the most dangerous (which is probably explained by a priority issue with drivers) and unsurpringly, most accidents are vehicle to vehicle.

In [1477]:
fig9 = px.bar(df, x='Types_of_Junction', color='Type_of_collision', title='Junction and Collision Types')
fig9.update_layout(updatemenus=[dict(type='buttons', 
                                    showactive=True, 
                                    buttons=[dict(label=alignment, method='relayout', args=['xaxis.categoryorder', 'total descending']) for alignment in df['Type_of_collision'].unique()])])
fig9.show()

## Graph 10: Accidents by areas
##### For our last analysis, we decided to look at the number of accidents by areas. We thought this would let us get hypotheses of the possible causes of those accidents. For example, we expect more accidents being caused in area close to school or workplaces during the days of the week. However, from our results, we found out that although some part of our hypotheses run true, they are hardly reliable as some results like the number of accidents in school happen at midnight (this isn't because of school traffic but inly because it somehow happened near the school). We concluded that the location of accidents, apart if it occurs often next to some area (because of the type of junctions expected in these areas) is hardly reliable.

In [1478]:
df['Time'] = pd.to_datetime(df['Time']).dt.hour

day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

df['Day_of_week'] = pd.Categorical(df['Day_of_week'], categories=day_order, ordered=True)

df = df.sort_values('Day_of_week')
df.dropna(inplace=True)


Could not infer format, so each element will be parsed individually, falling back to `dateutil`. To ensure parsing is consistent and as-expected, please specify a format.



In [1484]:
df10 = df[['Area_accident_occured', 'Time']].value_counts().reset_index()
df10.columns = ['Area_accident_occured', 'Time', 'Count']
df10 = pd.merge(df10, df, on=['Area_accident_occured', 'Time'], how='left')
df10.sort_values('Area_accident_occured', inplace=True)

fig = px.scatter(df10, x="Time", y='Count', animation_frame="Day_of_week", animation_group="Area_accident_occured",
                 size="Count", color="Area_accident_occured", hover_name="Area_accident_occured",
                 labels={"Time": "Time of Day", "Count": "Accident Count", "Area_accident_occured": "Area of Accident"})

fig.show()




