In [27]:
import pandas as pd
import altair as alt
from altair import datum
alt.data_transformers.disable_max_rows()

DataTransformerRegistry.enable('default')

In [28]:
import pandas as pd

In [29]:
collisions = pd.read_csv("../data/preprocessed-colisions.csv")

In [4]:
collisions['CRASH_DATETIME'] = pd.to_datetime(collisions['CRASH_DATE'] + ' ' + collisions['CRASH_TIME'], format='%m/%d/%Y %H:%M')
collisions = collisions.drop(columns=['CRASH_TIME'])
collisions['DAY_WEEK'] = collisions['CRASH_DATETIME'].dt.day_name()
collisions['TYPE_DAY'] = collisions['DAY_WEEK'].apply(lambda day: 'Weekend' if day in ['Saturday', 'Sunday'] else 'Weekday')

In [5]:
collisions.to_csv("../data/preprocessed-collisions-final.csv", index=False)

In [6]:
print(f'After the preprocessing, the dataset has {len(collisions)} rows and {len(collisions.columns)} columns')

After the preprocessing, the dataset has 115740 rows and 20 columns


### At what time of the day are accidents more common?

To examine the temporal patterns of accidents throughout the day, we will employ a line chart. The x-axis will represent hours, with the y-axis indicating the corresponding number of accidents. Opting for a line chart enables a clear depiction of how accident frequencies evolve over time. We will differentiate the data by year, using distinct colors for 2018 and 2020, providing a comparative analysis.

In [7]:
c31 = alt.Chart(collisions).mark_line(strokeWidth=2, point=True).encode(
    alt.X('hours(CRASH_DATETIME):O').title('Time of Day'),
    alt.Y('count():Q').title('Number of Collisions'),
    color= alt.Color('year(CRASH_DATETIME):O', scale = alt.Scale(domain=[2018, 2020], range=['steelblue', '#ff7f0e']))
)

# c31

To enhance the visualization, we will see that the total number of collisions by hour. These encoding make it challenging to intuitively grasp the frequency of accidents for each hour each day. To address this, we will refine the visualization by encoding the average number of accidents of each day, accompanied by an error bar indicating the standard deviation so that we can assess the variance of the data.

In [8]:
c32 = alt.Chart(collisions).mark_line(strokeWidth=2, point=True).encode(
    x = alt.X('hours:Q').title('Time of day'),
    y = alt.Y('avg:Q').title('Average number of collisions'),
    color = alt.Color('year:O', scale = alt.Scale(domain=[2018, 2020], range=['steelblue', '#ff7f0e']))
).transform_calculate(
  year = 'year(datum.CRASH_DATETIME)',
  hours = 'hours(datum.CRASH_DATETIME)'
).transform_aggregate(
   count='count()',
   groupby=['year', 'hours', 'CRASH_DATE']
).transform_aggregate(
    avg = 'mean(count)',
    groupby=['year', 'hours']
)

c33 = alt.Chart(collisions).mark_errorbar(ticks=True).encode(
    x=alt.X('hours:Q'),
    y=alt.Y('count:Q',axis=alt.Axis(title=None)).scale(zero=False),
    color = alt.Color('year:O', scale = alt.Scale(domain=[2018, 2020], range=['steelblue', '#ff7f0e']))
).transform_calculate(
  year = 'year(datum.CRASH_DATETIME)',
  hours = 'hours(datum.CRASH_DATETIME)'
).transform_aggregate(
   count='count()',
   groupby=['year', 'hours', 'CRASH_DATE']
)

# (c32 + c33).properties(width=600, height=400)

Upon analyzing the hourly collisions, a clear trend emerges: higher collision rates during the day and lower rates during the night. This pattern aligns with the increased presence of cars on the road during daylight hours and decreased activity during nighttime. Further we can distinguish different patterns between morning, afternoon, and evening periods. Mornings exhibit fewer collisions, likely attributed to work-related activities, whereas afternoons register higher incidents, potentially linked to leisure activities and transporting children to extracurricular activities. Evenings witness a decline in collisions as people conclude their activities and return home.

We can further enhance our chart by introducing an additional variable to glean more insights. One pivotal factor of high importance is the total number of kills. It's crucial not only to identify peak collision times throughout the day but also to comprehend the magnitude of the human cost associated with these incidents. These variable will be encoded through the line thickness, with thicker lines indicating a higher number of deaths.

In [9]:
c34 = alt.Chart(collisions).mark_trail().encode(
    x = alt.X('hours:Q').title('Time of day'),
    y = alt.Y('avg_collisions:Q').title('Average number of collisions'),
    color = alt.Color('year:O', scale = alt.Scale(domain=[2018, 2020], range=['steelblue', '#ff7f0e'])).title('Year'),
    size = alt.Size('avg_killed:Q').title('Average killed')
).transform_calculate(
  year = 'year(datum.CRASH_DATETIME)',
  hours = 'hours(datum.CRASH_DATETIME)'
).transform_aggregate(
   count_collisions='count()',
   count_killed='sum(TOTAL_KILLED)',
   groupby=['year', 'hours', 'CRASH_DATE']
).transform_aggregate(
    avg_collisions='mean(count_collisions)',
    avg_killed='mean(count_killed)',
    groupby=['year', 'hours']
)

# (c34 + c33).properties(width=600, height=400).properties(title='Average collisions and killings over time')

We finally achieved the final version of the graph. This visualization facilitates the identification of peak accident times. While the period with the highest collision frequency occurs around 16:00, instances of more severe outcomes, particularly deaths, are notable at 20:00 and 04:00 in 2018, and between 19:00 and 00:00, as well as at 04:00 in 2020. The deaths in the late night coincide with the times when people are returning home after socializing, often under the influence of alcohol, which make the accidents more dangerous.

*At what time of the day are accidents more common?*

In chart C3 you can see a line chart with the average accidents per hour along the different years. We use a different color for each year and line thickness to encode the killed people. You can see that the accidents are more common during the afternoon, having the peak at 16:00, and the killed people are more common during the evening and late night.

### Is there a correlation between weather conditions and accidents?

Before starting to create visualizations it is necessary to choose the attributes of the 'weather.csv' dataset. Furthermore, since we have one row per day in the weather dataset, we need to group the number of collisions per day in order to merge the two datasets appropriately. 

In [10]:
weather_original = pd.read_csv("../data/weather.csv")
weather = weather_original[['datetime', 'temp', 'precip', 'windspeed', 'humidity', 'cloudcover', 'conditions', 'visibility']]
weather['datetime'] = pd.to_datetime(weather['datetime'])
weather.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  weather['datetime'] = pd.to_datetime(weather['datetime'])


Unnamed: 0,datetime,temp,precip,windspeed,humidity,cloudcover,conditions,visibility
0,2018-06-01,21.6,0.282,12.6,86.8,65.9,"Rain, Partially cloudy",11.3
1,2018-06-02,25.1,0.346,22.3,74.0,35.4,"Rain, Partially cloudy",15.8
2,2018-06-03,17.0,2.929,24.1,75.0,92.7,"Rain, Overcast",15.6
3,2018-06-04,16.8,3.91978,16.7,76.6,71.6,"Rain, Partially cloudy",15.4
4,2018-06-05,19.8,0.0,25.9,60.7,35.7,Partially cloudy,16.0


In [32]:
coll_weather = pd.DataFrame({'datetime': collisions["CRASH_DATE"]})
coll_weather['datetime'] = pd.to_datetime(coll_weather['datetime'])
coll_weather = coll_weather.groupby(['datetime']).size().reset_index(name='collisions')
coll_weather = pd.merge(coll_weather, weather, on='datetime')
coll_weather['year'] = coll_weather['datetime'].dt.year
coll_weather.head()

Unnamed: 0,datetime,collisions,temp,precip,windspeed,humidity,cloudcover,conditions,visibility,year
0,2018-06-01,751,21.6,0.282,12.6,86.8,65.9,"Rain, Partially cloudy",11.3,2018
1,2018-06-02,622,25.1,0.346,22.3,74.0,35.4,"Rain, Partially cloudy",15.8,2018
2,2018-06-03,525,17.0,2.929,24.1,75.0,92.7,"Rain, Overcast",15.6,2018
3,2018-06-04,698,16.8,3.91978,16.7,76.6,71.6,"Rain, Partially cloudy",15.4,2018
4,2018-06-05,688,19.8,0.0,25.9,60.7,35.7,Partially cloudy,16.0,2018


We also create two more datasets, one for each year because we will use these two datasets separately in some of the visualization.

In [33]:
# change the overcast conditions to rain
coll_weather['conditions'] = coll_weather['conditions'].apply(lambda x: 'Rain, Overcast' if x=='Overcast' else x)

# devide the coll_weather into two parts: 2018 and 2020
coll_weather_2018 = coll_weather[coll_weather['year']==2018]
coll_weather_2020 = coll_weather[coll_weather['year']==2020]

Now it is time to create graphs to see if there is any correlation between weather condiditon and accidents. We will first try a parallel bar chart, we have choosen the variables that we considered that might be more related to the number of collisions. We will also differentiate the data by year, using distinct colors for 2018 and 2020, providing a comparative analysis.

In [34]:
custom_sort_order = [ 'collisions', 'visibility', 'windspeed', 'temp', 'humidity', 'cloudcover']

alt.Chart(coll_weather, width=500).transform_window(
    index='count()'
).transform_fold(
    ['temp', 'precip', 'windspeed', 'humidity', 'cloudcover', 'visibility', 'collisions']
).mark_line().encode(
    x=alt.X('key:N', sort=custom_sort_order),
    y='value:Q',
    color='year:N',
    detail='index:N',
    opacity=alt.value(0.5)
)

The first problem we see is that each variable has a different range of values so we will normalize the data to be able to compare them.

In [35]:
alt.Chart(coll_weather).transform_window(
    index='count()'
).transform_fold(
    ['temp', 'windspeed', 'collisions', 'humidity', 'cloudcover', 'visibility']
).transform_joinaggregate(
     min='min(value)',
     max='max(value)',
     groupby=['key']
).transform_calculate(
    minmax_value=(datum.value-datum.min)/(datum.max-datum.min),
    mid=(datum.min+datum.max)/2
).mark_line().encode(
    x=alt.X('key:N', sort=custom_sort_order),
    y='minmax_value:Q',
    color='year:N',
    detail='index:N',
    opacity=alt.value(0.5)
).properties(width=500)

Despite normalizing the data, the visualization is not clear enough to see if there is any correlation between the variables and the number of collisions. (we have trided different order of variables but nothing comes up). Maybe it is because since there is a lot of difference between the number of collisions in 2018 and 2020 we do not see trends. Our next graph will be a juxtaposition of two parallel coordinates charts, one for each year.

In [36]:
# Chart for coll_weather_2018
chart_2018 = alt.Chart(coll_weather_2018).transform_window(
    index='count()'
).transform_fold(
    ['temp', 'windspeed', 'collisions', 'humidity', 'cloudcover', 'visibility']
).transform_joinaggregate(
     min='min(value)',
     max='max(value)',
     groupby=['key']
).transform_calculate(
    minmax_value=(datum.value-datum.min)/(datum.max-datum.min),
    mid=(datum.min+datum.max)/2
).mark_line().encode(
    x=alt.X('key:N', sort=custom_sort_order),
    y='minmax_value:Q',
    color=alt.value('steelblue'),
    detail='index:N',
    opacity=alt.value(0.5)
).properties(width=500, title='2018')

# Chart for coll_weather_2020
chart_2020 = alt.Chart(coll_weather_2020).transform_window(
    index='count()'
).transform_fold(
    ['temp', 'windspeed', 'collisions', 'humidity', 'cloudcover', 'visibility']
).transform_joinaggregate(
     min='min(value)',
     max='max(value)',
     groupby=['key']
).transform_calculate(
    minmax_value=(datum.value-datum.min)/(datum.max-datum.min),
    mid=(datum.min+datum.max)/2
).mark_line().encode(
    x=alt.X('key:N', sort=custom_sort_order),
    y='minmax_value:Q',
    color=alt.value('#ff7f0e'),
    detail='index:N',
    opacity=alt.value(0.5)
).properties(width=500, title='2020')

combined_chart = alt.hconcat(chart_2018, chart_2020)
combined_chart

Were we may see som correlation with low visibility and high number of collisions, however, sine there are very few examples of low visibility compared to the ones with high visibility, we cannot conclude anything.

So, our we will try to approach the problem in a different type of visualization. We will use a small multiples of heatmaps to see if there is any correlation between two of the weather variables and the number of accidents.

In [37]:
alt.Chart(coll_weather).mark_rect().encode(
    alt.X(alt.repeat("column"), type='ordinal', bin=True),
    alt.Y(alt.repeat("row"), type='ordinal', bin=True),
    color='average(collisions):Q'
).properties(
    width=150,
    height=150
).repeat(
    row=['temp', 'windspeed', 'humidity', 'cloudcover', 'visibility'],
    column=['temp', 'windspeed', 'humidity', 'cloudcover', 'visibility']
).interactive()

With this visualization we still do not see significant correlation of weather coditions and accidents. We find two more problems: in a lot of the heatmaps there is too many white cells because we do not have any data for that combination of variables, and the other problem is that the heatmaps only lets compare two variables with the number of accidents at a time, which results in having to plot a lot of heatmaps to compare all the variables.

We do not see that we can obtain significant results with heatmaps so we will change the type of chart again. We will try to use a scatter plot to see if we can see any correlation between the variables and the number of collisions. Apart of the two axis, we will use the size, color and shape of the points to encode more variables. Since color and shape are better for categorical variables, we will encode the weather conditions and year in those respectively. Furthermore, we will encode the number of collisions in tye y-axis because is the main variable we want to compare with the others. However, on the x-axis and size of the points, we will try multiple combinations of variables to see if we can see any correlation.

In [38]:
alt.Chart(coll_weather).mark_point(opacity = 0.5, filled = True).encode(
    alt.X('temp:Q', title='Average Daily Temperature (C)', scale=alt.Scale(domain=[15, 31])),
    #alt.X('windspeed:Q', title='Avearge Daily Windspeed (km/h)', scale=alt.Scale(domain=[8, 45])),
    #alt.X('humidity:Q', title='Average Daily Humidity (%)', scale=alt.Scale(domain=[40, 95])),
    alt.Size('visibility:Q', title='Average Daily Visibility (km)', scale=alt.Scale(domain=[11, 16])),
    #alt.Size('precip:Q', title='Average Daily Precipitation (mm)', scale=alt.Scale(domain=[0, 50])),
    alt.Color('conditions', title='Weather Conditions'),
    alt.Y('collisions', title='Number of Collisions', scale=alt.Scale(domain=[150, 900])),
    alt.Shape('year:N', title='Year')
).properties(
    width=600,
    height=400
)

None of the combinations gives us enogh evidence to conclude that there is any correlation between the variables and the number of collisions. However, since we see that the number of collisions is higher in 2020 than in 2018, we will try to juxtapose the scatter plots of both years to see if get any insights.

In [39]:
# Chart for 2018
chart_2018 = alt.Chart(coll_weather_2018).mark_point(opacity=0.5, filled=True).encode(
    alt.X('temp:Q', title='Average Daily Temperature (C)', scale=alt.Scale(domain=[15, 31])),
    alt.Size('visibility:Q', title='Average Daily Visibility (km)', scale=alt.Scale(domain=[11, 16])),
    alt.Color('conditions', title='Weather Conditions', scale=alt.Scale(scheme='set2')),
    alt.Y('collisions', title='Number of Collisions', scale=alt.Scale(domain=[350, 900])),
).properties(
    title='Collisions and Weather Conditions in 2018',
    width=600,
    height=400
)

# Chart for 2020
chart_2020 = alt.Chart(coll_weather_2020).mark_point(opacity=0.5, filled=True).encode(
    alt.X('temp:Q', title='Average Daily Temperature (C)', scale=alt.Scale(domain=[15, 31])),
    alt.Size('visibility:Q', title='Average Daily Visibility (km)', scale=alt.Scale(domain=[11, 16])),
    alt.Color('conditions', title='Weather Conditions', scale=alt.Scale(scheme='set2')),
    alt.Y('collisions', title='Number of Collisions', scale=alt.Scale(domain=[150, 500])),
).properties(
    title='Collisions and Weather Conditions in 2020',
    width=600,
    height=400
)

# Display the charts side by side
chart_2018 | chart_2020

Despite not getting enough insides, we see that there might be a correlation between the number of collisions and weather conditions. However, to compare these two, this scatterplot is not the best option. We will try to use a bar chart instead.

In [40]:
alt.Chart(coll_weather).mark_bar().encode(
    y=alt.Y('conditions:N', sort='-x', title='Weather Conditions'),
    x=alt.X('collisions:Q', axis=alt.Axis(title='Number of Collisions'))
)

With the total number of collisions by weather conditions we see a pattern that we did not see in the previous graphs. However, since we are counting the number of collisions with each of the weather conditions, and there are not the same days with each of the conditions, we cannot conclude anything. To improve this visualization we will get the average number of collisions per day with each of the weather conditions.

In [41]:
alt.Chart(coll_weather).mark_bar().encode(
    y=alt.Y('conditions:N', sort='-x', title='Weather Conditions'),
    x=alt.X('average_collisions_condition:Q', axis=alt.Axis(title='Average Number of Collisions per Day')),
).transform_aggregate(
    total_days_condition='count()',
    total_collisions_condiditon='sum(collisions)',
    groupby=['conditions']
).transform_calculate(
    average_collisions_condition='datum.total_collisions_condiditon / datum.total_days_condition'
)


Now we are starting to get interesting results. We see that the average number of collisions is higher with worse weather conditions, and lowers with better weather conditions. However, we still cannot conclude anything because bar charts are not the best option to compare averages since we do not see the variance and the distribution of the data. To improve this visualization and see the distribution of the data we will use a violin plot.

In [42]:
alt.Chart(coll_weather, width=100).transform_density(
    'collisions',
    as_=['collisions', 'density'],
    extent=[0, 1200],
    groupby=['conditions']
).mark_area(orient='horizontal').encode(
    alt.X('density:Q')
        .stack('center')
        .impute(None)
        .title(None)
        .axis(labels=False, values=[0], grid=False, ticks=True),
    alt.Y('collisions:Q'),
    alt.Color('conditions:N', scale=alt.Scale(scheme='set2')),
    alt.Column('conditions:N')
        .spacing(0)
        .header(titleOrient='bottom', labelOrient='bottom', labelPadding=0)
).configure_view(
    stroke=None
)

With this visualization we still see that as worse weather conditions, more collisions. Dispite that, we still cannot see the median and quartiles of the data, so we will superpose a boxplot to the violin plot.

In [43]:
violin_right = (
    alt.Chart(coll_weather, width=100)
    .transform_density(
        "collisions",
        as_=["collisions", "density"],
        extent=[0, 1200],
        groupby=["conditions"]
    )
    .mark_area(orient="horizontal")
    .encode(
        alt.X("density:Q")
            .impute(None)
            .title(None)
            .axis(labels=False, grid=False, ticks=True),
        alt.Y("collisions:Q"),
        alt.Color("conditions:N", scale=alt.Scale(scheme='set2'))
    )
)

violin_left = (
    violin_right
    .copy()
    .transform_calculate(density="-datum.density")
)

boxplot = (
    alt.Chart(coll_weather, width=100)
    .mark_boxplot(outliers=False, size=10, extent=20)
    .encode(y="collisions:Q", color=alt.value("black"))
)

chart = (
    alt.layer(violin_left, violin_right,boxplot)
    .facet(alt.Column("conditions:N"))
    .configure_view(stroke=None)
)
chart

The graphs is getting better, however, we see in each weather condition violin two bulks of data, which correspon to the two years. We have seen in previous graphs that there is a significant difference between the number of collisions in 2018 and 2020, so we will try to separate the data by year to see each distribution separately.

In [44]:
boxplot = alt.Chart().mark_boxplot(color='black').encode(
    alt.Y(f'collisions:Q')
).properties(width=100)

violin = alt.Chart().transform_density(
    'collisions',
    as_=['collisions', 'density'],
    extent=[0, 1000],
    groupby=['conditions']
).mark_area(orient='horizontal').encode(
    y='collisions:Q',
    color=alt.Color('conditions:N', legend=None, scale=alt.Scale(scheme='set2')),
    x=alt.X(
        'density:Q',
        stack='center',
        impute=None,
        title=None,
        scale=alt.Scale(nice=False, zero=False),
        axis=alt.Axis(labels=False, values=[0], grid=False, ticks=True),
    ),
).properties(
    width=100,
    height=400
)

facet = lambda coll_weather, title: alt.layer(violin, boxplot, data=coll_weather).facet(column='conditions:N').\
    resolve_scale(x=alt.ResolveMode("independent")).properties(title=alt.TitleParams(text=title, anchor="middle", align="center"))

alt.hconcat(facet(coll_weather_2018, "Summer 2018"),facet(coll_weather_2020, "Sumer 2020")).configure_facet(
    spacing=0,
).configure_header(
    titleOrient='bottom',
    labelOrient='bottom'
).configure_view(
    stroke=None
).properties(
    title='Collisions and Weather Conditions in 2018 and 2020',
)

We have finally achieved the final version of the graph. This visualization facilitates the identification of the distribution of the number of collisions by weather conditions. We see that, in 2018, the median of the number of collisions is higher with rain conditions but with the other weather conditions there is not a significant difference between the number of collisions. In 2020 we do not see enough evidence to conclude that a type of weather condition is more likely to cause an accident. These probably happens because in 2020 there was a lockdown and people were not driving as much as in 2018 so there were less cars on the road and the weather conditions did not affect as much as in 2018. Furthermore, we see that the distribution of the data is wider in 2020 than in 2018, which means that there is more variance in the number of collisions.

We have ended up choosing this visualization despite not having the quantitative wheather variables such as temperature or precipitation because the weather conditions take into account all these variables into a categorical one.

*Is there a correlation between weather conditions and accidents?*

In chart C5 you see a violin plot with the distribution of the number of accidents per day with each of the weather conditions and year. We use a different color for each year to compare them. By looking at the median of the boxplot, you can see that in 2018 this median (and the overall distribution of collisions) is higher with rain conditions but with the other weather conditions there is not a significant difference between the number of collisions. In 2020 we do not see enough evidence to conclude that a type of weather condition is more likely to cause an accident.

In [45]:
# bonus chart

gaussian_jitter = alt.Chart(coll_weather, title='Normally distributed jitter').mark_circle(size=20).encode(
    y="conditions:N",
    x="collisions:Q",
    yOffset="jitter:Q",
    color=alt.Color('conditions:N').legend(None),
    shape=alt.Shape('year:N')
).transform_calculate(
    # Generate Gaussian jitter with a Box-Muller transform
    jitter="sqrt(-2*log(random()))*cos(2*PI*random())"
).properties(
    width=300, height=200
)

uniform_jitter = gaussian_jitter.transform_calculate(
    # Generate uniform jitter
    jitter='random()'
).encode(
    alt.Y('conditions:N').axis(None)
).properties(
    title='Uniformly distributed jitter',
    width=300, height=200,
)

(gaussian_jitter | uniform_jitter).resolve_scale(yOffset='independent')