Information Visualization Project - Part 2
===
*Universitat Politècnica de Catalunya*

Gabriela Malec
---

---

The second project is a continuation of my work with data about collisions in New York City. This time I have prepared a dataset covering the summer months, i.e., June, July, August, and September in 2018. All interactive visualizations were made using Altair and presented using Streamlit. <br>
The aim of this work was to answer the following questions:

* Which weather condition and type of vehicle were present in the majority of accidents each month? And in the combination of all the months?
* In which area and at what hour did the majority of accidents each month happen? And in the combination of all the months?
* Which area presented the majority of taxi accidents during rainy days in June on Mondays at noon, 12am?
* Which day had more accidents during clear days in July in Manhattan?

Small Preprocessing
===

The initial phase of the project involved appropriate data preparation. I loaded two datasets, one containing collision data and the other weather data. Subsequently, I merged them based on date columns into a single dataset to ensure the consistency of my analysis. 

In [None]:
import altair as alt
import pandas as pd
import geopandas as gpd
import math

In [None]:
crash = pd.read_csv('data-collisions-vehicles-proj2.csv')
alt.data_transformers.disable_max_rows()

In [None]:
crash.head()

In [None]:
crash['CRASH DATE'] = pd.to_datetime(crash['CRASH DATE'], format='%Y-%m-%dT%H:%M:%SZ')
crash['Month'] = crash['CRASH DATE'].dt.strftime('%B')
crash['Weekday'] = crash['CRASH DATE'].dt.day_name()

In [None]:
weather = pd.read_csv('weather2018.csv')
selected_columns = ['datetime', 'icon']
weather_subset = weather[selected_columns]
weather_subset.head()

In [None]:
weather_subset['datetime'] = pd.to_datetime(weather_subset['datetime'], format='%Y-%m-%d')
crash_w = pd.merge(crash, weather_subset, left_on='CRASH DATE', right_on='datetime', how='left')
crash_w = crash_w.drop(columns='datetime')
crash_w.rename(columns={'icon': 'weather'}, inplace=True)
# crash_w.info()
crash_w.head()

Design and implementation
====
I started the design process by preparing a plan and considering possible solutions to answer each of the questions. Then, I created interactive charts to provide answers to each of them in a visually engaging way.

---
* Question 1 <br>
Which weather condition and type of vehicle were present in the majority of accidents each month? And in the combination of all the months? <br>

To answer this question, I created an interactive rectangular heatmap chart to visualize the relationship between weather conditions, vehicle types, and the count of accidents. I implemented a menu with radio buttons to enable dynamic updates of the chart. It allows displaying the count of accidents based on weather conditions and vehicle types depending on the selected month. A darker color signifies a higher incidence of accidents under specific conditions.

In [None]:
options = ['June', 'July', 'August', 'September']
labels = [option + ' ' for option in options]

input_dropdown = alt.binding_radio(
    # Add the empty selection which shows all when clicked
    options=options + [None],
    labels=labels + ['All'],
    name='Month: '
)
selection = alt.selection_point(
    fields=['Month'],
    bind=input_dropdown,
)


alt.Chart(crash_w).mark_rect().encode(
    x='weather:O',
    y='VEHICLE TYPE CODE 1:O',
    color=alt.Color('count()', scale=alt.Scale(scheme='tealblues', type='log')),
    tooltip=['weather:O', 'VEHICLE TYPE CODE 1:O', 'count()']  
).properties(
    width=400,
    height=400
).add_params(
    selection
).transform_filter(
    selection
)

In [None]:
crash_w['weather'].value_counts()

In [None]:
crash_w['VEHICLE TYPE CODE 1'].value_counts()

In the analyzed dataset, we can see that the highest number of accidents took place during rainy weather, and taxis were most often involved in collisions.

---
* Question 2 <br>
In which area and at what hour did the majority of accidents each month happen? And in the combination of all the months? <br>

To conduct this visualization, I performed an operation of removing NA values to handle missing and incomplete data. Then, I added a new 'Hour' column, representing the hour of the day based on the 'CRASH TIME' column. I grouped all accidents that occurred between full hours, aiming to enhance the clarity of the dataset.
I decided to create a line plot that allows users to explore the temporal patterns of accidents throughout the day across different boroughs. Hence, I created a selection point and linked it to the 'Month' field using the radio buttons binding. This selection is instrumental for interactive filtering, allowing users to dynamically explore the relationship between the number of accidents and the hour of the day based on the selected month.

In [None]:
input_dropdown = alt.binding_radio(
    # Add the empty selection which shows all when clicked
    options=options + [None],
    labels=labels + ['All'],
    name='Month: '
)
selection = alt.selection_point(
    fields=['Month'],
    bind=input_dropdown,
)

crash_bor = crash_w.dropna(subset=['BOROUGH']).copy()
crash_bor['Hour'] = pd.to_datetime(crash_bor['CRASH TIME'], format='%H:%M', errors='coerce').dt.hour

chart = alt.Chart(crash_bor).mark_line(point=True).encode(
    x=alt.X('Hour:O', title='Hour'),
    y=alt.Y('count()', title='Number of Accidents'),
    color=alt.Color('BOROUGH:N', title='Area' , scale=alt.Scale(scheme='tableau10')),
    tooltip=['Hour', 'BOROUGH', 'count()']
).properties(
    width=800,
    height=400,
    title='Line Plot of Accidents by Hour and Borough'
).add_params(
    selection
).transform_filter(
    selection
)

chart

---
* Question 3 <br>
Which area presented the majority of taxi accidents during rainy days in June on Mondays at noon, 12am? <br>

I prepared the data using the groupby function to organize it by the columns 'BOROUGH', 'Month', and 'weather'. Subsequently, I applied the size function to calculate the total number of crashes for each unique combination of these factors. I utilized mapping to assign numerical codes to each borough.

In [None]:
# Group by 'BOROUGH' and 'Month' to get total crashes for each borough each month
borough_monthly_crash = crash_bor.groupby(['BOROUGH', 'Month', 'weather']).size().reset_index(name='total_crashes')

borough_mapping = {
    'Staten Island': 5,
    'Queens': 4,
    'Brooklyn': 3,
    'Manhattan': 1,
    'Bronx': 2
}

borough_monthly_crash['boro_code'] = borough_monthly_crash['BOROUGH'].map(borough_mapping)
borough_monthly_crash.head()

My first attempt to answer this question focused on creating a visualization using a choropleth map. To establish a comprehensive framework, I generated a new DataFrame named all_combinations by combining unique borough codes with predefined months. This set the stage for a detailed representation of collision data across boroughs and months. The visualization presents a map of New York, segmented into its distinct boroughs. By interacting with the radio buttons to toggle between months, users gain dynamic insights into the temporal patterns of collisions throughout different neighborhoods. The color gradations on the map offer a quick and effective way to identify areas with the highest and lowest collision rates.

In [None]:
gdf = gpd.read_file('test.geojson')
boroughs = gdf['boro_code'].unique()
months = ['June', 'July', 'August', 'September']
all_combinations = pd.DataFrame([(boro, month) for boro in boroughs for month in months], columns=['boro_code', 'Month'])

merged_combinations = all_combinations.merge(borough_monthly_crash, on=['boro_code', 'Month'], how='left')
merged_combinations['total_crashes'].fillna(0, inplace=True)
merged_gdf = gdf.merge(merged_combinations, on='boro_code', how='left')

chart = alt.Chart(merged_gdf).mark_geoshape().encode(
    color=alt.Color('total_crashes:Q', title='Number of Collisions', scale=alt.Scale(scheme='yellowgreenblue')),
    tooltip=['boro_name:N', 'total_crashes:Q']
).project(
    type='albersUsa'
).properties(
    width=400,
    height=300,
    title="Number of collisions depending on the borough "
).add_params(
    selection
).transform_filter(
    selection
)

chart

In the next step, I decided to refine the map with further data, adding a legend in the form of a heatmap. The presentation combines both the geospatial and weather summary legend charts side by side, allowing users to comprehensively analyze collision data in relation to boroughs, months, and weather conditions.

In [None]:
selection2 = alt.selection_point(fields=['Month', 'weather'])
merged_gdf = gdf.merge(borough_monthly_crash, on='boro_code', how='left')
chart = alt.Chart(merged_gdf).mark_geoshape().encode(
    color=alt.Color('total_crashes:Q', title='Number of Collisions', scale=alt.Scale(scheme='yellowgreenblue')),
    tooltip=['boro_name:N', 'total_crashes:Q']
).project(
    type='albersUsa'
).properties(
    width=400,
    height=300,
    title="Number of collisions depending on the borough "
).add_params(
    selection2
).transform_filter(
    selection2
)

legend = alt.Chart(crash_bor).mark_rect().encode(
    alt.Y('Month:N').axis(orient='left'),
    x='weather:O',
    color=alt.Color('count()', scale=alt.Scale(scheme='tealblues', type='log')),
    tooltip=['Month', 'weather', 'count()']
).properties(
    width=300,
    height=300,
    title='Weather Summary during each Month'
).add_params(
    selection
)

chart | legend

In [None]:
merged_gdf.head()

Afterwards, I focused on a more detailed analysis, incorporating the hours during which the accidents happened in my visualizations.

In [None]:
# Group by 'BOROUGH' and 'Month' to get total crashes for each borough each month
# this is the version with hours but it works longer
borough_monthly_crash_h = crash_bor.groupby(['BOROUGH', 'Month', 'weather', 'Hour']).size().reset_index(name='total_crashes')

borough_mapping = {
    'Staten Island': 5,
    'Queens': 4,
    'Brooklyn': 3,
    'Manhattan': 1,
    'Bronx': 2
}

borough_monthly_crash_h['boro_code'] = borough_monthly_crash_h['BOROUGH'].map(borough_mapping)
borough_monthly_crash_h

I applied a similar approach as previously. However, it was not effective, most likely due to the dataset complexity.

To enhance the visual representation of the data, I decided to try an alternative approach to answer the question. I created a bar chart showing the count of accidents by borough and weekday, along with a legend in the form of a rectangular heatmap displaying the count of observations based on weather and month. By selecting a weather condition during a specific month, the graph is updated to show the number of accidents each day of the week for all analyzed boroughs.

In [None]:
selection2 = alt.selection_point(fields=['Month', 'weather'])

chart = alt.Chart(crash_bor).mark_bar().encode(
    x=alt.X('BOROUGH:N', title='Borough'),
    color='BOROUGH:N',
    y=alt.Y('count()', title='Number of Observations'),
    column='Weekday:N'
).properties(
    width=200,
    height=500
).add_params(
    selection2
).transform_filter(
    selection2
)

legend = alt.Chart(crash_bor).mark_rect().encode(
    alt.Y('Month:N').axis(orient='left'),
    x='weather:O',
    color=alt.Color('count()', scale=alt.Scale(scheme='oranges', type='log')),
    tooltip=['Month', 'weather', 'count()']
).properties(
    width=300,
    height=300,
    title='Weather Summary during each Month'
).add_params(
    selection2
)

chart | legend

---
* Question 4 <br>
Which day had more accidents during clear days in July in Manhattan? <br>

To determine the answer to this question, I prepared a visual representation of accidents by borough, months, and weather conditions. The line plot shows the count of accidents over the hours of the day. It is differentiated by borough, with distinctive colors representing different areas. The chart is designed to be responsive to the interactive selection on the heatmap, allowing users to focus on specific weather conditions and months. Using this method, I obtained valuable insights into the temporal trends influencing the occurrence of accidents.

In [None]:
selection2 = alt.selection_point(fields=['Month', 'weather'])

crash_bor = crash_w.dropna(subset=['BOROUGH']).copy()
crash_bor['DayMonth'] = crash_bor['CRASH DATE'].dt.strftime('%d-%m')

chart = alt.Chart(crash_bor).mark_line(point=True).encode(
    x=alt.X('DayMonth:O', title='Hour'),
    y=alt.Y('count()', title='Number of Accidents'),
    color=alt.Color('BOROUGH:N', title='Area' , scale=alt.Scale(scheme='tableau10')),
    tooltip=['BOROUGH:N', 'count()', 'DayMonth', 'weather']
).properties(
    width=1200,
    height=400,
    title='Line Plot of Accidents by Hour and Borough'
).add_params(
    selection2
).transform_filter(
    selection2
)

legend = alt.Chart(crash_bor).mark_rect().encode(
    alt.Y('Month:N').axis(orient='left'),
    x='weather:O',
    color=alt.Color('count()', scale=alt.Scale(scheme='bluegreen', type='log')),
    tooltip=['Month', 'weather', 'count()']
).properties(
    width=300,
    height=300,
    title='Weather Summary during each Month'
).add_params(
    selection2
)

chart | legend

In order to confirm the validity of the data presented in the chart above, I checked the values of the number of accidents in the table. The 'Count' column represents the number of unique crash dates for each combination of month and weather condition.

In [None]:
crash_bor.groupby(['Month', 'weather'])['CRASH DATE'].nunique().reset_index(name='Count')

---
* Additional visualizations

Intending to take a closer look at the analyzed dataset, I opted to create supplementary charts. <br>

The chart below combines visualization featuring an interactive scatter plot and a corresponding bar chart to explore the frequency of collisions based on hourly observations. I designed it so users can interactively explore collision patterns by selecting times of day. The corresponding bar chart dynamically reflects the selected interval, providing insights into the number of collisions across boroughs and months.

In [None]:
brush = alt.selection_interval()

scatter_plot = alt.Chart(borough_monthly_crash_h).mark_point().encode(
    x='Hour:Q',
    y='count()',
).add_params(
    brush
)

bar_chart = alt.Chart(borough_monthly_crash_h).mark_bar().encode(
    x='count()',
    y='BOROUGH:N',
    color=alt.Color('Month:N', scale=alt.Scale(scheme='tableau10')),
    tooltip=['count()']
).transform_filter(
    brush
).properties(
    height=150,
    width=500,
    title='Frequency of collisions'
)
bar_chart & scatter_plot 

Within the examined dataset, I decided to also focus on additional information regarding the number of individuals injured in accidents. My exploration centered on identifying patterns based on the day of the week. I created a bar chart to visualize the total number of injured persons in collisions across days, differentiating by borough and allowing interactive selection of vehicle types.

In [None]:
weekday_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

borough_dropdown = alt.binding_select(options=crash_w['VEHICLE TYPE CODE 1'].unique().tolist(), name='Select vehicle type')
borough_selection = alt.selection_point(fields=['VEHICLE TYPE CODE 1'], bind=borough_dropdown)

filtered_crash_w = crash_w.dropna(subset=['BOROUGH', 'Weekday', 'NUMBER OF PERSONS INJURED'])

bar_chart = alt.Chart(filtered_crash_w).mark_bar().encode(
    x=alt.X('Weekday:O', sort=weekday_order),
    y='sum(NUMBER OF PERSONS INJURED):Q',
    color=alt.Color('BOROUGH:N', scale=alt.Scale(scheme='tableau10')),
    tooltip=['sum(NUMBER OF PERSONS INJURED):Q']
).add_params(
    borough_selection
).transform_filter(
    borough_selection
).properties(
    width=600,
    height=400,
    title='Total injured persons'
)

bar_chart

Subsequently, I opted for a similar approach to analyze the data related to factors causing accidents. Given the substantial variety of contributing factors, I used a dropdown menu for ease of navigation and selection. The visualization presents insights into the relationships between contributing factors, boroughs, and weather conditions through an interactive bar chart.

In [None]:
category_dropdown = alt.binding_select(
    options=crash_w['CONTRIBUTING FACTOR VEHICLE 1'].unique().tolist(),
    name='Select contributing factor'
)
category_param = alt.param(
    value='Driver Inattention/distraction',
    bind=category_dropdown
)
filtered_crash_w = crash_w.dropna(subset=['CONTRIBUTING FACTOR VEHICLE 1', 'BOROUGH', 'weather'])
category_chart = alt.Chart(filtered_crash_w).transform_filter(
    alt.FieldEqualPredicate(field='CONTRIBUTING FACTOR VEHICLE 1', equal=category_param)
).mark_bar().encode(
    x='BOROUGH:N',
    y='count():Q',
    color=alt.Color('weather:N', scale=alt.Scale(scheme='tableau10')),
    tooltip=['count():Q']
).add_params(
    category_param
).properties(
    width=600,
    height=400,
    title='Number of Collisions - CONTRIBUTING FACTOR'
)

category_chart

Answers to questions through the final visualization
==
Based on the final visualization I aim to answer the original questions of the task.


* Question 1 <br>
Which weather condition and type of vehicle were present in the majority of accidents each month? And in the combination of all the months?

As depicted on the heatmap, rainy weather conditions were predominant for accidents across each month. Conversely, occurrences were least frequent during cloudy weather, with no reported events in July and September. Similarly, analyzing the vehicle types involved in accidents, taxis were implicated in the majority of incidents, while fire trucks were less commonly involved.

* Question 2 <br>
In which area and at what hour did the majority of accidents each month happen? And in the combination of all the months?

The chart clearly illustrates that the majority of accidents occur in Manhattan, and this pattern persists every month. Generally, the most prevalent time for accidents across all boroughs is in the afternoon, typically around 4 p.m.

* Question 3 <br>
Which area presented the majority of taxi accidents during rainy days in June on Mondays at noon, 12am?

Based on the created charts, concerning the data for taxi accidents during rainy days in June on Mondays at noon, it is clear that Manhattan experienced the highest number of incidents, demonstrating a significant majority. In comparison, Brooklyn, Queens and Bronx had similar accident counts, while Staten Island recorded no accidents under these specific conditions.

* Question 4 <br>
Which day had more accidents during clear days in July in Manhattan?

Based on the line plot, analyzing the data for clear days in July in Manhattan, it is notable that there is no discernible strong pattern in the number of accidents. It is evident that July 19 stood out as the day with the highest number, totaling 20 incidents. Conversely, July 9 recorded the lowest number of accidents among clear days, with 8 occurrences.


---
Additional questions
==


* Question 5 <br>
How many times weather occured each month?

On the heatmap we can see that rainy weather was the most frequent condition in each of the analyzed months. June displayed a relatively balanced distribution of different weather conditions. In July, the number of clear days notably increased compared to June, and interestingly, no cloudy days were observed. August stood out as the month with the highest frequency of rainy days. September followed a pattern similar to July, although with decrease in clear days.

* Question 6 <br>
Was the number of injured people during accidents higher on weekends compared to weekdays for the different vehicles involved?

Specifically, for taxis, the highest number of injuries occured on Saturdays, while Sundays show similar values to weekdays. However, this pattern is not consistently observed for ambulances and fire trucks, where the number of injured people during accidents does not exhibit a clear distinction between weekends and weekdays.

* Question 7 <br>
Is there a specific time of day when Manhattan experiences fewer accidents in comparison to other boroughs?

Analyzing the chart, it is noteworthy that Manhattan consistently records the highest number of accidents throughout the day. The distinction in comparison to other boroughs becomes less pronounced in the early morning hours. In fact, around 7 am, the number of accidents in Manhattan and the Bronx appears to be similar.

* Question 8 <br>
Do drivers pay more attention on the road during bad weather in order to reduce the number of accidents?

When analyzing the contributing factor of 'Driver Inattention/distraction' and considering the corresponding bar chart, it appears that drivers do not significantly increase their attention on the road during bad weather. Across all boroughs, rainy weather consistently emerges as the most frequent weather condition for accidents caused by driver inattention.