# 🚗 Collisions in NY City 🗽
**Authors**: Carlos Arbonés & Benet Ramió

![Alt text](../data/photo_cars_ny.jpg)


## Data extraction

We acquired the dataset from the [Motor Vehicle Collisions](https://data.cityofnewyork.us/Public-Safety/Motor-Vehicle-Collisions-Crashes/h9gi-nx95) source. Prior to obtaining it, we specifically filtered and downloaded the records corresponding to the periods of June to September in both 2018 and 2020.

The initial dataset contains 115740 rows and 30 columns. 

## Preprocessing

During the preprocessing process, we meticulously reviewed every column, focusing on aspects such as data types, the presence of null values, and potential clustering patterns.

### Crash Date

For the 'Crash Date' column, we have maintained the data type as text, and it's important to note that **no null values** were found. The data within this column also appears to be in a **consistent format**. To enhance data organization and facilitate further analysis, we have **sorted** rows by 'Crash Date'.

### Crash Time

In the 'Crash Time' column, we've observed that **no null values** are present. Notably, we noticed specific hours, such as 00:00 and 13:00, a significant number of accidents occur. Exact hours, like 15:20 or 6:45, are more frequent than minutes, for instance, 19 or 23. This suggests that officers might commonly make **typographical errors** when recording minute values, which we have taken into account during the preprocessing process.


### Borough

The 'Borough' column contains a limited set of distinct values, namely, **BRONX, BROOKLYN, MANHATTAN, QUEENS, STATEN ISLAND**, and blank. Notably, there are a significant number of records where the 'Borough' value is **blank**, with a total of **40,671** occurrences. We do not eliminate these rows, as they can be used for counting and other applications. We also explored the option of imputing values using the GeoPandas library; however, due to the high cost and the prevalence of missing values, it would take a considerable amount of time. Therefore, we have taken into account the existence of blank entries for later consideration in our analysis.

### Zip Code

We've observed that there are 208 distinct zip codes within the dataset. It's important to note that a even though having too many zip codes, they can be extracted to the borough.

### Latitude, Longitude and Location

In our data preprocessing phase, we took several steps to enhance the quality and consistency of the 'Latitude' and 'Longitude' columns.

We first converted the data type of 'Latitude' and 'Longitude' to number. We also made the decision to remove the 'Location' column since the necessary information is now adequately represented in the 'Latitude' and 'Longitude' columns. It's important to highlight that there are a total of 7,667 blank values in these columns.

During the data inspection, we identified atypical and impossible values, such as a longitude of -201. These outliers were associated with the location "QUEENSBORO BRIDGE UPPER BROADWAY", and after verifying the correct longitude, we adjusted these values to -73.954224. Similar outlier corrections were made for the "NASSAU EXPRESSWAY" location, where the previous value of -32 was corrected to -73.7813672, and for the "WEST SHORE EXPRESSWAY", where the previous value of -74.7 was corrected to -74.1864671. Finally, rows that had both 'Latitude' and 'Longitude' values equal to 0 were modified to contain blank text values, promoting uniformity and clarity in the dataset.

These preprocessing actions have been taken to ensure that the 'Latitude' and 'Longitude' data are accurate, free from anomalies, and suitable for further analysis.

### ON STREET NAME, CROSS STREET NAME and OFF STREET NAME

As part of our data preprocessing, we have opted to remove the 'On Street Name,' 'Cross Street Name,' and 'Off Street Name' columns from the dataset. This decision is influenced by several key factors. Firstly, these columns contained a significant number of blank values, with 'On Street Name' having approximately 28,000 blanks, 'Cross Street Name' over 57,000 blanks, and 'Off Street Name' about 80,000 blanks. The prevalence of missing data in these columns significantly affected the overall data quality.

Furthermore, the information contained within these columns was found to be redundant, and the extensive variety of street names made it challenging to derive meaningful insights or create compelling visualizations based on these columns. To fulfill the need for location-related information, we have chosen to rely on the 'Latitude' and 'Longitude' columns, which provide more structured and comprehensive geographic data. This not only reduces redundancy but also enhances data clarity and usability for our visualization project.

### INJURED AND KILLED

We have changed the datatype of all the injured and killed columns to number as well as changed the column names to shorter and more informative ones.
The old column names and their corresponding new names are as follows:

- NUMBER OF PERSONS INJURED -> TOTAL_INJURED
- NUMBER OF PERSONS KILLED -> TOTAL_KILLED
- NUMBER OF PEDASTRIANS INJURED -> PEDASTRIANS_INJURED
- NUMBER OF PEDASTRIANS KILLED -> PEDASTRIANS_KILLED
- NUMBER OF CYCLISTS INJURED -> CYCLISTS_INJURED
- NUMBER OF CYCLISTS KILLED -> CYCLISTS_KILLED
- NUMBER OF MOTORISTS INJURED -> MOTORISTS_INJURED
- NUMBER OF MOTORISTS KILLED -> MOTORISTS_KILLED




### COLLISION_ID

We have removed this column since it is not needed for our analysis.

### VEHICLE TYPE CODE 1 & 2

For both 'VEHICLE TYPE CODE 1' and 'VEHICLE TYPE CODE 2,' a comprehensive data standardization process was undertaken. Initially, we applied clustering techniques using the Nearest Neighbor Method and Key Collision Method to harmonize and unify vehicle type names. These automated methods aided in resolving many inconsistencies caused by human errors.

Recognizing that the automated clustering did not catch all errors, we conducted a meticulous manual review of vehicle type names. In cases where specific vehicles were inaccurately described, we updated their names. For instance, we found entries like 'GLP050VXEV,' which, upon internet research, was identified as a forklift model, so we modified it accordingly. Similar manual refinements were applied to various other vehicle types to improve accuracy and consistency.

Additionally, we adopted a generalization approach to simplify overly specific vehicle types. For example, entities like 'FedEx,' 'UPS,' 'mail' and others were categorized under 'delivery' to reduce the number of distinct vehicle classes.

To further enhance data consistency, vehicle types with fewer than ten collisions were grouped under the 'Others' category to streamline the dataset and reduce the number of unique names.

It is remarkably that before these transformations, 'VEHICLE TYPE CODE 1' had 361 different names, and 'VEHICLE TYPE CODE 2' had 373 and, after, there are 42 and 44 respectivly. These preprocessing steps aimed to standardize and simplify the vehicle type data, making it more manageable and coherent for our analysis and visualization efforts.

### VEHICLE TYPE CODE 3, 4 & 5

Considering that the majority of accidents typically involve two vehicles, the 'VEHICLE TYPE CODE 3,' 'VEHICLE TYPE CODE 4,' and 'VEHICLE TYPE CODE 5' columns contained a significant number of blank values. Specifically, 'VEHICLE TYPE CODE 3' had 107,095 blanks, 'VEHICLE TYPE CODE 4' had 113,658 blanks, and 'VEHICLE TYPE CODE 5' had 115,154 blanks. These observations indicate that the vast majority of records in these columns did not contain meaningful data.

To streamline the dataset and focus on more relevant and informative columns, we have made the decision to remove these three columns. The absence of data in these columns, coupled with their limited relevance for our visualizations, makes their removal a practical and efficient choice in our data preprocessing.

### CONTRIBUTING_FACTOR_VEHICLE1, 2, 3, 4, & 5

Due to the high number of null values in the CONTRIBUTING_FACTOR3, 4, 5 columns, we have decided to eliminate them as they will not be useful for our analysis. We have retained columns 1 and 2 in case we want to analyze the primary causes of accidents in New York at some point.

## Data Derivation

Due to the types of charts we want to create, we found it beneficial to add more attributes to the dataset, particularly those related to the date. In this case, we decided to combine the 'CRASH_DATE' and 'CRASH_TIME' columns into a single column containing the data in a format suitable for Altair. We created a new column named 'CRASH_DATETIME' in the format '%m/%d/%Y %H:%M'. We also added another column, 'DAY_WEEK,' that indicates the day of the week when the accident occurred (e.g., Monday, ...). Additionally, we introduced a column named 'TYPE_DAY' containing values 'Weekend' and 'Weekday' based on whether the accidents happened on the weekend or during the week, respectively.

In [66]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

In [67]:
collisions = pd.read_csv("../data/preprocessed-collisions.csv")

In [68]:
collisions['CRASH_DATETIME'] = pd.to_datetime(collisions['CRASH_DATE'] + ' ' + collisions['CRASH_TIME'], format='%m/%d/%Y %H:%M')
collisions = collisions.drop(columns=['CRASH_TIME'])
collisions['DAY_WEEK'] = collisions['CRASH_DATETIME'].dt.day_name()
collisions['TYPE_DAY'] = collisions['DAY_WEEK'].apply(lambda day: 'Weekend' if day in ['Saturday', 'Sunday'] else 'Weekday')

In [None]:
# get only the columns we need
collisions = collisions[['CRASH_DATETIME', 'DAY_WEEK', 'TYPE_DAY', 'BOROUGH', 'ZIP_CODE', 'LATITUDE', 'LONGITUDE', '']]

In [69]:
collisions.to_csv("../data/preprocessed-collisions-final.csv", index=False)

In [70]:
print(f'After the preprocessing, the dataset has {len(collisions)} rows and {len(collisions.columns)} columns')

After the preprocessing, the dataset has 115740 rows and 20 columns


## Design and implementation

In [71]:
import numpy as np
import altair as alt
import geopandas as gpd

alt.data_transformers.disable_max_rows()

DataTransformerRegistry.enable('default')

In [72]:
collisions = pd.read_csv('../data/preprocessed-collisions-final.csv', dtype={'ZIP_CODE': str})

### Are accidents more frequent during weekdays or weekends? Is there any difference between before COVID-19 and after?

It is crucial to differentiate between weekdays and weekends, as well as the periods preceding and following the onset of COVID (2018 and 2020). Our initial strategy revolves around structuring the data with the day of the week on the x-axis and consolidating the accident counts for each day. This approach enables us to discern notable variations between weekdays and weekends.

For the comparative analysis of accident numbers before and after COVID, we suggest employing a paired bar chart. This graphical representation will utilize different colors for each period, facilitating a straightforward and visually impactful comparison between the two.

In [73]:
paired_bar_chart = alt.Chart(collisions).mark_bar().encode(
  x = alt.X('year:O', title = 'Type of day', axis=alt.Axis(title=None, labels=False, ticks=False)), 
  y = alt.Y('count:Q', title = 'Number of collisions', axis=alt.Axis(offset=6)),
  color= alt.Color('year:O', scale = alt.Scale(scheme='tableau10')),
  column = alt.Column('DAY_WEEK:N', title='Day of the Week', 
                      sort=['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday'], 
                      header=alt.Header(titleOrient='bottom', labelOrient='bottom', labelPadding=4))
).transform_calculate(
  year = 'year(datum.CRASH_DATETIME)',
).transform_aggregate(
  count='count()',
  groupby=['year', 'DAY_WEEK']
)

# paired_bar_chart

We can observe from the previous chart that it allows differentiation between the years for the same day of the week, enabling a comparison before and after COVID. However, we've noticed that for comparing and analyzing whether there have been more accidents on weekdays or weekends, the chart is somewhat limited. Therefore, our next step is to add a slope chart to incorporate this information, which is not being well encoded at the moment.


We also explored the option of standardizing the data by normalizing each value based on the corresponding count of occurrences on each day of the week, such as Mondays, Tuesdays, and so on. However, we believe that this would complicate the implementation. Instead, we will make the assumption that in the four months from June to September, from which we have extracted the data, there is a roughly equal distribution of days of the week. In other words, we assume that there are approximately the same number of Mondays as Tuesdays, Wednesdays, and so forth, which is more or less accurate.

In [74]:
slope_chart = alt.Chart(collisions).mark_line(point=True).encode(
  x=alt.X('TYPE_DAY:O', title = 'Type of day'),
  y=alt.Y('avg_collisions:Q', title = 'Average number of collisions'),
  color=alt.Color('year:O', scale = alt.Scale(scheme='tableau10'), legend=alt.Legend(title='Year')),
).transform_calculate(
  year='year(datum.CRASH_DATETIME)'
).transform_aggregate(
  count='count()',
  groupby=['year', 'DAY_WEEK', 'TYPE_DAY']
).transform_aggregate(
  avg_collisions = 'mean(count)',
  groupby=['year', 'TYPE_DAY']
)

# (paired_bar_chart | slope_chart).properties(
#      title='Number of collisions by day of the week and year'
# ).configure_title(
#   anchor='middle', offset=25, fontSize=18, fontStyle='normal', fontWeight='normal'
# ).configure_view(
#   stroke='transparent'
# ).resolve_scale(
#   y='shared'
# )

This view, composed of juxtaposed charts, enables us to encode all the information we desire. The bar chart has been employed due to its linear representation of changes, making distinctions easily perceptible. Within each category (day of the week), values can be compared based on the year. While comparing across days of the week is more complex, it is less relevant in this case as our primary focus is not on inter-category comparisons. The bar chart makes it easy to identify specific data within the same category but is less suitable for comparisons between different categories.

In any case, our focus is on comparing weekdays and weekends, and as mentioned previously, the slope chart is adept at capturing this comparison. We chose the slope chart because it is specifically crafted to encode two values, aligning seamlessly with our dataset. It serves the dual purpose of illustrating the increase or decrease of two data points over time. Its simplicity, paired with its high utility, enables swift and effective comparisons.

### Is there any type of vehicle more prone to participate in accidents?

To address this question, we will utilize the data from the 'VEHICLE_TYPE_CODE1' and 'VEHICLE_TYPE_CODE2' columns, which contain the names of the vehicles involved in accidents. In this instance, we will combine values from both columns and generate an auxiliary dataset called 'vehicle_type' that will include the vehicle name and the number of accidents it has been involved in, denoted as 'n_accidents'.

In [75]:
vehicle_type = pd.DataFrame({'vehicle_type': list(collisions['VEHICLE_TYPE_CODE1'].values) + list(collisions['VEHICLE_TYPE_CODE2'].values)})
vehicle_type = vehicle_type.groupby('vehicle_type').size().reset_index(name='n_accidents')
print(f'There are {len(vehicle_type)} different types of vehicles in the dataset')
vehicle_type.head()                                                     

There are 48 different types of vehicles in the dataset


Unnamed: 0,vehicle_type,n_accidents
0,3-Door,50
1,Ambulance,732
2,Armored Truck,55
3,Beverage Truck,55
4,Bike,5543


Given the extensive variety of vehicles in the dataset, encoding all of them in a single chart becomes impractical. The next step is to select the top 10 vehicle types with the highest accident counts. While this number could be different, we have observed that with more than 10, the accident counts for each vehicle type become relatively small, hindering effective visualization. Therefore, we opt to focus on the top 10 vehicle types with the highest accident frequencies for clearer representation. The values for the remaining vehicles have been grouped under "Others". By consolidating the less frequent vehicle types into a category labeled "Others," we streamline the visualization and prioritize clarity. This approach allows us to highlight the top 10 vehicle types with the highest accident frequencies, providing a more focused and interpretable representation of the data.

In [76]:
most_collisioned = list(vehicle_type.sort_values(by='n_accidents', ascending=False).head(10)['vehicle_type'])
vehicle_type['vehicle_type'] = vehicle_type['vehicle_type'].apply(lambda x: x if x in most_collisioned else 'Others')
vehicle_type = vehicle_type.groupby('vehicle_type').sum('counts').reset_index()
vehicle_type = vehicle_type.sort_values(by='n_accidents', ascending=False)
print(f'The 10 most collisioned vehicles are: {most_collisioned}')

The 10 most collisioned vehicles are: ['Sedan', 'SUV', 'Taxi', 'Pickup', 'Bike', 'Box truck', 'Bus', 'Truck', 'Motorcycle', 'Van']


This initial analysis aims to identify the most frequently involved vehicles in accidents. Our first choice for representation is using a bar chart. On the x-axis, we will have the total number of accidents for each vehicle, and on the y-axis, we will list the vehicle names. We adopt this approach to ensure clear legibility of the vehicle names, facilitating effective comparisons. Additionally, we find it beneficial to include a vertical line in red, indicating the average value of accidents involved. This addition allows for easy comparison of vehicles that surpass or fall below this average. Additionally, we will arrange the bars in descending order based on their values, facilitating easy comparison.

In [77]:
bar_chart = alt.Chart(vehicle_type).mark_bar().encode(
    x=alt.X('n_accidents:Q', title='Number of collisions', scale=alt.Scale(domain=(0, 1e5 + 1))),
    y=alt.Y('vehicle_type:N', 
            sort=list(vehicle_type.loc[vehicle_type['vehicle_type'] != 'Others', 'vehicle_type']) + ['Others'], 
            title='Vehicle Type'),
    color=alt.condition(
            alt.datum.vehicle_type == 'Others', 
            alt.value('grey'),
            alt.value('steelblue')
        )
)
    
mean_line = alt.Chart(vehicle_type).mark_rule(color='red', strokeWidth=1.5).encode(
        x = alt.X('mean(n_accidents):Q')
)

# (bar_chart + mean_line).properties(
#         title='Number of collisions by vehicle type'
# ).configure_title(
#         anchor='middle', fontSize=16, fontStyle='normal', fontWeight='normal', offset=20
# ).properties(
#         width=500, 
#         height=300
# )


The chart appears quite accurate, but we have decided to include the values of the bars to facilitate comparison. In its current form, it is evident which values are larger or smaller than others, but the exact values cannot be precisely determined. Therefore, even though adding the accident count values may seem redundant, we believe it aids in performing the tasks more effectively.







In [78]:
n_accidents_text = alt.Chart(vehicle_type).mark_text(align='left', dx=2, color='black', size=10).encode(
        x=alt.X('n_accidents:Q'),
        y=alt.Y('vehicle_type:N', 
            sort=list(vehicle_type.loc[vehicle_type['vehicle_type'] != 'Others', 'vehicle_type']) + ['Others']),
        text='n_accidents:Q'
)

# (bar_chart + mean_line + n_accidents_text).properties(
#         title='Number of collisions by vehicle type'
# ).configure_title(
#         anchor='middle', fontSize=16, fontStyle='normal', fontWeight='normal', offset=20
# ).properties(
#         width=500, 
#         height=300
# )

The issue with this chart is that it displays the number of accidents for each vehicle type during the given time period. However, it doesn't provide information about whether a particular vehicle is more prone to accidents than another, as we lack data on the total number of each type of vehicle. 

### At what time of the day are accidents more common?

To examine the temporal patterns of accidents throughout the day, we will employ a line chart. The x-axis will represent hours, with the y-axis indicating the corresponding number of accidents. Opting for a line chart enables a clear depiction of how accident frequencies evolve over time. We will differentiate the data by year, using distinct colors for 2018 and 2020, providing a comparative analysis.

In [79]:
line_count = alt.Chart(collisions).mark_line(strokeWidth=2, point=True).encode(
    alt.X('hours(CRASH_DATETIME):O', title='Time of Day'),
    alt.Y('count():Q', title='Number of Collisions'),
    color= alt.Color('year(CRASH_DATETIME):O', scale = alt.Scale(scheme='tableau10'))
)

# line_count

These type of visualization seems to work, dispite that it can be improved. We see that in the y-axis we have the total number of collisions by hour. These encoding make it challenging to intuitively grasp the frequency of accidents for each hour each day. To address this, we will refine the visualization by encoding the average number of accidents of each day, accompanied by an error bar indicating the standard deviation so that we can assess the variance of the data.

In [80]:
collisions2 = collisions.copy()
collisions2['CRASH_DATETIME'] = pd.to_datetime(collisions2['CRASH_DATETIME'])
# Extract year and hours from CRASH_DATETIME
collisions2['year'] = collisions2['CRASH_DATETIME'].dt.year
collisions2['hours'] = collisions2['CRASH_DATETIME'].dt.hour

collisions2 = collisions2.groupby(['year', 'hours', 'CRASH_DATE']).size().reset_index(name='count')

# Calculate average and confidence interval
average_ci = collisions2.groupby(['year', 'hours']).agg(
    avg=('count', 'mean'),
    ci_lower=('count', lambda x: np.percentile(x, 5)),
    ci_upper=('count', lambda x: np.percentile(x, 95))
).reset_index()

# Line chart for average with confidence intervals
line_chart = alt.Chart(average_ci).mark_line().encode(
    x=alt.X('hours:Q', title='Time of day'),
    y=alt.Y('avg:Q', title='Average number of collisions'),
    color=alt.Color('year:N', scale=alt.Scale(scheme='tableau10'))
)

error_bars = alt.Chart(average_ci).mark_errorband().encode(
    x=alt.X('hours:Q'),
    y=alt.Y('ci_lower:Q', title='Number of collisions'),
    y2=alt.Y2('ci_upper:Q'),
    color=alt.Color('year:N', scale=alt.Scale(scheme='tableau10'))
)

# (line_chart + error_bars).properties(width=600, height=400)

In [81]:
line_average = alt.Chart(collisions).mark_line(strokeWidth=2, point=True).encode(
    x = alt.X('hours:Q', title='Time of day'),
    y = alt.Y('avg:Q', title='Average number of collisions'),
    color = alt.Color('year:O', scale = alt.Scale(scheme='tableau10'))
).transform_calculate(
  year = 'year(datum.CRASH_DATETIME)',
  hours = 'hours(datum.CRASH_DATETIME)'
).transform_aggregate(
   count='count()',
   groupby=['year', 'hours', 'CRASH_DATE']
).transform_aggregate(
    avg = 'mean(count)',
    groupby=['year', 'hours']
)

error_bar = alt.Chart(collisions).mark_errorbar(ticks=True).encode(
    x=alt.X('hours:Q'),
    y=alt.Y('count:Q',axis=alt.Axis(title=None), scale=alt.Scale(zero=False)),
    color = alt.Color('year:O', scale = alt.Scale(scheme='tableau10'))
).transform_calculate(
  year = 'year(datum.CRASH_DATETIME)',
  hours = 'hours(datum.CRASH_DATETIME)'
).transform_aggregate(
   count='count()',
   groupby=['year', 'hours', 'CRASH_DATE']
)

# (line_average + error_bar).properties(width=600, height=400)

Upon analyzing the hourly collisions, a clear trend emerges: higher collision rates during the day and lower rates during the night. This pattern aligns with the increased presence of cars on the road during daylight hours and decreased activity during nighttime. Further we can distinguish different patterns between morning, afternoon, and evening periods. Mornings exhibit fewer collisions, likely attributed to work-related activities, whereas afternoons register higher incidents, potentially linked to leisure activities and transporting children to extracurricular activities. Evenings witness a decline in collisions as people conclude their activities and return home.

We can further enhance our chart by introducing an additional variable to glean more insights. One pivotal factor of high importance is the total number of deaths caused by the accidents. It's crucial not only to identify peak collision times throughout the day but also to comprehend the magnitude of the human cost associated with these incidents. These variable will be encoded through the line thickness, with thicker lines indicating a higher number of deaths.

In [82]:
avg_deaths_line = alt.Chart(collisions).mark_trail().encode(
    x = alt.X('hours:Q', title='Time of day'),
    y = alt.Y('avg_collisions:Q', title='Average number of collisions'),
    color = alt.Color('year:O', scale = alt.Scale(scheme='tableau10'), title='Year'),
    size = alt.Size('avg_killed:Q', title='Average deaths')
).transform_calculate(
  year = 'year(datum.CRASH_DATETIME)',
  hours = 'hours(datum.CRASH_DATETIME)'
).transform_aggregate(
  count_collisions='count()',
  count_killed='sum(TOTAL_KILLED)',
  groupby=['year', 'hours', 'CRASH_DATE']
).transform_aggregate(
  avg_collisions='mean(count_collisions)',
  avg_killed='mean(count_killed)',
  groupby=['year', 'hours']
)

# (avg_deaths_line + error_bars).properties(
#     title='Average collisions and deaths over time',
#     width=600,
#     height=400
#     ).configure_title(
#       anchor='middle', offset=25, fontSize=18, fontStyle='normal', fontWeight='normal'
#     )

We finally achieved the final version of the graph. This visualization facilitates the identification of peak accident times. While the period with the highest collision frequency occurs around 16:00, instances of more severe outcomes, particularly deaths, are notable at 20:00 and 04:00 in 2018, and between 19:00 and 00:00, as well as at 04:00 in 2020. The deaths in the late night coincide with the times when people are returning home after socializing, often under the influence of alcohol, which make the accidents more dangerous.

This visualization provides a clear depiction of the temporal patterns of accidents throughout the day which facilitates to identify what time of the day are accidents more common. However, the error bars and the line thickness make the visualization a bit cluttered and difficult to read. Dispite that we believe that the information they transmit is important enough to keep them in the visualization.

*At what time of the day are accidents more common?*

In chart C3 you can see a line chart with the average accidents per hour along the different years. We use a different color for each year and line thickness to encode the killed people. You can see that the accidents are more common during the afternoon, having the peak at 16:00, and the killed people are more common during the evening and late night.

### Are there any areas with a larger number of accidents?

We create a smaller dataset named 'geo_collisions' containing only the columns used for analysis. Additionally, we drop all rows with NaN values to ensure data integrity.

In [83]:
geo_collisions = collisions[['LATITUDE', 'LONGITUDE', 'ZIP_CODE', 'BOROUGH']]
geo_collisions = geo_collisions.dropna()
geo_collisions['ZIP_CODE'] = geo_collisions['ZIP_CODE'].apply(lambda x: x.split('.')[0])

In [84]:
ny_city_map = alt.topo_feature('../data/ny_city_map.geojson', '')
ny_city = alt.Chart(ny_city_map).mark_geoshape(
    fill='lightgray',  
    stroke='white',    
    strokeWidth=1.5    
)                            

The initial idea is to create a point map where each accident is marked as a point on the New York map. This way, we aim to visualize which areas have more points and, consequently, more accidents. Additionally, we have considered that it would be helpful to have a reference for the number of accidents in each borough to scale the visualization. We will also reduce the opacity of the points to better distinguish those that are close together. We should use a categorical color palette since there is no specific order, and it should allow for clear differentiation between the various neighborhoods. Furthermore, we won't include redundant neighborhood information in the bar chart, as it is already encoded with color.

In [85]:
point_map = alt.Chart(geo_collisions).mark_circle(size=1, opacity=0.7).encode(
    latitude='LATITUDE:Q',
    longitude='LONGITUDE:Q',
    color = alt.Color('BOROUGH:N', 
                      scale=alt.Scale(scheme='tableau10'))
)

bar_chart_map = alt.Chart(geo_collisions).mark_bar().encode(
    alt.X('BOROUGH:N', title='Borough', axis=alt.Axis(title=None, labels=False, ticks=False)),
    alt.Y('count():Q', title='Number of Accidents'),
    color = alt.Color('BOROUGH:N', 
                      scale=alt.Scale(scheme='tableau10'),)
)
    
# ((ny_city + point_map).properties(
#     width=380,
#     height=380
# ) | bar_chart_map).properties(
#     title = alt.TitleParams(text='Number of collisions by borough', 
#                             fontSize=18, 
#                             fontStyle='normal',
#                             fontWeight='normal',
#                             subtitle='', 
#                             offset=35)
# ).configure_title(
#     anchor='middle'
# )

We have observed that the point map does not efficiently differentiate areas with more accidents. Due to the high volume of points, they overlap, creating regions of solid color that are challenging to interpret. Therefore, the next step will be to try a choropleth map, which differentiates by PostalCode and encodes the number of accidents in each area with color. One issue we encounter is that larger map regions will generally have more accidents, making it an unfair comparison. To address this, we have decided to normalize the number of accidents in a specific area by the area it occupies. Instead of encoding the raw number of accidents, we will encode the number of accidents per square kilometer to facilitate meaningful comparisons. We will use a Sequential Single-Hue palette for clarity.

In [86]:
gdf = gpd.read_file('../data/ny_city_map.geojson')
gdf = gdf[['postalCode', 'Shape_Area', 'geometry']]
area_gdf = gdf.to_crs(epsg=6933) # the length unit is now 'meter'
gdf['Shape_Area'] = area_gdf.area.values / 1e6 # set the area units to sq Km.
# gdf.head()

We merge with the collisions dataset to obtain the count of accidents for each postal code.

In [87]:
gdf_collisions = gdf.merge(geo_collisions.groupby('ZIP_CODE').size().reset_index(name='n_accidents').rename(columns={'ZIP_CODE': 'postalCode'}), 
                           on='postalCode', 
                           how='left')
# gdf_collisions.head()

In [88]:
gdf_collisions['n_accidents_per_km2'] = gdf_collisions['n_accidents']/gdf_collisions['Shape_Area']
gdf_collisions['log_n_accidents_per_km2'] = gdf_collisions['n_accidents_per_km2'].apply(lambda x: np.log(x) if x > 0 else 0)

In [89]:
gdf_collisions.head()

Unnamed: 0,postalCode,Shape_Area,geometry,n_accidents,n_accidents_per_km2,log_n_accidents_per_km2
0,11372,1.875393,"POLYGON ((-73.86942 40.74916, -73.89507 40.746...",657.0,350.326589,5.858866
1,11004,2.102007,"POLYGON ((-73.71068 40.75004, -73.70869 40.748...",163.0,77.544949,4.350858
2,11040,0.581786,"POLYGON ((-73.70098 40.73890, -73.70309 40.744...",25.0,42.971107,3.760528
3,11426,4.588542,"POLYGON ((-73.72270 40.75373, -73.72251 40.753...",151.0,32.908059,3.493718
4,11365,6.450179,"POLYGON ((-73.81089 40.72717, -73.81116 40.728...",222.0,34.417651,3.53857


In [90]:
choropleth_map = alt.Chart(gdf_collisions).mark_geoshape().encode(
    alt.Color('n_accidents_per_km2:Q',
              title='Number of accidents per km²', 
              scale=alt.Scale(scheme='lighttealblue', domain=(0, 600)))
).properties(
    width=500,
    height=400,
)

borough_names = alt.Chart(geo_collisions).mark_text(fontWeight='bold', fontSize=11, color='black').encode(
    latitude='mean_lat:Q',
    longitude='mean_long:Q',
    text='BOROUGH:N',
).transform_aggregate(
    mean_lat='mean(LATITUDE)',
    mean_long='mean(LONGITUDE)',
    groupby=['BOROUGH']
)

# (choropleth_map + borough_names).properties(
#      title='Number of Collisions by Postal Code'
# ).configure_title(
#         anchor='middle', fontSize=16, fontStyle='normal', fontWeight='normal', offset=20
# ).configure_legend(
#     strokeColor='gray',
#     fillColor='#EEEEEE',
#     padding=10,
#     cornerRadius=10,
#     orient='top-left',
#     gradientLength=165
# )

The choropleth map provides a clear visualization of areas (by postal code) with higher accident rates per square kilometer. We positioned the legend in the top-left corner to optimize visualization space. Additionally, we adjusted the color domain to highlight differences between areas. Specifically, values above a certain threshold are assigned the same color to enhance clarity.

While the choropleth map allows for comparisons between postal code areas, we recognize the importance of effective comparisons between boroughs. To address this, we've introduced a bar chart to encode information on the number of accidents per square kilometer for each borough. This enables us to assess whether certain boroughs have significantly higher accident rates than others.

In [91]:
area = {'BRONX': 109.3, 'BROOKLYN': 179.7, 'MANHATTAN': 58.8, 'QUEENS': 281.5, 'STATEN ISLAND': 148.9} # sq km of each borough

accidents_per_borough = geo_collisions.groupby('BOROUGH').size().reset_index(name='n_accidents')
accidents_per_borough['n_accidents_per_sq_km'] = accidents_per_borough.apply(
    lambda x:  x['n_accidents'] / area[x['BOROUGH']],
    axis=1
)
accidents_per_borough.head()

Unnamed: 0,BOROUGH,n_accidents,n_accidents_per_sq_km
0,BRONX,12426,113.6871
1,BROOKLYN,24365,135.58709
2,MANHATTAN,13159,223.792517
3,QUEENS,19950,70.870337
4,STATEN ISLAND,2786,18.710544


In [92]:
bar_chart_map2 = alt.Chart(accidents_per_borough).mark_bar().encode(
    alt.X('BOROUGH:N', title='Borough', sort=alt.EncodingSortField(field="n_accidents_per_sq_km", op="sum", order='descending')),
    alt.Y('n_accidents_per_sq_km:Q', title='Number of Accidents per km²'),
)

# alt.hconcat(
#     choropleth_map + borough_names,
#     bar_chart_map2
# ).properties(
#     title='Number of Collisions by Postal Code and Borough'
# ).configure_title(
#     anchor='middle', fontSize=16, fontStyle='normal', fontWeight='normal', offset=20
# ).configure_legend(
#     strokeColor='gray',
#     fillColor='#EEEEEE',
#     padding=10,
#     cornerRadius=10,
#     orient='top-left',
#     gradientLength=165
# ).configure_axis(
#     labelFontSize=10,
#     titleFontSize=12, 
#     labelFontStyle='normal',
#     titleFontStyle='normal', 
# )

### Is there a correlation between weather conditions and accidents?

Before starting to create visualizations it is necessary to choose the attributes of the 'weather.csv' dataset. Furthermore, since we have one row per day in the weather dataset, we need to group the number of collisions per day in order to merge the two datasets appropriately. 

In [93]:
weather_original = pd.read_csv("../data/weather.csv")
weather = weather_original[['datetime', 'temp', 'precip', 'windspeed', 
                            'humidity', 'cloudcover', 'conditions', 'visibility']]
weather['datetime'] = pd.to_datetime(weather['datetime'])
weather.head()

Unnamed: 0,datetime,temp,precip,windspeed,humidity,cloudcover,conditions,visibility
0,2018-06-01,21.6,0.282,12.6,86.8,65.9,"Rain, Partially cloudy",11.3
1,2018-06-02,25.1,0.346,22.3,74.0,35.4,"Rain, Partially cloudy",15.8
2,2018-06-03,17.0,2.929,24.1,75.0,92.7,"Rain, Overcast",15.6
3,2018-06-04,16.8,3.91978,16.7,76.6,71.6,"Rain, Partially cloudy",15.4
4,2018-06-05,19.8,0.0,25.9,60.7,35.7,Partially cloudy,16.0


In [94]:
coll_weather = pd.DataFrame({'datetime': collisions["CRASH_DATE"]})
coll_weather['datetime'] = pd.to_datetime(coll_weather['datetime'])
coll_weather = coll_weather.groupby(['datetime']).size().reset_index(name='collisions')
coll_weather = pd.merge(coll_weather, weather, on='datetime')
coll_weather['year'] = coll_weather['datetime'].dt.year
coll_weather.head()

Unnamed: 0,datetime,collisions,temp,precip,windspeed,humidity,cloudcover,conditions,visibility,year
0,2018-06-01,751,21.6,0.282,12.6,86.8,65.9,"Rain, Partially cloudy",11.3,2018
1,2018-06-02,622,25.1,0.346,22.3,74.0,35.4,"Rain, Partially cloudy",15.8,2018
2,2018-06-03,525,17.0,2.929,24.1,75.0,92.7,"Rain, Overcast",15.6,2018
3,2018-06-04,698,16.8,3.91978,16.7,76.6,71.6,"Rain, Partially cloudy",15.4,2018
4,2018-06-05,688,19.8,0.0,25.9,60.7,35.7,Partially cloudy,16.0,2018


We also create two more datasets, one for each year because we will use these two datasets separately in some of the visualization.

In [95]:
coll_weather['conditions'] = coll_weather['conditions'].apply(lambda x: 'Rain, Overcast' if x=='Overcast' else x)

coll_weather_2018 = coll_weather[coll_weather['year']==2018]
coll_weather_2020 = coll_weather[coll_weather['year']==2020]

Now it is time to create graphs to see if there is any correlation between weather condiditon and accidents. We will first try a parallel bar chart, we have choosen the variables that we considered that might be more related to the number of collisions. We will also differentiate the data by year, using distinct colors for 2018 and 2020, providing a comparative analysis.

In [96]:
custom_sort_order = [ 'collisions', 'visibility', 'windspeed', 'temp', 'humidity', 'cloudcover']

# alt.Chart(coll_weather, width=500).transform_window(
#     index='count()'
# ).transform_fold(
#     ['temp', 'precip', 'windspeed', 'humidity', 'cloudcover', 'visibility', 'collisions']
# ).mark_line().encode(
#     x=alt.X('key:N', sort=custom_sort_order),
#     y='value:Q',
#     color='year:N',
#     detail='index:N',
#     opacity=alt.value(0.5)
# )

The first problem we see is that each variable has a different range of values so we will normalize the data to be able to compare them.

In [97]:
# alt.Chart(coll_weather).transform_window(
#     index='count()'
# ).transform_fold(
#     ['temp', 'windspeed', 'collisions', 'humidity', 'cloudcover', 'visibility']
# ).transform_joinaggregate(
#      min='min(value)',
#      max='max(value)',
#      groupby=['key']
# ).transform_calculate(
#     minmax_value=(alt.datum.value-alt.datum.min)/(alt.datum.max-alt.datum.min),
#     mid=(alt.datum.min+alt.datum.max)/2
# ).mark_line().encode(
#     x=alt.X('key:N', sort=custom_sort_order),
#     y='minmax_value:Q',
#     color='year:N',
#     detail='index:N',
#     opacity=alt.value(0.5)
# ).properties(width=500)

Despite normalizing the data, the visualization is not clear enough to see if there is any correlation between the variables and the number of collisions. (we have trided different order of variables but nothing comes up). Maybe it is because since there is a lot of difference between the number of collisions in 2018 and 2020 we do not see trends. Our next graph will be a juxtaposition of two parallel coordinates charts, one for each year.

In [98]:
# Chart for coll_weather_2018
chart_2018 = alt.Chart(coll_weather_2018).transform_window(
    index='count()'
).transform_fold(
    ['temp', 'windspeed', 'collisions', 'humidity', 'cloudcover', 'visibility']
).transform_joinaggregate(
     min='min(value)',
     max='max(value)',
     groupby=['key']
).transform_calculate(
    minmax_value=(alt.datum.value-alt.datum.min)/(alt.datum.max-alt.datum.min),
    mid=(alt.datum.min+alt.datum.max)/2
).mark_line().encode(
    x=alt.X('key:N', sort=custom_sort_order),
    y='minmax_value:Q',
    color=alt.value('steelblue'),
    detail='index:N',
    opacity=alt.value(0.5)
).properties(width=500, title='2018')

# Chart for coll_weather_2020
chart_2020 = alt.Chart(coll_weather_2020).transform_window(
    index='count()'
).transform_fold(
    ['temp', 'windspeed', 'collisions', 'humidity', 'cloudcover', 'visibility']
).transform_joinaggregate(
     min='min(value)',
     max='max(value)',
     groupby=['key']
).transform_calculate(
    minmax_value=(alt.datum.value-alt.datum.min)/(alt.datum.max-alt.datum.min),
    mid=(alt.datum.min+alt.datum.max)/2
).mark_line().encode(
    x=alt.X('key:N', sort=custom_sort_order),
    y='minmax_value:Q',
    color=alt.value('#ff7f0e'),
    detail='index:N',
    opacity=alt.value(0.5)
).properties(width=500, title='2020')

# combined_chart = alt.hconcat(chart_2018, chart_2020)
# combined_chart

Were we may see som correlation with low visibility and high number of collisions, however, sine there are very few examples of low visibility compared to the ones with high visibility, we cannot conclude anything.

So, our we will try to approach the problem in a different type of visualization. We will use a small multiples of heatmaps to see if there is any correlation between two of the weather variables and the number of accidents.

In [99]:
# alt.Chart(coll_weather).mark_rect().encode(
#     alt.X(alt.repeat("column"), type='ordinal', bin=True),
#     alt.Y(alt.repeat("row"), type='ordinal', bin=True),
#     color='average(collisions):Q'
# ).properties(
#     width=150,
#     height=150
# ).repeat(
#     row=['temp', 'windspeed', 'humidity', 'cloudcover', 'visibility'],
#     column=['temp', 'windspeed', 'humidity', 'cloudcover', 'visibility']
# ).interactive()

With this visualization we still do not see significant correlation of weather coditions and accidents. We find two more problems: in a lot of the heatmaps there is too many white cells because we do not have any data for that combination of variables, and the other problem is that the heatmaps only lets compare two variables with the number of accidents at a time, which results in having to plot a lot of heatmaps to compare all the variables.

We do not see that we can obtain significant results with heatmaps so we will change the type of chart again. We will try to use a scatter plot to see if we can see any correlation between the variables and the number of collisions. Apart of the two axis, we will use the size, color and shape of the points to encode more variables. Since color and shape are better for categorical variables, we will encode the weather conditions and year in those respectively. Furthermore, we will encode the number of collisions in tye y-axis because is the main variable we want to compare with the others. However, on the x-axis and size of the points, we will try multiple combinations of variables to see if we can see any correlation.

In [100]:
# alt.Chart(coll_weather).mark_point(opacity = 0.5, filled = True).encode(
#     alt.X('temp:Q', title='Average Daily Temperature (C)', scale=alt.Scale(domain=[15, 31])),
#     #alt.X('windspeed:Q', title='Avearge Daily Windspeed (km/h)', scale=alt.Scale(domain=[8, 45])),
#     #alt.X('humidity:Q', title='Average Daily Humidity (%)', scale=alt.Scale(domain=[40, 95])),
#     alt.Size('visibility:Q', title='Average Daily Visibility (km)', scale=alt.Scale(domain=[11, 16])),
#     #alt.Size('precip:Q', title='Average Daily Precipitation (mm)', scale=alt.Scale(domain=[0, 50])),
#     alt.Color('conditions', title='Weather Conditions'),
#     alt.Y('collisions', title='Number of Collisions', scale=alt.Scale(domain=[150, 900])),
#     alt.Shape('year:N', title='Year')
# ).properties(
#     width=600,
#     height=400
# )

None of the combinations gives us enogh evidence to conclude that there is any correlation between the variables and the number of collisions. However, since we see that the number of collisions is higher in 2020 than in 2018, we will try to juxtapose the scatter plots of both years to see if get any insights.

In [101]:
# Chart for 2018
chart_2018 = alt.Chart(coll_weather_2018).mark_point(opacity=0.5, filled=True).encode(
    alt.X('temp:Q', title='Average Daily Temperature (C)', scale=alt.Scale(domain=[15, 31])),
    alt.Size('visibility:Q', title='Average Daily Visibility (km)', scale=alt.Scale(domain=[11, 16])),
    alt.Color('conditions', title='Weather Conditions', scale=alt.Scale(scheme='set2')),
    alt.Y('collisions', title='Number of Collisions', scale=alt.Scale(domain=[350, 900])),
).properties(
    title='Collisions and Weather Conditions in 2018',
    width=600,
    height=400
)

# Chart for 2020
chart_2020 = alt.Chart(coll_weather_2020).mark_point(opacity=0.5, filled=True).encode(
    alt.X('temp:Q', title='Average Daily Temperature (C)', scale=alt.Scale(domain=[15, 31])),
    alt.Size('visibility:Q', title='Average Daily Visibility (km)', scale=alt.Scale(domain=[11, 16])),
    alt.Color('conditions', title='Weather Conditions', scale=alt.Scale(scheme='set2')),
    alt.Y('collisions', title='Number of Collisions', scale=alt.Scale(domain=[150, 500])),
).properties(
    title='Collisions and Weather Conditions in 2020',
    width=600,
    height=400
)

# Display the charts side by side
# chart_2018 | chart_2020

Despite not getting enough insides, we see that there might be a correlation between the number of collisions and weather conditions. However, to compare these two, this scatterplot is not the best option. We will try to use a bar chart instead.

In [102]:
# alt.Chart(coll_weather).mark_bar().encode(
#     y=alt.Y('conditions:N', sort='-x', title='Weather Conditions'),
#     x=alt.X('collisions:Q', axis=alt.Axis(title='Number of Collisions'))
# )

With the total number of collisions by weather conditions we see a pattern that we did not see in the previous graphs. However, since we are counting the number of collisions with each of the weather conditions, and there are not the same days with each of the conditions, we cannot conclude anything. To improve this visualization we will get the average number of collisions per day with each of the weather conditions.

In [103]:
# alt.Chart(coll_weather).mark_bar().encode(
#     y=alt.Y('conditions:N', sort='-x', title='Weather Conditions'),
#     x=alt.X('average_collisions_condition:Q', axis=alt.Axis(title='Average Number of Collisions per Day')),
# ).transform_aggregate(
#     total_days_condition='count()',
#     total_collisions_condiditon='sum(collisions)',
#     groupby=['conditions']
# ).transform_calculate(
#     average_collisions_condition='datum.total_collisions_condiditon / datum.total_days_condition'
# )


Now we are starting to get interesting results. We see that the average number of collisions is higher with worse weather conditions, and lowers with better weather conditions. However, we still cannot conclude anything because bar charts are not the best option to compare averages since we do not see the variance and the distribution of the data. To improve this visualization and see the distribution of the data we will use a violin plot.

In [104]:
# alt.Chart(coll_weather, width=100).transform_density(
#     'collisions',
#     as_=['collisions', 'density'],
#     extent=[0, 1200],
#     groupby=['conditions']
# ).mark_area(orient='horizontal').encode(
#     alt.X('density:Q')
#         .stack('center')
#         .impute(None)
#         .title(None)
#         .axis(labels=False, values=[0], grid=False, ticks=True),
#     alt.Y('collisions:Q'),
#     alt.Color('conditions:N', scale=alt.Scale(scheme='set2')),
#     alt.Column('conditions:N')
#         .spacing(0)
#         .header(titleOrient='bottom', labelOrient='bottom', labelPadding=0)
# ).configure_view(
#     stroke=None
# )

With this visualization we still see that as worse weather conditions, more collisions. Dispite that, we still cannot see the median and quartiles of the data, so we will superpose a boxplot to the violin plot.

In [105]:
violin_right = (
    alt.Chart(coll_weather, width=100)
    .transform_density(
        "collisions",
        as_=["collisions", "density"],
        extent=[0, 1200],
        groupby=["conditions"]
    )
    .mark_area(orient="horizontal")
    .encode(
        alt.X("density:Q")
            .impute(None)
            .title(None)
            .axis(labels=False, grid=False, ticks=True),
        alt.Y("collisions:Q"),
        alt.Color("conditions:N", scale=alt.Scale(scheme='set2'))
    )
)

violin_left = (
    violin_right
    .copy()
    .transform_calculate(density="-datum.density")
)

boxplot = (
    alt.Chart(coll_weather, width=100)
    .mark_boxplot(outliers=False, size=10, extent=20)
    .encode(y="collisions:Q", color=alt.value("black"))
)

chart = (
    alt.layer(violin_left, violin_right,boxplot)
    .facet(alt.Column("conditions:N"))
    .configure_view(stroke=None)
)
# chart

The graphs is getting better, however, we see in each weather condition violin two bulks of data, which correspon to the two years. We have seen in previous graphs that there is a significant difference between the number of collisions in 2018 and 2020, so we will try to separate the data by year to see each distribution separately.

In [106]:
width = 80

boxplot = alt.Chart().mark_boxplot(color='black').encode(
    alt.Y(f'collisions:Q')
).properties(width=width)

violin = alt.Chart().transform_density(
    'collisions',
    as_=['collisions', 'density'],
    extent=[0, 1000],
    groupby=['conditions']
).mark_area(orient='horizontal').encode(
    y='collisions:Q',
    color=alt.Color('conditions:N', legend=None, scale=alt.Scale(scheme='set2')),
    x=alt.X(
        'density:Q',
        stack='center',
        impute=None,
        title=None,
        scale=alt.Scale(nice=False, zero=False),
        axis=alt.Axis(labels=False, values=[0], grid=False, ticks=True),
    ),
).properties(
    width=width,
    height=300
)

facet = lambda coll_weather, title: alt.layer(violin, boxplot, data=coll_weather).facet(column='conditions:N').resolve_scale(x=alt.ResolveMode("independent")).properties(title=alt.TitleParams(text=title, anchor="middle", align="center"))

# alt.hconcat(facet(coll_weather_2018, "Summer 2018"),facet(coll_weather_2020, "Sumer 2020")).configure_facet(
#     spacing=0,
# ).configure_header(
#     titleOrient='bottom',
#     labelOrient='bottom'
# ).configure_view(
#     stroke=None
# ).properties(
#     title='Collisions and Weather Conditions in 2018 and 2020',
# ).configure_title(
#       anchor='middle', offset=25, fontSize=18, fontStyle='normal', fontWeight='normal'
# )

We have finally achieved the final version of the graph. This visualization facilitates the identification of the distribution of the number of collisions by weather conditions. We see that, in 2018, the median of the number of collisions is higher with rain conditions but with the other weather conditions there is not a significant difference between the number of collisions. In 2020 we do not see enough evidence to conclude that a type of weather condition is more likely to cause an accident. These probably happens because in 2020 there was a lockdown and people were not driving as much as in 2018 so there were less cars on the road and the weather conditions did not affect as much as in 2018. Furthermore, we see that the distribution of the data is wider in 2020 than in 2018, which means that there is more variance in the number of collisions.

We have ended up choosing this visualization despite not having the quantitative wheather variables such as temperature or precipitation because the weather conditions take into account all these variables into a categorical one.

*Is there a correlation between weather conditions and accidents?*

In chart C5 you see a violin plot with the distribution of the number of accidents per day with each of the weather conditions and year. We use a different color for each year to compare them. By looking at the median of the boxplot, you can see that in 2018 this median (and the overall distribution of collisions) is higher with rain conditions but with the other weather conditions there is not a significant difference between the number of collisions. In 2020 we do not see enough evidence to conclude that a type of weather condition is more likely to cause an accident.

In [107]:
# bonus chart

gaussian_jitter = alt.Chart(coll_weather, title='Normally distributed jitter').mark_circle(size=20).encode(
    y="conditions:N",
    x="collisions:Q",
    yOffset="jitter:Q",
    color=alt.Color('conditions:N').legend(None),
    shape=alt.Shape('year:N')
).transform_calculate(
    # Generate Gaussian jitter with a Box-Muller transform
    jitter="sqrt(-2*log(random()))*cos(2*PI*random())"
).properties(
    width=300, height=200
)

uniform_jitter = gaussian_jitter.transform_calculate(
    # Generate uniform jitter
    jitter='random()'
).encode(
    alt.Y('conditions:N').axis(None)
).properties(
    title='Uniformly distributed jitter',
    width=300, height=200,
)

# (gaussian_jitter | uniform_jitter).resolve_scale(yOffset='independent')

### What is the annual fatality count in accidents in New York, and how does that total break down by user type, including pedestrians, cyclists, and motorists?

In consideration of the available data, we have found it beneficial to analyze the information on the number of fatalities in traffic accidents in New York over the years at our disposal. In this way, we aim to determine the annual count of fatal accidents and the respective distribution of these fatalities among pedestrians, cyclists, and motorists.

In [108]:
deadly_accidents = collisions[['CRASH_DATE', 'TOTAL_KILLED', 'PEDESTRIANS_KILLED', 'CYCLIST_KILLED', 'MOTORIST_KILLED']]

First, we verify that the total number of deaths is equivalent to the sum of deaths across different user types to ensure consistency.

In [109]:
deadly_accidents = deadly_accidents[deadly_accidents['TOTAL_KILLED'] == deadly_accidents['PEDESTRIANS_KILLED'] + \
                                                                        deadly_accidents['CYCLIST_KILLED'] + \
                                                                        deadly_accidents['MOTORIST_KILLED']]

print(f'Rows eliminated: {len(collisions) - len(deadly_accidents)}')

Rows eliminated: 5


In [110]:
deadly_accidents = deadly_accidents.drop(columns=['TOTAL_KILLED'])
deadly_accidents['CRASH_DATE'] = pd.to_datetime(deadly_accidents['CRASH_DATE'])
deadly_accidents['year'] = deadly_accidents['CRASH_DATE'].dt.year
deadly_accidents = deadly_accidents.groupby('year').sum(['PEDESTRIANS_KILLED', 'CYCLIST_KILLED', 'MOTORIST_KILLED']).reset_index()
deadly_accidents.head()

Unnamed: 0,year,PEDESTRIANS_KILLED,CYCLIST_KILLED,MOTORIST_KILLED
0,2018,40,5,43
1,2020,38,15,61


In [111]:
deadly_accidents_melted = deadly_accidents.melt('year', var_name='type', value_name='killed')
deadly_accidents_melted['type'] = deadly_accidents_melted['type'].apply(lambda x: x.split('_')[0].lower())
deadly_accidents_melted = deadly_accidents_melted.sort_values(by=['year', 'killed'], ascending=False).reset_index(drop=True)
deadly_accidents_melted

Unnamed: 0,year,type,killed
0,2020,motorist,61
1,2020,pedestrians,38
2,2020,cyclist,15
3,2018,motorist,43
4,2018,pedestrians,40
5,2018,cyclist,5


We consider that a bar chart can encode the information we need optimally. We envision a bar chart where the number of deaths is on the y-axis, and the year is on the x-axis. We find it appropriate to have the year on the x-axis because this follows the standard practice in data visualization and makes it easier for viewers to interpret the chart. We will use a categorical palette to differentiate effectively between the various categories. 

In [112]:
mortal_collisions = alt.Chart(deadly_accidents_melted).mark_bar().encode(
    x=alt.X('year:O', title='Year'),
    y=alt.Y('sum(killed):Q', title='Number of deaths'),
    color=alt.Color('type:N', scale=alt.Scale(scheme='set2'), legend=alt.Legend(title='Type of user')),
    order=alt.Order('type', sort='ascending'),
)

# mortal_collisions

We believe that comparing the number of deaths each year is feasible, but comparing the number of deaths by type is not as accurate, also within the same year. Therefore, we find it beneficial to add labels indicating the quantity of deaths per user type. This addition will facilitate efficient comparisons between different user categories. 

In [113]:
# adjust the position of the text labels manually
deadly_accidents_melted['position'] = [61+15, 61+38+15, 15, 43+5, 43+40+5, 5]

number_of_deaths = alt.Chart(deadly_accidents_melted).mark_text(color='white', dy=7).encode(
    x=alt.X('year:O', title='Year'),
    y=alt.Y('position:Q', title='Number of deaths'),
    text=alt.Text('killed:Q', format='.0f')
)

# (mortal_collisions + number_of_deaths).properties(
#      title='Number of deaths by type of user and year'
# ).configure_title(
#   anchor='middle', offset=25, fontSize=16, fontStyle='normal', fontWeight='normal'
# )

## Answers to Questions

### Are accidents more frequent during weekdays or weekends? Is there any difference between before COVID-19 and after?

In the top-left chart, orange bars represent data from the year 2018, while blue bars represent data from 2020 (before and after COVID, respectively). Analyzing the length of the bars reveals a consistent trend: on all days of the week, the total number of accidents occurring on each day is considerably higher (more than double in all cases) before COVID compared to after.

Furthermore, in the right slope chart, where colors correspond to those in the paired bar chart, a decreasing trend is evident in the number of accidents on weekdays versus weekends. Weekends consistently exhibit a lower average number of accidents both before and after COVID. Notably, this difference intensifies before COVID, indicating a more pronounced contrast. However, in 2020, the difference is not as significant.

From this graph, we can infer that the number of traffic accidents has decreased post-COVID, and this reduction in accidents on weekends has followed a similar trend, with a slight decrease in the decline. One possible explanation is the reduced use of both public and private transportation on weekends due to decreased overall activity (work, school, etc.).

### Is there any type of vehicle more prone to participate in accidents?

In the middle-right chart, we can observe the number of accidents involving different types of vehicles in both 2018 and 2020 (from June to September). In this case, we cannot directly answer the question about which vehicles are more prone to accidents, as it would require knowledge of the number of vehicles of each type in New York (or circulating in New York). Quantifying this is a challenging task since, for example, although we see that Sedans have the highest number of accidents, there are likely also many more Sedan-type cars on the roads in New York. The same may be true for SUVs. By examining the length of the bars or directly looking at the numbers at the end of each bar, we can note that sedans and SUVs are the vehicle types most frequently involved in accidents, overshadowing other vehicles such as taxis, pickups, bicycles, etc.

Despite the difficulty in quantifying the total number of each vehicle type, what we can conclude is that Sedans and SUVs mentioned earlier contribute to a significant percentage of accidents in New York. By observing that the bars are well to the right of the red bar (mean accidents), it's evident that these values are well above the average accidents per vehicle. It's also noteworthy that taxis, although likely fewer in number, still have a notable frequency of accidents.

### At what time of the day are accidents more common?

In the middle-left, you can see a line chart with the average accidents per hour along the summers of 2018 and 2020. A different color for each year has been used as well as the line thickness to encode the killed people. Regarding the question asked about what time of the day are accidents more common, a clear trend emerges: higher collision rates during the day and lower rates during the night. This pattern aligns with the increased presence of cars on the road during daylight hours and decreased activity during nighttime. Further we can distinguish different patterns between morning, afternoon, and evening periods. Mornings exhibit fewer collisions, likely attributed to work-related activities, whereas afternoons register higher incidents, potentially linked to leisure activities and transporting children to extracurricular activities. Evenings witness a decline in collisions as people conclude their activities and are back home.

Furthermore, by looking at the line thickness we can see that the deathliest hours are at 20:00 and 04:00 in 2018, and between 19:00 and 00:00, as well as at 04:00 in 2020. The deaths in the late night coincide with the times when people are returning home after socializing, often under the influence of alcohol, which make the accidents more dangerous. However, beeing able to see that is more difficult because, one of the drawbacks of this visualization is that the line thickness is not easy to compare. Another criticism is that the error bars make the patterns a difficult to read. Dispite that it is important to keep them in the visualization because they show the variance of the data.

### Are there any areas with a larger number of accidents?

In the map in the top-right chart, we can observe areas with a higher concentration of accidents per square kilometer. This is evident by examining the intensity of the blue color, where darker shades of blue indicate a higher number of accidents per km². Knowing this, we can see that areas with the highest accident ratios are located in the southern part of Manhattan, featuring zones with a significant number of accidents. It's also noteworthy that there are dark blue areas in the center of Brooklyn, indicating a high number of accidents. The same pattern occurs in some southwestern parts of the Bronx and a specific postal code area in Queens, situated in the northwest. In Staten Island, the color intensity is very low, indicating a low number of accidents per km² throughout the borough.

In the bar chart to the right of the map, we can interpret, based on the length of the bars, that Manhattan has by far the highest number of accidents per km², which aligns with our observation on the map. Following Manhattan, there are relatively high accident rates in Brooklyn, the Bronx, Queens, and, lastly, as observed earlier, Staten Island, with significantly fewer accidents compared to the others.

### Is there a correlation between weather conditions and accidents?

In the bottom, we can see two violin plots juxtaposed, one for each year (2018 and 2020). The x-axis represents the weather conditions, and the y-axis represents the number of accidents per day. The width of the violin shows the distribution of the data, and the boxplot inside the violin shows the median and quartiles of the data.

The first thing we can observe is that the distribution of the data is wider in 2020 than in 2018, which means that there is more variance in the number of collisions. Furthermore, we see that, in 2018, the median of the number of collisions is higher with rain conditions but with the other weather conditions there is not a significant difference between the number of collisions. In 2020 we do not see enough evidence to conclude that a type of weather condition is more likely to cause an accident. These probably happens because in 2020 there was a lockdown and people were not driving as much as in 2018 so there were less cars on the road and the weather conditions did not affect as much as in 2018. In general we conclude that despite having some evidence, it is not enough to conclude that there is a correlation between weather conditions and accidents. If there was, we would be able to see it in the distribution of the data, which would be more concentrated in some weather conditions than in others.

The drawbacks of this visualization is that we have not encoded any quantitative variable of the weather conditions because since there are a lot of them, any plot is good enough to be able to see them toghether easily. For that we ended up choosing the weather conditions as a categorical variable which takes into account all the quantitative variables of the weather conditions.

### What is the annual fatality count in accidents in New York, and how does that total break down by user type, including pedestrians, cyclists, and motorists?

In the bottom-left chart, we can observe the number of fatalities in accidents depending on the year (summer). Looking at the length of the first bar, we can see that in 2018 (summer), there were 88 fatal accidents, and in 2020, there were 114, an increase of 26. We notice that fatal accidents constitute a very small percentage of the total accidents, indicating that typically, there are few accidents resulting in fatalities. This is surprising, as shown in the top-left chart, where there are many more accidents in 2018 than in 2020, yet in 2020, they are more lethal.

Examining the numbers within each color of the bar chart allows us to compare the number of fatalities each year based on the type of user. We observe that the most significant difference is in the number of motorist deaths, which has increased by 18.