# Aaron Ma

### Environmental and road conditions

Environmental and road conditions play a key part in the probability of an accident occuring. Knowing weather conditions, road type, visibility, urban vs rural can improve road maintenance or infrastructure planning that would allow policymakers to make better informed decisions that would lower the rate of road accident.

#### Key research questions
- Are certain weather conditions more likely to result in multi-vehicle accidents?
- How does visibility level impact pedestrian or cyclist involvement differently in different levels of traffic volume? 
- How do seasonal changes impact the frequency of accidents?

## EDA

### Imports

In [1]:
import sys
import os

sys.path.append(os.path.abspath("../../code"))

import altair as alt
import pandas as pd

from toolz.curried import pipe

def json_dir(data, data_dir='altairdata'):
    os.makedirs(data_dir, exist_ok=True)
    return pipe(data, alt.to_json(filename=data_dir + '/{prefix}-{hash}.{extension}') )

# Register and enable the new transformer
alt.data_transformers.register('json_dir', json_dir)
alt.data_transformers.enable('json_dir')

# Handle large data sets (default shows only 5000)
# See here: https://altair-viz.github.io/user_guide/data_transformers.html
alt.data_transformers.disable_max_rows()

alt.renderers.enable('jupyterlab')

RendererRegistry.enable('jupyterlab')

In [2]:
from cleaning_workflows import prepare_dataset

### Loading in the data

In [3]:
data = pd.read_csv('../../data/raw/Airbnb_Open_Data.csv', parse_dates=['last review'])
data.head()

  data = pd.read_csv('../../data/raw/Airbnb_Open_Data.csv', parse_dates=['last review'])


Unnamed: 0,id,NAME,host id,host_identity_verified,host name,neighbourhood group,neighbourhood,lat,long,country,...,service fee,minimum nights,number of reviews,last review,reviews per month,review rate number,calculated host listings count,availability 365,house_rules,license
0,1001254,Clean & quiet apt home by the park,80014485718,unconfirmed,Madaline,Brooklyn,Kensington,40.64749,-73.97237,United States,...,$193,10.0,9.0,2021-10-19,0.21,4.0,6.0,286.0,Clean up and treat the home the way you'd like...,
1,1002102,Skylit Midtown Castle,52335172823,verified,Jenna,Manhattan,Midtown,40.75362,-73.98377,United States,...,$28,30.0,45.0,2022-05-21,0.38,4.0,2.0,228.0,Pet friendly but please confirm with me if the...,
2,1002403,THE VILLAGE OF HARLEM....NEW YORK !,78829239556,,Elise,Manhattan,Harlem,40.80902,-73.9419,United States,...,$124,3.0,0.0,NaT,,5.0,1.0,352.0,"I encourage you to use my kitchen, cooking and...",
3,1002755,,85098326012,unconfirmed,Garry,Brooklyn,Clinton Hill,40.68514,-73.95976,United States,...,$74,30.0,270.0,2019-07-05,4.64,4.0,1.0,322.0,,
4,1003689,Entire Apt: Spacious Studio/Loft by central park,92037596077,verified,Lyndon,Manhattan,East Harlem,40.79851,-73.94399,United States,...,$41,10.0,9.0,2018-11-19,0.1,3.0,1.0,289.0,"Please no smoking in the house, porch or on th...",


In [4]:
data = prepare_dataset(data)

  arr = np.asarray(values, dtype=dtype)


In [5]:
print(f'Dataset shape: \n{data.shape}')
print(f'Dataset columns: \n{data.columns}')
data.info()

Dataset shape: 
(102599, 26)
Dataset columns: 
Index(['id', 'NAME', 'host id', 'host_identity_verified', 'host name',
       'neighbourhood group', 'neighbourhood', 'lat', 'long', 'country',
       'country code', 'instant_bookable', 'cancellation_policy', 'room type',
       'Construction year', 'price', 'service fee', 'minimum nights',
       'number of reviews', 'last review', 'reviews per month',
       'review rate number', 'calculated host listings count',
       'availability 365', 'house_rules', 'license'],
      dtype='object')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 102599 entries, 0 to 102598
Data columns (total 26 columns):
 #   Column                          Non-Null Count   Dtype         
---  ------                          --------------   -----         
 0   id                              102599 non-null  int64         
 1   NAME                            102599 non-null  object        
 2   host id                         102599 non-null  int64         
 

In [6]:
data.describe().drop(columns=['id', 'host id'])

Unnamed: 0,lat,long,Construction year,price,service fee,minimum nights,number of reviews,last review,reviews per month,review rate number,calculated host listings count,availability 365
count,102599.0,102599.0,102599.0,102599.0,102599.0,102599.0,102599.0,102599,102599.0,102599.0,102599.0,102599.0
mean,40.72484,-73.943956,2008.287751,623.785778,124.691586,8.103412,27.434722,2031-10-08 07:57:06.805329920,1.0066,3.268687,7.90882,140.512627
min,-1.0,-74.24984,-1.0,-1.0,-1.0,-1223.0,0.0,2012-07-11 00:00:00,-1.0,0.0,-1.0,-10.0
25%,40.68873,-73.98258,2007.0,337.0,67.0,1.0,1.0,2019-01-02 00:00:00,0.09,2.0,1.0,2.0
50%,40.72229,-73.95444,2012.0,623.0,124.0,3.0,7.0,2019-06-23 00:00:00,0.48,3.0,1.0,95.0
75%,40.76276,-73.93235,2017.0,912.0,182.0,5.0,30.0,2022-01-02 00:00:00,1.71,4.0,2.0,268.0
max,40.91697,-1.0,2022.0,1200.0,240.0,5645.0,1024.0,2099-01-01 00:00:00,90.0,5.0,332.0,3677.0
std,0.372667,0.646043,92.041761,332.690937,66.554848,30.497129,49.478373,,1.820937,1.295823,32.172501,135.46357


In [7]:
data.describe(include=['object']).drop(columns=['NAME', 'host name'])

Unnamed: 0,host_identity_verified,neighbourhood group,neighbourhood,country,country code,instant_bookable,cancellation_policy,room type,house_rules,license
count,102599,102599,102599,102599,102599,102599,102599,102599,102599,102599
unique,3,6,225,2,2,3,4,4,1977,2
top,unconfirmed,Manhattan,Bedford-Stuyvesant,United States,US,False,moderate,Entire home/apt,Unknown,Unknown
freq,51200,43793,7937,102067,102468,51474,34343,53701,52131,102597


In [8]:
data.isna().sum()

id                                0
NAME                              0
host id                           0
host_identity_verified            0
host name                         0
neighbourhood group               0
neighbourhood                     0
lat                               0
long                              0
country                           0
country code                      0
instant_bookable                  0
cancellation_policy               0
room type                         0
Construction year                 0
price                             0
service fee                       0
minimum nights                    0
number of reviews                 0
last review                       0
reviews per month                 0
review rate number                0
calculated host listings count    0
availability 365                  0
house_rules                       0
license                           0
dtype: int64

In [10]:
stacked_bar = alt.Chart(data).mark_bar().encode(x= "count():Q", 
                                                y = alt.Y("neighbourhood group:N", sort ='-x'),
                                                color = 'neighbourhood group:N',
                                                tooltip=['count():Q']
                                               ).properties(title='AirBnB neighbourhood count', height = 200, width = 400)

stacked_bar

<VegaLite 5 object>

If you see this message, it means the renderer has not been properly enabled
for the frontend that you are using. For more information, see
https://altair-viz.github.io/user_guide/display_frontends.html#troubleshooting


Based on the graph above, there are little to no differences in inter road type comparisons and intra road type comparisons. The number of vehicles involved seems to stay athe same at around 2000 per weather condition.

In [46]:
# heatmap = alt.Chart(data).mark_rect().encode(
#         x=alt.X("price:Q"),
#         y=alt.Y("price:Q"), color = 'price:Q').properties(title='Pedestrian Road Accidents Involvement by Visibility Level and Traffic Volume')

# data['price'] = data['price'].str.split("$", expand=True)[1].str.replace(",", "")

On the faceted chart above, it appears that there are marginal differences between different numbers of accident causes and seasonality. In intra-seasonal analysis, we can see that for different seasons in a 1000 domain scale, different seasons have varying top causes for accidents occuring. For example, Speeding in Fall versus Drunk Driving in Spring.

In [47]:
# pedestrian_heatmap = alt.Chart(accidents).mark_rect().encode(
#         x=alt.X("Visibility Level:Q", title="Visibility Level"),
#         y=alt.Y("Traffic Volume:Q", title="Traffic Volume"),
#         color='sum(Pedestrians Involved):Q').properties(title='Pedestrian Road Accidents Involvement by Visibility Level and Traffic Volume')

# cyclist_heatmap = alt.Chart(accidents).mark_rect().encode(
#         x=alt.X("Visibility Level:Q", title="Visibility Level"),
#         y=alt.Y("Traffic Volume:Q", title="Traffic Volume"),
#         color='sum(Cyclists Involved):Q').properties(title='Cyclist Road Accidents Involvement by Visibility Level and Traffic Volume')

# pedestrian_heatmap | cyclist_heatmap

Based on the heatmaps above, we can see that for the pedestrian heatmap, there is a clear center of the most pedestrian involvement in accidents from 250 to 300 visibility and 600 to 9000 traffic volume. There are also outliers where from any point onwards of 450 visibility level, regardless of traffic volume, number of pedestrians involved in an accident are are 2.
However, in regards to the heatmap concerning cyclist involvement, we can observe that at medium to low visibility level (0-250), regardless of traffic volume, the number of cyclist involvement is maximized at 2. There is also an outlier where at greater visibility levels, from 400-450, the number of cyclists involved in accidents are also maximized.

What can be observed is that the pedestrian heatmap shows a much clearer pattern between the three variables while the cyclist heatmap is more indiscriminate as seen by the large hues of dark blue.

## Task Analysis

### **1. Are certain weather conditions more likely to result in multi-vehicle accidents?**
- **Retrieve Value**: Extract `Weather Conditions`, `Number of Vehicles Involved`, and `Road Type`
- **Group**: Groupby `Weather Conditions` and `Road Type`
- **Aggregate**: Caculate average of `Number of Vehicles Involved` per group
- **Analyze**: Analyze relationships between groups
- **Visualize**: Visualize different groups

---

### **2. How does visibility level impact pedestrian or cyclist involvement differently in different levels of traffic volume?**
- **Retrieve Value**: Extract `Visibility Level`, `Pedestrians Involved`, `Cyclists Involved`,  and `Traffic Volume`
- **Group**: Separate by pedestrian or cyclist involvement with `Visibility Level` and `Traffic Volume`
- **Aggregate**: Caculate average of number of pedestrians/cyclists involved at each level of traffic volume and visibility level
- **Analyze**: Analyze relationships between groups
- **Visualize**: Visualize different groups and juxtapose pedestrian and cyclist representations

---

### **3. How do seasonal changes impact the frequency of accidents?**
- **Retrieve Value**: Extract `Month`, and `Accident Cause`
- **Create**: Create new data from `Month`, separating into `Season` by 3 month groups
- **Group**: Group by `Season`
- **Aggregate**: Caculate average of number of accidents occurred per group (season)
- **Analyze**: Analyze relationships between groups
- **Visualize**: Visualize different groups and facet seasonal representation

---

In [8]:
new_data = pd.read_csv('../../data/raw/listings.csv', parse_dates=['first_review', 'last_review'])

new_data.head()

Unnamed: 0,id,listing_url,scrape_id,last_scraped,source,name,description,neighborhood_overview,picture_url,host_id,...,review_scores_communication,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,39572,https://www.airbnb.com/rooms/39572,20250103033441,2025-01-03,city scrape,1 br in a 2 br apt (Midtown West),,,https://a0.muscache.com/pictures/fd1bffd9-ccf8...,169927,...,5.0,4.98,4.86,,f,2,1,1,0,0.25
1,39593,https://www.airbnb.com/rooms/39593,20250103033441,2025-01-03,city scrape,A lovely room w/ a Manhattan view,"A private, furnished large room to rent Jan/F...","Nate Silver called this super safe, clean, qui...",https://a0.muscache.com/pictures/0b9110f7-3b24...,110506,...,4.96,4.79,4.93,,f,1,0,1,0,0.2
2,39704,https://www.airbnb.com/rooms/39704,20250103033441,2025-01-03,previous scrape,"Private, Large & Sunny 1BR w/W&D",It's a No Brainer:<br />•Terrific Space For Le...,The Neighborhood<br />• Rich History <br />• B...,https://a0.muscache.com/pictures/0bc4e8a4-c047...,170510,...,4.92,4.38,4.72,,f,2,2,0,0,1.93
3,42300,https://www.airbnb.com/rooms/42300,20250103033441,2025-01-03,city scrape,Beautiful Lower East Side Loft,Architect-owned loft is a corner unit in a bea...,"The apartment is in the border of Soho, LES an...",https://a0.muscache.com/pictures/0e285e13-ee14...,184755,...,4.87,4.57,4.62,,f,1,1,0,0,0.4
4,42729,https://www.airbnb.com/rooms/42729,20250103033441,2025-01-03,city scrape,@HouseOnHenrySt - Private 2nd bedroom w/shared...,,"Lovely old Brooklyn neighborhood, with brick/b...",https://a0.muscache.com/pictures/925fe213-f5e1...,11481,...,4.73,4.58,4.64,,f,4,1,3,0,1.26


In [9]:
new_data = new_data.drop(columns=['listing_url', 'calendar_updated','scrape_id', 'last_scraped', 'source', 'picture_url', 'host_url', 'host_about', 'host_thumbnail_url', 'host_picture_url', 'host_has_profile_pic', 'neighbourhood', 'bathrooms_text', 'calendar_last_scraped', 'license'])

In [11]:
print(new_data.shape)
print(new_data.columns)


(37784, 60)
Index(['id', 'name', 'description', 'neighborhood_overview', 'host_id',
       'host_name', 'host_since', 'host_location', 'host_response_time',
       'host_response_rate', 'host_acceptance_rate', 'host_is_superhost',
       'host_neighbourhood', 'host_listings_count',
       'host_total_listings_count', 'host_verifications',
       'host_identity_verified', 'neighbourhood_cleansed',
       'neighbourhood_group_cleansed', 'latitude', 'longitude',
       'property_type', 'room_type', 'accommodates', 'bathrooms', 'bedrooms',
       'beds', 'amenities', 'price', 'minimum_nights', 'maximum_nights',
       'minimum_minimum_nights', 'maximum_minimum_nights',
       'minimum_maximum_nights', 'maximum_maximum_nights',
       'minimum_nights_avg_ntm', 'maximum_nights_avg_ntm', 'has_availability',
       'availability_30', 'availability_60', 'availability_90',
       'availability_365', 'number_of_reviews', 'number_of_reviews_ltm',
       'number_of_reviews_l30d', 'first_review', 'l

In [12]:
new_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37784 entries, 0 to 37783
Data columns (total 60 columns):
 #   Column                                        Non-Null Count  Dtype         
---  ------                                        --------------  -----         
 0   id                                            37784 non-null  int64         
 1   name                                          37782 non-null  object        
 2   description                                   36811 non-null  object        
 3   neighborhood_overview                         20607 non-null  object        
 4   host_id                                       37784 non-null  int64         
 5   host_name                                     37778 non-null  object        
 6   host_since                                    37778 non-null  object        
 7   host_location                                 29742 non-null  object        
 8   host_response_time                            22199 non-null  obje

In [14]:
numeric_columns = new_data.select_dtypes(include=['int64', 'float64']).columns
print(numeric_columns)

Index(['id', 'host_id', 'host_listings_count', 'host_total_listings_count',
       'latitude', 'longitude', 'accommodates', 'bathrooms', 'bedrooms',
       'beds', 'minimum_nights', 'maximum_nights', 'minimum_minimum_nights',
       'maximum_minimum_nights', 'minimum_maximum_nights',
       'maximum_maximum_nights', 'minimum_nights_avg_ntm',
       'maximum_nights_avg_ntm', 'availability_30', 'availability_60',
       'availability_90', 'availability_365', 'number_of_reviews',
       'number_of_reviews_ltm', 'number_of_reviews_l30d',
       'review_scores_rating', 'review_scores_accuracy',
       'review_scores_cleanliness', 'review_scores_checkin',
       'review_scores_communication', 'review_scores_location',
       'review_scores_value', 'calculated_host_listings_count',
       'calculated_host_listings_count_entire_homes',
       'calculated_host_listings_count_private_rooms',
       'calculated_host_listings_count_shared_rooms', 'reviews_per_month'],
      dtype='object')


In [22]:
object_columns = new_data.select_dtypes(include=['object']).columns
print(object_columns)

Index(['name', 'description', 'neighborhood_overview', 'host_name',
       'host_since', 'host_location', 'host_response_time',
       'host_response_rate', 'host_acceptance_rate', 'host_is_superhost',
       'host_neighbourhood', 'host_verifications', 'host_identity_verified',
       'neighbourhood_cleansed', 'neighbourhood_group_cleansed',
       'property_type', 'room_type', 'amenities', 'has_availability',
       'instant_bookable'],
      dtype='object')


In [24]:
new_data['host_response_rate'] = pd.to_numeric(new_data['host_response_rate'].str.rstrip('%').replace('N/A', None)) / 100
new_data['host_acceptance_rate'] = pd.to_numeric(new_data['host_acceptance_rate'].str.rstrip('%').replace('N/A', None)) / 100

In [25]:
new_data.describe().drop(columns=['id', 'latitude', 'longitude', 'host_id'])

Unnamed: 0,host_response_rate,host_acceptance_rate,host_listings_count,host_total_listings_count,accommodates,bathrooms,bedrooms,beds,price,minimum_nights,...,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
count,22199.0,22912.0,37778.0,37778.0,37784.0,22985.0,31975.0,22850.0,22969.0,37784.0,...,25890.0,25876.0,25885.0,25873.0,25874.0,37784.0,37784.0,37784.0,37784.0,25892.0
mean,0.919165,0.769278,263.096326,351.484568,2.754896,1.196693,1.380172,1.63488,195.224128,28.882172,...,4.65604,4.833322,4.825059,4.741406,4.637652,71.636354,45.435555,23.985232,0.005187,0.866954
min,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,8.0,1.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.01
25%,0.98,0.63,1.0,1.0,2.0,1.0,1.0,1.0,82.0,30.0,...,4.53,4.82,4.82,4.65,4.52,1.0,0.0,0.0,0.0,0.09
50%,1.0,0.88,2.0,3.0,2.0,1.0,1.0,1.0,132.0,30.0,...,4.81,4.95,4.96,4.85,4.76,2.0,1.0,1.0,0.0,0.29
75%,1.0,1.0,10.0,15.0,4.0,1.0,2.0,2.0,223.0,30.0,...,5.0,5.0,5.0,5.0,4.94,9.0,2.0,2.0,0.0,1.0
max,1.0,1.0,5079.0,9048.0,16.0,15.5,16.0,42.0,20000.0,1250.0,...,5.0,5.0,5.0,5.0,5.0,1154.0,1154.0,739.0,4.0,116.3
std,0.220384,0.29038,1000.435105,1196.902978,1.9039,0.553493,0.933898,1.201717,353.251037,29.90515,...,0.504042,0.37792,0.410308,0.395553,0.495027,224.585038,200.899492,109.136674,0.086848,1.885964


In [28]:
alt.Chart(new_data).mark_bar().encode(x='average(reviews_per_month)', y='neighbourhood_group_cleansed')

<VegaLite 5 object>

If you see this message, it means the renderer has not been properly enabled
for the frontend that you are using. For more information, see
https://altair-viz.github.io/user_guide/display_frontends.html#troubleshooting


In [29]:
new_data['price'] = new_data['price'].astype(str).str.strip().replace({r'\$': '', ',': ''}, regex=True)
new_data['price'] = pd.to_numeric(new_data['price'], errors='coerce') 

alt.Chart(new_data).mark_bar().encode(x='average(price)', y='neighbourhood_group_cleansed')

<VegaLite 5 object>

If you see this message, it means the renderer has not been properly enabled
for the frontend that you are using. For more information, see
https://altair-viz.github.io/user_guide/display_frontends.html#troubleshooting
