# Report

**Authors**: Carlos Arbones & Benet Ramió

## Preprocessing

In [74]:
import pandas as pd
import numpy as np

### Collisions dataset

We consider that the preprocessing we carried out in the previous practice is correct for this project as well. Therefore, the only thing we need to modify is to select only the rows that belong to AMBULANCES, TAXIS, and FIRE TRUCKS, as stated in the project description. También seleccionamos solo las muestras del 2018 i eleliminamos las columnas que no seran necesarias para el estudio.  

In [75]:
collisions = pd.read_csv('code/data/preprocessed-collisions.csv') 
collisions['CRASH_DATETIME'] = pd.to_datetime(collisions['CRASH_DATE'] + ' ' + collisions['CRASH_TIME'], format='%m/%d/%Y %H:%M')
collisions = collisions.drop(columns=['CRASH_DATE', 'CRASH_TIME', 'CONTRIBUTING_FACTOR_VEHICLE2', 'VEHICLE_TYPE_CODE2'])
collisions = collisions[collisions['CRASH_DATETIME'].dt.year == 2018] # Only 2018 data
collisions = collisions[collisions['VEHICLE_TYPE_CODE1'].isin(['Taxi', 'Ambulance', 'Fire truck'])].reset_index(drop=True) 

In [76]:
collisions.shape

(4093, 15)

In [77]:
collisions.isna().sum()

BOROUGH                         1385
ZIP_CODE                        1385
LATITUDE                         294
LONGITUDE                        294
TOTAL_INJURED                      1
TOTAL_KILLED                       1
PEDESTRIANS_INJURED                0
PEDESTRIANS_KILLED                 0
CYCLIST_INJURED                    0
CYCLIST_KILLED                     0
MOTORIST_INJURED                   0
MOTORIST_KILLED                    0
CONTRIBUTING_FACTOR_VEHICLE1       0
VEHICLE_TYPE_CODE1                 0
CRASH_DATETIME                     0
dtype: int64

Since we have 1385 missing values in the 'BOROUGH' and 'ZIP_CODE' columns and only 294 missing values in the coordinates, we will impute the values where we have LATITUDE and LONGITUDE but are missing 'BOROUGH' or 'ZIP_CODE'.

In [85]:
# !pip install geopy

In [79]:
from geopy.geocoders import Nominatim

def get_location_info(latitude, longitude):
    geolocator = Nominatim(user_agent="my_geocoder") # initialize geolocator
    location = geolocator.reverse((latitude, longitude), language="en") # reverse geocoding
    borough = location.raw['address']['suburb'].upper() if 'suburb' in location.raw['address'] else location.raw['address']['city'].upper()
    zip_code = location.raw['address']['postcode'] if 'postcode' in location.raw['address'] else None

    return borough, zip_code

In [80]:
no_borough_or_zip = (collisions['BOROUGH'].isna() | collisions['ZIP_CODE'].isna()) # filter for rows with missing borough or zip code
has_lat_long = (collisions['LATITUDE'].notna() & collisions['LONGITUDE'].notna()) # filter for rows with latitude and longitude

missing_locations = collisions[no_borough_or_zip & has_lat_long] 

In [81]:
# import tqdm
# for row in tqdm.tqdm(missing_locations.itertuples()):
#     borough, zip_code = get_location_info(row.LATITUDE, row.LONGITUDE)
#     collisions.loc[row.Index, 'BOROUGH'] = borough
#     collisions.loc[row.Index, 'ZIP_CODE'] = zip_code

1247it [10:32,  1.97it/s]


In [82]:
collisions['BOROUGH'].value_counts()

BOROUGH
MANHATTAN        2332
BROOKLYN          597
QUEENS            332
BRONX             306
QUEENS COUNTY     216
THE BRONX         159
STATEN ISLAND       9
KINGS COUNTY        4
Name: count, dtype: int64

We corrected some inconsistencies by changing neighborhood names from 'THE BRONX' to 'BRONX' and from 'QUEENS COUNTY' to 'QUEENS'. Additionally, we removed the rows that had 'KINGS COUNTY' as the neighborhood since it refers to a neighborhood in California. We assume that the data for those rows is incorrect.

In [92]:
collisions.loc[collisions['BOROUGH'] == 'THE BRONX', 'BOROUGH'] = 'BRONX' # standardize borough names
collisions.loc[collisions['BOROUGH'] == 'QUEENS COUNTY', 'BOROUGH'] = 'QUEENS' # standardize borough names
collisions = collisions[collisions['BOROUGH'] != 'KINGS COUNTY'].reset_index(drop=True) # remove rows with borough 'KINGS COUNTY'

In [93]:
collisions['BOROUGH'].value_counts()

BOROUGH
MANHATTAN        2332
BROOKLYN          597
QUEENS            548
BRONX             465
STATEN ISLAND       9
Name: count, dtype: int64

In [95]:
# percentatge of rows with missing values
collisions.isna().sum() / collisions.shape[0]

BOROUGH                         0.033749
ZIP_CODE                        0.033994
LATITUDE                        0.071900
LONGITUDE                       0.071900
TOTAL_INJURED                   0.000245
TOTAL_KILLED                    0.000245
PEDESTRIANS_INJURED             0.000000
PEDESTRIANS_KILLED              0.000000
CYCLIST_INJURED                 0.000000
CYCLIST_KILLED                  0.000000
MOTORIST_INJURED                0.000000
MOTORIST_KILLED                 0.000000
CONTRIBUTING_FACTOR_VEHICLE1    0.000000
VEHICLE_TYPE_CODE1              0.000000
CRASH_DATETIME                  0.000000
dtype: float64

We can see that the maximum percentage of missing values is in Latitude and Longitude, with 7% of the rows containing null values. To maintain data consistency, we have decided to delete the rows that contain null values. 

In [96]:
collisions = collisions.dropna().reset_index(drop=True) # drop rows with missing values
collisions.shape

(3793, 15)

In [98]:
# collisions.to_csv('code/data/preprocessed-collisions-2.csv', index=False) 

### Weather dataset

We read the weather dataset, which contains only the data we are interested in, i.e., from June to September 2018. Therefore, there is no need to filter by year. Additionally, for our visualizations and to answer the questions, we will only use the 'icon' column, which contains the type of weather condition for each day. No missing values are observed in this column.

In [104]:
weather = pd.read_csv('code/data/weather2018.csv')
weather = weather[['datetime', 'icon']] # keep only datetime and icon columns
weather['datetime'] = pd.to_datetime(weather['datetime']) # convert datetime column to datetime type
weather.head()

Unnamed: 0,datetime,icon
0,2018-06-01,rain
1,2018-06-02,rain
2,2018-06-03,rain
3,2018-06-04,rain
4,2018-06-05,partly-cloudy-day


In [100]:
weather['icon'].value_counts()

icon
rain                 75
clear-day            23
partly-cloudy-day    22
cloudy                2
Name: count, dtype: int64

In [105]:
weather.isna().sum()

datetime    0
icon        0
dtype: int64