# <center>Analyzing NYC Traffic Collision Data</center>
## <center>A 15-688 Project by:</center><center><br/> Ahmet Emre Unal (ahmetemu)<br/><br/>Marco Peyrot (mpeyrotc)</center>

New York City is a wonderful city with terrible traffic. The drivers are impatient and aggressive. This results in many traffic collisions every day, some of which, quite unfortunately, lead to injuries and death. 

For our project, we wanted to understand NYC's most collision-prone areas and try to predict collisions based on many factors, such as location, vehicle type, whether it's a weekday, etc.

The [NYPD Motor Vehicle Collisions](https://data.cityofnewyork.us/Public-Safety/NYPD-Motor-Vehicle-Collisions/h9gi-nx95/data) dataset we found was surprisingly feature-rich and populous. It also meant that there were lots of collision instances with missing data.

In [None]:
from bs4 import BeautifulSoup
import collections
import numpy as np
import pandas as pd
import re
import requests 
import scipy

raw_data_file_name = 'NYPD_Motor_Vehicle_Collisions.csv'
bicycle_lanes_page_url = 'http://www.nyc.gov/html/dot/html/bicyclists/lane-list.shtml'

We started by cleaning up rows with empty and '`UNKNOWN`' values. This threw away about a third of the original dataset. We then dropped columns that either were unnecessary (like the reason for the crash, since this sort of information is not possible to infer prior to the crash, like the driver being distracted) or were too detailed (like the vehicle subtypes).

In [None]:
def load_data(file_name):
    collision = pd.read_csv(file_name, 
                            na_filter=False, 
                            parse_dates={'DATE_COMBINED' : ['DATE', 'TIME']}, 
                            infer_datetime_format=True)
    
    # Remove rows that don't have the necessary data
    columns_to_check_for_empty = ['LOCATION', 'LATITUDE', 'LONGITUDE', 'BOROUGH',
                                  'ZIP CODE', 'ON STREET NAME', 'CROSS STREET NAME', 
                                  'VEHICLE TYPE CODE 1', 'VEHICLE TYPE CODE 1']
    for column in columns_to_check_for_empty:
        collision = collision[collision[column] != '']
        collision = collision[collision[column] != 'UNKNOWN']

    # Drop unneeded columns
    columns_to_drop = ['CONTRIBUTING FACTOR VEHICLE 1', 'CONTRIBUTING FACTOR VEHICLE 2', 
                       'CONTRIBUTING FACTOR VEHICLE 3', 'CONTRIBUTING FACTOR VEHICLE 4', 
                       'CONTRIBUTING FACTOR VEHICLE 5', 'LOCATION', 'UNIQUE KEY',
                       'OFF STREET NAME', 'VEHICLE TYPE CODE 2', 'VEHICLE TYPE CODE 3',
                       'VEHICLE TYPE CODE 4', 'VEHICLE TYPE CODE 5']
    for column in columns_to_drop:
        collision = collision.drop(column, axis=1)
        
    # Set column types
    collision['ZIP CODE'] = collision['ZIP CODE'].astype(int)
    collision['LATITUDE'] = collision['LATITUDE'].astype(float)
    collision['LONGITUDE'] = collision['LONGITUDE'].astype(float)
    
    # Rename date column to just 'DATE'
    collision = collision.rename(columns={'DATE_COMBINED':'DATE'})
    
    # Eliminate duplicates
    collision = collision.drop_duplicates()
    
    # Reset index
    collision = collision.reset_index(drop=True)
    
    return collision

We proceeded to create two temporal features:

 1. Whether it occurred on a weekday or weekend
 2. What 'time period' in the day it occurred

The 'time period' requires some explanation: We all know that morning and evening rush hours can be especially more collision-prone than day time (around noon) and night time (after the evening rush and before the morning rush). We decided to create these four bins with the following time ranges:

 1. Night Time: 00:00-06:59 & 20:00-00:00
 2. Morning Rush: 07:00-10:59
 3. Night Time: 00:00-06:59 & 20:00-00:00
 4. Evening Rush: 16:00-19:59

Together with the weekday/weekend feature, this allowed us to represent the date in a way that is meaningful to a machine learning algorithm. We also assumed that, as long as it's a weekday, the exact day (whether it's a Monday or a Thursday) is not significant.

In [None]:
def create_temporal_features(collision):
    collision = _create_time_features(collision)
    collision = _create_date_features(collision)
    
    # We're done with date, we can drop it
    collision = collision.drop('DATE', axis=1)
    
    return collision
    

def _create_time_features(collision):
    # Create one-hot time of day representation
    ## Date part is unimportant
    morning_rush_begin = pd.datetime(2000, 01, 01, 07, 00, 00).time()
    morning_rush_end = pd.datetime(2000, 01, 01, 11, 00, 00).time()
    evening_rush_begin = pd.datetime(2000, 01, 01, 16, 00, 00).time()
    evening_rush_end = pd.datetime(2000, 01, 01, 20, 00, 00).time()
    collision_time = collision['DATE'].dt.time
    
    ## Night Time 00:00-06:59 & 20:00-00:00
    night_time = (collision_time >= evening_rush_end) | (collision_time < morning_rush_begin)
    night_time_onehot = pd.get_dummies(night_time).loc[:, True].astype(int)
    collision = collision.assign(NIGHT_TIME = night_time_onehot.values)
    
    ## Morning Rush 07:00-10:59
    morning_rush = (collision_time >= morning_rush_begin) & (collision_time < morning_rush_end)
    morning_rush_onehot = pd.get_dummies(morning_rush).loc[:, True].astype(int)
    collision = collision.assign(MORNING_RUSH = morning_rush_onehot.values)
    
    ## Night time 00:00-06:59 & 20:00-00:00
    day_time = (collision_time >= morning_rush_end) & (collision_time < evening_rush_begin)
    day_time_onehot = pd.get_dummies(day_time).loc[:, True].astype(int)
    collision = collision.assign(DAY_TIME = day_time_onehot.values)
    
    ## Evening Rush 16:00-19:59
    evening_rush = (collision_time >= evening_rush_begin) & (collision_time < evening_rush_end)
    evening_rush_onehot = pd.get_dummies(evening_rush).loc[:, True].astype(int)
    collision = collision.assign(EVENING_RUSH = evening_rush_onehot.values)
    
    return collision

def _create_date_features(collision):
    # Create one-hot weekday/weekend representation
    collision_day = collision['DATE'].dt.dayofweek
    
    ## Weekday 0, 1, 2, 3, 4
    ## Weekend 5, 6
    is_weekday = (collision_day <= 4)
    is_weekday_onehot = pd.get_dummies(is_weekday).astype(int)
    
    ## Weekday
    weekday_onehot = is_weekday_onehot.loc[:, True]
    collision = collision.assign(WEEKDAY = weekday_onehot.values)

    ## Weekend
    weekend_onehot = is_weekday_onehot.loc[:, False]
    collision = collision.assign(WEEKEND = weekend_onehot.values)
    
    return collision

We further proceeded to create a one-hot encoding of the vehicle types that were available in the dataset. NYPD has put almost half of the vehicles in a general group called '`PASSENGER VEHICLE`', which seems to represent the common sedan type of vehicle. Other types, like SUVs, motorcycles, small and large commercial vehicles, have their own types. 

We removed some collision types (like the collisions in which the vehicle was an '`AMBULANCE`') with very few collisions. This further cut about 5% of the dataset.

In [None]:
def create_vehicle_features(collision):
    # Create one-hot vehicle type representation
    vehicle_types_onehot = pd.get_dummies(collision.loc[:, 'VEHICLE TYPE CODE 1']).astype(int)
    
    # Merge Motorcycle & Scooter columns
    motorcycle = vehicle_types_onehot.loc[:, 'MOTORCYCLE'] + vehicle_types_onehot.loc[:, 'SCOOTER']
    vehicle_types_onehot = vehicle_types_onehot.drop('MOTORCYCLE', axis=1)
    vehicle_types_onehot = vehicle_types_onehot.drop('SCOOTER', axis=1)
    vehicle_types_onehot = vehicle_types_onehot.assign(MOTORCYCLE = motorcycle.values)
    
    # Concatanate one-hot with collisions
    collision = pd.concat([collision, vehicle_types_onehot], axis=1)
    
    # Drop unneeded collisions
    vehicles_to_drop = ['OTHER', 'AMBULANCE', 'PEDICAB', 'FIRE TRUCK', 'LIVERY VEHICLE']
    collisions_to_drop = vehicle_types_onehot.loc[:, vehicles_to_drop[0]]
    for i in xrange(1, len(vehicles_to_drop)):  # Start from 1 since we already have OTHER
        collisions_to_drop += vehicle_types_onehot.loc[:, vehicles_to_drop[i]]
    collisions_to_keep = (collisions_to_drop == 0)
    collision = collision[collisions_to_keep]
    collision = collision.reset_index(drop=True)  # Reset index due to dropped rows
    
    # Drop unneeded vehicle columns
    for column in vehicles_to_drop:
        collision = collision.drop(column, axis=1)
    
    # Drop vehicle type column
    collision = collision.drop('VEHICLE TYPE CODE 1', axis=1)
        
    return collision

We then load the data:

In [None]:
# Load the collision data
collision = load_data(raw_data_file_name)
collision = create_temporal_features(collision)
collision = create_vehicle_features(collision)
    
print collision.head()
print collision.dtypes
print len(collision)

We proceed with parsing NYC's [bike lanes list web page](http://www.nyc.gov/html/dot/html/bicyclists/lane-list.shtml) to add another feature to our collision data: whether the street the accident occurred in had a bicycle lane or not. We do that by manually parsing the bicycle lanes page and matching them to the street names of collisions.

In [None]:
def get_bicycle_lanes_from_page(page_html):
    """
    Parse the table contained in the bicycle lanes webpage.

    Args:
        page_html (string): String of HTML corresponding to the data source webpage.

    Returns:
        a dictionary that contains mappings from a category to the list containing the data.
        These categories are: street, begins, ends, and borough.
    """
    soup = BeautifulSoup(page_html, 'html.parser')
    bicycle_lanes_list = {
        'street': [],
        'begins': [],
        'ends': [],
        'borough': []
    }

    table = soup.findChildren('tbody')[0]
    rows = table.findChildren(['tr'])
    
    for row in rows:
        cells = row.findChildren('td')
        content = cells[0].text
        m = re.search(r'^([a-zA-Z\s0-9\-\.,\(\)]+) (from)*(between)* '
                      r'([a-zA-Z\s0-9\-\.,\(\)]+) (to)*(and)* '
                      r'([a-zA-Z\s0-9\-\.,\(\)]+)$', content)
        
        # Content that does not follow this syntax is discarded because
        # it refers to landscapes or parks.
        if m is not None:
            bicycle_lanes_list['street'].append(m.group(1).upper())
            bicycle_lanes_list['begins'].append(m.group(4).upper())
            bicycle_lanes_list['ends'].append(m.group(7).upper())
            bicycle_lanes_list['borough'].append(cells[2].text.upper())

    return bicycle_lanes_list
    
def extract_bicycle_lanes(url):
    """
    Retrieve all of the bicycle lane information for the city of New York.

    Parameters:
        url (string): page URL corresponding to the listing of bicycle lane information.

    Returns:
        bicycle_lanes (Pandas DataFrame): list of dictionaries containing extracted lane information.
    """
    bicycle_lanes_page = requests.get(url).text
    bicycle_lanes_list = get_bicycle_lanes_from_page(bicycle_lanes_page)
            
    bicycle_lanes = pd.DataFrame(bicycle_lanes_list)
    bicycle_lanes.loc[bicycle_lanes.borough == 'THE BRONX', 'borough'] = 'BRONX'
    bicycle_lanes = bicycle_lanes.rename(columns={'borough': 'BOROUGH', 'begins': 'BEGINS', 'ends': 'ENDS', 'street': 'ON STREET NAME'})
    
    return bicycle_lanes

bicycle_lanes = extract_bicycle_lanes(bicycle_lanes_page_url)

In [None]:
def merge_bicycle_lanes(df1, df2, columns):
    """
    merge the both dataframes on the specified columns.

    Parameters:
        df1 (DataFrame): pandas dataframe containing some data.
        df2 (DataFrame): pandas dataframe containing some data.

    Returns:
        result (Pandas DataFrame): dataframe containing data from both dataframes.
    """
    # Populate collisions with bicycle lane data by matching on 'columns'
    result = pd.merge(df1, df2, how='left', on=columns)
    result = result.assign(HAS_BIKE_LANE = [0 if x is np.nan else 1 for x in result.loc[:, 'BEGINS']])
    
    # Drop unneeded columns
    result = result.drop('BEGINS', axis=1)
    result = result.drop('ENDS', axis=1)
    
    return result.drop_duplicates()

merged = merge_bicycle_lanes(collision, bicycle_lanes, ['ON STREET NAME', 'BOROUGH'])

print merged.head()
print merged.dtypes
print len(merged)