# NYC Motor Vehicle Collisions - Crashes

The purpose of this notebook is to clean and prepare the collision data, so that it eventually can be merged with CitiBike data

## 1. Imports and data loading

In [6]:
import numpy as np
import pandas as pd
import requests
import os
from dotenv import load_dotenv
from functools import lru_cache

load_dotenv()

df = pd.read_csv("../data/Motor_Vehicle_Collisions_-_Crashes_20251117.csv",
                 low_memory=False)

## 2. Overview

In [7]:
# Check column data types
print(df.info())
# Check for missing values in each column
print(df.isnull().mean())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2220334 entries, 0 to 2220333
Data columns (total 29 columns):
 #   Column                         Dtype  
---  ------                         -----  
 0   CRASH DATE                     object 
 1   CRASH TIME                     object 
 2   BOROUGH                        object 
 3   ZIP CODE                       object 
 4   LATITUDE                       float64
 5   LONGITUDE                      float64
 6   LOCATION                       object 
 7   ON STREET NAME                 object 
 8   CROSS STREET NAME              object 
 9   OFF STREET NAME                object 
 10  NUMBER OF PERSONS INJURED      float64
 11  NUMBER OF PERSONS KILLED       float64
 12  NUMBER OF PEDESTRIANS INJURED  int64  
 13  NUMBER OF PEDESTRIANS KILLED   int64  
 14  NUMBER OF CYCLIST INJURED      int64  
 15  NUMBER OF CYCLIST KILLED       int64  
 16  NUMBER OF MOTORIST INJURED     int64  
 17  NUMBER OF MOTORIST KILLED      int64  
 18  CO

In [8]:
# Check first few rows
df.head()

Unnamed: 0,CRASH DATE,CRASH TIME,BOROUGH,ZIP CODE,LATITUDE,LONGITUDE,LOCATION,ON STREET NAME,CROSS STREET NAME,OFF STREET NAME,...,CONTRIBUTING FACTOR VEHICLE 2,CONTRIBUTING FACTOR VEHICLE 3,CONTRIBUTING FACTOR VEHICLE 4,CONTRIBUTING FACTOR VEHICLE 5,COLLISION_ID,VEHICLE TYPE CODE 1,VEHICLE TYPE CODE 2,VEHICLE TYPE CODE 3,VEHICLE TYPE CODE 4,VEHICLE TYPE CODE 5
0,09/11/2021,2:39,,,,,,WHITESTONE EXPRESSWAY,20 AVENUE,,...,Unspecified,,,,4455765,Sedan,Sedan,,,
1,03/26/2022,11:45,,,,,,QUEENSBORO BRIDGE UPPER,,,...,,,,,4513547,Sedan,,,,
2,11/01/2023,1:29,BROOKLYN,11230.0,40.62179,-73.970024,"(40.62179, -73.970024)",OCEAN PARKWAY,AVENUE K,,...,Unspecified,Unspecified,,,4675373,Moped,Sedan,Sedan,,
3,06/29/2022,6:55,,,,,,THROGS NECK BRIDGE,,,...,Unspecified,,,,4541903,Sedan,Pick-up Truck,,,
4,09/21/2022,13:21,,,,,,BROOKLYN BRIDGE,,,...,Unspecified,,,,4566131,Station Wagon/Sport Utility Vehicle,,,,


In [9]:
# Combine date and time columns into a single datetime column
df["CRASH DATETIME"] = pd.to_datetime(df["CRASH DATE"] + " " + df["CRASH TIME"])

# Inline with our citibike, we focus on data from 2023-01-01 to 2025-10-31
df = df[(df["CRASH DATETIME"] >= "2023-01-01") & (df["CRASH DATETIME"] <= "2025-10-31")]

## Subsetting the data to accidents involving cyclists

We have direct information on (i) whether a cyclist was killed, and (ii) whether a cyclist was injured. However, we also care about accidents involving cyclist in which they were neither killed nor injured. For this, we need to investigate VEHICLE TYPE CODE 1 - 5.

In [10]:
# List of vehicle type columns
veh_cols = [f"VEHICLE TYPE CODE {i}" for i in range(1, 6)]

# Start with a mask of all False
cyclist_mask = pd.Series(False, index=df.index)

for col in veh_cols:
    s = df[col].astype("string").str.lower()

    is_bike = s.str.contains("bik", na=False) & \
              ~s.str.contains("motor|dirt", na=False)

    is_cycle = s.str.contains("cyc", na=False) & \
               ~s.str.contains("motor|quad", na=False)

    cyclist_mask |= (is_bike | is_cycle)

# Final indicator column: 1 if any of the 5 vehicle codes matches, else 0
df["CYCLIST_INVOLVED"] = cyclist_mask.astype(int)


In [11]:
df.loc[df["CYCLIST_INVOLVED"] == 1, veh_cols].head()


Unnamed: 0,VEHICLE TYPE CODE 1,VEHICLE TYPE CODE 2,VEHICLE TYPE CODE 3,VEHICLE TYPE CODE 4,VEHICLE TYPE CODE 5
1925,Bike,E-Bike,,,
2592,Bike,Sedan,,,
2661,Bike,,,,
2689,Bike,Pick-up Truck,,,
2708,Bike,,,,


In [12]:
# Subset dataframe to only cyclist-involved collisions
df_sub = df[(df["CYCLIST_INVOLVED"] == 1) | (df["NUMBER OF CYCLIST INJURED"] > 0) |
   (df["NUMBER OF CYCLIST KILLED"] > 0)]

df_sub.isna().sum()

CRASH DATE                           0
CRASH TIME                           0
BOROUGH                           3742
ZIP CODE                          3743
LATITUDE                           997
LONGITUDE                          997
LOCATION                           997
ON STREET NAME                    5152
CROSS STREET NAME                 7382
OFF STREET NAME                  16047
NUMBER OF PERSONS INJURED            0
NUMBER OF PERSONS KILLED             0
NUMBER OF PEDESTRIANS INJURED        0
NUMBER OF PEDESTRIANS KILLED         0
NUMBER OF CYCLIST INJURED            0
NUMBER OF CYCLIST KILLED             0
NUMBER OF MOTORIST INJURED           0
NUMBER OF MOTORIST KILLED            0
CONTRIBUTING FACTOR VEHICLE 1       38
CONTRIBUTING FACTOR VEHICLE 2     2569
CONTRIBUTING FACTOR VEHICLE 3    20753
CONTRIBUTING FACTOR VEHICLE 4    21141
CONTRIBUTING FACTOR VEHICLE 5    21177
COLLISION_ID                         0
VEHICLE TYPE CODE 1                  0
VEHICLE TYPE CODE 2      

## Dealing with remaining missing locations
For our analysis, it is necessary that we know the location of the cyclist-involved accident.

In [13]:
# Check the number of instances ZIP CODE is not missing, but LATITUDE or LONGITUDE is missing
df_sub[((df_sub["LATITUDE"].isna()) | (df_sub["LONGITUDE"].isna())) & (df_sub["ZIP CODE"].notna())].shape

(204, 31)

In [14]:
# Check the number of instances BOROUGH is not missing, but LATITUDE or LONGITUDE and ZIP CODE are missing
df_sub[((df_sub["LATITUDE"].isna()) | (df_sub["LONGITUDE"].isna())) & (df_sub["ZIP CODE"].isna()) & (df_sub["BOROUGH"].notna())].shape

(0, 31)

As ZIP CODE is preferred to BOROUGH, the approach we use is as follows:
1. Fill missing LATITUDE and LONGITUDE using NYC GEOCLIENT (v2)
2. Use ZIP CODE medians
3. Drop remaining, as we cannot reasonably assign location

In [15]:
GEOCLIENT_KEY = os.getenv("GEOCLIENT_KEY")
URL = "https://api.nyc.gov/geoclient/v2/search"

@lru_cache(maxsize=50_000)
def geocode(query: str):

    query_norm = query.strip().lower()

    headers = {
        "Ocp-Apim-Subscription-Key": GEOCLIENT_KEY
    }
    params = {"input": query_norm}

    r = requests.get(URL, headers=headers, params=params, timeout=8)
    if r.status_code != 200:
        return None, None

    data = r.json()
    results = data.get("results") or []
    if not results:
        return None, None

    resp = results[0].get("response", {})
    lat = resp.get("latitude")
    lon = resp.get("longitude")

    if lat is None or lon is None:
        return None, None
    return float(lat), float(lon)

def build_query(row):
    on = str(row.get("ON STREET NAME", "") or "").strip()
    cross = str(row.get("CROSS STREET NAME") or "").strip()
    bor = str(row.get("BOROUGH") or "").strip()
    
    if on and cross and bor:
        return f"{on} & {cross}, {bor}, NY"
    if on and bor:
        return f"{on}, {bor}, NY"

    return None


In [16]:
mask = df_sub["LATITUDE"].isna() | df_sub["LONGITUDE"].isna()
c = 0
for idx, row in df_sub[mask].iterrows():
    c += 1
    if c % 100 == 0:
        print(f"Geocoding record {c} of {len(df_sub[mask])}")
    q = build_query(row)
    lat, lon = geocode(q)
    df_sub.at[idx, "LATITUDE"] = lat
    df_sub.at[idx, "LONGITUDE"] = lon

Geocoding record 100 of 997
Geocoding record 200 of 997
Geocoding record 300 of 997
Geocoding record 400 of 997
Geocoding record 500 of 997
Geocoding record 600 of 997
Geocoding record 700 of 997
Geocoding record 800 of 997
Geocoding record 900 of 997


In [17]:
# For the remaining missing lat/lon, we first use ZIP CODE medians
zip_medians = df_sub.groupby("ZIP CODE")[["LATITUDE", "LONGITUDE"]].median()

mask = df_sub["LATITUDE"].isna() | df_sub["LONGITUDE"].isna()
for idx, row in df_sub[mask].iterrows():
    zip_code = row["ZIP CODE"]
    if pd.notna(zip_code) and zip_code in zip_medians.index:
        df_sub.at[idx, "LATITUDE"] = zip_medians.at[zip_code, "LATITUDE"]
        df_sub.at[idx, "LONGITUDE"] = zip_medians.at[zip_code, "LONGITUDE"]

In [18]:
df_sub.isnull().sum()

CRASH DATE                           0
CRASH TIME                           0
BOROUGH                           3742
ZIP CODE                          3743
LATITUDE                           784
LONGITUDE                          784
LOCATION                           997
ON STREET NAME                    5152
CROSS STREET NAME                 7382
OFF STREET NAME                  16047
NUMBER OF PERSONS INJURED            0
NUMBER OF PERSONS KILLED             0
NUMBER OF PEDESTRIANS INJURED        0
NUMBER OF PEDESTRIANS KILLED         0
NUMBER OF CYCLIST INJURED            0
NUMBER OF CYCLIST KILLED             0
NUMBER OF MOTORIST INJURED           0
NUMBER OF MOTORIST KILLED            0
CONTRIBUTING FACTOR VEHICLE 1       38
CONTRIBUTING FACTOR VEHICLE 2     2569
CONTRIBUTING FACTOR VEHICLE 3    20753
CONTRIBUTING FACTOR VEHICLE 4    21141
CONTRIBUTING FACTOR VEHICLE 5    21177
COLLISION_ID                         0
VEHICLE TYPE CODE 1                  0
VEHICLE TYPE CODE 2      

In [19]:
# Remove missing lat/lon records
df_sub = df_sub[df_sub["LATITUDE"].notna() & df_sub["LONGITUDE"].notna()]
df_sub.isnull().sum()

CRASH DATE                           0
CRASH TIME                           0
BOROUGH                           2958
ZIP CODE                          2959
LATITUDE                             0
LONGITUDE                            0
LOCATION                           213
ON STREET NAME                    4921
CROSS STREET NAME                 6823
OFF STREET NAME                  15494
NUMBER OF PERSONS INJURED            0
NUMBER OF PERSONS KILLED             0
NUMBER OF PEDESTRIANS INJURED        0
NUMBER OF PEDESTRIANS KILLED         0
NUMBER OF CYCLIST INJURED            0
NUMBER OF CYCLIST KILLED             0
NUMBER OF MOTORIST INJURED           0
NUMBER OF MOTORIST KILLED            0
CONTRIBUTING FACTOR VEHICLE 1       35
CONTRIBUTING FACTOR VEHICLE 2     2325
CONTRIBUTING FACTOR VEHICLE 3    19978
CONTRIBUTING FACTOR VEHICLE 4    20359
CONTRIBUTING FACTOR VEHICLE 5    20393
COLLISION_ID                         0
VEHICLE TYPE CODE 1                  0
VEHICLE TYPE CODE 2      

## Subsetting for columns we need

In [21]:
cols = ["CRASH DATETIME", "LATITUDE", "LONGITUDE", "CYCLIST_INVOLVED",
        "NUMBER OF CYCLIST INJURED", "NUMBER OF CYCLIST KILLED"]
df_sub = df_sub[cols]
df_sub.isna().sum()

CRASH DATETIME               0
LATITUDE                     0
LONGITUDE                    0
CYCLIST_INVOLVED             0
NUMBER OF CYCLIST INJURED    0
NUMBER OF CYCLIST KILLED     0
dtype: int64

In [22]:
df_sub.to_csv("../data/processed/cleaned_collision_data.csv", index=False)