## Bike Index Seattle - Data Prep

### Data cleaning for Seattle streets data

##### Objective: Recreate the study by Allen-Munley et al. (2004) for Seattle using WSDOT crash data.


#### Part 2.

I will take Seattle street data from [SDOT](https://data-seattlecitygis.opendata.arcgis.com/datasets/seattle-streets), which contain the following [attributes](https://www.seattle.gov/Documents/Departments/SDOT/GIS/Seattle_Streets_OD.pdf).

The coordinates from the WSDOT crashes do not correspond to the street network coordinates, so I will first find the nearest street segment from the geodata, and then take the projected coordinate on the street vector as the new crash coordinate, in order to merge the two tables to get street attributes.

[Reference for snapping to nearest street](https://medium.com/@brendan_ward/how-to-leverage-geopandas-for-faster-snapping-of-points-to-lines-6113c94e59aa)

Finally, I will keep just the attributes from the street data that matches the attribute used in the study.

In [1]:
import requests
import json
import pandas as pd
import numpy as np
import geopandas as gpd
import shapely
from shapely.geometry import Point, LineString
import os
import pathlib
import matplotlib.pyplot as plt

In [2]:
%matplotlib inline

plt.rcParams["figure.figsize"] = (20,20)

In [3]:
#url = 'https://data.seattle.gov/resource/38vd-gytv.json'
#Didn't use

#### Step 1: Load data

- Load street data from SDOT geoJSON file
- Load crash data from previous step
- Convert street geoJSON and crash data to geoDataFrame

In [4]:
# Load street data from SDOT

gisurl = 'https://opendata.arcgis.com/datasets/383027d103f042499693da22d72d10e3_0.geojson'

r = requests.get(gisurl)

streets = r.json()

In [28]:
# Check crs- CRS84

streets['crs']

{'type': 'name', 'properties': {'name': 'urn:ogc:def:crs:OGC:1.3:CRS84'}}

In [24]:
# features of json

streets['features'][0]

{'type': 'Feature',
 'properties': {'OBJECTID': 1,
  'ARTCLASS': 2,
  'COMPKEY': 1006,
  'UNITID': '00010',
  'UNITID2': '0120',
  'UNITIDSORT': '000100120',
  'UNITDESC': '1ST AVE BETWEEN SENECA ST AND UNIVERSITY ST',
  'STNAME_ORD': '1ST AVE',
  'XSTRLO': 'SENECA ST',
  'XSTRHI': 'UNIVERSITY ST',
  'ARTDESCRIPT': 'Minor Arterial',
  'OWNER': ' ',
  'STATUS': 'INSVC',
  'BLOCKNBR': 1200,
  'SPEEDLIMIT': 25,
  'SEGDIR': 'NW',
  'ONEWAY': 'N',
  'ONEWAYDIR': ' ',
  'FLOW': ' ',
  'SEGLENGTH': 306.0,
  'SURFACEWIDTH': 48,
  'SURFACETYPE_1': 'PCC',
  'SURFACETYPE_2': 'AC/PCC',
  'INTRLO': '1ST AVE AND SENECA ST',
  'DIRLO': 'NW',
  'INTKEYLO': 29611,
  'INTRHI': '1ST AVE AND UNIVERSITY ST',
  'DIRHI': 'SE',
  'NATIONHWYSYS': 'N',
  'STREETTYPE': 'Downtown Neighborhood',
  'PVMTCONDINDX1': 87,
  'PVMTCONDINDX2': 62,
  'TRANCLASS': 1,
  'TRANDESCRIPT': 'PRINCIPAL TRANSIT ROUTE',
  'SLOPE_PCT': 4,
  'PVMTCATEGORY': 'ART',
  'PARKBOULEVARD': 'N',
  'SHAPE_Length': 305.966027667416},
 'geometr

In [5]:
# Load crash data from previous step

crashes = pd.read_csv('data/bike_crash.csv')

In [6]:
# Convert street geoJSON and crash data .csv to geoDataFrame


# Convert lat/long to geometry
geom = [Point(xy) for xy in zip(crashes.LONGITUDE, crashes.LATITUDE)]

gdf_crashes = gpd.GeoDataFrame(crashes.drop(['LATITUDE','LONGITUDE'], axis = 1),
                               geometry = geom)

gdf = gpd.GeoDataFrame.from_features(streets)

#### Step 2: Find possible street segment for every crash point coordinate

- Given offset parameter, find the acceptable bounds for each crash coordinate
- Find street segments within the bounds for each crash coordinate
- Find closest street segment for each given crash coordinate
- 'Snap' crash coordinate to the street segment vector by projection
- Discard any street/crash combination whose distances are greater than a tolerance parameter

In [7]:
# Create index for street geoDF
gdf.sindex

# Select offset parameter- approx 10m
offset = 10**-4 

# Create bound box for each coordinate given offset parameter
bbox = gdf_crashes.bounds + [-offset, -offset, offset, offset]

In [8]:
# Find street segments within bounds for each crash coordinate

hits = bbox.apply(lambda row: list(gdf.sindex.intersection(row)), axis=1)

# Temp dataframe containing each point index and ordinal position of line
temp = pd.DataFrame({
    "pt_idx": np.repeat(hits.index, hits.apply(len)),
    "line_i": np.concatenate(hits.values)
})

temp.head()

Unnamed: 0,pt_idx,line_i
0,0,5040.0
1,0,3934.0
2,0,1104.0
3,1,5053.0
4,2,22.0


In [9]:
# Merge temp df to street gdf

temp2 = temp.merge(gdf.reset_index(drop=False). 
                   rename(columns={'index':'line_i'}), on='line_i')

temp2.head()

Unnamed: 0,pt_idx,line_i,geometry,OBJECTID,ARTCLASS,COMPKEY,UNITID,UNITID2,UNITIDSORT,UNITDESC,...,NATIONHWYSYS,STREETTYPE,PVMTCONDINDX1,PVMTCONDINDX2,TRANCLASS,TRANDESCRIPT,SLOPE_PCT,PVMTCATEGORY,PARKBOULEVARD,SHAPE_Length
0,0,5040.0,"LINESTRING (-122.31941 47.60520, -122.31942 47...",5041,0.0,1192,145,40,1450040,10TH AVE BETWEEN E TERRACE ST AND E JEFFERSON ST,...,N,Urban Village Neighborhood Access,20.0,0.0,0,NOT DESIGNATED,1.0,NON-ART,N,367.278725
1,0,3934.0,"LINESTRING (-122.31942 47.60621, -122.31811 47...",3935,3.0,14218,11190,100,111900100,E JEFFERSON ST BETWEEN 10TH AVE AND 11TH AVE,...,N,Urban Village Neighborhood,54.0,0.0,2,MAJOR TRANSIT ROUTE,9.0,ART,N,321.988627
2,0,1104.0,"LINESTRING (-122.32075 47.60621, -122.31942 47...",1105,3.0,14217,11190,90,111900090,E JEFFERSON ST BETWEEN BROADWAY AND 10TH AVE,...,N,Urban Village Neighborhood,54.0,0.0,2,MAJOR TRANSIT ROUTE,8.0,ART,N,329.041908
3,400,1104.0,"LINESTRING (-122.32075 47.60621, -122.31942 47...",1105,3.0,14217,11190,90,111900090,E JEFFERSON ST BETWEEN BROADWAY AND 10TH AVE,...,N,Urban Village Neighborhood,54.0,0.0,2,MAJOR TRANSIT ROUTE,8.0,ART,N,329.041908
4,1,5053.0,"LINESTRING (-122.32023 47.62546, -122.32023 47...",5054,2.0,1290,150,70,1500070,10TH AVE E BETWEEN E ROY W ST AND E ALOHA ST,...,N,Urban Village Neighborhood,14.0,0.0,2,MAJOR TRANSIT ROUTE,0.0,ART,N,498.702266


In [10]:
# Rename crashes gdf geometry to point, index to pt_idx to merge with temp

crashes_temp = gdf_crashes.reset_index(drop=False).rename(columns={'geometry':'point', 'index':'pt_idx'})

crashes_temp.columns

Index(['pt_idx', 'REPORT NUMBER', 'PRIMARY TRAFFICWAY',
       'INTERSECTING TRAFFICWAY', 'DATETIME', 'is_dry', 'is_light', 'is_clear',
       'is_hit_run', 'is_workzone', 'is_child', 'impaired', 'speeding',
       'driver_16_25', 'driver_65_plus', 'severity', 'point'],
      dtype='object')

In [11]:
# Merge the two temp df

temp3 = temp2.merge(crashes_temp, on="pt_idx")

# Convert back to a GeoDataFrame, so we can do spatial ops

temp4 = gpd.GeoDataFrame(temp3, geometry="geometry", crs=gdf_crashes.crs)

In [12]:
# Calculate snap_distance as distance for each crash to street

temp4["snap_dist"] = temp4.geometry.distance(gpd.GeoSeries(temp4.point))

max(temp4.snap_dist)

0.015461724901658874

In [13]:
# Discard any street/point combination that are greater than tolerance 

tolerance = offset # keep at same distance as original offset, approx 10m

temp5 = temp4.loc[temp4.snap_dist <= tolerance]

# Sort on ascending snap distance, so that closest goes to top
temp5 = temp5.sort_values(by=["snap_dist"])

temp5.head()

Unnamed: 0,pt_idx,line_i,geometry,OBJECTID,ARTCLASS,COMPKEY,UNITID,UNITID2,UNITIDSORT,UNITDESC,...,is_hit_run,is_workzone,is_child,impaired,speeding,driver_16_25,driver_65_plus,severity,point,snap_dist
93,542,17864.0,"LINESTRING (-122.31685 47.61526, -122.31685 47...",17865,2.0,1547,320,160,3200160,12TH AVE BETWEEN E PINE ST AND E OLIVE ST,...,0,0,0,0,0,0,0,3,POINT (-122.31685 47.61527),9.447105e-09
89,36,17864.0,"LINESTRING (-122.31685 47.61526, -122.31685 47...",17865,2.0,1547,320,160,3200160,12TH AVE BETWEEN E PINE ST AND E OLIVE ST,...,1,0,0,0,0,0,0,1,POINT (-122.31685 47.61527),9.942701e-09
4106,926,6651.0,"LINESTRING (-122.35435 47.67602, -122.35435 47...",6652,2.0,12160,9115,650,91150650,PHINNEY AVE N BETWEEN N 65TH ST AND N 67TH S ST,...,0,0,0,0,0,1,0,1,POINT (-122.35435 47.67604),1.787883e-08
1554,215,18980.0,"LINESTRING (-122.28992 47.57158, -122.28995 47...",18981,0.0,5271,2610,360,26100360,34TH AVE S BETWEEN S SPOKANE ST AND S CHARLEST...,...,0,0,0,0,0,0,0,3,POINT (-122.28995 47.57006),2.173671e-08
2984,619,13501.0,"LINESTRING (-122.32667 47.63204, -122.32678 47...",13502,0.0,10336,6740,110,67400110,FAIRVIEW AVE E BETWEEN FAIRVIEW AVE N AND E GA...,...,0,0,0,0,0,0,0,2,POINT (-122.32673 47.63208),3.235042e-08


#### Step 3: Find closest street segment and corresponding coordinate on line

- For each crash point, find the closest street segment
- Find the projection on the street segment from the point
- Create final new GeoDataFrame of the crash points and street segments

In [14]:
# Find the closest street segment for each crash point 

closest = temp5.groupby("pt_idx").first()

# construct a GeoDataFrame of the closest lines
closest = gpd.GeoDataFrame(closest, geometry="geometry")

In [18]:
#randomly check 10 from closest street vs csv file

check_ix = np.random.randint(len(closest)-1, size=10)

# UNITDESC is from SDOT data
# Primary Trafficway and Intersecting Trafficway are from the crash .csv file from WSDOT

# Just an eyeball check- description do not match exactly for primary & intersecting trafficway

closest.iloc[check_ix][['UNITDESC','PRIMARY TRAFFICWAY','INTERSECTING TRAFFICWAY']]

Unnamed: 0_level_0,UNITDESC,PRIMARY TRAFFICWAY,INTERSECTING TRAFFICWAY
pt_idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1026,S BAILEY ST BETWEEN CARLETON AVE S AND FLORA A...,S BAILEY ST,
1283,NE 45TH ST BETWEEN 5TH AVE NE AND 7TH AVE NE,005LX16938,
871,15TH AVE NE BETWEEN NE PACIFIC ST AND NE 40TH ST,NE PACIFIC ST,15TH AVE NE
678,JAMES ST BETWEEN BOREN AVE AND MINOR AVE,JAMES ST,BOREN AVE
742,MELROSE AV ON RP BETWEEN MELROSE AVE AND OLIVE...,MELROSE AVE,E OLIVE WAY
1269,FAIRVIEW AVE N BETWEEN YALE AVE N AND FAIRVIEW...,YALE AVE N,FAIRVIEW AVE N
662,HARBOR AVE SW BETWEEN CALIFORNIA WAY SW AND FA...,HARBOR AVE SW,
867,NE 95TH ST BETWEEN RAVENNA AVE NE AND 23RD AVE NE,NE 95TH ST,23RD AVE NE
384,PINE ST BETWEEN TERRY AVE AND BOREN AVE,BOREN AVE,PINE ST
158,2ND AVE BETWEEN PIKE ST AND PINE ST,2ND AVE,


In [19]:
# Position of nearest point from start of the line
pos = closest.geometry.project(gpd.GeoSeries(closest.point))

# Get new point location geometry
new_pts = closest.geometry.interpolate(pos)

In [94]:
# Create a new GeoDataFrame from the columns from the closest line and new point geometries

crashes_merged = gpd.GeoDataFrame(closest.drop(columns = ['geometry']),geometry=new_pts)

# Also keep the street geometries

crashes_merged = crashes_merged.merge(temp5.rename(columns = {'geometry':'line_geo'})[['line_i', 'line_geo']])

#### Step 4: Choose attributes, convert to binary variables

Revisiting the varialbes used in this study 
![Explanatory Variables](variables.png)  

**Attributes from Street data**

| Attribute | Variable Name |  Comment |
| :----------- | :----------- | :--|
|Speed | `SPEEDLIMIT`| Continuous variable |
|Width | `SURFACEWIDTH` | Continuous variable |
|One_Way | `ONEWAY` | Y/N, convert to binary |
|Grade | `SLOPE_PCT` | Continuous, convert to binary |
|Pave | `PVMTCONDINDX1` |  Primary pavement condition, out of 100 |
|Hwy | `ARTCLASS` | 4: State Highway, 5: Interstate Freeway |
|Bus | `TRANCLASS` | All non-zero classes are transit routes
|Truck | `STREETTYPE` | Industrial Access as proxy for Truck Route |

**Attributes from Crash data**  

Included in study

| Attribute | Variable Name |
| :----------- | :----------- |
|Weather | `is_clear` |
|Daylight | `is_light` | 
|Child | `is_child` | 

Not included in study

| Attribute | Variable Name |
| :----------- | :----------- |
|Pavement Surface | `is_dry` |
|Hit & Run | `is_hit_run` | 
|Workzone | `is_workzone` |
|Involves Impaired Persons | `impaired` |
|Driver Speeding | `speeding`|
|Driver 16-25 Years | `driver_16_25`|
|Driver over 65 Years | `driver_65_plus`|

**Attributes in study not yet gathered**

| Attribute | Description |
| :----------- | :----------- |
|Volume | Motor vehicle volume per lane |
|Income | Household income (of crash location) | 
|Density | Population density |
|Road_Div | Are opposing directions physically separated? |
|Curve | Does the road have a perceptible curve? |
|Parking | Is parking permitted? |
|Signal | Did the crash occur at a signalized intersection? |
|Resident | Was the loation zoned residential? |

**Convert data to binary variables**

- `ONEWAY`
- `SLOPE_PCT` 
    - Will take slopes >5% as perceptible slope
- `PVMTCONDINDX1` (Pavement Condition) 
    - Study used whether pavement was resurfaced within 10 years. Here, will take whether condition >50 (out of 100) as proxy for 'good' pavement
- `ARTCLASS` (Hwy)
    - `4: State Highway` and `5: Interstate Freeway` as 1 (is highway), else 0
- `TRANCLASS` (Bus)
    - All non-zero classes are classified as transit routes
- `STREETTYPE` (Truck)
    - Will use `Industrial Access` routes as proxy for truck route

In [96]:
crashes_merged['one_way'] = [1 if vals == 'Y' else 0 for vals in crashes_merged.ONEWAY]
crashes_merged['is_steep'] = [1 if vals > 5 else 0 for vals in crashes_merged.SLOPE_PCT]
crashes_merged['is_paved'] = [1 if vals > 50 else 0 for vals in crashes_merged.PVMTCONDINDX1]
crashes_merged['is_hwy'] = [1 if vals in [4, 5] else 0 for vals in crashes_merged.ARTCLASS]
crashes_merged['is_bus'] = [1 if vals != 0 else 0 for vals in crashes_merged.TRANCLASS]
crashes_merged['is_truck'] = [1 if vals == 'Industrial Access' else 0 for vals in crashes_merged.STREETTYPE]

#### Step 5: Drop unused columns, write to .csv file

In [97]:
crashes_merged.head()

Unnamed: 0,line_i,OBJECTID,ARTCLASS,COMPKEY,UNITID,UNITID2,UNITIDSORT,UNITDESC,STNAME_ORD,XSTRLO,...,point,snap_dist,geometry,line_geo,one_way,is_steep,is_paved,is_hwy,is_bus,is_truck
0,5040.0,5041,0.0,1192,145,40,1450040,10TH AVE BETWEEN E TERRACE ST AND E JEFFERSON ST,10TH AVE,E TERRACE ST,...,POINT (-122.31942 47.60621),7.421311e-07,POINT (-122.31942 47.60621),"LINESTRING (-122.31941 47.60520, -122.31942 47...",0,0,0,0,0,0
1,5053.0,5054,2.0,1290,150,70,1500070,10TH AVE E BETWEEN E ROY W ST AND E ALOHA ST,10TH AVE E,E ROY W ST,...,POINT (-122.32023 47.62656),1.199911e-05,POINT (-122.32022 47.62656),"LINESTRING (-122.32023 47.62546, -122.32023 47...",0,0,0,0,1,0
2,22.0,23,2.0,1299,150,220,1500220,10TH AVE E BETWEEN E BOSTON ST AND E LYNN ST,10TH AVE E,E BOSTON ST,...,POINT (-122.32007 47.63867),4.512041e-06,POINT (-122.32007 47.63867),"LINESTRING (-122.32007 47.63840, -122.32006 47...",0,1,1,0,1,0
3,22.0,23,2.0,1299,150,220,1500220,10TH AVE E BETWEEN E BOSTON ST AND E LYNN ST,10TH AVE E,E BOSTON ST,...,POINT (-122.32007 47.63867),4.512041e-06,POINT (-122.32007 47.63867),"LINESTRING (-122.32007 47.63840, -122.32006 47...",0,1,1,0,1,0
4,22.0,23,2.0,1299,150,220,1500220,10TH AVE E BETWEEN E BOSTON ST AND E LYNN ST,10TH AVE E,E BOSTON ST,...,POINT (-122.32007 47.63867),4.512041e-06,POINT (-122.32007 47.63867),"LINESTRING (-122.32007 47.63840, -122.32006 47...",0,1,1,0,1,0


In [84]:
crashes_merged = crashes_merged.reset_index(drop=True)

In [98]:
desc_cols = ['line_i',
             'REPORT NUMBER',
             'DATETIME',
             'geometry',
             'line_geo']

var_cols = ['SPEEDLIMIT', 
            'SURFACEWIDTH',
            'one_way', 
            'is_steep', 
            'is_paved', 
            'is_hwy', 
            'is_bus', 
            'is_truck', 
            'is_light', 
            'is_clear', 
            'is_hit_run', 
            'is_workzone', 
            'is_child', 
            'impaired', 
            'speeding', 
            'driver_16_25', 
            'driver_65_plus']


keep_cols = desc_cols + var_cols

keep_cols.append('severity')

In [99]:
df_cleaned = crashes_merged[keep_cols]

df_cleaned.head()

Unnamed: 0,line_i,REPORT NUMBER,DATETIME,geometry,line_geo,SPEEDLIMIT,SURFACEWIDTH,one_way,is_steep,is_paved,...,is_light,is_clear,is_hit_run,is_workzone,is_child,impaired,speeding,driver_16_25,driver_65_plus,severity
0,5040.0,3773772,2019-04-19 15:52:00,POINT (-122.31942 47.60621),"LINESTRING (-122.31941 47.60520, -122.31942 47...",20.0,30.0,0,0,0,...,1,0,0,0,0,0,0,0,0,3
1,5053.0,3773784,2017-06-27 06:40:00,POINT (-122.32022 47.62656),"LINESTRING (-122.32023 47.62546, -122.32023 47...",25.0,52.0,0,0,0,...,1,0,0,0,0,0,0,0,0,3
2,22.0,E779051,2018-03-10 23:00:00,POINT (-122.32007 47.63867),"LINESTRING (-122.32007 47.63840, -122.32006 47...",25.0,42.0,0,1,1,...,0,0,1,0,0,0,0,0,0,3
3,22.0,E779051,2018-03-10 23:00:00,POINT (-122.32007 47.63867),"LINESTRING (-122.32007 47.63840, -122.32006 47...",25.0,42.0,0,1,1,...,0,0,1,0,0,0,0,0,0,3
4,22.0,E779051,2018-03-10 23:00:00,POINT (-122.32007 47.63867),"LINESTRING (-122.32007 47.63840, -122.32006 47...",25.0,42.0,0,1,1,...,0,0,1,0,0,0,0,0,0,3


In [102]:
df_cleaned.to_csv('data/crash_streets.csv', index=False)