## Bike Index Seattle - Data Prep

### Data cleaning for crash data

##### Objective: Recreate the study by Allen-Munley et al. (2004) for Seattle using WSDOT crash data.


#### Part 1.

The crash data .csv file from WSDOT contains all crash reports for the entire state of Washington. I will clean up the .csv file to keep only collisions involving bicycles, and will focus only within the Seattle city limits.   

I will also keep just the explanatory variables used in the study, and will further dummy code the categorical ones as binary variables.  

Collision severity will be mapped to a 1-4 severity index similar to what was used in Allen-Munley et al.'s study.  

In [79]:
!git push origin main

To https://github.com/emi90/bike-index.git
   a92dbff..0d75b31  main -> main


In [1]:
import numpy as np
import pandas as pd
import os
import folium
from folium import plugins

#### Step 1 - Load data

- Read .csv file
- Filter to just relevant subset (bike collisions, Seattle only)

In [2]:
crash_data = pd.read_csv("../../UW/DATA511/Final Project/20201103Yamauchi_All_roads_Statewide_SRFF.csv")

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


In [3]:
crash_data.head()

Unnamed: 0,JURISDICTION,COUNTY,CITY,REPORT NUMBER,INDEXED PRIMARY TRAFFICWAY,PRIMARY TRAFFICWAY,BLOCK NUMBER,MILEPOST,A/B,INTERSECTING TRAFFICWAY,...,TZ Heavy Vehicle Crash Indicator,TZ Heavy Vehicle Crash Count,TZ Vehicle Train Crash Indicator,TZ Catostrophic Event Indicator,TZ Fatal Crash Indicator,TZ Fatality Count,TZ Suspected Serious Injury Crash Indicator,TZ Suspected Serious Injury Count,TZ Pedestrian Involved Indicator,TZ Pedacyclist Involved Indicator
0,City Street,Adams,Othello,E713622,ALLEY E OF S 12TH AV,ALLEY E OF S 12TH AVE,400.0,,,,...,0,0,0,0,0,0,0,0,0,0
1,City Street,Adams,Othello,E999637,ALLEYWAY NORTH OF MA,ALLEYWAY NORTH OF MAIN,900.0,,,,...,0,0,0,0,0,0,0,0,0,0
2,City Street,Adams,Othello,E962138,ASH ST,ASH ST,1200.0,,,,...,0,0,0,0,0,0,0,0,0,0
3,City Street,Adams,Othello,EA21607,CAPSTONE AVE,CAPSTONE AVE,1000.0,,,,...,0,0,0,0,0,0,0,0,0,0
4,City Street,Adams,Othello,E916903,CAPSTONE AVE,CAPSTONE AVE,0.0,,,GEMSTONE ST,...,0,0,0,0,0,0,0,0,0,0


In [4]:
crash_data.describe()

Unnamed: 0,MILEPOST,DIST FROM REF POINT,YEAR,TOTAL CRASHES,FATAL CRASHES,SERIOUS INJURY CRASHES,EVIDENT INJURY CRASHES,POSSIBLE INJURY CRASHES,PDO - NO INJURY CRASHES,TOTAL FATALITIES,...,TZ Heavy Vehicle Crash Indicator,TZ Heavy Vehicle Crash Count,TZ Vehicle Train Crash Indicator,TZ Catostrophic Event Indicator,TZ Fatal Crash Indicator,TZ Fatality Count,TZ Suspected Serious Injury Crash Indicator,TZ Suspected Serious Injury Count,TZ Pedestrian Involved Indicator,TZ Pedacyclist Involved Indicator
count,256574.0,95914.0,448711.0,448711.0,448711.0,448711.0,448711.0,448711.0,448711.0,448711.0,...,448711.0,448711.0,448711.0,448711.0,448711.0,448711.0,448711.0,448711.0,448711.0,448711.0
mean,48.790287,151.375624,2018.021339,1.0,0.004593,0.017392,0.069925,0.210811,0.697295,0.004896,...,0.056306,0.059731,0.000368,0.000432,0.004569,0.004872,0.017417,0.020271,0.019436,0.010795
std,79.105791,151.958862,1.186822,0.0,0.067617,0.130727,0.25502,0.407885,0.459429,0.074766,...,0.230511,0.252169,0.019173,0.020789,0.067437,0.074603,0.130818,0.158605,0.138051,0.103339
min,-0.57,0.0,2016.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.25,52.0,2017.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,8.8,107.0,2018.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,52.73,203.0,2019.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,765.0,4558.0,2020.0,1.0,1.0,1.0,1.0,1.0,1.0,4.0,...,1.0,9.0,1.0,1.0,1.0,4.0,1.0,6.0,1.0,1.0


In [5]:
# Filter to collisions occuring in Seattle

df = crash_data.loc[crash_data['CITY'] == 'Seattle']

df.describe()

Unnamed: 0,MILEPOST,DIST FROM REF POINT,YEAR,TOTAL CRASHES,FATAL CRASHES,SERIOUS INJURY CRASHES,EVIDENT INJURY CRASHES,POSSIBLE INJURY CRASHES,PDO - NO INJURY CRASHES,TOTAL FATALITIES,...,TZ Heavy Vehicle Crash Indicator,TZ Heavy Vehicle Crash Count,TZ Vehicle Train Crash Indicator,TZ Catostrophic Event Indicator,TZ Fatal Crash Indicator,TZ Fatality Count,TZ Suspected Serious Injury Crash Indicator,TZ Suspected Serious Injury Count,TZ Pedestrian Involved Indicator,TZ Pedacyclist Involved Indicator
count,16286.0,14005.0,45793.0,45793.0,45793.0,45793.0,45793.0,45793.0,45793.0,45793.0,...,45793.0,45793.0,45793.0,45793.0,45793.0,45793.0,45793.0,45793.0,45793.0,45793.0
mean,86.896353,116.860425,2017.94949,1.0,0.002162,0.016334,0.080864,0.244732,0.655908,0.002206,...,0.067303,0.071321,0.001223,0.000131,0.002162,0.002206,0.016334,0.017514,0.042408,0.029131
std,79.161309,106.045908,1.159778,0.0,0.046446,0.126759,0.272629,0.429933,0.475076,0.047834,...,0.250549,0.273888,0.034949,0.011446,0.046446,0.047834,0.126759,0.139252,0.201521,0.168176
min,0.0,0.0,2016.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,1.22,50.0,2017.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,39.945,98.0,2018.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,166.08,150.0,2019.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,174.57,4000.0,2020.0,1.0,1.0,1.0,1.0,1.0,1.0,2.0,...,1.0,5.0,1.0,1.0,1.0,2.0,1.0,3.0,1.0,1.0


In [69]:
# Filter to collisions involving bicycles

df_bike = df.loc[df['TOTAL BICYCLISTS INVOLVED'] > 0].reset_index(drop=True)

df_bike.head()

Unnamed: 0,JURISDICTION,COUNTY,CITY,REPORT NUMBER,INDEXED PRIMARY TRAFFICWAY,PRIMARY TRAFFICWAY,BLOCK NUMBER,MILEPOST,A/B,INTERSECTING TRAFFICWAY,...,TZ Heavy Vehicle Crash Indicator,TZ Heavy Vehicle Crash Count,TZ Vehicle Train Crash Indicator,TZ Catostrophic Event Indicator,TZ Fatal Crash Indicator,TZ Fatality Count,TZ Suspected Serious Injury Crash Indicator,TZ Suspected Serious Injury Count,TZ Pedestrian Involved Indicator,TZ Pedacyclist Involved Indicator
0,City Street,King,Seattle,3773772,10TH AVE,10TH AVE,0.0,,,E JEFFERSON ST,...,0,0,0,0,0,0,0,0,0,1
1,City Street,King,Seattle,3773784,10TH AVE E,10TH AVE E,700.0,,,,...,0,0,0,0,0,0,0,0,0,1
2,City Street,King,Seattle,E779051,10TH AVE E,10TH AVE E,2100.0,,,,...,0,0,0,0,0,0,0,0,0,1
3,City Street,King,Seattle,3773767,10TH AVE E,10TH AVE E,1900.0,,,,...,0,0,0,0,0,0,0,0,0,1
4,City Street,King,Seattle,EA10570,10TH AVE E,10TH AVE E,600.0,,,,...,0,0,0,0,0,0,0,0,0,1


#### Step 2: Check map visualization

- Convert state plane coordinates to Lat/Lon coordinates
- Visualize on map coordinates

In [7]:
from pyproj import Transformer

In [70]:
# Convert State Plane coordinates to lat/long coordinates

x,y = df_bike['WA STATE PLANE SOUTH - X'], df_bike['WA STATE PLANE SOUTH - Y']
transformer = Transformer.from_crs('epsg:2286','epsg:4326')
lat,long = transformer.transform(x, y)

df_bike['LONGITUDE'] = long
df_bike['LATITUDE'] = lat

In [9]:
m = folium.Map(location = [47.6062, -122.3321], zoom_start = 13)
accidents = plugins.MarkerCluster().add_to(m)

for i in range(len(df_bike)):
    lat = df_bike['LATITUDE']
    long = df_bike['LONGITUDE']
    folium.CircleMarker(
        location = [lat[i], long[i]],
        color = 'red',
        fill = True,
    ).add_to(accidents)

m

#### Step 3: Choose variables and convert data

The Allen-Munley et al. study uses the following variables for their study.  

![Explanatory Variables](variables.png)  

The variables in the study coming from the crash data and corresponding attribute name are as follows:

- Weather: `WEATHER`
- Daylight: `LIGHTING CONDITIONS`
- Child: `UNIT 2 PEDESTRIAN AGE`, `UNIT 3 PEDESTRIAN AGE`, `UNIT 1 BICYCLIST AGE`, `UNIT 2 BICYCLIST AGE`, `UNIT 3 BICYCLIST AGE`


Additional variables not present in the study but in the WSDOT data are as follows:

- `HIT & RUN` (Binary yes/no)
- `ROAD SURFACE CONDITIONS` (Wet, dry, etc.)
- `WORKZONE`
- `History/Suspense Ind`
- [Target Zero](https://targetzero.com/) indicators

**Converting data**

- Convert date/time columns to single datetime variable
- Convert categorical data into binary dummy variables:
    - Road conditions: 1 if `Dry` else 0
    - Lighting conditions: 1 if `Daylight` else 0
    - Weather: 1 if `Clear` or `Clear or Partly Cloudy` else 0
    - Hit & Run: 1 if `Yes` else 0
    - Workzone: 1 if `Within Workzone` else 0
    - Child: 1 if any of the age variables are >16 else 0
- TZ indicators are already binary- rename columns only

In [71]:
# Conert date/time columns to single datetime variable

df_bike["DATETIME"] = pd.to_datetime(df_bike["DATE"] + " " + df_bike["24 HR TIME"])

In [72]:
def get_binary_indicator(col_name, yes_vals):
    """
    Helper function to get 1/0 binary indicator
    For given variable column
    """
    
    if isinstance(yes_vals, list):
        bin_vals = [1 if vals in yes_vals else 0 for vals in df_bike[col_name]]
    else:
        bin_vals = [1 if vals==yes_vals else 0 for vals in df_bike[col_name]]
    
    return bin_vals


def get_binary_dist(bin_vals):
    
    return sum(bin_vals)/len(bin_vals)

In [73]:
# Road Conditions

df_bike['ROAD SURFACE CONDITIONS'].unique()

array(['Wet', 'Dry', 'Unknown', 'Ice', 'Standing Water', 'Snow/Slush'],
      dtype=object)

In [74]:
df_bike['is_dry'] = get_binary_indicator('ROAD SURFACE CONDITIONS', 'Dry')
print('%is_dry: ',get_binary_dist(df_bike['is_dry']))

%is_dry:  0.823088455772114


In [75]:
# Lighting Conditions

df_bike['LIGHTING CONDITIONS'].unique()

array(['Daylight', 'Dark-Street Lights On', 'Dusk', 'Dawn', 'Other',
       'Dark-No Street Lights', 'Dark - Unknown Lightin', 'Unknown',
       'Dark-Street Lights Off'], dtype=object)

In [76]:
df_bike['is_light'] = get_binary_indicator('LIGHTING CONDITIONS', 'Daylight')

print('%is_light: ', get_binary_dist(df_bike['is_light']))

%is_light:  0.7758620689655172


In [77]:
# Weather

df_bike['WEATHER'].unique()

array(['Raining', 'Clear or Partly Cloudy', 'Unknown', 'Overcast',
       'Clear', 'Other', 'Blowing Sand or Dirt or Snow',
       'Fog or Smog or Smoke', nan, 'Snowing'], dtype=object)

In [78]:
df_bike['is_clear'] = get_binary_indicator('WEATHER', ['Clear or Partly Cloudty', 'Clear'])

print('%is_clear: ', get_binary_dist(df_bike['is_clear']))

%is_clear:  0.07421289355322339


In [79]:
# Hit & Run

df_bike['HIT & RUN'].unique()

array(['No', 'Yes'], dtype=object)

In [80]:
df_bike['is_hit_run'] = get_binary_indicator('HIT & RUN', 'Yes')

print('%is_hit_run: ', get_binary_dist(df_bike['is_hit_run']))

%is_hit_run:  0.15367316341829085


In [81]:
# Workzone

df_bike['WORKZONE'].unique()

array([nan, 'Within WorkZone'], dtype=object)

In [82]:
df_bike['is_workzone'] = get_binary_indicator('WORKZONE', 'Within WorkZone')

print('%is_workzone: ', get_binary_dist(df_bike['is_workzone']))

%is_workzone:  0.004497751124437781


In [83]:
# Age data - check availability 

age_cols = [col for col in df_bike.columns if ' AGE' in col]
df_bike[age_cols].describe()

Unnamed: 0,VEH 1 MV DRIVER AGE,VEH 2 MV DRIVER AGE,VEH 3 MV DRIVER AGE,UNIT 2 PEDESTRIAN AGE,UNIT 3 PEDESTRIAN AGE,UNIT 1 BICYCLIST AGE,UNIT 2 BICYCLIST AGE,UNIT 3 BICYCLIST AGE
count,936.0,169.0,8.0,27.0,1.0,193.0,1042.0,6.0
mean,43.035256,42.183432,45.375,46.444444,43.0,37.305699,37.677543,46.666667
std,15.532331,15.179739,13.958484,19.05525,,13.682212,13.532671,14.988885
min,16.0,16.0,24.0,20.0,43.0,9.0,5.0,26.0
25%,30.0,31.0,36.5,26.5,43.0,27.0,28.0,36.25
50%,41.0,40.0,47.0,45.0,43.0,34.0,34.5,49.0
75%,55.0,54.0,53.75,59.5,43.0,47.0,47.0,58.75
max,90.0,77.0,66.0,87.0,43.0,73.0,78.0,62.0


In [84]:
for col in age_cols:
    cNan = df_bike[col].isna().sum()
    print(col, ' %Nan: ', cNan/len(df_bike))

VEH 1 MV DRIVER AGE  %Nan:  0.2983508245877061
VEH 2 MV DRIVER AGE  %Nan:  0.8733133433283359
VEH 3 MV DRIVER AGE  %Nan:  0.9940029985007496
UNIT 2 PEDESTRIAN AGE  %Nan:  0.97976011994003
UNIT 3 PEDESTRIAN AGE  %Nan:  0.9992503748125937
UNIT 1 BICYCLIST AGE  %Nan:  0.8553223388305847
UNIT 2 BICYCLIST AGE  %Nan:  0.21889055472263869
UNIT 3 BICYCLIST AGE  %Nan:  0.9955022488755623


In [85]:
veh1 = [col for col in df_bike if 'VEH 1' in col]

df_bike.loc[df_bike['UNIT 1 BICYCLIST AGE'].isna()==False][veh1].head()

Unnamed: 0,"SR ONLY, VEH 1 MILEPOST DIRECTION","SR ONLY, VEH 1 MOVEMENT",VEH 1 TYPE,VEH 1 MAKE,VEH 1 MODEL,VEH 1 STYLE,VEH 1 ACTION,VEH 1 COMPASS DIRECTION FROM,VEH 1 COMPASS DIRECTION TO,VEH 1 USAGE,...,VEH 1 MOTORCYCLE PASSENGER INJURY TYPE,VEH 1 MV DRIVER RESTRAINT,VEH 1 MV DRIVER EJECTION,VEH 1 MV DRIVER MISC ACTION 1,VEH 1 MV DRIVER MISC ACTION 2,VEH 1 MV DRIVER MISC ACTION 3,VEH 1 MV DRIVER SEQUENCE 1,VEH 1 MV DRIVER SEQUENCE 2,VEH 1 MV DRIVER SEQUENCE 3,VEH 1 MV DRIVER SEQUENCE 4
17,,,,,,,,,,,...,,,,,,,,,,
21,,,,,,,,,,,...,,,,,,,,,,
26,,,,,,,,,,,...,,,,,,,,,,
53,,,,,,,,,,,...,,,,,,,,,,
57,,,,,,,,,,,...,,,,,,,,,,


Roughly 80% of the collision data for bikes include bicyclist age- `UNIT 1 BICYCLIST` attributes are for collisions where the primary vehicle is a bicycle, not motor vehicle, hence the large number of NaNs. 

In [86]:
bike_age_col = [col for col in df_bike if 'BICYCLIST AGE' in col]
is_child = df_bike[bike_age_col].min(axis=1) <= 16
df_bike['is_child'] = is_child*1 

In [87]:
# History/Suspense Indicator

df_bike['History/Suspense Ind'].unique()

array(['No'], dtype=object)

No unique indicators for `History/Suspense Ind`

In [88]:
# TZ Indicators

df_bike = df_bike.rename(columns = {'TZ Impaired Involved Person Indicator':'impaired',
                                   'TZ Speeding Driver Indicator':'speeding',
                                   'TZ MV Driver 16 To 25 Years Involved Person Indicator':'driver_16_25',
                                   'TZ MV Driver 65 Plus Years Involved Person Indicator':'driver_65_plus'})

print('%impaired: ', get_binary_dist(df_bike['impaired']))
print('%speeding: ', get_binary_dist(df_bike['speeding']))
print('%16-25 driver: ', get_binary_dist(df_bike['driver_16_25']))
print('%65+ driver: ', get_binary_dist(df_bike['driver_65_plus']))

%impaired:  0.01649175412293853
%speeding:  0.0014992503748125937
%16-25 driver:  0.11469265367316342
%65+ driver:  0.08320839580209895


#### Step 4: Map respnse variable to severity index

The Allen-Munley et al. study uses a 1-3 index (there were no fatalities in their sample).  

![Severity Index Distribution](severity_index.png)  

The injuries will be mapped using the following index: 
- `PDO - NO INJURY CRASHES` : 1
- `POSSIBLE INJURIES` : 2
- `EVIDENT INJURIES` : 3
- `SERIOUS INJURIES` : 3
- `FATALITIES` : 4

In [89]:
severity_cols = [
    'PDO - NO INJURY CRASHES',
    'TOTAL FATALITIES',
    'TOTAL SERIOUS INJURIES',
    'TOTAL EVIDENT INJURIES',
    'TOTAL POSSIBLE INJURIES'
]


severity_dict = {
    'PDO - NO INJURY CRASHES' : 1,
    'TOTAL POSSIBLE INJURIES' : 2,
    'TOTAL EVIDENT INJURIES' : 3,
    'TOTAL SERIOUS INJURIES' : 3,
    'TOTAL FATALITIES' : 4
}

In [90]:
severity_df = df_bike[severity_cols].copy()

sev_sers = pd.Series(severity_df.columns[np.where(np.array(severity_df)!=0)[1]])

severity_df.isna().sum()

PDO - NO INJURY CRASHES    0
TOTAL FATALITIES           0
TOTAL SERIOUS INJURIES     0
TOTAL EVIDENT INJURIES     0
TOTAL POSSIBLE INJURIES    0
dtype: int64

In [91]:
sev_cat = sev_sers.map(severity_dict)

df_bike['severity'] = sev_cat

#### Step 5: Drop unused columns, write to .csv file


- Drop unused columns from original dataframe
- Write to .csv file to be merged with street data

In [92]:
df_bike.columns

Index(['JURISDICTION', 'COUNTY', 'CITY', 'REPORT NUMBER',
       'INDEXED PRIMARY TRAFFICWAY', 'PRIMARY TRAFFICWAY', 'BLOCK NUMBER',
       'MILEPOST', 'A/B ', 'INTERSECTING TRAFFICWAY',
       ...
       'LONGITUDE', 'LATITUDE', 'DATETIME', 'is_dry', 'is_light', 'is_clear',
       'is_hit_run', 'is_workzone', 'is_child', 'severity'],
      dtype='object', length=265)

In [95]:
# Descriptive columns- primary/intersecting trafficway will be used later to compare to street data

desc_cols = ['REPORT NUMBER',
             'PRIMARY TRAFFICWAY',
             'INTERSECTING TRAFFICWAY',
             'DATETIME',
             'LONGITUDE',
             'LATITUDE']

# Variable columns - binary indicators from previous steps

var_cols = ['is_dry',
            'is_light',
            'is_clear',
            'is_hit_run',
            'is_workzone',
            'is_child',
            'impaired',
            'speeding',
            'driver_16_25',
            'driver_65_plus']


keep_cols = desc_cols + var_cols

# Append response variable

keep_cols.append('severity')

In [98]:
df_bike_clean = df_bike[keep_cols]

df_bike_clean.head()

Unnamed: 0,REPORT NUMBER,PRIMARY TRAFFICWAY,INTERSECTING TRAFFICWAY,DATETIME,LONGITUDE,LATITUDE,is_dry,is_light,is_clear,is_hit_run,is_workzone,is_child,impaired,speeding,driver_16_25,driver_65_plus,severity
0,3773772,10TH AVE,E JEFFERSON ST,2019-04-19 15:52:00,-122.319415,47.606207,0,1,0,0,0,0,0,0,0,0,3
1,3773784,10TH AVE E,,2017-06-27 06:40:00,-122.320233,47.626563,1,1,0,0,0,0,0,0,0,0,3
2,E779051,10TH AVE E,,2018-03-10 23:00:00,-122.320074,47.638667,0,0,0,1,0,0,0,0,0,0,3
3,3773767,10TH AVE E,,2017-07-02 15:13:00,-122.320092,47.636485,1,1,0,0,0,0,0,0,0,0,3
4,EA10570,10TH AVE E,,2020-01-31 13:45:00,-122.319879,47.624504,0,1,0,0,0,0,0,0,0,0,3


In [54]:
keep_cols = [
    'REPORT NUMBER','BLOCK NUMBER','PRIMARY TRAFFICWAY','INTERSECTING TRAFFICWAY','DATE','24 HR TIME',
    'TOTAL FATALITIES','TOTAL SERIOUS INJURIES','TOTAL EVIDENT INJURIES','TOTAL POSSIBLE INJURIES','PDO - NO INJURY CRASHES',
    'TOTAL VEHICLES','TOTAL PEDESTRIANS INVOLVED','TOTAL BICYCLISTS INVOLVED',
    'UNIT 2 PEDESTRIAN AGE', 'UNIT 3 PEDESTRIAN AGE', 'UNIT 1 BICYCLIST AGE', 'UNIT 2 BICYCLIST AGE', 'UNIT 3 BICYCLIST AGE',
    'WEATHER','ROAD SURFACE CONDITIONS','LIGHTING CONDITIONS','HIT & RUN',
    'WA STATE PLANE SOUTH - X','WA STATE PLANE SOUTH - Y'
]

In [55]:
df_bike = df_bike[df_bike.columns.intersection(keep_cols)].reset_index(drop = True)

In [47]:
df_bike.columns

Index(['REPORT NUMBER', 'PRIMARY TRAFFICWAY', 'BLOCK NUMBER',
       'INTERSECTING TRAFFICWAY', 'DATE', '24 HR TIME',
       'PDO - NO INJURY CRASHES', 'TOTAL FATALITIES', 'TOTAL SERIOUS INJURIES',
       'TOTAL EVIDENT INJURIES', 'TOTAL POSSIBLE INJURIES', 'TOTAL VEHICLES',
       'TOTAL PEDESTRIANS INVOLVED', 'TOTAL BICYCLISTS INVOLVED', 'WEATHER',
       'ROAD SURFACE CONDITIONS', 'LIGHTING CONDITIONS', 'HIT & RUN',
       'WA STATE PLANE SOUTH - X', 'WA STATE PLANE SOUTH - Y', 'LONGITUDE',
       'LATITUDE', 'DATETIME', 'is_dry', 'is_light', 'is_clear', 'is_hit_run',
       'severity'],
      dtype='object')

In [48]:
keep_cols_2 = [
    'REPORT NUMBER','BLOCK NUMBER','PRIMARY TRAFFICWAY','INTERSECTING TRAFFICWAY','DATETIME','LATITUDE','LONGITUDE',
    'severity','is_dry','is_light','is_clear','is_hit_run',
    'TOTAL VEHICLES','TOTAL PEDESTRIANS INVOLVED','TOTAL BICYCLISTS INVOLVED'
]

In [49]:
df_bike = df_bike[keep_cols_2].copy()

In [50]:
df_bike.head()

Unnamed: 0,REPORT NUMBER,BLOCK NUMBER,PRIMARY TRAFFICWAY,INTERSECTING TRAFFICWAY,DATETIME,LATITUDE,LONGITUDE,severity,is_dry,is_light,is_clear,is_hit_run,TOTAL VEHICLES,TOTAL PEDESTRIANS INVOLVED,TOTAL BICYCLISTS INVOLVED
0,3773772,0.0,10TH AVE,E JEFFERSON ST,2019-04-19 15:52:00,47.606207,-122.319415,3,0,1,0,0,1,0,1
1,3773784,700.0,10TH AVE E,,2017-06-27 06:40:00,47.626563,-122.320233,3,1,1,1,0,1,0,1
2,E779051,2100.0,10TH AVE E,,2018-03-10 23:00:00,47.638667,-122.320074,3,0,0,0,1,1,0,1
3,3773767,1900.0,10TH AVE E,,2017-07-02 15:13:00,47.636485,-122.320092,3,1,1,1,0,1,0,1
4,EA10570,600.0,10TH AVE E,,2020-01-31 13:45:00,47.624504,-122.319879,3,0,1,0,0,1,0,1


#### Step 5: Write to .csv

Write current cleaned dataframe to .csv file.  
I will next gather roadway data to merge to this dataframe to replicate the road-specific variables used in the study.

In [51]:
df_bike.to_csv('data/bike_crash.csv', index=False)