# Discovery notebook for cycling crash data

In [11]:
import pandas as pd

DATA1 = "../data/cycling_safety_louisville.csv"
# DATA1 points to crash data from 2010 to 2017
# This data came from: https://zenodo.org/records/5603036
# Source: https://zenodo.org/records/5603036/files/louisville.zip
# Shape: 1273 rows x 54 columns

DATA2 = "../data/Louisville_Metro_KY_-_Traffic_Fatalities_and_Suspected_Serious_Injuries.csv"
# DATA2 points to crash data from 2016 to 2023.
# This data came from the Louisville Open Data portal
# Source: https://data.louisvilleky.gov/datasets/LOJIC::louisville-metro-ky-traffic-fatalities-and-suspected-serious-injuries-1/explore
# Shape: 4901 rows x 44 columns


In [12]:
df1 = pd.read_csv(DATA1)
df2 = pd.read_csv(DATA2)

# Building the data dictionaries


## cycling_safety data
| column name | type | description | value notes | cleaning notes | merge notes | 
|-------------|------|-------------|-------------|----------------|-------------|
|  Unnamed: 0 | number | index number for row | | ignore | |
| MASTER FILE NUMBER | number | case number for authorities? | | possibly ignore | can use to compare overlap |
| INVESTIGATING AGENCY | string | agency responding to crash | usually LMPD, but others too | get valuecounts | |
| LOCAL CODE | number | case number for local authority? | | possibly ignore | useful for comparison? |
| COLLISION STATUS CODE | string | code for case status | usually AC | get table for code meaning; it exists | |
| COUNTY NAME | number | county crash occurred in | should all be the same: 56: Jefferson County | check if there are weird values | |
| ROADWAY NUMBER | alphanumeric | code for state/county/etc roads like KY303 | a lot of null values |  | do these match the numbers in the other dataset? |
| BLOCK/HOUSE # | | | | | |
| ROADWAY NAME | string | name of primary road where the crash occurred | | | |
| ROADWAY SUFFIX | string | RD, AVE, LN, WAY, stuff like that | | | |
| ROADWAY DIRECTION CODE | string | S, N, E, W, stuff like that | | | |
| GPS LATITUDE DECIMAL | float/decimal | latitude coordinate | same as "Latitude"? | redundant? | |
| GPS LONGITUDE DECIMAL | float/decimal | longitudue coordinate | same as "longitude"? | redundant? | |
| MILEPOINT DERIVED | | | | | |
| COLLISION DATE | datestring | date of collision | 2010-2017 | | ==CollisionDate in  DATA2??|
| COLLISION TIME | | | | | |
| INTERSECTION ROADWAY # | | | | | |
| INTERSECTION ROADWAY NAME | | | | | |
| INTERSECTION ROADWAY SFX | | | | | |
|  BETWEEN STREET ROADWAY # 1 | | | | | |
| BETWEEN STREET ROADWAY NAME 1 | | | | | |
| BETWEEN STREET ROADWAY SFX 1 | | | | | |
| BETWEEN STREET ROADWAY # 2 | | | | | |
| BETWEEN STREET ROADWAY NAME 2 | | | | | |
| BETWEEN STREET ROADWAY SFX 2 | | | | | |
| UNITS INVOLVED | | | | | |
| MOTOR VEHICLES INVOLVED | | | | | |
| KILLED | bool? | did the crash kill someone? | | | |
| INJURED | bool? | was someone injured in the crash? | | | |

Note:
Columns with names like like "A CODE", "A" below all have a similar pattern:

| A CODE | number | condition code | small int | 1:1 map to A | ... |
|--------|--------|----------------|-----------|--------------|-----|
| A | number | human readable condition | short string | 1:1 map to A CODE | |

Back to the data dictionary:



| column name | type | description | value notes | cleaning notes | merge notes | 
|-------------|------|-------------|-------------|----------------|-------------|
| 30 | WEATHER CODE | number | numeric code for WEATHER condition | maps 1:1 with WEATHER | redundant? | same weather codes in other data? |
| 31 | WEATHER | string | human readable for WEATHER CODE | maps 1:1 with WEATHER CODE | | |
| 32 | ROADWAY CONDITION CODE | number | numeric code for ROADWAY CONDITION | | | |
| 33 | ROADWAY CONDITION | string | human readable for ROWADWAY CONDITION CODE | | | |
| 34 | HIT & RUN INDICATOR | bool? | was the crash a hit and run?  | | | |
| 35 | ROADWAY TYPE CODE | number | numeric code for ROADWAY TYPE | | | |
| 36 | ROADWAY TYPE | string | human readable for ROADWAY TYPE CODE | | | |
| 37 | DIRECTIONAL ANALYSIS CODE | number | numeric code for DIRECTIONAL ANALYSIS | | | |
| 38 | DIRECTIONAL ANALYSIS | string | human readable for DIRECTIONAL ANALYSIS CODE  | | | |
| 39 | MANNER OF COLLISION CODE | string | numeric code for MANNER OF COLLISION  | | | |
| 40 | MANNER OF COLLISION | | | | | |
| 41 | ROADWAY CHARACTER CODE | | | | | |
| 42 | ROADWAY CHARACTER | | | | | |
| 43 | LIGHT CONDITION CODE | | | | | |
| 44 | LIGHT CONDITION | | | | | |
| 45 | RAMP FROM ROADWAY ID | | | | | |
| 46 | RAMP TO ROADWAY ID | | | | | |
| 47 | SECONDARY COLLISION INDICATOR | bool? | was this collision a result of another collision | probably boolean; check that | is this always 0/False? | |
| 48 | hour | int | hour of collision | 24 hour clock | isn't this a repeat of info in COLLISION TIME? | |
| 49 | minute | int | minute of collision | normal minutes | isn't this a repeat of info in COLLISION TIME? | |
| 50 | Date | datestring | another date field? | do these values match with 'COLLISION DATE'? | figure out what the difference is between this field and COLLISION DATE, if any. If they're redundant, ignore one of them| figure out which of these dates is relevant; compare the overlap here with the other set to confirm |
| 51 | Latitude | float | latitude coordinate of crash site | within boundary range for Jefferson County | redundant? | |
| 52 | Longitude | float | longitude coordinate of crash site | within boundary range for Jefferson Cty | redundant? | |
| 53 | geometry | POINT | point location in logitude, latitude form | repeated data elsewhere | possibly ignore; can reconstruct from other fields | are any points repeated in the other dataset? |
| 54 | index_right | number | another index value | always 0 | ignore | |


## Exploring DATA1

In [13]:
# List all the columns
df1.columns

Index(['Unnamed: 0', 'MASTER FILE NUMBER', 'INVESTIGATING AGENCY',
       'LOCAL CODE', 'COLLISION STATUS CODE', 'COUNTY NAME', 'ROADWAY NUMBER',
       'BLOCK/HOUSE #', 'ROADWAY NAME', 'ROADWAY SUFFIX',
       'ROADWAY DIRECTION CODE', 'GPS LATITUDE DECIMAL',
       'GPS LONGITUDE DECIMAL', 'MILEPOINT DERIVED', 'COLLISION DATE',
       'COLLISION TIME', 'INTERSECTION ROADWAY #', 'INTERSECTION ROADWAY NAME',
       'INTERSECTION ROADWAY SFX', 'BETWEEN STREET ROADWAY # 1',
       'BETWEEN STREET ROADWAY NAME 1', 'BETWEEN STREET ROADWAY SFX 1',
       'BETWEEN STREET ROADWAY # 2', 'BETWEEN STREET ROADWAY NAME 2',
       'BETWEEN STREET ROADWAY SFX 2', 'UNITS INVOLVED',
       'MOTOR VEHICLES INVOLVED', 'KILLED', 'INJURED', 'WEATHER CODE',
       'WEATHER', 'ROADWAY CONDITION CODE', 'ROADWAY CONDITION',
       'HIT & RUN INDICATOR', 'ROADWAY TYPE CODE', 'ROADWAY TYPE',
       'DIRECTIONAL ANALYSIS CODE', 'DIRECTIONAL ANALYSIS',
       'MANNER OF COLLISION CODE', 'MANNER OF COLLISION',
   

In [14]:
# Some of the column names can be organized into groups of related data
# Date / Time columns
['COLLISION DATE', 'COLLISION TIME', 'hour', 'minute', 'Date']
# Geolocation columns
['GPS LATITUDE DECIMAL', 'GPS LONGITUDE DECIMAL', 'Latitude', 'Longitude', 'geometry']
# Address columns
['COUNTY NAME', 'ROADWAY NUMBER', 'BLOCK/HOUSE #', 'ROADWAY NAME', 'ROADWAY SUFFIX',
    'ROADWAY DIRECTION CODE', 'MILEPOINT DERIVED','INTERSECTION ROADWAY #', 'INTERSECTION ROADWAY NAME',
       'INTERSECTION ROADWAY SFX', 'BETWEEN STREET ROADWAY # 1',
       'BETWEEN STREET ROADWAY NAME 1', 'BETWEEN STREET ROADWAY SFX 1',
       'BETWEEN STREET ROADWAY # 2', 'BETWEEN STREET ROADWAY NAME 2',
       'BETWEEN STREET ROADWAY SFX 2', 'RAMP FROM ROADWAY ID', 'RAMP TO ROADWAY ID']
# Code columns
['LOCAL CODE', 'COLLISION STATUS CODE', 'WEATHER CODE', 
 'ROADWAY CONDITION CODE', 'ROADWAY TYPE CODE', 'DIRECTIONAL ANALYSIS CODE',
 'MANNER OF COLLISION CODE', 'ROADWAY CHARACTER CODE', 'LIGHT CONDITION CODE']
    # Some codes are paired with a human_readable string
['WEATHER', 'ROADWAY CONDITION', 'ROADWAY TYPE', 'DIRECTIONAL ANALYSIS', 'MANNER OF COLLISION',
     'ROADWAY CHARACTER', 'LIGHT CONDITION']
   # Other codes are stand alone:
['LOCAL CODE', 'COLLISION STATUS CODE']

# Other info
['Unnamed: 0', 'MASTER FILE NUMBER', 'INVESTIGATING AGENCY', 'LOCAL CODE', 'COLLISION STATUS CODE',
  'COUNTY NAME','UNITS INVOLVED', 'MOTOR VEHICLES INVOLVED', 'KILLED', 'INJURED', 'HIT & RUN INDICATOR',
  'SECONDARY COLLISION INDICATOR', 'index_right']


['Unnamed: 0',
 'MASTER FILE NUMBER',
 'INVESTIGATING AGENCY',
 'LOCAL CODE',
 'COLLISION STATUS CODE',
 'COUNTY NAME',
 'UNITS INVOLVED',
 'MOTOR VEHICLES INVOLVED',
 'KILLED',
 'INJURED',
 'HIT & RUN INDICATOR',
 'SECONDARY COLLISION INDICATOR',
 'index_right']

### Date and time columns
`['COLLISION DATE', 'COLLISION TIME', 'hour', 'minute', 'Date']`

In [30]:
df1[['Date', 'COLLISION DATE']].agg(("min", "max"))

Unnamed: 0,Date,COLLISION DATE
min,2010-01-13 10:00:00,1/10/2011
max,2017-12-22 21:51:00,9/9/2016


In [16]:
df1[['COLLISION TIME', 'hour', 'minute']].agg(("min", "max"))

Unnamed: 0,COLLISION TIME,hour,minute
min,0,0,0
max,100009,23,59


In [17]:
df1[['Date', 'COLLISION DATE', 'COLLISION TIME']]
# These seem to match up. It shouldn't be too hard to write code to verify this.

Unnamed: 0,Date,COLLISION DATE,COLLISION TIME
0,2010-02-20 16:20:00,2/20/2010,1620
1,2010-01-13 13:40:00,1/13/2010,1340
2,2010-01-13 10:00:00,1/13/2010,100008
3,2010-01-15 15:50:00,1/15/2010,1550
4,2010-02-02 06:11:00,2/2/2010,611
...,...,...,...
1268,2017-12-05 07:07:00,12/5/2017,707
1269,2017-12-14 17:09:00,12/14/2017,1709
1270,2017-12-19 10:00:00,12/19/2017,100002
1271,2017-12-21 19:56:00,12/21/2017,1956


In [18]:
df1[['COLLISION TIME', 'hour', 'minute']]
# These seem to match up. Write a script to check this data

Unnamed: 0,COLLISION TIME,hour,minute
0,1620,16,20
1,1340,13,40
2,100008,10,0
3,1550,15,50
4,611,6,11
...,...,...,...
1268,707,7,7
1269,1709,17,9
1270,100002,10,0
1271,1956,19,56


In [19]:
times = df1[['COLLISION TIME', 'hour', 'minute']]
times[(times['hour'] == 10)].sort_values(by='COLLISION TIME')


Unnamed: 0,COLLISION TIME,hour,minute
861,1000,10,0
26,1000,10,0
1052,1000,10,0
104,1000,10,0
1063,1001,10,1
...,...,...,...
1201,100009,10,0
6,100009,10,0
712,100009,10,0
542,100009,10,0


### Geolocation columns

`['GPS LATITUDE DECIMAL', 'GPS LONGITUDE DECIMAL', 'Latitude', 'Longitude', 'geometry']`


In [20]:

df1[['GPS LONGITUDE DECIMAL', 'Longitude', 'GPS LATITUDE DECIMAL', 'Latitude', 'geometry']]
# POINT data is more precise than other lat/long data

Unnamed: 0,GPS LONGITUDE DECIMAL,Longitude,GPS LATITUDE DECIMAL,Latitude,geometry
0,-85.707933,-85.707933,38.231850,38.231850,POINT (-85.707933333 38.23185)
1,-85.696572,-85.696572,38.273995,38.273995,POINT (-85.6965716 38.2739947)
2,-85.703576,-85.703576,38.258551,38.258551,POINT (-85.70357610000001 38.2585512)
3,-85.697265,-85.697265,38.250012,38.250012,POINT (-85.6972652 38.2500121)
4,-85.793380,-85.793380,38.195890,38.195890,POINT (-85.7933803 38.1958905)
...,...,...,...,...,...
1268,-85.733644,-85.733644,38.153815,38.153815,POINT (-85.73364410000001 38.1538153)
1269,-85.688008,-85.688008,38.163618,38.163618,POINT (-85.6880079 38.1636178)
1270,-85.671480,-85.671480,38.160030,38.160030,POINT (-85.6714798 38.1600301)
1271,-85.626309,-85.626309,38.198257,38.198257,POINT (-85.62630919999999 38.1982569)


### "... CODE" columns

`['LOCAL CODE', 'COLLISION STATUS CODE', 'WEATHER CODE', 
 'ROADWAY CONDITION CODE', 'ROADWAY TYPE CODE', 'DIRECTIONAL ANALYSIS CODE',
 'MANNER OF COLLISION CODE', 'ROADWAY CHARACTER CODE', 'LIGHT CONDITION CODE']`

`['WEATHER', 'ROADWAY CONDITION', 'ROADWAY TYPE', 'DIRECTIONAL ANALYSIS', 'MANNER OF COLLISION',
'ROADWAY CHARACTER', 'LIGHT CONDITION']`

In [21]:
human_readable = ['WEATHER', 'ROADWAY CONDITION', 'ROADWAY TYPE', 'DIRECTIONAL ANALYSIS', 'MANNER OF COLLISION',
     'ROADWAY CHARACTER', 'LIGHT CONDITION']
codes = {H:H+" CODE" for H in human_readable}
codes

{'WEATHER': 'WEATHER CODE',
 'ROADWAY CONDITION': 'ROADWAY CONDITION CODE',
 'ROADWAY TYPE': 'ROADWAY TYPE CODE',
 'DIRECTIONAL ANALYSIS': 'DIRECTIONAL ANALYSIS CODE',
 'MANNER OF COLLISION': 'MANNER OF COLLISION CODE',
 'ROADWAY CHARACTER': 'ROADWAY CHARACTER CODE',
 'LIGHT CONDITION': 'LIGHT CONDITION CODE'}

In [28]:
# Test each X CODE : X pair: They should be a 1:1 mapping, so the count of unique items for each
# human readable should equal the count of unique values for the numeric codes
all((len(df1[key].unique()) == len(df1[value].unique())) for key, value in codes.items())

True

In [23]:
df1['ROADWAY TYPE CODE'].value_counts(dropna=False)
# Has some null values
# NAN roadway type code -> NONE OF THE ABOVE in ROADWAY TYPE

ROADWAY TYPE CODE
5.0     665
7.0     270
2.0     255
1.0      48
99.0     19
4.0      11
NaN       4
3.0       1
Name: count, dtype: int64

### Address information

`['COUNTY NAME', 'ROADWAY NUMBER', 'BLOCK/HOUSE #', 'ROADWAY NAME', 'ROADWAY SUFFIX',
    'ROADWAY DIRECTION CODE', 'MILEPOINT DERIVED','INTERSECTION ROADWAY #', 'INTERSECTION ROADWAY NAME',
       'INTERSECTION ROADWAY SFX', 'BETWEEN STREET ROADWAY # 1',
       'BETWEEN STREET ROADWAY NAME 1', 'BETWEEN STREET ROADWAY SFX 1',
       'BETWEEN STREET ROADWAY # 2', 'BETWEEN STREET ROADWAY NAME 2',
       'BETWEEN STREET ROADWAY SFX 2', 'RAMP FROM ROADWAY ID', 'RAMP TO ROADWAY ID']`

### Other columns
`['Unnamed: 0', 'MASTER FILE NUMBER', 'INVESTIGATING AGENCY', 'LOCAL CODE', 'COLLISION STATUS CODE',
  'COUNTY NAME','UNITS INVOLVED', 'MOTOR VEHICLES INVOLVED', 'KILLED', 'INJURED', 'HIT & RUN INDICATOR',
  'SECONDARY COLLISION INDICATOR', 'index_right']`

In [34]:
# Unnamed: 0 column
unnamed_column = df1['Unnamed: 0']
len(unnamed_column) == len(unnamed_column.unique())
# This looks to be an index column.
# I will probably just ignore it. 

True

In [54]:
# index_right column
df1.index_right
df1.index_right.unique()
# This column is all zeros. I will ignore it entirely. 

array([0])

In [38]:
# MASTER FILE NUMBER column
MFN = df1['MASTER FILE NUMBER']
len(MFN.unique()), len(MFN)
# This looks like a unique identifier for each crash report. 
# I will keep this column


(1273, 1273)

In [41]:
# INVESTIGATING AGENCY
IA = df1['INVESTIGATING AGENCY']
IA.unique()
IA.value_counts()

INVESTIGATING AGENCY
LOUISVILLE METRO POLICE DEPT      1160
SHIVELY POLICE DEPARTMENT           40
ST. MATTHEWS POLICE DEPARTMENT      29
JEFFERSONTOWN POLICE DEPT           24
UNIV. OF LOUISVILLE POLICE           7
INDIAN HILLS POLICE DEPARTMENT       5
PROSPECT POLICE DEPARTMENT           2
GRAYMOOR-DEVONDALE POLICE DEPT       2
ANCHORAGE POLICE DEPARTMENT          1
WEST BUECHEL POLICE DEPT.            1
NORTHFIELD POLICE DEPARTMENT         1
AUDUBON PARK POLICE DEPARTMENT       1
Name: count, dtype: int64

In [48]:
# LOCAL CODE column
LC = df1['LOCAL CODE']



In [None]:
# COLLISION STATUS CODE column
CSC = df1['COLLISION STATUS CODE']


In [None]:
# COUNTY NAME column
CN = df1['COUNTY NAME']


In [None]:
# UNITS INVOLVED column
UI = df1["UNITS INVOLVED"]


In [None]:
# MOTOR VEHICLES INVOLVED column
MVI = df1['MOTOR VEHICLES INVOLVED']


In [None]:
# KILLED/INJURED/HIT & RUN INDICATOR/SECONDARY COLLISION INDICATOR columns


## Exploring LOJIC data


## LOJIC data
| column name | type | description | value notes | cleaning notes | 
|-------------|------|-------------|-------------|----------------|
| X |
| Y |
| IncidentID |
| AgencyName |
| RdwyNumber |
| Street |
| StreetDir |
| RoadwayName |
| StreetSfx |
| OWNER |
| ROAD_CLASSIFICATION |
| COUNCIL_DISTRICT |
| IntersectionRdwy |
| IntersectionRdwyName |
| BetweenStRdwy1 |
| BetweenStRdwyName1 |
| BetweenStRdwy2 |
| BetweenStRdwyName2 |
| Latitude |
| Longitude |
| Milepoint |
| DAY_OF_WEEK |
| CollisionDate |
| CollisionTime |
| HOUR_OF_DAY |
| UnitsInvolved |
| MotorVehiclesInvolved |
| MODE |
| NAME |
| AGE |
| GENDER |
| SEVERITY |
| LINK |
| Weather |
| RdwyConditionCode |
| HitandRun |
| DirAnalysisCode |
| MannerofCollision |
| RdwyCharacter |
| LightCondition |
| RampFromRdwyId |
| RampToRdwyId |
| IsSecondaryCollision |
| ObjectId |


In [24]:
df2['CollisionDate']


0       2016/10/11 03:08:00+00
1       2016/10/12 13:02:00+00
2       2016/10/12 13:02:00+00
3       2016/10/12 19:31:00+00
4       2016/10/12 23:51:00+00
                 ...          
4896    2022/09/12 22:55:00+00
4897    2022/09/16 05:47:00+00
4898    2022/09/16 22:47:00+00
4899    2022/09/17 00:14:00+00
4900    2022/09/17 02:10:00+00
Name: CollisionDate, Length: 4901, dtype: object