# Philadelphia Auto Crash - Exploratory Data Analysis

My goal here is to explore the data set and find interesting things I might want to analyze later.  This is scratchwork and won't be as interesting as later analysis workbooks.

In [198]:
import pandas as pd
import numpy as np

In [199]:
df_crash = pd.read_csv('df_crash.csv',low_memory=False) # data
df_crash_dict = pd.read_csv('crash_data_dict.csv') # data dictionary

In [200]:
df_crash_dict.head()

Unnamed: 0,Column Name,Short Title / Description,Type,Length,Constraints
0,ARRIVAL_TM,Time police arrived at the scene,TEXT,4.0,HHMM
1,AUTOMOBILE_COUNT,Total amount of Automobiles Involved,NUMBER,2.0,
2,BELTED_DEATH_COUNT,Total Deaths of belted occupants,NUMBER,2.0,
3,BELTED_SUSP_SERIOUS_INJ_COUNT,Total Suspected Serious Injuries of belted occ...,NUMBER,2.0,
4,BICYCLE_COUNT,Total amount of Bicycles involved,NUMBER,2.0,


In [201]:
#rename columns to make lowercase and remove spaces
df_crash_dict.columns = ['column_name','description','type','length','constraints']
df_crash_dict.head()

Unnamed: 0,column_name,description,type,length,constraints
0,ARRIVAL_TM,Time police arrived at the scene,TEXT,4.0,HHMM
1,AUTOMOBILE_COUNT,Total amount of Automobiles Involved,NUMBER,2.0,
2,BELTED_DEATH_COUNT,Total Deaths of belted occupants,NUMBER,2.0,
3,BELTED_SUSP_SERIOUS_INJ_COUNT,Total Suspected Serious Injuries of belted occ...,NUMBER,2.0,
4,BICYCLE_COUNT,Total amount of Bicycles involved,NUMBER,2.0,


#### Check sample values in df_crash

In [202]:
# check first row
row1_dict = df_crash.iloc[1,:]
row1_dict[:25]

CRN                               2020007793
ARRIVAL_TM                               NaN
AUTOMOBILE_COUNT                           1
BELTED_DEATH_COUNT                         0
BELTED_SUSP_SERIOUS_INJ_COUNT              0
BICYCLE_COUNT                              0
BICYCLE_DEATH_COUNT                        0
BICYCLE_SUSP_SERIOUS_INJ_COUNT             0
BUS_COUNT                                  0
CHLDPAS_DEATH_COUNT                        0
CHLDPAS_SUSP_SERIOUS_INJ_COUNT             0
COMM_VEH_COUNT                             0
CONS_ZONE_SPD_LIM                        NaN
CRASH_MONTH                                2
CRASH_YEAR                              2019
DAY_OF_WEEK                                6
DEC_LAT                              40.0798
DEC_LONG                            -75.0267
DISPATCH_TM                              NaN
DRIVER_COUNT_16YR                          0
DRIVER_COUNT_17YR                          0
DRIVER_COUNT_18YR                          0
DRIVER_COU

#### I don't know what the column names mean so I will pull a summary of each column into the data dictionary

In [203]:
# compare columns in data dict and data to see how they differ
# no  extra columns in the data dictionary
set(df_crash_dict.column_name) - set(df_crash.columns)

set()

In [204]:
# the only difference is the filename field that I added to the data
set(df_crash.columns) - set(df_crash_dict.column_name)

{'filename'}

In [205]:
# add a sample value to the data dictionary so it is easier to understand
#df_crash_dict['sample_value2'] = df_crash_dict.column_name.apply(lambda x: row1_dict[x])
#df_crash_dict.head(50)

In [206]:
def get_column_summary(col):
    # summarize values contained in column
    # if text, then return top 3 values and concatenate with a comma
    # if numeric, then then return mean, min, max
    
    s = df_crash[col]

    # if object data type (text)
    if s.dtype == 'O':
        top3 = s.fillna('null').value_counts(dropna=False).head(3).index
        return (', ').join(top3)
    
    # if numeric    
    else:
        mean_ = round(df_crash[col].mean(),2)
        min_ = round(df_crash[col].min(),2)
        max_ = round(df_crash[col].max(),2)
        return f'''mean={mean_}, min={min_}, max={max_}'''

In [207]:
df_crash_dict['summary'] = df_crash_dict.column_name.apply(get_column_summary)

In [208]:
df_crash_dict

Unnamed: 0,column_name,description,type,length,constraints,summary
0,ARRIVAL_TM,Time police arrived at the scene,TEXT,4.0,HHMM,"mean=1284.41, min=0.0, max=9999.0"
1,AUTOMOBILE_COUNT,Total amount of Automobiles Involved,NUMBER,2.0,,"mean=1.05, min=0, max=10"
2,BELTED_DEATH_COUNT,Total Deaths of belted occupants,NUMBER,2.0,,"mean=0.0, min=0, max=2"
3,BELTED_SUSP_SERIOUS_INJ_COUNT,Total Suspected Serious Injuries of belted occ...,NUMBER,2.0,,"mean=0.01, min=0, max=3"
4,BICYCLE_COUNT,Total amount of Bicycles involved,NUMBER,2.0,,"mean=0.02, min=0, max=2"
...,...,...,...,...,...,...
94,WZ_LN_CLOSURE,Did Work zone have a lane closure?,TEXT,1.0,"1=Y, 0 = N","null, Y, N"
95,WZ_MOVING,Was there moving work in the zone?,TEXT,1.0,"1=Y, 0 = N","null, N, Y"
96,WZ_OTHER,Was this a special type of work zone?,TEXT,1.0,"1=Y, 0 = N","null, N, Y"
97,WZ_SHLDER_MDN,Was a median/shoulder in the zone?,TEXT,1.0,"1=Y, 0 = N","null, Y, N"


In [209]:
# get first row of dataframe and add it as a column to the data dictionary as a sample data crash event
sample1 = df_crash[:2].to_dict('records')[0]
df_crash_dict['sample1'] = df_crash_dict.column_name.map(sample1)

# get last row and add it as another sample
sample2 = df_crash[-1:].to_dict('records')[0]
df_crash_dict['sample2'] = df_crash_dict.column_name.map(sample2)

## Fields 0-9

In [210]:
df_crash_dict[:10]

Unnamed: 0,column_name,description,type,length,constraints,summary,sample1,sample2
0,ARRIVAL_TM,Time police arrived at the scene,TEXT,4.0,HHMM,"mean=1284.41, min=0.0, max=9999.0",,1745.0
1,AUTOMOBILE_COUNT,Total amount of Automobiles Involved,NUMBER,2.0,,"mean=1.05, min=0, max=10",2.0,0.0
2,BELTED_DEATH_COUNT,Total Deaths of belted occupants,NUMBER,2.0,,"mean=0.0, min=0, max=2",0.0,0.0
3,BELTED_SUSP_SERIOUS_INJ_COUNT,Total Suspected Serious Injuries of belted occ...,NUMBER,2.0,,"mean=0.01, min=0, max=3",0.0,0.0
4,BICYCLE_COUNT,Total amount of Bicycles involved,NUMBER,2.0,,"mean=0.02, min=0, max=2",0.0,0.0
5,BICYCLE_DEATH_COUNT,Total amount of Bicyclist Fatalities,NUMBER,2.0,,"mean=0.0, min=0, max=1",0.0,0.0
6,BICYCLE_SUSP_SERIOUS_INJ_COUNT,\nTotal amount of Bicyclist Suspected Serious ...,NUMBER,2.0,,"mean=0.0, min=0, max=1",0.0,0.0
7,BUS_COUNT,Total amount of Buses involved,NUMBER,2.0,,"mean=0.02, min=0, max=2",0.0,0.0
8,CHLDPAS_DEATH_COUNT,killed in the crash\nTotal child passengers un...,NUMBER,2.0,,"mean=0.0, min=0, max=1",0.0,0.0
9,CHLDPAS_SUSP_SERIOUS_INJ_COUNT,with suspected serious injuries\nTotal child p...,NUMBER,2.0,,"mean=0.0, min=0, max=2",0.0,0.0


In [211]:
# ARRIVAL_TM - no police arrival time for first event.  2nd event, police arrived at 5:45PM
# AUTOMOBILE_COUNT - 2 cars involved in first event.  0 cars involved in the second.  
        # Does that mean that it was a motorcycle?
        # The max value is 10.  Would be interesting to look at the 10-car accident.

# no belted deaths or belted serious injuries, no bicycles involved, no buses, no children

In [252]:
# it looks like arrival times are often rounded to the nearest 5 minutes
# we can't draw any conclusiosn about crash time or arrival time since we are just seeing
        # the "most popular minute" that was reported in the data
        # 4:30-5:30 will likely wind up being the most frequent crash time
        # but looking at the individual minute is not the best way to analyze.

df_crash.ARRIVAL_TM.value_counts(normalize=True,dropna=False)

NaN       0.025043
1630.0    0.003412
1700.0    0.002884
1640.0    0.002844
1730.0    0.002803
            ...   
533.0     0.000142
433.0     0.000122
544.0     0.000122
554.0     0.000102
546.0     0.000102
Name: ARRIVAL_TM, Length: 1442, dtype: float64

In [251]:
# almost half of crashes only involve 1 car but 26% involve no cars
# no car accidents probably involve trucks, commercial vehicles, SUVs or motorcycles.
# I will create a "total vehicle" column later
df_crash.BELTED_DEATH_COUNT.value_counts(normalize=True,dropna=False)

0    0.999228
1    0.000711
2    0.000061
Name: BELTED_DEATH_COUNT, dtype: float64

In [250]:
# This is a sneak preview to a later field.  I want to compare unbelted death count and belted
df_crash.UNB_DEATH_COUNT.value_counts(normalize=True, dropna=False)

0    0.998538
1    0.001320
2    0.000102
3    0.000020
4    0.000020
Name: UNB_DEATH_COUNT, dtype: float64

In [253]:
# There are about 90% more unbelted deaths compared to belted deaths.
(1-0.998538) / (1-0.999228)

1.893782383419654

In [254]:
# 2.3% of crashes involve 1 bicycle
df_crash.BICYCLE_COUNT.value_counts(normalize=True,dropna=False)

0    0.977069
1    0.022911
2    0.000020
Name: BICYCLE_COUNT, dtype: float64

In [255]:
# only 1 accident involved 2 bcicyles.  I wonder if a car hit 2 cyclists at the same time.
df_crash.BICYCLE_COUNT.value_counts(dropna=False)

0    48106
1     1128
2        1
Name: BICYCLE_COUNT, dtype: int64

In [256]:
# 26 accidents involved the death of 1 cyclist
df_crash.BICYCLE_DEATH_COUNT.value_counts(dropna=False)

0    49209
1       26
Name: BICYCLE_DEATH_COUNT, dtype: int64

In [257]:
# 73 accidents invovled a bicylist with a suspected serious injury
# I was in a bike accident in 2016.  I will have to look back in the data and see if I can find it.
df_crash.BICYCLE_SUSP_SERIOUS_INJ_COUNT.value_counts(dropna=False)

0    49162
1       73
Name: BICYCLE_SUSP_SERIOUS_INJ_COUNT, dtype: int64

In [259]:
# 1.7% of accidents involved a bus
df_crash.BUS_COUNT.value_counts(normalize=True,dropna=False)

0    0.982716
1    0.017041
2    0.000244
Name: BUS_COUNT, dtype: float64

In [260]:
# 12 accidents involved 2 buses
df_crash.BUS_COUNT.value_counts(dropna=False)

0    48384
1      839
2       12
Name: BUS_COUNT, dtype: int64

In [261]:
# 3 accidents involved the death of a child.  RIP
df_crash.CHLDPAS_DEATH_COUNT.value_counts(dropna=False)

0    49232
1        3
Name: CHLDPAS_DEATH_COUNT, dtype: int64

In [262]:
# 20 accidents involved a child with a suspected serious injury.
df_crash.CHLDPAS_SUSP_SERIOUS_INJ_COUNT.value_counts(dropna=False)

0    49215
1       16
2        4
Name: CHLDPAS_SUSP_SERIOUS_INJ_COUNT, dtype: int64

## Fields 10-19

In [212]:
df_crash_dict[10:20]

Unnamed: 0,column_name,description,type,length,constraints,summary,sample1,sample2
10,COLLISION_TYPE,Collision category that defines the crash,TEXT,1.0,See Column Code,"Angle, Rear-end, Sideswipe (same dir.)",Angle,Angle
11,COMM_VEH_COUNT,Total Commercial vehicles involved,NUMBER,2.0,,"mean=0.07, min=0, max=5",0,0
12,CONS_ZONE_SPD_LIM,Speed limit for the Construction Zone,TEXT,2.0,,"mean=48.2, min=5.0, max=99.0",,
13,COUNTY,County Code Number where crash occurred,TEXT,2.0,See Column Code,PHILADELPHIA,PHILADELPHIA,PHILADELPHIA
14,CRASH_MONTH,Month when the crash occurred,TEXT,2.0,,"mean=6.57, min=1, max=12",3,9
15,CRASH_YEAR,Year when the crash occurred,TEXT,4.0,,"mean=2020.87, min=2019, max=2023",2019,2023
16,CRN,Crash Record Number,NUMBER,2.0,identifies a unique crash case\nDatabase key f...,"mean=2020970397.8, min=2019000228, max=2024023477",2020008819,2023085796
17,DAY_OF_WEEK,Day of the Week code when crash occurred,TEXT,1.0,See Column Code,"mean=4.07, min=1, max=7",1,2
18,DEC_LAT,Decimal format of the Latitude,NUMBER,2.4,Latitude expressed in decimal\ndegrees 99.9999,"mean=39.99, min=39.0, max=40.13",40.0214,39.9767
19,DEC_LONG,Decimal format of the Longitude,NUMBER,2.4,Longitude expressed in decimal\ndegrees 99.9999,"mean=-75.15, min=-75.28, max=-74.0",-75.0794,-75.1649


In [213]:
# both collisions were at an angle where the front of one car hits the side of the other
# no commercial vehicles
# no construction zone
# in Philadelphia
# crash 1 was in March 2019 and crash 2 was in September 2023
# crash 1 was on Sunday and crash 2 was on Monday
# crash 1 occurred at the intersection of Frankford Ave and Dyre St (my Mom grew up in that neighborhood)
# crash 2 occurred at the intersection of Jefferson St and N 18th St. (near Temple University)


In [264]:
# About 1/3 of crashes involve a car hitting another car in the side.
# 20% involve a read-end
# 13% of crashes involve pedestrians
df_crash.COLLISION_TYPE.value_counts(normalize=True,dropna=False)

Angle                        0.322839
Rear-end                     0.202762
Sideswipe (same dir.)        0.133949
Hit fixed object             0.132771
Hit pedestrian               0.132710
Head-on                      0.036356
Sideswipe (Opposite dir.)    0.021387
Non-collision                0.009384
Backing                      0.004915
Unknown                      0.001483
Other                        0.001442
Name: COLLISION_TYPE, dtype: float64

In [265]:
# 6.3% of crashes involve a commercial vehicle
df_crash.COMM_VEH_COUNT.value_counts(normalize=True,dropna=False)

0    0.936041
1    0.061359
2    0.002498
3    0.000081
5    0.000020
Name: COMM_VEH_COUNT, dtype: float64

In [266]:
# most crashes do not have a construction zone speed limit value so they probably didn't occur in a construction zone
# I am not sure what "99" means but I am sure it's not a speed limit.
df_crash.CONS_ZONE_SPD_LIM.value_counts(normalize=True,dropna=False)

NaN     0.987550
45.0    0.005504
35.0    0.001686
55.0    0.001341
99.0    0.001198
25.0    0.000873
50.0    0.000589
15.0    0.000325
90.0    0.000305
40.0    0.000244
30.0    0.000162
20.0    0.000102
5.0     0.000081
10.0    0.000041
Name: CONS_ZONE_SPD_LIM, dtype: float64

In [267]:
# October has the most crashes followed by May
# I wonder if this has to do with people going away on vacations and going to the beach in the summer
df_crash.CRASH_MONTH.value_counts(normalize=True,dropna=False)

10    0.090444
5     0.089306
8     0.086768
9     0.086057
3     0.083944
7     0.083498
12    0.082929
6     0.081263
1     0.081081
11    0.081040
2     0.077201
4     0.076470
Name: CRASH_MONTH, dtype: float64

In [270]:
# 2019 has the highest crash volume of all 5 years.  2020 is in the middle of the group.
# I would have expected 2020 to have the lowest volume since so many people quarantined during the pandemic.
df_crash.CRASH_YEAR.value_counts(dropna=False)

2019    11159
2021    10552
2020    10171
2022     8783
2023     8570
Name: CRASH_YEAR, dtype: int64

In [273]:
# 1=Sunday
# Saturday has the most crashes with Friday coming in 2nd place and Sunday coming in 3rd.
df_crash.DAY_OF_WEEK.value_counts(normalize=True,dropna=False)

7    0.158627
6    0.154321
1    0.150889
5    0.135209
2    0.134701
4    0.134132
3    0.132121
Name: DAY_OF_WEEK, dtype: float64

In [None]:
# 1=Sunday
# Saturday has the most crashes with Friday coming in 2nd place and Sunday coming in 3rd.
df_crash.DAY_OF_WEEK.value_counts(normalize=True,dropna=False)

## Fields 20-29

In [274]:
df_crash_dict[20:30]

Unnamed: 0,column_name,description,type,length,constraints,summary,sample1,sample2
20,DISPATCH_TM,Time police were dispatched to the scene,TEXT,4.0,HHMM (Hour Minute),"mean=1280.69, min=0.0, max=9999.0",,1739.0
21,DISTRICT,\nDistrict Number where crash occurred (Based ...,TEXT,2.0,See Column Code,"District 6 (Bucks, Chester, Delaware, Montgome...","District 6 (Bucks, Chester, Delaware, Montgome...","District 6 (Bucks, Chester, Delaware, Montgome..."
22,DRIVER_COUNT_16YR,Total amount of 16-year-old drivers,NUMBER,2.0,,"mean=0.0, min=0, max=2",0,0
23,DRIVER_COUNT_17YR,Total amount of 17-year-old drivers,NUMBER,2.0,,"mean=0.01, min=0, max=2",0,0
24,DRIVER_COUNT_18YR,Total amount of 18-year-old drivers,NUMBER,2.0,,"mean=0.02, min=0, max=2",0,0
25,DRIVER_COUNT_19YR,Total amount of 19-year old drivers,NUMBER,2.0,,"mean=0.03, min=0, max=2",0,0
26,DRIVER_COUNT_20YR,Total amount of 20-year-old drivers,NUMBER,2.0,,"mean=0.04, min=0, max=2",0,0
27,DRIVER_COUNT_50_64YR,Total amount of 50 to 64-year-old drivers,NUMBER,2.0,,"mean=0.26, min=0, max=4",0,0
28,DRIVER_COUNT_65_74YR,Total amount of 65 to 74-year-old drivers,NUMBER,2.0,,"mean=0.07, min=0, max=3",0,0
29,DRIVER_COUNT_75PLUS,Total amount of drivers ages 75 and up,NUMBER,2.0,,"mean=0.03, min=0, max=2",0,0


In [277]:
# the first event does not have a dispatch time.  The 2nd event has a time of 5:39
# All events are in District 6
# The "driver count within age range x" are all zero. Not sure why.

In [278]:
# The most popular dispatch time is 4pm followed by 4:30
df_crash.DISPATCH_TM.value_counts(normalize=True,dropna=False)

NaN       0.028069
1600.0    0.003047
1630.0    0.002925
1700.0    0.002539
1900.0    0.002458
            ...   
544.0     0.000183
442.0     0.000183
331.0     0.000162
553.0     0.000162
459.0     0.000041
Name: DISPATCH_TM, Length: 1442, dtype: float64

In [279]:
# Always District 6
df_crash.DISTRICT.value_counts(normalize=True,dropna=False)

District 6 (Bucks, Chester, Delaware, Montgomery, Philadelphia Counties)    1.0
Name: DISTRICT, dtype: float64

In [280]:
# 0.32% of accidents involve a 16 year old driver
df_crash.DRIVER_COUNT_16YR.value_counts(normalize=True,dropna=False)

0    0.996811
1    0.003148
2    0.000041
Name: DRIVER_COUNT_16YR, dtype: float64

In [282]:
# 1.1% of accidents involve a 17 year old driver
df_crash.DRIVER_COUNT_17YR.value_counts(normalize=True,dropna=False)

0    0.988545
1    0.011354
2    0.000102
Name: DRIVER_COUNT_17YR, dtype: float64

In [283]:
# 2.2% of accidents involve an 18 year old driver
df_crash.DRIVER_COUNT_18YR.value_counts(normalize=True,dropna=False)

0    0.978003
1    0.021692
2    0.000305
Name: DRIVER_COUNT_18YR, dtype: float64

In [284]:
# 3.1% of accidents involve a 19 year old driver
df_crash.DRIVER_COUNT_19YR.value_counts(normalize=True,dropna=False)

0    0.969067
1    0.030629
2    0.000305
Name: DRIVER_COUNT_19YR, dtype: float64

In [286]:
# 3.5% of accidents involve a 20 year old driver
df_crash.DRIVER_COUNT_20YR.value_counts(normalize=True,dropna=False)

0    0.964862
1    0.034731
2    0.000406
Name: DRIVER_COUNT_20YR, dtype: float64

In [288]:
# 23% of accidents involve a 50-64 year old driver
df_crash.DRIVER_COUNT_50_64YR.value_counts(normalize=True,dropna=False)

0    0.763725
1    0.214766
2    0.020351
3    0.001117
4    0.000041
Name: DRIVER_COUNT_50_64YR, dtype: float64

In [289]:
# 6.8% of accidents involve a 65-74 year old driver
df_crash.DRIVER_COUNT_65_74YR.value_counts(normalize=True,dropna=False)

0    0.932284
1    0.066112
2    0.001564
3    0.000041
Name: DRIVER_COUNT_65_74YR, dtype: float64

In [290]:
# 2.6% of accidents involve a 75+ year old driver
df_crash.DRIVER_COUNT_75PLUS.value_counts(normalize=True,dropna=False)

0    0.973190
1    0.026343
2    0.000467
Name: DRIVER_COUNT_75PLUS, dtype: float64

## Fields 30-39

In [214]:
df_crash_dict[30:40]

Unnamed: 0,column_name,description,type,length,constraints,summary,sample1,sample2
30,EST_HRS_CLOSED,Estimated hours roadway was closed,TEXT,1.0,HHMM,"mean=2.64, min=1.0, max=9.0",,
31,FATAL_COUNT,Total amount of fatalities involved,NUMBER,2.0,,"mean=0.01, min=0, max=4",0,0
32,HEAVY_TRUCK_COUNT,Total amount of Heavy Trucks involved,NUMBER,2.0,,"mean=0.05, min=0, max=3",0,0
33,HORSE_BUGGY_COUNT,involved in the Crash\nTotal Number of Horse a...,NUMBER,2.0,,"mean=0.0, min=0.0, max=1.0",0.0,0.0
34,HOUR_OF_DAY,The hour of Day when the crash occurred,TEXT,2.0,00 to 23,"mean=12.74, min=0.0, max=99.0",8.0,17.0
35,ILLUMINATION,Code that defines lighting at crash scene,TEXT,1.0,See Column Code,"Daylight, Dark - streetlights, Dusk",Daylight,Daylight
36,INJURY_COUNT,Total count of all injuries sustained,NUMBER,2.0,,"mean=0.9, min=0, max=25",1,1
37,INTERSECT_TYPE,Code that defines the Intersection Type,TEXT,2.0,See Column Code,"mean=0.76, min=0, max=13",1,1
38,INTERSECTION_RELATED,Was this midblock crash related to a nearby in...,TEXT,1.0,"1=Y, 0 = N","null, N, Y",,
39,LANE_CLOSED,Was there a lane closure? (Y/N),TEXT,1.0,"1=Y, 0 = N","mean=0.46, min=0, max=9",0,0


## <span style="color:red">🚨 THERE WERE NO HORSE AND BUGGIES INVOLVED IN EITHER ACCIDENT 🚨</span>

In [228]:
# the road was not closed
# no fatalities, not heavy trucks
# the first accident occurred at 8am and the second occured at 5pm
# both accidents occurred during daylight which makes sense given the time of year and the time of day
# 1 injury in each accident
# intersection type is "4 way" for both
# there was no lane closure

In [276]:
# 7.1% of crashes have an estimate of how long the road was closed
df_crash.EST_HRS_CLOSED.value_counts(normalize=True,dropna=False)

NaN    0.928669
2.0    0.024576
1.0    0.023114
3.0    0.014908
9.0    0.007210
4.0    0.001401
5.0    0.000081
6.0    0.000041
Name: EST_HRS_CLOSED, dtype: float64

In [None]:
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

In [229]:
df_crash_dict[40:50]

Unnamed: 0,column_name,description,type,length,constraints,summary,sample1,sample2
40,LATITUDE,GPS Latitude determined by PennDOT,TEXT,12.0,DD MM:SS.ddd,"null, 40 00:16.847, 39 55:01.859",40 01:17.161,39 58:36.055
41,LN_CLOSE_DIR,Direction of traffic in closed lane (s),TEXT,1.0,See Column Code,"null, South, North",,
42,LOCATION_TYPE,Code that defines the crash location,TEXT,2.0,See Column Code,"Not applicable, Driveway or Parking Lot, Bridge",Underpass,Not applicable
43,LONGITUDE,GPS Longitude determined by PennDOT (in negati...,TEXT,12.0,DD MM:SS.ddd,"null, 75 09:07.652, 75 14:46.691",75 04:45.952,75 09:53.625
44,MAX_SEVERITY_LEVEL,Injury severity level of the crash,TEXT,1.0,See Column Code,"Suspected Minor injury, Not injured, Injury/ U...",Suspected Minor injury,Suspected Minor injury
45,MCYCLE_DEATH_COUNT,Total amount of Motorcyclist fatalities,NUMBER,2.0,,"mean=0.0, min=0, max=1",0,0
46,MCYCLE_SUSP_SERIOUS_INJ_COUNT,Total amount of Motorcyclist Suspected Serious...,NUMBER,2.0,,"mean=0.01, min=0, max=3",0,0
47,MOTORCYCLE_COUNT,Total amount of Motorcycles Involved,NUMBER,2.0,,"mean=0.03, min=0, max=3",0,0
48,MUNICIPALITY,Municipality Code,TEXT,5.0,See Municipality Code,"mean=67301.0, min=67301, max=67301",67301,67301
49,NONMOTR_COUNT,Total number of Non-motorists involved in the ...,NUMBER,2.0,,"mean=0.14, min=0, max=6",0,0


In [None]:
# no lane closure
# 1st accident was under and underpass
# minor injuries in both accidents
# no motorcyles were involved in the crash.
        #I still don't know what happened in crash 2 if there were not cars or motorcyles involved


In [230]:
df_crash_dict[50:60]

Unnamed: 0,column_name,description,type,length,constraints,summary,sample1,sample2
50,NONMOTR_DEATH_COUNT,Total number of Non-motorists killed in the crash,NUMBER,2.0,,"mean=0.01, min=0, max=3",0,0
51,NONMOTR_SUSP_SERIOUS_INJ_COUNT,Total number of Non-motorists with suspected s...,NUMBER,2.0,,"mean=0.01, min=0, max=3",0,0
52,NTFY_HIWY_MAINT,PENNDOT highway maintenance notified?,TEXT,1.0,"1=Y, 0 = N","N, Y",N,N
53,PED_COUNT,Total Pedestrians involved,NUMBER,2.0,,"mean=0.12, min=0, max=6",0,0
54,PED_DEATH_COUNT,Total Pedestrian fatalities,NUMBER,2.0,,"mean=0.0, min=0, max=3",0,0
55,PED_SUSP_SERIOUS_INJ_COUNT,Total Pedestrians with an Injury Severity of “...,NUMBER,2.0,,"mean=0.01, min=0, max=3",0,0
56,PERSON_COUNT,Total People involved,NUMBER,2.0,,"mean=2.37, min=0, max=28",3,2
57,POLICE_AGCY,Code of the Reporting Police Agency,TEXT,5.0,See Police Agency Code,"67301, 68K01, 00000",00000,67301
58,POSSIBLE_INJ_COUNT,Total number of People with an injury severity...,NUMBER,2.0,,"mean=0.16, min=0, max=15",0,0
59,RDWY_SURF_TYPE_CD,Code for the Roadway surface type –only for fa...,TEXT,2.0,See Column Code,"null, Blacktop, Concrete",,


In [235]:
# no non-motorists or pedestrians were involved
# hiway maintenance was not notified
# 3 people were involved in the first crash and 2 in the second
# Surface type is null

In [234]:
# these are different police departments in the city
# 00000 is not mentioned in the data dictionary.
# I don't believe this will be an important data point in my analysis so I will not decode these values
df_crash.POLICE_AGCY.value_counts()

67301    36066
68K01     9255
00000     1382
67505      748
67501      527
67508      495
67504      342
67507      234
46104       63
68K02       24
46102       19
68Z99       19
68T07       17
68M03       10
68K03        9
67509        7
46108        6
39301        3
37301        2
62301        2
67506        1
51601        1
23109        1
46101        1
68M05        1
Name: POLICE_AGCY, dtype: int64

In [236]:
df_crash_dict[60:70]

Unnamed: 0,column_name,description,type,length,constraints,summary,sample1,sample2
60,RELATION_TO_ROAD,Code for the crash’s relativity to the road,TEXT,2.0,See Column Code,"On roadway, In parking lane, Roadside (off tra...",On roadway,On roadway
61,ROAD_CONDITION,Roadway Surface Condition Code,TEXT,1.0,See Column Code,"Dry, Wet, Unknown",Dry,Wet
62,ROADWAY_CLEARED,Time the roadway was opened to traffic,TEXT,4.0,0000-2359 or 9999,"mean=1163.68, min=0.0, max=9999.0",,
63,SCH_BUS_IND,Did the crash involve a School Bus? (Y/N),TEXT,1.0,Y = Yes N = No,"N, null, Y",N,N
64,SCH_ZONE_IND,Did the crash occur in a School Zone? (Y/N),TEXT,1.0,Y = Yes N = No,"N, null, Y",N,N
65,SECONDARY_CRASH,Was this crash caused at least in part to a pr...,TEXT,1.0,Y = Yes N = No,"N, null, Y",N,N
66,SMALL_TRUCK_COUNT,Total amount of Small Trucks involved,NUMBER,2.0,,"mean=0.1, min=0, max=4",0,0
67,SPEC_JURIS_CD,\nCode that defines any special jurisdiction –...,TEXT,1.0,See Column Code,"null, No Special Jurisdiction, Other",,
68,SUSP_MINOR_INJ_COUNT,Total number of People with an injury severity...,NUMBER,2.0,,"mean=0.45, min=0, max=10",1,1
69,SUSP_SERIOUS_INJ_COUNT,Total number of People with an injury,NUMBER,2.0,,"mean=0.05, min=0, max=11",0,0


In [237]:
# both crashes occurred on the roadway
# the first in dry weather, the second in wet
# no school bus or school zone involved
# no small trucks involved
# 1 suspected minor injury in each crash and no serious injuries

In [238]:
df_crash_dict[70:80]

Unnamed: 0,column_name,description,type,length,constraints,summary,sample1,sample2
70,SUV_COUNT,Total count of sport utility vehicles involved...,NUMBER,2.0,,"mean=0.61, min=0, max=6",0,2
71,TCD_FUNC_CD,Code for Traffic Control Device state,TEXT,2.0,See Column Code,"No Controls, Device Functioning properly, Unknown",No Controls,Device Functioning properly
72,TCD_TYPE,Code that defines the Traffic Control Device,TEXT,1.0,See Column Code,"Not applicable, Traffic signal, Stop sign",Not applicable,Flashing traffic signal
73,TFC_DETOUR_IND,Was Traffic Detoured? (Y/N),TEXT,1.0,"1=Y, 0 = N","null, N, Y",,
74,TIME_OF_DAY,The Time of Day When the Crash Occurred,TEXT,4.0,0000 through 2359,"mean=1301.79, min=0.0, max=9999.0",820.0,1715.0
75,TOT_INJ_COUNT,injuries.\ninvolved in this crash. Does not in...,NUMBER,2.0,,"mean=0.9, min=0, max=25",1,1
76,TOTAL_UNITS,Total count of all Vehicles and Pedestrians,NUMBER,2.0,,"mean=2.18, min=1, max=18",2,2
77,UNB_DEATH_COUNT,No. of people killed not wearing a seatbelt,NUMBER,2.0,,"mean=0.0, min=0, max=4",0,0
78,UNB_SUSP_SERIOUS_INJ_COUNT,Serious Injuries\nTotal # of unbelted sustaini...,NUMBER,2.0,,"mean=0.0, min=0, max=8",0,0
79,UNBELTED_OCC_COUNT,Total count of all unbelted occupants,NUMBER,2.0,,"mean=0.24, min=0, max=12",0,1


In [None]:
# 2 SUVs were involved in the first crash.
        # That solves the mystery of what types of vehicle were involved in the 2nd crash
# I am not sure what TCD_FUNC_CD means.  I will need to look into this more
# the first event had no traffic control device (stop sign, traffic signal, yield sign, etc).
        # the second had a Flashing traffic signal
# the first accident occurred at 8:20AM and the second occurred at 5:15PM
# one occupant was unbelted

In [239]:
df_crash.TCD_TYPE.value_counts()

Not applicable                  26414
Traffic signal                  15797
Stop sign                        5642
Other Type TCD                    529
Flashing traffic signal           393
Unknown                           282
Yield sign                        121
Police officer or flagman          38
Active RR crossing controls        13
Passive RR crossing controls        6
Name: TCD_TYPE, dtype: int64

In [240]:
df_crash_dict[80:90]

Unnamed: 0,column_name,description,type,length,constraints,summary,sample1,sample2
80,UNK_INJ_DEG_COUNT,No. of injuries with unknown severity,NUMBER,2.0,,"mean=0.24, min=0, max=23",0,0
81,UNK_INJ_PER_COUNT,No. of people that are unknown if injured,NUMBER,2.0,,"mean=0.18, min=0, max=8",0,0
82,URBAN_RURAL,Code to classify crash as Urban or Rural,TEXT,1.0,"1= Rural, 2=Urbanized,\n3=Urban","mean=2.0, min=1, max=2",2,2
83,VAN_COUNT,Total amount of vans involved,NUMBER,2.0,,"mean=0.1, min=0, max=4",0,0
84,VEHICLE_COUNT,Total number of all motor vehicles involved in...,NUMBER,2.0,,"mean=2.03, min=0, max=18",2,2
85,WEATHER1,Code for the first weather condition at time o...,TEXT,2.0,See Column Code,"Clear, Rain, Cloudy",Clear,Rain
86,WEATHER2,Code for the second weather condition at time ...,TEXT,2.0,See Column Code,"null, Clear, Rain",,Rain
87,WORK_ZONE_IND,Did the crash occur in a work zone,TEXT,1.0,"1=Y, 0 = N","N, Y",N,N
88,WORK_ZONE_LOC,The Work Zone Location Code,TEXT,1.0,See Column Code,"null, Activity area, Advance warning area",,
89,WORK_ZONE_TYPE,Code to define the type of Work Zone,TEXT,1.0,See Column Code,"null, Construction, Maintenance",,


In [None]:
# both crashes occured in urbanized areas
# crash 1 occurred in clear weather and crash 2 occurred in the rain
# no work zone

In [241]:
df_crash_dict[90:100]

Unnamed: 0,column_name,description,type,length,constraints,summary,sample1,sample2
90,WORKERS_PRES,Were construction personnel present?,TEXT,1.0,"1=Y, 0 = N","null, N, Y",,
91,WZ_CLOSE_DETOUR,Was traffic rerouted due to work zone?,TEXT,1.0,"1=Y, 0 = N","null, N, Y",,
92,WZ_FLAGGER,Did Work zone have a flagman?,TEXT,1.0,"1=Y, 0 = N","null, N, Y",,
93,WZ_LAW_OFFCR_IND,Did Work zone have a patrolman?,TEXT,1.0,"1=Y, 0 = N","null, N, Y",,
94,WZ_LN_CLOSURE,Did Work zone have a lane closure?,TEXT,1.0,"1=Y, 0 = N","null, Y, N",,
95,WZ_MOVING,Was there moving work in the zone?,TEXT,1.0,"1=Y, 0 = N","null, N, Y",,
96,WZ_OTHER,Was this a special type of work zone?,TEXT,1.0,"1=Y, 0 = N","null, N, Y",,
97,WZ_SHLDER_MDN,Was a median/shoulder in the zone?,TEXT,1.0,"1=Y, 0 = N","null, Y, N",,
98,WZ_WORKERS_INJ_KILLED,Were any Work Zone workers injured or killed a...,TEXT,1.0,"1=Y, 0 = N","null, N, Y",,


In [None]:
# no work zones involved

In [217]:
len(df_crash_dict.description)

99

In [227]:
'Was this midblock crash related to a nearby intersection?'

'Was this midblock crash related to a nearby intersection?'

In [226]:
for i in range(len(df_crash_dict.description.to_list())):
    display(df_crash_dict.description.to_list()[i])

'Time police arrived at the scene'

'Total amount of Automobiles Involved'

'Total Deaths of belted occupants'

'Total Suspected Serious Injuries of belted occupants'

'Total amount of Bicycles involved'

'Total amount of Bicyclist Fatalities'

'\nTotal amount of Bicyclist Suspected Serious Injuries'

'Total amount of Buses involved'

'killed in the crash\nTotal child passengers under the age of 8'

'with suspected serious injuries\nTotal child passengers under the age of 8'

'Collision category that defines the crash'

'Total Commercial vehicles involved'

'Speed limit for the Construction Zone'

'County Code Number where crash occurred'

'Month when the crash occurred'

'Year when the crash occurred'

'Crash Record Number'

'Day of the Week code when crash occurred'

'Decimal format of the Latitude'

'Decimal format of the Longitude'

'Time police were dispatched to the scene'

'\nDistrict Number where crash occurred (Based on County)'

'Total amount of 16-year-old drivers'

'Total amount of 17-year-old drivers'

'Total amount of 18-year-old drivers'

'Total amount of 19-year old drivers'

'Total amount of 20-year-old drivers'

'Total amount of 50 to 64-year-old drivers'

'Total amount of 65 to 74-year-old drivers'

'Total amount of drivers ages 75 and up'

'Estimated hours roadway was closed'

'Total amount of fatalities involved'

'Total amount of Heavy Trucks involved'

'involved in the Crash\nTotal Number of Horse and Buggy Units'

'The hour of Day when the crash occurred'

'Code that defines lighting at crash scene'

'Total count of all injuries sustained'

'Code that defines the Intersection Type'

'Was this midblock crash related to a nearby intersection?'

'Was there a lane closure? (Y/N)'

'GPS Latitude determined by PennDOT'

'Direction of traffic in closed lane (s)'

'Code that defines the crash location'

'GPS Longitude determined by PennDOT (in negative degrees)'

'Injury severity level of the crash'

'Total amount of Motorcyclist fatalities'

'Total amount of Motorcyclist Suspected Serious Injuries'

'Total amount of Motorcycles Involved'

'Municipality Code'

'Total number of Non-motorists involved in the crash'

'Total number of Non-motorists killed in the crash'

'Total number of Non-motorists with suspected serious injures in the crash'

'PENNDOT highway maintenance notified?'

'Total Pedestrians involved'

'Total Pedestrian fatalities'

'Total Pedestrians with an Injury Severity of “Suspected Serious Injury”'

'Total People involved'

'Code of the Reporting Police Agency'

'Total number of People with an injury severity of “Possible Injury”'

'Code for the Roadway surface type –only for fatal crashes'

'Code for the crash’s relativity to the road'

'Roadway Surface Condition Code'

'Time the roadway was opened to traffic'

'Did the crash involve a School Bus? (Y/N)'

'Did the crash occur in a School Zone? (Y/N)'

'Was this crash caused at least in part to a prior crash?'

'Total amount of Small Trucks involved'

'\nCode that defines any special jurisdiction – only for fatal crashes'

'Total number of People with an injury severity of Suspected Minor Injury'

'Total number of People with an injury'

'Total count of sport utility vehicles involved severity of Suspected Serious Injury'

'Code for Traffic Control Device state'

'Code that defines the Traffic Control Device'

'Was Traffic Detoured? (Y/N)'

'The Time of Day When the Crash Occurred'

'injuries.\ninvolved in this crash. Does not include fatal\nCount of total injuries sustained by persons'

'Total count of all Vehicles and Pedestrians'

'No. of people killed not wearing a seatbelt'

'Serious Injuries\nTotal # of unbelted sustaining Suspected'

'Total count of all unbelted occupants'

'No. of injuries with unknown severity'

'No. of people that are unknown if injured'

'Code to classify crash as Urban or Rural'

'Total amount of vans involved'

'Total number of all motor vehicles involved in the crash'

'Code for the first weather condition at time of crash'

'Code for the second weather condition at time of crash'

'Did the crash occur in a work zone'

'The Work Zone Location Code'

'Code to define the type of Work Zone'

'Were construction personnel present?'

'Was traffic rerouted due to work zone?'

'Did Work zone have a flagman?'

'Did Work zone have a patrolman?'

'Did Work zone have a lane closure?'

'Was there moving work in the zone?'

'Was this a special type of work zone?'

'Was a median/shoulder in the zone?'

'Were any Work Zone workers injured or killed as a result of this crash?'