# Interpret and Transform

In [1]:
import pandas as pd

In [2]:
df_collision_original = pd.read_csv("./data/collisions1.csv", date_format="%d/%m/%Y", parse_dates=["date"])
df_collision_original.head().T

Unnamed: 0,0,1,2,3,4
police_force,1,1,1,1,1
accident_severity,3,3,3,3,3
number_of_vehicles,1,3,2,2,2
number_of_casualties,1,2,1,1,1
date,2023-01-01,2023-01-01,2023-01-01,2023-01-01,2023-01-01
day_of_week,1,1,1,1,1
time,01:24,02:25,03:50,02:13,01:42
first_road_class,5,6,3,3,3
road_type,2,6,1,6,6
speed_limit,20,30,30,30,30


## 1. Make Categorical variables readable

The original dataset encodes its categorical variables as numbers. For ease of use, they should be converted into readable strings. 

There is a [Road Safety Open Dataset Guide](https://data.dft.gov.uk/road-accidents-safety-data/dft-road-casualty-statistics-road-safety-open-dataset-data-guide-2023.xlsx) that provides the key for these values and can be used as a mapper

In [3]:
df_interp = pd.read_excel("./data/dft-road-casualty-statistics-road-safety-open-dataset-data-guide-2023.xlsx")

df_collision_key = df_interp[df_interp["table"] == "accident"]

categorical_columns = ["police_force","accident_severity",
                       "day_of_week", "first_road_class","road_type","junction_detail",
                       "pedestrian_crossing_human_control","pedestrian_crossing_physical_facilities",
                       "light_conditions", "weather_conditions", "road_surface_conditions",
                       "special_conditions_at_site", "carriageway_hazards","urban_or_rural_area",
                       "trunk_road_flag"]

categorical_columns_mapping = {}

for col in categorical_columns:
    filtered_key = df_collision_key[df_collision_key["field name"] == col]
    categorical_columns_mapping[col] = pd.Series(filtered_key["label"].values,index=filtered_key["code/format"]).to_dict()

In [4]:
for col in categorical_columns:
    col_mapper = categorical_columns_mapping[col]
    df_collision_original[col] = df_collision_original[col].map(col_mapper)

In [5]:
df_collision_original.head().T

Unnamed: 0,0,1,2,3,4
police_force,Metropolitan Police,Metropolitan Police,Metropolitan Police,Metropolitan Police,Metropolitan Police
accident_severity,Slight,Slight,Slight,Slight,Slight
number_of_vehicles,1,3,2,2,2
number_of_casualties,1,2,1,1,1
date,2023-01-01,2023-01-01,2023-01-01,2023-01-01,2023-01-01
day_of_week,Sunday,Sunday,Sunday,Sunday,Sunday
time,01:24,02:25,03:50,02:13,01:42
first_road_class,C,Unclassified,A,A,A
road_type,One way street,Single carriageway,Roundabout,Single carriageway,Single carriageway
speed_limit,20,30,30,30,30


## 2. Filter for data collected by police officers who attended the scene

`did_police_officer_attend_scene_of_accident` tells us wether a police officer attended the scene of the accident.

The majority of the data (~70%) was collected by police officers who attended the scene of the accident. 

10% was collected over the counter at a police station and around 20% was collected from forms completed by members of the public ("self reporting" forms). It is acknowledged in the advice on completing the form that data from forms completed by members of the public may contain missing fields.

It's reasonable to assume too that police officers are likely to be more experienced with collecting this data making the data more reliable.

To reduce the number of missing fields and in an attempt to make the data more reliable, I have filtered the collision data by this field.

In [6]:
df_collision_original["did_police_officer_attend_scene_of_accident"].value_counts(normalize=True)

did_police_officer_attend_scene_of_accident
1    0.685607
3    0.203658
2    0.110735
Name: proportion, dtype: float64

In [7]:
df_collision = df_collision_original[df_collision_original["did_police_officer_attend_scene_of_accident"] == 1].copy()
df_collision["did_police_officer_attend_scene_of_accident"].value_counts()

did_police_officer_attend_scene_of_accident
1    71480
Name: count, dtype: int64

In [8]:
del df_collision["did_police_officer_attend_scene_of_accident"]

Convert categorical column numbers into strings

In [9]:
len(df_collision)

71480

## 3. Create the Target Variable

`accident_severity` has three possible values: `Slight`, `Serious` and `Fatal`.

- Fatal
This covers casualties who sustained injuries which caused death less than 30 days after the accident.

- Serious
"Injuries classed as serious include: fractures, concussion, internal injuries, crushings, burns (excluding friction burns), severe cuts, severe general shock requiring medical treatment and injuries causing death 30 or more days after the accident." [Reported road casualty statistics: Background Quality Report](https://www.gov.uk/government/publications/reported-road-casualty-statistics-background-quality-report/reported-road-casualty-statistics-background-quality-report)

- Slight
"An injured casualty that is not classified as seriously injured, having an injury of a minor character such as a sprain (including neck whiplash injury), bruise or cut which are not judged to be severe, or slight shock requiring roadside attention. This definition includes injuries not requiring medical treatment." [Reported road casualty statistics: Background Quality Report](https://www.gov.uk/government/publications/reported-road-casualty-statistics-background-quality-report/reported-road-casualty-statistics-background-quality-report) 

**Assumption**: We assume that the severity categorisation is accurate. 

In this case we will categorise every collision in this dataset with a category of "Serious" or "Fatal" as being "severe".

In [10]:
df_collision["accident_severity"].value_counts()

accident_severity
Slight     50091
Serious    19883
Fatal       1506
Name: count, dtype: int64

In [11]:
df_collision["is_severe"] = df_collision["accident_severity"].isin(["Serious", "Fatal"]).astype(int)
del df_collision["accident_severity"]

In [12]:
df_collision["is_severe"].value_counts(normalize=True)

is_severe
0    0.700769
1    0.299231
Name: proportion, dtype: float64

## 4. Handle date and time

The `date` column for humans captures information about year, month and day number and we can understand the differences and perhaps see patterns across years, months and days but for an algorithm to pick this up and see temporal differences, we often need to be more explicit and extract these variables.

Our data has already been broken down a little bit: we have separate columns for time and day of the week. 

We will include the following date and time variables and drop the date column:

- `time` (rounded to nearest hour)
- `day_of_week`
- `month` (extracted from date)
- `day_of_year` (extracted from date)

### Rounding time to the nearest hour

`time` records the the time of the accident.

There are 1440 unique values in our dataframe. It looks like a considerable number have been rounded to the nearest hour but not all. 

We will convert all of the time data to round to the nearest hour so that we can better see patterns.

The time category now has 24 unique values i.e. every hour block is present, so our category is complete. 


In [13]:
len(df_collision["time"].unique())

1440

In [14]:
df_collision["time"].value_counts().head(5)

time
17:00    586
16:00    499
15:00    477
18:00    468
15:30    446
Name: count, dtype: int64

In [15]:
df_collision["time"].tail()

104252    01:30
104253    08:43
104255    17:00
104256    21:40
104257    16:17
Name: time, dtype: object

In [16]:
df_collision["time"] = pd.to_datetime(df_collision["time"], format="%H:%M").dt.round('60min').dt.strftime("%H:%M")

In [17]:
len(df_collision["time"].unique())

24

### Extracting month

We have the `date` field. It might be interesting to see if the month has a role in predicting severity.

In [18]:
df_collision["month"] = pd.to_datetime(df_collision["date"], format="mixed").dt.month_name()
df_collision["month"] = df_collision["month"].astype("category") 
category_order = ['January','February','March','April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December']
df_collision["month"] = df_collision["month"].cat.set_categories(category_order, ordered=True)
df_collision.head()

Unnamed: 0,police_force,number_of_vehicles,number_of_casualties,date,day_of_week,time,first_road_class,road_type,speed_limit,junction_detail,...,pedestrian_crossing_physical_facilities,light_conditions,weather_conditions,road_surface_conditions,special_conditions_at_site,carriageway_hazards,urban_or_rural_area,trunk_road_flag,is_severe,month
0,Metropolitan Police,1,1,2023-01-01,Sunday,01:00,C,One way street,20,Other junction,...,Pedestrian phase at traffic signal junction,Darkness - lights lit,Other,Wet or damp,,,Urban,Non-trunk,0,January
1,Metropolitan Police,3,2,2023-01-01,Sunday,02:00,Unclassified,Single carriageway,30,T or staggered junction,...,Zebra,Darkness - lights lit,Fine no high winds,Dry,,,Urban,Non-trunk,0,January
2,Metropolitan Police,2,1,2023-01-01,Sunday,04:00,A,Roundabout,30,Roundabout,...,No physical crossing facilities within 50 metres,Darkness - lights lit,Fine no high winds,Dry,,,Urban,Non-trunk,0,January
3,Metropolitan Police,2,1,2023-01-01,Sunday,02:00,A,Single carriageway,30,T or staggered junction,...,No physical crossing facilities within 50 metres,Darkness - lights lit,Unknown,Dry,,,Urban,Non-trunk,0,January
4,Metropolitan Police,2,1,2023-01-01,Sunday,02:00,A,Single carriageway,30,Private drive or entrance,...,No physical crossing facilities within 50 metres,Darkness - lights lit,Fine no high winds,Dry,,,Urban,Non-trunk,0,January


### Extract day of year from date and drop date column

To allow the algorithm to capture relative differences between dates, we will convert our date to number of days since January 1st 2023 i.e. we will use pandas `dayofyear` attribute which returns the day of the year on which the particular date occurs. This produces a continuous numerical value from 1 to 365, where 1 is the 1st day of the year i.e. 1st January 2023.

This new feature should be able to implicitly capture week based patterns so we have no real strong need for adding a week_of_year category.


In [19]:
df_collision["date"] = pd.to_datetime(df_collision["date"])
df_collision["day_of_year"] = df_collision["date"].dt.dayofyear
del df_collision["date"]

## 5. Create consistency

Our original categorical data consists of strings that are often capitalized. To maintain a degres of consistency, we replace all spaces and hyphens with underscores.

In [20]:
categorical_cols = list(df_collision.dtypes[(df_collision.dtypes == "object") | (df_collision.dtypes == "category")].index)

for col in categorical_cols:
    df_collision[col] = df_collision[col].str.strip()
    df_collision[col] = df_collision[col].str.lower().str.replace(' ','_')
    df_collision[col] = df_collision[col].str.replace('-','_')
    df_collision[col] = df_collision[col].str.replace('+','')

df_collision.tail().T

Unnamed: 0,104252,104253,104255,104256,104257
police_force,police_scotland,police_scotland,police_scotland,police_scotland,police_scotland
number_of_vehicles,2,1,2,1,2
number_of_casualties,2,1,1,1,1
day_of_week,sunday,thursday,wednesday,tuesday,friday
time,02:00,09:00,17:00,22:00,16:00
first_road_class,motorway,unclassified,unclassified,unclassified,a
road_type,dual_carriageway,single_carriageway,single_carriageway,single_carriageway,single_carriageway
speed_limit,70,30,30,20,50
junction_detail,not_at_junction_or_within_20_metres,not_at_junction_or_within_20_metres,not_at_junction_or_within_20_metres,other_junction,roundabout
pedestrian_crossing_human_control,none_within_50_metres,data_missing_or_out_of_range,data_missing_or_out_of_range,data_missing_or_out_of_range,none_within_50_metres


## Save data

In [21]:
df_collision.to_csv("./data/collisions2.csv", index=False)