# EDA: Feature Importance

In [1]:
import pandas as pd

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import mutual_info_score

## Recap on columns

So far we have handled missing values in our dataset and extracted all the core information we wanted. 

The columns we now have are all features that we want to use. 

For clarity, let's recap on their meaning:
  
- `number_of_vehicles`: the number of vehicles involved in the collision

**Target variable**

- `is_severe`: collisions which are severe and involve casualties that likely require medical treatment/medical emergency services

**Location**

- `is_urban`: captures if the collision occured in an urban, built up area or in a rural area
  
- `police_force`: the police force in whose area the collision occured. Each police forces handles a specific area of Great Britain and so this variable acts as a kind of location category.

**Date and Time**

- `day_of_week`: the name of the day of the week the collision occured e.g. sunday
  
- `time`: the time the collision occured. Note that the actual time has been rounded up to the nearest half hour.

- `month`: the month of the collision (string value)

- `day_of_year`: the day of the year the collision occured, 1 is 1st January, 32 is 1st February etc. 

**Static Road details**

- `first_road_class`: this is the road category of the main road on which the collision occured. It indicates how major a road is e.g. A roads are major roads, B less so and then unclassified roads are local roads, intended for local traffic.
  
- `road_type`: this is another form of road categorisation including values like "one_way_street" and "roundabout"
  
- `speed_limit`: this is the legal speed limit of the road. Speed limits in the UK are in miles per hour.

- `is_near_junction`: a junction is where two or more roads meet. The attribute identifies if an accident occurs near a junction.

- `is_near_pedestrian_crossing`: this field captures if there was a pedestrian crossing nearby (within 50 metres of the accident). This could be a crossing facilitated by some kind of official  e.g. "lollipop person" (school crossing patrol) or another "authorised person" (police office or a traffic warden in uniform). It could also be a pedestrian crossing that is not human facilitated e.g. a zebra crossing, a foot bridge, a space in the middle of the road (central refuge), a crossing at a traffic light (toucan, pelican, puffin). Note that the crossings at traffic lights must contain a specific indicator light (a "green man") and time for pedestrians to cross. Not all traffic lights have a pedestrian crossing so this field does not mean traffic lights were nearby.

- `is_trunk`: this field identifies whether a road is managed by Highways England (a trunk road). Highways England are a government company in charge of operating and maintaining mostly major road e.g. motorways and major A roads.

**Dynamic road details**
- `light_conditions`: a category that details how much light there was at the time of the accident
  
- `weather_conditions`: a category describing the weather at the time of the collision
  
- `road_surface_condition`: a category detailing the condition of the road, capturing whether the road was flooded or dry, for example
  
- `has_special_conditions_at_site`: this boolean category indicates whether there was anything particularly unusual or defective at the site e.g traffic signals aren't working, there are roadworks or oil, there is mud
  
- `is_carriageway_hazard`: this category indicates whether there an unexpected object found in the road e.g. something may have fallen off a lorry, a dog, another human
  

In [3]:
df_collision = pd.read_csv("./data/collisions4.csv")
df_collision.head().transpose()

Unnamed: 0,0,1,2,3,4
police_force,metropolitan_police,metropolitan_police,metropolitan_police,metropolitan_police,metropolitan_police
number_of_vehicles,1,3,2,2,1
day_of_week,sunday,sunday,sunday,sunday,sunday
time,01:00,02:00,04:00,02:00,02:00
first_road_class,c,unclassified,a,a,a
road_type,one_way_street,single_carriageway,roundabout,single_carriageway,single_carriageway
speed_limit,20,30,30,30,30
light_conditions,darkness___lights_lit,darkness___lights_lit,darkness___lights_lit,darkness___lights_lit,darkness___lights_lit
weather_conditions,other_adverse_weather_condition,fine_no_high_winds,fine_no_high_winds,fine_no_high_winds,fine_no_high_winds
road_surface_conditions,wet_or_damp,dry,dry,dry,dry


In [4]:
df_collision.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 69043 entries, 0 to 69042
Data columns (total 19 columns):
 #   Column                          Non-Null Count  Dtype 
---  ------                          --------------  ----- 
 0   police_force                    69043 non-null  object
 1   number_of_vehicles              69043 non-null  int64 
 2   day_of_week                     69043 non-null  object
 3   time                            69043 non-null  object
 4   first_road_class                69043 non-null  object
 5   road_type                       69043 non-null  object
 6   speed_limit                     69043 non-null  int64 
 7   light_conditions                69043 non-null  object
 8   weather_conditions              69043 non-null  object
 9   road_surface_conditions         69043 non-null  object
 10  is_severe                       69043 non-null  int64 
 11  month                           69043 non-null  object
 12  day_of_year                     69043 non-null

In [5]:
df_collision.columns

Index(['police_force', 'number_of_vehicles', 'day_of_week', 'time',
       'first_road_class', 'road_type', 'speed_limit', 'light_conditions',
       'weather_conditions', 'road_surface_conditions', 'is_severe', 'month',
       'day_of_year', 'is_trunk', 'is_near_pedestrian_crossing', 'is_urban',
       'has_special_conditions_at_site', 'is_carriageway_hazard',
       'is_near_junction'],
      dtype='object')

## Perform some EDA on full training set

Following on from the approaches taken in the lectures, I will split the data at this point and perform some further analysis on the full training set. I am using the random_state parameter for reproducability so that when I repeat this process and split the data up in other notebooks to train a model, I will be using the same rows for training and testing.

In [6]:
df_full_train, df_test = train_test_split(df_collision, test_size=0.2, random_state=11)

## Target variable

We can consider the proportion of collisions that were severe i.e. the `severity_rate`.

The majority of the collisions in our dataset are not severe which means we have an imbalanced dataset. This may cause issues. I will look at ways in which regularization could help us later on. What it does mean is that we cannot use accuracy to measure performance.

In [7]:
df_full_train["is_severe"].value_counts()

is_severe
0    38748
1    16486
Name: count, dtype: int64

### Global severity rate 

Almost 30% of collisions were categorised as severe. We can use the `global_severity_rate`, calculated across our entire dataset, to get a sense of the importance of our categorical features.

In [8]:
df_full_train["is_severe"].value_counts(normalize=True)

is_severe
0    0.701524
1    0.298476
Name: proportion, dtype: float64

In [9]:
global_severity_rate = round(df_full_train["is_severe"].mean(),3)
global_severity_rate

np.float64(0.298)

## Feature Importance: Categorical columns

The majority of the features are categorical. 

### Difference and risk ratio

To understand how important each one might be we can compare the severity rate for one category with the global severity rate.

There are two ways of quantifying this:
- **Difference**: global_severity_rate - group_severity_rate

This is an absolute measure of the difference between the global severity rate and the group severity rate.

If this is positive, it means that this category had more collisions that were severe than average, if negative, this category had less collisions that were severe than average.
  
- **Risk ratio**: group_severity_rate/ global_severity_rate

This is a relative measure of the ratio between the group severity rate and the global severity rate.

If this is greater than one, it indicates that the severity rate for this category is greater than the global one, indicating that this category has more severe collisions. If less the one, the opposite.

**What risk ratio or difference values imply the category as a whole has good predictive power?**

The function `get_feature_importance` calculates the risk ratio and difference for every category value in the category column.

If we can see that some of the values in this category produce a higher than average risk ratio or a larger difference and others produce a lower than average risk ratio or larger negative difference, then it could be a good indication that the category has good predictive power.

**Conclusion**

Because risk ratio and difference provide a quantification of the relative importance of individual category values, when there are a lot of values in a category, using this measure to try and evaluate if a category as a whole is important is tricky.

Given the complexity, I will only use this to look at binary variables briefly.

### Mutual Information score

For all other categorical variables, I will calculate the mutual information score.

The Mutual Information of two random variables is a measure of the mutual dependence between the two variables. The mutual information score quantifies the amount of information obtained about one random variable by observing the other random variable: how much do we learn about `is_severe` by observing the variable `police_force`? The higher the mutual information score is, the more we learn and the more important the variable is as a predictor of the target.

### Evaluating difference and risk ratio of binary columns

We have several binary columns in our dataset:

- 'is_trunk'
- 'is_near_pedestrian_crossing'
- 'is_urban'
- 'has_special_conditions_at_site'
- 'is_carriageway_hazard'
- 'is_near_junction'

Of all the columns `is_urban` and `is_near_junction` seem to be the most promising predictors. But it is difficult to compare them properly. For comparison, mutual information score is a better metric.

In [10]:
def get_difference_risk_ratio(df: pd.DataFrame, category_col: str, global_severity_rate: float):
    df_group = df.groupby(category_col)["is_severe"].agg(["mean","count"])
    df_group["diff"] = df_group["mean"] - global_severity_rate
    df_group["risk"] = df_group["mean"]/global_severity_rate
    df_group.index = [f"{category_col}_{cat_val}"for cat_val in df_group.index]
    return df_group

In [11]:
get_difference_risk_ratio(df_collision, "is_trunk", global_severity_rate)

Unnamed: 0,mean,count,diff,risk
is_trunk_0,0.301017,63694,0.003017,1.010125
is_trunk_1,0.263601,5349,-0.034399,0.884566


In [12]:
get_difference_risk_ratio(df_collision, "is_near_pedestrian_crossing", global_severity_rate)

Unnamed: 0,mean,count,diff,risk
is_near_pedestrian_crossing_0,0.301619,54728,0.003619,1.012144
is_near_pedestrian_crossing_1,0.284736,14315,-0.013264,0.955491


In [13]:
get_difference_risk_ratio(df_collision, "is_urban", global_severity_rate)

Unnamed: 0,mean,count,diff,risk
is_urban_0,0.327868,26959,0.029868,1.100229
is_urban_1,0.279061,42084,-0.018939,0.936446


In [14]:
get_difference_risk_ratio(df_collision, "has_special_conditions_at_site", global_severity_rate)

Unnamed: 0,mean,count,diff,risk
has_special_conditions_at_site_0,0.298512,67257,0.000512,1.001717
has_special_conditions_at_site_1,0.283315,1786,-0.014685,0.95072


In [15]:
get_difference_risk_ratio(df_collision, "is_carriageway_hazard", global_severity_rate)

Unnamed: 0,mean,count,diff,risk
is_carriageway_hazard_0,0.297276,67355,-0.000724,0.997569
is_carriageway_hazard_1,0.331754,1688,0.033754,1.113267


In [16]:
get_difference_risk_ratio(df_collision, "is_near_junction", global_severity_rate)

Unnamed: 0,mean,count,diff,risk
is_near_junction_0,0.32699,30530,0.02899,1.097281
is_near_junction_1,0.275232,38513,-0.022768,0.923596


### Using mutual information score to evaluate feature importance

The mutual information score reveals that `police_force` is the strongest predictor of severity.

The mutual information scores aren't particularly high. It's possible that we don't have appropriate data to get a good prediction of the severity of a collision.

Some categorical features look completely independent from severity e.g. has_special_conditions_at_site and is_near_pedestrian_crossing.

In [17]:
categorical = ['police_force', 'time',
       'first_road_class', 'road_type', 'light_conditions',
       'weather_conditions', 'road_surface_conditions', 'month',
       'day_of_week', 'is_trunk', 'is_near_pedestrian_crossing', 'is_urban',
       'has_special_conditions_at_site', 'is_carriageway_hazard',
       'is_near_junction'
       ]

def mutual_info_severity_score(series):
    return mutual_info_score(series, df_full_train["is_severe"])

df_full_train[categorical].apply(mutual_info_severity_score).sort_values(ascending=False)

police_force                      0.011708
is_near_junction                  0.001405
time                              0.001378
is_urban                          0.001329
road_type                         0.001104
light_conditions                  0.000689
first_road_class                  0.000494
day_of_week                       0.000442
month                             0.000361
road_surface_conditions           0.000332
is_trunk                          0.000283
weather_conditions                0.000192
is_carriageway_hazard             0.000113
is_near_pedestrian_crossing       0.000075
has_special_conditions_at_site    0.000004
dtype: float64

## Feature Importance: Numerical columns

We can calculate the correlation between the numerical features we have and the target variable to understanding how great an impact each numerical feature has on the target variable.

We find that the correlation for all numerical features is between -0.07 and +0.004 indicating a low correlation, suggesting that neither of these columns are important in predicting the severity of a collision.

In [18]:
numerical = ["number_of_vehicles","day_of_year"]

df_full_train[numerical].corrwith(df_full_train["is_severe"])

number_of_vehicles   -0.074777
day_of_year           0.004244
dtype: float64

## Save final version of data

In [20]:
df_collision.to_csv("../2_model_training/data/collisions_final.csv", index=False)
df_collision.to_csv("../3_scripts/data/collisions_final.csv", index=False)