**Table of contents**<a id='toc0_'></a>    
- [Feature Selection](#toc1_1_)    
    - [Load the data downloaded by Network Rail and filtered by the London Overground operator](#toc1_1_1_)    
    - [Train - Test split](#toc1_1_2_)    
    - [Initial feature overview](#toc1_1_3_)    
    - [Indepth feature analysis](#toc1_1_4_)    
      - [Departure and Arrival Station](#toc1_1_4_1_)    
      - [Timetable information](#toc1_1_4_2_)    
      - [Other timing information](#toc1_1_4_3_)    
      - [Trailing](#toc1_1_4_4_)    
      - [Reactionary delays](#toc1_1_4_5_)    
      - [Delays caused by incident](#toc1_1_4_6_)    
      - [Affected train information](#toc1_1_4_7_)    
      - [Other fields](#toc1_1_4_8_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

## <a id='toc1_1_'></a>[Feature Selection](#toc0_)

In [107]:
import pandas as pd
from sklearn.model_selection import train_test_split

### <a id='toc1_1_1_'></a>[Load the data downloaded by Network Rail and filtered by the London Overground operator](#toc0_)

In [108]:
df = pd.read_pickle('./data/processed/df.pkl')

In [27]:
df.head()

Unnamed: 0,FINANCIAL_YEAR_AND_PERIOD,ORIGIN_DEPARTURE_DATE,TRUST_TRAIN_ID_AFFECTED,PLANNED_ORIG_LOC_CODE_AFF,PLANNED_ORIG_WTT_DATETIME_AFF,PLANNED_ORIG_GBTT_DATETIME_AFF,PLANNED_DEST_LOC_CODE_AFFECTED,PLANNED_DEST_WTT_DATETIME_AFF,PLANNED_DEST_GBTT_DATETIME_AFF,TRAIN_SERVICE_CODE_AFFECTED,...,INCIDENT_RESPONSIBLE_TRAIN,RESP_TRAIN,REACT_TRAIN,PERFORMANCE_EVENT_CODE,START_STANOX,END_STANOX,EVENT_DATETIME,PFPI_MINUTES,UNNAMED: 40,UNNAMED: 41
0,2020/21_P01,01-Apr-20,879D681801,87651.0,2020-01-04 22:09:00,2020-01-04 22:09:00,52051.0,2020-01-04 22:59:00,2020-01-04 23:04:00,22215003.0,...,2K67,872K671801,872K671801,M,87643.0,87643.0,01/04/2020 22:19,5.0,,
1,2020/21_P01,01-Apr-20,879H411R01,87219.0,2020-01-04 15:30:00,2020-01-04 15:30:00,52051.0,2020-01-04 16:21:00,2020-01-04 16:26:00,22218000.0,...,9H41,879H411R01,,M,87609.0,87568.0,01/04/2020 16:03,5.0,,
2,2020/21_P01,01-Apr-20,512D191C01,51815.0,2020-01-04 08:16:00,2020-01-04 08:16:00,52741.0,2020-01-04 08:55:00,2020-01-04 09:00:00,21234001.0,...,2D19,512D191C01,,M,51815.0,51815.0,01/04/2020 08:22,6.0,,
3,2020/21_P01,01-Apr-20,879H671801,87219.0,2020-01-04 22:00:00,2020-01-04 22:00:00,52051.0,2020-01-04 22:51:00,2020-01-04 22:56:00,22206000.0,...,2J66,872J661701,872J661701,M,87208.0,87609.0,01/04/2020 22:19,4.0,,
4,2020/21_P01,01-Apr-20,872L151A01,87219.0,2020-01-04 07:26:00,2020-01-04 07:26:00,52226.0,2020-01-04 08:31:00,2020-01-04 08:31:00,22204000.0,...,2M15,872M151A01,,M,72241.0,72241.0,01/04/2020 07:39,3.0,,


### <a id='toc1_1_2_'></a>[Train - Test split](#toc0_)

In [109]:
X = df.drop(columns ='PFPI_MINUTES')
y = df['PFPI_MINUTES']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)


### <a id='toc1_1_3_'></a>[Initial feature overview](#toc0_)

The majority of the features are categorical, with a lot of unique values. `OneHotEncode` all of them is not feasible, as our feature matrix `X_preproc` would become pretty big and sparse.

In [111]:
X.dtypes.value_counts()
X.columns

Index(['FINANCIAL_YEAR_AND_PERIOD', 'ORIGIN_DEPARTURE_DATE',
       'TRUST_TRAIN_ID_AFFECTED', 'PLANNED_ORIG_LOC_CODE_AFF',
       'PLANNED_ORIG_WTT_DATETIME_AFF', 'PLANNED_ORIG_GBTT_DATETIME_AFF',
       'PLANNED_DEST_LOC_CODE_AFFECTED', 'PLANNED_DEST_WTT_DATETIME_AFF',
       'PLANNED_DEST_GBTT_DATETIME_AFF', 'TRAIN_SERVICE_CODE_AFFECTED',
       'SERVICE_GROUP_CODE_AFFECTED', 'OPERATOR_AFFECTED', 'ENGLISH_DAY_TYPE',
       'APP_TIMETABLE_FLAG_AFF', 'TRAIN_SCHEDULE_TYPE_AFFECTED',
       'TRACTION_TYPE_AFFECTED', 'TRAILING_LOAD_AFFECTED',
       'TIMING_LOAD_AFFECTED', 'UNIT_CLASS_AFFECTED', 'INCIDENT_NUMBER',
       'INCIDENT_CREATE_DATE', 'INCIDENT_START_DATETIME',
       'INCIDENT_END_DATETIME', 'SECTION_CODE',
       'NETWORK_RAIL_LOCATION_MANAGER', 'RESPONSIBLE_MANAGER',
       'INCIDENT_REASON', 'ATTRIBUTION_STATUS', 'INCIDENT_EQUIPMENT',
       'INCIDENT_DESCRIPTION', 'REACTIONARY_REASON_CODE',
       'INCIDENT_RESPONSIBLE_TRAIN', 'RESP_TRAIN', 'REACT_TRAIN',
       'PERFORMAN

In [48]:
feat_categorical_nunique = X.select_dtypes(include='object').nunique().sort_values(ascending = False)
print(feat_categorical_nunique)

EVENT_DATETIME                    127060
TRUST_TRAIN_ID_AFFECTED            75170
REACT_TRAIN                        49539
INCIDENT_NUMBER                    38239
INCIDENT_DESCRIPTION               32758
RESP_TRAIN                         24830
INCIDENT_RESPONSIBLE_TRAIN         11489
INCIDENT_EQUIPMENT                  6679
SECTION_CODE                        1633
RESPONSIBLE_MANAGER                  828
ORIGIN_DEPARTURE_DATE                734
END_STANOX                           240
START_STANOX                         235
INCIDENT_REASON                      211
PLANNED_ORIG_LOC_CODE_AFF            174
PLANNED_DEST_LOC_CODE_AFFECTED       163
NETWORK_RAIL_LOCATION_MANAGER         70
FINANCIAL_YEAR_AND_PERIOD             67
TRAIN_SERVICE_CODE_AFFECTED           28
REACTIONARY_REASON_CODE               19
UNIT_CLASS_AFFECTED                   15
SERVICE_GROUP_CODE_AFFECTED            6
ENGLISH_DAY_TYPE                       5
TRAIN_SCHEDULE_TYPE_AFFECTED           4
ATTRIBUTION_STAT

In [61]:
X.isna().sum().sort_values(ascending=False)/len(X)

TRAILING_LOAD_AFFECTED            0.621005
TIMING_LOAD_AFFECTED              0.621005
RESP_TRAIN                        0.317426
REACT_TRAIN                       0.310087
REACTIONARY_REASON_CODE           0.302483
INCIDENT_RESPONSIBLE_TRAIN        0.294313
INCIDENT_EQUIPMENT                0.290213
PLANNED_ORIG_GBTT_DATETIME_AFF    0.142573
PLANNED_DEST_GBTT_DATETIME_AFF    0.142573
UNIT_CLASS_AFFECTED               0.011258
TRACTION_TYPE_AFFECTED            0.011095
INCIDENT_REASON                   0.000162
INCIDENT_NUMBER                   0.000052
INCIDENT_START_DATETIME           0.000052
NETWORK_RAIL_LOCATION_MANAGER     0.000052
RESPONSIBLE_MANAGER               0.000052
INCIDENT_DESCRIPTION              0.000052
ATTRIBUTION_STATUS                0.000052
SECTION_CODE                      0.000052
INCIDENT_END_DATETIME             0.000052
PERFORMANCE_EVENT_CODE            0.000000
START_STANOX                      0.000000
END_STANOX                        0.000000
FINANCIAL_Y

### <a id='toc1_1_4_'></a>[Indepth feature analysis](#toc0_)

#### <a id='toc1_1_4_1_'></a>[Departure and Arrival Station](#toc0_)

The data reports both the planned departure/arrival locations and the effective departure/arrival locations. These might be differ, for example if the train journey is cut short after a delay. Only the planned locations ('PLANNED_ORIG_LOC_CODE_AFF','PLANNED_DEST_LOC_CODE_AFFECTED') will be retained. Since they have too many unique values,they will have to be converted into geographical coordinates.

In [103]:
print("Value Counts")
display(X[['PLANNED_ORIG_LOC_CODE_AFF','PLANNED_DEST_LOC_CODE_AFFECTED', 'START_STANOX', 'END_STANOX']].value_counts(dropna=False))
print("Missing Values")
display(X[['PLANNED_ORIG_LOC_CODE_AFF','PLANNED_DEST_LOC_CODE_AFFECTED', 'START_STANOX', 'END_STANOX']].isna().sum()/len(X))
print("Unique Values")
display(X[['PLANNED_ORIG_LOC_CODE_AFF','PLANNED_DEST_LOC_CODE_AFFECTED', 'START_STANOX', 'END_STANOX']].nunique())

Value Counts


PLANNED_ORIG_LOC_CODE_AFF  PLANNED_DEST_LOC_CODE_AFFECTED  START_STANOX  END_STANOX
72419.0                    51531.0                         51558.0       51541.0       4771
87219.0                    52226.0                         72275.0       72419.0       2522
87132.0                    52226.0                         72275.0       72419.0       2335
72410.0                    72000.0                         72277.0       72017.0       2290
52226.0                    87219.0                         72419.0       72275.0       2271
                                                                                       ... 
72271.0                    51529.0                         51531.0       51529.0          1
                                                           72271.0       72419.0          1
52226                      87219                           52226         72421            1
72271.0                    51531.0                         72271.0       72419.0        

Missing Values


PLANNED_ORIG_LOC_CODE_AFF         0.0
PLANNED_DEST_LOC_CODE_AFFECTED    0.0
START_STANOX                      0.0
END_STANOX                        0.0
dtype: float64

Unique Values


PLANNED_ORIG_LOC_CODE_AFF         174
PLANNED_DEST_LOC_CODE_AFFECTED    163
START_STANOX                      235
END_STANOX                        240
dtype: int64

#### <a id='toc1_1_4_2_'></a>[Timetable information](#toc0_)

* 'PLANNED_ORIG_WTT_DATETIME_AFF', 'PLANNED_DEST_WTT_DATETIME_AFF': The working timetable (WTT) is the rail industry’s version of the public national timetable. It shows all movements on the rail network including freight trains, empty trains and those coming in and out of depots.
* 'PLANNED_DEST_GBTT_DATETIME_AFF', 'PLANNED_ORIG_GBTT_DATETIME_AFF': GBTT times are those which appear in the published timetable and delays are calculated against. 
* 'APP_TIMETABLE_FLAG_AFF': Applicable timetable flag – if N the train is not in official performance records as it is a short term replacement of a train plan – normally a reinstatement of part of a cancelled service.

Since we are interested in passenger trains, we will take into account the GBTT timetable. NAs are likely to be in correspondance of freight trains or movements not including passengers.

In [113]:
print("Value Counts")
display(X[['PLANNED_ORIG_WTT_DATETIME_AFF','PLANNED_DEST_WTT_DATETIME_AFF', 'PLANNED_DEST_GBTT_DATETIME_AFF', 'PLANNED_ORIG_GBTT_DATETIME_AFF','APP_TIMETABLE_FLAG_AFF']].value_counts(dropna=False))
print("Missing Values")
display(X[['PLANNED_ORIG_WTT_DATETIME_AFF','PLANNED_DEST_WTT_DATETIME_AFF', 'PLANNED_DEST_GBTT_DATETIME_AFF', 'PLANNED_ORIG_GBTT_DATETIME_AFF','APP_TIMETABLE_FLAG_AFF']].isna().sum()/len(X))
print("Unique Values")
display(X[['PLANNED_ORIG_WTT_DATETIME_AFF','PLANNED_DEST_WTT_DATETIME_AFF', 'PLANNED_DEST_GBTT_DATETIME_AFF', 'PLANNED_ORIG_GBTT_DATETIME_AFF','APP_TIMETABLE_FLAG_AFF']].nunique())

Value Counts


PLANNED_ORIG_WTT_DATETIME_AFF  PLANNED_DEST_WTT_DATETIME_AFF  PLANNED_DEST_GBTT_DATETIME_AFF  PLANNED_ORIG_GBTT_DATETIME_AFF  APP_TIMETABLE_FLAG_AFF
2018-11-12 08:23:00            2018-11-12 08:59:00            2018-11-12 09:03:00             2018-11-12 08:23:00             Y                         12
2019-10-12 19:14:00            2019-10-12 20:16:00            2019-10-12 20:21:00             2019-10-12 19:14:00             Y                         12
2019-10-12 19:09:00            2019-10-12 20:11:00            2019-10-12 20:14:00             2019-10-12 19:09:00             Y                         11
2020-12-14 11:47:00            2020-12-14 12:39:00            2020-12-14 12:39:00             2020-12-14 11:47:00             Y                         11
2019-01-09 16:32:00            2019-01-09 17:25:00            2019-01-09 17:28:00             2019-01-09 16:32:00             Y                          9
                                                                            

Missing Values


PLANNED_ORIG_WTT_DATETIME_AFF     0.000000
PLANNED_DEST_WTT_DATETIME_AFF     0.000000
PLANNED_DEST_GBTT_DATETIME_AFF    0.142573
PLANNED_ORIG_GBTT_DATETIME_AFF    0.142573
APP_TIMETABLE_FLAG_AFF            0.000000
dtype: float64

Unique Values


PLANNED_ORIG_WTT_DATETIME_AFF     110452
PLANNED_DEST_WTT_DATETIME_AFF     110846
PLANNED_DEST_GBTT_DATETIME_AFF     95310
PLANNED_ORIG_GBTT_DATETIME_AFF     95279
APP_TIMETABLE_FLAG_AFF                 2
dtype: int64

#### <a id='toc1_1_4_3_'></a>[Other timing information](#toc0_)

* 'FINANCIAL_YEAR_AND_PERIOD': The “railway” period that the delay occurred in.
* 'ORIGIN_DEPARTURE_DATE': The date of the train within the system
* 'ENGLISH_DAY_TYPE': Weekday, Saturday, Sunday, bank holiday, Christmas.
* 'INCIDENT_CREATE_DATE': The date the incident was entered into the system
* 'INCIDENT_START_DATETIME': The date the system has the incident live (this is not the length of the incident
on the ground).
* 'INCIDENT_END_DATETIME': The date the system has the incident live (this is not the length of the incident
on the ground).
* 'EVENT_DATETIME': The time the train encountered the delay.

'FINANCIAL_YEAR_AND_PERIOD' refers to a fiscal period, so it can be dropped. 'ORIGIN_DEPARTURE_DATE' overlaps with the timetable departure date, so it can be dropped. Since 'INCIDENT_START_DATETIME' and 'INCIDENT_END_DATETIME' does not coincide with the length of the incident on the ground, but it is again administrative, it can be dropped.

In [146]:
timing_info = ['FINANCIAL_YEAR_AND_PERIOD','ORIGIN_DEPARTURE_DATE','ENGLISH_DAY_TYPE','INCIDENT_CREATE_DATE','INCIDENT_START_DATETIME',
               'INCIDENT_END_DATETIME','EVENT_DATETIME']
print("Value Counts")
display(X[timing_info].value_counts(dropna=False))
print("Missing Values")
display(X[timing_info].isna().sum()/len(X))
print("Unique Values")
display(X[timing_info].nunique())

Value Counts


FINANCIAL_YEAR_AND_PERIOD  ORIGIN_DEPARTURE_DATE  ENGLISH_DAY_TYPE  INCIDENT_CREATE_DATE  INCIDENT_START_DATETIME  INCIDENT_END_DATETIME  EVENT_DATETIME   
2018/19_P07                16/09/2018 00:00       SU                2018-09-16            2018-09-16 21:50:00      2018-09-17 06:00:00    16-SEP-2018 22:12    5
2019/20_P12                07-Feb-20              WD                2020-02-07            2020-07-02 06:45:00      2020-07-02 15:00:00    07/02/2020 07:43     4
2019/20_P10                08-Dec-19              SU                2019-12-08            2019-08-12 16:00:00      2019-09-12 01:07:00    08/12/2019 15:39     4
2019/20_P06                18-Aug-19              SU                2019-08-18            2019-08-18 10:27:00      2019-08-18 12:27:00    18/08/2019 12:32     4
                           22-Aug-19              WD                2019-08-23            2019-08-22 19:00:00      2019-08-22 23:00:00    22/08/2019 20:23     4
                                       

Missing Values


FINANCIAL_YEAR_AND_PERIOD    0.000000
ORIGIN_DEPARTURE_DATE        0.000000
ENGLISH_DAY_TYPE             0.000000
INCIDENT_CREATE_DATE         0.000000
INCIDENT_START_DATETIME      0.000052
INCIDENT_END_DATETIME        0.000052
EVENT_DATETIME               0.000000
dtype: float64

Unique Values


FINANCIAL_YEAR_AND_PERIOD        67
ORIGIN_DEPARTURE_DATE           734
ENGLISH_DAY_TYPE                  5
INCIDENT_CREATE_DATE           1148
INCIDENT_START_DATETIME       35149
INCIDENT_END_DATETIME         35738
EVENT_DATETIME               127060
dtype: int64

#### <a id='toc1_1_4_4_'></a>[Trailing](#toc0_)

TRAILING_LOAD_AFFECTED and TIMING_LOAD_AFFECTED are features pertaining to freight trains only, not passenger ones.
The London Overground trains are passenger trains, therefore this explains why this fields are basically empty.

In [85]:
display(X['TRAILING_LOAD_AFFECTED'].value_counts(dropna=False))
display(X['TIMING_LOAD_AFFECTED'].value_counts(dropna=False))

TRAILING_LOAD_AFFECTED
NaN    95708
       58410
Name: count, dtype: int64

TIMING_LOAD_AFFECTED
NaN    95708
       58278
S        132
Name: count, dtype: int64

#### <a id='toc1_1_4_5_'></a>[Reactionary delays](#toc0_)

Reactionary delays are delays that are caused by the late arrival of another train, crew, or passengers. 

* REACT_TRAIN is the id of the train which caused the reactionary delay. The missing values might be imputed in case the incident is primay and not reactionary.
* REACTIONARY_REASON_CODE is filled out with a code identifying a main category of delay (for example "RT" is reactionary delay caused by loading lugagge). If no code, the delay is primary (ie the delay is at the site of the incident). The reactionary codes are too many to one-hot encode them. One solution might be to try to map them to macro-categories of reactionary codes.

A strategy could be to drop REACT_TRAIN, which has an high percentage of NAs and a lot of unique values, and to transform REACTIONARY_REASON_CODE in a binary variable (1 if delay is reactionary in nature, 0 otherwise).

In [97]:
print("Value Counts")
display(X['REACT_TRAIN'].value_counts(dropna=False))
X['REACTIONARY_REASON_CODE'].value_counts(dropna=False)
print("Missing Values")
display(X[['REACT_TRAIN','REACTIONARY_REASON_CODE']].isna().sum()/len(X))
print("Unique Values")
display(X[['REACT_TRAIN','REACTIONARY_REASON_CODE']].nunique())

Value Counts


REACT_TRAIN
NaN           47790
              26379
872L21MY10       19
512D57MB12       18
879D11MC01       17
              ...  
725Z241K04        1
872L31M804        1
872N39M804        1
872F22MG04        1
875M572P14        1
Name: count, Length: 49540, dtype: int64

Missing Values


REACT_TRAIN                0.310087
REACTIONARY_REASON_CODE    0.302483
dtype: float64

Unique Values


REACT_TRAIN                49539
REACTIONARY_REASON_CODE       19
dtype: int64

#### <a id='toc1_1_4_6_'></a>[Delays caused by incident](#toc0_)

* INCIDENT_RESPONSIBLE_TRAIN is the ID of the train that initially caused the incident (if any).
* RESP_TRAIN is another ID derived from INCIDENT_RESPONSIBLE_TRAIN (4,5,6,7 digits of the INCIDENT_RESPONSIBLE_TRAIN)
* INCIDENT_EQUIPMENT is a internal free form information.
* INCIDENT_REASON is the the Delay Attribution Guide cause code for the incident.
* INCIDENT_DESCRIPTION is a description of the incident.
* INCIDENT_NUMBER is a numeric identifier of the incident.


A strategy might be to drop INCIDENT_RESPONSIBLE_TRAIN and RESP_TRAIN (which are identification codes) with a lot of unique and NA values. INCIDENT_NUMBER and INCIDENT_EQUIPMENT should be dropped as well because they have too many unique values to encode. INCIDENT_DESCRIPTION can be dropped. INCIDENT_REASON has quite a lot of unique values as well, but they might be redirected to macrocategory thanks to https://sacuksprodnrdigital0001.blob.core.windows.net/historic-delay-attribution/Reference%20Files/Transpareny%20Page%20Attribution%20Glossary.xlsx.

In [99]:
print("Value Counts")
display(X[['INCIDENT_RESPONSIBLE_TRAIN','RESP_TRAIN','INCIDENT_EQUIPMENT','INCIDENT_REASON','INCIDENT_NUMBER']].value_counts(dropna=False))
print("Missing Values")
display(X[['INCIDENT_RESPONSIBLE_TRAIN','RESP_TRAIN','INCIDENT_EQUIPMENT','INCIDENT_REASON','INCIDENT_NUMBER']].isna().sum()/len(X))
print("Unique Values")
display(X[['INCIDENT_RESPONSIBLE_TRAIN','RESP_TRAIN','INCIDENT_EQUIPMENT','INCIDENT_REASON','INCIDENT_NUMBER']].nunique())

Value Counts


INCIDENT_RESPONSIBLE_TRAIN  RESP_TRAIN  INCIDENT_EQUIPMENT  INCIDENT_REASON  INCIDENT_NUMBER
NaN                         NaN         71.21               JD               618312.0           2596
                                        070A.21             JD               618289.0           1775
                                        153860              JS               21250.0            1633
                                        181230              IS               258363.0           1393
NaN                         NaN         95.2                IS               961349.0           1170
                                                                                                ... 
4E24                        514E24CH26                      OC               840634.0              1
                            514E24CH18                      AC               44128.0               1
                            514E24CH16                      AJ               59765.0               

Missing Values


INCIDENT_RESPONSIBLE_TRAIN    0.294313
RESP_TRAIN                    0.317426
INCIDENT_EQUIPMENT            0.290213
INCIDENT_REASON               0.000162
INCIDENT_NUMBER               0.000052
dtype: float64

Unique Values


INCIDENT_RESPONSIBLE_TRAIN    11489
RESP_TRAIN                    24830
INCIDENT_EQUIPMENT             6679
INCIDENT_REASON                 211
INCIDENT_NUMBER               38239
dtype: int64

#### <a id='toc1_1_4_7_'></a>[Affected train information](#toc0_)

* 'TRUST_TRAIN_ID_AFFECTED' : Train identifier. Train ids are unique within a railway period
but not within a year – the same train in the timetable should have the same 8 digit trainid the last two
digits are the day of the month.
* 'TRAIN_SERVICE_CODE_AFFECTED': Train service code, which identify a type of train service, at the point where the delay occurred.
* 'SERVICE_GROUP_CODE_AFFECTED': Business code assigned to train operating companies.
* 'OPERATOR_AFFECTED': The operator code of the train affected.
* 'TRAIN_SCHEDULE_TYPE_AFFECTED': type of schedule.
* 'TRACTION_TYPE_AFFECTED': Type of traction power of the train.
* 'UNIT_CLASS_AFFECTED': A unit train is a freight train carrying the same type of commodity, from origin to destination.
* 'SECTION_CODE': Seems to be the section between station covered by the train.

The operator code is always the same (EK - London Overground), therefore OPERATOR_AFFECTED can be dropped.
The section code can be dropped as well as it has too many unique values and its information overlap with the the departure and arrival station. The TRUST_TRAIN_ID_AFFECTED is an identifier and it has too many numbers, therefore will be dropped.



In [126]:
print("Value Counts")
display(X[['TRUST_TRAIN_ID_AFFECTED','TRAIN_SERVICE_CODE_AFFECTED','SERVICE_GROUP_CODE_AFFECTED','OPERATOR_AFFECTED','TRAIN_SCHEDULE_TYPE_AFFECTED','TRACTION_TYPE_AFFECTED','UNIT_CLASS_AFFECTED','SECTION_CODE']].value_counts(dropna=False))
print("Missing Values")
display(X[['TRUST_TRAIN_ID_AFFECTED','TRAIN_SERVICE_CODE_AFFECTED','SERVICE_GROUP_CODE_AFFECTED','OPERATOR_AFFECTED','TRAIN_SCHEDULE_TYPE_AFFECTED','TRACTION_TYPE_AFFECTED','UNIT_CLASS_AFFECTED','SECTION_CODE']].isna().sum()/len(X))
print("Unique Values")
display(X[['TRUST_TRAIN_ID_AFFECTED','TRAIN_SERVICE_CODE_AFFECTED','SERVICE_GROUP_CODE_AFFECTED','OPERATOR_AFFECTED','TRAIN_SCHEDULE_TYPE_AFFECTED','TRACTION_TYPE_AFFECTED','UNIT_CLASS_AFFECTED','SECTION_CODE']].nunique())

Value Counts


TRUST_TRAIN_ID_AFFECTED  TRAIN_SERVICE_CODE_AFFECTED  SERVICE_GROUP_CODE_AFFECTED  OPERATOR_AFFECTED  TRAIN_SCHEDULE_TYPE_AFFECTED  TRACTION_TYPE_AFFECTED  UNIT_CLASS_AFFECTED  SECTION_CODE
872N27MY10               22214000.0                   EK01                         EK                 LTP                           EMU                     378.0                72251:72275     10
872L23MY10               22204000.0                   EK01                         EK                 LTP                           EMU                     378.0                72251:72275     10
872N45MO10               22214000.0                   EK01                         EK                 LTP                           EMU                     378.0                52226:52045      8
872N49MP10               22214000.0                   EK01                         EK                 LTP                           EMU                     378.0                52226:52045      8
873D70M917               2

Missing Values


TRUST_TRAIN_ID_AFFECTED         0.000000
TRAIN_SERVICE_CODE_AFFECTED     0.000000
SERVICE_GROUP_CODE_AFFECTED     0.000000
OPERATOR_AFFECTED               0.000000
TRAIN_SCHEDULE_TYPE_AFFECTED    0.000000
TRACTION_TYPE_AFFECTED          0.011095
UNIT_CLASS_AFFECTED             0.011258
SECTION_CODE                    0.000052
dtype: float64

Unique Values


TRUST_TRAIN_ID_AFFECTED         75170
TRAIN_SERVICE_CODE_AFFECTED        28
SERVICE_GROUP_CODE_AFFECTED         6
OPERATOR_AFFECTED                   1
TRAIN_SCHEDULE_TYPE_AFFECTED        4
TRACTION_TYPE_AFFECTED              3
UNIT_CLASS_AFFECTED                15
SECTION_CODE                     1633
dtype: int64

#### <a id='toc1_1_4_8_'></a>[Other fields](#toc0_)

* 'NETWORK_RAIL_LOCATION_MANAGER': Affected train manager code
* 'RESPONSIBLE_MANAGER': Responsible organisation of the train causing the delay
* 'ATTRIBUTION_STATUS': All delays go through an acceptance process only once they are agreed is the linkage
 to the incident official.Disputed delays normally mean further investigation is ongoing about either
the cause of the incident or the delay.
* 'PERFORMANCE_EVENT_CODE': A and M denote delays, where A means the delay report has been automatic and B means the delay has been manual.

We keep all the rows for which the attribution status is confirmed.
We can try to reduce the unique values of RESPONSIBLE_MANAGER and NETWORK_RAIL_LOCATION_MANAGER by merging some categories.

In [153]:
other = ['NETWORK_RAIL_LOCATION_MANAGER','RESPONSIBLE_MANAGER','ATTRIBUTION_STATUS','PERFORMANCE_EVENT_CODE']
print("Value Counts")
display(X[other].value_counts(dropna=False))
print("Missing Values")
display(X[other].isna().sum()/len(X))
print("Unique Values")
display(X[other].nunique())

Value Counts


NETWORK_RAIL_LOCATION_MANAGER  RESPONSIBLE_MANAGER  ATTRIBUTION_STATUS    PERFORMANCE_EVENT_CODE
OQHN                           OQHU                 Attribution Agreed    M                         8501
OXCA                           TXCA                 Attribution Agreed    M                         5149
OQHQ                           IQHP                 Attribution Agreed    M                         4830
OQHP                           IQHP                 Attribution Agreed    A                         4748
                                                    Attribution Disputed  A                         4579
                                                                                                    ... 
OQHQ                           APEC                 Attribution Agreed    M                            1
OQFO                           VHFK                 Attribution Agreed    M                            1
OQNE                           MEKA                 Attribution

Missing Values


NETWORK_RAIL_LOCATION_MANAGER    0.000052
RESPONSIBLE_MANAGER              0.000052
ATTRIBUTION_STATUS               0.000052
PERFORMANCE_EVENT_CODE           0.000000
dtype: float64

Unique Values


NETWORK_RAIL_LOCATION_MANAGER     70
RESPONSIBLE_MANAGER              828
ATTRIBUTION_STATUS                 3
PERFORMANCE_EVENT_CODE             2
dtype: int64