In [2]:
# import libs
import pandas as pd

# load data
raw_dataset = pd.read_csv("../data/1_raw_dataset.csv", low_memory=False)

# Dataset engineering

To view the final variables used in the dataset, [click here](#final-variables)

## Preface

The treatment realised here are being done after the construction of our rudimentary dataset.

This dataset is the merger of the following sources :

* TTC Delays and Routes 2023 - [Kaggle 2023] (2025-10-15)
* Hourly climate data (station 6158359) - [Government of Canada] (2025-10-15)  

<!-- Links to cleanup the md code -->
[Kaggle 2023]: https://www.kaggle.com/datasets/karmansinghbains/ttc-delays-and-routes-2023
[Government of Canada]: https://climate-change.canada.ca/climate-data/#/hourly-climate-data

To each entry in the TTC delay source, we associated the hourly weather data. 
The association is made from the local hour (UTC−4) of the delay rounded up to the nearest hour (floored <= 30min, ceiled > 30min)

## Objective

Our final objective it to predict the delay (at the scale of the minute) on a bus line at a precise time in relation to the local weather conditions.

Our goal here is to remove all the variables that are not of use to reach this objective.

## Initial variables description

The table just below details all the different variables present in our dataset.

| Name | Type | Origin | Description |
|------|------|--------|-------------|
| Date | Date | [Kaggle 2023] | The date when the delay incident occurred (format: 'd-MMM-yy', UTC−4) |
| Route | Nominal categorical | [Kaggle 2023] | The bus line number |
| Time | Time (HH:MM) | [Kaggle 2023] | The time when the delay incident was reported in the 24h format (UTC-4) |
| Day | Nominal categorical | [Kaggle 2023] | The day of the week when the delay incident occurred |
| Location | Nominal categorical | [Kaggle 2023] | The location or station where the delay incident occurred | 
| Incident | Nominal categorical | [Kaggle 2023] | The type of incident causing the delay |
| **Min Delay** | **Discrete numerical** | [Kaggle 2023] | **The duration of the delay in minutes** |
| Min Gap | Discrete numerical | [Kaggle 2023] | The time gap in minutes between buses |
| Direction | Nominal categorical | [Kaggle 2023] | The direction of travel |
| Vehicle | Nominal categorical | [Kaggle 2023] | The identifier of the bus involved in the delay incident |
| x | Continuous numerical | [Government of Canada] | Longitude of the weather station |
| y | Continuous numerical | [Government of Canada] | Latitude of the weather station |
| STATION_NAME | Text | [Government of Canada] | Name of the weather station |
| CLIMATE_IDENTIFIER | Text | [Government of Canada] | Identifier of the weather station |
| ID | Text | [Government of Canada] | Identifier of the weather measure |
| LOCAL_DATE | Date | [Government of Canada] | Date of the weather measure (format: 'YYYY-MM-DD HH:MM:SS': UTC-4) |
| PROVINCE_CODE | Nominal categorical | [Government of Canada] | Code of the province in which the measure was made |
| LOCAL_YEAR | Discrete numerical | [Government of Canada] | Local year during which the measure was made (UTC-4) |
| LOCAL_MONTH | Nominal categorical | [Government of Canada] | Local month during which the measure was made (note: is encoded from 1-12 as floats, UTC-4) |
| LOCAL_DAY | Nominal categorical | [Government of Canada] | Local day of the month during which the measure was made (note: is encoded from 1-31 as floats, UTC-4) |
| LOCAL_HOUR | Nominal categorical | [Government of Canada] | Local hour of the day during which the measure was made (note: encoded from 0-23 as floats & the minutes are always at 00, UTC-4) |
| UTC_DATE | Date | [Government of Canada] | Date of the weather measure (format: 'YYYY-MM-DDTHH:MM:SS': UTC+0) |
| UTC_YEAR | Discrete numerical | [Government of Canada] | UTC year during which the measure was made (UTC+0) |
| UTC_MONTH | Nominal categorical | [Government of Canada] | UTC month during which the measure was made (note: is encoded from 1-12 as floats, UTC+0) |
| UTC_DAY | Nominal categorical | [Government of Canada] | UTC day of the month during which the measure was made (note: is encoded from 1-31 as floats, UTC+0) |
| UTC_HOUR | Nominal categorical | [Government of Canada] | UTC hour of the day during which the measure was made (note: encoded from 0-23 as floats & the minutes are always at 00, UTC+0) |
| TEMP | Continuous numerical | [Government of Canada] | Temperature in degrees Celsius |
| TEMP_FLAG | Nominal categorical | [Government of Canada] | Flag for the measure of the temperature ([FLAG TABLE])  | 
| DEW_POINT_TEMP | Continuous numerical | [Government of Canada] | Dew point temperature in degrees Celsius |
| DEW_POINT_TEMP_FLAG | Nominal categorical | [Government of Canada] | Flag for the measure of the dew point temperature ([FLAG TABLE]) |
| HUMIDEX | Continuous numerical | [Government of Canada] | Index indicating how hot or humid the weather feels to the average person |
| HUMIDEX_FLAG | Nominal categorical | [Government of Canada] | Flag for the measure of the humidex ([FLAG TABLE]) |
| PRECIP_AMOUNT | Continuous numerical | [Government of Canada] | Measurment of precipitation expressed in terms of vertical depths of water in millimeter (mm) |
| PRECIP_AMOUNT_FLAG | Nominal categorical | [Government of Canada] | Flag for the measure of the preciîtation ([FLAG TABLE]) |
| RELATIVE_HUMIDITY | Continuous numerical | [Government of Canada] | Relative humidity in percent (%) |
| RELATIVE_HUMIDITY_FLAG | Nominal categorical | [Government of Canada] | Flag for the measure of the relative humidity ([FLAG TABLE]) |
| STATION_PRESSURE | Continuous numerical | [Government of Canada] | The atmospheric pressure in kilopascals (kPa) at the station elevation |
| STATION_PRESSURE_FLAG | Nominal categorical | [Government of Canada] | Flag for the measure of the atmospheric pressure ([FLAG TABLE]) |
| VISIBILITY | Continuous numerical | [Government of Canada] | The distance in kilometers (km) at which objects of suitable size can be seen and identified |
| VISIBILITY_FLAG | Nominal categorical | [Government of Canada] | Flag for the measure of the visibility ([FLAG TABLE]) |
| WEATHER_ENG_DESC | Nominal categorical | [Government of Canada] | Observation of atmospheric phenomenon including the occurrence of weather and obstructions to vision in english (note: no data means "Clear") |
| WEATHER_FRE_DESC | Nominal categorical | [Government of Canada] | Observation of atmospheric phenomenon including the occurrence of weather and obstructions to vision in french |
| WINDCHILL | Continuous numerical | [Government of Canada] | Index indicating how cold the weather feels to the average person |
| WINDCHILL_FLAG | Nominal categorical | [Government of Canada] | Flag for the measure of the windchill ([FLAG TABLE]) |
| WIND_DIRECTION | Continuous numerical | [Government of Canada] | The geographic direction from which the wind blows expressed in tens of degrees |
| WIND_DIRECTION_FLAG | Nominal categorical | [Government of Canada] | Flag for the measure of the wind direction ([FLAG TABLE]) |
| WIND_SPEED | Continuous numerical | [Government of Canada] | Measure of the wind speed in kilometers per hour (km/h). Usually observed at 10 meteres above the ground |
| WIND_SPEED_FLAG | Nominal categorical | [Government of Canada] | Flag for the measure of the wind speed ([FLAG TABLE]) |


<!-- Links to cleanup the md code -->
[FLAG TABLE]: https://climate.weather.gc.ca/doc/Technical_Documentation.pdf#page=24&zoom=100,92,517

## Removing variables

At first glance, we can see that some explanatory variables cannot be used for the analysis (not usable in our final application, metadata, duplicates, etc.). 

The table below describes what variables are removed from the dataset.

| Name | Reason | Action | Explanation | 
|------|--------|--------|-------------|
| Date | Duplicate | Remove | The date is already splitted in all it's basic components in our dataset |
| Location | Usecase | Temporary removal | During the prediction, we will not have access to the incident location. Howerver we could use this data to have the delay per station later on |
| Min Gap | Duplicate | Remove | This variable is another representation of our target variable |
| Direction | Usecase | Remove | During the prediction, we will not have access to the direction of the buses |
| Vehicle | Usecase/Unusable | Remove | During the prediction, we will not have access to the current buses circulating. Furthermore, the buses may have changed today, rendering this variable unusable |
| x | Metadata | Remove | The longitude of the weather station is irrelevant to our predictions |
| y | Metadata | Remove | The latitude of the weather station is irrelevant to our predictions |
| STATION_NAME | Metadata | Remove | The name of the weather station is irrelevant to our predictions |
| CLIMATE_IDENTIFIER | Metadata | Remove | The ID of the weather station is irrelevant to our predictions |
| ID | Metadata | Remove | The ID of the weather prediction is irrelevant to our predictions |
| LOCAL_DATE | Duplicate | Remove | The date is already splitted in all it's basic components in our dataset |
| PROVINCE_CODE | Metadata | Remove | The province code is irrelevant to our predictions |
| LOCAL_YEAR | Unusable | Remove | The year of the accident will not repeat itself. It is irrelevant for our predictions |
| LOCAL_HOUR | Duplicate | Remove | The "Time" variable describes more precisely the hour and minute of the incident (note: this variable actually represents the hour of the measurement of the weather, not the time of the incident) |
| UTC_* | Duplicate | Remove | All the UTC_* variables are duplicates of the local time variables. However our incident data is based on the local hour (UTC-4) |
| WEATHER_FRE_DESC | Duplicate | Remove | Is a duplicate of the same english variable |
| *_FLAG | Metadata | Remove | The conditions surrounding the measurment of the weather are not relevant to our predictions. However, this variable is useful to evaluate the quality of the measurments |

We proceed to the removal of these variables.

In [3]:
var_to_remove = [
    "Date",
    "Location",
    "Min Gap",
    "Direction",
    "Vehicle",
    "x",
    "y",
    "STATION_NAME",
    "CLIMATE_IDENTIFIER",
    "ID",
    "LOCAL_DATE",
    "PROVINCE_CODE",
    "LOCAL_YEAR",
    "LOCAL_HOUR",
    "UTC_DATE", "UTC_YEAR", "UTC_MONTH", "UTC_DAY",
    "WEATHER_FRE_DESC",
    "TEMP_FLAG", "DEW_POINT_TEMP_FLAG", "HUMIDEX_FLAG", "PRECIP_AMOUNT_FLAG", "RELATIVE_HUMIDITY_FLAG", "STATION_PRESSURE_FLAG", 
    "VISIBILITY_FLAG", "WINDCHILL_FLAG", "WIND_DIRECTION_FLAG", "WIND_SPEED_FLAG"
]

dataset = raw_dataset.drop(columns=var_to_remove)

dataset.head()

Unnamed: 0,Route,Time,Day,Incident,Min Delay,LOCAL_MONTH,LOCAL_DAY,TEMP,DEW_POINT_TEMP,HUMIDEX,PRECIP_AMOUNT,RELATIVE_HUMIDITY,STATION_PRESSURE,VISIBILITY,WEATHER_ENG_DESC,WINDCHILL,WIND_DIRECTION,WIND_SPEED
0,91,02:30,Sunday,Diversion,81,1.0,1.0,3.7,1.7,,0.0,87.0,100.27,16.1,,,28.0,17.0
1,69,02:34,Sunday,Security,22,1.0,1.0,3.5,1.5,,0.0,87.0,100.37,16.1,,,24.0,17.0
2,35,03:06,Sunday,Cleaning - Unsanitary,30,1.0,1.0,3.5,1.5,,0.0,87.0,100.37,16.1,,,24.0,17.0
3,900,03:14,Sunday,Security,17,1.0,1.0,3.5,1.5,,0.0,87.0,100.37,16.1,,,24.0,17.0
4,85,03:43,Sunday,Security,1,1.0,1.0,4.4,1.1,,0.0,79.0,100.37,16.1,,,29.0,21.0


## Renaming variables

Since we mixed two datasets, the naming of the variables is not uniform. We will rename the variables to make them more readable.

| Old name | New name |
|----------|----------|
| Route | ROUTE |
| Time | LOCAL_TIME |
| Day | WEEK_DAY |
| Incident | INCIDENT |
| Min Delay | DELAY |

In [4]:
dataset = dataset.rename(columns={
    "Route": "ROUTE",
    "Time": "LOCAL_TIME",
    "Day": "WEEK_DAY",
    "Incident": "INCIDENT",
    "Min Delay": "DELAY"
})

dataset.head()

Unnamed: 0,ROUTE,LOCAL_TIME,WEEK_DAY,INCIDENT,DELAY,LOCAL_MONTH,LOCAL_DAY,TEMP,DEW_POINT_TEMP,HUMIDEX,PRECIP_AMOUNT,RELATIVE_HUMIDITY,STATION_PRESSURE,VISIBILITY,WEATHER_ENG_DESC,WINDCHILL,WIND_DIRECTION,WIND_SPEED
0,91,02:30,Sunday,Diversion,81,1.0,1.0,3.7,1.7,,0.0,87.0,100.27,16.1,,,28.0,17.0
1,69,02:34,Sunday,Security,22,1.0,1.0,3.5,1.5,,0.0,87.0,100.37,16.1,,,24.0,17.0
2,35,03:06,Sunday,Cleaning - Unsanitary,30,1.0,1.0,3.5,1.5,,0.0,87.0,100.37,16.1,,,24.0,17.0
3,900,03:14,Sunday,Security,17,1.0,1.0,3.5,1.5,,0.0,87.0,100.37,16.1,,,24.0,17.0
4,85,03:43,Sunday,Security,1,1.0,1.0,4.4,1.1,,0.0,79.0,100.37,16.1,,,29.0,21.0


## Final variables

<a id="final-variables"></a>

Here are the variables currently present in our dataset.

**WARNING** : this dataset is not final. During the EDA we will continue to modify this dataset before feeding it to a predictive model.

| Name | Type | Origin | Description |
|------|------|--------|-------------|
| ROUTE | Nominal categorical | [Kaggle 2023] | The bus line number |
| LOCAL_TIME | Time (HH:MM) | [Kaggle 2023] | The time when the delay incident was reported in the 24h format (UTC-4) |
| WEEK_DAY | Nominal categorical | [Kaggle 2023] | The day of the week when the delay incident occurred |
| INCIDENT | Nominal categorical | [Kaggle 2023] | The type of incident causing the delay |
| **DELAY** | **Discrete numerical** | [Kaggle 2023] | **The duration of the delay in minutes** |
| LOCAL_MONTH | Nominal categorical | [Government of Canada] | Local month during which the measure was made (note: is encoded from 1-12 as floats, UTC-4) |
| LOCAL_DAY | Nominal categorical | [Government of Canada] | Local day of the month during which the measure was made (note: is encoded from 1-31 as floats, UTC-4) |
| TEMP | Continuous numerical | [Government of Canada] | Temperature in degrees Celsius |
| DEW_POINT_TEMP | Continuous numerical | [Government of Canada] | Dew point temperature in degrees Celsius (temperature at which water vapor contained in the air condenses at contact with a cold surface) |
| HUMIDEX | Continuous numerical | [Government of Canada] | Index indicating how hot or humid the weather feels to the average person |
| PRECIP_AMOUNT | Continuous numerical | [Government of Canada] | Measurment of precipitation expressed in terms of vertical depths of water in millimeter (mm) |
| RELATIVE_HUMIDITY | Continuous numerical | [Government of Canada] | Relative humidity in percent (%) |
| STATION_PRESSURE | Continuous numerical | [Government of Canada] | The atmospheric pressure in kilopascals (kPa) at the station elevation |
| VISIBILITY | Continuous numerical | [Government of Canada] | The distance in kilometers (km) at which objects of suitable size can be seen and identified |
| WEATHER_ENG_DESC | Nominal categorical | [Government of Canada] | Observation of atmospheric phenomenon including the occurrence of weather and obstructions to vision in english |
| WINDCHILL | Continuous numerical | [Government of Canada] | Index indicating how cold the weather feels to the average person |
| WIND_DIRECTION | Continuous numerical | [Government of Canada] | The geographic direction from which the wind blows expressed in tens of degrees |
| WIND_SPEED | Continuous numerical | [Government of Canada] | Measure of the wind speed in kilometers per hour (km/h). Usually observed at 10 metres above the ground |

<!-- Links to cleanup the md code -->
[Kaggle 2023]: https://www.kaggle.com/datasets/karmansinghbains/ttc-delays-and-routes-2023
[Government of Canada]: https://climate-change.canada.ca/climate-data/#/hourly-climate-data

We now save this new dataset under the name `"2_dataset.csv"`.

In [5]:
dataset.to_csv("../data/2_dataset.csv", index=False)