<div style="text-align: right"> Part 1: PROJECT PRELIMINARIES </div>

# **TUBE TWIN - THE LONDON UNDERGROUND FORECASTING SYSTEM**

<img src="https://www.ft.com/__origami/service/image/v2/images/raw/https%3A%2F%2Fd1e00ek4ebabms.cloudfront.net%2Fproduction%2F2f90cf07-9f8c-4437-ad9b-818989cb2fd2.jpg?dpr=2&fit=scale-down&quality=medium&source=next&width=700" alt="image info" />

## **PROJECT SUMMARY**

The London Underground (aka “The Tube”) is a network of train stations which connects the city. Via an open data API, Transport for London (TfL) publishes data showing in fifteen minute increments the number of people entering and exiting every station. Understanding the behaviour of this transport network presents many similar challenges to the ones we face in understanding water and waste distribution
networks. <br>

### OBJECTIVES
The project should be able to cover the following: <br>
- forecast passenger counts on train lines and stations throughout the year
- effects of special events (eg. football games, holidays, concerts) on the network
- relationship like between passenger counts at different stations
- patterns in usage of the Tube prior and post COVID pandemic, focusing on aggregation, popularity of certain lines, and relationship between counts at different stations
- effects of new lines on existing lines, case in point the Elizabeth Line
-  long-term trends in passenger loads across the different lines and stations
- identification of optimal location where a line can be added
- simulate the network and uncover plausible risks to the network

### GOALS
Expected outputs of the project are: <br>
- A graph representation of the network
- A passenger count forecast at each station (entry and exit), in 15 minutes increments, based on time of day, and counts at other stations
- A study of the effect of significant events (holidays, football games, etc) on traffic
- Dashboard showing key insights to railway technicians and commuters
- Simulation

## **PROJECT DATA FROM TFL API**

Data that will be used in this project is accessed from the TfL API, at https://api-portal.tfl.gov.uk/ <br>

The following data types were identified as potentially being necessary:<br>
- Station data
- Passenger density
- Air quality
- 

Resources for extraction of the data using the API:<br>
- Request library in python
- tflunifiedapi library in python


### SCRIPT TO EXTRACT DATA FROM TFL API

In [1]:
!pip install tflunifiedapi
!pip install requests

Collecting tflunifiedapi
  Downloading tflunifiedapi-0.2.1.tar.gz (9.3 kB)
Collecting msrest
  Downloading msrest-0.7.1-py3-none-any.whl (85 kB)
Collecting azure-core>=1.24.0
  Downloading azure_core-1.25.1-py3-none-any.whl (178 kB)
Collecting isodate>=0.6.0
  Downloading isodate-0.6.1-py2.py3-none-any.whl (41 kB)
Collecting requests-oauthlib>=0.5.0
  Downloading requests_oauthlib-1.3.1-py2.py3-none-any.whl (23 kB)
Collecting typing-extensions>=4.0.1
  Using cached typing_extensions-4.3.0-py3-none-any.whl (25 kB)
Collecting oauthlib>=3.0.0
  Downloading oauthlib-3.2.1-py3-none-any.whl (151 kB)
Building wheels for collected packages: tflunifiedapi
  Building wheel for tflunifiedapi (setup.py): started
  Building wheel for tflunifiedapi (setup.py): finished with status 'done'
  Created wheel for tflunifiedapi: filename=tflunifiedapi-0.2.1-py3-none-any.whl size=16729 sha256=2082a7c0f68b6eaa85dd06d109bdf91552e908af767c94d454a95fe7f4faefd2
  Stored in directory: c:\users\gti tech\appdata\lo

In [2]:
from tfl.client import Client
from tfl.api_token import ApiToken
import requests
from requests.auth import HTTPBasicAuth
from urllib.parse import urlencode
import logging
import csv

#### *RETRIEVING DATA FROM TfL API FOR LONDON UNDERGROUND*

In [3]:
# sample statement
sample = "https://api.tfl.gov.uk/crowding/{Naptan}/Live?app_key={ApplicationKey 15}"
# example link
url = "https://api.tfl.gov.uk/crowding/940GZZLUBST"


In [4]:
app_id = 'Twin Tube'
app_key = '3d6ecccce7bc4221b76171eb5e1564d0'
api_token = {"app_id": app_id, "app_key": app_key}
headers = {"Accept": "application/json"}
r = requests.get(url, api_token, headers=headers)
crowding = r.json()


## **DATA WRANGLING**

Data wrangling is the process of removing errors and combining complex data sets to make them more accessible and easier to analyze.

Historical data is available for the following years, for a 'typical autumn day' at 15 minutes interval for 24 hr period:
- 2017
- 2018
- 2019
- 2020
- 2021

The data is for a typical autumn day, and is segmented into:
- MTT or MTF (Monday to Thursday or Monday to Friday (ony in 2017))
- FRI
- SAT
- SUN

The data was prepared for use in notebook using Excel program as follows:
1. The row containing 15 minute time periods was retained to be the header of the dataframe columns and all rows above this row were deleted
2. No other action was carried out

### DATA OVERVIEW

In [5]:
import pandas as pd
import numpy as np

In [6]:
def getListOfFiles(dirName):
    '''
        For the given path, get the List of all files in the directory tree
        dirName: directory of folder containing files
        output: a list of file paths for each of the files 
    '''
    # create a list of file and sub directories 
    # names in the given directory 
    listOfFile = os.listdir(dirName)
    allFiles = list()
    # Iterate over all the entries
    for entry in listOfFile:
        # Create full path
        fullPath = os.path.join(dirName, entry)
        # If entry is a directory then get the list of files in this directory 
        if os.path.isdir(fullPath):
            allFiles = allFiles + getListOfFiles(fullPath)
        else:
            allFiles.append(fullPath)
                
    return allFiles

In [8]:
import os
# saving file paths to a list variable
dirList = getListOfFiles("Resources/Data/STATION_COUNTS/")

FileNotFoundError: [WinError 3] The system cannot find the path specified: 'Resources/Data/STATION_COUNTS/'

In [44]:
dirList

['Resources/Data/STATION_COUNTS/2017.csv',
 'Resources/Data/STATION_COUNTS/2017_summary.csv',
 'Resources/Data/STATION_COUNTS/2017_yearlabels.csv',
 'Resources/Data/STATION_COUNTS/2018.csv',
 'Resources/Data/STATION_COUNTS/2018_summary.csv',
 'Resources/Data/STATION_COUNTS/2019.csv',
 'Resources/Data/STATION_COUNTS/2019_summary.csv',
 'Resources/Data/STATION_COUNTS/2020.csv',
 'Resources/Data/STATION_COUNTS/2020_summary.csv',
 'Resources/Data/STATION_COUNTS/2021.csv',
 'Resources/Data/STATION_COUNTS/2021_summary.csv',
 'Resources/Data/STATION_COUNTS/sample.csv']

In [45]:
# reading the yearly passenger count data into a dataframe variable

df_17 = pd.read_csv("Resources/Data/STATION_COUNTS/2017.csv")
df_18 = pd.read_csv("Resources/Data/STATION_COUNTS/2018.csv")
df_19 = pd.read_csv("Resources/Data/STATION_COUNTS/2019.csv")
df_20 = pd.read_csv("Resources/Data/STATION_COUNTS/2020.csv")
df_21 = pd.read_csv("Resources/Data/STATION_COUNTS/2021.csv")

In [46]:
df_17.head(1)

Unnamed: 0,Mode,NLC,ASC,Station,Coverage,year,day,dir,0500-0515,0515-0530,...,0230-0245,0245-0300,0300-0315,0315-0330,0330-0345,0345-0400,0400-0415,0415-0430,0430-0445,0445-0500
0,LU,500,ACTu,Acton Town,Station entry / exit,2017,MTF,IN,22,27,...,0,0,0,0,0,0,0,0,0,0


In [47]:
df_18.head(1)

Unnamed: 0,Mode,NLC,ASC,Station,Coverage,year,day,dir,0500-0515,0515-0530,...,0230-0245,0245-0300,0300-0315,0315-0330,0330-0345,0345-0400,0400-0415,0415-0430,0430-0445,0445-0500
0,LU,500,ACT,Acton Town,Station entry / exit,2018,MTT,IN,26,32,...,0,0,0,0,0,0,0,0,3,18


In [48]:
df_19.head(1)

Unnamed: 0,Mode,NLC,ASC,Station,Coverage,year,day,dir,0500-0515,0515-0530,...,0230-0245,0245-0300,0300-0315,0315-0330,0330-0345,0345-0400,0400-0415,0415-0430,0430-0445,0445-0500
0,LU,500,ACTu,Acton Town,Station entry / exit,2019,MTT,IN,21,36,...,0,0,0,0,0,0,0,0,0,0


In [49]:
df_20.head(1)

Unnamed: 0,Mode,NLC,ASC,Station,Coverage,year,day,dir,0500-0515,0515-0530,...,0230-0245,0245-0300,0300-0315,0315-0330,0330-0345,0345-0400,0400-0415,0415-0430,0430-0445,0445-0500
0,LU,500,ACTu,Acton Town,Station entry / exit,2020,MTT,IN,9,13,...,0,0,0,0,0,0,0,0,1,15


In [50]:
df_21.head(1)

Unnamed: 0,Mode,NLC,ASC,Station,Coverage,year,day,dir,0500-0515,0515-0530,...,0230-0245,0245-0300,0300-0315,0315-0330,0330-0345,0345-0400,0400-0415,0415-0430,0430-0445,0445-0500
0,LU,500,ACTu,Acton Town,Station entry / exit,2021,MTT,IN,12,30,...,0,0,0,0,0,0,0,0,2,17


### FEATURE DEFINITIONS

1. Mode:<br>
>
modeID|mode|modecode|modedescription
---|---|---|---|
10|LU|u|London Underground
30|LO|o|London Overground
40|EZL|r|Elizabeth Line
50|DLR|d|Docklands Light Railway
60|NR|r|National Rail
70|TRM|t|London Trams



As we are focusing on London Undergrund, we will filter the passenger count dataset to only feature data for London Underground `LU`. <br>
The lines under London Underground are: <br>
>
Line Group | Line Name Long | ASC with suffix | Definition
---|---|---|---
LU	|Bakerloo	|BAK	| 
LU	|Central	|CEN	| 
LU	|District	|DIS	| 
LU	|Hammersmith City	|H&C	| 
LU	|Jubilee	|JUB	| 
LU	|Metropolitan	|MET	| 
LU	|Northern	|NOR	| 
LU	|Piccadilly	|PIC	| 
LU	|Victoria	|VIC	| 
LU	|Waterloo City	|WAT	| 


2. Direction (dir):
>
Direction_Code |	Direction_Name|	Direction_Comment
---|---|---
NB|	Northbound|	n/a
SB|	Southbound|	n/a
EB|	Eastbound|	n/a
WB|	Westbound|	n/a
IB|	Inbound|	Applies to DLR. Inbound = towards City/Stratford
OB|	Outbound|	Applies to DLR. Outbound = from City/Stratford
UP|	Up|	Applies to National Rail. Up = towards London
DN|	Down|	Applies to National Rail. Down = from London
IR|	Inner Rail|	Applies to Circle line
OR|	Outer Rail|	Applies to Circle line


3. DayTypes (day):
>
Daytype|	Daytype| description|	Comment
---|---|---|---
MTF|	Monday to Friday|	Only in NBT 2017
MTT|	Monday to Thursday|	From NBT 2018 onwards
FRI|	Friday|	From NBT 2018 onwards
SAT|	Saturday|	n/a
SUN|	Sunday|	n/a


In [51]:
# filtering data for only the London Underground

df_17LU = df_17[df_17['Mode']=='LU']
df_18LU = df_18[df_18['Mode']=='LU']
df_19LU = df_19[df_19['Mode']=='LU']
df_20LU = df_20[df_20['Mode']=='LU']
df_21LU = df_21[df_21['Mode']=='LU']
print("No. of rows, columns of 2017.csv is ",df_17LU.shape)
print("No. of rows, columns of 2018.csv is ",df_18LU.shape)
print("No. of rows, columns of 2019.csv is ",df_19LU.shape)
print("No. of rows, columns of 2020.csv is ",df_20LU.shape)
print("No. of rows, columns of 2021.csv is ",df_21LU.shape)

No. of rows, columns of 2017.csv is  (1608, 104)
No. of rows, columns of 2018.csv is  (2144, 104)
No. of rows, columns of 2019.csv is  (2144, 104)
No. of rows, columns of 2020.csv is  (2144, 104)
No. of rows, columns of 2021.csv is  (2160, 104)


### DATA STRUCTURING

The datasets are in wide format, where each time interval is a column. In the wide format, the dataframe have 104 columns each.<br>
We need to convert them to long format by melting the time interval columns into rows. The implication will be that:
- the number of columns will reduce to 10 columns
- the number of rows will increase by a multiple of the number of the melted columns, in this case, 96.<br>

To convert to long-format, we will implement a `pd.melt()` method to the wide formatted dataframes.

In [57]:
def wide_to_long(df):
    '''
        Function that will convert wide-format dataframe to long-format
        Input - dataframe (df)
        Output - dataframe (df_sample_long)
    '''
    # saving a list of columns containing the passenger counts
    # to the list variable time_list
    time_list = list(df.columns)[8:]
    # saving a list of identifier variables to the list variable id_vars
    id_vars = list(df.columns)[:8]
    df_sample_long = pd.melt(df, id_vars = id_vars,  value_vars = time_list, var_name = 'intervals', value_name = 'counts')

    return df_sample_long

In [122]:
# applying wide_to_long() function to the dataframes
df_17LULong = wide_to_long(df_17LU)
df_18LULong = wide_to_long(df_18LU)
df_19LULong = wide_to_long(df_19LU)
df_20LULong = wide_to_long(df_20LU)
df_21LULong = wide_to_long(df_21LU)

print("No. of rows, columns of 2017.csv is ",df_17LULong.shape)
print("No. of rows, columns of 2018.csv is ",df_18LULong.shape)
print("No. of rows, columns of 2019.csv is ",df_19LULong.shape)
print("No. of rows, columns of 2020.csv is ",df_20LULong.shape)
print("No. of rows, columns of 2021.csv is ",df_21LULong.shape)

No. of rows, columns of 2017.csv is  (154368, 10)
No. of rows, columns of 2018.csv is  (205824, 10)
No. of rows, columns of 2019.csv is  (205824, 10)
No. of rows, columns of 2020.csv is  (205824, 10)
No. of rows, columns of 2021.csv is  (207360, 10)


In [84]:
df_17LULong.head(3)

Unnamed: 0,Mode,NLC,ASC,Station,Coverage,year,day,dir,intervals,counts
0,LU,500,ACTu,Acton Town,Station entry / exit,2017,MTF,IN,0500-0515,22
1,LU,502,ALDu,Aldgate,Station entry / exit,2017,MTF,IN,0500-0515,11
2,LU,503,ALEu,Aldgate East,Station entry / exit,2017,MTF,IN,0500-0515,6


In [85]:
df_21LULong.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 207360 entries, 0 to 207359
Data columns (total 10 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   Mode       207360 non-null  object
 1   NLC        207360 non-null  int64 
 2   ASC        207360 non-null  object
 3   Station    207360 non-null  object
 4   Coverage   206592 non-null  object
 5   year       207360 non-null  int64 
 6    day       207360 non-null  object
 7    dir       207360 non-null  object
 8   intervals  207360 non-null  object
 9   counts     207360 non-null  int64 
dtypes: int64(3), object(7)
memory usage: 15.8+ MB


                                            The long format dataframes now have a total of 10 columns each.

These columns require a change of the dtype:<br>
- NLC - to Object
- year - to datetime
- day
- interval - to datetime
For now, we will convert the dtype of NLC column, as the other three will be dealt with in the following section.

In [123]:
# changing the dtype of NLC column to object
df_17LULong['NLC'] = df_17LULong['NLC'].astype("object")
df_18LULong['NLC'] = df_18LULong['NLC'].astype("object")
df_19LULong['NLC'] = df_19LULong['NLC'].astype("object")
df_20LULong['NLC'] = df_20LULong['NLC'].astype("object")
df_21LULong['NLC'] = df_21LULong['NLC'].astype("object")

In [124]:
# checking to see that the dtype of column NLC has changed to 'object'

df_17LULong.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 154368 entries, 0 to 154367
Data columns (total 10 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   Mode       154368 non-null  object
 1   NLC        154368 non-null  object
 2   ASC        154368 non-null  object
 3   Station    154368 non-null  object
 4   Coverage   154368 non-null  object
 5   year       154368 non-null  int64 
 6    day       154368 non-null  object
 7    dir       154368 non-null  object
 8   intervals  154368 non-null  object
 9   counts     154368 non-null  int64 
dtypes: int64(2), object(8)
memory usage: 11.8+ MB


### DATA ENRICHING


For timeseries analysis, we require a column that has datetime data. We currently have the year, day and time in separate columns, and in formats that are not readily usable in a timeseries analysis.<br>

To convert columns `year` and `interval` to a usable form, the following will be done:
1. convert `interval` into a time compatible format ie `0445-0500` --> `0500` --> `05:00`
2. assign a typical date for the day variables. I propose in format `YYYY-mm-dd`:
> - MTT - `year`-01-01
> - FRI - `year`-01-02
> - SAT - `year`-01-03
> - SUN - `year`-01-04
3. Combine the interval from no.1 to date in no.2
4. Create a datetime value from output of no.3, eg `2017-01-03 05:00`

In [91]:
# using regex to extract the required time from the time intervals
import re

def extract_year(df, column):
    '''
        Function that will extract the end time of the interval periods
        from each row, and output it as a string that is formatted as 'hr:min'
    '''
    #initialising the list variable that will store the extracted time
    tm = []
    for index, i in df[column].items():
        z = re.match("(\d\d\d\d)-(\d\d\d\d)",i)
        if z:
            t = z.group(2)
            t = t[:2]+':'+t[2:]
            tm.append(t)
    return tm 

In [125]:
# applying the extract_year() function to the dataframes
df_17LULong['time']=extract_year(df_17LULong,'intervals')
df_18LULong['time']=extract_year(df_18LULong,'intervals')
df_19LULong['time']=extract_year(df_19LULong,'intervals')
df_20LULong['time']=extract_year(df_20LULong,'intervals')
df_21LULong['time']=extract_year(df_21LULong,'intervals')



In [105]:
# function to combine values in year, day and time columns to a date-time compatible format

def day_to_datetime(df):
    '''
        Function will change:
            MTT --> `year`-01-01 00:00
            MTF --> `year`-01-02 00:00
            FRI --> `year`-01-02 00:00
            SAT --> `year`-01-03 00:00
            SUN --> `year`-01-04 00:00
            where   year is value in the column 'year'
                    time is value in the column 'time'
        Already, time is in 24hr system, and therefore 
        no time system conversion is necessary.
        Output will be a string in the form 20xx-01-0x xx:xx
    '''
    yr = []
    day = []
    time = []
    yr_day_time =[]
    for value1 in df['year']:
        yr.append(value1)
    for value2 in df[' day']:
        if value2 == 'MTT':
            day.append('-01-01')
        elif value2 == 'MTF':
            day.append('-01-02')
        elif value2 == 'FRI':
            day.append('-01-02')
        elif value2 == 'SAT':
            day.append('-01-03')
        elif value2 == 'SUN':
            day.append('-01-04')
    for value3 in df['time']:
        time.append(value3)

    for i in range(len(yr)):
        yr_day_time.append(str(yr[i])+day[i]+' '+time[i])
    return yr_day_time

In [126]:
# applying function dat_to_datetime to the dataframes,
# creating a new column 'date_time'
df_17LULong['date_time'] = day_to_datetime(df_17LULong)
df_18LULong['date_time'] = day_to_datetime(df_18LULong)
df_19LULong['date_time'] = day_to_datetime(df_19LULong)
df_20LULong['date_time'] = day_to_datetime(df_20LULong)
df_21LULong['date_time'] = day_to_datetime(df_21LULong)

# converting values in 'date_time' column into datetime dtype
df_17LULong['date_time'] = pd.to_datetime(df_17LULong['date_time'])
df_18LULong['date_time'] = pd.to_datetime(df_18LULong['date_time'])
df_19LULong['date_time'] = pd.to_datetime(df_19LULong['date_time'])
df_20LULong['date_time'] = pd.to_datetime(df_20LULong['date_time'])
df_21LULong['date_time'] = pd.to_datetime(df_21LULong['date_time'])

In [114]:
df_17LULong.shape

(154368, 12)

In [115]:
df_17LULong.head(3)

Unnamed: 0,Mode,NLC,ASC,Station,Coverage,year,day,dir,intervals,counts,time,date_time
0,LU,500,ACTu,Acton Town,Station entry / exit,2017,MTF,IN,0500-0515,22,05:15,2017-01-02 05:15:00
1,LU,502,ALDu,Aldgate,Station entry / exit,2017,MTF,IN,0500-0515,11,05:15,2017-01-02 05:15:00
2,LU,503,ALEu,Aldgate East,Station entry / exit,2017,MTF,IN,0500-0515,6,05:15,2017-01-02 05:15:00


In [116]:
df_17LULong.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 154368 entries, 0 to 154367
Data columns (total 12 columns):
 #   Column     Non-Null Count   Dtype         
---  ------     --------------   -----         
 0   Mode       154368 non-null  object        
 1   NLC        154368 non-null  int64         
 2   ASC        154368 non-null  object        
 3   Station    154368 non-null  object        
 4   Coverage   154368 non-null  object        
 5   year       154368 non-null  int64         
 6    day       154368 non-null  object        
 7    dir       154368 non-null  object        
 8   intervals  154368 non-null  object        
 9   counts     154368 non-null  int64         
 10  time       154368 non-null  object        
 11  date_time  154368 non-null  datetime64[ns]
dtypes: datetime64[ns](1), int64(3), object(8)
memory usage: 14.1+ MB


### CONSOLIDATION OF DATAFRAMES INTO 1 DATAFRAME

In [127]:
# appending columns from all the five dataframes into one dataframe

frames = [df_17LULong, df_18LULong, df_19LULong, df_20LULong, df_21LULong]
df_pass = pd.concat(frames)

In [128]:
df_pass.shape

(979200, 12)

### **STORING PREPARED DATA**

The prepared dataframe will be exported to a parquet file ready for subsequent processes.. <br>
Parquet stores data in a columnar format. I am using .parquet over .csv because .parquet has the following advantages:
- Good for storing big data of any kind
- Increased data throughput and performance
- Parquet is a binary format and allows encoded data types

In [131]:
df_pass.to_parquet('working/data/2017-2021_data.parquet') 
df_pass.to_csv('working/data/2017-2021_data.csv') 

![PArquet vs CSV](Resources/References/parquet_vs_csv.PNG)

As can be seen in this image, the csv file for the output is 100MBs in size, while the parquet file for the same output is 5MBs in size. This is a 95% saving in storage space.<br>

Using Parquet enables fast transfer and also access of the dataset.