# Traffic trends for London's public transport network

![TFL MAP](https://tfl.gov.uk/cdn/static/cms/images/london-rail-and-tube-services-map.gif)

## Introduction

### 1.1 Background

London's complex public transportation system serves millions of passengers daily and includes a wide range of bus routes and underground stations. However, accessibility remains a critical concern, especially for individuals with mobility impairments. To evaluate the inclusiveness of this network, we propose using an agent-based model to assess the accessibility of public transport in London, with a specific focus on **step-free access**.

As an initial step, we merge the bus stop location dataset with the tube station location dataset, appending two new columns to record the names of all bus and tube stops within a 0.5-mile radius for each station. Following this spatial augmentation, we incorporate footfall data to assign weightings to each station based on passenger volume. We then utilize Monte Carlo simulations to model popular travel routes across the network.

Under the constraint of considering **only step-free stations**, we preliminarily evaluate whether a given origin-destination (OD) pair has a feasible route. Routes are labeled as "1" if a step-free path exists, and the number of interchanges is recorded; otherwise, they are labeled as "0." For all accessible OD pairs (label = 1), we apply shortest path algorithms to compute minimal distance or travel time, allowing us to identify routes that, while accessible, are inefficient compared to the average.

Our final analysis targets two categories of concern: (1) **inaccessible OD pairs** (label = 0), and (2) **inefficient but accessible routes** (label = 1 with high cost). We trace these routes to identify critical interruption points and evaluate their frequency and crowd density. These stations are then prioritized for potential infrastructure upgrades, such as new bus stop installations or transfer improvements.

### 1.2 Research Questions

- How accessible is London’s public transport system for mobility-impaired passengers, considering only step-free routes?
- Can we identify high-demand routes that are either inaccessible or significantly inefficient for step-free travel?
- What are the key transfer stations or segments that contribute to poor accessibility, and how can infrastructure investment be prioritized accordingly?

### 1.3 Data Sources

**1.3.1 Footfall Data**  
- Source:  
  https://app.powerbi.com/view?r=eyJrIjoiMjZjMmQwYTktZjYxNS00MTIwLTg0ZjAtNWIwNGE0ODMzZGJhIiwidCI6IjFmYmQ2NWJmLTVkZWYtNGVlYS1hNjkyLWEwODljMjU1MzQ2YiIsImMiOjh9
  or https://crowding.data.tfl.gov.uk/Network%20Demand/StationFootfall_2023_2024%20.csv  
- Description: Historical station-level passenger flow data from the London Underground (TfL)

**1.3.2 Station Location Data**  
- Bus Stop Locations: https://data.london.gov.uk/dataset/tfl-bus-stop-locations-and-routes 
- Tube Station Locations: https://foi.tfl.gov.uk/FOI-2209-2122/Station%20locations.csv  

**1.3.3 Accessibility Data**  
- Step-free Access Information: https://content.tfl.gov.uk/step-free-tube-guide-map.pdf  
- Note: Additional data sources will be explored to supplement this guide and ensure comprehensive coverage of accessibility infrastructure.
e：


In [13]:
# import necessary libraries
import numpy as np
import pandas as pd  

## Tube Data preprocessing and Data Extraction

In [23]:
import os
import pandas as pd

file_paths = [
    "data/StationFootfall_2019.csv",
    "data/StationFootfall_2020.csv",
    "data/StationFootfall_2021.csv",
    "data/StationFootfall_2022.csv",
    "data/StationFootfall_2023_2024 .csv"
]

tfl_list = []

for file_path in file_paths:
    df = pd.read_csv(file_path, header=0)
    print(df.columns)
    tfl_list.append(df)

Index(['TravelDate', 'DayOFWeek', 'Station', 'EntryTapCount', 'ExitTapCount'], dtype='object')
Index(['TravelDate', 'DayOFWeek', 'Station', 'EntryTapCount', 'ExitTapCount'], dtype='object')
Index(['TravelDate', 'DayOFWeek', 'Station', 'EntryTapCount', 'ExitTapCount'], dtype='object')
Index(['TravelDate', 'DayOFWeek', 'Station', 'EntryTapCount', 'ExitTapCount'], dtype='object')
Index(['TravelDate', 'DayOfWeek', 'Station', 'EntryTapCount', 'ExitTapCount'], dtype='object')


## Bus Data Acquisition via TfL API

In [None]:
import requests
import pandas as pd

all_stops = []
page = 1

while True:
    url = f'https://api.tfl.gov.uk/StopPoint/Mode/bus?page={page}'
    response = requests.get(url)
    data = response.json()

    if 'stopPoints' not in data or not data['stopPoints']:
        break

    for stop in data['stopPoints']:
        all_stops.append({
            'id': stop.get('id'),
            'naptanId': stop.get('naptanId'),
            'commonName': stop.get('commonName'),
            'lat': stop.get('lat'),
            'lon': stop.get('lon'),
            'modes': ', '.join(stop.get('modes', [])),
            'lines': ', '.join([l['name'] for l in stop.get('lines', [])])
        })

    print(f'Page {page} fetched — {len(all_stops)} bus stops in total.')
    page += 1

df = pd.DataFrame(all_stops)
display(df.head())
df.to_csv('latest_bus_stops.csv', index = False)

### Bus Data Preprocessing

In [None]:
missing_count = df['lines'].isna().sum()
print(f'Total missing lines: {missing_count}')

In [None]:
# Add a new column to the DataFrame that indicates whether the stop is served by buses only.
df['is_pure_bus'] = df['modes'].apply(lambda x: x.strip() == 'bus').map({True: 1, False: 0})

In [None]:
# Clean
def clean_lines(text):
    if pd.isna(text):
        return ''
    lines = [l.strip() for l in text.split(',')]
    return ', '.join(sorted(set(lines)))

df['lines'] = df['lines'].apply(clean_lines)

df['commonName'] = df['commonName'].str.lstrip('.').str.strip()
df['lines'] = df['lines'].apply(clean_lines)

In [None]:
# Merge
def merge_lines(series):
    all_lines = []
    for line_list in series:
        all_lines.extend([l.strip() for l in line_list.split(',') if l.strip()])
    return ', '.join(sorted(set(all_lines)))

grouped_df = df.groupby(['commonName', 'lat', 'lon','is_pure_bus']).agg({
    'id': 'first',
    'lines': merge_lines
    }).reset_index()

display(grouped_df.head())
grouped_df.to_csv('processed_bus_stops.csv', index = False)

### 2.2 Webcraping  
The following code block is used to get the station location of the metro. It will need to create an .env file with APP_KEY in the same directory as this script.APP_KEY is the api key for the TfL API. See https://api-portal.tfl.gov.uk/ for more information.
The site location includes latitude and longitude using the WGS84 coordinate system. The location of the stations and other information is saved in a csv file named stations_location.csv in the data folder for further use.

In [10]:
APP_ID= "crowding"
APP_KEY = "0a2ad06e04f243b2a3a0cdbf8c62e314"

In [11]:
import os
import requests
import pandas as pd
import csv
from dotenv import load_dotenv

load_dotenv()
APP_ID = os.getenv("APP_ID")
APP_KEY = os.getenv("APP_KEY")

def get_line_ids() -> dict[str, tuple[str, str]]:
    """
    Gets the routing information for all lines and filters out the tube, dlr, overground and elizabeth-line modes.
    Returns a dictionary with line id as key and (line name, mode name) as value.
    """
    url_line = "https://api.tfl.gov.uk/Line/Route"
    params = {
        "app_id": APP_ID,
        "app_key": APP_KEY,
    }
    response = requests.get(url_line, params=params)
    response.raise_for_status()  
    data = response.json()

    mode_interested = ["tube", "dlr", "overground", "elizabeth-line"]
    line_ids = {}
    for line in data:
        if line.get("modeName") in mode_interested:
            line_ids[line["id"]] = (line["name"], line["modeName"])
    return line_ids

def get_station_stop_points(line_id: str) -> list[dict]:
    """
    Get all stations on a line by line id.
    :param line_id: line id
    :return: list of stations
    """
    url_stop_points = f"https://api.tfl.gov.uk/Line/{line_id}/StopPoints"
    params = {
        "app_id": APP_ID,
        "app_key": APP_KEY,
    }
    response = requests.get(url_stop_points, params=params)
    response.raise_for_status()
    return response.json()

def get_station_stop_points_df() -> pd.DataFrame:
    """
    Get the information of all the stations under the line of interest and construct a DataFrame.
    """
    lines = get_line_ids()
    stop_points_all = []
    
    for line_id, (line_name, mode_name) in lines.items():
        try:
            stop_points = get_station_stop_points(line_id)
            for sp in stop_points:
                stop_points_all.append({
                    "line_id": line_id,
                    "line_name": line_name,
                    "mode_name": mode_name,
                    "station_id": sp.get("id", ""),
                    "station_name": sp.get("commonName", ""),
                    "lat": sp.get("lat", ""),
                    "lon": sp.get("lon", ""),
                    "station_modes": sp.get("modes", []),
                })
        except requests.RequestException as e:
            print(f"Failed to get station information for line {line_id}: {e}")
    
    return pd.DataFrame(stop_points_all)

def save_station_locations(file_path: str = "data/stations_location.csv") -> None:
    """
    Get station location information and save it to a CSV file
    """
    os.makedirs(os.path.dirname(file_path), exist_ok=True)
    
    df = get_station_stop_points_df()
    df.to_csv(file_path, index=False, encoding="utf-8")
    print(f"Metro station location information has been saved to {file_path}")

if __name__ == "__main__":
    try:
        save_station_locations()
    except Exception as e:
        print("Error during programme execution：", e)


Metro station location information has been saved to data/stations_location.csv


The following code block matches the station names in the TfL data with the station names in the location data. Matched data are saved in tfl_stations_location.csv for further use.

In [26]:
tfl_df = pd.concat(tfl_list, ignore_index=True)

# Pre-processing of station names for subsequent matching: removal of spaces, conversion to lower case
tfl_df["Station_clean"] = tfl_df["Station"].astype(str).str.strip().str.lower()

location_file = "data/stations_location.csv"
location_df = pd.read_csv(location_file)

# The station names in the location data are also preprocessed
location_df["station_name_clean"] = location_df["station_name"].astype(str).str.strip().str.lower()

# Merge based on preprocessed columns
merged_df = pd.merge(
    tfl_df,
    location_df,
    left_on="Station_clean",
    right_on="station_name_clean",
    how="inner"
)

output_file = "tfl_stations_location.csv"
merged_df.to_csv(output_file, index=False, encoding="utf-8")
print(f"The matched data has been saved to the {output_file}")

The matched data has been saved to the tfl_stations_location.csv


In [3]:
import pandas as pd

# Load Dataset
bus_stops_df = pd.read_csv('data/processed_bus_stops.csv')
stations_location_df = pd.read_csv('data/stations_location.csv')

# Data preprocessing: clean-up and harmonization of site name formats
bus_stops_df["commonName_clean"] = bus_stops_df["commonName"].str.strip().str.lower()
stations_location_df["station_name_clean"] = stations_location_df["station_name"].str.strip().str.lower()

# Merge data based on processed site names
merged_df = pd.merge(
    bus_stops_df,
    stations_location_df,
    left_on="commonName_clean",
    right_on="station_name_clean",
    how="inner"
)

# Save the merged data to a CSV file
merged_df.to_csv('data/merged_stations.csv', index=False, encoding='utf-8')

print("Data have been merged and saved to merged_stations.csv")


Data have been merged and saved to merged_stations.csv
