# Traffic trends for London's public transport network

![TFL MAP](https://tfl.gov.uk/cdn/static/cms/images/london-rail-and-tube-services-map.gif)

## Introduction
### 1.1 Background
London is a global hub for large-scale events, hosting concerts, sports tournaments, cultural festivals, and more. These events significantly impact the city's transportation system, particularly the London Underground (TfL network).  

Transport for London (TfL) typically schedules services based on historical data and pre-planned timetables. However, this method may not immediately accommodate fluctuations in passenger demand caused by events  

Our study aims to analyze whether events can serve as a reliable predictor of station business (passenger flow) and assess how well TfL’s current scheduling aligns with demand surge  
### 1.2 Research question  
What key factors (e.g., event characteristics, external conditions) are most relevant in predicting station business during events?  
H1: The type, scale, and ticketing status of an event have a significant impact on station business, but the extent of this impact varies depending on external factors such as weather, time, and competing events.


Can we develop a tool that links event type, size, and scale with station business to improve network management?  
H2: Developing a predictive model that links event characteristics with station business will provide data-driven insights for network management, thereby enhancing transport planning efficiency.


Can integrating event data into a predictive model improve TfL’s ability to optimize scheduling and resource allocation?  
H3:Integrating event data (such as event type, scale, and timing) into the predictive model will enhance TfL’s ability to accurately schedule services during high-demand periods, reduce over-scheduling and resource waste, and optimize train frequencies and station operations.  
### 1.3 Data sources  
1.3.1 TFL data:  
Data Sources: https://tfl.gov.uk/corporate/publications-and-reports/network-demand-data  

Data description: This dataset provides historical station-level ridership data for the London Underground and other transport networks

1.3.2 Event Data Collection:
Primary Source: Ticketmaster API(Provides structured event data
s:
ons.
e：


In [13]:
# import necessary libraries
import numpy as np
import pandas as pd  

## Data preprocessing and Data Extraction

In [23]:
import os
import pandas as pd

file_paths = [
    "data/StationFootfall_2019.csv",
    "data/StationFootfall_2020.csv",
    "data/StationFootfall_2021.csv",
    "data/StationFootfall_2022.csv",
    "data/StationFootfall_2023_2024 .csv"
]

tfl_list = []

for file_path in file_paths:
    df = pd.read_csv(file_path, header=0)
    print(df.columns)
    tfl_list.append(df)

Index(['TravelDate', 'DayOFWeek', 'Station', 'EntryTapCount', 'ExitTapCount'], dtype='object')
Index(['TravelDate', 'DayOFWeek', 'Station', 'EntryTapCount', 'ExitTapCount'], dtype='object')
Index(['TravelDate', 'DayOFWeek', 'Station', 'EntryTapCount', 'ExitTapCount'], dtype='object')
Index(['TravelDate', 'DayOFWeek', 'Station', 'EntryTapCount', 'ExitTapCount'], dtype='object')
Index(['TravelDate', 'DayOfWeek', 'Station', 'EntryTapCount', 'ExitTapCount'], dtype='object')


The Ticket Manager api is unable to fetch past data. So use ‘https://www.setlist.fm/’ api to get past event information. There is date, artist and address information in the data file.

In [23]:
import requests
import pandas as pd
import time

# Setlist.fm API Key
API_KEY = "ZQ0i1hm7czfg71HaPHsr_EWsli711GGD-Ph5"
HEADERS = {"x-api-key": API_KEY, "Accept": "application/json"}

# API URL
BASE_URL = "https://api.setlist.fm/rest/1.0/search/setlists"

# city
CITY = "London"

# years
start_year = 2019
end_year = 2024

# all data of events
all_events = []

for year in range(start_year, end_year + 1):
    print(f"Get {year} years of performance data")
    
    page = 1  # Setlist.fm API with page data
    while True:
        # API URL
        url = f"{BASE_URL}?cityName={CITY}&year={year}&p={page}"
        
        # send API request
        response = requests.get(url, headers=HEADERS)
        
        if response.status_code == 200:
            data = response.json()
            
            # Check performance data
            if "setlist" in data and len(data["setlist"]) > 0:
                for event in data["setlist"]:
                    event_date = event.get("eventDate", "Unknown")
                    venue = event["venue"].get("name", "Unknown")
                    artist = event["artist"].get("name", "Unknown")
                    city = event["venue"].get("city", {}).get("name", "Unknown")

                    all_events.append([year, event_date, artist, venue, city])

                print(f"read {page}page of {year}year")
                page += 1  # continue to next page
            else:
                print(f"Crawling of {year} data is complete, total {page-1}.")
                break  # out of 'page loop'
            
        else:
            print(f"API request failed: {response.status_code}")
            break 
        
        time.sleep(1)

# Save to CSV file
if all_events:
    df = pd.DataFrame(all_events, columns=["Year", "Event Date", "Artist", "Venue", "City"])
    df.to_csv("setlistfm_api_events_2019_2024.csv", index=False, encoding="utf-8-sig")

    print("Data has been saved to the setlistfm_api_events_2019_2024.csv")
else:
    print("No data to save")

Get 2019 years of performance data
read 1page of 2019year
read 2page of 2019year
read 3page of 2019year
read 4page of 2019year
read 5page of 2019year
read 6page of 2019year
read 7page of 2019year
read 8page of 2019year
read 9page of 2019year
read 10page of 2019year
read 11page of 2019year
read 12page of 2019year
read 13page of 2019year
read 14page of 2019year
read 15page of 2019year
read 16page of 2019year
read 17page of 2019year
read 18page of 2019year
read 19page of 2019year
read 20page of 2019year
read 21page of 2019year
read 22page of 2019year
read 23page of 2019year
read 24page of 2019year
read 25page of 2019year
read 26page of 2019year
read 27page of 2019year
read 28page of 2019year
read 29page of 2019year
read 30page of 2019year
read 31page of 2019year
read 32page of 2019year
read 33page of 2019year
read 34page of 2019year
read 35page of 2019year
read 36page of 2019year
read 37page of 2019year
read 38page of 2019year
read 39page of 2019year
read 40page of 2019year
read 41page of

In [22]:
import os

file_path = "setlistfm_api_events_2019_2024.csv"  

if os.path.exists(file_path):
    os.remove(file_path)
    print(f"Delete {file_path}")
else:
    print(f"Not find {file_path}")

Delete setlistfm_api_events_2019_2024.csv


### 2.2 Webcraping  
The following code block is used to get the station location of the metro. It will need to create an .env file with APP_KEY in the same directory as this script.APP_KEY is the api key for the TfL API. See https://api-portal.tfl.gov.uk/ for more information.
The site location includes latitude and longitude using the WGS84 coordinate system. The location of the stations and other information is saved in a csv file named stations_location.csv in the data folder for further use.

In [10]:
APP_ID= "crowding"
APP_KEY = "0a2ad06e04f243b2a3a0cdbf8c62e314"

In [11]:
import os
import requests
import pandas as pd
import csv
from dotenv import load_dotenv

load_dotenv()
APP_ID = os.getenv("APP_ID")
APP_KEY = os.getenv("APP_KEY")

def get_line_ids() -> dict[str, tuple[str, str]]:
    """
    Gets the routing information for all lines and filters out the tube, dlr, overground and elizabeth-line modes.
    Returns a dictionary with line id as key and (line name, mode name) as value.
    """
    url_line = "https://api.tfl.gov.uk/Line/Route"
    params = {
        "app_id": APP_ID,
        "app_key": APP_KEY,
    }
    response = requests.get(url_line, params=params)
    response.raise_for_status()  
    data = response.json()

    mode_interested = ["tube", "dlr", "overground", "elizabeth-line"]
    line_ids = {}
    for line in data:
        if line.get("modeName") in mode_interested:
            line_ids[line["id"]] = (line["name"], line["modeName"])
    return line_ids

def get_station_stop_points(line_id: str) -> list[dict]:
    """
    Get all stations on a line by line id.
    :param line_id: line id
    :return: list of stations
    """
    url_stop_points = f"https://api.tfl.gov.uk/Line/{line_id}/StopPoints"
    params = {
        "app_id": APP_ID,
        "app_key": APP_KEY,
    }
    response = requests.get(url_stop_points, params=params)
    response.raise_for_status()
    return response.json()

def get_station_stop_points_df() -> pd.DataFrame:
    """
    Get the information of all the stations under the line of interest and construct a DataFrame.
    """
    lines = get_line_ids()
    stop_points_all = []
    
    for line_id, (line_name, mode_name) in lines.items():
        try:
            stop_points = get_station_stop_points(line_id)
            for sp in stop_points:
                stop_points_all.append({
                    "line_id": line_id,
                    "line_name": line_name,
                    "mode_name": mode_name,
                    "station_id": sp.get("id", ""),
                    "station_name": sp.get("commonName", ""),
                    "lat": sp.get("lat", ""),
                    "lon": sp.get("lon", ""),
                    "station_modes": sp.get("modes", []),
                })
        except requests.RequestException as e:
            print(f"Failed to get station information for line {line_id}: {e}")
    
    return pd.DataFrame(stop_points_all)

def save_station_locations(file_path: str = "data/stations_location.csv") -> None:
    """
    Get station location information and save it to a CSV file
    """
    os.makedirs(os.path.dirname(file_path), exist_ok=True)
    
    df = get_station_stop_points_df()
    df.to_csv(file_path, index=False, encoding="utf-8")
    print(f"Metro station location information has been saved to {file_path}")

if __name__ == "__main__":
    try:
        save_station_locations()
    except Exception as e:
        print("Error during programme execution：", e)


Metro station location information has been saved to data/stations_location.csv


The following code block matches the station names in the TfL data with the station names in the location data. Matched data are saved in tfl_stations_location.csv for further use.

In [26]:
tfl_df = pd.concat(tfl_list, ignore_index=True)

# Pre-processing of station names for subsequent matching: removal of spaces, conversion to lower case
tfl_df["Station_clean"] = tfl_df["Station"].astype(str).str.strip().str.lower()

location_file = "data/stations_location.csv"
location_df = pd.read_csv(location_file)

# The station names in the location data are also preprocessed
location_df["station_name_clean"] = location_df["station_name"].astype(str).str.strip().str.lower()

# Merge based on preprocessed columns
merged_df = pd.merge(
    tfl_df,
    location_df,
    left_on="Station_clean",
    right_on="station_name_clean",
    how="inner"
)

output_file = "tfl_stations_location.csv"
merged_df.to_csv(output_file, index=False, encoding="utf-8")
print(f"The matched data has been saved to the {output_file}")

The matched data has been saved to the tfl_stations_location.csv
