# Traffic trends for London's public transport network

## Introduction
### 1.1 Background
London is a global hub for large-scale events, hosting concerts, sports tournaments, cultural festivals, and more. These events significantly impact the city's transportation system, particularly the London Underground (TfL network).  

Transport for London (TfL) typically schedules services based on historical data and pre-planned timetables. However, this method may not immediately accommodate fluctuations in passenger demand caused by events  

Our study aims to analyze whether events can serve as a reliable predictor of station business (passenger flow) and assess how well TfL’s current scheduling aligns with demand surge  
### 1.2 Research question  
What key factors (e.g., event characteristics, external conditions) are most relevant in predicting station business during events?  
H1: The type, scale, and ticketing status of an event have a significant impact on station business, but the extent of this impact varies depending on external factors such as weather, time, and competing events.


Can we develop a tool that links event type, size, and scale with station business to improve network management?  
H2: Developing a predictive model that links event characteristics with station business will provide data-driven insights for network management, thereby enhancing transport planning efficiency.


Can integrating event data into a predictive model improve TfL’s ability to optimize scheduling and resource allocation?  
H3:Integrating event data (such as event type, scale, and timing) into the predictive model will enhance TfL’s ability to accurately schedule services during high-demand periods, reduce over-scheduling and resource waste, and optimize train frequencies and station operations.  
### 1.3 Data sources  
1.3.1 TFL data:ons.
e：


In [None]:
# import necessary libraries
import numpy as np

## Data preprocessing and Data Extraction

In [None]:
file_paths = [
    "data/NBT23MON_outputs.xlsx",
    "data/NBT23TWT_outputs.xlsx",
    "data/NBT23FRI_outputs.xlsx",
    "data/NBT23SAT_outputs.xlsx",
    "data/NBT23SUN_outputs.xlsx"
]

sheet_names = [
    "Link_Loads",
    "Link_Frequencies",
    "Station_Flows",
    "Station_Entries",
    "Station_Exits"
]

data_frames = {}

for file_path in file_paths:
    data_frames[file_path] = {}
    for sheet in sheet_names:
        df = pd.read_excel(file_path, sheet_name=sheet, skiprows=2)
        data_frames[file_path][sheet] = df

#Adjust all numbers in all datafrmaes to round to integers
for file_path in data_frames:
    for sheet in data_frames[file_path]:
        data_frames[file_path][sheet] = data_frames[file_path][sheet].map(
            lambda x: round(x) if isinstance(x, (float, int)) else x)

### 2.2 Webcraping  
The following code block is used to get the location of stations for tube, dlr, overground and elizabeth-line.
You need to have a .env file with APP_KEY in the same directory as this script.
APP_KEY is the api key for the TfL API. See https://api-portal.tfl.gov.uk/ for more information.  
The location of the stations consists of latitude and longitude using WGS84 coordinate system.
The location and other information of the stations are saved in a csv file
named stations_location.csv in the data folder for further use.

In [None]:
dotenv.load_dotenv()
APP_KEY = os.getenv("APP_KEY")

def get_line_ids() -> dict[str, tuple[str, str]]:
    """
    Get the line ids for tube, dlr, overground and elizabeth-line
    :return: a dictionary with line id as key and line name and mode name as value
    """
    url_line = "https://api.tfl.gov.uk/Line/Route"
    params = {
        "app_key": APP_KEY,
    }
    response = requests.get(url_line, params=params)
    data = response.json()

    mode_insterested = ["tube", "dlr", "overground", "elizabeth-line"]
    line_ids = dict()
    for line in data:
        if line["modeName"] in mode_insterested:
            line_ids[line["id"]] = (line["name"], line["modeName"])

    return line_ids

def get_station_stop_points(line_id_: str) -> list[dict]:
    """
    Get the stop points for a given line
    :param line_id_: the line id
    :return: a list of stop points
    """
    url_stop_points = f"https://api.tfl.gov.uk/Line/{line_id_}/StopPoints"
    params = {
        "app_key": APP_KEY,
    }
    response = requests.get(url_stop_points, params=params)
    return response.json()

def get_station_stop_points_df() -> pd.DataFrame:
    lines = get_line_ids()
    stop_points_all = list()

    for line_id, (line_name, mode_name) in lines.items():
        stop_points = get_station_stop_points(line_id)
        for stop_point in stop_points:
            stop_points_all.append({
                "line_id": line_id,
                "line_name": line_name,
                "mode_name": mode_name,
                "station_id": stop_point["id"],
                "station_name": stop_point["commonName"],
                "lat": stop_point["lat"],
                "lon": stop_point["lon"],
                "station_modes": stop_point["modes"]
            })

    return pd.DataFrame(stop_points_all)

if __name__ == "__main__":
    if not os.path.exists("data/stations_location.csv"):
        get_station_stop_points_df = get_station_stop_points_df()
        get_station_stop_points_df.to_csv("data/stations_location.csv", index=False)