This notebook uses Selenium chrome driver to scrape daily flight schedule data from London Heathrow Airport's website. Due to the complexity of the toggles in the website, this web scraper relies on the interative driver that enable user to interact with the website while the programme is scraping the site.


### 1. Selenium Set Up
We will use Firefox as the driver

In [1]:
# selenium
from selenium import webdriver 
from src.main import *

import pandas as pd
import os
import datetime 

In [2]:
# initiate the web driver
driver = webdriver.Firefox()


In [3]:
current_date = datetime.datetime.now(tz= datetime.timezone.utc)

## 2. London Heathrow Airport

We will first write some helper function. In general, the scraping process needs to be done as follow. For each departure / arrival data set:
* Get the page on interactive driver and load to the top of flight schedule page
* Scrap the schedule from the scheule page, inclduing the url of the flight card
* Go into the flight card to get details, including actual time

### 2.1 Departures

First, we will load the page and get to the top of the daily flight schedule table. For the purpose of the project, we rely on historical data where the actual time of arrival/departure is known. Therefore, you may want to interact with the web driver at this stage to load the data from the previous day.

In [4]:
# initate_driver("https://www.heathrow.com/departures")
driver.get("https://www.heathrow.com/departures")
time.sleep(5) 
# confirm the page is loadded to the date wanted properly
input("Enter when the desired page is loaded (accepted cookies)")

''

In [5]:
# get to top of the day
go_to_top(driver)

Loaded to the top of the list


We will now start scraping the data

In [6]:
departures = scrape_heathrow_pages(driver=driver,departure=True)

Reached end of the page


#### 2.1.2 Scrape individual page (Skipped)
In addition to the schedule and the status, we are also interested in the actual departure time. This requires scraping all the pages for each of the flight. At times, the website may be unresponsive, requiring a halt to the scraper.

In [7]:
# initialise a 'time_act' column to fill
departures['time_act'] = pd.NA

In [None]:
# iterate through rows
counter = 1
error_list = []
# set up headless driver
for key, val in departures[(departures['time_act'].isna()) & (departures["status"] == "DEPARTED")].iterrows():
    driver.get(val['url'])
    time.sleep(0.5)
    try:
        time_act, iata = scrape_flight_page(dep=True)
        departures.loc[key,['time_act','iata']] = time_act, iata
        print(f'{counter}: flight {key} scheduled at {val["time_sch"]} departed at {time_act}')
    except:
        print(f"Error occured when calling scrape_flight_page for {val['status']} flight {key}")
        error_list.append(val['url'])
    counter +=1 

In [None]:
# inspect the empty data
departures[(departures["iata"].notna()) & (departures["time_act"].isna())]

In [None]:
# validate the data collection
departures.isnull().sum()

### 2.2 Arrivals

In [6]:
driver.get("https://www.heathrow.com/arrivals")
time.sleep(5) 
# confirm the page is loadded to the date wanted properly
input("Enter when the desired page is loaded (accepted cookies)")
# get to top of the day
go_to_top(driver)

Loaded to the top of the list


In [8]:
arrivals = scrape_heathrow_pages(driver=driver,departure=False)

Reached end of the page


#### 2.2.2 Scrape indivual page (skipped)

In [None]:
# iterate through rows
counter = 1
arrivals["time_act"] = pd.NA

In [None]:
#  fill in the actual time and iata
slee_time = .1
# set up headless driver
for key, val in arrivals[arrivals['time_act'].isnull() &
                         ((arrivals["status"] != "CANCELLED")& (arrivals['status'] != "FLIGHT DIVERTED"))
                         ].iterrows():
    driver.get(val['url'])
    time.sleep(sleep_time)
    try:
        time_act,iata = scrape_flight_page(dep = False)
        arrivals.loc[key,['time_act','iata']] = time_act, iata
        print(f'{counter}: flight {key} scheduled at {val["time_sch"]} landed at {time_act}')
    except:
        print(f"Error occured when calling scrape_flight_page for {val['status']} flight {key}")

    counter +=1 

In [None]:
# check for mising value
arrivals[((arrivals['time_act'].isnull()) | (arrivals["time_act"] == ""))
        & ((arrivals['status'] != "CANCELLED") & (arrivals['status'] != "FLIGHT DIVERTED"))
         ]

In [None]:
arrivals.head()

In [None]:
arrivals.isnull().sum()

### 2.3 Concatenate the Arrival and Departure Data

In [10]:
# add orig/dest column
departures['orig'] = ["London" for _ in range(len(departures))]
arrivals['dest'] = ['London' for _ in range(len(arrivals))]
df_lhr = pd.concat([departures, arrivals])
# inspect
df_lhr.head()

Unnamed: 0_level_0,time_sch,dest,status,url,orig
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
TP1363,06:00,Lisbon,DEPARTED,https://www.heathrow.com/departures/terminal-2...,London
OS458,06:00,Vienna,DEPARTED,https://www.heathrow.com/departures/terminal-2...,London
LX345,06:00,Zurich,DEPARTED,https://www.heathrow.com/departures/terminal-2...,London
BA472,06:05,Barcelona,DEPARTED,https://www.heathrow.com/departures/terminal-5...,London
BA456,06:10,Madrid,DEPARTED,https://www.heathrow.com/departures/terminal-5...,London


In [11]:
df_lhr.isnull().sum()

time_sch    0
dest        0
status      0
url         0
orig        0
dtype: int64

In [12]:
file_name = f"{current_date.strftime('%d%b%Y')}_LHR.csv" 
# define file name
# Get the parent directory (preceding folder) of the current directory
parent_directory = os.path.dirname(os.getcwd())
filepath = os.path.join(parent_directory ,"data",file_name )

# save to csv
print(f"Saving df to {filepath}")
df_lhr.to_csv(filepath)

Saving df to /Users/Tra_FIT/Desktop/Python/GitHub/LHR_ops_data/data/22Feb2025_LHR.csv


## 3. London Gatwick Airport

In [13]:
driver.get('https://www.gatwickairport.com/flights?desination=A')

### Arrival

In [14]:
lgw_return_to_start(driver=driver)

In [None]:
lgw_arrv = lgw_return_data(driver=driver,departure=False)

In [None]:
lgw_arrv

### Departure

In [None]:
lgw_return_to_start(driver=driver)

In [None]:
lgw_dept = lgw_return_data(driver=driver,departure=True)

### 3.3 Parse the Result as pd DataFrame

In [None]:
df_dept = pd.DataFrame(lgw_dept)
df_dept['orig'] = 'London'
df_arrv = pd.DataFrame(lgw_arrv)
df_dept['dest'] = 'London'


In [None]:
df_lgw = pd.concat([df_dept,df_arrv]).drop(columns = 'dummy')
df_lgw.head()

In [None]:
current_date = datetime.datetime.now(tz= datetime.timezone.utc)
current_date.strftime('%d%b%Y')

In [None]:
current_date = datetime.datetime.now(tz= datetime.timezone.utc)

file_name = f"{current_date.strftime('%d%b%Y')}_LGW.csv" 
# define file name
# Get the parent directory (preceding folder) of the current directory
parent_directory = os.path.dirname(os.getcwd())
filepath = os.path.join(parent_directory ,"data",file_name )

# save to csv
print(f"Saving df to {file_name}")
df_lgw.to_csv(filepath)

In [15]:
driver.quit()