This notebook uses Selenium chrome driver to scrape daily flight schedule data from London Heathrow Airport's website. Due to the complexity of the toggles in the website, this web scraper relies on the interative driver that enable user to interact with the website while the programme is scraping the site.


### 1. Selenium Set Up
We will use Firefox as the driver

In [1]:
# selenium
from selenium import webdriver 
from selenium.webdriver.common.by import By 
from selenium.webdriver.common.keys import Keys

# beutiful soup
import requests
from bs4 import BeautifulSoup

import pandas as pd
import json, time

In [2]:
# initiate the web driver
driver = webdriver.Firefox()

# def initate_driver(url,firefox = True):
#     global driver
#     driver = webdriver.Firefox()
#     driver.get(url)

## 2. London Heathrow Airport

We will first write some helper function. In general, the scraping process needs to be done as follow. For each departure / arrival data set:
* Get the page on interactive driver and load to the top of flight schedule page
* Scrap the schedule from the scheule page, inclduing the url of the flight card
* Go into the flight card to get details, including actual time

### Define helper function

In [3]:
earlier_flight_button = '//*[@id="flight-list-app"]/div/div[2]/div[2]//button[1]'

def scrape_heathrow_page():
    """
    parse the departure time, flight code and the city to three list
    """
    if 'times' not in globals():
        global times
        times = []
    if 'codes' not in globals():
        global codes
        codes = []
    if 'citys' not in globals():
        global citys
        citys = []
    if 'urls' not in globals():
        global urls
        urls = []
    if 'status' not in globals():
        global status
        status = []

    # loop over all list flight schedule item
    for result in driver.find_elements(By.XPATH,'//*[@class="airline-listing-table"]/a[contains(@class,"airline-listing-line-item")]'):
        ftime = result.find_element(By.XPATH,"./div").text
        code = result.find_element(By.XPATH,"./div[2]/div[1]/div[1]").text
        city = result.find_element(By.XPATH,"./div[2]/div[1]/div[2]").text
        url  = result.get_attribute("href")
        status_i = result.find_element(By.XPATH,"./div[3]/p").text
        times.append(ftime)
        codes.append(code)
        citys.append(city)
        urls.append(url)
        status.append(status_i)

        print(f"Flight {code} departing for {city} at {ftime}: {url}")


def scrape_flight_page(dep):
    """scape the individual flight page"""
    # identify which block to scrape
    div_id = 0 if dep else 1
    # point to the flight detail card
    res  = driver.find_elements(By.XPATH, "//div[contains(@class,'show-flight-details')]")
    card = res[div_id] # identified the card by departure or arrival
    iata_card = res[1 if dep else 0]
    try:
        time_act = card.find_element(By.XPATH, ".//div[contains(@aria-label,'actual time')]").text
    except:
        print("An error occured when parsing the actual time.")
        time_act = None
    try:
        iata = iata_card.find_element(By.XPATH, "./p").text
    except:
        print("An error occured when parsing the iata.")
        iata = None

    return time_act, iata

def go_to_top(): 
    """
    to be called when web driver is at the daily schedule page, to scroll to the top of the page
    """           
    while True:
        try:
            driver.find_element(By.XPATH,earlier_flight_button).send_keys(Keys.RETURN)
            time.sleep(0.5) 
        except:
            print("Loaded to the top of the list")
            break
    

### 2.1 Departures

First, we will load the page and get to the top of the daily flight schedule table. For the purpose of the project, we rely on historical data where the actual time of arrival/departure is known. Therefore, you may want to interact with the web driver at this stage to load the data from the previous day.

In [4]:
# initate_driver("https://www.heathrow.com/departures")
driver.get("https://www.heathrow.com/departures")
time.sleep(5) 
# confirm the page is loadded to the date wanted properly
input("Enter when the desired page is loaded")
# get to top of the day
go_to_top()

Loaded to the top of the list


We will now start scraping the data

In [5]:
times = []
codes = []
citys = []
urls = []
status = []

# scrape the first page
scrape_heathrow_page()

# loop through all schedule of the date
later_flight_button =   '//*[@id="flight-list-app"]/div/div[2]/div[2]/div/div[3]/button'
while True:
	try: 
		# load later flights
		driver.find_element(By.XPATH,later_flight_button).send_keys(Keys.RETURN)
		# add the data the the list
		scrape_heathrow_page()

	except:
		print("Reached the end of the list")
		break

Flight TP1363 departing for Lisbon at 06:00: https://www.heathrow.com/departures/terminal-2/flight-details/TP1363/02-05-2024
Flight OS458 departing for Vienna at 06:00: https://www.heathrow.com/departures/terminal-2/flight-details/OS458/02-05-2024
Flight LX345 departing for Zurich at 06:00: https://www.heathrow.com/departures/terminal-2/flight-details/LX345/02-05-2024
Flight BA472 departing for Barcelona at 06:10: https://www.heathrow.com/departures/terminal-5/flight-details/BA472/02-05-2024
Flight BA854 departing for Prague at 06:15: https://www.heathrow.com/departures/terminal-3/flight-details/BA854/02-05-2024
Flight AF1381 departing for Paris at 06:15: https://www.heathrow.com/departures/terminal-4/flight-details/AF1381/02-05-2024
Flight BA1430 departing for Edinburgh at 06:15: https://www.heathrow.com/departures/terminal-5/flight-details/BA1430/02-05-2024
Flight IB3181 departing for Madrid at 06:15: https://www.heathrow.com/departures/terminal-5/flight-details/IB3181/02-05-2024
Fli

In [6]:
# return the dataframe
departures = pd.DataFrame({"time_sch":times,'code':codes,'dest':citys, "status":status,'url':urls})
departures = departures.set_index('code')
departures.head()

Unnamed: 0_level_0,time_sch,dest,status,url
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
TP1363,06:00,Lisbon,DEPARTED,https://www.heathrow.com/departures/terminal-2...
OS458,06:00,Vienna,DEPARTED,https://www.heathrow.com/departures/terminal-2...
LX345,06:00,Zurich,DEPARTED,https://www.heathrow.com/departures/terminal-2...
BA472,06:10,Barcelona,DEPARTED,https://www.heathrow.com/departures/terminal-5...
BA854,06:15,Prague,DEPARTED,https://www.heathrow.com/departures/terminal-3...


#### 2.1.2 Scrape individual page
In addition to the schedule and the status, we are also interested in the actual departure time. This requires scraping all the pages for each of the flight. At times, the website may be unresponsive, requiring a halt to the scraper.

In [None]:
# initialise a 'time_act' column to fill
departures['time_act'] = pd.NA

In [32]:
driver = webdriver.Firefox()
driver.get("https://www.heathrow.com/departure")

In [38]:
# iterate through rows
counter = 1
error_list = []
# set up headless driver
for key, val in departures[(departures['time_act'].isna()) & (departures["status"] == "DEPARTED")].iterrows():
    driver.get(val['url'])
    time.sleep(0.25)
    try:
        time_act, iata = scrape_flight_page(dep=True)
        departures.loc[key,['time_act','iata']] = time_act, iata
        print(f'{counter}: flight {key} scheduled at {val["time_sch"]} departed at {time_act}')
    except:
        print(f"Error occured when calling scrape_flight_page for {val['status']} flight {key}")
        error_list.append(val['url'])
    counter +=1 

Error occured when calling scrape_flight_page for DEPARTED flight BA342
Error occured when calling scrape_flight_page for DEPARTED flight BA752
Error occured when calling scrape_flight_page for DEPARTED flight BA790
Error occured when calling scrape_flight_page for DEPARTED flight BA668
Error occured when calling scrape_flight_page for DEPARTED flight BA962
Error occured when calling scrape_flight_page for DEPARTED flight EI915
Error occured when calling scrape_flight_page for DEPARTED flight LX355


KeyboardInterrupt: 

In [28]:
# inspect the empty data
departures[(departures["iata"].notna()) & (departures["time_act"].isna())]

Unnamed: 0_level_0,time_sch,dest,status,url,time_act,iata
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
BA854,06:30,Prague,CANCELLED,https://www.heathrow.com/departures/terminal-3...,,Arrive Prague (PRG)
BA304,06:40,Paris,CANCELLED,https://www.heathrow.com/departures/terminal-5...,,Arrive Paris (CDG)
BA428,06:50,Amsterdam,CANCELLED,https://www.heathrow.com/departures/terminal-5...,,Arrive Amsterdam (AMS)
BA1382,07:00,Manchester,CANCELLED,https://www.heathrow.com/departures/terminal-5...,,Arrive Manchester (MAN)
BA762,07:30,Oslo,CANCELLED,https://www.heathrow.com/departures/terminal-3...,,Arrive Oslo (OSL)
BA1340,07:50,Jersey,CANCELLED,https://www.heathrow.com/departures/terminal-5...,,Arrive Saint Helier (JER)
KL1002,08:40,Amsterdam,CANCELLED,https://www.heathrow.com/departures/terminal-4...,,Arrive Amsterdam (AMS)
BA982,08:45,Berlin,CANCELLED,https://www.heathrow.com/departures/terminal-5...,,Arrive Berlin (BER)
EW461,08:50,Cologne,CANCELLED,https://www.heathrow.com/departures/terminal-2...,,Arrive Cologne/Bonn (CGN)
AF1681,09:00,Paris,CANCELLED,https://www.heathrow.com/departures/terminal-4...,,Arrive Paris (CDG)


In [36]:
# validate the data collection
departures.isnull().sum()

time_sch      0
dest          0
status        0
url           0
time_act    214
iata        201
dtype: int64

### 2.2 Arrivals

In [37]:
driver.get("https://www.heathrow.com/arrivals")
time.sleep(5) 

# confirm the page is loadded properly
input("Enter when the page is loaded")

# get to top of the day
earlier_flight_button = '//*[@id="flight-list-app"]/div/div[2]/div[2]//button[1]'
while True:
    try:
        driver.find_element(By.XPATH,earlier_flight_button).send_keys(Keys.RETURN)
        time.sleep(1) 
    except:
        print("Loaded to the top of the list")
        break

Loaded to the top of the list


In [None]:
times = []
codes = []
citys = []
urls = []
status = []

# scrape the first page
scrape_heathrow_page()

# loop through all schedule of the date
later_flight_button = '//*[@id="flight-list-app"]/div/div[2]/div[2]/div/div[3]/button'
while True:
	try: 
		# load later flights
		driver.find_element(By.XPATH,later_flight_button).send_keys(Keys.RETURN)
		# add the data the the list
		scrape_heathrow_page()
	except:
		print("Reached the end of the list")
		break

In [None]:
arrivals = pd.DataFrame({"time_sch":times,'code':codes,'orig':citys, "status":status,'url':urls})
arrivals = arrivals.set_index('code')
arrivals.head()

Unnamed: 0_level_0,time_sch,orig,status,url
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
VS450,05:00,Johannesburg,LANDED,https://www.heathrow.com/arrivals/terminal-3/f...
QF209,05:05,Melbourne,LANDED,https://www.heathrow.com/arrivals/terminal-3/f...
BA074,05:25,Lagos,LANDED,https://www.heathrow.com/arrivals/terminal-5/f...
BA016,05:25,Sydney,LANDED,https://www.heathrow.com/arrivals/terminal-5/f...
BA056,05:30,Johannesburg,EXPECTED,https://www.heathrow.com/arrivals/terminal-5/f...


#### 2.2.2 Scrape indivual page

In [None]:
# iterate through rows
counter = 1
arrivals["time_act"] = pd.NA

In [None]:
#  fill in the actual time and iata
slee_time = .1
# set up headless driver
for key, val in arrivals[arrivals['time_act'].isnull() &
                         ((arrivals["status"] != "CANCELLED")& (arrivals['status'] != "FLIGHT DIVERTED"))
                         ].iterrows():
    driver.get(val['url'])
    time.sleep(sleep_time)
    try:
        time_act,iata = scrape_flight_page(dep = False)
        arrivals.loc[key,['time_act','iata']] = time_act, iata
        print(f'{counter}: flight {key} scheduled at {val["time_sch"]} landed at {time_act}')
    except:
        print(f"Error occured when calling scrape_flight_page for {val['status']} flight {key}")

    counter +=1 

In [None]:
# check for mising value
arrivals[((arrivals['time_act'].isnull()) | (arrivals["time_act"] == ""))
        & ((arrivals['status'] != "CANCELLED") & (arrivals['status'] != "FLIGHT DIVERTED"))
         ]

In [None]:
arrivals.head()

Unnamed: 0_level_0,time_sch,orig,status,url,time_act,iata,dest
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
VS450,05:00,Johannesburg,LANDED,https://www.heathrow.com/arrivals/terminal-3/f...,04:33,Depart Johannesburg (JNB),London
QF209,05:05,Melbourne,LANDED,https://www.heathrow.com/arrivals/terminal-3/f...,05:32,Depart Melbourne (MEL),London
BA074,05:25,Lagos,LANDED,https://www.heathrow.com/arrivals/terminal-5/f...,05:11,Depart Lagos (LOS),London
BA016,05:25,Sydney,LANDED,https://www.heathrow.com/arrivals/terminal-5/f...,05:36,Depart Sydney (SYD),London
BA056,05:30,Johannesburg,EXPECTED,https://www.heathrow.com/arrivals/terminal-5/f...,21:14,Depart Johannesburg (JNB),London


In [None]:
arrivals.isnull().sum()

time_sch    0
orig        0
status      0
url         0
time_act    6
iata        0
dest        0
dtype: int64

### 2.3 Concatenate the Arrival and Departure Data

In [41]:
# add orig/dest column
departures['orig'] = ["London" for _ in range(len(departures))]
arrivals['dest'] = ['London' for _ in range(len(arrivals))]
df = pd.concat([departures, arrivals])
# inspect
df.head()

In [43]:
df.isnull().sum()

time_sch      0
dest          0
status        0
url           0
time_act    214
iata        201
orig          0
dtype: int64

In [57]:
date = "02MAY2024"
file_name = f"{date}_LHR.csv" 
# define file name
# Get the parent directory (preceding folder) of the current directory
parent_directory = os.path.dirname(os.getcwd())
filepath = os.path.join(parent_directory ,"data",file_name )

# save to csv
print(f"Saving df to {filepath}")
df.to_csv(filepath)

Saving df to /Users/Tra_FIT/Desktop/Python/GitHub/LHR_ops_data/data/02MAY2024_LHR.csv
