This notebook uses Selenium chrome driver to scrape daily flight schedule data from London Heathrow Airport's website. Due to the complexity of the toggles in the website, this web scraper relies on the interative driver that enable user to interact with the website while the programme is scraping the site.


### 1. Selenium Set Up
We will use Firefox as the driver

In [18]:
# selenium
from selenium import webdriver 
from src.main import *

import pandas as pd
import os
import datetime 

In [2]:
# initiate the web driver
driver = webdriver.Firefox()


In [3]:
current_date = datetime.datetime.now(tz= datetime.timezone.utc)

## 2. London Heathrow Airport

We will first write some helper function. In general, the scraping process needs to be done as follow. For each departure / arrival data set:
* Get the page on interactive driver and load to the top of flight schedule page
* Scrap the schedule from the scheule page, inclduing the url of the flight card
* Go into the flight card to get details, including actual time

### 2.1 Departures

First, we will load the page and get to the top of the daily flight schedule table. For the purpose of the project, we rely on historical data where the actual time of arrival/departure is known. Therefore, you may want to interact with the web driver at this stage to load the data from the previous day.

In [4]:
# initate_driver("https://www.heathrow.com/departures")
driver.get("https://www.heathrow.com/departures")
time.sleep(5) 
# confirm the page is loadded to the date wanted properly
input("Enter when the desired page is loaded (accepted cookies)")
# get to top of the day
go_to_top(driver)

Loaded to the top of the list


We will now start scraping the data

In [5]:
departures = scrape_heathrow_pages(driver=driver,departure=True)

Reached end of the page


In [7]:
departures

Unnamed: 0_level_0,time_sch,dest,status,url
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
TP1363,06:00,Lisbon,DEPARTED,https://www.heathrow.com/departures/terminal-2...
OS458,06:00,Vienna,DEPARTED,https://www.heathrow.com/departures/terminal-2...
LX345,06:00,Zurich,DEPARTED,https://www.heathrow.com/departures/terminal-2...
BA472,06:05,Barcelona,DEPARTED,https://www.heathrow.com/departures/terminal-5...
BA868,06:20,Budapest,DEPARTED,https://www.heathrow.com/departures/terminal-3...
...,...,...,...,...
BA083,22:30,Abuja,ON TIME,https://www.heathrow.com/departures/terminal-5...
MS780,22:30,Cairo,ON TIME,https://www.heathrow.com/departures/terminal-2...
VS449,22:30,Johannesburg,ON TIME,https://www.heathrow.com/departures/terminal-3...
AM008,22:30,Mexico City,ON TIME,https://www.heathrow.com/departures/terminal-3...


#### 2.1.2 Scrape individual page (Skipped)
In addition to the schedule and the status, we are also interested in the actual departure time. This requires scraping all the pages for each of the flight. At times, the website may be unresponsive, requiring a halt to the scraper.

In [None]:
# initialise a 'time_act' column to fill
departures['time_act'] = pd.NA

In [32]:
driver = webdriver.Firefox()
driver.get("https://www.heathrow.com/departure")

In [38]:
# iterate through rows
counter = 1
error_list = []
# set up headless driver
for key, val in departures[(departures['time_act'].isna()) & (departures["status"] == "DEPARTED")].iterrows():
    driver.get(val['url'])
    time.sleep(0.25)
    try:
        time_act, iata = scrape_flight_page(dep=True)
        departures.loc[key,['time_act','iata']] = time_act, iata
        print(f'{counter}: flight {key} scheduled at {val["time_sch"]} departed at {time_act}')
    except:
        print(f"Error occured when calling scrape_flight_page for {val['status']} flight {key}")
        error_list.append(val['url'])
    counter +=1 

Error occured when calling scrape_flight_page for DEPARTED flight BA342
Error occured when calling scrape_flight_page for DEPARTED flight BA752
Error occured when calling scrape_flight_page for DEPARTED flight BA790
Error occured when calling scrape_flight_page for DEPARTED flight BA668
Error occured when calling scrape_flight_page for DEPARTED flight BA962
Error occured when calling scrape_flight_page for DEPARTED flight EI915
Error occured when calling scrape_flight_page for DEPARTED flight LX355


KeyboardInterrupt: 

In [28]:
# inspect the empty data
departures[(departures["iata"].notna()) & (departures["time_act"].isna())]

Unnamed: 0_level_0,time_sch,dest,status,url,time_act,iata
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
BA854,06:30,Prague,CANCELLED,https://www.heathrow.com/departures/terminal-3...,,Arrive Prague (PRG)
BA304,06:40,Paris,CANCELLED,https://www.heathrow.com/departures/terminal-5...,,Arrive Paris (CDG)
BA428,06:50,Amsterdam,CANCELLED,https://www.heathrow.com/departures/terminal-5...,,Arrive Amsterdam (AMS)
BA1382,07:00,Manchester,CANCELLED,https://www.heathrow.com/departures/terminal-5...,,Arrive Manchester (MAN)
BA762,07:30,Oslo,CANCELLED,https://www.heathrow.com/departures/terminal-3...,,Arrive Oslo (OSL)
BA1340,07:50,Jersey,CANCELLED,https://www.heathrow.com/departures/terminal-5...,,Arrive Saint Helier (JER)
KL1002,08:40,Amsterdam,CANCELLED,https://www.heathrow.com/departures/terminal-4...,,Arrive Amsterdam (AMS)
BA982,08:45,Berlin,CANCELLED,https://www.heathrow.com/departures/terminal-5...,,Arrive Berlin (BER)
EW461,08:50,Cologne,CANCELLED,https://www.heathrow.com/departures/terminal-2...,,Arrive Cologne/Bonn (CGN)
AF1681,09:00,Paris,CANCELLED,https://www.heathrow.com/departures/terminal-4...,,Arrive Paris (CDG)


In [36]:
# validate the data collection
departures.isnull().sum()

time_sch      0
dest          0
status        0
url           0
time_act    214
iata        201
dtype: int64

### 2.2 Arrivals

In [8]:
driver.get("https://www.heathrow.com/arrivals")
time.sleep(5) 
# confirm the page is loadded to the date wanted properly
input("Enter when the desired page is loaded (accepted cookies)")
# get to top of the day
go_to_top(driver)

Loaded to the top of the list


In [9]:
arrivals = scrape_heathrow_pages(driver=driver,depaeture=False)

Reached end of the page


In [13]:
arrivals

Unnamed: 0_level_0,time_sch,orig,status,url
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
BA056,04:45,Johannesburg,LANDED,https://www.heathrow.com/arrivals/terminal-5/f...
BA074,04:50,Lagos,LANDED,https://www.heathrow.com/arrivals/terminal-5/f...
BA058,04:55,Cape Town,LANDED,https://www.heathrow.com/arrivals/terminal-5/f...
QF009,05:05,Perth,LANDED,https://www.heathrow.com/arrivals/terminal-3/f...
BA262,05:15,Riyadh,LANDED,https://www.heathrow.com/arrivals/terminal-5/f...
...,...,...,...,...
TP1366,22:40,Lisbon,ON TIME,https://www.heathrow.com/arrivals/terminal-2/f...
BA349,22:40,Nice,ON TIME,https://www.heathrow.com/arrivals/terminal-5/f...
BA371,22:40,Toulouse,ON TIME,https://www.heathrow.com/arrivals/terminal-3/f...
AY1341,22:40,Helsinki,CANCELLED,https://www.heathrow.com/arrivals/terminal-3/f...


#### 2.2.2 Scrape indivual page (skipped)

In [None]:
# iterate through rows
counter = 1
arrivals["time_act"] = pd.NA

In [None]:
#  fill in the actual time and iata
slee_time = .1
# set up headless driver
for key, val in arrivals[arrivals['time_act'].isnull() &
                         ((arrivals["status"] != "CANCELLED")& (arrivals['status'] != "FLIGHT DIVERTED"))
                         ].iterrows():
    driver.get(val['url'])
    time.sleep(sleep_time)
    try:
        time_act,iata = scrape_flight_page(dep = False)
        arrivals.loc[key,['time_act','iata']] = time_act, iata
        print(f'{counter}: flight {key} scheduled at {val["time_sch"]} landed at {time_act}')
    except:
        print(f"Error occured when calling scrape_flight_page for {val['status']} flight {key}")

    counter +=1 

In [None]:
# check for mising value
arrivals[((arrivals['time_act'].isnull()) | (arrivals["time_act"] == ""))
        & ((arrivals['status'] != "CANCELLED") & (arrivals['status'] != "FLIGHT DIVERTED"))
         ]

In [None]:
arrivals.head()

Unnamed: 0_level_0,time_sch,orig,status,url,time_act,iata,dest
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
VS450,05:00,Johannesburg,LANDED,https://www.heathrow.com/arrivals/terminal-3/f...,04:33,Depart Johannesburg (JNB),London
QF209,05:05,Melbourne,LANDED,https://www.heathrow.com/arrivals/terminal-3/f...,05:32,Depart Melbourne (MEL),London
BA074,05:25,Lagos,LANDED,https://www.heathrow.com/arrivals/terminal-5/f...,05:11,Depart Lagos (LOS),London
BA016,05:25,Sydney,LANDED,https://www.heathrow.com/arrivals/terminal-5/f...,05:36,Depart Sydney (SYD),London
BA056,05:30,Johannesburg,EXPECTED,https://www.heathrow.com/arrivals/terminal-5/f...,21:14,Depart Johannesburg (JNB),London


In [None]:
arrivals.isnull().sum()

time_sch    0
orig        0
status      0
url         0
time_act    6
iata        0
dest        0
dtype: int64

### 2.3 Concatenate the Arrival and Departure Data

In [14]:
# add orig/dest column
departures['orig'] = ["London" for _ in range(len(departures))]
arrivals['dest'] = ['London' for _ in range(len(arrivals))]
df = pd.concat([departures, arrivals])
# inspect
df.head()

Unnamed: 0_level_0,time_sch,dest,status,url,orig
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
TP1363,06:00,Lisbon,DEPARTED,https://www.heathrow.com/departures/terminal-2...,London
OS458,06:00,Vienna,DEPARTED,https://www.heathrow.com/departures/terminal-2...,London
LX345,06:00,Zurich,DEPARTED,https://www.heathrow.com/departures/terminal-2...,London
BA472,06:05,Barcelona,DEPARTED,https://www.heathrow.com/departures/terminal-5...,London
BA868,06:20,Budapest,DEPARTED,https://www.heathrow.com/departures/terminal-3...,London


In [15]:
df.isnull().sum()

time_sch    0
dest        0
status      0
url         0
orig        0
dtype: int64

In [17]:
file_name = f"{current_date.strftime('%d%b%Y')}_LHR.csv" 
# define file name
# Get the parent directory (preceding folder) of the current directory
parent_directory = os.path.dirname(os.getcwd())
filepath = os.path.join(parent_directory ,"data",file_name )

# save to csv
print(f"Saving df to {filepath}")
df.to_csv(filepath)

Saving df to /Users/Tra_FIT/Desktop/Python/GitHub/LHR_ops_data/data/16Feb2025_LHR.csv


## 3. London Gatwich Airport

In [4]:
driver.get('https://www.gatwickairport.com/flights?desination=A')

### Arrival

In [22]:
lgw_return_to_start(driver=driver)

In [20]:
lgw_arrv = lgw_return_data(driver=driver,departure=False)

reached end of the page


In [29]:
lgw_arrv

[{'time': '09:45',
  'orig': 'Lisbon',
  'dummy': '',
  'code': 'EZY8514',
  'status': 'Bags delivered 17:45'},
 {'time': '14:40',
  'orig': 'Malta',
  'dummy': '',
  'code': 'BA2615',
  'status': 'Bags delivered 20:02'},
 {'time': '16:15',
  'orig': 'Grenoble',
  'dummy': '',
  'code': 'LS2106',
  'status': 'Bags delivered 19:15'},
 {'time': '16:35',
  'orig': 'Jersey',
  'dummy': '',
  'code': 'EZY876',
  'status': 'Bags delivered 17:36'},
 {'time': '16:40',
  'orig': 'Paris',
  'dummy': '',
  'code': 'EZY8404',
  'status': 'Bags delivered 18:01'},
 {'time': '16:55',
  'orig': 'Isle of Man',
  'dummy': '',
  'code': 'EZY6396',
  'status': 'Bags delivered 17:45'},
 {'time': '17:10',
  'orig': '-',
  'dummy': '',
  'code': 'SV3203',
  'status': 'Landed 20:12'},
 {'time': '17:10',
  'orig': 'Zurich',
  'dummy': '',
  'code': 'EZY8470',
  'status': 'Bags delivered 17:33'},
 {'time': '17:15',
  'orig': 'Grenoble',
  'dummy': '',
  'code': 'ZT224',
  'status': 'Bags delivered 17:40'},
 {'t

### Departure

In [30]:
lgw_return_to_start(driver=driver)

In [24]:
lgw_dept = lgw_return_data(driver=driver,departure=True)

reached end of the page


### 3.3 Parse the Result as pd DataFrame

In [25]:
df_dept = pd.DataFrame(lgw_dept)
df_dept['orig'] = 'London'
df_arrv = pd.DataFrame(lgw_arrv)
df_dept['dest'] = 'London'


In [26]:
df_lgw = pd.concat([df_dept,df_arrv]).drop(columns = 'dummy')
df_lgw.head()

Unnamed: 0,time,dest,code,status,orig
0,15:45,London,AT803,Departed 17:38,London
1,16:25,London,EJU8671,Departed 17:44,London
2,16:25,London,EZY8517,Departed 17:48,London
3,16:35,London,W64558,Departed 17:47,London
4,16:40,London,EZY8733,Departed 17:34,London


In [27]:
current_date = datetime.datetime.now(tz= datetime.timezone.utc)
current_date.strftime('%d%b%Y')

'16Feb2025'

In [28]:
current_date = datetime.datetime.now(tz= datetime.timezone.utc)

file_name = f"{current_date.strftime('%d%b%Y')}_LGW.csv" 
# define file name
# Get the parent directory (preceding folder) of the current directory
parent_directory = os.path.dirname(os.getcwd())
filepath = os.path.join(parent_directory ,"data",file_name )

# save to csv
print(f"Saving df to {file_name}")
df_lgw.to_csv(filepath)

Saving df to 16Feb2025_LGW.csv


In [31]:
driver.quit()