# Process blog post:

Like so many others out there this quarantine has instilled a certain level of wanderlust in me. And obviously before I commit to any location I wanted to know that I'm not getting scalped on the airfare. To put my (and possibly your) mind at ease I've looked at historical airfare prices across X popular routes to get a view on the price fluctiations you can expect if you're travelling.

## Table of Contents:
0. Exec summary
1. Data collection
    - Schema and collection method
    - Sources
        - Airports & Routes
        - Flights & Prices
2. Data preparation and feature engineering
    - TBC
3. Model generation
4. Model validation
5. Output and visualizations
6. Extensions

## Data Collection
### Airports and Routes:
Scraping a complete list of Airports with additional characteristics - long and lat for catchment analysis

Source: https://www.world-airport-codes.com/alphabetical/country-name/a.html?page=30

Additional things that might be interesting to do:
- Make airport scraping update regularly and have historical view on closure/openning and other variables (i.e. new runway)
- Add on airport volumes from other data sources

#### Process:
---------------------
To query the population of relevant routes from Skyscanner, I first needed to collect a comprehensive (or as close to as possible) list of airports. 
1. Find all airport URLS
    - For each letter in alphabet:
        - For each page:
            - For each linked airport
                - Scrape airport hyperlink and associated data
                
2. Collect informaton on seperate airport URL pages
    - For each airport URL:
        - Collect relevant airport specific info
        - Add to data store
        
3. Check data is consistent between first page and specific URLS

4. Summary statistics

----------------------
All possible routes then becomes the 2-combinaton of all n routes in the collected list.

N.B. interesting to look at the number of routes each airport is directly connected, what average is by country, airport size, etc
N.B. interesting to see number of actual routes as % of all possible. Imagine <1% given factorial nature of combinations...

#### Airports List

In [202]:
# Find the URLs of each airport on world-airport-codes.com
from string import ascii_lowercase
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

# download chromedriver 
# allow use with - https://stackoverflow.com/questions/60362018/macos-catalinav-10-15-3-error-chromedriver-cannot-be-opened-because-the-de
driver = webdriver.Chrome()
driver.implicitly_wait(15)

# set up pages to scrape
BASE_URL = "https://www.world-airport-codes.com/alphabetical/country-name/{letter}.html?page={page_n}"
starting_urls = [[BASE_URL.format(letter=letter, page_n=n) for n in range(1, 500)] for letter in ascii_lowercase]

all_tables = []
for group_of_urls in starting_urls:
    for url in group_of_urls:
        driver.get(url)
        soup = BeautifulSoup(driver.page_source, "html")
        tables = soup.find_all("table")
        if len(tables) == 0:
            break
        all_tables.append(tables)

In [215]:
# collect raw tables in dataframes for later processing
raw_dataframes = []
raw_urls = []

for table_group in all_tables:
    for table in table_group:
        text_data = [[cell.text for cell in row.find_all(["th","td"])]
                                for row in table.find_all("tr")]
        raw_dataframes.append(pd.DataFrame(text_data))

        url_data = [[row.text, row.get("href")] for row in table.find_all("a")]
        raw_urls.append(pd.DataFrame(url_data))

In [216]:
# appending all tables into one
airports = pd.concat(raw_dataframes, axis=0, ignore_index=True)
airport_urls = pd.concat(raw_urls, axis=0, ignore_index=True)

# naming fields
airports.columns = airports.iloc[0,:]
airport_urls.columns = ["Airport", "URL"]

# drop extra header rows
airports.drop(airports[airports["Airport"] == "Airport"].index, inplace=True)

# removing whitespace and column names from raw text
for column in airports.columns:
    airports[column] = airports[column].map(lambda text: text.replace(f"{column}:", "").strip())

# removing "Closed" tag from airport name
airports.loc[airports["Type"] == "Closed", "Airport"] = airports.loc[airports["Type"] == "Closed", "Airport"].str[:-7]

# joining in URLS
airports = pd.merge(airports, airport_urls, how="left", on="Airport")

In [233]:
# write to disk
airports.to_csv("airports.csv", index=False, encoding='utf-8-sig')

#### Scraping additional airport characteristics from unique webpages

In [None]:
# kill this below if mass scrape works

In [361]:
driver.get("https://www.world-airport-codes.com/afghanistan/andkhoi-74127.html")
soup = BeautifulSoup(driver.page_source, "html")
tt = soup.find_all("div", class_= "airport-basic-data")
tt

[<div class="airport-basic-data large-6 medium-6 columns">
 <div class="small-12 columns background-grey"><strong class="acronym-key" title="International Air Transport Association">IATA Code</strong><span></span></div><div class="small-12 columns"><strong class="acronym-key" title="International Civil Aviation Organisation">ICAO Code</strong><span>OAAK</span></div><div class="small-12 columns background-grey"><strong class="acronym-key" title="Federal Aviation Authority">FAA Code</strong><span></span></div><div class="small-12 columns"><strong class="" title="">Latitude</strong><span>36.9433222</span></div><div class="small-12 columns background-grey"><strong class="" title="">Longitude</strong><span>65.2069667</span></div><div class="small-12 columns"><strong class="" title="">Time Zone</strong><span>Asia/Samarkand (GMT +5:00)</span></div> <input id="idAirport" name="id" type="hidden" value="74127"/>
 <input id="ajaxurl" name="ajaxurl" type="hidden" value="https://www.world-airport-c

In [365]:
basic_info = soup.find_all("div", class_= "airport-basic-data")
basics = [[cell.text for cell in row.find_all(["strong", "span"])] 
                             for row in basic_info[0].find_all("div")]
basics

[['IATA Code', ''],
 ['ICAO Code', 'OAAK'],
 ['FAA Code', ''],
 ['Latitude', '36.9433222'],
 ['Longitude', '65.2069667'],
 ['Time Zone', 'Asia/Samarkand (GMT +5:00)']]

In [374]:
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from collections import defaultdict

airport_full_urls_base = "https://www.world-airport-codes.com{}"
supplements = defaultdict(list)
combined = {}

# intialize selenium driver
driver = webdriver.Chrome()
wait = WebDriverWait(driver, 60)

# for each line in airports scrape supplementary data
for airport, url, country in airports[["Airport","URL","Country"]].values:
    airport_full_url = airport_full_urls_base.format(url)
    
    # go to new URL and download html after sufficient wait
    driver.get(airport_full_url)
    element = wait.until(EC.presence_of_element_located((By.ID, "metar-observations-title")))
    soup = BeautifulSoup(driver.page_source, "html")
    
    # instantiate and task scraper
    scraper = AirportScraper(soup, airport, country)
    raw_supplement = scraper.scrape()
  
    # add data to storage dict
    for k, v in raw_supplement.items():
        supplements[k].append(v)
        
driver.close()

KeyboardInterrupt: 

In [334]:
# consolidate seperate dfs
for k, v in supplements.items():
    combined[k] = pd.concat(v, ignore_index=True)
    
    # rename columns
    if k != "Basics":
        combined[k].columns = list(combined[k].iloc[0,:-2].values)+["Source", "Country"]
    
    # filter out junk rows
    first_column = combined[k].columns[0]
    combined[k].drop(combined[k][combined[k][first_column] == first_column].index, inplace=True)

In [352]:
# write supplements to disk
for supp_type, data in combined.items():
    data.to_csv(f"{supp_type}.csv", index=False, encoding='utf-8-sig')

In [377]:
class AirportScraper: 
    def __init__(self, soup, airport, country):
        self.raw_supplements = {}
        self.soup = soup
        self.airport = airport
        self.country = country
        self.scraped = []
        
    def scrape(self):
        try:
            self._scrape_supplements()
        except Exception as e:
            print("ERROR:", e.__class__, "occurred.", self.airport, self.country)
            raise RuntimeError("Something wrong went wrong")
        
        try:
            self._scrape_basic_info()
        except Exception as e:
            print("ERROR:", e.__class__, "occurred.", self.airport, self.country)
            raise RuntimeError("Something wrong went wrong 2")
        
        if len(self.raw_supplements):
            self._label()
        
        return self.raw_supplements
    
    def _label(self):
        # adds in unique key as airport & country for later joins
        for supp_type, df in self.raw_supplements.items():
            df["Source"] = self.airport
            df["Country"] = self.country
    
    def _classify_supplement(self, series):
        """
        Classifies the supplementary table scraped from airport webpage

        Parameters
        ----------
        series : int
            Pandas series object. List of column headers for table to classify
        airport : str
            Airport data relates to

        Returns
        -------
        classification: str

        """
        mapping = {
            "Frequency": ["Type", "Description", "Frequency (MHz)"],
            "Runway": ['Runway', 'Length (feet)', 'Width (feet)', 'Surface Type'],
            "Destinations": ['Destination', 'IATA', 'Airlines Flying Route'],
            "Helipads": ["Helipad", "Helipad Dimensions", "Helipad Surface Type"]
        }
        
        for k, v in mapping.items():
            if len(series) == len(v) and all(series == v):
                return k

        # broken
        print(series, self.airport)
        return None

    def _scrape_supplements(self):
        tbl = self.soup.find_all("table")

        for table in tbl:
            text_data = [[cell.text for cell in row.find_all(["th","td"])]
                                    for row in table.find_all("tr")]

            raw = pd.DataFrame(text_data)
            supp_type = self._classify_supplement(raw.iloc[0])

            self.raw_supplements[supp_type] = raw
            self.scraped.append(supp_type)

    def _scrape_basic_info(self):    
        basic_info = self.soup.find_all("div", class_= "airport-basic-data")
        
        if len(basic_info) == 0:
            print(basic_info)
            print(self.soup.find_all("div"))

        basics = [[cell.text for cell in row.find_all(["strong", "span"])] 
                             for row in basic_info[0].find_all("div")]

        basics_df = pd.DataFrame(basics, columns=["Metric", "Value"])

        self.raw_supplements["Basics"] = basics_df
        self.scraped.extend(list(basics_df["Metric"].values))

## Skyscanner flights

Purpose of this section is to explore the data structure of the Skyscanner API and create an ETL pipeline that will periodically (daily) add historical prices to an SQLlite DB.

Taking the airports and routes gather from the previous section, I now build up an updating view of the prices for those routes by airline. The value of the time series is to be able to look at how the time-till-flight (TTF) affects the price of the ticket.

In [None]:
# join into main table
# basics_t = combined["Basics"].pivot(index="Source", columns="Metric", values="Value")
# pd.merge(airports, basics_t, how="left", left_on="Airport", right_on="Source")
#runways & helipads?

# join source IATA code to Destinations for downstream use


In [375]:
routes = pd.read_csv("Destinations.csv")

In [376]:
routes

Unnamed: 0,Destination,IATA,Airlines Flying Route,Source
0,Indira Gandhi International,DEL,Safi Airlines,Herat
1,Benazir Bhutto International,,Safi Airlines,Herat
2,Kabul International,KBL,"Safi Airlines, Ariana Afghan Airlines, Hankook...",Herat
3,Mashhad International,MHD,Iran Aseman Airlines,Herat
4,Indira Gandhi International,DEL,"Safi Airlines, Air India Limited, Ariana Afgha...",Kabul International
...,...,...,...,...
467,Brigadier Antonio Parodi,EQS,TRIP Linhas A,San Carlos De Bariloche
468,Jorge Newbery Airpark,AEP,"Aerolineas Argentinas, LAN Airlines",San Carlos De Bariloche
469,Ingeniero Ambrosio Taravella,COR,Aerolineas Argentinas,San Carlos De Bariloche
470,Ministro Pistarini International,EZE,Aerolineas Argentinas,San Carlos De Bariloche


Now that the routes have been structured for mass API quering, I iterate through the combinations to find the day's flight prices. 

N.B. Automate this to happen every day at X time.

In [40]:
def create_url(origin, dest, date_outbound, date_inbound, country, currency, locale):
    return f"https://skyscanner-skyscanner-flight-search-v1.p.rapidapi.com/apiservices/browsedates/v1.0/{country}/{currency}/{locale}/{origin}/{dest}/{date_outbound}"


def make_call(origin, dest, date_outbound, date_inbound, country = "UK", currency = "GBP", locale = "en-UK"):
    url = create_url(origin, dest, date_outbound, date_inbound, country, currency, locale)
    
    querystring = {"inboundpartialdate":date_inbound}

    headers = {
        'x-rapidapi-host': "skyscanner-skyscanner-flight-search-v1.p.rapidapi.com",
        'x-rapidapi-key': ""
        }

    response = requests.request("GET", url, headers=headers, params=querystring)
    response_json = json.loads(response.text)
    print(json.dumps(response_json, indent=2))
    return response_json

def flatten_json(quotes):
    for i, _ in enumerate(quotes['Quotes']):
        
        for key, val in quotes['Quotes'][i]['OutboundLeg'].items():
            quotes['Quotes'][i][key] = val
            
        del(quotes['Quotes'][i]['OutboundLeg'])
        
    # json to DataFrames
    # for each then need to enforce data types as well as rename any relevant columns 
    carriers = pd.DataFrame.from_dict(quotes['Carriers'])
    
    places = pd.DataFrame.from_dict(quotes['Places'])
            
    quotes = pd.DataFrame.from_dict(quotes['Quotes'])
    
    # construct routes from unique quotes start/end destinations
    routes = pd.DataFrame.from_dict(quotes['Quotes'])


In [11]:
# create SQLLite DB and make schema

#!/usr/bin/python
# API Call key and meta data

import requests
import json
import numpy as np
import pandas as pd
import sqlite3


# # quotes = make_call("LHR-sky", "LAX- sky", "2020-10-01", "2020-11-01")
# df = pd.DataFrame.from_dict(quotes)
# head(df)


conn = sqlite3.connect('test.db')

field = ["Carrier_ID", "Carrier_Name", "Country"]

conn.execute("SELECT name FROM sqlite_master WHERE type='table' AND name='{table_name}';")

conn.execute('''CREATE TABLE IF NOT EXISTS CARRIER
                (CARRIER_ID INT PRIMARY KEY     NOT NULL,
                 CARRIER_NAME           TEXT    NOT NULL,
                 COUNTRY                TEXT    NOT NULL);''')

conn.execute("INSERT INTO CARRIER (CARRIER_ID,CARRIER_NAME,COUNTRY) \
      VALUES (2, 'Virgin Airways', 'UK')");

conn.commit()

cursor = conn.execute("SELECT carrier_id, carrier_name, country from CARRIER")
for row in cursor:
    for i, val in enumerate(row):
        print(f"{field[i]} = ", val)

conn.close()
# back up function that recreates the schema on deletion

# take json format and put into relational database

# append to SQLLite database

Carrier_ID =  1
Carrier_Name =  British Airways
Country =  UK
Carrier_ID =  2
Carrier_Name =  Virgin Airways
Country =  UK


In [16]:
conn = sqlite3.connect('test.db')

# print out all tables in database
cursor = conn.execute("SELECT name FROM sqlite_master WHERE type='table';")
for row in cursor:
    print(row)

('CARRIER',)


### Data schema

    1. Carrier level
        - Carrier ID
        - Carrier Name
        - Carrier Parent?
        - Carrier Country
        - Other flags ...
        
    2. Route level (includes layovers)
        - Route ID
        - Start airport
        - End airport
        
    3. Airport level
        - Airport ID
        - Airport Name
        - Airport Country
        - Geo tag? lat and long?
            - possibly important if want to look at catchment analysis later on
           
    4. Quote level (route & carrier & time level)
        - Departure time
        - Landing time / duration??
        - Carrier ID
        - Route id
        - Price
        - Direct or not flag
        - Layover time?
        - Layover cities (how do I store this?)
        - Currency - GBP
        - Quote time - when was this queried
        - Sourcing
    

In [None]:
def main(): 
    # to get new quotes for each combination of date, origin, dest
    for origin, dest, date_outbound, date_inbound in search_params:
        quotes = make_call(origin, dest, date_outbound, date_inbound)
        
        quotes = prepare(quotes)
        connection = make_connection()
        update_database(connection, quotes)


        
def prepare():
    pass
        
def make_connection():
    pass
        
def update_database(connection, json):
    '''
    Update SQLite tables with new json
    '''
    if not json:
        # add to data validation (date check?)
        print("No new data, exiting update")
        
    updates = [update_carriers, update_routes, update_airports, update_quotes]
    for update in updates:
        print(update(connection, json))
        
        
def update_carriers():
    # connect to table
    # insert
    # try catch?
    pass