# Process blog post:

Like so many others out there this quarantine has instilled a certain level of wanderlust in me. And obviously before I commit to any location I wanted to know that I'm not getting scalped on the airfare. To put my (and possibly your) mind at ease I've looked at historical airfare prices across X popular routes to get a view on the price fluctiations you can expect if you're travelling.

## Table of Contents:
0. Exec summary
1. Data collection
    - Schema and collection method
    - Sources
        - Airports & Routes
        - Flights & Prices
2. Data preparation and feature engineering
    - TBC
3. Model generation
4. Model validation
5. Output and visualizations
6. Extensions

## To do:
Updates for streamlining:
- move API key to external gitignored file
- set up gitignore
- consolidate imports at beginning

## Data Collection
### Airports and Routes:
Scraping a complete list of Airports with additional characteristics - long and lat for catchment analysis

Source: https://www.world-airport-codes.com/alphabetical/country-name/a.html?page=30

Additional things that might be interesting to do:
- Make airport scraping update regularly and have historical view on closure/openning and other variables (i.e. new runway)
- Add on airport volumes from other data sources

#### Process:
---------------------
To query the population of relevant routes from Skyscanner, I first needed to collect a comprehensive (or as close to as possible) list of airports. 
1. Find all airport URLS
    - For each letter in alphabet:
        - For each page:
            - For each linked airport
                - Scrape airport hyperlink and associated data
                
2. Collect informaton on seperate airport URL pages
    - For each airport URL:
        - Collect relevant airport specific info
        - Add to data store
        
3. Check data is consistent between first page and specific URLS

4. Summary statistics

----------------------
All possible routes then becomes the 2-combinaton of all n routes in the collected list.

N.B. interesting to look at the number of routes each airport is directly connected, what average is by country, airport size, etc
N.B. interesting to see number of actual routes as % of all possible. Imagine <1% given factorial nature of combinations...

#### Airports List

In [103]:
# Find the URLs of each airport on world-airport-codes.com
from string import ascii_lowercase
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, WebDriverException
from collections import defaultdict
import pandas as pd
from datatime import datetime
import sys

In [None]:
# download chromedriver 
# allow use with - https://stackoverflow.com/questions/60362018/macos-catalinav-10-15-3-error-chromedriver-cannot-be-opened-because-the-de
driver = webdriver.Chrome()
driver.implicitly_wait(15)

# set up pages to scrape
BASE_URL = "https://www.world-airport-codes.com/alphabetical/country-name/{letter}.html?page={page_n}"
starting_urls = [[BASE_URL.format(letter=letter, page_n=n) for n in range(1, 500)] for letter in ascii_lowercase]

all_tables = []
for group_of_urls in starting_urls:
    for url in group_of_urls:
        driver.get(url)
        soup = BeautifulSoup(driver.page_source, "html")
        tables = soup.find_all("table")
        if len(tables) == 0:
            break
        all_tables.append(tables)
        

In [4]:
# collect raw tables in dataframes for later processing
raw_dataframes = []

for table_group in all_tables:
    for table in table_group:
        text_data = [[cell.text for cell in row.find_all(["th","td"])] 
             + [cell.get("href") for cell in row.find_all(["a"])]
                        for row in table.find_all("tr")]
        raw_dataframes.append(pd.DataFrame(text_data))
        
        

In [5]:
# appending all tables into one
airports = pd.concat(raw_dataframes, axis=0, ignore_index=True)

# naming fields
airports.columns = list(airports.iloc[0,:-1].values) + ["URL"]

# drop extra header rows
airports.drop(airports[airports["Airport"] == "Airport"].index, inplace=True)

# removing whitespace and column names from raw text
for column in airports.columns:
    airports[column] = airports[column].map(lambda text: text.replace(f"{column}:", "").strip())

# removing "Closed" tag from airport name
airports.loc[airports["Type"] == "Closed", "Airport"] = airports.loc[airports["Type"] == "Closed", "Airport"].str[:-7]



In [6]:
# write to disk
airports.to_csv("airports.csv", index=False, encoding='utf-8-sig')

#### Scraping additional airport characteristics from unique webpages

In [101]:
# base URL
airport_full_urls_base = "https://www.world-airport-codes.com"

# filter out dummy urls and get inputs
airports = pd.read_csv("airports.csv")
airports["URL"] = airport_full_urls_base + airports["URL"]
to_search =  airports.loc[airports["URL"].notna(), ["Airport","URL","Country", "City"]]
urls = list(to_search["URL"].values)
tags = to_search[["Airport","Country", "City"]].values

In [None]:
# OOP version of individual page scraping
start = 0

manager = ScrapeManager(urls, tags, start)
supplements = manager.scrape()
manager.close()

In [130]:
class ScrapeManager:
    def __init__(self, urls, tags, start_i=0):
        self._start_new_driver()
        self.final_i = len(urls)
        self.start_i = start_i
        self.urls = urls
        self.tags = tags
        self.supplements = defaultdict(list)
        
    def scrape(self):
        i = self.start_i
        while i < self.final_i:
            try:                
                # get page source of new webpage
                self.driver.get(self.urls[i])
                element = self.wait.until(EC.presence_of_element_located((By.ID, "metar-observations-title")))
                soup = BeautifulSoup(self.driver.page_source, "html")

                # instantiate and task scraper
                scraper = AirportScraper(soup, *self.tags[i])
                raw_supplement = scraper.scrape()
                
                # add data to storage dict
                for k, v in raw_supplement.items():
                    self.supplements[k].append(v)
                    
            except TimeoutException:
                # restart driver and try again
                print(f"Timed out on {i}")
                self._start_new_driver()
                i = i - 1
                
            except WebDriverException:
                # restart driver and try again
                print(f"Webdriver failed on {i}")
                self._start_new_driver()
                i = i - 1
                
            except KeyboardInterrupt:
                print(f"Latest i = {i-1}: {self.urls[i-1]}")
                print(f"Restart with i = {i}: {self.urls[i]}\n\n")
                self._backup_progress() # save to csv
                sys.exit()
                
            i += 1

        return self._backup_progress()
    
    def _backup_progress(self):
        # consolidate seperate dfs then save down
        combined = {}

        for k, v in self.supplements.items():
            combined[k] = pd.concat(v, ignore_index=True)

            # rename columns
            if k != "Basics":
                combined[k].columns = list(combined[k].iloc[0,:-3].values)+["Airport", "Country", "City"]

            # filter out junk rows
            first_column = combined[k].columns[0]
            combined[k].drop(combined[k][combined[k][first_column] == first_column].index, inplace=True)
            
        # write supplements to disk
        timecode = datetime.now().strftime("%Y%m%d_%H%M%S")
        for supp_type, data in combined.items():
            data.to_csv(f"discrete/data/{supp_type.lower()}_{timecode}.csv", index=False, encoding='utf-8-sig')
            
        return combined
            
    def _start_new_driver(self):
        self.driver = webdriver.Chrome()
        self.wait = WebDriverWait(self.driver, 60)
        
    def close(self):
        self.driver.close()


class AirportScraper: 
    def __init__(self, soup, airport, country, city):
        self.raw_supplements = {}
        self.soup = soup
        self.airport = airport # can refactor this into a dict
        self.country = country
        self.city = city
        self.scraped = []
        
    def scrape(self):
        try:
            self._scrape_supplements()
        except Exception as e:
            print("ERROR:", e.__class__, "occurred.", self.airport, self.country)
            raise RuntimeError("Something wrong went wrong")
        
        try:
            self._scrape_basic_info()
        except Exception as e:
            print("ERROR:", e.__class__, "occurred.", self.airport, self.country)
            raise RuntimeError("Something wrong went wrong 2")
        
        if len(self.raw_supplements):
            self._label()
        
        return self.raw_supplements
    
    def _label(self):
        # adds in unique key as airport & country for later joins
        for supp_type, df in self.raw_supplements.items():
            df["Airport"] = self.airport
            df["Country"] = self.country
            df["City"] = self.country
    
    def _classify_supplement(self, series):
        """
        Classifies the supplementary table scraped from airport webpage

        Parameters
        ----------
        series : int
            Pandas series object. List of column headers for table to classify
        airport : str
            Airport data relates to

        Returns
        -------
        classification: str

        """
        mapping = {
            "Frequency": ["Type", "Description", "Frequency (MHz)"],
            "Runway": ['Runway', 'Length (feet)', 'Width (feet)', 'Surface Type'],
            "Destinations": ['Destination', 'IATA', 'Airlines Flying Route'],
            "Helipads": ["Helipad", "Helipad Dimensions", "Helipad Surface Type"]
        }
        
        for k, v in mapping.items():
            if len(series) == len(v) and all(series == v):
                return k

        # broken
        print(series, self.airport)
        return None

    def _scrape_supplements(self):
        tbl = self.soup.find_all("table")

        for table in tbl:
            text_data = [[cell.text for cell in row.find_all(["th","td"])]
                                    for row in table.find_all("tr")]

            raw = pd.DataFrame(text_data)
            supp_type = self._classify_supplement(raw.iloc[0])

            self.raw_supplements[supp_type] = raw
            self.scraped.append(supp_type)

    def _scrape_basic_info(self):    
        basic_info = self.soup.find_all("div", class_= "airport-basic-data")
        
        if len(basic_info) == 0:
            print(basic_info)
            print(self.soup.find_all("div"))

        basics = [[cell.text for cell in row.find_all(["strong", "span"])] 
                             for row in basic_info[0].find_all("div")]

        basics_df = pd.DataFrame(basics, columns=["Metric", "Value"])

        self.raw_supplements["Basics"] = basics_df
        self.scraped.extend(list(basics_df["Metric"].values))
        
        