# International Airfare Pricing and FX Arbitrage

### Description
An analysis of international airfare prices from the U.S. to various destinations Europe. Data is collected via webscraping using BeautifulSoup. Data and analysis will be used to construct predictive logistic model that will assist with airfare purchasing decisions.

### Acknowledgements
Courtesy of Norwegian Airlines

***
### Setup: Load Packages

In [2]:
import requests
from bs4 import BeautifulSoup

import urllib.parse
from urllib.parse import urlparse

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.pyplot import figure
from collections import OrderedDict

import time
import datetime
import pytz
import re

***

### Setting Airports and Currency Options
Prior to extracting data, let's define our date range for flights as well as airports and currencies we are interested in  

##### AIRPORTS
**US**: Los Angeles (LAX) | Oakland (OAK) | San Francisco (SFO) | New York-JFK (JFK) | New Jersey-Newark (EWR)  
**EU**: Copenhagen (CPH) | Stockholm, Sweden (ARN) | Paris (CDG) | London-Gatwick (LGW) | Amsterdam (AMS)

In [51]:
# Create lists of origin and destination airports
orig_port = ['LAX','OAK','SFO','JFK','EWR']
dest_port = ['CPH','ARN','CDG','LGW','AMS']

##### CURRENCIES
US Dollar (USD), Euro (EUR), Great Britian Pound (GBP), Swedish Krona (SEK)

In [32]:
# Currency list
curr = ['USD','EUR','GBP','SEK']

***

### Select Airports, Currency and Dates

In [33]:
# Set input indicies
orig_select = 0
dest_select = 1
curr_select = 0

# Selected inputs are:
print('fly from:', orig_port[orig_select])
print('fly   to:', dest_port[dest_select])
print('currency:', curr[curr_select])

fly from: LAX
fly   to: ARN
currency: USD


##### DATES & TIMES

In [183]:
# Specify time right now
now = datetime.datetime.now(pytz.timezone('US/Pacific'))

# Create list of dates beginning from now for six months (180 days)
#dates = pd.date_range(now, periods=180).tolist()
dates = pd.date_range(now, periods=10).tolist()

***

### Extract Data: Run Webscraper

In [184]:
# Create 'master' dataframe object
tix_data = pd.DataFrame(columns = [])

# RUN SCRAPER
for x in range(0,len(dates),1):
    
    #--------------------------
    # SCRAPING SITE FOR DATA
    
    print(x)
    print('Sleeping...')
    time.sleep(10)
    
    print('Retreiving...')
    # Request URL
    url = str('https://www.norwegian.com/us/ipc/availability/avaday?D_City=%s&A_City=%s&TripType=1&D_Day=%s&D_Month=%s%s&AgreementCodeFK=-1&CurrencyCode=%s&rnd=9038&processid=33487&mode=ab' 
          % (orig_port[orig_select],dest_port[dest_select],
             dates[x].strftime('%d'),dates[x].strftime("%Y"),dates[x].strftime("%m"),
             curr[curr_select]))
    
    # Request data
    tix_raw = requests.get(url).text
    
    # Turn into soup
    tix_soup = BeautifulSoup(tix_raw,'html.parser')

    #--------------------------
    # EXTRACT FIELDS...

    # Duration
    duration_all = [i.text for i in tix_soup.find_all('td', class_="duration")]
    # NOTE: The site structure changes if there are less than 3 flights in a day...
    remove = ['Direct','stop']
    duration_filtered = [i for i in duration_all if not any(word in i.split(' ') for word in remove)]
    # Filter out flight duration
    duration = [i.split(': ', 1)[1] for i in duration_filtered]
    
    # NOTE: TO ADDRESS SITE STRUCTURE CHANGES!
    # Stops Alternative
    stops_alt = [i for i in duration_all if any(word in i.split(' ') for word in remove)]
    
    # Calculate hrs and min
    hrs = [i.split(' ', 1)[0] for i in duration]
    min = [i.split(' ', 1)[1] for i in duration]

    hrs = [float(i.split('h', 1)[0]) for i in hrs]
    min = [float(i.split('m', 1)[0]) for i in min]
    
    #******[ADD TO DATAFRAME]******
    duration_total_min = np.add([x*60 for x in hrs], min)
    
    #--------------------------
    # Total number of flights this day
    total_num_flights = len(duration)
    
    #--------------------------
    # Departure Info
    depart_info = [i.text for i in tix_soup.find_all('td', class_="depdest")]
    
    #******[ADD TO DATAFRAME]******
    if len(stops_alt) > 0:
        depart_time = depart_info[0::2]
    else:
        depart_time = depart_info[0::3]
        #dest_port_name = depart_info[1::3]
    
    #--------------------------
    # Arrival Info
    arrive_info = [i.text for i in tix_soup.find_all('td', class_="arrdest")]
   
    #******[ADD TO DATAFRAME]******
    arrive_time = arrive_info[0::2]
    #orig_port_name = arrive_info[1::2]
    
    #--------------------------    
    # Number of stops for each flight
    # If there is a webpage structure change, use alternative stops source
    if len(stops_alt) > 0:
        stops = stops_alt[:]
    else:    
        # Otherwise, use the usual   
        stops = depart_info[2::3]
        
    stops = [i.split(' ', 1)[0] for i in stops]
    stops = [0 if i=='Direct' else i for i in stops]
    
    #******[ADD TO DATAFRAME]******
    stops = [int(i) for i in stops]

    #--------------------------    
    # Stop details
    stops_info = [i.text for i in tix_soup.find_all('li', class_="tooltipclick TooltipBoxTransit")] 

    stops_time_temp = [i.split(')', 1)[0] for i in stops_info]
    stops_time_temp = [i.split('(', 1)[1] for i in stops_time_temp]
    
    stops_loc_temp = [i.split('in ', 1)[1] for i in stops_info]
    
    #******[ADD TO DATAFRAME]******
    stops_time = []
    stops_loc = []
    
    for i in range(0, len(stops),1):
        if stops[i] == 0:
            stops_time.append(np.nan)
            stops_loc.append(np.nan)
        else:
            stops_time.append(stops_time_temp[0])
            stops_time_temp.pop(0)
            
            stops_loc.append(stops_loc_temp[0])
            stops_loc_temp.pop(0)

    
    #--------------------------    
    # Prices
    prices_all = [i.text for i in tix_soup.find_all('td', class_=re.compile('.*fare.*'))]
    
    # NOTE: TO ADDRESS STRUCTURE CHANGE
    # Identify if premium flights are available
    prem_avail = int('Premium' in prices_all)
    
    # Remove non-price elements
    remove = ['','\xa0','LowFare','LowFare+','Premium','Only']
    prices_all_filtered = [i for i in prices_all if not any(word in i.split(' ') for word in remove)]
    
    # Sanity check
    if len(prices_all_filtered)%total_num_flights == 0:
        print('All prices successfully extracted')
    else:
        print('Error: incorrect number of prices')
        break
    
    # Clean up data
    prices_all_filtered = pd.Series(prices_all_filtered).replace('-', np.nan)
    prices_all_filtered = pd.Series(prices_all_filtered).replace('Sold out', 0)
    prices_all_filtered = pd.Series(prices_all_filtered).replace(',', '', regex=True).astype(float)
    
    if prem_avail > 0:
        #******[ADD TO DATAFRAME]******
        prices_lowfare = prices_all_filtered[0::5]
        prices_lowfareplus = prices_all_filtered[1::5]
        prices_flex = prices_all_filtered[2::5]
        prices_prem = prices_all_filtered[3::5]
        prices_premflex = prices_all_filtered[4::5]
    else:
        prices_lowfare = prices_all_filtered[0::3]
        prices_lowfareplus = prices_all_filtered[1::3]
        prices_flex = prices_all_filtered[2::3]
        prices_prem = [np.nan] * total_num_flights
        prices_premflex = [np.nan] * total_num_flights
    
    #--------------------------
    # CREATE OTHER DATA FRAME VECTORS
    
    # Flight dates
    depart_date_comb = [dates[x].strftime('%Y-%m-%d')] * total_num_flights
    #depart_date_yr = [dates[x].strftime('%Y')] * total_num_flights
    #depart_date_mo = [dates[x].strftime('%m')] * total_num_flights
    #depart_date_day = [dates[x].strftime('%d')] * total_num_flights
    
    # Flight airport codes
    orig_port_code = [orig_port[orig_select]] * total_num_flights
    dest_port_code = [dest_port[dest_select]] * total_num_flights
    
    # Data Extracted Timestamp
    data_extract_time = [now.strftime('%y-%m-%d %H:%m:%S')] * total_num_flights

    #--------------------------
    # COMBINE ALL RELEVANT VECTORS INTO DATA FRAME

    # Create tempdata
    tix_tempdata = pd.concat([pd.Series(data_extract_time, name = 'data_extract_time'),
                              pd.Series(orig_port_code, name = 'orig_port_code'), 
                              pd.Series(dest_port_code, name = 'dest_port_code'),
                              #pd.Series(orig_port_name, name = 'orig_port_name'), 
                              #pd.Series(dest_port_name, name = 'dest_port_name'), 
                              pd.Series(depart_date_comb, name = 'depart_date'),
                              #pd.Series(depart_date_yr, name = 'depart_date_yr'), 
                              #pd.Series(depart_date_mo, name = 'depart_date_mo'), 
                              #pd.Series(depart_date_day, name = 'depart_date_day'),
                              pd.Series(depart_time, name = 'depart_time'), 
                              pd.Series(arrive_time, name = 'arrive_time'), 
                              pd.Series(duration_total_min, name = 'duration_total_min'), 
                              pd.Series(stops, name = 'stops'), 
                              pd.Series(stops_loc, name = 'stops_loc'), 
                              pd.Series(stops_time, name = 'stops_time'),
                              pd.Series(prices_lowfare, name = 'prices_lowfare').reset_index(drop=True),
                              pd.Series(prices_lowfareplus, name = 'prices_lowfareplus').reset_index(drop=True),
                              pd.Series(prices_flex, name = 'prices_flex').reset_index(drop=True),
                              pd.Series(prices_prem, name = 'prices_prem').reset_index(drop=True),
                              pd.Series(prices_premflex, name = 'prices_premflex').reset_index(drop=True)
                             ], 
                             axis = 1 
                            )
    
    
    #--------------------------    
    # Flight ids   
    # Find all available flight numbers and clean up
    id_all = list(OrderedDict.fromkeys(
        [i for i in str(tix_soup.find_all('input', class_="radio-ajax")).split("|") if i.startswith('D')])
                          )
    
    # Separate first and second leg ids (if applicable)
    id_leg1_temp = [value[:6].upper() for value in id_all]
    id_leg1_temp = list(pd.Series(id_leg1_temp).replace('', np.nan))

    id_leg2_temp = [value[12:len(value)-6].upper() for value in id_all]
    id_leg2_temp = list(pd.Series(id_leg2_temp).replace('', np.nan))


    # Create final leg id vectors taking into account any sold-out flights (no flight ids available)
    #******[ADD TO DATAFRAME]******
    id_leg1 = []
    id_leg2 = []

    for i in range(0,total_num_flights,1):
        if sum(tix_tempdata.iloc[i,10:14].dropna()) == 0:
            id_leg1.append(np.nan)
            id_leg2.append(np.nan)
        else:
            id_leg1.append(id_leg1_temp[0])
            id_leg1_temp.pop(0)

            id_leg2.append(id_leg2_temp[0])
            id_leg2_temp.pop(0)

    # Clean up temporary vectors
    del id_leg1_temp, id_leg2_temp

    # Add flight ids to tempdata
    tix_tempdata.insert(1, 'id_leg1', pd.Series(id_leg1))
    tix_tempdata.insert(9, 'id_leg2', pd.Series(id_leg2))
    
    
    #--------------------------    
    # Concatenate to master
    tix_data = pd.concat([tix_data, tix_tempdata])
    print('Data saved!')

0
Sleeping...
Retreiving...
All prices successfully extracted
Data saved!
1
Sleeping...
Retreiving...
All prices successfully extracted
Data saved!
2
Sleeping...
Retreiving...
All prices successfully extracted
Data saved!
3
Sleeping...
Retreiving...
All prices successfully extracted
Data saved!
4
Sleeping...
Retreiving...
All prices successfully extracted
Data saved!
5
Sleeping...
Retreiving...
All prices successfully extracted
Data saved!
6
Sleeping...
Retreiving...
All prices successfully extracted
Data saved!
7
Sleeping...
Retreiving...
All prices successfully extracted
Data saved!
8
Sleeping...
Retreiving...
All prices successfully extracted
Data saved!
9
Sleeping...
Retreiving...
All prices successfully extracted
Data saved!


In [185]:
tix_data

Unnamed: 0,data_extract_time,id_leg1,orig_port_code,dest_port_code,depart_date,depart_time,arrive_time,duration_total_min,stops,id_leg2,stops_loc,stops_time,prices_lowfare,prices_lowfareplus,prices_flex,prices_prem,prices_premflex
0,19-05-23 13:05:17,,LAX,ARN,2019-05-23,15:45,11:20 +1,635.0,0,,,,0.0,0.0,0.0,0.0,0.0
1,19-05-23 13:05:17,,LAX,ARN,2019-05-23,19:30,20:20 +1,950.0,1,,London,3h 10m,0.0,0.0,0.0,0.0,0.0
2,19-05-23 13:05:17,,LAX,ARN,2019-05-23,19:30,23:15 +1,1125.0,1,,London,6h 5m,0.0,0.0,0.0,0.0,0.0
3,19-05-23 13:05:17,,LAX,ARN,2019-05-23,20:00,00:40 +2,1180.0,1,,Paris,6h 30m,0.0,0.0,0.0,0.0,0.0
4,19-05-23 13:05:17,DY7108,LAX,ARN,2019-05-23,16:35,22:55 +1,1280.0,1,D85522,Barcelona,5h 0m,1499.3,1589.3,1657.8,1759.3,1917.8
5,19-05-23 13:05:17,DY7108,LAX,ARN,2019-05-23,16:35,23:45 +1,1330.0,1,DY4254,Barcelona,7h 40m,1522.0,1612.0,1657.8,1782.0,1917.8
0,19-05-23 13:05:17,DI7096,LAX,ARN,2019-05-24,19:30,20:20 +1,950.0,1,D82856,London,3h 10m,,,1664.1,,
0,19-05-23 13:05:17,DY7088,LAX,ARN,2019-05-25,18:00,13:35 +1,635.0,0,,,,,,999.9,,1189.9
1,19-05-23 13:05:17,DI7096,LAX,ARN,2019-05-25,19:30,19:45 +1,915.0,1,DY4456,London,2h 35m,,,1664.1,,
2,19-05-23 13:05:17,DY7108,LAX,ARN,2019-05-25,18:25,20:50 +1,1045.0,1,D85552,Barcelona,2h 55m,1115.8,1205.8,1465.8,1665.8,1665.8
