# Price Data Scrape

## Where Do We Start?

Through some subject matter knowledge and research, obtaining price data would require some creative approaches. It was difficult to find domains where obtaining pricing data would be readily available for flights in the past -- even searching on some airliners' websites to uncover what a flight ticket would cost yesterday was impossible. This was troublesome at first, until the idea was suggested to inquire future pricing data as a target variable for our supervised learning models. Flight routes could be utilized as an input with a price scraping API, which should be able to collect pricing data on all of the flights we propose. Ideally, we should gather data in exactly one year from the flight's original date in May 2020, with the intention of simulating as much of a similar schedule as possible.

## The Skyscanner API

[RapidAPI.com](https://rapidapi.com) is considered to be the world's largest API marketplace to find, test, and connect to thousands of APIs. A RapidAPI account can readily be made with a Gmail account and a user can almost instantaneously connect to whichever API they wish. One of the APIs readily avilable to use from RapidAPI.com was the [Skyscanner Flight Search API](https://rapidapi.com/search/skyscanner). [Skyscanner.com](https://www.skyscanner.com/about-us), the creators of the affiliated API, is a website that is dedicated to consolidating travel needs into one place. The website mainly conducts a search across as many domains as possible for the best price on a given travel preference. Developer authentication would typically be needed to utilize their sophisticated REST API, but RapidAPI bypasses the developer interaction process by immediately providing a unique API key to a user who wishes to subscribe to a plan -- RapidAPI would act as the hosting website. The Skyscanner API was free to use through RapidAPI, under certain limitaions. [Skyscanner's official documentation for their travel API's](https://skyscanner.github.io/slate/#getting-started) shows the API is able to perform more intensive gridsearching as well as gather live pricing on flight data. Such capabilities were not avilable when utilizing the API through RapidAPI. Furthermore, the official query search limit was set to 100 searches per minute with Skyscanner's official API. When using RapidAPI, a user is limited to 50 searches per minute -- however, even when we satisfied this request limit by only searching 48 queries before sleeping our function for a minute, many restrictioons were placed on searches due to our function exceeding the maximum amount per minute (this should not happen). An adjustment was made so that we could only search 30 queries before sleeping our function for a minute to help cushion the stress on RapiAPI's and Skyscanner's servers. All of these searches were conducted using Python's [`requests` package](https://realpython.com/python-requests/). 

## The Search

The `Browse Quotes` function was heavily used to gather all of the information needed for our study. To utilize the `Browse Quotes` search query, we would need to convert our unix time epoch timestamps to datetime strings. We also would need to incorporate the IATA codes for each airport in our search. Unfortunately, when we conducted our search with the FlightXML API, the returned airport codes were ICAO airport codes --  a different way of classifying airports which was not adequate for our Skyscanner API. The Skyscanner API only accepts IATA codes for recognizing airports. THis lead to a semi-arduous task of searching the IATA code for each airport in our search. Fortunately, there were only a total of 40 airports to manuall search for their IATA codes. Currently existing datasets were searched for on Google which may showcase ICAO and IATA codes together, but none were found. In the future, a programmatic solution is necessary for any further studies. 

Our original target search first incorporated the May 2020 dates an exact year later from their original inquiry -- that is, a flight flown on May 1st 2020 was considered to also fly on May 1st 2021. However, this was not the case. Much of our search was empty as flights we were searching did not fly on the same days a year from then. The next step was to incorporate a search where our API returned the best price for that specific flight in the month of May 2021. Surprisingly, this type of search was also low in data needed for the study. The final alternative to obtain pricing data was to search the best price of each flight on a monthly basis for the rest of the 2020 year and for year 2021. It is still suspected that some flights returned from this search were empty, but many of our searches returned. The reason for empty searches could be that Skyscanner lacks the ability to find flights occurring in the future, the flight was not scheduled in the future yet, or the flight will never exist in the future (the latter of such hypothesis possibly being the least believable). 

This limits the degree of accuracy of our study. The intended and most appropriate searches would not be collected with this API for various reasons. Such complications must be mentioned when noting the accuracy of our models. 

In [53]:
import requests #imports the requests package
import json #imports the json package
import pandas as pd #imports the pandas package
import datetime #imports the datetime package
import numpy as np #imports the numpy package
import time #imports the time package

In [54]:
with open('/Users/ChristopherKuzemka/Documents/GA/dsi_11/projects/capstone/env.json') as f:
    information = json.load(f) #imports the hidden json file with sensitive API information

In [3]:
host_name = information.get('x-rapidapi-host') #gets the host name 
apiKey = information.get('x-rapidapi-key') #gets the api key

# Data Prep

In [5]:
flight_schedules = pd.read_csv('../data/flight_schedules.csv')
flight_schedules

Unnamed: 0.1,Unnamed: 0,ident,actual_ident,departuretime,arrival_time,origin,destination,aircrafttype,meal_service,seats_cabin_first,seats_cabin_business,seats_cabin_coach
0,0,UAL4282,ASQ4282,1588330800,1588340820,CYUL,KORD,E75L,Business: Refreshments / Economy: Food for sale,0,12,58
1,1,ACA7591,AC27591,1588335000,1588343880,CYUL,KORD,E75L,"Business: Breakfast / Economy: Breakfast, Food...",0,12,64
2,2,UAL8371,AC27591,1588335000,1588343880,CYUL,KORD,E75L,Business: Breakfast / Economy: Food for sale,0,12,64
3,3,UAL4245,ASQ4245,1588341060,1588351080,CYUL,KORD,E75L,Business: Refreshments / Economy: Food for sale,0,12,58
4,4,UAL8481,AC27595,1588353300,1588362000,CYUL,KORD,E75L,Business: Meal / Economy: Food for sale,0,12,64
...,...,...,...,...,...,...,...,...,...,...,...,...
5818,5818,UAL464,,1590514200,1590523860,KDEN,KPDX,A319,Business: Snack or brunch / Economy: No meal,0,12,114
5819,5819,SWA378,,1590520500,1590529500,KDEN,KPDX,B738,Economy: No meal,0,0,175
5820,5820,DLH9070,UAL393,1590539460,1590549120,KDEN,KPDX,A319,Business: Snack or brunch / Economy: No meal,0,12,114
5821,5821,UAL393,,1590539460,1590549120,KDEN,KPDX,A319,Business: Snack or brunch / Economy: No meal,0,12,114


In [6]:
flight_combinations = pd.read_csv('../data/flight_combinations.csv') #loads the flight combinations dataframe
flight_combinations['origin'].unique() #shows the unique origins

Unnamed: 0.1,Unnamed: 0,origin,destination,0
0,0,CYHM,KJFK,1
1,1,CYUL,KORD,1
2,2,CYVR,KLAX,1
3,3,CYYZ,KIAH,2
4,4,CYYZ,KJFK,1
5,5,CYYZ,KLAX,1
6,6,CYYZ,KORD,4
7,7,EBBR,KMIA,2
8,8,EBBR,KORD,1
9,9,EBLG,KATL,1


In [7]:
len(flight_combinations['origin'].unique()) #displays the length of the unique origin destinations

array(['CYHM', 'CYUL', 'CYVR', 'CYYZ', 'EBBR', 'EBLG', 'EDDF', 'EDDN',
       'EDDP', 'EGLL', 'EHAM', 'EHBK', 'EIDW', 'ELLX', 'KABQ', 'KABY',
       'KAEX', 'KAFW', 'KAGS', 'KAPF', 'KAST', 'KATL', 'KAUS', 'KBFI',
       'KBIH', 'KBKV', 'KBOI', 'KBOS', 'KBUR', 'KBWI', 'KCVG', 'KCVO',
       'KDEN'], dtype=object)

In [63]:
flight_combinations['destination'].unique() #displays the list of unique destinations

array(['KJFK', 'KORD', 'KLAX', 'KIAH', 'KMIA', 'KATL', 'KPDX'],
      dtype=object)

In [62]:
len(flight_combinations['destination'].unique()) #displays the length of the unique destination destinations

7

Will have to programmatically find a way to create a bunch if IATA codes out of the ICAO codes shown here. There are only 33 unique origins and 7 unique destinations making a total of 40 different airports studied. In total, we have 60 combinations studied. 

## Making the IATA Dictionaries

In [10]:
origin_IATA_dict = {k:[] for k in flight_combinations['origin'].unique()} #creates an empty origin dictionary where the keys are ICAO codes

In [11]:
destination_IATA_dict = {k:[] for k in flight_combinations['destination'].unique()} #creates an empty destination dictionary where the keys are ICAO codes

In [12]:
origin_IATA_list = ['YHM', 'YUL', 'YVR', 'YYZ', 'BRU', 'LGG', 'FRA', 'NUE', 'LEJ', 'LHR', 'AMS', 'MST', 'DUB', 'LUX', 'ABQ', 'ABY', 'AEX', 'AFW', 'AGS', 'APF', 'AST', 'ATL', 'AUS', 'BFI', 'BIH', 'BKV', 'BOI', 'BOS', 'BUR', 'BWI', 'CVG', 'CVO', 'DEN'] #origin IATA codes

In [13]:
destination_IATA_list = ['JFK', 'ORD', 'LAX', 'IAH', 'MIA', 'ATL', 'PDX'] #destination IATA codes

In [14]:
for k,v in enumerate(origin_IATA_dict):
    origin_IATA_dict[v].append(origin_IATA_list[k]) #populates the appropriate IATA to the ICAO dictionary IN THE ORDER BOTH ITEMS WERE CREATED

In [15]:
for k,v in enumerate(destination_IATA_dict):
    destination_IATA_dict[v].append(destination_IATA_list[k]) #populates the appropriate IATA to the ICAO dictionary IN THE ORDER BOTH ITEMS WERE CREATED

## Adding the IATAs to flight combinations dataframe

In [None]:
flight_combinations['destination_IATA'] = flight_combinations['destination'].map(destination_IATA_dict) #maps the destination iata dictionary to the flight combinations dataframe
flight_combinations['destination_IATA'] = [ls[0] for ls in flight_combinations['destination_IATA']] #converts the destination iata cells into single elements rather than lists

In [None]:
flight_combinations['origin_IATA'] = flight_combinations['origin'].map(origin_IATA_dict) #maps the origin iata dictionary to the flight combinations dataframe
flight_combinations['origin_IATA'] = [ls[0] for ls in flight_combinations['origin_IATA'] #converts the origin iata cells into single elements rather than lists

## Generate Monthly Dates

In [56]:
#https://www.w3resource.com/python-exercises/date-time-exercise/python-date-time-exercise-50.php
#Code tested with below

def daterange(date1, date2):
    for n in range(int((date2 - date1).days)+1):
        yield date1 + datetime.timedelta(n)

start_dt = datetime.date(2020, 6, 1)
end_dt = datetime.date(2020, 12, 31)


output_list = []

for dt in daterange(start_dt, end_dt):
    output_list.append(dt.strftime("%Y-%m"))

In [57]:
date_set = set(output_list)

In [58]:
date_list = list(date_set)

## Adding the IATAs to flight combinations dataframe

In [31]:
flight_combinations['destination_IATA'] = flight_combinations['destination'].map(destination_IATA_dict)
flight_combinations['destination_IATA'] = [ls[0] for ls in flight_combinations['destination_IATA']]

In [32]:
flight_combinations['origin_IATA'] = flight_combinations['origin'].map(origin_IATA_dict)
flight_combinations['origin_IATA'] = [ls[0] for ls in flight_combinations['origin_IATA']]

__Quotes Inputs:__

- Country (Required)

- Currency (Required)

- Locale (Required)

- originplace (Required)

- destinationplace (Required)

- outboundpartialdate (Required)

- inboundpartialdate (Optional)

In [51]:
def get_prices_across_year(country, currency, locale, flight_combs, month_range):
    start_time = time.time() #epoch start time
    output_list = []
    count = 0


    for k in range(len(flight_combs['origin_IATA'])):
        for month in range(len(month_range)):
            url = f"https://skyscanner-skyscanner-flight-search-v1.p.rapidapi.com/apiservices/browsequotes/v1.0/{country}/{currency}/{locale}/{flight_combs.loc[k,'origin_IATA']}-sky/{flight_combs.loc[k, 'destination_IATA']}-sky/{month_range[month]}"

            querystring = {"inboundpartialdate":"anytime"}

            headers = {
                'x-rapidapi-host': host_name,
                'x-rapidapi-key': apiKey
            }

            response = requests.get(url = url, headers=headers, params=querystring)
            print(f"Getting the data for: (Origin: {flight_combs.loc[k,'origin_IATA']}), (Destination: {flight_combs.loc[k, 'destination_IATA']}), (For Date: {month_range[month]}) ")
            json_response = response.json()
            output_list.append(json_response)
            count = count + 1
            if count <= 30:
                continue
            else:
                print("Approaching request limit. Sleeping for 60 seconds....")
                print(f"The request count index you are at is: {k}")
                print(f"The the month index you are at is: {month_range[month]}")
                print(f"The size of the data file you are requesting off of is: {len(flight_combs)} ")
                print(f"This search is {(k/len(flight_combs)) * 100} percent complete.")
                print(f"Elapsed time is: {time.time() - start_time} seconds.")
                time.sleep(60)
                print("Conducting another request batch.")
                
                count = 0
    print(f"Elapsed time of process is: {time.time() - start_time}")
    print(f"The start time was: {start_time}")
    print(f"The end time is: {time.time()}")
    return pd.DataFrame(output_list)

In [60]:
test_monthly = get_prices_across_year('US', 'USD', 'en-US', flight_combinations, date_list)

Getting the data for: (Origin: AMS), (Destination: MIA), (For Date: 2020-12) 
Getting the data for: (Origin: AMS), (Destination: MIA), (For Date: 2020-08) 
Getting the data for: (Origin: MST), (Destination: ATL), (For Date: 2020-11) 
Getting the data for: (Origin: MST), (Destination: ATL), (For Date: 2020-07) 
Getting the data for: (Origin: MST), (Destination: ATL), (For Date: 2020-09) 
Getting the data for: (Origin: MST), (Destination: ATL), (For Date: 2020-10) 
Getting the data for: (Origin: MST), (Destination: ATL), (For Date: 2020-06) 
Getting the data for: (Origin: MST), (Destination: ATL), (For Date: 2020-12) 
Getting the data for: (Origin: MST), (Destination: ATL), (For Date: 2020-08) 
Getting the data for: (Origin: DUB), (Destination: JFK), (For Date: 2020-11) 
Getting the data for: (Origin: DUB), (Destination: JFK), (For Date: 2020-07) 
Getting the data for: (Origin: DUB), (Destination: JFK), (For Date: 2020-09) 
Getting the data for: (Origin: DUB), (Destination: JFK), (For Da

In [61]:
test_monthly.to_csv('../data/june2020_to_december2020_monthlyprice.csv')

In [45]:
test_monthly_2 = get_prices_across_year('US', 'USD', 'en-US', flight_combinations, date_list)

his search is 68.33333333333333 percent complete.
Elapsed time is: 1217.755401134491 seconds.
Conducting another request batch.
Getting the data for: (Origin: AFW), (Destination: LAX), (For Date: 2021-06) 
Getting the data for: (Origin: AFW), (Destination: LAX), (For Date: 2021-04) 
Getting the data for: (Origin: AFW), (Destination: LAX), (For Date: 2021-05) 
Getting the data for: (Origin: AFW), (Destination: LAX), (For Date: 2021-02) 
Getting the data for: (Origin: AFW), (Destination: LAX), (For Date: 2021-07) 
Getting the data for: (Origin: AFW), (Destination: LAX), (For Date: 2021-03) 
Getting the data for: (Origin: AFW), (Destination: LAX), (For Date: 2021-12) 
Getting the data for: (Origin: AFW), (Destination: LAX), (For Date: 2021-11) 
Getting the data for: (Origin: AGS), (Destination: ATL), (For Date: 2021-01) 
Getting the data for: (Origin: AGS), (Destination: ATL), (For Date: 2021-10) 
Getting the data for: (Origin: AGS), (Destination: ATL), (For Date: 2021-09) 
Getting the da

In [46]:
test_monthly_2.to_csv('../data/2021_monthly_pricing2.csv')