# Flight Data Scrape

## Where Do We Start?

The most obvious candidate that came to mind for collecting data was collecting through [FlightAware.com](https://flightaware.com). It is the world's largest flight tracking and data platform, which actively collects data directly from various air traffic control systems in many countries, including ground stations and satellites. Its powerful HyperFeed engine works with FlightAware's artificial intelligence network to gather data in real time. It serves as a reliable web-based source for data and provides a poweful API known as "FlightXML" to allow customers to gather useufl and comprehensive information on flights flown in history. Some notable companies rely on FlightAware for their data, such as *United, tripadvisor, Hawaiian Airlines, and more*. Many other APIs, which claim to collect flight data, will more than likely go through the FlightAware API as the root API.  

With the above introduction, the next step was to gain access to the "FlightXML" API. A basic license of this API is free for users, in that there is no monthly subscription charge. However, all users of this API will be charged pennies for query searches - including basic license and more advanced license users. Furthermore, basic licensed users are not allowed to use tthe FlightXML API for commercial products.

## The FlightXML2 API

The API provided with the basic license is the "FlightXML2" API ("FlightXML3" does exist, but is believed to be exclusive for more advanced license users). The documentation for using the FlightXML2 API is found [here](https://flightaware.com/commercial/flightxml/explorer/#op_AirlineFlightInfo). To integrate the API's capabilities with our Python scripts, we also needed to integrate a package meant to work with SOAP/suds objects. The [suds](https://docs.inductiveautomation.com/display/DOC79/SUDS+-+Library+Overview) package was used to help us correctly use the API and gather different sets of data.  

FlightXML2 is an excellent API for gathering a lot of data on flights in general. Their provided query search functionalities do not have a time limit, but do have a result limit set per search. A typical query search will include flight related data of up to 15 results. For example, searching for what current flights are arriving at John F. Kennedy Airport within a given timeframe and search will only return the latest 15 flights. There is offset functionality which allows a user to offset the 15 result search by any given integer and the user is also allowed to expand the maximum result search query (with the consequence of increasing costs). For this study, the maximum was not increased and offset functionaily was only implemented for some of the searches.

The basic license of the API had some severe limitations in how much history a user has access to with flight data. Originally, we intended to gather speed and altitude data over time for each considered flight, but the FlightXML2 function utilized to complete this task is limited by a two week timeframe. This severely limits the amount of data we wouldve been able to collect and feature engineer. Instead, we decided to utlize a different function which is able to search three months in the past, but tackled different issues from this...

Let's provide some background context and conduct an experiment. According to [Flights.com](https://www.flights.com/flights/new-york-jfk-to-miami-mia/), "With 3 different airlines operating flights between New York and Miami, there are, on average, 2,197 flights per month. This equates to about 523 flights per week, and 75 flights per day from JFK to MIA." If we were limited to this two week period to start with, we should be seeing enough flights only between two such popular locations in the U.S. for our study. However, when actually searching through FlightXML2, it was found that only 14 total flights were made in this two week time span. Such patterns of limited flights were observed across other airports. These kinds of trends existed even across a three month span. But why is this so?

The timing of this study plays a trendous role in what outcomes we make from this. These searches were conducted towards the end of May of 2020 (Between May22nd and May29th). The furthest back we would've been able to search would've been towards the end of February/beginning of March in 2020. Coincidentally, this is the same point in time where global air travel restrictions are placed and volume of travel begins to drop, dramatically. By mid-March, [many major airports begin to close down due to the Covid-19 pandemic](https://www.businessinsider.com/coronavirus-airports-and-faa-centers-temporarily-closed-for-cleaning-2020-3#chicagos-midway-international-airport-1).

## The Search

With such limitations, a plan was conducted on how to gather much flight data. Instead of methodically picking popular routes where air traffic "may" exist between our target airports and other supposed popular airports, we decided to conduct a more random and and wider search for flight data. The goal was to get as much data as possible through the pandemic. 

- The first step made was to search active flights in the sky for each focused airport. On a given night (May 27th), 15 different flights were identified to arrive at any given airport. Of the seven destination airports, a total of 60 routes were collected. 

- The next step was to search each specific flight's schedule throughout the single month of May 2020. Even though we had the capability to search as far back as three months, we felt that utilizing the search on months where air traffic was extremely low (where some of our target airports were closed completely during the peak of the pandemic) would be a waste of money -- reminding ourselves that each query search costs a certain amount of pennies. Within this month timespan, query searches regarding a specific route was conducted on an eight hour time requency. Our search ran for approximately one hour before returning three dataframes: flight combinations, those current flights in the sky, and the flight schedules for that month. From 60 combinations of flights, we were able to obtain, approximately 5,800 data points of each flights history in that month. 

5,800 individual flights is extremely low when analyzing across seven different destination airports and utilizing 60 different flight combinations. Many of the query searches returned empty, possibly due to some flight cancellations. While this small set of data my not be ideal for the intended purpose of our study, it is enough to showcase a minimum proof oc concept. 

The below notebook of python script showcases how the API was utilized.

In [2]:
import sys #imports the sys package
from suds import null, WebFault #imports the null and WebFault methods and classes from suds
from suds.client import Client #imports the Client class from suds
import logging #imports logging package
import json #imports json package
import pandas as pd #imports the pandas package
import datetime #imports datetimne package
import numpy as np #imports the numpy package
import time #imports the time package

In [3]:
with open('/Users/ChristopherKuzemka/Documents/GA/dsi_11/projects/capstone/env.json') as f: #opens the json file containing sensitive information
    information = json.load(f) #loads the json dictionary

In [4]:
information.keys() #displays the avilable keys from the json file

dict_keys(['FA_API_KEY', 'FA_USERNAME', 'x-rapidapi-host', 'x-rapidapi-key'])

In [5]:
username = information.get('FA_USERNAME') #gets the value for the username of the json file
apiKey = information.get('FA_API_KEY') #gets the API key for the 
url = 'http://flightxml.flightaware.com/soap/FlightXML2/wsdl' #xml url used as a host for the API

In [6]:
logging.basicConfig(level=logging.INFO) #instantiates logging of information gathered through the url
api = Client(url, username=username, password=apiKey) #sets the sude Client to always implenet API key and username for every query search

The `make_unix_lists` function is needed to work with the API functions where timestamps are only interpretted in unix epoch time.

In [7]:
def make_unix_lists(start_date, end_date, frequency):
    created_range = pd.date_range(start = start_date, end = end_date, freq = frequency) #creates a daterange series
    list_created_range = list(created_range) #converts such range into a list
    unix_floats = [date.to_pydatetime().timestamp() for date in list_created_range] #transforms the daterange list into unix epoch tiimestamps represented as floats
    unix_ints = [int(i) for i in unix_floats] #makes the above list as a list of integers
    #return unix_ints

    #We are doing the below to accomodate a for loop format into another function
    start_ints = unix_ints[:-1] #creates a list of all the start dates without last element
    end_ints = unix_ints[1:] #creates a list of all the end dates without first element 

    return start_ints, end_ints

The `get_flights` function below takes a start date, end date, time freuqnecy input, and list of destination airports. From these destinations, the current flights are explored using FlightXML2's `SearchBirdeyeInFlight()` function to gather flight combinations. The function below also incorporates the `make_unix_lists` function shown above to convert start and end dates to unix epoch time. Finally, the `get_flights` function will then incorporate the unix timeframe and flight combinations to create a schedule dataframe. This is accomplished using the `AirlineFlightSchedules()` function from FLightXML2. 

In [715]:
def get_flights(start_input, end_input, frequency_input, destination_input):
    start_time = time.time() #epoch time in seconds
    print(f'This search began at this epoch time: {start_time}')

    ## Unix Epoch Time Function
    start_list, end_list = make_unix_lists(start_input, end_input, frequency_input) #utilizes the unix epoch creation list function
    
    ## Create an empty list to add searches to
    get_flights_list = []
    schedules = []
    modified_schedules = []    


    ## Skeleton df for get_flights_df
    get_flights_cols = ['faFlightID', 'ident', 'prefix', 'type', 'suffix', 'origin', 'destination', 'timeout', 'timestamp', 'departureTime', 'firstPositionTime', 'arrivalTime', 'longitude', 'latitude', 'lowLongitude', 'lowLatitude', 'highLongitude', 'highLatitude', 'groundspeed', 'altitude', 'heading', 'altitudeStatus', 'updateType', 'altitudeChange', 'waypoints']
    get_flights_df = pd.DataFrame(columns = get_flights_cols)

    ## Creates the sketelton dataframe for schedules_df
    schedules_cols = ['ident', 'actual_ident', 'departuretime', 'arrival_time', 'origin', 'destination', 'aircrafttype', 'meal_service', 'seats_cabin_first', 'seats_cabin_business', 'seats_cabin_coach'] #column names for the scheduling df
    schedules_df = pd.DataFrame(columns = column_names) #creates the empty schedules dataframe skeleton

    ## THE POINT OF THE BELOW TO GET FLIGHT ORIGIN-DESTINATION COMBINATIONS
    print("Grabbing In-Flight Information...")
    #Grabbing flights from the list of destinations
    for destination in destination_input: #for a destination labeled in the destination input
        add_dest = '{= dest ' + destination + '}' #make it into the appropriate formatconducts a search on the current flights enroute 
        search_flight = api.service.SearchBirdseyeInFlight(query = add_dest, howMany = 15) #conducts a search
        search_dict = Client.dict(search_flight)
        print("Destination Query Searched!")
        for retrieved_query in range(len(search_dict['aircraft'])):
            get_flights_list.append(search_dict['aircraft'][retrieved_query])
            
    
    print("Creating In-Flight Dataframe")
    print(f'The elapsed time is: {time.time() - start_time} seconds.')
    ## Creates the dataframe and concats to the get_flights_df
    for list_item in range(len(get_flights_list)):
        preliminary_flight_df = pd.DataFrame(get_flights_list[list_item]).T #uses a transposed version to cleanly extract the data
        preliminary_flight_df.rename(columns = {0:get_flights_cols[0], #renames the columns to match the get_flights_df so concatenation is simple
                            1: get_flights_cols[1],
                            2: get_flights_cols[2],
                            3: get_flights_cols[3],
                            4: get_flights_cols[4],
                            5: get_flights_cols[5],
                            6: get_flights_cols[6],
                            7: get_flights_cols[7],
                            8: get_flights_cols[8],
                            9: get_flights_cols[9],
                            10: get_flights_cols[10],
                            11: get_flights_cols[11],
                            12: get_flights_cols[12],
                            13: get_flights_cols[13],
                            14: get_flights_cols[14],
                            15: get_flights_cols[15],
                            16: get_flights_cols[16],
                            17: get_flights_cols[17],
                            18: get_flights_cols[18],
                            19: get_flights_cols[19],
                            20: get_flights_cols[20],
                            21: get_flights_cols[21],
                            22: get_flights_cols[22],
                            23: get_flights_cols[23],
                            24: get_flights_cols[24]}, inplace = True) #very inefficient way to rename my columns
        get_flights_df = pd.concat([get_flights_df, preliminary_flight_df]) #overwrites the get_flights_df to create a larger dataframe
        print("Conactenated In-Flight Dataframe!")

    print('Cleaning The In-Flight Dataframe...')
    ## Cleans the get_flights_df        
    get_flights_df.drop(index = 0, inplace = True) #drops the 0 index rows in the repetative 0,1 pattern of the dataframe
    get_flights_df.reset_index(inplace = True) #resets the index by creating a new index row
    get_flights_df.drop(columns = 'index', inplace = True) #drops the extra index row


    
    ## Makes a unique combination 
    unique_orig_dest_combs = get_flights_df.groupby(['origin', 'destination']).size().reset_index()

    print("Grabbing Airline Schedule Information...")
    print(f'The elapsed time so far is: {time.time() - start_time} seconds.')
    ## Gets airline schedule and flight data
    for index, origin_destination_combination in unique_orig_dest_combs.iterrows():
        print(f'Searching Airline Schedules For Origin-Destination Combination: {origin_destination_combination["origin"], origin_destination_combination["destination"]}')
        print(f'The elapsed time so far is: {time.time() - start_time} seconds.') 
        for time_epoch in range(len(start_list)): #for the entirity of the unix epoch time list
            airline_flight_schedules = api.service.AirlineFlightSchedules(startDate = start_list[time_epoch], endDate = end_list[time_epoch], origin = origin_destination_combination["origin"], destination = origin_destination_combination["destination"], howMany = 15) #gets the flight schedules information
            airline_flight_dict = Client.dict(airline_flight_schedules) #converts the above variable to a dictionary
            schedules.append(airline_flight_dict) #appends results to the instantiated output list
            print("Airline Shcedule Searched!")

    print("Cleaning Dirty Schedules Scrape...")
    ## Cleans the dirty collection for schedules
    for item in range(len(schedules)):
        try: #try the below...we'll get some errors for empty stuff
            modified_schedules.append(schedules[item]['data']) #filters the output list to only show meaningful information where actual flights were flows in a timeframes studied
        except: #when an error occurs...
            continue #..move on and ignore any issues with with non existent flight data

    
    print('Making The Schedules Dataframe...')
    print(f'The elapsed time so far is: {time.time() - start_time} seconds.')
    ## Creates a dataframe out of the modified_schedules list <-- FIX THIS LATER
    for l in range(len(modified_schedules)):
        for m in range(len(modified_schedules[l])):
            sched_df = pd.DataFrame(modified_schedules[l][m]).T #uses a transposed version to cleanly extract the data
            sched_df.rename(columns = {0:schedules_cols[0], #renames the columns to match the schedules_df so concatenation is simple
                        1: schedules_cols[1],
                        2: schedules_cols[2],
                        3: schedules_cols[3],
                        4: schedules_cols[4],
                        5: schedules_cols[5],
                        6: schedules_cols[6],
                        7: schedules_cols[7],
                        8: schedules_cols[8],
                        9: schedules_cols[9],
                        10: schedules_cols[10]}, inplace = True) #very inefficient way to rename my columns
            schedules_df = pd.concat([schedules_df, sched_df]) #overwrites the schedules_df to create a larger dataframe
            print("Schedules Dataframe Concatenated!")

    print("Cleaning Schedules Dataframe...")
    ## Cleans the schedules_df
    schedules_df.drop(index = 0, inplace = True) #drops the 0 index rows in the repetative 0,1 pattern of the dataframe
    schedules_df.reset_index(inplace = True) #resets the index by creating a new index row
    schedules_df.drop(columns = 'index', inplace = True) #drops the extra 'index row
    

    print(f'The elapsed time for this search is: {time.time() - start_time} seconds.')
    print(f'The current epoch time that this search has finished is: {time.time()}')
    return get_flights_df, schedules_df, unique_orig_dest_combs

In [716]:
destination_list = ['KLAX', 'KJFK', 'KMIA', 'KORD', 'KPDX', 'KIAH', 'KATL']
flights_df, flights_scheds, flight_combs = get_flights('5/1/2020','5/27/2020','8H', destination_list)

nated!
Schedules Dataframe Concatenated!
Schedules Dataframe Concatenated!
Schedules Dataframe Concatenated!
Schedules Dataframe Concatenated!
Schedules Dataframe Concatenated!
Schedules Dataframe Concatenated!
Schedules Dataframe Concatenated!
Schedules Dataframe Concatenated!
Schedules Dataframe Concatenated!
Schedules Dataframe Concatenated!
Schedules Dataframe Concatenated!
Schedules Dataframe Concatenated!
Schedules Dataframe Concatenated!
Schedules Dataframe Concatenated!
Schedules Dataframe Concatenated!
Schedules Dataframe Concatenated!
Schedules Dataframe Concatenated!
Schedules Dataframe Concatenated!
Schedules Dataframe Concatenated!
Schedules Dataframe Concatenated!
Schedules Dataframe Concatenated!
Schedules Dataframe Concatenated!
Schedules Dataframe Concatenated!
Schedules Dataframe Concatenated!
Schedules Dataframe Concatenated!
Schedules Dataframe Concatenated!
Schedules Dataframe Concatenated!
Schedules Dataframe Concatenated!
Schedules Dataframe Concatenated!
Schedul

In [723]:
flights_df.to_csv('../data/current_flights.csv')

In [724]:
flights_scheds.to_csv('../data/flight_schedules.csv')

In [726]:
flight_combs.to_csv('../data/flight_combinations.csv')