# Notes

## Flight paths:
- **Houston, TX** to **Los Angeles, CA** (IAH - LAX)
- **New York City, NY** to **Miami, FL** (JFK - MIA)
- **Portland, WA** to **Chicago, IL** (PDX - ORD)

## Number of Total Routes
- at least 1,000 per route for now (All times in GMC)

    -__Times:__

        - 0000 hours to 0600 hours

        - 0601 hours to 1200 hours

        - 1201 hours to 1800 hours

        - 1801 hours to 2399 hours

    - __Times of the Year:__
    
        - Try to get every month

    - __Times of the Week:__
        - Try to get every day. 
    
## Notes about Data considerations

- see what are typical flight times for your paths. you may be limited here. 

- consistently work with other api to see what you can grab

- grab future flights too!

- do some division on how many flights you can grab from how many time zonestuff


## Features to Scrape
- want aircraft type
- want airline flight info
- want airline flight 
- want aircraft type struct
- want flight struct

1 query will generate 15 results. Ex; If you request to see all flight from airport Alpha to Airport Bravo and the search results come back with 5000 flights. To find the pricing estimate you would do the following math. 5000/15 = 333 * $0.0079 = $2.63 (Class 2)

In [2]:
import sys
from suds import null, WebFault
from suds.client import Client
import logging
import json
import pandas as pd
import datetime
import numpy as np
import time

In [3]:
with open('/Users/ChristopherKuzemka/Documents/GA/dsi_11/projects/capstone/env.json') as f:
    information = json.load(f)

In [4]:
information.keys()

dict_keys(['FA_API_KEY', 'FA_USERNAME', 'x-rapidapi-host', 'x-rapidapi-key'])

In [5]:
username = information.get('FA_USERNAME')
apiKey = information.get('FA_API_KEY')
url = 'http://flightxml.flightaware.com/soap/FlightXML2/wsdl'

In [6]:
logging.basicConfig(level=logging.INFO)
api = Client(url, username=username, password=apiKey)

# JFK - MIA

From [here](https://www.flights.com/flights/new-york-jfk-to-miami-mia/): "with 3 differnt airlines operating flights between New York and Miami, there are, on average, 2,197 flights per month.. This equates to about 523 flights per week, and 75 flights per day from JFK to MIA. The three airlines are:
- American Airlines (Flight AA 2572)

- British Airways (Flight BA 1687) 

- Malaysia Airlines (Flight MH 9446).

In [7]:
def make_unix_lists(start_date, end_date, frequency):
    created_range = pd.date_range(start = start_date, end = end_date, freq = frequency) #creates a daterange series
    list_created_range = list(created_range) #converts such range into a list
    unix_floats = [date.to_pydatetime().timestamp() for date in list_created_range] #transforms the daterange list into unix epoch tiimestamps represented as floats
    unix_ints = [int(i) for i in unix_floats] #makes the above list as a list of integers
    #return unix_ints

    #We are doing the below to accomodate a for loop format into another function
    start_ints = unix_ints[:-1] #creates a list of all the start dates without last element
    end_ints = unix_ints[1:] #creates a list of all the end dates without first element 

    return start_ints, end_ints

In [715]:
def get_flights(start_input, end_input, frequency_input, destination_input):
    start_time = time.time() #epoch time in seconds
    print(f'This search began at this epoch time: {start_time}')

    ## Unix Epoch Time Function
    start_list, end_list = make_unix_lists(start_input, end_input, frequency_input) #utilizes the unix epoch creation list function
    
    ## Create an empty list to add searches to
    get_flights_list = []
    schedules = []
    modified_schedules = []    


    ## Skeleton df for get_flights_df
    get_flights_cols = ['faFlightID', 'ident', 'prefix', 'type', 'suffix', 'origin', 'destination', 'timeout', 'timestamp', 'departureTime', 'firstPositionTime', 'arrivalTime', 'longitude', 'latitude', 'lowLongitude', 'lowLatitude', 'highLongitude', 'highLatitude', 'groundspeed', 'altitude', 'heading', 'altitudeStatus', 'updateType', 'altitudeChange', 'waypoints']
    get_flights_df = pd.DataFrame(columns = get_flights_cols)

    ## Creates the sketelton dataframe for schedules_df
    schedules_cols = ['ident', 'actual_ident', 'departuretime', 'arrival_time', 'origin', 'destination', 'aircrafttype', 'meal_service', 'seats_cabin_first', 'seats_cabin_business', 'seats_cabin_coach'] #column names for the scheduling df
    schedules_df = pd.DataFrame(columns = column_names) #creates the empty schedules dataframe skeleton

    ## THE POINT OF THE BELOW TO GET FLIGHT ORIGIN-DESTINATION COMBINATIONS
    print("Grabbing In-Flight Information...")
    #Grabbing flights from the list of destinations
    for destination in destination_input: #for a destination labeled in the destination input
        add_dest = '{= dest ' + destination + '}' #make it into the appropriate formatconducts a search on the current flights enroute 
        search_flight = api.service.SearchBirdseyeInFlight(query = add_dest, howMany = 15) #conducts a search
        search_dict = Client.dict(search_flight)
        print("Destination Query Searched!")
        for retrieved_query in range(len(search_dict['aircraft'])):
            get_flights_list.append(search_dict['aircraft'][retrieved_query])
            
    
    print("Creating In-Flight Dataframe")
    print(f'The elapsed time is: {time.time() - start_time} seconds.')
    ## Creates the dataframe and concats to the get_flights_df
    for list_item in range(len(get_flights_list)):
        preliminary_flight_df = pd.DataFrame(get_flights_list[list_item]).T #uses a transposed version to cleanly extract the data
        preliminary_flight_df.rename(columns = {0:get_flights_cols[0], #renames the columns to match the get_flights_df so concatenation is simple
                            1: get_flights_cols[1],
                            2: get_flights_cols[2],
                            3: get_flights_cols[3],
                            4: get_flights_cols[4],
                            5: get_flights_cols[5],
                            6: get_flights_cols[6],
                            7: get_flights_cols[7],
                            8: get_flights_cols[8],
                            9: get_flights_cols[9],
                            10: get_flights_cols[10],
                            11: get_flights_cols[11],
                            12: get_flights_cols[12],
                            13: get_flights_cols[13],
                            14: get_flights_cols[14],
                            15: get_flights_cols[15],
                            16: get_flights_cols[16],
                            17: get_flights_cols[17],
                            18: get_flights_cols[18],
                            19: get_flights_cols[19],
                            20: get_flights_cols[20],
                            21: get_flights_cols[21],
                            22: get_flights_cols[22],
                            23: get_flights_cols[23],
                            24: get_flights_cols[24]}, inplace = True) #very inefficient way to rename my columns
        get_flights_df = pd.concat([get_flights_df, preliminary_flight_df]) #overwrites the get_flights_df to create a larger dataframe
        print("Conactenated In-Flight Dataframe!")

    print('Cleaning The In-Flight Dataframe...')
    ## Cleans the get_flights_df        
    get_flights_df.drop(index = 0, inplace = True) #drops the 0 index rows in the repetative 0,1 pattern of the dataframe
    get_flights_df.reset_index(inplace = True) #resets the index by creating a new index row
    get_flights_df.drop(columns = 'index', inplace = True) #drops the extra index row


    
    ## Makes a unique combination 
    unique_orig_dest_combs = get_flights_df.groupby(['origin', 'destination']).size().reset_index()

    print("Grabbing Airline Schedule Information...")
    print(f'The elapsed time so far is: {time.time() - start_time} seconds.')
    ## Gets airline schedule and flight data
    for index, origin_destination_combination in unique_orig_dest_combs.iterrows():
        print(f'Searching Airline Schedules For Origin-Destination Combination: {origin_destination_combination["origin"], origin_destination_combination["destination"]}')
        print(f'The elapsed time so far is: {time.time() - start_time} seconds.') 
        for time_epoch in range(len(start_list)): #for the entirity of the unix epoch time list
            airline_flight_schedules = api.service.AirlineFlightSchedules(startDate = start_list[time_epoch], endDate = end_list[time_epoch], origin = origin_destination_combination["origin"], destination = origin_destination_combination["destination"], howMany = 15) #gets the flight schedules information
            airline_flight_dict = Client.dict(airline_flight_schedules) #converts the above variable to a dictionary
            schedules.append(airline_flight_dict) #appends results to the instantiated output list
            print("Airline Shcedule Searched!")

    print("Cleaning Dirty Schedules Scrape...")
    ## Cleans the dirty collection for schedules
    for item in range(len(schedules)):
        try: #try the below...we'll get some errors for empty stuff
            modified_schedules.append(schedules[item]['data']) #filters the output list to only show meaningful information where actual flights were flows in a timeframes studied
        except: #when an error occurs...
            continue #..move on and ignore any issues with with non existent flight data

    
    print('Making The Schedules Dataframe...')
    print(f'The elapsed time so far is: {time.time() - start_time} seconds.')
    ## Creates a dataframe out of the modified_schedules list <-- FIX THIS LATER
    for l in range(len(modified_schedules)):
        for m in range(len(modified_schedules[l])):
            sched_df = pd.DataFrame(modified_schedules[l][m]).T #uses a transposed version to cleanly extract the data
            sched_df.rename(columns = {0:schedules_cols[0], #renames the columns to match the schedules_df so concatenation is simple
                        1: schedules_cols[1],
                        2: schedules_cols[2],
                        3: schedules_cols[3],
                        4: schedules_cols[4],
                        5: schedules_cols[5],
                        6: schedules_cols[6],
                        7: schedules_cols[7],
                        8: schedules_cols[8],
                        9: schedules_cols[9],
                        10: schedules_cols[10]}, inplace = True) #very inefficient way to rename my columns
            schedules_df = pd.concat([schedules_df, sched_df]) #overwrites the schedules_df to create a larger dataframe
            print("Schedules Dataframe Concatenated!")

    print("Cleaning Schedules Dataframe...")
    ## Cleans the schedules_df
    schedules_df.drop(index = 0, inplace = True) #drops the 0 index rows in the repetative 0,1 pattern of the dataframe
    schedules_df.reset_index(inplace = True) #resets the index by creating a new index row
    schedules_df.drop(columns = 'index', inplace = True) #drops the extra 'index row
    

    print(f'The elapsed time for this search is: {time.time() - start_time} seconds.')
    print(f'The current epoch time that this search has finished is: {time.time()}')
    return get_flights_df, schedules_df, unique_orig_dest_combs

In [716]:
destination_list = ['KLAX', 'KJFK', 'KMIA', 'KORD', 'KPDX', 'KIAH', 'KATL']
flights_df, flights_scheds, flight_combs = get_flights('5/1/2020','5/27/2020','8H', destination_list)

nated!
Schedules Dataframe Concatenated!
Schedules Dataframe Concatenated!
Schedules Dataframe Concatenated!
Schedules Dataframe Concatenated!
Schedules Dataframe Concatenated!
Schedules Dataframe Concatenated!
Schedules Dataframe Concatenated!
Schedules Dataframe Concatenated!
Schedules Dataframe Concatenated!
Schedules Dataframe Concatenated!
Schedules Dataframe Concatenated!
Schedules Dataframe Concatenated!
Schedules Dataframe Concatenated!
Schedules Dataframe Concatenated!
Schedules Dataframe Concatenated!
Schedules Dataframe Concatenated!
Schedules Dataframe Concatenated!
Schedules Dataframe Concatenated!
Schedules Dataframe Concatenated!
Schedules Dataframe Concatenated!
Schedules Dataframe Concatenated!
Schedules Dataframe Concatenated!
Schedules Dataframe Concatenated!
Schedules Dataframe Concatenated!
Schedules Dataframe Concatenated!
Schedules Dataframe Concatenated!
Schedules Dataframe Concatenated!
Schedules Dataframe Concatenated!
Schedules Dataframe Concatenated!
Schedul

In [7]:
test = api.service.AllAirports()

In [8]:
test

TE4",
      "LKPS",
      "NJ38",
      "YBUR",
      "2VA3",
      "WV12",
      "USRK",
      "LPAZ",
      "04AA",
      "YMMO",
      "3GA3",
      "XS06",
      "1Y5",
      "DE32",
      "NA90",
      "NH93",
      "BGIS",
      "WT00",
      "YCNY",
      "VILH",
      "ORBI",
      "5LL5",
      "SN12",
      "PA63",
      "SILC",
      "USHQ",
      "FA95",
      "YHBD",
      "SVCU",
      "MI62",
      "MYBS",
      "WI53",
      "0AZ4",
      "YKDI",
      "LTBZ",
      "OK02",
      "4NC7",
      "6S6",
      "SNWC",
      "8CO1",
      "4NJ0",
      "K57",
      "DTR",
      "CO58",
      "PS74",
      "6AK5",
      "MA12",
      "75MO",
      "YDNO",
      "FNCF",
      "OEMA",
      "FCPA",
      "KMEJ",
      "44OR",
      "VA54",
      "VOVZ",
      "AL83",
      "74MU",
      "48CA",
      "2OR6",
      "NVVA",
      "86MN",
      "KAOH",
      "HAMN",
      "UT15",
      "0IA0",
      "50D",
      "KGWB",
      "OR77",
      "YYRD",
      "3D2",
      "2GE6",
      

In [723]:
flights_df.to_csv('../data/current_flights.csv')

In [724]:
flights_scheds.to_csv('../data/flight_schedules.csv')

In [726]:
flight_combs.to_csv('../data/flight_combinations.csv')