Libraries

In [2]:
import numpy as np
import requests
import json
import time
import os
# saved my API key and secret and added to gitignore
import ae_keys
import pandas as pd
from pandas.io.json import json_normalize
from datetime import date, timedelta

In [16]:
# os.chdir('C:\\Users\\ellio\\Desktop\\Python\\scripts\\ML_flight_cancellation_prediction\\1- API_scrape_dataset_build')
os.getcwd()

'C:\\Users\\ellio\\Desktop\\Python\\scripts\\ML_flight_cancellation_prediction\\1- API_scrape_dataset_build'

### Solution 1 - the flightaware API

##### Subscription required
I have set up a developer account at:
https://uk.flightaware.com/commercial/flightxml/ 

This is a pay for service account- however it shouldn't be too expensive to develop a POC dataset to do initial modelling on

Drawbacks: 
- time in UNIX seconds since 1970 - annoying but ultimately nbd
- there is a lot of information and endpoints here, the solution may take the form of a number of queries and joins. 

Positives:
- there is a very active user community to keep the developers honest here: https://discussions.flightaware.com/ 

In [None]:
# Define the values we will need for the API

#imported from the keys file, added to .gitignore
username = keys.username 
apiKey = keys.apiKey

#get today values
today = date.today()
str_today = str(today)
pattern = '%Y-%m-%d'
epoch_today = int(time.mktime(time.strptime(str_today, pattern)))

#tomorrow values 
tomorrow = datetime.date.today() + datetime.timedelta(days = 1)
str_tomorrow = str(tomorrow)
pattern = '%Y-%m-%d'
epoch_tomorrow = int(time.mktime(time.strptime(str_tomorrow, pattern)))

Set maximum query payload

In [None]:
fxmlUrl = "https://flightxml.flightaware.com/json/FlightXML2/" # Our URL

endpoint = 'SetMaximumResultSize' # once this payload is set, it will default the account until it is set or reset. 

payload = {'max_size': 150} # set number

response = requests.get(fxmlUrl + endpoint, params=payload, auth=(username, apiKey))

if response.status_code == 200:
    print("Max size response set")
else:
    print("Error executing request")

Set up the API call with a payload of 2 so not to spend too much $$$

In [None]:
# Set up the API call - parameters into the payload and glue the URL request together
fxmlUrl = "https://flightxml.flightaware.com/json/FlightXML2/" # Our URL
endpoint = 'SearchBirdseyeInFlight' # Our endpoint, full documentation: https://flightaware.com/commercial/flightxml/explorer/#op_SearchBirdseyeInFlight 

payload = {'startDate': 1608654400, 'endDate': 1609027200, 'true': 'cancelled', 'orig': 'EGLL', 'howMany': 2}

payload_ls = []

response = requests.get(fxmlUrl + endpoint, params=payload, auth=(username, apiKey))

if response.status_code == 200:
    payload_ls.append(response.json())
    cancelled_flights = pd.json_normalize(payload_ls[0]['SearchBirdseyeInFlightResult']['aircraft'])
    cancelled_flights['cancelled'] = 1
    
    print("Request is good and dataframe returned")
else:
    print("Error executing request")

There seems to be a lot of extra small flights included in the payload. This needs to be rectified. Enquire as to heavy or commercial flights only.

In [None]:
cancelled_flights

Annoyingly, the cancelled flights don't give dates outside our parameters, so we will have to take their flight id's and requery the API, this is going to be expensive if we need to create a large dataset, so we might keep the values to 1500 rows for the entire thing and build a shell model. 

In [None]:
# API call to get us a dataframe of the cancelled flights

fxmlUrl = "https://flightxml.flightaware.com/json/FlightXML2/" # Our URL

endpoint = 'FlightInfoEx'

payload = {'startDate': 1608854400, 'endDate': 1609027200, 'airport': 'EGLL', 'howMany': 150}

payload_ls = []

response = requests.get(fxmlUrl + endpoint, params=payload, auth=(username, apiKey))

if response.status_code == 200:
    payload_ls.append(response.json())
    fs_df = pd.json_normalize(payload_ls[0]['DepartedResult']['departures'])
    print("Request is good and dataframe returned")
else:
    print("Error executing request")

In [None]:
fs_df.head(4)

Need the clean names of the airlines - I have scraped the details from a IATA table from wikipedia. See 1.1. I can use the first 3 digits from the 'ident' field as an identifier for this information


# Problem encountered:

The flight aware API isn't returning the cancelled flights from the requested location. Furthermore, the dates on the flights aren't included. 

### options:
- find a unique identifier and join with a dataset that has the dates included. 
- find another source of cancelled flights. 


### I have found a new API may have what we need
https://aviation-edge.com/

This API has flight cancellations at the individual flight level, it also can be filtered by airport, this looks promising. 

##### Subscription required
I have set up a developer account. This is a pay for service account- however it shouldn't be too expensive to develop a POC dataset to do initial modelling on 

### Limitations:

- The maximum amount of days returned is 30 __*** Update ***__, 30 day query was returning errors, reduced each qury to 14 days.
- There is only historical data from May 2020, this may leave us with a small dataset. 
- There is limited documentation compared with FlightAware, doesn't say anywhere if arrival refers to where the plan has come from or it's final arrival.

# Fix - new API:
 - create 30 day intervals using a date_from and date_to list. Iterate over all of them and concatenate the result into a dataframe

In [None]:
# Set new API parameters, rename values in case we decide to run and use both

url = 'http://aviation-edge.com/v2/public/flightsHistory?key=' # Our URL

key = ae_keys.api_key # import keys from local script - added to gitignore

payload = '&code=LHR&type=departure' # London Heathrow and departures only

start = date(2020, 5, 15) # start date

end = date(2020, 12, 25) # end date - not used within the query but as a stopping point for the while loop

payload_ls = [] #list to add each payload response to

DictionaryOfDataFrames = {} # to add each dataframe to

group = 0 # to index the dataframe

while start < end:
    response = requests.get(url + str(key)+ payload + '&date_from=' + str(start) + '&date_to=' +str(start + timedelta(days=14)))
    start += timedelta(days=15) #adds one extra day so not to query the same day again      
    payload_ls.append(response.json()) #adds json to list
    DictionaryOfDataFrames["group " + str(start)] = pd.json_normalize(payload_ls[group]) # adds dataframe to dictionary using group index
    group += 1
    print(url + str(key)+ payload + '&date_from=' + str(start) + '&date_to=' +str(start + timedelta(days=14))) # to check payloads in case something goes awry - something did. Error discovered in EDA that payload
    #was missing between 19 Oct and 4 November - rerunning query

#concatenate to one dataframe
flight_schedule_dataframe = pd.concat(DictionaryOfDataFrames.values(), ignore_index=True)

In [12]:
flight_schedule_dataframe.head(3)

Unnamed: 0,type,status,departure.iataCode,departure.icaoCode,departure.terminal,departure.scheduledTime,arrival.iataCode,arrival.icaoCode,arrival.terminal,arrival.scheduledTime,...,departure.actualTime,departure.estimatedRunway,departure.actualRunway,arrival.estimatedTime,arrival.baggage,arrival.delay,arrival.gate,arrival.actualTime,arrival.estimatedRunway,arrival.actualRunway
0,departure,unknown,lhr,egll,2,2020-05-15t06:40:00.000,arn,essa,5,2020-05-15t10:05:00.000,...,,,,,,,,,,
1,departure,unknown,lhr,egll,2,2020-05-15t06:40:00.000,arn,essa,5,2020-05-15t10:05:00.000,...,,,,,,,,,,
2,departure,unknown,lhr,egll,2,2020-05-15t06:45:00.000,cph,ekch,3,2020-05-15t09:35:00.000,...,,,,,,,,,,


In [13]:

# rerunning the same query again (to attempt to get the missing 2 week gap found in EDA) has returned 139,292 rows - Logged a support ticket with the API owner and the API urls
flight_schedule_dataframe.shape

(139292, 35)

In [14]:
flight_schedule_dataframe['departure.scheduledTime'] = pd.to_datetime(flight_schedule_dataframe['departure.scheduledTime'])

display(min(flight_schedule_dataframe['departure.scheduledTime']))
display(max(flight_schedule_dataframe['departure.scheduledTime']))

Timestamp('2020-05-15 06:40:00')

Timestamp('2020-12-25 22:30:00')

~139,292 rows - not bad, will defo be enough to do basic modelling on. Save as a parquet and bounce on over to 2-preprocessing

In [15]:
flight_schedule_dataframe.to_parquet('C:\\Users\\ellio\\Desktop\\Python\\scripts\\ML_flight_cancellation_prediction\\datasets\\1.1-full_departures_3012.parquet.gzip',
              compression='gzip')