# A machine learning model to predict flight cancellations.

### Goals
- the ability to predict flight cancellations at least 1 week ahead of schedule. 

### Prior studies on the topic
https://srcole.github.io/assets/flight_delay/report.pdf - study which used classification for >15 mins late

### Scope: 
As this is a POC all airports are going to be too much to do. This project will be a binary classification of flights leaving Heathrow (1- cancelled 0- non-cancelled)


### Steps:
1. create a dataset with our positive outcome variable - cancelled flights from Heathrow  
    a. combine it with another dataset with our negative outcome variable - non cancelled flights
2. Exploratory data analysis

3. feature engineering (see below for ongoing ideas if time permits)

4. Model selection

5. Model performance assessment


### Feature ideas:

1. Weather predicted the day/week before - numerical score, encoded score

2. Sentiment analysis __from__ the Heathrow leadership boards Tweets.

3. Sentiment analysis __about__ Heathrow

4. BA Union website, crawler for mentiones of #strikes

5. Do # of daily covid infections affect flight cancellations? https://data.london.gov.uk/api/table/s8c9t_j4fs2

### Modelling:
1. Depending on time/scope - we will examine whether the dataset can be used for logistic regression. 
2. Assess other models for appropriateness

Libraries

In [2]:
import numpy as np
import requests
import json
import datetime
import time
import os
import keys # saved my API key and secret and added to gitignore
import ft_keys
import pandas as pd
from pandas.io.json import json_normalize

In [3]:
os.chdir('C:\\Users\\ellio\\Desktop\\Python\\scripts\\ML_flight_cancellation_prediction\\1- API_scrape_dataset_build')
os.getcwd()

'C:\\Users\\ellio\\Desktop\\Python\\scripts\\ML_flight_cancellation_prediction\\1- API_scrape_dataset_build'

### Solution 1 - the flightaware API

##### Subscription required
I have set up a developer account at:
https://uk.flightaware.com/commercial/flightxml/ 

This is a pay for service account- however it shouldn't be too expensive to develop a POC dataset to do initial modelling on

Drawbacks: 
- time in UNIX seconds since 1970 - annoying but ultimately nbd
- there is a lot of information and endpoints here, the solution may take the form of a number of queries and joins. 

Positives:
- there is a very active user community to keep the developers honest here: https://discussions.flightaware.com/ 

In [145]:
# Define the values we will need for the API

#imported from the keys file, added to .gitignore
username = keys.username 
apiKey = keys.apiKey

#get today values
today = date.today()
str_today = str(today)
pattern = '%Y-%m-%d'
epoch_today = int(time.mktime(time.strptime(str_today, pattern)))

#tomorrow values 
tomorrow = datetime.date.today() + datetime.timedelta(days = 1)
str_tomorrow = str(tomorrow)
pattern = '%Y-%m-%d'
epoch_tomorrow = int(time.mktime(time.strptime(str_tomorrow, pattern)))

Set maximum query payload

In [104]:
fxmlUrl = "https://flightxml.flightaware.com/json/FlightXML2/" # Our URL

endpoint = 'SetMaximumResultSize' # once this payload is set, it will default the account until it is set or reset. 

payload = {'max_size': 150} # set number

response = requests.get(fxmlUrl + endpoint, params=payload, auth=(username, apiKey))

if response.status_code == 200:
    print("Max size response set")
else:
    print("Error executing request")

Max size response set


Set up the API call with a payload of 2 so not to spend too much $$$

In [134]:
# Set up the API call - parameters into the payload and glue the URL request together
fxmlUrl = "https://flightxml.flightaware.com/json/FlightXML2/" # Our URL
endpoint = 'SearchBirdseyeInFlight' # Our endpoint, full documentation: https://flightaware.com/commercial/flightxml/explorer/#op_SearchBirdseyeInFlight 

payload = {'startDate': 1608654400, 'endDate': 1609027200, 'true': 'cancelled', 'orig': 'EGLL', 'howMany': 2}

payload_ls = []

response = requests.get(fxmlUrl + endpoint, params=payload, auth=(username, apiKey))

if response.status_code == 200:
    payload_ls.append(response.json())
    cancelled_flights = pd.json_normalize(payload_ls[0]['SearchBirdseyeInFlightResult']['aircraft'])
    cancelled_flights['cancelled'] = 1
    
    print("Request is good and dataframe returned")
else:
    print("Error executing request")

Request is good and dataframe returned


There seems to be a lot of extra small flights included in the payload. This needs to be rectified. Enquire as to heavy or commercial flights only.

In [135]:
cancelled_flights

Unnamed: 0,faFlightID,ident,prefix,type,suffix,origin,destination,timeout,timestamp,departureTime,...,highLongitude,highLatitude,groundspeed,altitude,heading,altitudeStatus,updateType,altitudeChange,waypoints,cancelled
0,JIA5373-1608615956-airline-0252,JIA5373,,CRJ7,,KCLT,KCRW,0,0,0,...,-200.0,-200.0,0,0,0,,,,35.21 -80.95 35.35 -81 35.4 -81.02 35.53 -81.0...,1
1,KAP2551-1608618316-airline-0333,KAP2551,,CNA,,TJSJ,TISX,0,0,0,...,-200.0,-200.0,0,0,0,,,,,1


Annoyingly, the cancelled flights don't give dates outside our parameters, so we will have to take their flight id's and requery the API, this is going to be expensive if we need to create a large dataset, so we might keep the values to 1500 rows for the entire thing and build a shell model. 

In [None]:
# API call to get us a dataframe of the cancelled flights

fxmlUrl = "https://flightxml.flightaware.com/json/FlightXML2/" # Our URL

endpoint = 'FlightInfoEx'

payload = {'startDate': 1608854400, 'endDate': 1609027200, 'airport': 'EGLL', 'howMany': 150}

payload_ls = []

response = requests.get(fxmlUrl + endpoint, params=payload, auth=(username, apiKey))

if response.status_code == 200:
    payload_ls.append(response.json())
    fs_df = pd.json_normalize(payload_ls[0]['DepartedResult']['departures'])
    print("Request is good and dataframe returned")
else:
    print("Error executing request")

In [105]:
# Get all departed flights:

fxmlUrl = "https://flightxml.flightaware.com/json/FlightXML2/" # Our URL

endpoint = 'Departed'

payload = {'startDate': 1608854400, 'endDate': 1609027200, 'airport': 'EGLL', 'howMany': 150}

payload_ls = []

response = requests.get(fxmlUrl + endpoint, params=payload, auth=(username, apiKey))

if response.status_code == 200:
    payload_ls.append(response.json())
    fs_df = pd.json_normalize(payload_ls[0]['DepartedResult']['departures'])
    print("Request is good and dataframe returned")
else:
    print("Error executing request")

Request is good and dataframe returned


In [118]:
fs_df.head(4)

Unnamed: 0,ident,aircrafttype,actualdeparturetime,estimatedarrivaltime,actualarrivaltime,origin,destination,originName,originCity,destinationName,destinationCity
0,BAW297,B789,1609077724,1609107840,0,EGLL,KORD,London Heathrow,"London, England",Chicago O'Hare Intl,"Chicago, IL"
1,EIN885,,1609077630,0,0,EGLL,,London Heathrow,"London, England",,
2,BAW193,B789,1609077486,1609111980,0,EGLL,KDFW,London Heathrow,"London, England",Dallas-Fort Worth Intl,"Dallas-Fort Worth, TX"
3,MSR5531,A332,1609077406,1609093906,0,EGLL,HECA,London Heathrow,"London, England",Cairo Int'l,Cairo


Need the clean names of the airlines - I have scraped the details from a IATA table from wikipedia. See 1.1. I can use the first 3 digits from the 'ident' field as an identifier for this information


# Problem encountered:

The flight aware API isn't returning the cancelled flights from the requested location. Furthermore, the dates on the flights aren't included. 

### options:
- find a unique identifier and join with a dataset that has the dates included. 
- find another source of cancelled flights. 


### I have found a new API may have what we need
https://aviation-edge.com/premium-api/https://aviation-edge.com/premium-api/

This API has flight cancellations at the individual flight level, it also can be filtered by airport, this looks promising. 

##### Subscription required
I have set up a developer account. This is a pay for service account- however it shouldn't be too expensive to develop a POC dataset to do initial modelling on 

### Limitations:

- The maximum amount of days returned is 30 
- There is only historical data from May 2020, this may leave us with a small dataset. 

## Fix:
 - create 30 day intervals using a date_from and date_to list. Iterate over all of them and concatenate the result into a dataframe

In [225]:
# Set new API parameters, rename values in case we decide to run and use both

url = 'https://aviation-edge.com/v2/public/flightsHistory?key=' # Our URL

key = ft_keys.ft_apikey # import keys from local script - added to gitignore

payload = '&code=LHR&type=departure' # London Heathrow and departures only

date_from =  ['2020-05-15','2020-06-13', '2020-07-13', '2020-08-13', '2020-09-13', '2020-10-13', '2020-11-13'] # dates from and dates to, this year max, and then 

date_to =  ['2020-06-12', '2020-07-12', '2020-08-12', '2020-09-12', '2020-10-12', '2020-11-12', '2020-12-12']

payload_ls = [] #list to add each payload response to

for i in range(len(date_from)):
    response = requests.get(url + str(key)+ payload + '&date_from=' + date_from[i] + '&date_to=' + date_to[i])
    
    if response.status_code == 200:
        payload_ls.append(response.json())
        DictionaryOfDataFrames["group" + str(i)] = pd.json_normalize(payload_ls[i])  
        
        print("Request is good and dataframe " + str(i) + " returned")
    
    else:
        print("Error executing request")

#concatenate to one dataframe
flight_schedule_dataframe = pd.concat(DictionaryOfDataFrames.values(), ignore_index=True) 

Request is good and dataframe 0 returned
Request is good and dataframe 1 returned
Request is good and dataframe 2 returned
Request is good and dataframe 3 returned
Request is good and dataframe 4 returned
Request is good and dataframe 5 returned
Request is good and dataframe 6 returned


In [226]:
flight_schedule_dataframe.shape

(19063, 37)

~19k rows - not bad, will defo be enough to do basic modelling on. Save as a parquet and bounce on over to EDA

In [None]:
df.to_parquet('1.1-.parquet.gzip',
              compression='gzip') 