# Determining Airline Prices
By: Chirstopher Kuzemka : [Github](https://git.generalassemb.ly)

## Problem Statement

Aviation is one of the largest industries dominating our global market today. Commercial aviation has made it possible for people to connect with each other in ways that may have been unimaginable over a century ago. However, a lot of thought must be put into the FAA standards and routes that modern planes must make today to make such connections possible.

Consider the case example where a startup airliner, known as "Kruze", wants to establish itself as a top competitor against existing airliners today. A part of this startup process focuses on understanding the costs that will come into play when managing flights. Our job as data scientists today is to help Kruze determine the minimum threshold cost the airliner must charge their passengers on a ticket class basis in order to break even with a profit. To do this, we are going to use existing flight routes (velocity and altitude data), existing data on jet fuel pricing, and existing flight ticket prices (as a prediction) to help us create a supervised learning model. 

To start, we will approach the project with the intention of expressing a minimum proof of concept. With such introduction, we will make some limitations to our study and decrease the potential for scope increase by:

- conducting an idealized thermal jet propulsion cycle for feature engineering purposes (focusing on an open Brayton cycle in particular)
- analyzing flight route data across the U.S. domestically; choosing up to 3 routes of varying sizes and suggesting their reverse flight paths as data inputs as well. 
    - **Houston, TX** to **Los Angeles, CA** (IAH - LAX)
    - **New York City, NY** to **Miami, FL** (JFK - MIA)
    - **Portland, WA** to **Chicago, IL** (PDX - ORD)
- assuming air to be treated as an ideal gas
- assuming operating engine conditions to be steady state
- assuming kinetic energy and potential energy to be negligible in our system, except at inlet and exit conditioins of jet engine itself
- assuming atmospheric temperature, pressure, and air density to be an averaged value between 0 and 15,000 meters altitude
- assuming data incorporating head or tail wind effects to be negligible
- assuming passenger weight to be negligible
- assuming external costs from the study (including food/maintenance/crew salary) to be negligible
- using price data from future flights as opposed to previous flights as previous flight pricing is not readily available


All current assumptions labeled are set to allow us to achieve (or attempt to achieve) our goal within a certain time frame, as Kruze is requiring an answer from us quickly! With this in mind, we will consider discussing how such assumptions can contribute to any error throughout our study, as well as remind ourselves that integrating negated features for future work may actually be very beneficial to us in achieveing a stronger prediction. Conducting an idealized thermal engine analysis will help us understand the average power output of a given plane's engines throughout different phases of its flight. Routes chosen throughout a variety of times and seasons will also help us determine how such elements play a role in pricing. Finally, some plane specifications (including aircraft type, number of seats it supports, as well as type/number of engines) will allow us to consider any extra technical factors for ticket pricing. 

As we are working with what is considerred to be a continuous variable, we will analyze common price trends utilizing a supervised regression model, such as Linear Regression, Logistic Regression, SVR, AdaBoosting Regression, Gradient Boosting Regression, KNNRegression, and Naive Bayes Regression. We will ultimately be using the Mean Absolute Error against our predictions to help us gauge how well our selected model predicts the price and discuss what issues may be observed from the limitations of this study.



## Executive Summary

## Table of Contents
[1.00 Data Loading](#1.00-Data-Loading)

[2.00 Data Cleaning and Analysis](#2.00-Data-Cleaning-and-Moderate-Analysis)

- [2.01 Quick Check](#2.01-Quick-Check)

- [2.02 Data Documentation Exploration](#2.02-Data-Documentation-Exploration)

- [2.03 Cleaning](#2.03-Cleaning)

- [2.04 Exploratory Data Analysis and Visualization](#2.04-Exploratory-Data-Analysis-and-Visualization)

[3.00 Machine Learning Modeling and Visulalization](#3.00-Machine-Learning-Modeling-and-Visulalization)

- [3.01 Model Preparation](#3.01-Model-Preparation)

- [3.02 Modeling](#3.02-Modeling)

- [3.03 Model Selection](#3.03-Model-Selection)

- [3.04 Model Evaluation](#3.04-Model-Evaluation)

[4.00 Conclusions](#4.00-Conclusions)

[5.00 Sources and References](#5.00-Sources-and-References)

## Data Dictionary

# 1.00 Data Loading

In [2]:
import pandas as pd #imports the pandas package
import numpy as np #imports the numpy package
import matplotlib.pyplot as plt #imports the matplotlib plotting package
import seaborn as sns #imports the seaborn package

import json #imports the json package

## 1.01 Flight Tracking Data

In [3]:
current_flights = pd.read_csv('../data/current_flights.csv') #reads the current_flights csv
flight_combinations = pd.read_csv('../data/flight_combinations.csv') #reads the flight_combinations csv
flight_schedules = pd.read_csv('../data/flight_schedules.csv') #reads the flight_schedules csv

## 1.02 Pricing Data

In [4]:
monthly_pricing_2021 = pd.read_csv('../data/2021_monthly_pricing.csv') #reads the 2021_monthly_pricing csv
may_pricing_per_flight = pd.read_csv('../data/may_pricing_per_flight.csv') #reads the may_pricing_per_flight csv
may2020_to_june2021_monthlyprice = pd.read_csv('../data/may2020_to_june2021_monthlyprice.csv') #reads the may2020_to_june2021_monthlyprice csv

## 1.03 Additional Relevant Data

In [9]:
tsa_checkpoint_travel = pd.read_excel('../data/tsa_checkpoint_travel.xlsx', sheet_name = 'Sheet1', index_col = None, usecols = 'A:C') #reads the tsa_checkpoint_travel xlsx
tsa_confirmed_cases = pd.read_excel('../data/tsa_confirmed_cases.xlsx', sheet_name = 'Sheet1', index_col = None, usecols=  'A:E') #reads the tsa_confirmed_cases xlsx

# 2.00 Data Data Cleaning and Analysis

## 2.01 Quick Check

In [None]:
def quick_check(dataframe):
    print("-------------------------------------------------------------------------------------------------")
    print(f"The head of your input dataframe is dataframe is:")
    print(" - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -")
    print(dataframe.head()) #checks the head of the dataframe
    print("-------------------------------------------------------------------------------------------------")
    print(f"The tail of you input dataframe is:")
    print(" - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -")
    print(dataframe.tail()) #checks the tail of the dataframe
    print("-------------------------------------------------------------------------------------------------")
    print(f"The shape of the dataframe is {dataframe.shape[0]} rows and {dataframe.shape[1]} columns.") #checks the shape of the dataframe
    print("-------------------------------------------------------------------------------------------------")
    print("The below shows whether there exist nulls in our dataframe or not:")
    print(" - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -")
    print(dataframe.isnull().any()) #checks the null status of the current_flights dataframe
    print("-------------------------------------------------------------------------------------------------")
    print("The below shows the useful information to be aware of when exploring this input dataframe:")
    print(" - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -")
    print(dataframe.info()) #checks the null status of the current_flights dataframe

The above function is created to conveniently conduct a quick check on the dataframe for the reader/user. Through it, we will able to see the __head__, __tail__, __shape__, __null presence__, and __important dataframe information__.

### Current Flights Data

In [42]:
quick_check(current_flights) #performs a quick check on the current_flights dataframe

-------------------------------------------------------------------------------------------------
The head of your input dataframe is dataframe is:
 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
   Unnamed: 0                      faFlightID   ident prefix  type  suffix  \
0           0       DAL333-1590465975-fa-0008  DAL333    NaN  A321     NaN   
1           1  KLM601-1590468354-airline-0005  KLM601    NaN  B77W     NaN   
2           2       VIR607-1590664542-ed-0002  VIR607    NaN  B789     NaN   
3           3       DAL702-1590465982-fa-0006  DAL702    NaN  A321     NaN   
4           4  ACA572-1590468353-airline-0278  ACA572    NaN  A319     NaN   

  origin destination  timeout   timestamp  ...  lowLatitude  highLongitude  \
0   KATL        KLAX        0  1590716390  ...     32.94676      -84.44664   
1   EHAM        KLAX        0  1590711509  ...     33.95142        4.71741   
2   EGLL        KLAX        0  1590711368  ...    

__Key takeaways form the above output:__

- The dataframe is large and denotes separations with a `\` symbol.

- There is an `Unnamed: 0` column in our dataframe which is not necessary to include. We will remove this in our cleaning.

- Our dataframe contains nulls. 

- Most of the values in our dataframe are numerical. 

### Flight Combinations

In [41]:
quick_check(flight_combinations)

-------------------------------------------------------------------------------------------------
The head of your input dataframe is dataframe is:
 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
   Unnamed: 0 origin destination  0
0           0   CYHM        KJFK  1
1           1   CYUL        KORD  1
2           2   CYVR        KLAX  1
3           3   CYYZ        KIAH  2
4           4   CYYZ        KJFK  1
-------------------------------------------------------------------------------------------------
The tail of you input dataframe is:
 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    Unnamed: 0 origin destination  0
55          55   KBUR        KPDX  1
56          56   KBWI        KPDX  1
57          57   KCVG        KPDX  1
58          58   KCVO        KPDX  2
59          59   KDEN        KPDX  3
----------------------------------------------------------------------------------

In [None]:
print(current_flights.isnull().mean().sort_values(ascending = False)) #shows the percentage of the nulls

## Conclusions and Future Work

For the future, consider incorporating weather data, randomized passenger weight data, incorporate the dynamic changes in fuel/mass ratio throughout a flight, incorporate some demographical passenger data, more routes, the ability for the problem to become a UI tool rather than just a study.