# Predicting Flight Ticket Prices for JetBlue in New York

**Notebook Authors**:
- Chris Kuzemka
- Irene Chau
- Ali Nazim
- Yurica Xu
- Aqib Rahim

##  Imports & Notebook Functions

In [5]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## Data Builder

In [9]:
base_df = pd.read_csv("../data/cleaned_jetblue_df.csv") #cleaned
jetblue_df = pd.read_csv("../data/jetblue_df.csv")

# Table of Contents

Here goes the sections of this notebook breakdown and we can apply relevant authors to it as well. 

## Data Overview

| Variable Name | Expected Data Type | Description |
|--------------|-------------------|-------------|
| searchDate | datetime | Date when the flight search was performed |
| searchDayOfWeek | int/string | Day of the week when search was performed |
| route | string | Flight route code |
| flightDate | datetime | Scheduled date of the flight |
| flightDayOfWeek | int/string | Day of the week of the flight |
| startingAirport | string | Departure airport code |
| destinationAirport | string | Arrival airport code |
| elapsedDays | int | Number of days between search date and flight date |
| isBasicEconomy | boolean | Whether the ticket is basic economy class |
| isRefundable | boolean | Whether the ticket is refundable |
| isNonStop | boolean | Whether the flight is non-stop |
| baseFare | float | Base ticket price before taxes and fees |
| totalFare | float | Total price including all taxes and fees |
| seatsRemaining | int | Number of seats available |
| totalTravelDistance | float | Total flight distance |
| segmentsDepartureTimeRaw | datetime | Raw departure time for flight segments |
| segmentsArrivalTimeRaw | datetime | Raw arrival time for flight segments |
| segmentsArrivalAirportCode | string | Airport code for segment arrival |
| segmentsDepartureAirportCode | string | Airport code for segment departure |
| segmentsAirlineName | string | Name of the airline |
| segmentsDurationInSeconds | int | Duration of flight segments in seconds |
| segmentsCabinCode | string | Cabin class code |
| departureTime | datetime | Flight departure time |
| arrivalTime | datetime | Flight arrival time |
| departureCategory | string | Category of departure time (e.g., morning/afternoon) |
| arrivalCategory | string | Category of arrival time (e.g., morning/afternoon) |
| isHolidaySearchDate | boolean | Whether the search date is a holiday |
| isHolidayFlightDate | boolean | Whether the flight date is a holiday |
| nearHolidaySearchDate | boolean | Whether the search date is near a holiday |
| nearHolidayFlightDate | boolean | Whether the flight date is near a holiday |
| searchDateInt | int | Integer representation of search date |
| flightDateInt | int | Integer representation of flight date |
| daysLeft | int | Days remaining until flight departure |
| numStops | int | Number of stops in the flight route |

discussion on the data's integrity and composition 

#  Data Exploration and Cleaning

## Reading Kaggle Itineraries 

**Christopher Kuzemka**

The original project began with observing a 30GB dataset from Kaggle. For the sake of conserving space, the processing notebook used to explore this master is no longer available as the dataset was largely discarded in favor of computational efficiency. For the purpose of this section, we still will briefly discuss the content of Kaggle's itineraries dataset and will be referencing the [project's data source](https://www.kaggle.com/datasets/dilwong/flightprices). The source itself labels that there are only 5.9M columns, but our group had to strongly reduce the size of the dataset. Practically speaking, we als sport a final dataframe of approximately 3.9M rows with a file size approximately 1GB. The math just doesn't add up and we believe Kaggle (or the author) made an error here. 

Regardless, the 30GB original dataframe sporadically included information from a variety of airlines. Our quick-attempt to find the best dataset was to choose a subsection filter off of the itineraries dataframe. The choices between difference dataframes came down to 3 proposed airlines:

- AmericanAirlines (~10-12GB)
- Delta (~4GB)
- JetBlue (~1GB)

Note that the files sizes shown above are approximatiions as the originals were later deleted and the notebook affiliated with such also removed. 

In the next section we talk about the chosen dataframe. 

In [12]:
print(f"The shape of our cleaned JetBlue df is: {base_df.shape}")

print(f"The shape of our uncleaned JetBlue df is: {jetblue_df.shape}")

print(f"The size of our cleaned JetBlue df is: {base_df.size}")

print(f"The size of our uncleaned JetBlue df is: {jetblue_df.size}")

The shape of our cleaned JetBlue df is: (3929953, 34)
The shape of our uncleaned JetBlue df is: (6824440, 28)
The size of our cleaned JetBlue df is: 133618402
The size of our uncleaned JetBlue df is: 191084320


## JetBlue Focus

**Christopher Kuzemka**

JetBlue was the winning dataframe for a small number of reasons. The first was that the dataset was small to process but rich in size for our project (against housing approximately 3.9M rows). We organized our cleaning processes through a cleaning script found in our project's `toolkit` folder. The cleaning script accepted this approximate 2GB dataset of uncleaned JetBlue data and converted it into the cleaned JetBlue data we have today.

The uncleaned data is represented as the variable `jetblue_df` and the cleaned data is represented as the variable `base_df` (the naming convention here to signify a working dataframe).

## Cleaning

(little nuances) <- nulls, useless features, 

- nulls 

(problems with labeling and constraints) <- made up information or missing information or inconsistent labeling lacking data

(outliers and why they are outliers)



## Feature Engineering and KPIs

(showoff some of the features created and plotting)

# Model Preprocessing

(re-discussion of any points on how the model might be influenced based on what we know)

(model encoding strategeis)



# Modeling

(preliminary models)

(discussion of observations and failures)

(showing off different models)

(discussion and nshowcasing of best models)

# Model Application

(discuss how the model is used in an applicable setting)

(consider discussions on how it can be improved)

(discuss how the scope could be increased to other airports and routes or markets if applicable)