# Final Project: Flight Price Prediction

According to statistics from the Bureau of Transportation, over 853 million passengers traveled through U.S. airports in 2022. In 2020, there were 388 million passengers traveling and in 2021, there were 658 million passengers. The number of travelers has been steadily increasing year by year as the global aviation industry has expanded and demand for tourism has accelerated. However, not everyone in the world can afford to fly because of the high cost of air travel. We hope to give potential passengers and airlines an idea of the market demand and price of air travel. 


In our project we will answer the question:
     Are we able to find the cheapest flight price given certain criteria for flights?

## Part 1: Exploratory Data Analysis

First we need to import the necessary libraries for data analysis and preprocess our [kaggle dataset](https://www.kaggle.com/datasets/dilwong/flightprices). 

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
#include imports for machine learning here

#to create the complex graph network
import networkx as nx 
#use pyvis to generate better looking graphs
#rom pyvis.network import Network

#chunk the data - use a preditermined chunking pattern before the dataset is loaded in. 
columns = ['flightDate', 'segmentsAirlineCode', 'totalFare', 'startingAirport', 'destinationAirport', 'segmentsAirlineName']


chunksize=10000
flight_price_dataset = pd.read_csv("data/itineraries.csv", chunksize=chunksize)
week_analysis_data = next(flight_price_dataset)
# week_analysis_data = week_analysis_data.sample(frac=1, random_state=42)
# display(week_analysis_data)
week_analysis_data = week_analysis_data[columns]
display(week_analysis_data)
#for data analysis of weekend prices v. weekday prices.
# only focus on a set starting point and destination point or make a clear distinction in presentation
week_analysis_df = week_analysis_data[['segmentsAirlineCode', 'totalFare', 'flightDate', 'segmentsAirlineName']].copy()




week_analysis_df['flightDate'] = pd.to_datetime(week_analysis_df['flightDate'], format='%Y-%m-%d')
week_analysis_df['Weekday'] = week_analysis_df['flightDate'].dt.day_name()
week_analysis_df['Weekend'] = week_analysis_df['Weekday'].apply(lambda day: 1 if day == 'Sunday' else 0)

#make codes to correlate between the airline and the full name


agg_data = week_analysis_df.groupby(['segmentsAirlineCode', 'Weekend'])['totalFare'].mean().reset_index()

# Plot the aggregated data
plt.subplots(figsize=(10, 6))
sns.barplot(data=agg_data, x='segmentsAirlineCode', y='totalFare', hue='Weekend', color='orange')
plt.xlabel("Airline")
plt.xticks(rotation=45)
plt.ylabel("Average Price")
plt.title("Average Flight Price on Weekdays Vs Weekends by Airline")
plt.legend(title='Weekend', loc='upper right')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.tight_layout()
plt.show()
#this currently prints but I need to preprocess data so that the graph looks more visually appealing. 

## Part 2: Directed Graph Network

In class, we used a graph network to explore the ways in which connections can be made socially, economoically, and through systems in the world. 

Travel shows just how connected the world really is. And we are focused on sharing that as well.

In [None]:
#what do we want the graph to represent??
    #POSSIBLE IDEAS
        #network of flight paths: shows how flights are connected by leg
        #network of distances and the prices on each edge: shows how flights are connected by distance
#first create a list of airport iata codes
#second grab the distances from the distance column
#create a network based on the flights given in the dataset
    #make a function -> for each startingAirport grab the endingAirport and place the two in a list and append it to another list.
#then add a labeled edge where the distance is written.

# Create a directed graph
G = nx.DiGraph()

for chunk in flight_price_dataset:
    start_airports = chunk['startingAirport'].unique()
    stop_airports = chunk['destinationAirport'].unique()



    # Add nodes representing airports to the graph
    G.add_nodes_from(start_airports)
    G.add_nodes_from(stop_airports)

    # Add edges between the nodes based on flight connections with distances as edge labels
    for index, row in chunk.iterrows():
        start = row['startingAirport']
        stop = row['destinationAirport']
        distance = row['totalFare']  # You can replace this with the actual distance if available
        G.add_edge(start, stop, distance=distance)

# Draw the network graph with edge labels
plt.figure(figsize=(10, 8))
pos = nx.spring_layout(G)
nx.draw(G, pos, with_labels=True, node_color='lightblue', node_size=1500, font_size=10, font_weight='bold')
labels = nx.get_edge_attributes(G, 'distance')
nx.draw_networkx_edge_labels(G, pos, edge_labels=labels)
plt.title('Flight Network Graph with Distances')
plt.show()



