
### About Dataset

Dataset contains information about flight booking options from the website Easemytrip for flight travel between India's top 6 metro cities. There are 300261 datapoints and 11 features in the cleaned dataset. Data was collected for 50 days, from February 11th to March 31st, 2022. Data source was secondary data and was collected from Ease my trip website.


### Features

The various features of the cleaned dataset are explained below:
1. *Airline*: The name of the airline company is stored in the airline column. It is a categorical feature having 6 different airlines.
2. *Flight*: Flight stores information regarding the plane's flight code. It is a categorical feature.
3. *Source City*: City from which the flight takes off. It is a categorical feature having 6 unique cities.
4. *Departure Time*: This is a derived categorical feature obtained created by grouping time periods into bins. It stores information about the departure time and have 6 unique time labels.
5. *Stops*: A categorical feature with 3 distinct values that stores the number of stops between the source and destination cities.
6. *Arrival Time*: This is a derived categorical feature created by grouping time intervals into bins. It has six distinct time labels and keeps information about the arrival time.
7. *Destination City*: City where the flight will land. It is a categorical feature having 6 unique cities.
8. *Class*: A categorical feature that contains information on seat class; it has two distinct values: Business and Economy.
9. *Duration*: A continuous feature that displays the overall amount of time it takes to travel between cities in hours.
10. *Days Left*: This is a derived characteristic that is calculated by subtracting the trip date by the booking date.
11. *Price*: Target variable stores information of the ticket price.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("whitegrid")


In [None]:
df = pd.read_csv("data\Clean_Dataset.csv", index_col=0)
df.head()

In [None]:
#check missing value
df.isnull().sum()

In [None]:
df.describe(include="all").T

### Let's check distribution

In [None]:
plt.figure(figsize=(7, 4))
sns.histplot(x ='price', data = df, kde = True)
plt.show()

In [None]:
plt.figure(figsize=(7, 4))
sns.histplot(x ='price', data = df[df['class']== 'Economy'], kde = True)
plt.show()

In [None]:
plt.figure(figsize=(7, 4))
sns.histplot(x ='price', data = df[df['class'] != 'Economy'], kde = True)
plt.show()

In [None]:
# visualization of catagoric features
plt.figure(figsize=(17, 20))

plt.subplot(4, 2, 1)
sns.countplot(x=df['airline'], data=df)
plt.title("Frequency of airline")

plt.subplot(4, 2, 2)
sns.countplot(x=df["source_city"], data=df)
plt.title("Frequency of Source City")

plt.subplot(4, 2, 3)
sns.countplot(x=df["departure_time"], data=df)
plt.title("Frequency of Departure Time")

plt.subplot(4, 2, 4)
sns.countplot(x=df["stops"], data=df)
plt.title("Frequency of Stops")

plt.subplot(4, 2, 5)
sns.countplot(x=df["arrival_time"], data=df)
plt.title("Frequency of Arrival Time")

plt.subplot(4, 2, 6)
sns.countplot(x=df["destination_city"], data=df)
plt.title("Frequency of Destination City")

plt.subplot(4, 2, 7)
sns.countplot(x=df["class"], data=df)
plt.title("Class Frequency")

plt.show()

### How does the ticket price vary between Economy and Business class?

To visualize the difference between the two kind of tickets, I will plot the prices for the two prices for both business and economy tickets for the different companies.

In [None]:
plt.figure(figsize=(20, 5))
sns.barplot(x='airline', y='price', hue="class",
            data=df.sort_values("price")
            )
plt.show()

📌 Business flights are only available in two companies: Air India and Vistara. Also, there is a big gap between the prices in the two class that reaches almost 5 times the price of Economy for Business tickets.

### How is the price affected when tickets are bought in just 1 or 2 days before departure?

To visualize how the prices changes depending on the number of days left, I will calculate the average price depending on the days left, to try to understand a pattern in the curve.

In [None]:
df_temp = df.groupby(['days_left'])['price'].mean().reset_index()
df_temp.head()

In [None]:
plt.figure(figsize=(15,6))
ax = sns.scatterplot(x="days_left",
                     y="price", data=df_temp)
ax.set_title("Average prizes depending on the days left",fontsize=15)
plt.show()

### Does the price change with the duration of the flight?

In [None]:
df_temp = df.groupby(['duration'])['price'].mean().reset_index()

plt.figure(figsize=(15,6))
ax = sns.scatterplot(x="duration", y="price", data=df_temp)
ax.set_title("Average prizes depending on the duration",fontsize=15)
plt.show()

📌 It is clear that here the relationship is not linear. The prices reaches a high price at a duration of 20 hours before lowering again.
However some outliers seem to affect the regression curve .

### Does ticket price change based on the departure time and arrival time?

In [None]:
plt.figure(figsize = (18,8))
plt.subplot(1,2,1)
sns.boxplot(data=df, y="price", x="departure_time",showfliers=False)
ax.set_title("Airline prices based on the departure time",fontsize=15)

plt.subplot(1,2,2)
sns.boxplot(data=df, y="price", x="arrival_time",showfliers=False)
ax.set_title("Airline prices based on the arrival time",fontsize=15)
plt.show()

### Does the number of stops influences the price?

In [None]:
fig, axs = plt.subplots (1, 2, gridspec_kw={'width_ratios': [5, 3]}, figsize=(25, 5))
sns.barplot(y = "price", x = "airline",hue="stops",data = df.loc[df["class"]=='Economy'].sort_values("price", ascending = False), ax=axs[0])
axs[0].set_title("Airline prices based on the number of stops  for economy",fontsize=20)

sns.barplot(y = "price", x = "airline",hue="stops",data = df.loc[df["class"]=='Business'].sort_values("price", ascending = False), ax=axs[1])
axs[1].set_title("Airline prices based on the number of stops  for business",fontsize=20)