## Flight Ticket Price Prediction

### Problem Statement:

Anyone who has booked a flight ticket knows how unexpectedly the prices vary. The cheapest available ticket on a given flight gets more and less expensive over time. This usually happens as an attempt to maximize revenue based on -

1. Time of purchase patterns (making sure last-minute purchases are expensive)

2. Keeping the flight as full as they want it (raising prices on a flight which is filling up in order to reduce sales and hold back inventory for those expensive last-minute expensive purchases)

Model Building Phase

After collecting/scraping the data, we have around 1948 rows and 9 columns. We need to build a machine learning model. Before model building, we will be doing data pre-processing steps. We will try different models with different hyper parameters and select the best model.

Size of training dataset: 1948 records

### Features

Airline: The name of the airline.

Source: The source from which the service begins.

Destination: The destination where the service ends.

Dep_Time: The time when the journey starts from the source.

Arrival_Time: Time of arrival at the destination.

Duration: Total duration of the flight.

Total_Stops: Total stops between the source and destination.

Additional_Info: Additional information about meal on the flight

Price: The price of the ticket

### Importing required libraries

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, RandomizedSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression, Lasso,Ridge
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, AdaBoostRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import r2_score, mean_squared_error

import warnings
warnings.filterwarnings("ignore")

In [None]:
flight_df = pd.read_csv("Flight price Dataset.csv")
flight_df

In [None]:
# Dropping the unwanted column 'Unnamed'
flight_df.drop("Unnamed: 0",axis =1,inplace = True)
flight_df

In [None]:
flight_df.head()

In [None]:
flight_df.shape

In [None]:
flight_df.info()

In [None]:
flight_df.isnull().sum()

In [None]:
flight_df['Additional_Info']=flight_df['Additional_Info'].replace('No info', np.nan)
flight_df

In [None]:
flight_df.isnull().sum()

In [None]:
#To check missing values
sns.heatmap(flight_df.isnull())

In [None]:
#To check percent of missing data in column Additional info
flight_df['Additional_Info']. isnull(). sum() * 100 / len(flight_df['Additional_Info'])

Here we can see almost 78% of data is missing from Additional_info column. So rather than replacing the nana value with simple imputer, we will drop the Additional info column.

In [None]:
flight_df.drop('Additional_Info', axis=1, inplace=True)
flight_df.head()

### EXPLORATORY DATA ANALYSIS

In [None]:
#Checking the unique values counts in the columns
obj_col = flight_df.select_dtypes(include= "object")
for i in obj_col.columns:
    print(i)
    print(obj_col[i].value_counts(),"\n")

Conclusion:

We have multiple airlines data, top 3 airlines names are Indigo, AirAsia and Vistara.

Date column has to be converted into datetime columns and date and month from the date needs to be separated for analysin.

Major sources of the flights are from major 4 cities i.e. Mumbai, Bangalore, Delhi and Hydrabad.And their destination is also to major cities i.e. Bangalore,New Delhi, Hydrabad and Chennai.

Arrival time columns as multiple observations , it has hours, minutes

Duration is shown in hours and minutes.

Total stops tells that how many stops a flight takes. Most of the flights have no stop. Next to it are the flights which are having 1 stop.

### Creating features by seprating Dep_hour and Dep_min from Departure Time and Arrival Time

### Departure Time

In [None]:
# Departure time is when a plane leaves the gate. 

# Extracting Hours
flight_df["Dep_hour"] = pd.to_datetime(flight_df["Dep_Time"]).dt.hour

# Extracting Minutes
flight_df["Dep_min"] = pd.to_datetime(flight_df["Dep_Time"]).dt.minute

# Now we can drop Dep_Time as it is of no use
flight_df.drop(["Dep_Time"], axis = 1, inplace = True)

### Arrival Time

In [None]:
# Arrival time is when the plane pulls up to the gate.

# Extracting Hours
flight_df["Arrival_hour"] = pd.to_datetime(flight_df['Arrival_Time']).dt.hour

# Extracting Minutes
flight_df["Arrival_min"] = pd.to_datetime(flight_df['Arrival_Time']).dt.minute

# Now we can drop Arrival_Time as it is of no use
flight_df.drop(["Arrival_Time"], axis = 1, inplace = True)

In [None]:
flight_df.head()

### Extracting the hours and min from the Duration column

In [None]:
# Time taken by plane to reach destination is called Duration
# It is the differnce betwwen Departure Time and Arrival time
# Assigning and converting Duration column into list

duration = list(flight_df["Duration"])
for i in range(len(duration)):
    if len(duration[i].split()) !=2:
        if "h" in duration[i]:
             duration[i] = duration[i].strip() + " 0m" 
        else:
            duration[i] = "0h " + duration[i]
duration_hrs = []
duration_min = []

for i in range(len(duration)):
    duration_hrs.append(int(duration[i].split("h")[0]))
    duration_min.append(int(duration[i].split("m")[0].split()[-1]))

In [None]:
flight_df["Duration_hours"] = duration_hrs
flight_df["Duration_Min"] = duration_hrs
flight_df.drop("Duration",axis = 1,inplace = True)

In [None]:
flight_df.head(2)

In [None]:
# Replacing Total_Stops
flight_df.replace({"Non Stop": 0, "1 Stop": 1, "2 Stop(s)": 2}, inplace = True)

In [None]:
flight_df.head(2)

In [None]:
def convert_price(flight_df):
    flight_df['Price (in ₹)'] = flight_df['Price (in ₹)'].str.replace(',', '') # these two lines remove unwanted symbols. Leaving me with a '1100.00' for example
    flight_df['Price (in ₹)'] = flight_df['Price (in ₹)'].astype('int64') # convert data to int. 
    return flight_df

In [None]:
print(convert_price(flight_df))

In [None]:
flight_df.head(2)

In [None]:
flight_df.info()

#### Univariate Analysis:

In [None]:
flight_df["Price (in ₹)"].describe()

In [None]:
#histogram
flight_df['Price (in ₹)'].hist(bins = 20)

In [None]:
#skewness & kurtosis
print("Skewness: %f" % flight_df['Price (in ₹)'].skew())
print("Kurtosis: %f" % flight_df['Price (in ₹)'].kurt())

In [None]:
# For numerical columns
flight_df.describe()

### Handling Categorical Data

In [None]:
plt.figure(figsize=(20,5))
plt.subplot(1,2,1)
sns.countplot('Source',data=flight_df)
plt.subplot(1,2,2)
sns.countplot('Total_Stops',data=flight_df)
plt.tight_layout()    
plt.show()

In [None]:
sns.set(style="whitegrid")
plt.figure(figsize=(10,5))
sns.countplot(flight_df.Destination)
plt.title("Destination")
plt.xticks(rotation=90)
plt.show()

In [None]:
sns.set(style="whitegrid")
plt.figure(figsize=(10,5))
sns.countplot(flight_df.Airline)
plt.title("Airline")
plt.xticks(rotation=90)
plt.show()

In [None]:
# Airline vs Price
sns.catplot(y = "Price (in ₹)", x = "Airline", data = flight_df.sort_values("Price (in ₹)", ascending = False), kind="boxen", height = 4, aspect = 3)
plt.show()

In [None]:
# Source vs Price
sns.catplot(y='Price (in ₹)',x='Source',data=flight_df.sort_values('Price (in ₹)', ascending=False),kind='boxen',height=4,aspect=3)
plt.show()

In [None]:
plt.figure(figsize =(15,5))
flight_df.groupby(["Source","Destination"])["Price (in ₹)"].mean().sort_values(ascending= False).plot(kind = "bar")

In [None]:
plt.figure(figsize=(10,3))
sns.barplot(x = "Total_Stops", y = "Price (in ₹)", data = flight_df)

### Label Encoding

In [None]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

flight_df["Airline"] = le.fit_transform(flight_df["Airline"])
flight_df["Source"] = le.fit_transform(flight_df["Source"])
flight_df["Destination"] = le.fit_transform(flight_df["Destination"])

In [None]:
plt.figure(figsize=(6,4))
sns.scatterplot(x ="Price (in ₹)", y = "Duration_hours" , data = flight_df)

### Correlation Map

In [None]:
flight_df.corr()

In [None]:
plt.figure(figsize =(10,6))
sns.heatmap(flight_df.corr(),annot= True, cmap = "afmhot_r")

### Check For Skewness

In [None]:
x=flight_df.drop('Price (in ₹)', axis=1)
y=flight_df['Price (in ₹)']
x

In [None]:
# Cheking Skewness
x.skew().sort_values(ascending=False)

In [None]:
from sklearn.preprocessing import power_transform
x_new=power_transform(x)
type(x_new)

In [None]:
x.columns

In [None]:
x=pd.DataFrame(x_new, columns=x.columns)
x

In [None]:
# Again Cheking Skewness if it has been removed 
x.skew().sort_values(ascending=False)

In [None]:
x.plot(kind='box',subplots=True,layout=(2,5),figsize=(8,8))

## Features Scaling / Standard Scaler:

In [None]:
# Performing Standard scaler
sc = StandardScaler()
X = sc.fit_transform(x)


### Finding Best Random State

In [None]:
maxScore = 0
maxRS = 0

for i in range(1,300):
    x_train,x_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=i)
    lr = LinearRegression()
    lr.fit(x_train,y_train)
    pred_train = lr.predict(x_train)
    pred_test = lr.predict(x_test)
    acc=r2_score(y_test,pred_test)
    if acc>maxScore:
        maxScore=acc
        maxRS=i
print('Best score is',maxScore,'on Random State',maxRS)

In [None]:
model = [LinearRegression(),Lasso(alpha=1.0),Ridge(alpha=1.0),DecisionTreeRegressor(criterion='squared_error'),
         KNeighborsRegressor()]
for i in model:
    X_train1,X_test1,y_train1,y_test1 = train_test_split(X,y, test_size = 0.2, random_state =maxRS)
    i.fit(X_train1,y_train1)
    pred = i.predict(X_test1)
    print('Train Score of', i , 'is:' , i.score(X_train1,y_train1))
    print("r2_score", r2_score(y_test1, pred))
    print("mean_squred_error", mean_squared_error(y_test1, pred))
    print("RMSE", np.sqrt(mean_squared_error(y_test1, pred)),"\n")

### Cross Validation

In [None]:
best_Ada_Boost = AdaBoostRegressor(n_estimators= 50, loss= 'linear', learning_rate =1, random_state=111)

for i in range(2,11):
    cross_score = cross_val_score(best_Ada_Boost,X,y,cv = i,n_jobs = -1) 
    print(i,"mean",cross_score.mean() ,"and STD" , cross_score.std())