# Flight Price Prediction

Problem Statement:
Flight ticket prices can be something hard to guess, today we might see a price, check out the price of the same flight tomorrow, it will be a different story. We might have often heard travellers saying that flight ticket prices are so unpredictable. Here you will be provided with prices of flight tickets for various airlines between the months of March and June of 2019 and between various cities.

Size of training set: 10683 records

Size of test set: 2671 records

FEATURES:
Airline: The name of the airline.

Date_of_Journey: The date of the journey

Source: The source from which the service begins.

Destination: The destination where the service ends.

Route: The route taken by the flight to reach the destination.

Dep_Time: The time when the journey starts from the source.

Arrival_Time: Time of arrival at the destination.

Duration: Total duration of the flight.

Total_Stops: Total stops between the source and destination.

Additional_Info: Additional information about the flight

Price: The price of the ticket

# Importing library

In [1]:
## import all necessary library
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

# Loading dataset

In [28]:
train=pd.read_csv('Data_Train_1.csv')
sample = pd.read_csv('Sample_submission.csv')
test = pd.read_csv('Test_set_1.csv')

In [29]:
train.head()

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price
0,IndiGo,24/03/2019,Banglore,New Delhi,BLR ? DEL,22:20,01:10 22 Mar,2h 50m,non-stop,No info,3897
1,Air India,1/05/2019,Kolkata,Banglore,CCU ? IXR ? BBI ? BLR,05:50,13:15,7h 25m,2 stops,No info,7662
2,Jet Airways,9/06/2019,Delhi,Cochin,DEL ? LKO ? BOM ? COK,09:25,04:25 10 Jun,19h,2 stops,No info,13882
3,IndiGo,12/05/2019,Kolkata,Banglore,CCU ? NAG ? BLR,18:05,23:30,5h 25m,1 stop,No info,6218
4,IndiGo,01/03/2019,Banglore,New Delhi,BLR ? NAG ? DEL,16:50,21:35,4h 45m,1 stop,No info,13302


In [30]:
test = pd.concat([test,sample],axis=1)

In [31]:
test.head()

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price
0,Jet Airways,6/06/2019,Delhi,Cochin,DEL ? BOM ? COK,17:30,04:25 07 Jun,10h 55m,1 stop,No info,15998
1,IndiGo,12/05/2019,Kolkata,Banglore,CCU ? MAA ? BLR,06:20,10:20,4h,1 stop,No info,16612
2,Jet Airways,21/05/2019,Delhi,Cochin,DEL ? BOM ? COK,19:15,19:00 22 May,23h 45m,1 stop,In-flight meal not included,25572
3,Multiple carriers,21/05/2019,Delhi,Cochin,DEL ? BOM ? COK,08:00,21:00,13h,1 stop,No info,25778
4,Air Asia,24/06/2019,Banglore,Delhi,BLR ? DEL,23:55,02:45 25 Jun,2h 50m,non-stop,No info,16934


In [32]:
train.shape,test.shape,train.shape[0]/(train.shape[0]+test.shape[0])*100

((10683, 11), (2671, 11), 79.99850232140183)

In [33]:
df= pd.concat([train,test])
df.shape

(13354, 11)

In [34]:
df.head()

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price
0,IndiGo,24/03/2019,Banglore,New Delhi,BLR ? DEL,22:20,01:10 22 Mar,2h 50m,non-stop,No info,3897
1,Air India,1/05/2019,Kolkata,Banglore,CCU ? IXR ? BBI ? BLR,05:50,13:15,7h 25m,2 stops,No info,7662
2,Jet Airways,9/06/2019,Delhi,Cochin,DEL ? LKO ? BOM ? COK,09:25,04:25 10 Jun,19h,2 stops,No info,13882
3,IndiGo,12/05/2019,Kolkata,Banglore,CCU ? NAG ? BLR,18:05,23:30,5h 25m,1 stop,No info,6218
4,IndiGo,01/03/2019,Banglore,New Delhi,BLR ? NAG ? DEL,16:50,21:35,4h 45m,1 stop,No info,13302


# Feature Engineering

In [35]:
##Droping columns that does not seem practical to ask to a customer.
df.drop(labels=['Route','Arrival_Time','Duration','Additional_Info'],axis=1,inplace=True)

In [36]:
df.head()

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Dep_Time,Total_Stops,Price
0,IndiGo,24/03/2019,Banglore,New Delhi,22:20,non-stop,3897
1,Air India,1/05/2019,Kolkata,Banglore,05:50,2 stops,7662
2,Jet Airways,9/06/2019,Delhi,Cochin,09:25,2 stops,13882
3,IndiGo,12/05/2019,Kolkata,Banglore,18:05,1 stop,6218
4,IndiGo,01/03/2019,Banglore,New Delhi,16:50,1 stop,13302


In [37]:
df['Airline'].value_counts()

Jet Airways                          4746
IndiGo                               2564
Air India                            2192
Multiple carriers                    1543
SpiceJet                             1026
Vistara                               608
Air Asia                              405
GoAir                                 240
Multiple carriers Premium economy      16
Jet Airways Business                    8
Vistara Premium economy                 5
Trujet                                  1
Name: Airline, dtype: int64

In [38]:
df['Source'].value_counts(),df['Destination'].value_counts()

(Delhi       5682
 Kolkata     3581
 Banglore    2752
 Mumbai       883
 Chennai      456
 Name: Source, dtype: int64,
 Cochin       5682
 Banglore     3581
 Delhi        1582
 New Delhi    1170
 Hyderabad     883
 Kolkata       456
 Name: Destination, dtype: int64)

In [39]:
df.isnull().sum()

Airline            0
Date_of_Journey    0
Source             0
Destination        0
Dep_Time           0
Total_Stops        1
Price              0
dtype: int64

In [40]:
print(df.shape)
df.dropna(inplace=True)
print(df.shape)

(13354, 7)
(13353, 7)


In [41]:

df.head()

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Dep_Time,Total_Stops,Price
0,IndiGo,24/03/2019,Banglore,New Delhi,22:20,non-stop,3897
1,Air India,1/05/2019,Kolkata,Banglore,05:50,2 stops,7662
2,Jet Airways,9/06/2019,Delhi,Cochin,09:25,2 stops,13882
3,IndiGo,12/05/2019,Kolkata,Banglore,18:05,1 stop,6218
4,IndiGo,01/03/2019,Banglore,New Delhi,16:50,1 stop,13302


In [42]:
df['Day']= df['Date_of_Journey'].str.split('/').str[0]
df['Month']= df['Date_of_Journey'].str.split('/').str[1]
df['Year']= df['Date_of_Journey'].str.split('/').str[2]

In [43]:

df.head()

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Dep_Time,Total_Stops,Price,Day,Month,Year
0,IndiGo,24/03/2019,Banglore,New Delhi,22:20,non-stop,3897,24,3,2019
1,Air India,1/05/2019,Kolkata,Banglore,05:50,2 stops,7662,1,5,2019
2,Jet Airways,9/06/2019,Delhi,Cochin,09:25,2 stops,13882,9,6,2019
3,IndiGo,12/05/2019,Kolkata,Banglore,18:05,1 stop,6218,12,5,2019
4,IndiGo,01/03/2019,Banglore,New Delhi,16:50,1 stop,13302,1,3,2019


In [44]:
df['Total_Stops']=df['Total_Stops'].str.replace('non-','0 ')

In [45]:
df.head()

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Dep_Time,Total_Stops,Price,Day,Month,Year
0,IndiGo,24/03/2019,Banglore,New Delhi,22:20,0 stop,3897,24,3,2019
1,Air India,1/05/2019,Kolkata,Banglore,05:50,2 stops,7662,1,5,2019
2,Jet Airways,9/06/2019,Delhi,Cochin,09:25,2 stops,13882,9,6,2019
3,IndiGo,12/05/2019,Kolkata,Banglore,18:05,1 stop,6218,12,5,2019
4,IndiGo,01/03/2019,Banglore,New Delhi,16:50,1 stop,13302,1,3,2019


In [46]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 13353 entries, 0 to 2670
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   Airline          13353 non-null  object
 1   Date_of_Journey  13353 non-null  object
 2   Source           13353 non-null  object
 3   Destination      13353 non-null  object
 4   Dep_Time         13353 non-null  object
 5   Total_Stops      13353 non-null  object
 6   Price            13353 non-null  int64 
 7   Day              13353 non-null  object
 8   Month            13353 non-null  object
 9   Year             13353 non-null  object
dtypes: int64(1), object(9)
memory usage: 1.1+ MB


In [47]:
df['Stops'] = df['Total_Stops'].str.split().str[0]
df.head()

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Dep_Time,Total_Stops,Price,Day,Month,Year,Stops
0,IndiGo,24/03/2019,Banglore,New Delhi,22:20,0 stop,3897,24,3,2019,0
1,Air India,1/05/2019,Kolkata,Banglore,05:50,2 stops,7662,1,5,2019,2
2,Jet Airways,9/06/2019,Delhi,Cochin,09:25,2 stops,13882,9,6,2019,2
3,IndiGo,12/05/2019,Kolkata,Banglore,18:05,1 stop,6218,12,5,2019,1
4,IndiGo,01/03/2019,Banglore,New Delhi,16:50,1 stop,13302,1,3,2019,1


In [48]:
df['Departure_Hour'] = df['Dep_Time'].str.split(':').str[0]
df['Departure_Minute'] = df['Dep_Time'].str.split(':').str[1]

In [49]:
df.head()

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Dep_Time,Total_Stops,Price,Day,Month,Year,Stops,Departure_Hour,Departure_Minute
0,IndiGo,24/03/2019,Banglore,New Delhi,22:20,0 stop,3897,24,3,2019,0,22,20
1,Air India,1/05/2019,Kolkata,Banglore,05:50,2 stops,7662,1,5,2019,2,5,50
2,Jet Airways,9/06/2019,Delhi,Cochin,09:25,2 stops,13882,9,6,2019,2,9,25
3,IndiGo,12/05/2019,Kolkata,Banglore,18:05,1 stop,6218,12,5,2019,1,18,5
4,IndiGo,01/03/2019,Banglore,New Delhi,16:50,1 stop,13302,1,3,2019,1,16,50


In [50]:
#Converting the datatype o newly created features 
df['Day'] = df['Day'].astype(int)
df['Month'] = df['Month'].astype(int)
df['Year'] = df['Year'].astype(int)
df['Stops'] = df['Stops'].astype(int)
df['Departure_Hour'] = df['Departure_Hour'].astype(int)
df['Departure_Minute'] = df['Departure_Minute'].astype(int)

In [51]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 13353 entries, 0 to 2670
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Airline           13353 non-null  object
 1   Date_of_Journey   13353 non-null  object
 2   Source            13353 non-null  object
 3   Destination       13353 non-null  object
 4   Dep_Time          13353 non-null  object
 5   Total_Stops       13353 non-null  object
 6   Price             13353 non-null  int64 
 7   Day               13353 non-null  int32 
 8   Month             13353 non-null  int32 
 9   Year              13353 non-null  int32 
 10  Stops             13353 non-null  int32 
 11  Departure_Hour    13353 non-null  int32 
 12  Departure_Minute  13353 non-null  int32 
dtypes: int32(6), int64(1), object(6)
memory usage: 1.1+ MB


In [52]:
#Now droping the parent features since we don't need them 
df.drop(['Date_of_Journey','Dep_Time','Total_Stops'],axis=1,inplace=True)
df.head()

Unnamed: 0,Airline,Source,Destination,Price,Day,Month,Year,Stops,Departure_Hour,Departure_Minute
0,IndiGo,Banglore,New Delhi,3897,24,3,2019,0,22,20
1,Air India,Kolkata,Banglore,7662,1,5,2019,2,5,50
2,Jet Airways,Delhi,Cochin,13882,9,6,2019,2,9,25
3,IndiGo,Kolkata,Banglore,6218,12,5,2019,1,18,5
4,IndiGo,Banglore,New Delhi,13302,1,3,2019,1,16,50


In [53]:
df.Airline.value_counts().index

Index(['Jet Airways', 'IndiGo', 'Air India', 'Multiple carriers', 'SpiceJet',
       'Vistara', 'Air Asia', 'GoAir', 'Multiple carriers Premium economy',
       'Jet Airways Business', 'Vistara Premium economy', 'Trujet'],
      dtype='object')

In [54]:
source_dict = {y:x for x,y in enumerate(df.Source.value_counts().index.sort_values())}
source_dict

{'Banglore': 0, 'Chennai': 1, 'Delhi': 2, 'Kolkata': 3, 'Mumbai': 4}

In [55]:
df.Destination.value_counts().index.sort_values()

Index(['Banglore', 'Cochin', 'Delhi', 'Hyderabad', 'Kolkata', 'New Delhi'], dtype='object')

In [56]:
destination_dict = {'Banglore':0,'Cochin':1,'Delhi':2,'Kolkata': 3,'Hyderabad':4,'New Delhi':5}

In [57]:
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
df['Airline_Encoded']= le.fit_transform(df['Airline'].values)

df3 = df[['Airline']].copy()
df3['Encoded']=df['Airline_Encoded']
df3=df3.drop_duplicates('Airline').reset_index().iloc[:,1:]
d5=df3.Airline.values
d6=df3.Encoded.values
airline_dict = dict(zip(d5,d6))

print(airline_dict)

{'IndiGo': 3, 'Air India': 1, 'Jet Airways': 4, 'SpiceJet': 8, 'Multiple carriers': 6, 'GoAir': 2, 'Vistara': 10, 'Air Asia': 0, 'Vistara Premium economy': 11, 'Jet Airways Business': 5, 'Multiple carriers Premium economy': 7, 'Trujet': 9}


In [58]:
df['Source_Encoded']=df['Source'].map(source_dict)
df['Destination_Encoded']=df['Destination'].map(destination_dict)


In [59]:
df.head()


Unnamed: 0,Airline,Source,Destination,Price,Day,Month,Year,Stops,Departure_Hour,Departure_Minute,Airline_Encoded,Source_Encoded,Destination_Encoded
0,IndiGo,Banglore,New Delhi,3897,24,3,2019,0,22,20,3,0,5
1,Air India,Kolkata,Banglore,7662,1,5,2019,2,5,50,1,3,0
2,Jet Airways,Delhi,Cochin,13882,9,6,2019,2,9,25,4,2,1
3,IndiGo,Kolkata,Banglore,6218,12,5,2019,1,18,5,3,3,0
4,IndiGo,Banglore,New Delhi,13302,1,3,2019,1,16,50,3,0,5


In [60]:
df = df.drop(['Airline','Source','Destination'],axis=1)
df.head()

Unnamed: 0,Price,Day,Month,Year,Stops,Departure_Hour,Departure_Minute,Airline_Encoded,Source_Encoded,Destination_Encoded
0,3897,24,3,2019,0,22,20,3,0,5
1,7662,1,5,2019,2,5,50,1,3,0
2,13882,9,6,2019,2,9,25,4,2,1
3,6218,12,5,2019,1,18,5,3,3,0
4,13302,1,3,2019,1,16,50,3,0,5


# Feature Selection

In [61]:
from sklearn.linear_model import Lasso
from sklearn.feature_selection import SelectFromModel 
from sklearn.model_selection import train_test_split

In [62]:
df.shape

(13353, 10)

In [63]:
df_train = df[0:10600]
df_test = df[10600:]

In [64]:

X = df_train.drop(['Price'],axis=1)
y = df_train.Price

In [65]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

In [88]:

from sklearn.linear_model import LinearRegression
lr=LinearRegression()
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split
import warnings
warnings.filterwarnings('ignore')


In [90]:
for i in range(0,100):
    X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=i)
    lr.fit(X_train,y_train)
    pred_train=lr.predict(X_train)
    pred_test=lr.predict(X_test)
    print(f"At random state {i},the training accuracy is :- {r2_score(y_train,pred_train)}")
    print(f"At random state {i},the testing accuracy is :- {r2_score(y_test,pred_test)}")
    print("\n")

At random state 0,the training accuracy is :- 0.4285516360573395
At random state 0,the testing accuracy is :- 0.4138402480905212


At random state 1,the training accuracy is :- 0.42266370304476886
At random state 1,the testing accuracy is :- 0.4383287806098942


At random state 2,the training accuracy is :- 0.4172404163709277
At random state 2,the testing accuracy is :- 0.4607288326762551


At random state 3,the training accuracy is :- 0.4201221580839788
At random state 3,the testing accuracy is :- 0.44932541282796634


At random state 4,the training accuracy is :- 0.41992121848210673
At random state 4,the testing accuracy is :- 0.4519513764403268


At random state 5,the training accuracy is :- 0.43672165939793817
At random state 5,the testing accuracy is :- 0.3908164258875414


At random state 6,the training accuracy is :- 0.4183597801324258
At random state 6,the testing accuracy is :- 0.45570644665358073


At random state 7,the training accuracy is :- 0.4192442691216597
At random sta

In [91]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

In [93]:

lr.fit(X_train,y_train)

In [94]:

pred_test=lr.predict(X_test)

In [95]:

print(r2_score(y_test,pred_test))

0.42340064730494487


In [96]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
import warnings
warnings.filterwarnings('ignore')

In [98]:
from sklearn.linear_model import Lasso

parameters = {'alpha' : [.0001,.001,.01,.1,1,10],'random_state' : list(range(0,10))}
ls=Lasso()
clf=GridSearchCV(ls,parameters)
clf.fit(X_train,y_train)

print(clf.best_params_)

{'alpha': 0.1, 'random_state': 0}


In [100]:
ls=Lasso(alpha=0.1,random_state=0)
ls.fit(X_train,y_train)
ls.score(X_train,y_train)
pred_ls=ls.predict(X_test)

lss=r2_score(y_test,pred_ls)
lss

0.42340315896885417