<a href="https://colab.research.google.com/github/Meghashyamt/Flight-Price-Prediction/blob/master/Flight_Price_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Flight Ticket Price Prediction**

Flight ticket prices can be something hard to guess, today we might see a price, check out the price of the same flight tomorrow, it will be a different story. We might have often heard travelers saying that flight ticket prices are so unpredictable. As data scientists, we are gonna prove that given the right data anything can be predicted. Here you will be provided with prices of flight tickets for various airlines between the months of March and June of 2019 and between various cities.

In [0]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
get_ipython().run_line_magic('matplotlib', 'inline')
import matplotlib.pyplot as plt  # Matlab-style plotting
import seaborn as sns
color = sns.color_palette()
sns.set_style('darkgrid')
from scipy import stats
from scipy.stats import norm, skew
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import KFold, cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.preprocessing import RobustScaler
from sklearn.pipeline import make_pipeline

from sklearn import svm
from lightgbm import LGBMRegressor
from sklearn.metrics import accuracy_score,confusion_matrix,classification_report,r2_score


In [0]:
train_df =  pd.read_excel('Data_Train.xlsx')
test_df=pd.read_excel('Test_set.xlsx')

**Append the dataset**

In [0]:
big_df = train_df.append(test_df)


of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  sort=sort)


In [0]:
big_df.dtypes

Additional_Info     object
Airline             object
Arrival_Time        object
Date_of_Journey     object
Dep_Time            object
Destination         object
Duration            object
Price              float64
Route               object
Source              object
Total_Stops         object
dtype: object

**Feature Engineering**

In [0]:
big_df['Date'] = big_df['Date_of_Journey'].str.split('/').str[0]
big_df['Month'] = big_df['Date_of_Journey'].str.split('/').str[1]
big_df['Year'] = big_df['Date_of_Journey'].str.split('/').str[2]

In [0]:
big_df['Date'] = big_df['Date'].astype(int)
big_df['Month'] = big_df['Month'].astype(int)
big_df['Year'] = big_df['Year'].astype(int)

In [0]:
big_df.head()

Unnamed: 0,Additional_Info,Airline,Arrival_Time,Date_of_Journey,Dep_Time,Destination,Duration,Price,Route,Source,Total_Stops,Date,Month,Year
0,No info,IndiGo,01:10 22 Mar,24/03/2019,22:20,New Delhi,2h 50m,3897.0,BLR → DEL,Banglore,non-stop,24,3,2019
1,No info,Air India,13:15,1/05/2019,05:50,Banglore,7h 25m,7662.0,CCU → IXR → BBI → BLR,Kolkata,2 stops,1,5,2019
2,No info,Jet Airways,04:25 10 Jun,9/06/2019,09:25,Cochin,19h,13882.0,DEL → LKO → BOM → COK,Delhi,2 stops,9,6,2019
3,No info,IndiGo,23:30,12/05/2019,18:05,Banglore,5h 25m,6218.0,CCU → NAG → BLR,Kolkata,1 stop,12,5,2019
4,No info,IndiGo,21:35,01/03/2019,16:50,New Delhi,4h 45m,13302.0,BLR → NAG → DEL,Banglore,1 stop,1,3,2019


In [0]:
big_df=big_df.drop(['Date_of_Journey'], axis=1)

In [0]:
big_df.head()

Unnamed: 0,Additional_Info,Airline,Arrival_Time,Dep_Time,Destination,Duration,Price,Route,Source,Total_Stops,Date,Month,Year
0,No info,IndiGo,01:10 22 Mar,22:20,New Delhi,2h 50m,3897.0,BLR → DEL,Banglore,non-stop,24,3,2019
1,No info,Air India,13:15,05:50,Banglore,7h 25m,7662.0,CCU → IXR → BBI → BLR,Kolkata,2 stops,1,5,2019
2,No info,Jet Airways,04:25 10 Jun,09:25,Cochin,19h,13882.0,DEL → LKO → BOM → COK,Delhi,2 stops,9,6,2019
3,No info,IndiGo,23:30,18:05,Banglore,5h 25m,6218.0,CCU → NAG → BLR,Kolkata,1 stop,12,5,2019
4,No info,IndiGo,21:35,16:50,New Delhi,4h 45m,13302.0,BLR → NAG → DEL,Banglore,1 stop,1,3,2019


In [0]:
big_df['Arrival_Time'] = big_df['Arrival_Time'] .str.split(' ').str[0]

In [0]:
big_df['Total_Stops']=big_df['Total_Stops'].fillna('1 stop')

In [0]:
big_df.head()

Unnamed: 0,Additional_Info,Airline,Arrival_Time,Dep_Time,Destination,Duration,Price,Route,Source,Total_Stops,Date,Month,Year
0,No info,IndiGo,01:10,22:20,New Delhi,2h 50m,3897.0,BLR → DEL,Banglore,non-stop,24,3,2019
1,No info,Air India,13:15,05:50,Banglore,7h 25m,7662.0,CCU → IXR → BBI → BLR,Kolkata,2 stops,1,5,2019
2,No info,Jet Airways,04:25,09:25,Cochin,19h,13882.0,DEL → LKO → BOM → COK,Delhi,2 stops,9,6,2019
3,No info,IndiGo,23:30,18:05,Banglore,5h 25m,6218.0,CCU → NAG → BLR,Kolkata,1 stop,12,5,2019
4,No info,IndiGo,21:35,16:50,New Delhi,4h 45m,13302.0,BLR → NAG → DEL,Banglore,1 stop,1,3,2019


In [0]:
big_df['Total_Stops']=big_df['Total_Stops'].replace('non-stop','0 stop')

In [0]:
big_df['Stop'] = big_df['Total_Stops'].str.split(' ').str[0]

In [0]:
big_df['Stop'] = big_df['Stop'].astype(int)

In [0]:
big_df=big_df.drop(['Total_Stops'], axis=1)

In [0]:
big_df.head()

Unnamed: 0,Additional_Info,Airline,Arrival_Time,Dep_Time,Destination,Duration,Price,Route,Source,Date,Month,Year,Stop
0,No info,IndiGo,01:10,22:20,New Delhi,2h 50m,3897.0,BLR → DEL,Banglore,24,3,2019,0
1,No info,Air India,13:15,05:50,Banglore,7h 25m,7662.0,CCU → IXR → BBI → BLR,Kolkata,1,5,2019,2
2,No info,Jet Airways,04:25,09:25,Cochin,19h,13882.0,DEL → LKO → BOM → COK,Delhi,9,6,2019,2
3,No info,IndiGo,23:30,18:05,Banglore,5h 25m,6218.0,CCU → NAG → BLR,Kolkata,12,5,2019,1
4,No info,IndiGo,21:35,16:50,New Delhi,4h 45m,13302.0,BLR → NAG → DEL,Banglore,1,3,2019,1


In [0]:
big_df['Arrival_Hour'] = big_df['Arrival_Time'] .str.split(':').str[0]
big_df['Arrival_Minute'] = big_df['Arrival_Time'] .str.split(':').str[1]

big_df['Arrival_Hour'] = big_df['Arrival_Hour'].astype(int)
big_df['Arrival_Minute'] = big_df['Arrival_Minute'].astype(int)
big_df=big_df.drop(['Arrival_Time'], axis=1)

In [0]:
big_df.head()

Unnamed: 0,Additional_Info,Airline,Dep_Time,Destination,Duration,Price,Route,Source,Date,Month,Year,Stop,Arrival_Hour,Arrival_Minute
0,No info,IndiGo,22:20,New Delhi,2h 50m,3897.0,BLR → DEL,Banglore,24,3,2019,0,1,10
1,No info,Air India,05:50,Banglore,7h 25m,7662.0,CCU → IXR → BBI → BLR,Kolkata,1,5,2019,2,13,15
2,No info,Jet Airways,09:25,Cochin,19h,13882.0,DEL → LKO → BOM → COK,Delhi,9,6,2019,2,4,25
3,No info,IndiGo,18:05,Banglore,5h 25m,6218.0,CCU → NAG → BLR,Kolkata,12,5,2019,1,23,30
4,No info,IndiGo,16:50,New Delhi,4h 45m,13302.0,BLR → NAG → DEL,Banglore,1,3,2019,1,21,35


In [0]:
big_df['Dep_Hour'] = big_df['Dep_Time'] .str.split(':').str[0]
big_df['Dep_Minute'] = big_df['Dep_Time'] .str.split(':').str[1]
big_df['Dep_Hour'] = big_df['Dep_Hour'].astype(int)
big_df['Dep_Minute'] = big_df['Dep_Minute'].astype(int)
big_df=big_df.drop(['Dep_Time'], axis=1)

In [0]:
big_df['Route_1'] = big_df['Route'] .str.split('→ ').str[0]
big_df['Route_2'] = big_df['Route'] .str.split('→ ').str[1]
big_df['Route_3'] = big_df['Route'] .str.split('→ ').str[2]
big_df['Route_4'] = big_df['Route'] .str.split('→ ').str[3]
big_df['Route_5'] = big_df['Route'] .str.split('→ ').str[4]

In [0]:
big_df.head()

Unnamed: 0,Additional_Info,Airline,Destination,Duration,Price,Route,Source,Date,Month,Year,Stop,Arrival_Hour,Arrival_Minute,Dep_Hour,Dep_Minute,Route_1,Route_2,Route_3,Route_4,Route_5
0,No info,IndiGo,New Delhi,2h 50m,3897.0,BLR → DEL,Banglore,24,3,2019,0,1,10,22,20,BLR,DEL,,,
1,No info,Air India,Banglore,7h 25m,7662.0,CCU → IXR → BBI → BLR,Kolkata,1,5,2019,2,13,15,5,50,CCU,IXR,BBI,BLR,
2,No info,Jet Airways,Cochin,19h,13882.0,DEL → LKO → BOM → COK,Delhi,9,6,2019,2,4,25,9,25,DEL,LKO,BOM,COK,
3,No info,IndiGo,Banglore,5h 25m,6218.0,CCU → NAG → BLR,Kolkata,12,5,2019,1,23,30,18,5,CCU,NAG,BLR,,
4,No info,IndiGo,New Delhi,4h 45m,13302.0,BLR → NAG → DEL,Banglore,1,3,2019,1,21,35,16,50,BLR,NAG,DEL,,


In [0]:
big_df['Price'].fillna((big_df['Price'].mean()), inplace=True)

In [0]:
big_df['Route_1'].fillna("None",inplace = True)
big_df['Route_2'].fillna("None",inplace = True)
big_df['Route_3'].fillna("None",inplace = True)
big_df['Route_4'].fillna("None",inplace = True)
big_df['Route_5'].fillna("None",inplace = True)


In [0]:
big_df.describe()



Unnamed: 0,Price,Date,Month,Year,Stop,Arrival_Hour,Arrival_Minute,Dep_Hour,Dep_Minute
count,13354.0,13354.0,13354.0,13354.0,13354.0,13354.0,13354.0,13354.0,13354.0
mean,9087.064121,13.389846,4.710574,2019.0,0.826045,13.396061,24.664146,12.513254,24.507264
std,4124.447805,8.43906,1.165622,0.0,0.674608,6.896145,16.559723,5.736273,18.832385
min,1759.0,1.0,3.0,2019.0,0.0,0.0,0.0,0.0,0.0
25%,6135.25,6.0,3.0,2019.0,0.0,8.0,10.0,8.0,5.0
50%,9087.064121,12.0,5.0,2019.0,1.0,14.0,25.0,11.0,25.0
75%,11087.0,21.0,6.0,2019.0,1.0,19.0,35.0,18.0,40.0
max,79512.0,27.0,6.0,2019.0,4.0,23.0,55.0,23.0,55.0


In [0]:
big_df=big_df.drop(['Route'], axis=1)
big_df=big_df.drop(['Duration'], axis=1)

In [0]:
big_df.head()

Unnamed: 0,Additional_Info,Airline,Destination,Price,Source,Date,Month,Year,Stop,Arrival_Hour,Arrival_Minute,Dep_Hour,Dep_Minute,Route_1,Route_2,Route_3,Route_4,Route_5
0,No info,IndiGo,New Delhi,3897.0,Banglore,24,3,2019,0,1,10,22,20,BLR,DEL,,,
1,No info,Air India,Banglore,7662.0,Kolkata,1,5,2019,2,13,15,5,50,CCU,IXR,BBI,BLR,
2,No info,Jet Airways,Cochin,13882.0,Delhi,9,6,2019,2,4,25,9,25,DEL,LKO,BOM,COK,
3,No info,IndiGo,Banglore,6218.0,Kolkata,12,5,2019,1,23,30,18,5,CCU,NAG,BLR,,
4,No info,IndiGo,New Delhi,13302.0,Banglore,1,3,2019,1,21,35,16,50,BLR,NAG,DEL,,


**Converting the Categorical into integer variable**

In [0]:

from sklearn.preprocessing import LabelEncoder

lb_encode = LabelEncoder()
big_df["Additional_Info"] = lb_encode.fit_transform(big_df["Additional_Info"])
big_df["Airline"] = lb_encode.fit_transform(big_df["Airline"])
big_df["Destination"] = lb_encode.fit_transform(big_df["Destination"])
big_df["Source"] = lb_encode.fit_transform(big_df["Source"])
big_df['Route_1']= lb_encode.fit_transform(big_df["Route_1"])
big_df['Route_2']= lb_encode.fit_transform(big_df["Route_2"])
big_df['Route_3']= lb_encode.fit_transform(big_df["Route_3"])
big_df['Route_4']= lb_encode.fit_transform(big_df["Route_4"])
big_df['Route_5']= lb_encode.fit_transform(big_df["Route_5"])

**Missing value validation**

In [0]:
def missing_values_table(df):
        # Total missing values
        mis_val = df.isnull().sum()
        
        # Percentage of missing values
        mis_val_percent = 100 * df.isnull().sum() / len(df)
        
        # Make a table with the results
        mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
        
        # Rename the columns
        mis_val_table_ren_columns = mis_val_table.rename(
        columns = {0 : 'Missing Values', 1 : '% of Total Values'})
        
        # Sort the table by percentage of missing descending
        mis_val_table_ren_columns = mis_val_table_ren_columns[
            mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)
        
        # Print some summary information
        print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"      
            "There are " + str(mis_val_table_ren_columns.shape[0]) +
              " columns that have missing values.")
        
        # Return the dataframe with missing information
        return mis_val_table_ren_columns  

In [0]:
missing_values_table(big_df)

Your selected dataframe has 18 columns.
There are 0 columns that have missing values.


Unnamed: 0,Missing Values,% of Total Values


**Split it into test and train**

In [0]:
big_df.shape

(13354, 18)

In [0]:
df_train = big_df[0:10683]
df_test = big_df[10683:]
df_test = df_test.drop(['Price'], axis =1)

In [0]:
X = df_train.drop(['Price'], axis=1)
y = df_train.Price

In [0]:

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123)

**Model Building **

In [0]:

from IPython.display import Image
from IPython.core.display import HTML 
Image(url = "http://i.imgur.com/QBuDOjs.jpg")

# Linear Regression

In [0]:

#Build our model method
lm = LinearRegression()


In [0]:
#Build our cross validation method
kfolds = KFold(n_splits=50,shuffle=True, random_state=100)

In [0]:
def cv_rmse(model):
    rmse = np.sqrt(-cross_val_score(model, X, y, 
                                   scoring="neg_mean_squared_error", 
                                   cv = kfolds))
    return(rmse)

In [0]:
benchmark_model = make_pipeline(RobustScaler(),
                                lm).fit(X=X_train, y=y_train)
cv_rmse(benchmark_model).mean()


3238.316987636252

In [0]:
lm.fit(X_train,y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [0]:
predict_lm=lm.predict(X_test)

In [0]:
r2_score(y_test,predict_lm)

0.4834156699537322

# XGBoost Regressor

In [0]:
from sklearn.model_selection import GridSearchCV
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 12, 4
get_ipython().run_line_magic('matplotlib', 'inline')
import xgboost as xgb
from xgboost import XGBRegressor

In [0]:
def modelfit(alg, dtrain, target, useTrainCV=True, 
             cv_folds=5, early_stopping_rounds=50):
    
    if useTrainCV:
        xgb_param = alg.get_xgb_params()
        xgtrain = xgb.DMatrix(dtrain.values, 
                              label=y.values)
        
        print("\nGetting Cross-validation result..")
        cvresult = xgb.cv(xgb_param, xgtrain, 
                          num_boost_round=alg.get_params()['n_estimators'], 
                          nfold=cv_folds,metrics='rmse', 
                          early_stopping_rounds=early_stopping_rounds,
                          verbose_eval = True)
        alg.set_params(n_estimators=cvresult.shape[0])
    
    #Fit the algorithm on the data
    print("\nFitting algorithm to data...")
    alg.fit(dtrain, target, eval_metric='rmse')
        
    #Predict training set:
    print("\nPredicting from training data...")
    dtrain_predictions = alg.predict(dtrain)
        
    #Print model report:
    print("\nModel Report")
    print("RMSE : %.4g" % np.sqrt(mean_squared_error(target.values,
                                             dtrain_predictions)))

In [0]:

#cv_rmse(xgb_fit).mean()

In [0]:
xgb3 = XGBRegressor(learning_rate =0.1, n_estimators=200, max_depth=10,
                     min_child_weight=5 ,gamma=0, subsample=0.7,max_bin=20,
                     colsample_bytree=0.8,objective= 'reg:linear',
                     nthread=4,scale_pos_weight=1,seed=27, reg_alpha=0.00006)

xgb_fit = xgb3.fit(X_train, y_train)

  if getattr(data, 'base', None) is not None and \




In [0]:
predict_xg=xgb3.predict(X_test)

In [0]:
#accuracy_score(y_test, predict_xg)

In [0]:
cv_rmse(xgb_fit).mean()

  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




  if getattr(data, 'base', None) is not None and \




1275.0105963996864

# **Support Vector Machine**

In [0]:
from sklearn import svm
svr_opt = svm.SVR(C = 100000, gamma = 1e-08)

svr_fit = svr_opt.fit(X_train, y_train)


In [0]:
cv_rmse(svr_fit).mean()


4191.657759309572

In [0]:
df_test_xgb = df_test[['Additional_Info', 'Airline', 'Destination', 'Source', 'Date', 'Month',
       'Year', 'Stop', 'Arrival_Hour', 'Arrival_Minute', 'Dep_Hour',
       'Dep_Minute', 'Route_1', 'Route_2', 'Route_3', 'Route_4', 'Route_5']]
preds_1 = xgb_fit.predict(df_test_xgb)
df_test_xgb['Price'] = preds_1
df_test_xgb.to_csv('flight_price_10.csv')

In [0]:
accuracy_score(y_test, preds_1)

ValueError: ignored

In [0]:
data=pd.read_csv("flight_price_10.csv")

In [0]:
data