# Flight Fare Prediction

## Data Overview
Here each data point corresponds to trip of flight from one city to another.

- Airline: The name of the airline.

- Date_of_Journey: The date of the journey

- Source: The source from which the service begins.

- Destination: The destination where the service ends.

- Route: The route taken by the flight to reach the destination.

- Dep_Time: The time when the journey starts from the source.

- Arrival_Time: Time of arrival at the destination.

- Duration: Total duration of the flight.

- Total_Stops: Total stops between the source and destination.

- Additional_Info: Additional information about the flight

- Price(target): The price of the ticket

### Step 1: Importing the Relevant Libraries

In [1]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

from sklearn.metrics import mean_squared_error as mse
from sklearn.metrics import r2_score

from math import sqrt
from sklearn.linear_model import LinearRegression,Ridge, Lasso
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

**Reading the Datasets**

In [2]:
os.chdir(r"C:\Users\vishw\Pictures\project")
os.getcwd()

airline = pd.read_excel('Data_Train.xlsx')
airline.head()

Unnamed: 0,Airline,Date_of_Journey,Source,Destination,Route,Dep_Time,Arrival_Time,Duration,Total_Stops,Additional_Info,Price
0,IndiGo,24/03/2019,Banglore,New Delhi,BLR → DEL,22:20,01:10 22 Mar,2h 50m,non-stop,No info,3897
1,Air India,1/05/2019,Kolkata,Banglore,CCU → IXR → BBI → BLR,05:50,13:15,7h 25m,2 stops,No info,7662
2,Jet Airways,9/06/2019,Delhi,Cochin,DEL → LKO → BOM → COK,09:25,04:25 10 Jun,19h,2 stops,No info,13882
3,IndiGo,12/05/2019,Kolkata,Banglore,CCU → NAG → BLR,18:05,23:30,5h 25m,1 stop,No info,6218
4,IndiGo,01/03/2019,Banglore,New Delhi,BLR → NAG → DEL,16:50,21:35,4h 45m,1 stop,No info,13302


In [3]:
airline.shape

(10683, 11)

## EDA

In [4]:
# checking the null values presents
airline.isnull().sum()

Airline            0
Date_of_Journey    0
Source             0
Destination        0
Route              1
Dep_Time           0
Arrival_Time       0
Duration           0
Total_Stops        1
Additional_Info    0
Price              0
dtype: int64

In [5]:
airline.dropna(axis=0,inplace=True)

In [6]:
airline.columns

Index(['Airline', 'Date_of_Journey', 'Source', 'Destination', 'Route',
       'Dep_Time', 'Arrival_Time', 'Duration', 'Total_Stops',
       'Additional_Info', 'Price'],
      dtype='object')

**1.Droping the Duplicate Rows:**

In [7]:
airline = airline.drop_duplicates()
airline.shape

(10462, 11)

**2.Converting the Date_of_journey into date, month and year.**

In [8]:
airline.sort_values('Date_of_Journey', inplace = True)

airline['year'] = pd.DatetimeIndex(airline['Date_of_Journey']).year
airline['month'] = pd.DatetimeIndex(airline['Date_of_Journey']).month
airline['Day'] = pd.DatetimeIndex(airline['Date_of_Journey']).day

In [9]:
## 'No info' is same as 'No Info'. So replacing them with single common label.
airline['Additional_Info'].replace('No Info', 'No info', inplace = True)

## Converting the sparses
airline['Airline'].replace(['Trujet', 'Vistara Premium economy'], 'Another', inplace = True)

**3.Converting the Total_Stops into numbers and dropping the rows with NaN.**

In [10]:
airline.dropna(axis = 0, inplace = True) # droping the na (just in case only)

In [11]:
# function to convert the stops to number
def convert_into_stops(X):
    if X == '4 stops':
        return 4
    elif X == '3 stops':
        return 3
    elif X == '2 stops':
        return 2
    elif X == '1 stop':
        return 1
    elif X == 'non stop':
        return 0

In [12]:
airline['Total_Stops'] = airline['Total_Stops'].map(convert_into_stops) # calling the function 

In [13]:
# just incase after the above process done we got nan, this will solve the issue
airline.fillna(0, inplace  = True) # filling 0 in the place of nan
airline['Total_Stops'] = airline['Total_Stops'].apply(lambda x : int(x)) # solving through lambda

**4.Converting the flight Dep_Time into proper time i.e. mid_night, morning, afternoon and evening.**

In [14]:
def flight_dep_time(X):
    '''
    This function takes the flight Departure time 
    and convert into appropriate format.
    '''
    if int(X[:2]) >= 0 and int(X[:2]) < 6:
        return 'mid_night'
    elif int(X[:2]) >= 6 and int(X[:2]) < 12:
        return 'morning'
    elif int(X[:2]) >= 12 and int(X[:2]) < 18:
        return 'afternoon'
    elif int(X[:2]) >= 18 and int(X[:2]) < 24:
        return 'evening'

In [15]:
# altering the dep_time and saving it in a new name as flight_time
airline['flight_time'] = airline['Dep_Time'].apply(flight_dep_time)

**5.Converting the flight duration into seconds.**

In [16]:
def convert_into_seconds(X):
    '''
    This function takes the total time of flight from
    one city to another and converts it into the seconds.
    '''
    a = [int(s) for s in re.findall(r'-?\d+\.?\d*', X)]
    if len(a) == 2:
        hr = a[0] * 3600
        min = a[1] * 60
    else:
        hr = a[0] * 3600
        min = 0   
    total = hr + min
    return total

In [17]:
airline['Duration(sec)'] = airline['Duration'].map(convert_into_seconds) # calling the function and solving it

In [18]:
df = airline.copy()

In [19]:
# droping unwated feature for the model building
df.drop(['Date_of_Journey', 'Route', 'Dep_Time', 'Arrival_Time', 'Duration','year','Day'], axis = 1, inplace = True)

In [20]:
df

Unnamed: 0,Airline,Source,Destination,Total_Stops,Additional_Info,Price,month,flight_time,Duration(sec)
8536,Jet Airways,Banglore,New Delhi,1,No info,25735,1,afternoon,69900
10149,Air India,Banglore,New Delhi,2,Change airports,17461,1,morning,26100
5701,Air India,Banglore,New Delhi,2,No info,25430,1,morning,138900
4829,Jet Airways,Banglore,New Delhi,1,No info,27992,1,mid_night,52500
6558,IndiGo,Banglore,New Delhi,0,No info,11934,1,evening,10200
...,...,...,...,...,...,...,...,...,...
6944,Air India,Kolkata,Banglore,2,No info,11642,9,afternoon,45900
8086,Jet Airways,Kolkata,Banglore,1,No info,13401,9,evening,77700
3683,IndiGo,Delhi,Cochin,1,No info,6069,9,afternoon,18000
3693,Jet Airways,Kolkata,Banglore,1,No info,14571,9,afternoon,25500


**6.Seeing the Values Counts**

In [21]:
df['Airline'].value_counts()

Jet Airways                          3700
IndiGo                               2043
Air India                            1694
Multiple carriers                    1196
SpiceJet                              815
Vistara                               478
Air Asia                              319
GoAir                                 194
Multiple carriers Premium economy      13
Jet Airways Business                    6
Another                                 4
Name: Airline, dtype: int64

In [22]:
df['Source'].value_counts()

Delhi       4345
Kolkata     2860
Banglore    2179
Mumbai       697
Chennai      381
Name: Source, dtype: int64

In [23]:
df['Destination'].value_counts()

Cochin       4345
Banglore     2860
Delhi        1265
New Delhi     914
Hyderabad     697
Kolkata       381
Name: Destination, dtype: int64

In [24]:
df['Total_Stops'].value_counts()

1    5625
0    3475
2    1318
3      43
4       1
Name: Total_Stops, dtype: int64

In [25]:
df['Additional_Info'].value_counts()

No info                         8185
In-flight meal not included     1926
No check-in baggage included     318
1 Long layover                    19
Change airports                    7
Business class                     4
2 Long layover                     1
1 Short layover                    1
Red-eye flight                     1
Name: Additional_Info, dtype: int64

In [26]:
df['month'].value_counts()

6     2465
3     2169
5     2025
9     1375
1     1058
12     946
4      424
Name: month, dtype: int64

In [27]:
df['flight_time'].value_counts()

morning      4224
evening      2629
afternoon    2563
mid_night    1046
Name: flight_time, dtype: int64

In [28]:
df['Duration(sec)'].value_counts()

10200     544
5400      386
9900      335
10500     332
9300      329
         ... 
104100      1
148800      1
109500      1
140700      1
71400       1
Name: Duration(sec), Length: 367, dtype: int64

### Step 5: Binary Conversion for the Classification Variables

In [29]:
### Label Encoding
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
df['Airline'] = labelencoder.fit_transform(df['Airline'])
df['Source'] = labelencoder.fit_transform(df['Source'])
df['Destination'] = labelencoder.fit_transform(df['Destination'])
df['Additional_Info'] = labelencoder.fit_transform(df['Additional_Info'])
df['flight_time'] = labelencoder.fit_transform(df['flight_time'])

In [30]:
df

Unnamed: 0,Airline,Source,Destination,Total_Stops,Additional_Info,Price,month,flight_time,Duration(sec)
8536,5,0,5,1,7,25735,1,0,69900
10149,1,0,5,2,4,17461,1,3,26100
5701,1,0,5,2,7,25430,1,3,138900
4829,5,0,5,1,7,27992,1,2,52500
6558,4,0,5,0,7,11934,1,1,10200
...,...,...,...,...,...,...,...,...,...
6944,1,3,0,2,7,11642,9,0,45900
8086,5,3,0,1,7,13401,9,1,77700
3683,4,2,1,1,7,6069,9,0,18000
3693,5,3,0,1,7,14571,9,0,25500


**6.Seeing the Values Counts**

Airlines:
- Air India (1)
- Another (2)
- GoAir (3)
- IndiGo(4)
- Jet Airways(5)
- Jet Airways Business(6)
- Multiple carriers(7)
- Multiple carriers Premium economy(8)
- SpiceJet(9)
- Vistara(10)
- Air Asia(0)

Sources:
- Chennai(1)
- Delhi(2)
- Kolkata(3)
- Mumbai(4)
- Bangalore(0)

Destination:
- Banglore(0)
- Cochin(1)
- Delhi(2)
- Hyderabad(3)
- Kolkata(4)
- New Delhi(5)

Total_Stops: (0,1,2,3,4)

Additional_Info:
- 1 Long layover(0)
- 2 Long layover(1)
- 1 Short layover(2)
- Business class(3)
- Change airports(4)
- In-flight meal not included(5)
- No check-in baggage included(6)
- No info(7)
- Red-eye flight(8)

Months: (1,3,4,5,6,9,12)

Flight Time:
- afternoon(0)
- evening(1)
- mid_night(2)
- morning(3)

Duration(Sec): Between(5400 - 148800)

In [31]:
X = df.drop(['Price'], axis = 1)
y = (df['Price']) # applying np.log(due to the price is right skewed), using this we adjusted the skewness

In [32]:
# 20% data as validation set
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=22)

In [33]:
lr = RandomForestRegressor()
model = lr.fit(X_train,y_train)
y_pred = model.predict(X_test)

In [34]:
y_pred

array([ 7439.17      ,  5129.3475    , 13576.60866667, ...,
        9803.53      , 14506.64516667,  9443.42861111])

In [35]:
from sklearn.metrics import r2_score
r2_score(y_test,y_pred)

0.7643558394867086

In [36]:
## predicting the result
model.predict([[5,2,1,2,5,3,1,88500]])

array([7439.17])

In [37]:
#Saving our model into a file
import pickle

with open("Pickle_file.pkl", 'wb') as file:
    pickle.dump(lr,file)

In [38]:
with open("Pickle_file.pkl", 'rb') as file:
    pickle_LR_Model = pickle.load(file)

In [40]:
ypredict = pickle_LR_Model.predict(X_test)

In [42]:

pickle_LR_Model.predict([[5,2,1,2,5,3,1,88500]])

# Airline ,  Source ,  Destination ,  Total_Stops , Additional_Info , month , flight_time , Duration(sec)

array([7439.17])