<a href="https://colab.research.google.com/github/elshikh555/Data-Analysis-/blob/main/Flight_Price_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [14]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import datetime as dt
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.ensemble import RandomForestRegressor, ExtraTreesRegressor
import pickle

Reading training data

In [None]:
train_data = pd.read_excel('Flight Dataset/Data_Train.xlsx')
train_data.head()

Checking values in the Destination column

In [None]:
train_data['destination'].value.counts()

Merging Delhi and New Delhi

In [None]:
def newd(x):
    if x=='New Delhi':
        return 'Delhi'
    else:
        return x

train_data['Destination'] = train_data['Destination'].apply(newd)

Checking info of our train data.

In [None]:
train_data.info()

##Make day and month columns as Datetime columns

*We will extract the journey day and journey month from the Date of the journey and make 2 columns for them as shown below.

*And then we will drop the Date of the journey column.

In [None]:
train_data['Journey_day'] = pd.to_datetime(train_data['Date_of_Journey'],format='%d/%m/%Y').dt.day
train_data['Journey_month'] = pd.to_datetime(train_data['Date_of_Journey'],format='%d/%m/%Y').dt.month

train_data.drop('Date_of_Journey',inplace=True,axis=1)

train_data.head()




##Extracting hours and minutes from time.

-we will extract departure hour and departure minutes from departure time.

-And same will be done for arrival time.

-And after that, we will drop both columns.

In [None]:
train_data['Dep_hour'] = pd.to_datetime(train_data['Dep_Time']).dt.hour
train_data['Dep_min'] = pd.to_datetime(train_data['Dep_Time']).dt.minute
train_data.drop('Dep_Time',axis=1,inplace=True)

train_data['Arrival_hour'] = pd.to_datetime(train_data['Arrival_Time']).dt.hour
train_data['Arrival_min'] = pd.to_datetime(train_data['Arrival_Time']).dt.minute
train_data.drop('Arrival_Time',axis=1,inplace=True)

train_data.head()

Checking values in the Duration column

In [None]:
train_data['Duration'].value_counts()

##Dropping the Duration column and extracting important info from it.
- 1 We are just bringing every duration to the same format. There might be a case when some flight duration will be just 30m so we will write it as ‘0h 30m’ and there may also be cases like 2h so we will write it as ‘2h 0m’.
- 2 Simply split it into 2 components, hour and minute.
- 3 Add two columns ‘Duration_hours’ and ‘Duration_mins’
- 4 Drop the original Duration column.


In [None]:
duration = list(train_data['Duration'])

for i in range(len(duration)):
    if len(duration[i].split()) != 2:
        if 'h' in duration[i]:
            duration[i] = duration[i] + ' 0m'
        else:
            duration[i] = '0h ' + duration[i]

duration_hour = []
duration_min = []

for i in duration:
    h,m = i.split()
    duration_hour.append(int(h[:-1]))
    duration_min.append(int(m[:-1]))

train_data['Duration_hours'] = duration_hour
train_data['Duration_mins'] = duration_min

train_data.drop('Duration',axis=1,inplace=True)
train_data.head()

Plotting Airline vs Price

In [None]:
sns.catplot(x='Airline',y='Price',data=train_data.sort_values('Price',ascending=False),kind='boxen',aspect=3,height=6)

Create dummy columns out of the Airline column

In [None]:
airline = train_data[['Airline']]
airline = pd.get_dummies(airline,drop_first=True)

Plotting Source vs Price.

In [None]:
# If we are going from Banglore the prices are slightly higher as compared to other cities
sns.catplot(x='Source',y='Price',data=train_data.sort_values('Price',ascending=False),kind='boxen',aspect=3,height=4)

Create dummy columns out of the Source column

In [None]:
source = train_data[['Source']]
source = pd.get_dummies(source,drop_first=True)
source.head()

Plotting Destination vs Price.

In [None]:
# If we are going to New Delhi the prices are slightly higher as compared to other cities
sns.catplot(x='Destination',y='Price',data=train_data.sort_values('Price',ascending=False),kind='boxen',aspect=3,height=4)

Create dummy columns out of the Destination column

In [None]:
destination = train_data[['Destination']]
destination = pd.get_dummies(destination,drop_first=True)
destination.head()

Dropping crap columns.

In [None]:
train_data.drop(['Route','Additional_Info'],inplace=True,axis=1)

Checking values in the Total stops column.

In [None]:
train_data['Total_Stops'].value_counts()

Converting labels into numbers in the Total_stops column

In [None]:
# acc to the data, price is directly prop to the no. of stops
train_data['Total_Stops'].replace({'non-stop':0,'1 stop':1,'2 stops':2,'3 stops':3,'4 stops':4},inplace=True)
train_data.head()

Checking the shapes of our 4 data frames

In [None]:
print(airline.shape)
print(source.shape)
print(destination.shape)
print(train_data.shape)

Combine all 4 data frames.

In [None]:
data_train = pd.concat([train_data,airline,source,destination],axis=1)
data_train.drop(['Airline','Source','Destination'],axis=1,inplace=True)
data_train.head()

Taking out train data

In [None]:
X = data_train.drop('Price',axis=1)
X.head()

Take out train data labels.

In [None]:
y = data_train['Price']
y.head()

Checking correlations between columns

In [None]:
plt.figure(figsize=(10,10))
sns.heatmap(train_data.corr(),cmap='viridis',annot=True)

First try out the ExtraTreesRegressor model for Flight Price Prediction.

In [None]:
reg = ExtraTreesRegressor()
reg.fit(X,y)

print(reg.feature_importances_)

Checking feature importance given by ExtraTreeRegressor

In [None]:
plt.figure(figsize = (12,8))
feat_importances = pd.Series(reg.feature_importances_, index=X.columns)
feat_importances.nlargest(20).plot(kind='barh')
plt.show()

Splitting our data into Training and Testing data

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

##Training Random Forest Regressor model for Flight Price Prediction.
-Here we are using RandomizedSearchCV which just randomly tries out combinations and sees which one is the best out of them.

-We have declared the parameters of RandomForestRegressor which we want to try.

In [None]:
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 100, stop = 1200, num = 12)]

# Number of features to consider at every split
max_features = ['auto', 'sqrt']

# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(5, 30, num = 6)]

# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10, 15, 100]

# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 5, 10]

random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf}


# Random search of parameters, using 5 fold cross validation, search across 100 different combinations
rf_random = RandomizedSearchCV(estimator = RandomForestRegressor(), param_distributions = random_grid,
                               scoring='neg_mean_squared_error', n_iter = 10, cv = 5,
                               verbose=1, random_state=42, n_jobs = 1)
rf_random.fit(X_train,y_train)

Checking the best parameters we got using Randomized Search CV.

In [None]:
rf_random.best_params_

Taking Predictions

In [None]:
# Flight Price Prediction
prediction = rf_random.predict(X_test)

Plotting the residuals.

In [None]:
plt.figure(figsize = (8,8))
sns.distplot(y_test-prediction)
plt.show()

Plotting y_test vs predictions.

In [None]:
plt.figure(figsize = (8,8))
plt.scatter(y_test, prediction, alpha = 0.5)
plt.xlabel("y_test")
plt.ylabel("y_pred")
plt.show()

Printing metrics.

In [None]:
print('r2 score: ', metrics.r2_score(y_test,y_pred))