<center><h2 style='font-family:monospace;'>FLIGHT FARE PREDICTION USING ML 🛫🛫</h2></center>
<center>Dataset Link <br><a 'https://www.kaggle.com/nikhilmittal/flight-fare-prediction-mh/'>Flight Fare Prediction MH</a></center>

<h3>Problem Statement</h3>
<p style='font-family:Verdana;'>
Flight ticket prices can be something hard to guess, today we might see a price, check out the price of the same flight tomorrow, it will be a different story. We might have often heard travelers saying that flight ticket prices are so unpredictable. As data scientists, we are gonna prove that given the right data anything can be predicted. Here you will be provided with prices of flight tickets for various airlines between the months of March and June of 2019 and between various cities. Size of training set: 10683 records
</p>

#### Columns

* Size of test set: 2671 records
* FEATURES: Airline: The name of the airline.
* Date_of_Journey: The date of the journey
* Source: The source from which the service begins.
* Destination: The destination where the service ends.
* Route: The route taken by the flight to reach the destination.
* Dep_Time: The time when the journey starts from the source.
* Arrival_Time: Time of arrival at the destination.
* Duration: Total duration of the flight.
* Total_Stops: Total stops between the source and destination.
* Additional_Info: Additional information about the flight

* Price: The price of the ticket   

> Predict The Flight Fare Based On User Ticket Details.

## Workflow To Be Followed

<p style='font-family:Verdana;'>
Step 1: Loading The Dataset <br>
Step 2: Performing EDA <br>
Step 3: Feature Engineering <br>
Step 4: Model Training & Evaluation <br>
Step 5: Testing The Model On New Data. <br>

</p>    

## STEP 1: Loading Dataset

In [None]:
## Basic Libraries
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

import seaborn as sns
dark = sns.color_palette('dark')
bright = sns.color_palette('bright')
deep = sns.color_palette('deep')
pastel = sns.color_palette('pastel')

plt.style.use("seaborn-dark")

import plotly.graph_objects as go
import plotly.express as px

import warnings
warnings.filterwarnings('ignore')

In [None]:
pip install openpyxl

In [None]:
# Load The Dataset
df =pd.read_excel("../input/flight-fare-prediction-mh/Data_Train.xlsx")
df.head()

In [None]:
df.shape

In [None]:
df_test=pd.read_excel("../input/flight-fare-prediction-mh/Test_set.xlsx")


In [None]:
df_test.head()

In [None]:
df_test.shape

## STEP 2: EDA

#### Basic EDA

In [None]:
df.info()

In [None]:
# A Random Sample From The Data
df.sample(8)

In [None]:
## Missing Values
df.isna().sum()

Great, We Don't Have That Much Null Values.

### Analysis

Price

In [None]:
sns.distplot(df['Price'])

In [None]:
## Flight Price Higher Than 50000
df[df.Price>40000]['Price'].count()

In [None]:
## Flight Price Lesser Than 2000
df[df.Price<2000]['Price'].count()

plotting all nominal categorical columns

In [None]:
nominal_categorical_columns = [feature for feature in df.columns if df[feature].dtype=='O' and df[feature].nunique()<15]
nominal_categorical_columns

In [None]:
for col in nominal_categorical_columns:
    plt.figure(figsize=(30,7))
    ax = sns.countplot(col,data=df)
    ax.bar_label(ax.containers[0])
    labels = (df[col].value_counts() / len(df))*100
    plt.title(col)
    plt.xlabel(f'{labels}')
    plt.show()

Plotting Ordinal Categorical Columns

In [None]:
ordinal_categorical_columns = [feature for feature in df.columns if df[feature].dtype=='O' and df[feature].nunique()>15]
ordinal_categorical_columns

Date_of_Journey

In [None]:
## Number of Flights On Different Months
from datetime import date
df['temp'] = pd.to_datetime(df['Date_of_Journey'], format='%d/%m/%Y')
ax = sns.countplot(df['temp'].dt.month)
labels = ['March','April','May','June']
ax.bar_label(ax.containers[0])
ax.set_xticklabels(labels);
df.drop('temp',axis=1,inplace=True)

In [None]:
## Number of Flights On Particular Day of Different Months
from datetime import date
df['Date_of_Journey'] = pd.to_datetime(df['Date_of_Journey'], format='%d/%m/%Y')
ax = sns.countplot(df['Date_of_Journey'].dt.day)
ax.bar_label(ax.containers[0]);

Dep_Time

In [None]:
df['Dep_Time'].head()

In [None]:
## The Average Price of Flights Based on Their Departure Time
df['Dep_Time'] = pd.to_datetime(df['Dep_Time'], format='%H:%M')
df.groupby(df['Dep_Time'].dt.hour)['Price'].mean().plot.bar(color=deep)

Airline

In [None]:
df.Airline.value_counts()

In [None]:
sns.catplot(y = "Price", x = "Airline", data =df.sort_values("Price", ascending = False), kind="boxen", height = 10, aspect = 3)
plt.show()

All Airline median value is almost same except Jet Airways.

#### Date_of_journey

In [None]:
df['Date_of_Journey'].unique(),df['Date_of_Journey'].nunique()

In [None]:
## Average Flight Price Value On Different Date of Journey
df.groupby('Date_of_Journey')['Price'].mean().plot.bar(figsize=(20,10));

In [None]:
## Maximum Flight Fare and Minimum Flight Fare
max_fare = df[df['Price']==df['Price'].max()][['Date_of_Journey','Price','Airline','Route','Duration']]
max_fare

In [None]:
## Minimum Flight Fare
min_fare = df[df['Price']==df['Price'].min()][['Date_of_Journey','Price','Airline','Route','Duration']]
min_fare

#### Source

In [None]:
## All Unique Source Values
df.Source.unique()

In [None]:
## Number of Flights From Each Source
ax = sns.countplot(data=df,x='Source')
ax.bar_label(ax.containers[0]);

In [None]:
df.columns

In [None]:
df.info()

## STEP 3: FEATURE ENGINEERING

##### Date_of_Journey, 'Dep_Time', 'Arrival_Time'

In [None]:
## Convert and Extract Day & Month
def extract_day_and_month(data,col):
    df[col]=pd.to_datetime(df[col])
    data[col+'_Date'] = data[col].dt.day
    data[col+'_Month'] = data[col].dt.month
    
## Convert and Extract Hour & Minute
def extract_hour_and_minute(data,col):
    df[col]=pd.to_datetime(df[col])
    data[col+'_Hour']=data[col].dt.hour
    data[col+'_Min']=data[col].dt.minute

In [None]:
## Date_of_Journey
extract_day_and_month(df,'Date_of_Journey')

## Arrival Time
extract_hour_and_minute(df,'Arrival_Time')

## Dep Time
extract_hour_and_minute(df,'Dep_Time')

df.drop(['Date_of_Journey','Arrival_Time','Dep_Time'], axis=1, inplace=True)

In [None]:
# ## One More Way (Without Converting Into DateTime)

# # Date_of_Journey
# df["Day"]=(df['Date_of_Journey'].apply(lambda x:x.split("/")[0])).astype(int)
# df["Month"]=(df['Date_of_Journey'].apply(lambda x:x.split("/")[1])).astype(int)

# # Arrival_Time
# arrival_time = df['Arrival_Time'].apply(lambda x:x.split(' ')[0])
# df['Arrival_hour'] = arrival_time.apply(lambda x:x.split(':')[0]).astype(int)
# df['Arrival_minute'] = arrival_time.apply(lambda x:x.split(':')[1]).astype(int)

# # Dep_Time
# df['Dep_hour'] = df['Dep_Time'].apply(lambda x:x.split(":")[0]).astype(int)
# df['Dep_minute'] = df['Dep_Time'].apply(lambda x:x.split(":")[1]).astype(int)

In [None]:
df.head(3)

In [None]:
df.info()

##### Route

In [None]:
df.drop('Route',axis=1,inplace=True)
df.head()

##### Duration

In [None]:
def handle_single_duration_data(df,col):
    for i in range(len(df[col])):
        if 'h' not in df[col][i]:
            df[col][i] = "0h "+str(df[col][i])
        elif 'm' not in df[col][i]:
            df[col][i] = str(df[col][i])+" 0m"



def extract_hour_from_duration(val):
    return val.split(' ')[0][0:-1]

def extract_minutes_from_duration(val):
    return val.split(' ')[1][0:-1]

In [None]:
handle_single_duration_data(df,'Duration')

In [None]:
df['Duration_Hour'] = df['Duration'].apply(extract_hour_from_duration)
df['Duration_Minute'] = df['Duration'].apply(extract_minutes_from_duration)

In [None]:
df['Duration_Hour'] = df['Duration_Hour'].astype(int)
df['Duration_Minute'] = df['Duration_Minute'].astype(int)

In [None]:
df.drop('Duration',axis=1,inplace=True)

#### Total_Stops

In [None]:
df['Total_Stops'].head()

In [None]:
df['Total_Stops'].unique()

In [None]:
df.Total_Stops.isna().sum()

In [None]:
#drop the nullvalues
df.dropna(inplace=True)

In [None]:
df.Total_Stops.isna().sum()

In [None]:
df['Total_Stops']=df['Total_Stops'].map({
    'non-stop':0,
    '1 stop':1,
    '2 stops':2,
    '3 stops':3,
    '4 stops':4}).astype(int)

In [None]:
df.head()

#### Categorical Columns

##### Additional_Info

In [None]:
df['Additional_Info'].unique(),df['Additional_Info'].nunique()

In [None]:
df['Additional_Info'] = df['Additional_Info'].str.replace('No Info','No info')

In [None]:
df['Additional_Info'].unique(),df['Additional_Info'].nunique()

In [None]:
Additional_Info = df[['Additional_Info']]
Additional_Info = pd.get_dummies(Additional_Info,drop_first=True)
Additional_Info

##### Airline

In [None]:
df['Airline'].head()

In [None]:
df['Airline'].unique()

In [None]:
Airline = df[["Airline"]]
Airline = pd.get_dummies(Airline,drop_first=True)
Airline

##### Source

In [None]:
df['Source'].head()

In [None]:
df['Source'].unique()

In [None]:
Source = df[["Source"]]
Source = pd.get_dummies(Source,drop_first=True)
Source

##### Destination

In [None]:
df['Destination'].head()

In [None]:
df['Destination'].unique()

All These Left Three Columns Are Categorical Columns So We Can Convert Them Into Numerical Using Label Encoding.

In [None]:
Destination = df[["Destination"]]
Destination = pd.get_dummies(Destination,drop_first=True)
Destination

In [None]:
### Dropping Columns
df.drop(['Destination','Airline','Source','Additional_Info'],axis=1,inplace=True)

Price

In [None]:
df['Price'] = df['Price'].astype(int)

In [None]:
final_df=pd.concat([df,Airline,Source,Destination,Additional_Info],axis=1)


In [None]:
final_df.head()

In [None]:
final_df.info()

In This Notebook I am Planning to Use Ensemble Models That Are Not Affected By Outliers.

#### Feature Selection

In [None]:
final_df.shape

In [None]:
X = final_df.drop('Price',axis=1)
X.head()

In [None]:
y = final_df['Price'].astype(int)
y

##### 1. Correlation Matrix

In [None]:
plt.figure(figsize = (30,30))
sns.heatmap(final_df.corr(), annot = True, cmap = "RdYlGn")

plt.show()

Our Correlation Matrix Shows Use A Brief Idea, Now Let's Try To Use Another Way.

In [None]:
from sklearn.feature_selection import mutual_info_classif
mutual_info_classif(X,y)
feature_imp = pd.DataFrame(mutual_info_classif(X,y),index=X.columns,columns=['Importance'])

In [None]:
feature_imp.sort_values(by='Importance',ascending=False)

We Can See In Our Feature Importance That `Additional Info` Column That We Have Took Has Less Significance Value Compare To Other Parameters So We Will Remove It Completly From The DataSet, Also There Are Some Airline Parameters That Are Also Zero That We Will Measure In Next Step.

In [None]:
### Removing All Additional info Columns
final_df.drop(list(final_df.filter(regex = 'Additional_Info')), axis = 1, inplace = True)

In [None]:
final_df

In [None]:
X = final_df.drop('Price',axis=1)
y = final_df['Price']

mutual_info_classif(X,y)
feature_imp = pd.DataFrame(mutual_info_classif(X,y),index=X.columns,columns=['Feature_Importance'])
feature_imp.sort_values(by='Feature_Importance',ascending=False)

In [None]:
# Important feature using ExtraTreesRegressor

from sklearn.ensemble import ExtraTreesRegressor
selection = ExtraTreesRegressor()
selection.fit(X, y)
print(selection.feature_importances_)
plt.figure(figsize = (12,8))
feat_importances = pd.Series(selection.feature_importances_, index=X.columns)
feat_importances.nlargest(30).plot(kind='barh')
plt.show()

This Time We Have Received A Bit Different Results Like Jet_Airways_Business Has Little Better Importance. The Final Decision Will Be Taken After Modeling.

## STEP 4: Model Training

1. Split The Data Into Training and Validation Sets

2. Scaling The Data if Required By The Model

3. Loading & Fitting The Model On The Training Data

4. Predict Y_test using X_test

5. Calculate MAR, R2 Score, & RMSE Score For Evaluating The Model

6. Plot The Prediction Graph

In [None]:
# St. 1
## spiliting the dataset
from sklearn.model_selection import train_test_split
X = final_df.drop('Price',axis=1)
y = final_df['Price']
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.20,random_state=42)


In [None]:
X_train.shape

In [None]:
X_test.shape

In [None]:
## St. 5 Evaluating Model
from sklearn.metrics import r2_score,mean_absolute_error,mean_squared_error
r2_scores = {}

def fit_and_evaluate(prediction_model):
    print(f'###### MACHINE LEARNING MODEL : {prediction_model}')
    
    model= prediction_model.fit(X_train,y_train)
    print("Training score: {}".format(model.score(X_train,y_train)))
    

    predictions = model.predict(X_test)
    print("Predictions:\n",predictions)
    
    print('\n')
    
    r2score=r2_score(y_test,predictions) 
    print("r2 score is: {}".format(r2score))
    r2_scores[f'{prediction_model}'] = r2score
          
    print('MAE:{}'.format(mean_absolute_error(y_test,predictions)))
    print('MSE:{}'.format(mean_squared_error(y_test,predictions)))
    print('RMSE:{}'.format(np.sqrt(mean_squared_error(y_test,predictions))))
     
    sns.distplot(y_test-predictions)

In [None]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import GradientBoostingRegressor,RandomForestRegressor

In [None]:
## Fitting Each Model With Base Parameters One By One
fit_and_evaluate(KNeighborsRegressor())

In [None]:
fit_and_evaluate(DecisionTreeRegressor())

In [None]:
fit_and_evaluate(RandomForestRegressor())

In [None]:
fit_and_evaluate(GradientBoostingRegressor())

In [None]:
## Comparing Different R2 Scores
plt.figure(figsize=(12,6))
scores = pd.DataFrame(r2_scores.items(),columns=['Model', 'Accuracy'])
ax = sns.barplot(data=scores.sort_values("Accuracy", ascending = False),x='Model',y='Accuracy')
ax.bar_label(ax.containers[0]);

The Best Model Is `RandomForestRegressor()`, Let's Try To Increase Its Accuracy Using `Hyper parameter Tuning`.

#### Hyper Parameter Tuning
1. Randomized Search CV

In [None]:
from sklearn.model_selection import RandomizedSearchCV
## PARAMETERS
# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 100, stop = 500, num = 20)]

# Number of features to consider at every split
max_features = ['auto','sqrt']

# Maximum number of levels in tree
max_depth =  [int(x) for x in np.linspace(start = 10, stop = 25, num = 8)]
max_depth.append(None)

# Minimum number of samples required to split a node
min_samples_split = [2, 5, 7,15]

# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4,10]

# Method of selecting samples for training each tree
bootstrap = [True, False]

# Create the param grid
param_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

In [None]:
param_grid

In [None]:
rfr=RandomForestRegressor()
rfr_tuned=RandomizedSearchCV(estimator=rfr,
                             param_distributions=param_grid,
                             cv=5,
                             verbose=2, ## Print Amount of Message (The Higher The Number The More Message Gets Printed)
                             n_jobs=-1,
                             scoring='neg_mean_squared_error',
                             n_iter = 15,
                            random_state=42)

In [None]:
## Fitting or Training The Model
rfr_tuned.fit(X_train,y_train)

In [None]:
## Best Parameters
rfr_tuned.best_params_

In [None]:
## Best Parameters
pred = rfr_tuned.predict(X_test)

In [None]:
sns.displot(y_test-pred)

In [None]:
plt.figure(figsize = (8,8))
plt.scatter(y_test, pred, alpha = 0.5)
plt.xlabel("y_test")
plt.ylabel("y_pred")
plt.show()

In [None]:
r2_score(y_test,pred)

In [None]:
print('MAE:', mean_absolute_error(y_test, pred))
print('MSE:', mean_squared_error(y_test, pred))
print('RMSE:', np.sqrt(mean_squared_error(y_test, pred)))

### Hyper Parameter Tuning Details-
```
First Iteration With Some Parameter  : 0.848079685690615
MAE: 1086.59 MSE: 2925328.586 RMSE: 1710.35

Second Iteration With Change In Parameter(change in n_estimator range) : 0.8504176918461204
MAE: 1163.69 MSE: 2880308.69 RMSE: 1697.14

Third Iteration With Some Other Changes In Parameter (increased n_estimator range from 10 to 20) : 0.8624978847736444 
MAE: 1073.21 MSE: 2647696.39 RMSE:1627.17
```

## Test Data

In [None]:
df_test.head()

In [None]:
df_test.columns

In [None]:
df_test.isna().sum()

In [None]:
##  Dropping Columns (Route,Additional_Info)
df_test.drop(['Route','Additional_Info'],axis=1,inplace=True)

## Handling Categorical Columns(Airline, Total_Stops,Source,Destination)
## Airline
Airline = df_test[["Airline"]]
Airline = pd.get_dummies(Airline,drop_first=True)

## Source
Source = df_test[["Source"]]
Source = pd.get_dummies(Source,drop_first=True)

## Destination
Destination = df_test[["Destination"]]
Destination = pd.get_dummies(Destination,drop_first=True)

## Total_Stops
df_test['Total_Stops']=df_test['Total_Stops'].map({
    'non-stop':0,
    '1 stop':1,
    '2 stops':2,
    '3 stops':3,
    '4 stops':4}).astype(int)

## Handling Datetime Object Columns (Arrival_Time, Dep_Time)

## Convert and Extract Day & Month
def extract_day_and_month(data,col):
    data[col]=pd.to_datetime(data[col])
    data[col+'_Day'] = data[col].dt.day
    data[col+'_Month'] = data[col].dt.month
    
## Convert and Extract Hour & Minute
def extract_hour_and_minute(data,col):
    data[col]=pd.to_datetime(data[col])
    data[col+'_Hour']=data[col].dt.hour
    data[col+'_Min']=data[col].dt.minute

def handle_single_duration_data(df,col):
    for i in range(len(df[col])):
        if 'h' not in df[col][i]:
            df[col][i] = "0h "+str(df[col][i])
        elif 'm' not in df[col][i]:
            df[col][i] = str(df[col][i])+" 0m"



def extract_hour_from_duration(val):
    return val.split(' ')[0][0:-1]

def extract_minutes_from_duration(val):
    return val.split(' ')[1][0:-1]


## Date_of_Journey
extract_day_and_month(df_test,'Date_of_Journey')

## Arrival Time
extract_hour_and_minute(df_test,'Arrival_Time')

## Dep Time
extract_hour_and_minute(df_test,'Dep_Time')

## Duration
handle_single_duration_data(df_test,'Duration')
df_test['Duration_Hour'] = df_test['Duration'].apply(extract_hour_from_duration)
df_test['Duration_Minute'] = df_test['Duration'].apply(extract_minutes_from_duration)



df_test.drop(['Date_of_Journey','Arrival_Time','Dep_Time','Duration','Airline','Destination','Source'], axis=1, inplace=True)
df_test = pd.concat([df_test,Airline,Source,Destination],axis=1)
df_test.head()

In [None]:
X_train.shape,df_test.shape

There Is a Problem with shape of X_train and New Test Data. (They Are Different 29,28) There Must Be Some Value That Is Missing From The Data. 

In [None]:
Airline.T

`Airline` Only has 10 categories, It is short by 1 Category that is `Airline_Trujet`. What We Can Do is Add a New Column With All Row Values As 0.

In [None]:
Airline_Trujet = [i*0 for i in range(len(df_test))]
df_test['Airline_Trujet'] = Airline_Trujet

In [None]:
X_train.shape,df_test.shape

In [None]:
test_predictions = rfr_tuned.predict(df_test)

In [None]:
test_predictions = test_predictions.astype(int)

In [None]:
test_predictions_df = pd.DataFrame({'Price': test_predictions})
test_predictions_df

In [None]:
test_predictions_df.to_csv('Test_Set_Submissions.csv',index=False)

### VOTE

* Give Notebook a Upvote 🙌 if You Found It Useful.

### CONNECT WITH ME

[LinkedIN](https://www.linkedin.com/in/abhayparashar31/) | [Medium](https://medium.com/@abhayparashar31) | [Twitter](https://twitter.com/abhayparashar31) | [Github](https://github.com/Abhayparashar31)

#### HOPE TO SEE YOU IN MY NEXT KAGGLE NOTEBOOK 😀,