<a href="https://colab.research.google.com/github/aayusharma01/ML-Project/blob/main/Classification_Airline_Passenger_Referral_Prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    -Airline Passenger Referral Prediction



##### **Project Type**    - Classification
##### **Contribution**    - Individual
##### ** Member Name**    - Aayush Sharma

# **Project Summary -**

Write the summary here within 500-600 words.

Data includes airline reviews from 2006 to 2019 for popular airlines around the world with multiple choice and free text questions. Data is scraped in Spring 2019. The main objective is to predict whether passengers will refer the airline to their friends.. My project aims to address this issue by leveraging a combination of data analysis, visualization, feature engineering, ensemble techniques, and machine learning algorithms with parameter tuning to provide a more precise and consistent prediction of customer referral to their friend.

# **GitHub Link -**

https://github.com/aayusharma01/ML-Project

# **Problem Statement**


**Write Problem Statement Here.**

The primary objective of this project is to build a machine learning model that can accurately predict whether passengers are likely to recommend the airline to their friends.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 15 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





6. You may add more ml algorithms for model creation. Make sure for each and every algorithm, the following format should be answered.


*   Explain the ML Model used and it's performance using Evaluation metric Score Chart.


*   Cross- Validation & Hyperparameter Tuning

*   Have you seen any improvement? Note down the improvement with updates Evaluation metric Score Chart.

*   Explain each evaluation metric's indication towards business and the business impact pf the ML model used.




















# ***Let's Begin !***

## ***1. Know Your Data***



In [None]:
dataset_loc = '/content/data_airline_reviews.xlsx'

### Import Libraries

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
%matplotlib inline

import lightgbm

import warnings
warnings.filterwarnings('ignore')


In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier

from sklearn.preprocessing import LabelEncoder
from sklearn.svm import SVC


from sklearn import model_selection
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.svm import LinearSVC
import time
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.naive_bayes import MultinomialNB

In [None]:
# Importing  metrics for evaluation for our models
from sklearn import metrics
from sklearn.metrics import classification_report,confusion_matrix
from sklearn.metrics import accuracy_score,precision_score
from sklearn.metrics import recall_score,f1_score,roc_curve, roc_auc_score

### Dataset Loading

In [None]:
from google.colab import drive
drive.mount('/content/drive')

### Dataset First View

In [None]:
df = pd.read_excel(dataset_loc)

In [None]:
first_view = df.head()
first_view

In [None]:
df.tail()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
rows_col_count = df.shape
rows_col_count

### Dataset Information

In [None]:
# Dataset Info
infor_abt_data = df.info()
infor_abt_data

In [None]:
#copy the dataframe so that original dataframe won't be affected
df1 =df.copy()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count

duplicate_counts = df1.duplicated().sum()

# Display the count of duplicates
print(f"Number of duplicates:{duplicate_counts}")

In [None]:
#Dropping Duplicate Values
df1.drop_duplicates(inplace = True)

In [None]:
#Count of remaining rows and Columns
df1.shape

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
missi_value = df1.isnull().sum().sort_values(ascending=False)[:10]
missi_value

In [None]:
# Visualizing the missing values
# Plot a bar chart
plt.figure(figsize=(10, 6),facecolor='skyblue')
missi_value.plot(kind='bar', color='pink')

# Set labels and title
plt.title("Missing Values")
plt.xlabel("Columns")
plt.ylabel("Missing Value Count")

### What did you know about your dataset?

Airline Passenger referral prediction dataset contains aircraft name,complete passenger review,date flown,cabin service,type of seat,seat comfort which gives airline owners a prediction whether passengers will recommend the particular airline to their friend or not after doing brief data Analysis and applying several Machine Learning algorithms.

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
df1.columns

In [None]:
# Dataset Describe
df1.info()

In [None]:
df1.describe().T

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
uniq_valu ={}
for column in df1.columns:
  uniq_valu = df1[column].unique()
  print(f"Unique values for {column}:")
  print(uniq_valu)

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
#Dropping missing values
df1 = df1.dropna(subset=['cabin_service','recommended','ground_service','overall','value_for_money','seat_comfort'])

#Dropping unnecessary columns that are of no use in data Visualization as well as ML model
df1 =df1.drop(['author','review_date','route','date_flown','customer_review','aircraft'],axis = 1)
#Splitting the Numeric column
low_null = ['overall','seat_comfort','cabin_service','value_for_money']
high_null = ['food_bev','entertainment',]


#Imputation technique using Quantile-1 value
def impute_by_q1_values(df1,column):
  Q1=np.percentile(np.sort(df[column].dropna()),25)
  df[column].fillna(Q1,inplace=True)


#Looping the null value column
for col in low_null:
  impute_by_q1_values(df1,col)


#Imputation technique using Median Imputation
def median_imputation(df1,column):
  df1[column].fillna(df1[column].median(),inplace=True)

#Looping the null value column
for col in high_null:
  median_imputation(df1,col)

#Mode to be used in place of null value in traveller type
df1['cabin'].fillna(df1['cabin'].mode().values[0],inplace=True)



In [None]:
df1.head(1)

In [None]:
# Now no Nan value is present in any of the columns

In [None]:
#Now to check how many missing values are in each column
missi_value = df1.isnull().sum().sort_values(ascending=False)
missi_value
#Our Data is now well maintained and clean ready for visualization and ML algorithms

### What all manipulations have you done and insights you found?

__ Firstly,we remove the missing values from our major columns that are going to be used for our Visualization and ML model

__ Then we drop unnecessary columns that are of no use in data Visualization as well as ML model

__Then we Split the Numeric Column by splitting the columns into two lists, low_null and high_null, which contain the names of columns in a DataFrame that need imputation.

__Then we name function impute_by_q1_values is defined for imputing missing values in specific columns using the Q1 value (the 25th percentile).

__Then after ,Looping through Columns for Low Null Values:

__Then we use Imputation Technique using Median Imputation

__We name function median_imputation is defined for imputing missing values in specific columns using the median value.

__Now this time for Looping Through Columns for High Null Values

__Another loop iterates through the columns listed in high_null.

__Finally,we use mode in place of missing values in Cabin Column.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### Chart - 1

Which type of Cabin has more value for money?

In [None]:
# Chart - 1 visualization code
#setting the figure size and plotting the graph
plt.figure(figsize=(8, 6))
sns.countplot(data=df1, x='cabin', hue='value_for_money')


##### 1. Why did you pick the specific chart?

Airlines need to know which type of cabin is worth investing of according to passenger review.

##### 2. What is/are the insight(s) found from the chart?

As we can see clearly in Econonmy class passenger gives highest rating and lowest too on the other hand passenger travelling in business class believes it is worth of money.

#### Chart - 2
Which Airline have more recommendation?

In [None]:
# Chart - 2 visualization code
# Get the number of trips each airline make.
recommendation_airlines =df1['airline'].value_counts()

plt.figure(figsize=(10,5),facecolor="skyblue")
recommendation_airlines[:10].plot(kind='bar',color = 'pink')
plt.xlabel('Airline Type',fontsize=12)
plt.ylabel('recommended',fontsize=12)
plt.title('highest recommendation ',fontsize=15)
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()


##### 1. Why did you pick the specific chart?

It shows comprehensive data which Airlines are more recommended by passengers.

##### 2. What is/are the insight(s) found from the chart?

Spirit Airlines is highly recommended by passengers followed by british airways then china southern airlines.

#### Chart - 3

Average rating given by passenger across all types of cabin

In [None]:
# Chart - 3 visualization code
df1=df1.groupby('cabin')[['food_bev','entertainment']].mean().reset_index()
df1


In [None]:
plt.figure('figsize'==(10,7))
df1.plot(x="cabin", y=["food_bev", "entertainment"], kind="bar",color=["skyblue", "pink"])

This graph shows the average rating given by passenger across all types of cabin to help Airline as well as other fellow passengers to invest accordingly and choose their cabin for future journey.

##### 2. What is/are the insight(s) found from the chart?

From this graph it is clearly visible that business class and first class passenger gives higher rating for food beverages and entertainment and lower for economy.

#### Chart - 4
Types of travellers across different Cabins.

In [None]:
# Chart - 4 visualization code
plt.figure(figsize=(8, 6))
sns.countplot(x='traveller_type', hue='cabin', data=df1)
plt.show()

##### 1. Why did you pick the specific chart?

Chart Clearly explain the distribution of different travellers across different cabins

##### 2. What is/are the insight(s) found from the chart?

In this chart it is clearly shown that business type of travellers most used business type of cabin and solo leisure picks economy type of cabin most among all other traveller type categories.

#### Chart - 5
Which Airlines have made highest trips.

In [None]:
# Chart - 5 visualization code
airlines_trips =df1['airline'].value_counts()

plt.figure(figsize=(20,5),facecolor = 'skyblue')
airlines_trips[:10].plot(kind='bar',color = 'pink')
plt.xlabel('Airline Type',fontsize=12)
plt.ylabel('Counts',fontsize=12)
plt.title('Highest trip by Airline ',fontsize=15)
plt.xticks(rotation=45)
plt.show()

##### 1. Why did you pick the specific chart?

To check the market demand of top 10 Airlines that are in competition

##### 2. What is/are the insight(s) found from the chart?

Clearly visible that Spirit Airlines made highest trips followed by British Airways and china Southern airlines

#### Chart - 6
Cabin type have more service rating.

In [None]:
# Chart - 6 visualization code
sns.set_style("whitegrid")

# Define a custom color palette
custom_palette = ["lightblue", "lightgreen", "lightcoral"]

plt.figure(figsize=(10, 5))
sns.boxplot(x=df['cabin'], y=df1['cabin_service'], hue=df1['recommended'], palette=custom_palette)
plt.show()


##### 1. Why did you pick the specific chart?

I want to show the corelation between cabin service to recommendation over different cabin type.

##### 2. What is/are the insight(s) found from the chart?

It is clearly visible that when the cabin service is given full star rating i.e 5 out of 5 here recommendation is most likely to happen.

First class Passengers are least likely to give the recommendation.




#### Chart - 7

Which Airline have high Cabin service rating.

In [None]:
# Chart - 7 visualization code
airline_mean_ratings = df1.groupby('airline')['cabin_service'].mean().reset_index()

# Select the top 10 airlines based on the sum of ratings
top_10_airlines = airline_mean_ratings.nlargest(10, 'cabin_service')

top_10_airlines

In [None]:
plt.figure(figsize=(10, 5))
sns.set(style="whitegrid")
sns.pointplot(x="airline", y="cabin_service", data=top_10_airlines, ci=None, join=False)

plt.xticks(rotation=45)
plt.title("Mean of Cabin Service Ratings by Airline (Top 10)")
plt.xlabel("Airline")
plt.ylabel("Mean of Cabin Service Ratings")
plt.tight_layout()

plt.show()

##### 1. Why did you pick the specific chart?

Airlines offering high service rating irrespective of Cabin type.

##### 2. What is/are the insight(s) found from the chart?



From the pintgraph clearly visible that Garuda Indonesia have close to 4.6 average rating which is highest followed by ANA All Nippon Airways.

#### Chart - 8

Which Airline is most value for money.

In [None]:
# Chart - 8 visualization code
airline_mean_ratings = df1.groupby('airline')['value_for_money'].mean().reset_index()
top_10_airlines = airline_mean_ratings.nlargest(15, 'value_for_money')
top_10_airlines

In [None]:
plt.figure(figsize=(10, 5))
sns.set(style="darkgrid")  # Set the background style
custom_palette = sns.color_palette("Set3", len(df1['airline'].unique()))
sns.violinplot(data=top_10_airlines, x="airline", y="value_for_money", palette=custom_palette)

plt.xticks(rotation=90)
plt.title("Distribution of Value for Money Ratings by top 15 Airline")
plt.xlabel("Airline")
plt.ylabel("Value for Money Rating")
plt.tight_layout()


plt.show()

##### 1. Why did you pick the specific chart?

To find out which airline offers high value for money.

##### 2. What is/are the insight(s) found from the chart?

Chart Depicts EVA Air is high value for money as its average rating is 4.4 followed by china Southern Airlines

#### Chart - 9
Type of travellers doing most trips.

In [None]:
# Chart - 9 visualization code
plt.figure(figsize=(10, 5))
sns.set(style="whitegrid")  # Set the background style

# Create a histogram (histplot) for the 'traveller_type' column
sns.histplot(data=df1, x="traveller_type", kde=True, color="blue")

plt.title("Distribution of Traveller Types")
plt.xlabel("Traveller Type")
plt.ylabel("Frequency")
plt.xticks(rotation=45)
plt.tight_layout()

plt.show()


##### 1. Why did you pick the specific chart?

To find out which type of travellers made highest trips

##### 2. What is/are the insight(s) found from the chart?

Clearly Visible that people likely to prefer solo trips.

#### Chart - 14 - Correlation Heatmap

In [None]:
# Correlation Heatmap visualization code
plt.figure(figsize=(10,5))
sns.heatmap(df1.corr(), annot=True)

## ***6. Feature Engineering & Data Pre-processing***

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation
df1['traveller_type'].fillna(method="ffill",inplace=True)


In [None]:
missi_value = df1.isnull().sum().sort_values(ascending=False)
missi_value

### 3. Categorical Encoding

In [None]:
# Encode your categorical columns
#converting recommended column
df1['recommended'].replace({'yes':1,'no':0},inplace=True)

In [None]:
df1.head(5)

#### What all categorical encoding techniques have you used & why did you use those techniques?

As recommendation is our target column need to apply ML Algorithms

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

In [None]:
# Manipulate Features to minimize feature correlation and create new features
#Correlation plot
plt.figure(figsize=(10,7))
sns.heatmap(df1.corr(), annot=True)


In [None]:
#Removing Multicollinearity features

def mult_col(X):

   # Calculating VIF
   vif = pd.DataFrame()
   vif["variables"] = X.columns
   vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

   return(vif)

In [None]:
mult_col(df1[[i for i in df1.describe().columns if i not in ['recommended','value_for_money']]])

In [None]:

#Dropping Airline coloumn as it has no further use
df1.drop(["airline"], axis = 1, inplace = True)


In [None]:
#Drop overall column as it has highest correlation value than others.
df1.drop(["overall"], axis = 1, inplace = True)


##### What all feature selection methods have you used  and why?

Answer Here.

##### Which all features you found important and why?

Answer Here.

### 5. Data Transformation

### 6. Data Scaling

In [None]:
#Defining the dependent and independent variables

#separating the dependent and independent variables
y = df1['recommended']
x = df1.drop(columns = 'recommended')

x.columns

In [None]:
x = pd.get_dummies(x)

x.shape

In [None]:
x.head(3)

In [None]:

print("The Percentage of No labels of Target Variable is",np.round(y.value_counts()[0]/len(y)*100))
print("The Percentage of Yes labels of Target Variable is",np.round(y.value_counts()[1]/len(y)*100))

### 7. Dimesionality Reduction

##### Do you think that dimensionality reduction is needed? Explain Why?

Answer Here.

In [None]:
# DImensionality Reduction (If needed)

##### Which dimensionality reduction technique have you used and why? (If dimensionality reduction done on dataset.)

Answer Here.

### 8. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.

x_train, x_test, y_train, y_test = train_test_split( x,y , test_size = 0.2, random_state = 42)
print(x_train.shape)
print(x_test.shape)



In [None]:
print(y_train.shape)
print(y_test.shape)

### 9. Handling Imbalanced Dataset

##### Do you think the dataset is imbalanced? Explain Why.

The Percentage of both labels('yes','no) is approximately equal. So no need of Handling Class Imbalance technique.

## ***7. ML Model Implementation***

### ML Model - 1

**Logistic Regression**

In [None]:
# ML Model - 1 Implementation

# Fit the Algorithm
#logistic regression fitting
log_reg = LogisticRegression(fit_intercept=True, max_iter=10000)
log_reg.fit(x_train, y_train)
# Predict on the model

In [None]:
log_reg.coef_

In [None]:
log_reg.intercept_

In [None]:

log_reg.score(x_test,y_test)

In [None]:
#94% accuracy with Logistic Regression

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
y_pred = log_reg.predict(x_test)
#report of logistic regression
report_lR = classification_report(y_test, y_pred)
print(report_lR)


In [None]:
#cofusion matrix of logistic regression
confuse_matrix_lr = confusion_matrix( y_test,y_pred)
#plooting confusion matrix
sns.heatmap(confuse_matrix_lr, annot=True, fmt = ".1f")

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 1 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm
logistic = LogisticRegression()


from sklearn.model_selection import cross_val_score


scores = cross_val_score(log_reg, x_train, y_train, cv=10)
print('Cross-Validation Accuracy Scores', scores)
# Predict on the model

In [None]:

scores = pd.Series(scores)
scores.min(), scores.mean(), scores.max()

### ML Model - 2



**Decision Tree**

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
#Initializing Decision Tree Model object
tree_classify=DecisionTreeClassifier()
#Taining a model with x and y
tree_classify.fit(x_train,y_train)

In [None]:
print("Training Accuracy of Decision Tree Model is",tree_classify.score(x_train,y_train))
print("Testing Accuracy of Decision Tree Model is",tree_classify.score(x_test,y_test))

In [None]:

y_pred = tree_classify.predict(x_test)


#report of decision tree
report_dec_tree = classification_report(y_test, y_pred)
print(report_dec_tree)


#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
#setting the parameters and scoring metric
parameters = {"criterion":["gini","entropy"],"max_depth":[5,7],"min_samples_split":[5,7],"min_samples_leaf":[2,3]}
scoring_=['f1','recall','precision','accuracy']


#performing hyperparameter tuning using gridsearchcv

#setting an estimator,and crossvalidation
tree_cv = GridSearchCV(estimator=tree_classify, param_grid=parameters, scoring=scoring_, cv=5,refit='accuracy')

#Fitting x and y to gridsearchcv model using an estimator Decision tree classifier
tree_cv.fit(x_train, y_train)

In [None]:
#calling an best params
tree_cv.best_params_


In [None]:
#calling our best score
tree_cv.best_score_

In [None]:
#93% accuracy of Decision Tree with the help of hypermatring tunning.

##### Which hyperparameter optimization technique have you used and why?

We Used Grid Search CV

### ML Model - 3

**Random Forest**

In [None]:
# ML Model -

# Fit the Algorithm
random_forest = RandomForestClassifier()
random_forest.fit(x_train,y_train)


In [None]:
random_forest.score(x_test,y_test)

In [None]:
#report of decision tree
report_ran_forest = classification_report(y_test, y_pred)
print(report_ran_forest)

In [None]:
#92% accuracy with Random Forest

#### 2. Cross- Validation & Hyperparameter Tuning

In [None]:
# ML Model - 3 Implementation with hyperparameter optimization techniques (i.e., GridSearch CV, RandomSearch CV, Bayesian Optimization etc.)

# Fit the Algorithm
random_forest_gridcv = GridSearchCV(estimator=random_forest,
                       param_grid = parameters,
                       cv = 5, verbose=2)


random_forest_gridcv.fit(x_train,y_train)
# Predict on the model

In [None]:
random_forest_gridcv.best_params_

##### 4.ML Model


**Support Vector Machine**

In [None]:
support_vector = SVC(kernel='linear')
support_vector.fit(x_train, y_train)

In [None]:
#score for support vector machine
support_vector.score(x_test, y_test)

In [None]:
#94% accuracy with support vector machine

In [None]:
y_pred = support_vector.predict(x_test)


#confusion matrix
support_vector_con_mat = confusion_matrix( y_test,y_pred)
support_vector_con_mat

In [None]:
#Creating a function to return all Models Accuracy Score

def accuracy_of_each_model(model,X_train,X_test):

  #predicting a train datas
  y_train_preds=model.predict(X_train)

  #predicting a test datas
  y_test_preds=model.predict(X_test)

  #storing all training scores
  train_scores=[]

  #storing all test scores
  test_scores=[]
  metrics=['Accuracy_Score','Precsion_Score','Recall_Score','Roc_Auc_Score']

  # Get the accuracy scores
  train_accuracy_score = accuracy_score(y_train,y_train_preds)
  test_accuracy_score = accuracy_score(y_test,y_test_preds)

  train_scores.append(train_accuracy_score)
  test_scores.append(test_accuracy_score)

  # Get the precision scores
  train_precision_score = precision_score(y_train,y_train_preds)
  test_precision_score = precision_score(y_test,y_test_preds)

  train_scores.append(train_precision_score)
  test_scores.append(test_precision_score)

  # Get the recall scores
  train_recall_score =recall_score(y_train,y_train_preds)
  test_recall_score =recall_score(y_test,y_test_preds)

  train_scores.append(train_recall_score)
  test_scores.append(test_recall_score)

  # Get the roc_auc scores
  train_roc_auc_score=roc_auc_score(y_train,y_train_preds)
  test_roc_auc_score =roc_auc_score(y_test,y_test_preds)

  train_scores.append(train_roc_auc_score)
  test_scores.append(test_roc_auc_score)

  return train_scores,test_scores,metrics




models=[log_reg,tree_cv,random_forest,support_vector]
name=['Logistic Regression Model','Decision Tree Model After Hyperparameter Tuning','Random Forest Model After Hyperparameter Tuning','support vector',]


for model_ in range(len(models)):
  train_score_,test_score_,metrics_=accuracy_of_each_model(models[model_],x_train,x_test)
  print("-*-*-"*3+f"{name[model_]}"+"-*-*-"*4)
  print("")
  print(pd.DataFrame(data={'Metrics':metrics_,'Train_Score':train_score_,'Test_Score':test_score_}))
  print("")

### 2. Which ML model did you choose from the above created models as your final prediction model and why?

We have used 4 ML Model for prediction of Airline by Passenger to their friends,colleagues,family and based on our Accuracy Score we can say that Logistic regression gives very high prediction among others.

In [None]:
feature_importance = random_forest.feature_importances_

#examine the feature importance values

for feature_name, importance in zip(x_train.columns, feature_importance):
    print(f"{feature_name}: {importance}")

### 3. Explain the model which you have used and the feature importance using any model explainability tool?

#These importance scores provide insights into which features had the most impact on the model's predictions firstly, **value_for_money** and **ground_service** are the most influential features, with importance scores of 0.3010 and 0.2403, respectively. Features like **seat_comfort** and **cabin_service** also have significant importance

# **Conclusion**

I used 4 Models for this classification problem are

Logistic Regression Model,
Decision Tree Model,
Random Forest Model,
Support Vector Machines

I performed Hyperparameter tuning using Gridsearch CV method for Decision Tree Model, Random Forest Model  and Support Vector Machine To increase accuracy and avoid Overfitting Criteria, this is done. After that, we finalized the Gradient Boosting model by fine-tuning the hyperparameters.

Based on the knowledge of the business and the problem usecase. The Classification metrics of Recall is given first priority , Accuray is given second priority , and ROC AUC is given third priority.

I have built classifier models using 4 different types of classifiers and all these are able to give accuracy of more than 93%. I can conclude that Random Forest gives the best model.

Model evaluation metrics comparison, we can see that Support Vector Machine being the model with highest accuracy rate by a very small margin, works best among the experimented models for the given dataset.

The most important feature are overall rating and Value for money that contribute to a model's prediction whether a passenger will recommened a particular airline to his/her friends.

The classifier models developed can be used to predict passenger referral as it will give airlines ability to identify impactful passengers who can help in bringing more revenues.


As a result, in order to increase their business or grow, our client must provide excellent cabin service, ground service, food beverage entertainment, and seat comfort.


### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***