<a href="https://colab.research.google.com/github/Vishita7/Machine-Learning/blob/main/Airline_Passenger_Satisfaction_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Project - Airline Passenger Satisfaction Project
####Created By Vishita Yadav

###Introduction
Customer satisfaction is particularly key as it can affect whether a customer is likely to reuse the service and whether they are likely to recommend the service. The passenger is always the best group of clients to the airline as the fare is one of the main revenues. Airline companies are looking to enhance customer satisfaction and increase business performance by reporting and strategizing the factors that influence customer satisfaction levels. Analyzing their feedback would greatly help the airlines understand more what the passenger exactly needs and where they can improve better. Through this project we are analyzing which features contribute most to the levels of customer satisfaction towards the airline. By understanding the satisfaction, the business can seek to improve and ensure future growth.

The objectives of this data analysis are:

Perform EDA between the passenger satisfaction and all other features
Select the best predictive models for predicting passengers’ satisfaction
Evaluate the high correlation factors through the great performance models
Through the use of an extensive passenger satisfaction dataset, we will aim to evaluate factors that affect the overall satisfaction as well aim to understand how predictable the overall level of customer satisfaction is relative to the key data. From the best model, we have observed the following features that contributed the most to the satisfaction levels. 
1. Flight distance
2. Inflight Wi-Fi services
3. Leg room service
4. Inflight entertainment
5. Seat comfort






#Importing Modules

In [None]:
! pip install https://github.com/pandas-profiling/pandas-profiling/archive/master.zip
! pip install markupsafe==2.0.1
! pip show pandas-profiling

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting https://github.com/pandas-profiling/pandas-profiling/archive/master.zip
  Downloading https://github.com/pandas-profiling/pandas-profiling/archive/master.zip
[K     | 21.9 MB 1.0 MB/s
Collecting PyYAML>=5.0.0
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 4.1 MB/s 
Collecting visions[type_image_path]==0.7.5
  Downloading visions-0.7.5-py3-none-any.whl (102 kB)
[K     |████████████████████████████████| 102 kB 45.9 MB/s 
Collecting htmlmin>=0.1.12
  Downloading htmlmin-0.1.12.tar.gz (19 kB)
Collecting phik>=0.11.1
  Downloading phik-0.12.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (690 kB)
[K     |████████████████████████████████| 690 kB 34.6 MB/s 
[?25hCollecting tangled-up-in-unicode==0.2.0
  Downloading tangled_up_in_unicode-

##Mounting the Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.pyplot import *
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler

#Modules used in modeling 

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, f1_score, recall_score, precision_score,r2_score, mean_squared_error,roc_auc_score, plot_roc_curve,plot_confusion_matrix

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.linear_model import Perceptron
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
import xgboost as xgb
import seaborn as sns

In [None]:
def FitStatCalculation(ModelName,ModelVariable, y_test, y_pred, x_test):
  print('Accuracy of '+ModelName+': %.4f' % accuracy_score(y_test, y_pred))
  print('Misclassification Rate of '+ModelName+': %.4f' % (1-accuracy_score(y_test, y_pred)))
  print('R-square of '+ModelName+': %.4f' % r2_score(y_test, y_pred))
  print('Mean Square Error for '+ModelName+': %.4f' % mean_squared_error(y_test, y_pred))
  print('Root Mean Square Error for '+ModelName+': %.4f'% np.sqrt(mean_squared_error(y_test, y_pred)))
  print(confusion_matrix(y_test,y_pred))
  print(classification_report(y_test,y_pred))
  plot_confusion_matrix(ModelVariable, x_test_std, y_test,cmap=plt.cm.pink, normalize = 'all')
  plot_roc_curve(ModelVariable, x_test_std, y_test)

##Loading the data into Train, test datasets

In [None]:
train_data = pd.read_csv("/content/drive/Shareddrives/Data Science Using Python/data/train.csv")
test_data = pd.read_csv("/content/drive/Shareddrives/Data Science Using Python/data/test.csv")
train_data.head()

Unnamed: 0.1,Unnamed: 0,id,Gender,Customer Type,Age,Type of Travel,Class,Flight Distance,Inflight wifi service,Departure/Arrival time convenient,...,Inflight entertainment,On-board service,Leg room service,Baggage handling,Checkin service,Inflight service,Cleanliness,Departure Delay in Minutes,Arrival Delay in Minutes,satisfaction
0,0,70172,Male,Loyal Customer,13,Personal Travel,Eco Plus,460,3,4,...,5,4,3,4,4,5,5,25,18.0,neutral or dissatisfied
1,1,5047,Male,disloyal Customer,25,Business travel,Business,235,3,2,...,1,1,5,3,1,4,1,1,6.0,neutral or dissatisfied
2,2,110028,Female,Loyal Customer,26,Business travel,Business,1142,2,2,...,5,4,3,4,4,4,5,0,0.0,satisfied
3,3,24026,Female,Loyal Customer,25,Business travel,Business,562,2,5,...,2,2,5,3,1,4,2,11,9.0,neutral or dissatisfied
4,4,119299,Male,Loyal Customer,61,Business travel,Business,214,3,3,...,3,3,4,4,3,3,3,0,0.0,satisfied


#Data Transformation

In [None]:
# Drop unnecessary columns
train_data.drop(['Unnamed: 0', 'id'], axis=1,inplace=True)
test_data.drop(['Unnamed: 0', 'id'], axis=1,inplace=True)
train_data.head()

Unnamed: 0,Gender,Customer Type,Age,Type of Travel,Class,Flight Distance,Inflight wifi service,Departure/Arrival time convenient,Ease of Online booking,Gate location,...,Inflight entertainment,On-board service,Leg room service,Baggage handling,Checkin service,Inflight service,Cleanliness,Departure Delay in Minutes,Arrival Delay in Minutes,satisfaction
0,Male,Loyal Customer,13,Personal Travel,Eco Plus,460,3,4,3,1,...,5,4,3,4,4,5,5,25,18.0,neutral or dissatisfied
1,Male,disloyal Customer,25,Business travel,Business,235,3,2,3,3,...,1,1,5,3,1,4,1,1,6.0,neutral or dissatisfied
2,Female,Loyal Customer,26,Business travel,Business,1142,2,2,2,2,...,5,4,3,4,4,4,5,0,0.0,satisfied
3,Female,Loyal Customer,25,Business travel,Business,562,2,5,5,5,...,2,2,5,3,1,4,2,11,9.0,neutral or dissatisfied
4,Male,Loyal Customer,61,Business travel,Business,214,3,3,3,3,...,3,3,4,4,3,3,3,0,0.0,satisfied


##Checking for Null values

In [None]:
train_data.isnull().sum()

Gender                                 0
Customer Type                          0
Age                                    0
Type of Travel                         0
Class                                  0
Flight Distance                        0
Inflight wifi service                  0
Departure/Arrival time convenient      0
Ease of Online booking                 0
Gate location                          0
Food and drink                         0
Online boarding                        0
Seat comfort                           0
Inflight entertainment                 0
On-board service                       0
Leg room service                       0
Baggage handling                       0
Checkin service                        0
Inflight service                       0
Cleanliness                            0
Departure Delay in Minutes             0
Arrival Delay in Minutes             310
satisfaction                           0
dtype: int64

##Mean value Imputation

In [None]:
train_data['Arrival Delay in Minutes'] = train_data['Arrival Delay in Minutes'].fillna(train_data['Arrival Delay in Minutes'].mean())
test_data['Arrival Delay in Minutes'] = test_data['Arrival Delay in Minutes'].fillna(test_data['Arrival Delay in Minutes'].mean())

In [None]:
train_data.isnull().sum()

##Outliers

In [None]:
Q1 = train_data.quantile(.25)
Q3 = train_data.quantile(.75)
IQR = Q3-Q1
print(IQR)

Age                                    24.0
Flight Distance                      1329.0
Inflight wifi service                   2.0
Departure/Arrival time convenient       2.0
Ease of Online booking                  2.0
Gate location                           2.0
Food and drink                          2.0
Online boarding                         2.0
Seat comfort                            3.0
Inflight entertainment                  2.0
On-board service                        2.0
Leg room service                        2.0
Baggage handling                        2.0
Checkin service                         1.0
Inflight service                        2.0
Cleanliness                             2.0
Departure Delay in Minutes             12.0
Arrival Delay in Minutes               13.0
dtype: float64


(103904, 23)

In [None]:
# Removal of outliers from dataset
# if a data point is defined as an outlier, save as true
drop = ((train_data < (Q1 - 1.5 * IQR))
         | (train_data > (Q3 + 1.5 * IQR)))
# remove the outliers        
train_data = train_data[~(drop).any(axis=1)]

  This is separate from the ipykernel package so we can avoid doing imports until
  after removing the cwd from sys.path.


In [None]:
train_data.describe()

Unnamed: 0,Age,Flight Distance,Inflight wifi service,Departure/Arrival time convenient,Ease of Online booking,Gate location,Food and drink,Online boarding,Seat comfort,Inflight entertainment,On-board service,Leg room service,Baggage handling,Checkin service,Inflight service,Cleanliness,Departure Delay in Minutes,Arrival Delay in Minutes
count,74931.0,74931.0,74931.0,74931.0,74931.0,74931.0,74931.0,74931.0,74931.0,74931.0,74931.0,74931.0,74931.0,74931.0,74931.0,74931.0,74931.0,74931.0
mean,39.469832,1147.115827,2.752399,3.099652,2.763516,2.963353,3.237792,3.313515,3.499446,3.401169,3.472021,3.39443,3.706623,3.63764,3.729831,3.341301,3.463413,3.534633
std,15.101783,929.339061,1.34682,1.537766,1.404885,1.284912,1.32046,1.354357,1.314857,1.330237,1.263877,1.303823,1.172029,0.98332,1.159913,1.300071,6.740086,6.766589
min,7.0,31.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,2.0,0.0,0.0,0.0,0.0
25%,27.0,409.0,2.0,2.0,2.0,2.0,2.0,2.0,3.0,2.0,3.0,2.0,3.0,3.0,3.0,2.0,0.0,0.0
50%,40.0,836.0,3.0,3.0,3.0,3.0,3.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0,3.0,0.0,0.0
75%,51.0,1709.0,4.0,4.0,4.0,4.0,4.0,4.0,5.0,5.0,4.0,4.0,5.0,4.0,5.0,4.0,4.0,4.0
max,85.0,3736.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,30.0,32.0


In [None]:
train_data.info()

In [None]:
%%javascript
IPython.OutputArea.prototype._should_scroll = function(lines) {
    return false;
}

#Data Visualization

In [None]:
import pandas_profiling as pp
profile = pp.ProfileReport(train_data, title = "Exploratory Data Analysis")

In [None]:
profile.to_notebook_iframe()

In [None]:
train_data.head()

#Using Labelencoder to encode categorical fields

In [None]:
#Seggregating numerical and categorical columns
All_columns = train_data.columns
Numerical_Columns=train_data._get_numeric_data().columns
Categorical_Columns= list(set(All_columns)-set(Numerical_Columns))
Categorical_Columns

In [None]:
# Catergocial columns unique values before Label Encoding
for i in Categorical_Columns:
   print("train_data["+i+"]", train_data[i].unique())
   print("test_data["+i+"]", test_data[i].unique())

In [None]:
#Label Encoding categorical vairables for ease of modeling
from sklearn import preprocessing
le= preprocessing.LabelEncoder()
for i in Categorical_Columns:
  train_data[i]=le.fit_transform(train_data[i].values)
  test_data[i]=le.fit_transform(test_data[i].values)

In [None]:
# Catergocial columns unique values before Label Encoding
for i in Categorical_Columns:
  print("train_data["+i+"]", train_data[i].unique())
  print("test_data["+i+"]", train_data[i].unique())
# train_data.head()
# test_data.head()

In [None]:
#Saving the Cleaned File for reference
train_data.to_csv("/content/drive/Shareddrives/Data Science Using Python/data/airline_cleaned_data.csv")

##Correlation Matrix

In [None]:
rs = np.random.RandomState(0)
df= train_data.iloc[:,:22].copy()
corr = df.corr()
plt.figure(figsize=(20,11))
sns.heatmap(corr, cmap="Greens",annot=True)
plt.show()

#Feature selection using Chi-Square Test

In [None]:
x_train = train_data.iloc[:,:22]
y_train = train_data[['satisfaction']]
x_test = test_data.iloc[:,:22]
y_test = test_data[['satisfaction']]
x_train.head()

In [None]:
mms = MinMaxScaler()
x_train_std = pd.DataFrame(mms.fit_transform(x_train), columns=x_train.columns)
x_test_std = pd.DataFrame(mms.fit_transform(x_test), columns=x_test.columns)
x_train_std.head()

In [None]:
selector = SelectKBest(chi2, k=10)
selector.fit(x_train_std, y_train)
selector.transform(x_train_std)
BestColumns=x_train_std.columns[selector.get_support(indices=True)].tolist()
x_train_std=x_train_std[BestColumns]
x_test_std=x_test_std[BestColumns]
print("List of highly effective columns", BestColumns)

#Modeling

##Decision Tree

In [None]:
 #Decision_tree
Dtree = DecisionTreeClassifier(criterion='entropy', random_state=1)
Dtree.fit(x_train_std, y_train)
y_pred = Dtree.predict(x_test_std)
# Fit Statistics for DecisionTree
FitStatCalculation('Decision Tree',Dtree, y_test, y_pred, x_test_std)


## Random Forest

In [None]:
Rnd_clf = RandomForestClassifier(n_estimators=239, max_leaf_nodes = 16, n_jobs = -1)
Rnd_clf.fit(x_train_std, y_train)
y_pred = Rnd_clf.predict(x_test_std)

# Fit Statistics for RandomForest
FitStatCalculation('Random Forest',Rnd_clf, y_test, y_pred, x_test_std)

##Gradient Boosting

In [None]:
from sklearn.ensemble import GradientBoostingClassifier
gb_Cls = GradientBoostingClassifier(n_estimators=200, max_features=9, max_depth=8, random_state=0)
gb_Cls.fit(x_train_std, y_train)
y_pred = gb_Cls.predict(x_test_std)

# Fit Statistics for Gradient Boosting
FitStatCalculation('Grading Boosting',gb_Cls, y_test, y_pred, x_test_std)

In [None]:
xgb_Cls = xgb.XGBClassifier(n_estimators=200, max_features=9, max_depth=8, random_state=0)
xgb_Cls.fit(x_train_std, y_train)
y_pred = xgb_Cls.predict(x_test_std)

# Fit Statistics for Gradient Boosting
FitStatCalculation('Extreme Grading Boosting',xgb_Cls, y_test, y_pred, x_test_std)

xgb.plot_importance(booster=xgb_Cls)
plt.show()

## Logistic Regression

In [None]:
# Training ML logistic regression model on training data set
lr = LogisticRegression(C=1000.0, random_state=0)
lr.fit(x_train_std, y_train)
# Validating the ML on test data set and finding the accuracy score
y_pred = lr.predict(x_test_std)

# Fit Statistics for Logistic Regression
FitStatCalculation('Logistic Regression',lr, y_test, y_pred, x_test_std)

## SVM:  Support Vector Machines

In [None]:
# Training ML on SVC of training data set
svm = SVC(kernel='linear', C=1.0, random_state=0)
svm.fit(x_train_std, y_train)
# Validating the ML on test data set and finding its accuracy
y_pred = svm.predict(x_test_std)

# Fit Statistics for Logistic Regression
FitStatCalculation('SVC',svm, y_test, y_pred, x_test_std)

## Perceptron

In [None]:
ppn = Perceptron(max_iter=40, eta0=0.1, random_state=0)
ppn.fit(x_train_std, y_train)

y_pred = ppn.predict(x_test_std)

# Fit Statistics for Perceptron
FitStatCalculation('Perceptron',ppn, y_test, y_pred, x_test_std)

## Naive Bayes

In [None]:
# Training the Naive Bayes model on the Training set
GNB = GaussianNB()
GNB.fit(x_train_std, y_train)

# Predicting the Test set results
y_pred = GNB.predict(x_test_std)

FitStatCalculation('Naive Bayes',GNB, y_test, y_pred, x_test_std)

##K-Nearest Neighbors


In [None]:
# selecting k value based on accuracy score
accuracy_score_K=[]
MaxKNNAccuracy = 0
for i in range(1,17):
  knn_clf=KNeighborsClassifier(n_neighbors=i)
  knn_clf.fit(x_train_std,y_train)
  pred= knn_clf.predict(x_test_std)
  accuracy_score_K.append(accuracy_score(y_test,pred))
MaxK=max(accuracy_score_K)
print(f'Accuracy is highest for K value {accuracy_score_K.index(MaxK)} i.e., {round(MaxK,3)}')

In [None]:
# KNN model for K=10, and Classification Report
knn_clf=KNeighborsClassifier(n_neighbors=10)
knn_clf.fit(x_train_std,y_train)
y_pred= knn_clf.predict(x_test_std)
print('with k=8')
print('\n')
FitStatCalculation('KNN', knn_clf, y_test, y_pred, x_test_std)

# Model Comparison in terms of Accuracy

In [None]:
ModelNames = [ "Logistic", "Linear_SVM", "Naive_Bayes", "Xtreme Gradient Boosting", "Gradient Boosting",  "Random Forest", "Decision Tree", "Perceptron", "KNN"]
Classifiers = [lr, svm, GNB,xgb_Cls, gb_Cls, Rnd_clf, Dtree, ppn, knn_clf]
scores = []
for name, clf in zip(ModelNames, Classifiers):
    score = round(clf.score(x_test_std, y_test),3)
    scores.append(score)
Comparison = pd.DataFrame()
Comparison['ModelName'] = ModelNames
Comparison['AccuracyScore'] = scores
Comparison