<a href="https://colab.research.google.com/github/federicoding/Airline_Satisfaction/blob/main/Airline_Satisfaction_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Training for a Satisfaction Classification model

Based on Airline flight satisfaction survey data from [Kaggle](https://www.kaggle.com/teejmahal20/airline-passenger-satisfaction) and totaly modified by [Tan pengshi Alvin](https://towardsdatascience.com/predicting-satisfaction-of-airline-passengers-with-classification-76f1516e1d16) as published in his blog. More than an inspiration, it's a step-by-step procedure for modeling. And I'll get practical knowledge from it.

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV, Lasso, lars_path
from sklearn.naive_bayes import MultinomialNB, GaussianNB, BernoulliNB
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, KFold, cross_val_score, GridSearchCV, RandomizedSearchCV
from sklearn.metrics import confusion_matrix, precision_score, recall_score, precision_recall_curve, f1_score,roc_auc_score, roc_curve, log_loss, classification_report

from ipywidgets import interactive
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)

import warnings
warnings.filterwarnings("ignore")

%matplotlib inline

In [19]:
url='https://raw.githubusercontent.com/federicoding/Airline_Satisfaction/main/Airline_Dataset.csv'
df = pd.read_csv(filepath_or_buffer=url,
                 sep=';')

#Data Preparation
As stated from the blog post, the original data contains around 130k lines (from passengers on US airline flights). In total, 21 feature columns are present and one target binary column. Out of those features, 14 are survey entries where passengers rate the flight experience (with a scale of 1 to 5). Some survey entries have a value of 0, which is infered as unfilled survey questions. Those survey entries, and some other NaN values, are removed resulting in a data set of about 70 000 entries to build the model. It is to be noted that some columns and other entries have been renamed for further clarity.

This is the path of data cleaning.

All in all, **the following changes can be summarized**:



*   Column names are renamed
*   Elements in Feature 'Customer Type' and 'Class' have been renamed
*   Rows with Null values have been removed
*   Rows with scored of 0 in the survey of satisfaction (target column if you want) have been removed (probably due to customers not filling it)
*   Departure Delay end Arrival Delar are combined
*   Satisfaction target is relabelled as 0 and 1


In [7]:
df

Unnamed: 0,id,Gender,Customer Type,Age,Type of Travel,Class,Flight Distance,Inflight wifi service,Departure/Arrival time convenient,Ease of Online booking,Gate location,Food and drink,Online boarding,Seat comfort,Inflight entertainment,On-board service,Leg room service,Baggage handling,Checkin service,Inflight service,Cleanliness,Departure Delay in Minutes,Arrival Delay in Minutes,Satisfaction
0,70172,Male,Loyal Customer,13,Personal Travel,Eco Plus,460,3,4,3,1,5,3,5,5,4,3,4,4,5,5,25,18.0,neutral or dissatisfied
1,5047,Male,disloyal Customer,25,Business travel,Business,235,3,2,3,3,1,3,1,1,1,5,3,1,4,1,1,6.0,neutral or dissatisfied
2,110028,Female,Loyal Customer,26,Business travel,Business,1142,2,2,2,2,5,5,5,5,4,3,4,4,4,5,0,0.0,satisfied
3,24026,Female,Loyal Customer,25,Business travel,Business,562,2,5,5,5,2,2,2,2,2,5,3,1,4,2,11,9.0,neutral or dissatisfied
4,119299,Male,Loyal Customer,61,Business travel,Business,214,3,3,3,3,4,5,5,3,3,4,4,3,3,3,0,0.0,satisfied
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
129875,78463,Male,disloyal Customer,34,Business travel,Business,526,3,3,3,1,4,3,4,4,3,2,4,4,5,4,0,0.0,neutral or dissatisfied
129876,71167,Male,Loyal Customer,23,Business travel,Business,646,4,4,4,4,4,4,4,4,4,5,5,5,5,4,0,0.0,satisfied
129877,37675,Female,Loyal Customer,17,Personal Travel,Eco,828,2,5,1,5,2,1,2,2,4,3,4,5,4,2,0,0.0,neutral or dissatisfied
129878,90086,Male,Loyal Customer,14,Business travel,Business,1127,3,3,3,3,4,4,4,4,3,2,5,4,5,4,0,0.0,satisfied


Let's have a rapid look at the data

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 129880 entries, 0 to 129879
Data columns (total 24 columns):
 #   Column                             Non-Null Count   Dtype  
---  ------                             --------------   -----  
 0   id                                 129880 non-null  int64  
 1   Gender                             129880 non-null  object 
 2   Customer Type                      129880 non-null  object 
 3   Age                                129880 non-null  int64  
 4   Type of Travel                     129880 non-null  object 
 5   Class                              129880 non-null  object 
 6   Flight Distance                    129880 non-null  int64  
 7   Inflight wifi service              129880 non-null  int64  
 8   Departure/Arrival time convenient  129880 non-null  int64  
 9   Ease of Online booking             129880 non-null  int64  
 10  Gate location                      129880 non-null  int64  
 11  Food and drink                     1298

Renaming Customer Types
(*NDLR* les titres originaux ne regardent qu'eux...)

In [20]:
df['Customer Type'] = df['Customer Type'].map({'Loyal Customer':'Returning Customer', 'Disloyal Customer':'First-time Customer'})

In [21]:
df = df.dropna(axis=0)

In [22]:
df['Departure Delay in Minutes'] = df['Departure Delay in Minutes'].astype('float')

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 105773 entries, 0 to 129879
Data columns (total 24 columns):
 #   Column                             Non-Null Count   Dtype  
---  ------                             --------------   -----  
 0   id                                 105773 non-null  int64  
 1   Gender                             105773 non-null  object 
 2   Customer Type                      105773 non-null  object 
 3   Age                                105773 non-null  int64  
 4   Type of Travel                     105773 non-null  object 
 5   Class                              105773 non-null  object 
 6   Flight Distance                    105773 non-null  int64  
 7   Inflight wifi service              105773 non-null  int64  
 8   Departure/Arrival time convenient  105773 non-null  int64  
 9   Ease of Online booking             105773 non-null  int64  
 10  Gate location                      105773 non-null  int64  
 11  Food and drink                     1057

In [13]:
df.describe()

Unnamed: 0,id,Age,Flight Distance,Inflight wifi service,Departure/Arrival time convenient,Ease of Online booking,Gate location,Food and drink,Online boarding,Seat comfort,Inflight entertainment,On-board service,Leg room service,Baggage handling,Checkin service,Inflight service,Cleanliness,Departure Delay in Minutes,Arrival Delay in Minutes
count,105773.0,105773.0,105773.0,105773.0,105773.0,105773.0,105773.0,105773.0,105773.0,105773.0,105773.0,105773.0,105773.0,105773.0,105773.0,105773.0,105773.0,105773.0,105773.0
mean,64934.203729,41.463625,1297.02252,2.732247,3.206934,2.768854,2.974095,3.240657,3.3738,3.539268,3.425127,3.416136,3.380296,3.617908,3.324979,3.629244,3.336872,14.569181,15.004973
std,37469.289618,15.135105,1048.84933,1.334159,1.472698,1.415342,1.30992,1.314042,1.323639,1.278098,1.313002,1.28722,1.313633,1.198901,1.261288,1.191813,1.290972,38.044553,38.552532
min,2.0,7.0,31.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
25%,32512.0,31.0,440.0,2.0,2.0,2.0,2.0,2.0,2.0,3.0,2.0,2.0,2.0,3.0,3.0,3.0,2.0,0.0,0.0
50%,65033.0,43.0,925.0,3.0,4.0,3.0,3.0,3.0,4.0,4.0,4.0,4.0,4.0,4.0,3.0,4.0,3.0,0.0,0.0
75%,97330.0,53.0,1986.0,4.0,4.0,4.0,4.0,4.0,4.0,5.0,5.0,4.0,4.0,5.0,4.0,5.0,4.0,12.0,13.0
max,129880.0,85.0,4983.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,1592.0,1584.0


Renaming columns

In [23]:
df = df.rename(columns={'Leg room service':'Leg room'})

In [24]:
from string import capwords
df.columns = [capwords(i) for i in df.columns]
df = df.rename(columns={'Departure/arrival Time Convenient':'Departure/Arrival Time Convenience'})

In [25]:
df

Unnamed: 0,Id,Gender,Customer Type,Age,Type Of Travel,Class,Flight Distance,Inflight Wifi Service,Departure/Arrival Time Convenience,Ease Of Online Booking,Gate Location,Food And Drink,Online Boarding,Seat Comfort,Inflight Entertainment,On-board Service,Leg Room,Baggage Handling,Checkin Service,Inflight Service,Cleanliness,Departure Delay In Minutes,Arrival Delay In Minutes,Satisfaction
0,70172,Male,Returning Customer,13,Personal Travel,Eco Plus,460,3,4,3,1,5,3,5,5,4,3,4,4,5,5,25.0,18.0,neutral or dissatisfied
2,110028,Female,Returning Customer,26,Business travel,Business,1142,2,2,2,2,5,5,5,5,4,3,4,4,4,5,0.0,0.0,satisfied
3,24026,Female,Returning Customer,25,Business travel,Business,562,2,5,5,5,2,2,2,2,2,5,3,1,4,2,11.0,9.0,neutral or dissatisfied
4,119299,Male,Returning Customer,61,Business travel,Business,214,3,3,3,3,4,5,5,3,3,4,4,3,3,3,0.0,0.0,satisfied
5,111157,Female,Returning Customer,26,Personal Travel,Eco,1180,3,4,2,1,1,2,1,1,3,4,4,4,4,1,0.0,0.0,neutral or dissatisfied
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
129873,120654,Male,Returning Customer,52,Business travel,Business,280,3,3,3,3,3,4,4,4,4,4,4,3,4,3,0.0,0.0,satisfied
129876,71167,Male,Returning Customer,23,Business travel,Business,646,4,4,4,4,4,4,4,4,4,5,5,5,5,4,0.0,0.0,satisfied
129877,37675,Female,Returning Customer,17,Personal Travel,Eco,828,2,5,1,5,2,1,2,2,4,3,4,5,4,2,0.0,0.0,neutral or dissatisfied
129878,90086,Male,Returning Customer,14,Business travel,Business,1127,3,3,3,3,4,4,4,4,3,2,5,4,5,4,0.0,0.0,satisfied


Removing a bunch of **0** valued entries.

In [26]:
df = df[(df['Inflight Wifi Service']!=0)&(df['Departure/Arrival Time Convenience']!=0)&(df['Ease Of Online Booking']!=0)&(df['Gate Location'])&(df['Food And Drink']!=0)&(df['Online Boarding']!=0)&(df['Seat Comfort']!=0)&(df['Inflight Entertainment']!=0)&(df['On-board Service']!=0)&(df['Leg Room']!=0)&(df['Baggage Handling']!=0)&(df['Checkin Service']!=0)&(df['Inflight Service']!=0)&(df['Cleanliness']!=0)]

Relabeling of target and merging of delays

In [28]:
df['Satisfaction'] = df['Satisfaction'].map({'satisfied':1,'neutral or dissatisfied':0})
df = df.reset_index()
df = df.drop('index',axis=1)
df['Total Delay'] = df['Departure Delay In Minutes'] + df['Arrival Delay In Minutes']

Creating a new dataset object

In [29]:
DF = df.copy()
df = df.drop('Id',axis=1)

Re-indexing and dropping unecessary columns

In [30]:
df = df.reindex(columns=['Satisfaction']+list(df.columns)[:-2]+['Total Delay'])
df = df.drop(['Departure Delay In Minutes','Arrival Delay In Minutes'],axis=1)

Checking the Satisfaction (target) column

In [None]:
df['Satisfaction'].value_counts(normalize=True)

Relabeling some feature values

In [31]:
df['Class'] = df['Class'].map({'Eco':'Economy','Eco Plus':'Economy','Business':'Business'})

In [32]:
df

Unnamed: 0,Satisfaction,Gender,Customer Type,Age,Type Of Travel,Class,Flight Distance,Inflight Wifi Service,Departure/Arrival Time Convenience,Ease Of Online Booking,Gate Location,Food And Drink,Online Boarding,Seat Comfort,Inflight Entertainment,On-board Service,Leg Room,Baggage Handling,Checkin Service,Inflight Service,Cleanliness,Total Delay
0,0,Male,Returning Customer,13,Personal Travel,Economy,460,3,4,3,1,5,3,5,5,4,3,4,4,5,5,43.0
1,0,Female,Returning Customer,25,Business travel,Business,562,2,5,5,5,2,2,2,2,2,5,3,1,4,2,20.0
2,1,Male,Returning Customer,61,Business travel,Business,214,3,3,3,3,4,5,5,3,3,4,4,3,3,3,0.0
3,0,Female,Returning Customer,26,Personal Travel,Economy,1180,3,4,2,1,1,2,1,1,3,4,4,4,4,1,0.0
4,0,Male,Returning Customer,47,Personal Travel,Economy,1276,2,4,2,3,2,2,2,2,3,3,4,3,5,2,32.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
58434,0,Female,Returning Customer,48,Business travel,Economy,283,3,1,1,1,4,3,3,3,3,3,3,4,3,1,66.0
58435,1,Male,Returning Customer,52,Business travel,Business,280,3,3,3,3,3,4,4,4,4,4,4,3,4,3,0.0
58436,0,Female,Returning Customer,17,Personal Travel,Economy,828,2,5,1,5,2,1,2,2,4,3,4,5,4,2,0.0
58437,1,Male,Returning Customer,14,Business travel,Business,1127,3,3,3,3,4,4,4,4,3,2,5,4,5,4,0.0
