
# Identifying the Most Influential Factors in Airline Passenger Satisfaction & Classifying Customers

#### This dataset was downloaded from kaggle.com
#### By Daniel Payan

In this notebook, we will develop machine learning models to process our data and identify which factors have the most influence on airline customer satisfaction and will develop a prediction model to output whether or not a passenger would be satisfied with a trip. <br> 
This Machine Learning development is **part 2** of this project, with **part 1** being located in my [Data-Cleaning-and-EDA repo](https://github.com/danielpayan13/Project-Portfolio/tree/main/Data-Cleaning-and-EDA) under the same project name.<br> 
(Part 1 covers the manipulation, analysis, and visualization of the data as EDA)

This notebook has one CSV file, ***airline_passenger_satisfaction.csv***, which will be stored using a pandas DataFrame. For machine learning and statistical work we will be using scikit-learn and numpy.

In [36]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score

In [19]:
#Read data into pandas DataFrame
airline_df = pd.read_csv('Data/airline_passenger_satisfaction.csv')
#The dataset has 129880 records with 24 attributes
airline_df.shape

(129880, 24)

### First let's display the dataset to get an overall look at our data.

In [20]:
airline_df

Unnamed: 0,ID,Gender,Age,Customer Type,Type of Travel,Class,Flight Distance,Departure Delay,Arrival Delay,Departure and Arrival Time Convenience,...,On-board Service,Seat Comfort,Leg Room Service,Cleanliness,Food and Drink,In-flight Service,In-flight Wifi Service,In-flight Entertainment,Baggage Handling,Satisfaction
0,1,Male,48,First-time,Business,Business,821,2,5.0,3,...,3,5,2,5,5,5,3,5,5,Neutral or Dissatisfied
1,2,Female,35,Returning,Business,Business,821,26,39.0,2,...,5,4,5,5,3,5,2,5,5,Satisfied
2,3,Male,41,Returning,Business,Business,853,0,0.0,4,...,3,5,3,5,5,3,4,3,3,Satisfied
3,4,Male,50,Returning,Business,Business,1905,0,0.0,2,...,5,5,5,4,4,5,2,5,5,Satisfied
4,5,Female,49,Returning,Business,Business,3470,0,1.0,3,...,3,4,4,5,4,3,3,3,3,Satisfied
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
129875,129876,Male,28,Returning,Personal,Economy Plus,447,2,3.0,4,...,5,1,4,4,4,5,4,4,4,Neutral or Dissatisfied
129876,129877,Male,41,Returning,Personal,Economy Plus,308,0,0.0,5,...,5,2,5,2,2,4,3,2,5,Neutral or Dissatisfied
129877,129878,Male,42,Returning,Personal,Economy Plus,337,6,14.0,5,...,3,3,4,3,3,4,2,3,5,Neutral or Dissatisfied
129878,129879,Male,50,Returning,Personal,Economy Plus,337,31,22.0,4,...,4,4,5,3,3,4,5,3,5,Satisfied


#### From part 1 we remember we have:
> 5 categorical (object) attributes <br> & <br> 19 numerical (float/int) attributes
#### We also have a few missing values in the 'Arrival Delay' attribute that we filled with the mean attribute value

In [21]:
airline_df = airline_df.fillna({'Arrival Delay':15.09})

#### Now we'll use pandas get_dummies to make all of these categorical attributes into binary columns based on each unique value
EX: Gender has two unique values in this dataset (Male/Female), so get_dummies will return a is_male column with 1 being true, and 0 being false, and a is_female column with 1 being true, and 0 being false.

In [22]:
#Before modeling, let's deal with these categorical attributes
categ_list = ['Gender','Customer Type','Type of Travel','Class','Satisfaction']
airline_df = pd.get_dummies(airline_df,columns=categ_list)
airline_df.drop('Satisfaction_Neutral or Dissatisfied',axis=1,inplace=True)
airline_df

Unnamed: 0,ID,Age,Flight Distance,Departure Delay,Arrival Delay,Departure and Arrival Time Convenience,Ease of Online Booking,Check-in Service,Online Boarding,Gate Location,...,Gender_Female,Gender_Male,Customer Type_First-time,Customer Type_Returning,Type of Travel_Business,Type of Travel_Personal,Class_Business,Class_Economy,Class_Economy Plus,Satisfaction_Satisfied
0,1,48,821,2,5.0,3,3,4,3,3,...,0,1,1,0,1,0,1,0,0,0
1,2,35,821,26,39.0,2,2,3,5,2,...,1,0,0,1,1,0,1,0,0,1
2,3,41,853,0,0.0,4,4,4,5,4,...,0,1,0,1,1,0,1,0,0,1
3,4,50,1905,0,0.0,2,2,3,4,2,...,0,1,0,1,1,0,1,0,0,1
4,5,49,3470,0,1.0,3,3,3,5,3,...,1,0,0,1,1,0,1,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
129875,129876,28,447,2,3.0,4,4,4,4,2,...,0,1,0,1,0,1,0,0,1,0
129876,129877,41,308,0,0.0,5,3,5,3,4,...,0,1,0,1,0,1,0,0,1,0
129877,129878,42,337,6,14.0,5,2,4,2,1,...,0,1,0,1,0,1,0,0,1,0
129878,129879,50,337,31,22.0,4,4,3,4,1,...,0,1,0,1,0,1,0,0,1,1


#### Data Normalization

In [23]:
#Assign target value
y = airline_df['Satisfaction_Satisfied'].values
airline_df.drop('Satisfaction_Satisfied',axis=1,inplace=True)
x = airline_df.values

scaler = StandardScaler()
scaler.fit(x)
x_scaled = scaler.transform(x)

In [24]:
#Now time to split the data using a train:test 80%:20% ratio
#Random_state 13 is used for consistent results and reproductivity
xtrain,xtest,ytrain,ytest = train_test_split(x_scaled,y,test_size=0.2,random_state=13)

### Feature Importance/Selection

#### Random Forest

In [25]:
from sklearn.ensemble import RandomForestClassifier

# Define a list of number of trees to try
n_trees = [50, 100, 200, 500, 1000]

# Loop over the number of trees and train a random forest for each value
best_accuracy = 0
optimal_n_trees = None
for n in n_trees:
    # Train a random forest with n trees
    rfc = RandomForestClassifier(n_estimators=n, random_state=13)
    rfc.fit(xtrain, ytrain)

    # Evaluate the model on the validation set
    ypred = rfc.predict(xtest)
    accuracy = accuracy_score(ytest, ypred)

    # Check if the current model is the best so far
    if accuracy > best_accuracy:
        best_accuracy = accuracy
        optimal_n_trees = n

print('The optimal number of trees is:', optimal_n_trees)
print('With an accuracy score of: ', best_accuracy)

The optimal number of trees is: 1000
With an accuracy score of:  0.9655451185709886


In [30]:
importances = rfc.feature_importances_
sorted_indices = np.argsort(importances)[::-1]
 
feat_labels = airline_df.columns
 
for f in range(xtrain.shape[1]):
    print("%2d) %-*s %f" % (f + 1, 30,
                            feat_labels[sorted_indices[f]],
                            importances[sorted_indices[f]]))

 1) Online Boarding                0.154686
 2) In-flight Wifi Service         0.136819
 3) Class_Business                 0.066825
 4) Type of Travel_Personal        0.059928
 5) Type of Travel_Business        0.059030
 6) In-flight Entertainment        0.053416
 7) Seat Comfort                   0.046978
 8) Class_Economy                  0.040921
 9) Ease of Online Booking         0.036469
10) Leg Room Service               0.029163
11) ID                             0.028328
12) Customer Type_First-time       0.028134
13) Flight Distance                0.026572
14) On-board Service               0.025431
15) Age                            0.025430
16) Customer Type_Returning        0.025251
17) Baggage Handling               0.024645
18) Check-in Service               0.024524
19) Cleanliness                    0.024300
20) In-flight Service              0.021681
21) Gate Location                  0.013079
22) Departure and Arrival Time Convenience 0.012614
23) Arrival Delay       

#### Now we'll try Random Forest again after adjusting for overfitting

In [32]:
from sklearn.feature_selection import RFE
from sklearn.model_selection import cross_val_score
#Recursive Feature Elimination will be used to select a subset of the most important features
selector = RFE(RandomForestClassifier(n_estimators=100, random_state=13),n_features_to_select=10)
xtrain = selector.fit_transform(xtrain,ytrain)
xtest = selector.transform(xtest)

rfc = RandomForestClassifier(n_estimators = 600, min_samples_leaf = 5, max_features = 10, random_state=13)
rfc.fit(xtrain,ytrain)

#Now we will use cross-validation to estimate the model's generalization performance
cv_scores = cross_val_score(rfc,xtrain,ytrain,cv=5)
print('These are the cross-validation scores: ', cv_scores)
print('These are the average scores: ', cv_scores.mean())

#These are the updated scores!
ypred = rfc.predict(xtest)
accuracy = accuracy_score(ytest,ypred)
print('This is the updated accuracy score: ', accuracy)

These are the cross-validation scores:  [0.94153313 0.93970454 0.93917521 0.93956018 0.93768046]
These are the average scores:  0.9395307028580143
This is the updated accuracy score:  0.9401755466584539


In [33]:
importances = rfc.feature_importances_
sorted_indices = np.argsort(importances)[::-1]
 
feat_labels = airline_df.columns
 
for f in range(xtrain.shape[1]):
    print("%2d) %-*s %f" % (f + 1, 30,
                            feat_labels[sorted_indices[f]],
                            importances[sorted_indices[f]]))

 1) Arrival Delay                  0.397990
 2) Departure and Arrival Time Convenience 0.209769
 3) Check-in Service               0.150059
 4) Ease of Online Booking         0.062244
 5) ID                             0.046594
 6) Flight Distance                0.041910
 7) Age                            0.038574
 8) Online Boarding                0.027175
 9) Departure Delay                0.024331
10) Gate Location                  0.001354


### Results:
#### We see after feature selection and accounting for overfitting, we have our top 10 influential features

The **most influential feature on airline customer satisfaction is Arrival Delay** with a coefficient of 0.398

This random forest model can also accurately predict the satisfaction of a customer **94%** of the time.