# Data Preparation

In [37]:
import pandas as pd 

 This study is based on a survey conducted on 103904 airline passengers. There is limited information on the background of the dataset and how the data was collected. The dataset has been downloaded from the Kaggle website and can be found at the following [link](https://www.kaggle.com/datasets/teejmahal20/airline-passenger-satisfaction/data). The main goal of the project is to determine the features most strongly correlated with customer satisfaction and to create a model to predict new customer's satisfaction levels. Since the response variable, `satisfaction`, is binary this is a classification problem. There are 23 features in total, most of which are categorical in nature. According to the creators of the data set, it has already been preprocessed for the purpose of classification. However, the dataset will be validated and reprocessed in this notebook. 

## Cleaning and Transforming the Train Data Set

In [38]:
train = pd.read_csv('../Data/Raw/train.csv', index_col = 0)
train.head()

Unnamed: 0,id,Gender,Customer Type,Age,Type of Travel,Class,Flight Distance,Inflight wifi service,Departure/Arrival time convenient,Ease of Online booking,...,Inflight entertainment,On-board service,Leg room service,Baggage handling,Checkin service,Inflight service,Cleanliness,Departure Delay in Minutes,Arrival Delay in Minutes,satisfaction
0,70172,Male,Loyal Customer,13,Personal Travel,Eco Plus,460,3,4,3,...,5,4,3,4,4,5,5,25,18.0,neutral or dissatisfied
1,5047,Male,disloyal Customer,25,Business travel,Business,235,3,2,3,...,1,1,5,3,1,4,1,1,6.0,neutral or dissatisfied
2,110028,Female,Loyal Customer,26,Business travel,Business,1142,2,2,2,...,5,4,3,4,4,4,5,0,0.0,satisfied
3,24026,Female,Loyal Customer,25,Business travel,Business,562,2,5,5,...,2,2,5,3,1,4,2,11,9.0,neutral or dissatisfied
4,119299,Male,Loyal Customer,61,Business travel,Business,214,3,3,3,...,3,3,4,4,3,3,3,0,0.0,satisfied


In [39]:
# Checking for missing values:
print(train.info())

<class 'pandas.core.frame.DataFrame'>
Index: 103904 entries, 0 to 103903
Data columns (total 24 columns):
 #   Column                             Non-Null Count   Dtype  
---  ------                             --------------   -----  
 0   id                                 103904 non-null  int64  
 1   Gender                             103904 non-null  object 
 2   Customer Type                      103904 non-null  object 
 3   Age                                103904 non-null  int64  
 4   Type of Travel                     103904 non-null  object 
 5   Class                              103904 non-null  object 
 6   Flight Distance                    103904 non-null  int64  
 7   Inflight wifi service              103904 non-null  int64  
 8   Departure/Arrival time convenient  103904 non-null  int64  
 9   Ease of Online booking             103904 non-null  int64  
 10  Gate location                      103904 non-null  int64  
 11  Food and drink                     103904 no

There appear to be no missing values in the dataset. However, in the description of the data, categorical variables which contain a category numbered '0' are 'unapplicable' and will be considered as missing. 

In [40]:
# Checking for duplicate entries:
print(train.duplicated().any())

False


There are no duplicate entries in the dataset. 

Each feature of the dataset will be checked, validated and formatted individually, below:

In [41]:
# Checking whether the `id` column has only unique values:
len(train['id'].unique())

103904

In [42]:
# Unique column values:
train['Gender'].unique()
# Covert to categorical:
train['Gender'] = train['Gender'].astype('category')
train['Gender'].unique()

['Male', 'Female']
Categories (2, object): ['Female', 'Male']

In [43]:
# Unique column values:
train['Customer Type'].unique()
# Format the column:
train['Customer Type'] = train['Customer Type'].str.capitalize()
# Covert to categorical:
train['Customer Type'] = train['Customer Type'].astype('category')
train['Customer Type'].unique()

['Loyal customer', 'Disloyal customer']
Categories (2, object): ['Disloyal customer', 'Loyal customer']

In [44]:
# Unique column values:
train['Age'].unique()
# Validating data:
print('Minimum Age:', train['Age'].min())
print('Maximum Age:', train['Age'].max())

Minimum Age: 7
Maximum Age: 85


In [45]:
# Unique column values:
train['Type of Travel'].unique()
# Capitalise the info:
train['Type of Travel'] = train['Type of Travel'].str.capitalize()
# Covert to categorical:
train['Type of Travel'] = train['Type of Travel'].astype('category')
train['Type of Travel'].unique()

['Personal travel', 'Business travel']
Categories (2, object): ['Business travel', 'Personal travel']

In [46]:
# Unique column values:
train['Class'].unique()
train['Class'] = train['Class'].str.replace('Eco', '1')
train['Class'] = train['Class'].str.replace('1 Plus', '2')
train['Class'] = train['Class'].str.replace('Business', '3')
# Convert to categorical:
train['Class'] = train['Class'].astype('category')
train['Class'] = train['Class'].cat.set_categories(new_categories = ['1', '2', '3'], ordered = True)
train['Class'].unique()

['2', '3', '1']
Categories (3, object): ['1' < '2' < '3']

In [47]:
# Unique column values:
train['Flight Distance'].unique()
# Validating data:
print('Minimum Flight Distance:', train['Flight Distance'].min())
print('Maximum Flight Distance:', train['Flight Distance'].max())

Minimum Flight Distance: 31
Maximum Flight Distance: 4983


In [48]:
# Unique column values:
train['Inflight wifi service'].unique()
# Convert to ordered categorical:
train['Inflight wifi service'] = train['Inflight wifi service'].astype('category')
train['Inflight wifi service'] = train['Inflight wifi service'].cat.set_categories(new_categories = [1, 2, 3, 4, 5], ordered = True)
train['Inflight wifi service'].unique()

[3, 2, 4, 1, 5, NaN]
Categories (5, int64): [1 < 2 < 3 < 4 < 5]

In [49]:
# Unique column values:
train['Departure/Arrival time convenient'].unique()
# Convert to ordered categorical:
train['Departure/Arrival time convenient'] = train['Departure/Arrival time convenient'].astype('category')
train['Departure/Arrival time convenient'] = train['Departure/Arrival time convenient'].cat.set_categories(new_categories = [1, 2, 3, 4, 5],
                                                                                                           ordered = True)
train['Departure/Arrival time convenient'].unique()

[4, 2, 5, 3, 1, NaN]
Categories (5, int64): [1 < 2 < 3 < 4 < 5]

In [50]:
# Unique column values:
train['Ease of Online booking'].unique()
# Convert to ordered categorical:
train['Ease of Online booking'] = train['Ease of Online booking'].astype('category')
train['Ease of Online booking'] = train['Ease of Online booking'].cat.set_categories(new_categories = [1, 2, 3, 4, 5], 
                                                                                     ordered = True)
train['Ease of Online booking'].unique()

[3, 2, 5, 4, 1, NaN]
Categories (5, int64): [1 < 2 < 3 < 4 < 5]

In [51]:
# Unique column values:
train['Gate location'].unique()
# Convert to ordered categorical:
train['Gate location'] = train['Gate location'].astype('category')
train['Gate location'] = train['Gate location'].cat.set_categories(new_categories = [1, 2, 3, 4, 5], ordered = True)
train['Gate location'].unique()

[1, 3, 2, 5, 4, NaN]
Categories (5, int64): [1 < 2 < 3 < 4 < 5]

In [52]:
# Unique column values:
train['Food and drink'].unique()
# Convert to ordered categorical:
train['Food and drink'] = train['Food and drink'].astype('category')
train['Food and drink'] = train['Food and drink'].cat.set_categories(new_categories = [1, 2, 3, 4, 5], ordered = True)
train['Food and drink'].unique()

[5, 1, 2, 4, 3, NaN]
Categories (5, int64): [1 < 2 < 3 < 4 < 5]

In [53]:
# Unique column values:
train['Online boarding'].unique()
# Convert to ordered categorical:
train['Online boarding'] = train['Online boarding'].astype('category')
train['Online boarding'] = train['Online boarding'].cat.set_categories(new_categories = [1, 2, 3, 4, 5], ordered = True)
train['Online boarding'].unique()

[3, 5, 2, 1, 4, NaN]
Categories (5, int64): [1 < 2 < 3 < 4 < 5]

In [54]:
# Unique column values:
train['Seat comfort'].unique()
# Convert to ordred categorical:
train['Seat comfort'] = train['Seat comfort'].astype('category')
train['Seat comfort'] = train['Seat comfort'].cat.set_categories(new_categories = [1, 2, 3, 4, 5], ordered = True)
train['Seat comfort'].unique()


[5, 1, 2, 3, 4, NaN]
Categories (5, int64): [1 < 2 < 3 < 4 < 5]

In [55]:
# Unique column values:
train['Inflight entertainment'].unique()
# Convert to ordered categorical:
train['Inflight entertainment'] = train['Inflight entertainment'].astype('category')
train['Inflight entertainment'] = train['Inflight entertainment'].cat.set_categories(new_categories = [1, 2, 3, 4, 5], ordered = True)
train['Inflight entertainment'].unique()

[5, 1, 2, 3, 4, NaN]
Categories (5, int64): [1 < 2 < 3 < 4 < 5]

In [56]:
# Unique column values:
train['On-board service'].unique()
# Convert to ordered categorical:
train['On-board service'] = train['On-board service'].astype('category')
train['On-board service'] = train['On-board service'].cat.set_categories(new_categories = [1, 2, 3, 4, 5], ordered = True)
train['On-board service'].unique()


[4, 1, 2, 3, 5, NaN]
Categories (5, int64): [1 < 2 < 3 < 4 < 5]

In [57]:
# Unique column values:
train['Leg room service'].unique()
# Convert to ordered categorical:
train['Leg room service'] = train['Leg room service'].astype('category')
train['Leg room service'] = train['Leg room service'].cat.set_categories(new_categories = [1, 2, 3, 4, 5], ordered = True)
train['Leg room service'].unique()

[3, 5, 4, 2, 1, NaN]
Categories (5, int64): [1 < 2 < 3 < 4 < 5]

In [58]:
# Unique column values:
train['Baggage handling'].unique()
# Convert to ordered categorical:
train['Baggage handling'] = train['Baggage handling'].astype('category')
train['Baggage handling'] = train['Baggage handling'].cat.set_categories(new_categories = [1, 2, 3, 4, 5], ordered = True)
train['Baggage handling'].unique()


[4, 3, 5, 1, 2]
Categories (5, int64): [1 < 2 < 3 < 4 < 5]

In [59]:
# Unique column values:
train['Checkin service'].unique()
# Convert to ordered categorical:
train['Checkin service'] = train['Checkin service'].astype('category')
train['Checkin service'] = train['Checkin service'].cat.set_categories(new_categories = [1, 2, 3, 4, 5], ordered = True)
train['Checkin service'].unique()

[4, 1, 3, 5, 2, NaN]
Categories (5, int64): [1 < 2 < 3 < 4 < 5]

In [60]:
# Unique column values:
train['Inflight service'].unique()
# Convert to cordered ategorical:
train['Inflight service'] = train['Inflight service'].astype('category')
train['Inflight service'] = train['Inflight service'].cat.set_categories(new_categories = [1, 2, 3, 4, 5], ordered = True)
train['Inflight service'].unique()

[5, 4, 3, 1, 2, NaN]
Categories (5, int64): [1 < 2 < 3 < 4 < 5]

In [61]:
# Unique column values:
train['Cleanliness'].unique()
# Convert to ordered categorical:
train['Cleanliness'] = train['Cleanliness'].astype('category')
train['Cleanliness'] = train['Cleanliness'].cat.set_categories(new_categories = [1, 2, 3, 4, 5], ordered = True)
train['Cleanliness'].unique()

[5, 1, 2, 3, 4, NaN]
Categories (5, int64): [1 < 2 < 3 < 4 < 5]

In [62]:
# Unique column values:
print(train['Departure Delay in Minutes'].unique()[0:100])
# Check column dtype:
print('Dtype:', train['Departure Delay in Minutes'].dtype)
# Validating data:
print('Minimum Departure Delay in Minutes:', train['Departure Delay in Minutes'].min())
print('Maximum Departure Delay in Minutes:',train['Departure Delay in Minutes'].max())

[ 25   1   0  11   9   4  28  43  49   7  17  52  54  27  18  19   3 109
  23   8  14  10  51  39  13  30  64  20  45  44  31  81  35  67  22  40
  91  21  15  29 105  12 162  24 141   6  34   2  97  16  99  37  66  53
  36 209  60 149  26   5  59  62 113  38  83 176  46  73 199  56  93  70
  80  96  57  95  74 172  63 175 143  48  47 101 118  76 220  33  55 232
 170 173 124 112  94 243 128  92 270  90]
Dtype: int64
Minimum Departure Delay in Minutes: 0
Maximum Departure Delay in Minutes: 1592


In [63]:
# Unique column values:
print(train['Arrival Delay in Minutes'].unique()[0:100])
# Check column dtype:
print('Dtype:', train['Arrival Delay in Minutes'].dtype)
# Validating data:
print('Minimum Arrival Delay in Minutes:', train['Arrival Delay in Minutes'].min())
print('Maximum Arrival Delay in Minutes:', train['Arrival Delay in Minutes'].max())

[ 18.   6.   0.   9.  23.   8.  35.  51.  10.   5.   4.  29.  44.  28.
  12. 120.  24.   1.  20.  31.  15.  48.  26.  49.   2.  37.  50.   3.
  19.  72.  11.  34.  62.  27.  52.  13.  82.  30.  16.   7. 122. 179.
 125.  17.  nan  89. 101.  14.  61.  32.  33.  41. 191. 138.  53.  22.
  57.  65.  76. 107.  92. 164.  21.  40.  55. 185.  63.  77.  86.  91.
 100.  54.  36.  70. 139.  67. 163. 128. 180.  93. 121.  45. 105. 126.
  56.  73. 212.  88. 241. 172. 175. 111.  99.  25.  42. 226.  46. 131.
 260.  69.]
Dtype: float64
Minimum Arrival Delay in Minutes: 0.0
Maximum Arrival Delay in Minutes: 1584.0


In [64]:
# Unique column values:
print(train['satisfaction'].unique())

# Format column name:
train = train.rename(columns = {'satisfaction': 'Satisfaction'})
train['Satisfaction'].unique()

# Convert to ordered categorical:
train['Satisfaction'] = train['Satisfaction'].astype('category')
train['Satisfaction'] = train['Satisfaction'].cat.set_categories(new_categories = ['satisfied', 'neutral or dissatisfied'], ordered = True)

['neutral or dissatisfied' 'satisfied']


In [65]:
# Format column name:
train = train.rename(columns = {'Inflight wifi service': 'Inflight Wifi Service', 'Departure/Arrival time convenient': 'Departure/Arrival Time Convenient',
                                'Ease of Online booking': 'Ease of Online Booking', 'Gate location': 'Gate Location', 
                                'Food and drink': 'Food and Drink', 'Online boarding': 'Online Boarding', 'Seat comfort': 'Seat Comfort',
                                'Inflight entertainment': 'Inflight Entertainment', 'On-board service': 'On-board Service', 'Leg room service': 'Leg Room Service',
                                'Baggage handling': 'Baggage Handling', 'Checkin service': 'Checkin Service', 'Inflight service': 'Inflight Service'})


Notice how, when cleaning the data and setting correct ordered category values, a large number of NaNs were created. This will be further analysed below:

In [66]:
# Checking for missing values:
train.isna().sum()

id                                      0
Gender                                  0
Customer Type                           0
Age                                     0
Type of Travel                          0
Class                                   0
Flight Distance                         0
Inflight Wifi Service                3103
Departure/Arrival Time Convenient    5300
Ease of Online Booking               4487
Gate Location                           1
Food and Drink                        107
Online Boarding                      2428
Seat Comfort                            1
Inflight Entertainment                 14
On-board Service                        3
Leg Room Service                      472
Baggage Handling                        0
Checkin Service                         1
Inflight Service                        3
Cleanliness                            12
Departure Delay in Minutes              0
Arrival Delay in Minutes              310
Satisfaction                      

We will drop NaN for variables with less than 5% of missing values:

In [67]:
# Establishing a threshold in accordance with the dataset size for the deletion of NaNs:
threshold = len(train)*0.05
print(round(threshold))

5195


In [68]:
# Identifying the columns for which NaNs can be dropped:
columns_to_drop = train.columns[train.isna().sum() <= threshold]
print(columns_to_drop)

Index(['id', 'Gender', 'Customer Type', 'Age', 'Type of Travel', 'Class',
       'Flight Distance', 'Inflight Wifi Service', 'Ease of Online Booking',
       'Gate Location', 'Food and Drink', 'Online Boarding', 'Seat Comfort',
       'Inflight Entertainment', 'On-board Service', 'Leg Room Service',
       'Baggage Handling', 'Checkin Service', 'Inflight Service',
       'Cleanliness', 'Departure Delay in Minutes', 'Arrival Delay in Minutes',
       'Satisfaction'],
      dtype='object')


In [69]:
# Dropping the missing values and checking the new data shape:
train.dropna(subset = columns_to_drop, inplace = True)
train.shape

(98860, 24)

One feature contains more missing values than the established threshold. Since the feature is categorical, the mode (most frequent category) will be used to impute the missing values. 

In [70]:
# Computing the mode:
print('Mode:', train['Departure/Arrival Time Convenient'].mode())

# Converting the feature to a string dtype for imputation:
train['Departure/Arrival Time Convenient'] = train['Departure/Arrival Time Convenient'].astype('string')

# Filling missing values with the most frequent category:
train['Departure/Arrival Time Convenient'] = train['Departure/Arrival Time Convenient'].fillna('4.0')

# Converting the feature to categorical:
train['Departure/Arrival Time Convenient'] = train['Departure/Arrival Time Convenient'].astype('category')

# Reformatting the categories:
update_cats = {'4.0': '4', '2.0': '2', '5.0': '5', '3.0': '3', '1.0': '1'}
train['Departure/Arrival Time Convenient'] = train['Departure/Arrival Time Convenient'].replace(update_cats)

# Setting new categories:
train['Departure/Arrival Time Convenient'] = train['Departure/Arrival Time Convenient'].cat.set_categories(new_categories = ['1', '2', '3', '4', '5'], ordered = True)
train['Departure/Arrival Time Convenient'].unique()


Mode: 0    4
Name: Departure/Arrival Time Convenient, dtype: category
Categories (5, int64): [1 < 2 < 3 < 4 < 5]


  train['Departure/Arrival Time Convenient'] = train['Departure/Arrival Time Convenient'].replace(update_cats)


['4', '2', '5', '3', '1']
Categories (5, object): ['1' < '2' < '3' < '4' < '5']

## Cleaning and Transforming the Test Data Set

The same process is performed no the test set.

In [71]:
# Importing the data set:
test = pd.read_csv('../Data/Raw/test.csv', index_col = 0)

In [72]:
# Checking for missing values:
print(test.info())

<class 'pandas.core.frame.DataFrame'>
Index: 25976 entries, 0 to 25975
Data columns (total 24 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   id                                 25976 non-null  int64  
 1   Gender                             25976 non-null  object 
 2   Customer Type                      25976 non-null  object 
 3   Age                                25976 non-null  int64  
 4   Type of Travel                     25976 non-null  object 
 5   Class                              25976 non-null  object 
 6   Flight Distance                    25976 non-null  int64  
 7   Inflight wifi service              25976 non-null  int64  
 8   Departure/Arrival time convenient  25976 non-null  int64  
 9   Ease of Online booking             25976 non-null  int64  
 10  Gate location                      25976 non-null  int64  
 11  Food and drink                     25976 non-null  int64  


There appear to be no missing values in this data set neither.

In [73]:
# Checking for duplicate entries:
print(test.duplicated().any())

False


There are not duplicated entries in the test set.

In [74]:
# Checking the unique column values:
len(test['id'].unique())

25976

In [75]:
# Changing the 'Gender' column to categorical:
test['Gender'] = test['Gender'].astype('category')

In [76]:
# Formatting the 'Customer Type' column to be capitalised:
test['Customer Type'] = test['Customer Type'].str.capitalize()

In [77]:
# Validating the 'Age' column:
print('Minimum Age:', test['Age'].min())
print('Maximum Age:',test['Age'].max())

Minimum Age: 7
Maximum Age: 85


In [78]:
# Capitalising the 'Type of Travel' column:
test['Type of Travel'] = test['Type of Travel'].str.capitalize()
# Coverting to categorical:
test['Type of Travel'] = test['Type of Travel'].astype('category')

In [79]:
# Replacing the categories to numerical values:
test['Class'] = test['Class'].str.replace('Eco', '1')
test['Class'] = test['Class'].str.replace('1 Plus', '2')
test['Class'] = test['Class'].str.replace('Business', '3')

# Converting to ordered categorical variable:
test['Class'] = test['Class'].astype('category')
test['Class'] = test['Class'].cat.set_categories(new_categories = ['1', '2', '3'], ordered = True)

In [80]:
# Validating data:
print('Maximum flight distance:', test['Flight Distance'].min())
print('Minimum flight distance:',test['Flight Distance'].max())

Maximum flight distance: 31
Minimum flight distance: 4983


In [81]:
# Converting to an ordered categorical variable:
test['Inflight wifi service'] = test['Inflight wifi service'].astype('category')
test['Inflight wifi service'] = test['Inflight wifi service'].cat.set_categories(new_categories = [1, 2, 3, 4, 5], ordered = True)

In [82]:
# Converting to an ordered categorical variable:
test['Departure/Arrival time convenient'] = test['Departure/Arrival time convenient'].astype('category')
test['Departure/Arrival time convenient'] = test['Departure/Arrival time convenient'].cat.set_categories(new_categories = [1, 2, 3, 4, 5],
                                                                                                           ordered = True)

In [83]:
# Converting to an ordered categorical variable:
test['Ease of Online booking'] = test['Ease of Online booking'].astype('category')
test['Ease of Online booking'] = test['Ease of Online booking'].cat.set_categories(new_categories = [1, 2, 3, 4, 5], 
                                                                                     ordered = True)

In [84]:
# Converting to an ordered categorical variable:
test['Gate location'] = test['Gate location'].astype('category')
test['Gate location'] = test['Gate location'].cat.set_categories(new_categories = [1, 2, 3, 4, 5], ordered = True)

In [85]:
# Converting to an ordered categorical variable:
test['Food and drink'] = test['Food and drink'].astype('category')
test['Food and drink'] = test['Food and drink'].cat.set_categories(new_categories = [1, 2, 3, 4, 5], ordered = True)

In [86]:
# Converting to an ordered categorical variable:
test['Food and drink'] = test['Food and drink'].astype('category')
test['Food and drink'] = test['Food and drink'].cat.set_categories(new_categories = [1, 2, 3, 4, 5], ordered = True)

In [87]:
# Converting to an ordered categorical variable:
test['Online boarding'] = test['Online boarding'].astype('category')
test['Online boarding'] = test['Online boarding'].cat.set_categories(new_categories = [1, 2, 3, 4, 5], ordered = True)

In [88]:
# Converting to an ordered categorical variable:
test['Seat comfort'] = test['Seat comfort'].astype('category')
test['Seat comfort'] = test['Seat comfort'].cat.set_categories(new_categories = [1, 2, 3, 4, 5], ordered = True)

In [89]:
# Converting to an ordered categorical variable:
test['Inflight entertainment'] = test['Inflight entertainment'].astype('category')
test['Inflight entertainment'] = test['Inflight entertainment'].cat.set_categories(new_categories = [1, 2, 3, 4, 5], ordered = True)

In [90]:
# Converting to an ordered categorical variable:
test['On-board service'] = test['On-board service'].astype('category')
test['On-board service'] = test['On-board service'].cat.set_categories(new_categories = [1, 2, 3, 4, 5], ordered = True)

In [91]:
# Converting to an ordered categorical variable:
test['Leg room service'] = test['Leg room service'].astype('category')
test['Leg room service'] = test['Leg room service'].cat.set_categories(new_categories = [1, 2, 3, 4, 5], ordered = True)

In [92]:
# Converting to an ordered categorical variable:
test['Baggage handling'] = test['Baggage handling'].astype('category')
test['Baggage handling'] = test['Baggage handling'].cat.set_categories(new_categories = [1, 2, 3, 4, 5], ordered = True)

In [93]:
# Converting to an ordered categorical variable:
test['Checkin service'] = test['Checkin service'].astype('category')
test['Checkin service'] = test['Checkin service'].cat.set_categories(new_categories = [1, 2, 3, 4, 5], ordered = True)

In [94]:
# Converting to an ordered categorical variable:
test['Inflight service'] = test['Inflight service'].astype('category')
test['Inflight service'] = test['Inflight service'].cat.set_categories(new_categories = [1, 2, 3, 4, 5], ordered = True)

In [95]:
# Converting to an ordered categorical variable:
test['Cleanliness'] = test['Cleanliness'].astype('category')
test['Cleanliness'] = test['Cleanliness'].cat.set_categories(new_categories = [1, 2, 3, 4, 5], ordered = True)

In [96]:
# Validating the 'Departure Delay' data:
print('Minimum departure delay in minutes:',test['Departure Delay in Minutes'].min())
print('Maximum departure delay in minutes', test['Departure Delay in Minutes'].max())

Minimum departure delay in minutes: 0
Maximum departure delay in minutes 1128


In [97]:
# Validating the 'Arrival Delay' data:
print('Minimum arrival delay in minutes:', test['Arrival Delay in Minutes'].min())
print('Maximum arrival delay in minutes', test['Arrival Delay in Minutes'].max())

Minimum arrival delay in minutes: 0.0
Maximum arrival delay in minutes 1115.0


In [98]:
# Formating 'Satisfaction' column name:
test = test.rename(columns = {'satisfaction': 'Satisfaction'})

# Defining a dictionary to convert object into a numerical variable:
change = {'satisfied': 1, 'neutral or dissatisfied': 2}

# Converting to ordered categorical:
test['Satisfaction'] = test['Satisfaction'].astype('category')
test['Satisfaction'] = test['Satisfaction'].cat.set_categories(new_categories = ['satisfied', 'neutral or dissatisfied'], ordered = True)

In [99]:
# Formating column names:
test = test.rename(columns = {'Inflight wifi service': 'Inflight Wifi Service', 'Departure/Arrival time convenient': 'Departure/Arrival Time Convenient',
                                'Ease of Online booking': 'Ease of Online Booking', 'Gate location': 'Gate Location', 
                                'Food and drink': 'Food and Drink', 'Online boarding': 'Online Boarding', 'Seat comfort': 'Seat Comfort',
                                'Inflight entertainment': 'Inflight Entertainment', 'On-board service': 'On-board Service', 'Leg room service': 'Leg Room Service',
                                'Baggage handling': 'Baggage Handling', 'Checkin service': 'Checkin Service', 'Inflight service': 'Inflight Service'})

In [100]:
# Establishing a threshold for the acceptable amount of missing values given the data set size:
threshold = len(test)*0.05
round(threshold)

1299

In [101]:
# Definnig the variables that meet the threshold requirement:
columns_to_drop = test.columns[test.isna().sum() <= threshold]
columns_to_drop

Index(['id', 'Gender', 'Customer Type', 'Age', 'Type of Travel', 'Class',
       'Flight Distance', 'Inflight Wifi Service', 'Ease of Online Booking',
       'Gate Location', 'Food and Drink', 'Online Boarding', 'Seat Comfort',
       'Inflight Entertainment', 'On-board Service', 'Leg Room Service',
       'Baggage Handling', 'Checkin Service', 'Inflight Service',
       'Cleanliness', 'Departure Delay in Minutes', 'Arrival Delay in Minutes',
       'Satisfaction'],
      dtype='object')

In [102]:
# Dropping the missing values of the respective columns:
test.dropna(subset = columns_to_drop, inplace = True)

# Checking the eventual data shape:
print(test.shape)

(24641, 24)


In [103]:
# Imputing the mode of 'Departure/Arrival Time Convenient':
test['Departure/Arrival Time Convenient'].mode()
test['Departure/Arrival Time Convenient'] = test['Departure/Arrival Time Convenient'].astype('string')
test['Departure/Arrival Time Convenient'] = test['Departure/Arrival Time Convenient'].fillna('4.0')

# Converting to categorical and changing/formatting the categories to ordinal:
test['Departure/Arrival Time Convenient'] = test['Departure/Arrival Time Convenient'].astype('category')
update_cats = {'4.0': '4', '2.0': '2', '5.0': '5', '3.0': '3', '1.0': '1'}
test['Departure/Arrival Time Convenient'] = test['Departure/Arrival Time Convenient'].replace(update_cats)
test['Departure/Arrival Time Convenient'] = test['Departure/Arrival Time Convenient'].cat.set_categories(new_categories = ['1', '2', '3', '4', '5'], ordered = True)


  test['Departure/Arrival Time Convenient'] = test['Departure/Arrival Time Convenient'].replace(update_cats)


In [104]:
# Saving the first preprocessed dataset to csv:
# train.to_csv('../Data/Preprocessed_1/train_preprocessed_1.csv', index = False)
# test.to_csv('../Data/Preprocessed_1/test_preprocessed_1.csv', index = False)

In [105]:
# Importing the preprocessed data:
train = pd.read_csv('../Data/Preprocessed_1/train_preprocessed_1.csv')
test = pd.read_csv('../Data/Preprocessed_1/test_preprocessed_1.csv')


Both datasets are now ready. The next step is to perform an Exploratory Data Analysis on the train set. Please go to the `exploratory_data_analysis.ipynb` notebook. 