# Data Preparation

In [1522]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

 This study is based on a survey conducted on 103904 airline passengers. There is limited information on the background of the dataset and how the data was collected. The dataset has been downloaded from the Kaggle website and can be found at the following [link](https://www.kaggle.com/datasets/teejmahal20/airline-passenger-satisfaction/data). The main goal of the project is to determine the features most strongly correlated with customer satisfaction and to create a model to predict new customer's satisfaction levels. Since the response variable, `satisfaction`, is binary this is a classification problem or a supervised learning task. There are 23 features in total, most of which are categorical in nature. According to the creators of the data set, it has already been preprocessed for the purpose of classification. However, the dataset will be validated and reprocessed in this notebook. 

## Cleaning and Transforming the Train Data Set

In [1523]:
train = pd.read_csv('../Data/Raw/train.csv', index_col = 0)
train.head()

Unnamed: 0,id,Gender,Customer Type,Age,Type of Travel,Class,Flight Distance,Inflight wifi service,Departure/Arrival time convenient,Ease of Online booking,...,Inflight entertainment,On-board service,Leg room service,Baggage handling,Checkin service,Inflight service,Cleanliness,Departure Delay in Minutes,Arrival Delay in Minutes,satisfaction
0,70172,Male,Loyal Customer,13,Personal Travel,Eco Plus,460,3,4,3,...,5,4,3,4,4,5,5,25,18.0,neutral or dissatisfied
1,5047,Male,disloyal Customer,25,Business travel,Business,235,3,2,3,...,1,1,5,3,1,4,1,1,6.0,neutral or dissatisfied
2,110028,Female,Loyal Customer,26,Business travel,Business,1142,2,2,2,...,5,4,3,4,4,4,5,0,0.0,satisfied
3,24026,Female,Loyal Customer,25,Business travel,Business,562,2,5,5,...,2,2,5,3,1,4,2,11,9.0,neutral or dissatisfied
4,119299,Male,Loyal Customer,61,Business travel,Business,214,3,3,3,...,3,3,4,4,3,3,3,0,0.0,satisfied


In [1524]:
# Checking for missing values:
print(train.info())

<class 'pandas.core.frame.DataFrame'>
Index: 103904 entries, 0 to 103903
Data columns (total 24 columns):
 #   Column                             Non-Null Count   Dtype  
---  ------                             --------------   -----  
 0   id                                 103904 non-null  int64  
 1   Gender                             103904 non-null  object 
 2   Customer Type                      103904 non-null  object 
 3   Age                                103904 non-null  int64  
 4   Type of Travel                     103904 non-null  object 
 5   Class                              103904 non-null  object 
 6   Flight Distance                    103904 non-null  int64  
 7   Inflight wifi service              103904 non-null  int64  
 8   Departure/Arrival time convenient  103904 non-null  int64  
 9   Ease of Online booking             103904 non-null  int64  
 10  Gate location                      103904 non-null  int64  
 11  Food and drink                     103904 no

There appear to be no missing values in the dataset. However, in the description of the data, categorical variables which contain a category numbered '0' are 'unapplicable' and will be considered as missing. 

In [1525]:
# Checking for duplicate entries:
print(f'Data is duplicated:', train.duplicated().any())

Data is duplicated: False


In [1526]:
# Checking the shape of the dataset:
print(f'Data shape:', train.shape)

Data shape: (103904, 24)


There are no duplicate entries in the dataset. 

Each feature of the dataset will be checked, validated and formatted individually, below:

In [1527]:
# Checking whether the `id` column has only unique values:
print(f'Length of unique ID values:', len(train['id'].unique()))

Length of unique ID values: 103904


In [1528]:
# Unique column values:
train['Gender'].unique()
# Coverting to categorical:
train['Gender'] = train['Gender'].astype('category')
train['Gender'].unique()

['Male', 'Female']
Categories (2, object): ['Female', 'Male']

In [1529]:
# Unique column values:
train['Customer Type'].unique()
# Formatting the column:
train['Customer Type'] = train['Customer Type'].str.capitalize()
# Coverting to categorical:
train['Customer Type'] = train['Customer Type'].astype('category')
train['Customer Type'].unique()

['Loyal customer', 'Disloyal customer']
Categories (2, object): ['Disloyal customer', 'Loyal customer']

In [1530]:
# Unique column values:
train['Age'].unique()
# Validating the data:
print('Minimum Age:', train['Age'].min())
print('Maximum Age:', train['Age'].max())

Minimum Age: 7
Maximum Age: 85


In [1531]:
# Unique column values:
train['Type of Travel'].unique()
# Capitalising the info:
train['Type of Travel'] = train['Type of Travel'].str.capitalize()
# Coverting to categorical:
train['Type of Travel'] = train['Type of Travel'].astype('category')
train['Type of Travel'].unique()

['Personal travel', 'Business travel']
Categories (2, object): ['Business travel', 'Personal travel']

In [1532]:
# Unique column values:
train['Class'].unique()
train['Class'] = train['Class'].str.replace('Eco', '1')
train['Class'] = train['Class'].str.replace('1 Plus', '2')
train['Class'] = train['Class'].str.replace('Business', '3')
# Converting to categorical:
train['Class'] = train['Class'].astype('category')
train['Class'] = train['Class'].cat.set_categories(new_categories = ['1', '2', '3'], ordered = True)
train['Class'].unique()

['2', '3', '1']
Categories (3, object): ['1' < '2' < '3']

In [1533]:
# Unique column values:
train['Flight Distance'].unique()
# Validating the data:
print('Minimum Flight Distance:', train['Flight Distance'].min())
print('Maximum Flight Distance:', train['Flight Distance'].max())

Minimum Flight Distance: 31
Maximum Flight Distance: 4983


In [1534]:
# Unique column values:
train['Inflight wifi service'].unique()
# Converting to ordered categorical:
train['Inflight wifi service'] = train['Inflight wifi service'].astype('category')
train['Inflight wifi service'] = train['Inflight wifi service'].cat.set_categories(new_categories = [1, 2, 3, 4, 5], ordered = True)
train['Inflight wifi service'].unique()

[3, 2, 4, 1, 5, NaN]
Categories (5, int64): [1 < 2 < 3 < 4 < 5]

In [1535]:
# Unique column values:
train['Departure/Arrival time convenient'].unique()

# Converting to ordered categorical:
train['Departure/Arrival time convenient'] = train['Departure/Arrival time convenient'].astype('category')

# Rearranging category order:
train['Departure/Arrival time convenient'] = train['Departure/Arrival time convenient'].cat.set_categories(new_categories = [1, 2, 3, 4, 5],
                                                                                                           ordered = True)
train['Departure/Arrival time convenient'].unique()


[4, 2, 5, 3, 1, NaN]
Categories (5, int64): [1 < 2 < 3 < 4 < 5]

In [1536]:
# Unique column values:
train['Ease of Online booking'].unique()
# Converting to ordered categorical:
train['Ease of Online booking'] = train['Ease of Online booking'].astype('category')
train['Ease of Online booking'] = train['Ease of Online booking'].cat.set_categories(new_categories = [1, 2, 3, 4, 5], 
                                                                                     ordered = True)
train['Ease of Online booking'].unique()

[3, 2, 5, 4, 1, NaN]
Categories (5, int64): [1 < 2 < 3 < 4 < 5]

In [1537]:
# Unique column values:
train['Gate location'].unique()
# Converting to ordered categorical:
train['Gate location'] = train['Gate location'].astype('category')
train['Gate location'] = train['Gate location'].cat.set_categories(new_categories = [1, 2, 3, 4, 5], ordered = True)
train['Gate location'].unique()

[1, 3, 2, 5, 4, NaN]
Categories (5, int64): [1 < 2 < 3 < 4 < 5]

In [1538]:
# Unique column values:
train['Food and drink'].unique()
# Converting to ordered categorical:
train['Food and drink'] = train['Food and drink'].astype('category')
train['Food and drink'] = train['Food and drink'].cat.set_categories(new_categories = [1, 2, 3, 4, 5], ordered = True)
train['Food and drink'].unique()

[5, 1, 2, 4, 3, NaN]
Categories (5, int64): [1 < 2 < 3 < 4 < 5]

In [1539]:
# Unique column values:
train['Online boarding'].unique()
# Converting to ordered categorical:
train['Online boarding'] = train['Online boarding'].astype('category')
train['Online boarding'] = train['Online boarding'].cat.set_categories(new_categories = [1, 2, 3, 4, 5], ordered = True)
train['Online boarding'].unique()

[3, 5, 2, 1, 4, NaN]
Categories (5, int64): [1 < 2 < 3 < 4 < 5]

In [1540]:
# Unique column values:
train['Seat comfort'].unique()
# Converting to ordered categorical:
train['Seat comfort'] = train['Seat comfort'].astype('category')
train['Seat comfort'] = train['Seat comfort'].cat.set_categories(new_categories = [1, 2, 3, 4, 5], ordered = True)
train['Seat comfort'].unique()


[5, 1, 2, 3, 4, NaN]
Categories (5, int64): [1 < 2 < 3 < 4 < 5]

In [1541]:
# Unique column values:
train['Inflight entertainment'].unique()
# Converting to ordered categorical:
train['Inflight entertainment'] = train['Inflight entertainment'].astype('category')
train['Inflight entertainment'] = train['Inflight entertainment'].cat.set_categories(new_categories = [1, 2, 3, 4, 5], ordered = True)
train['Inflight entertainment'].unique()

[5, 1, 2, 3, 4, NaN]
Categories (5, int64): [1 < 2 < 3 < 4 < 5]

In [1542]:
# Unique column values:
train['On-board service'].unique()
# Converting to ordered categorical:
train['On-board service'] = train['On-board service'].astype('category')
train['On-board service'] = train['On-board service'].cat.set_categories(new_categories = [1, 2, 3, 4, 5], ordered = True)
train['On-board service'].unique()


[4, 1, 2, 3, 5, NaN]
Categories (5, int64): [1 < 2 < 3 < 4 < 5]

In [1543]:
# Unique column values:
train['Leg room service'].unique()
# Converting to ordered categorical:
train['Leg room service'] = train['Leg room service'].astype('category')
train['Leg room service'] = train['Leg room service'].cat.set_categories(new_categories = [1, 2, 3, 4, 5], ordered = True)
train['Leg room service'].unique()

[3, 5, 4, 2, 1, NaN]
Categories (5, int64): [1 < 2 < 3 < 4 < 5]

In [1544]:
# Unique column values:
train['Baggage handling'].unique()
# Converting to ordered categorical:
train['Baggage handling'] = train['Baggage handling'].astype('category')
train['Baggage handling'] = train['Baggage handling'].cat.set_categories(new_categories = [1, 2, 3, 4, 5], ordered = True)
train['Baggage handling'].unique()


[4, 3, 5, 1, 2]
Categories (5, int64): [1 < 2 < 3 < 4 < 5]

In [1545]:
# Unique column values:
train['Checkin service'].unique()
# Converting to ordered categorical:
train['Checkin service'] = train['Checkin service'].astype('category')
train['Checkin service'] = train['Checkin service'].cat.set_categories(new_categories = [1, 2, 3, 4, 5], ordered = True)
train['Checkin service'].unique()

[4, 1, 3, 5, 2, NaN]
Categories (5, int64): [1 < 2 < 3 < 4 < 5]

In [1546]:
# Unique column values:
train['Inflight service'].unique()
# Converting to cordered ategorical:
train['Inflight service'] = train['Inflight service'].astype('category')
train['Inflight service'] = train['Inflight service'].cat.set_categories(new_categories = [1, 2, 3, 4, 5], ordered = True)
train['Inflight service'].unique()

[5, 4, 3, 1, 2, NaN]
Categories (5, int64): [1 < 2 < 3 < 4 < 5]

In [1547]:
# Unique column values:
train['Cleanliness'].unique()
# Converting to ordered categorical:
train['Cleanliness'] = train['Cleanliness'].astype('category')
train['Cleanliness'] = train['Cleanliness'].cat.set_categories(new_categories = [1, 2, 3, 4, 5], ordered = True)
train['Cleanliness'].unique()

[5, 1, 2, 3, 4, NaN]
Categories (5, int64): [1 < 2 < 3 < 4 < 5]

In [1548]:
# Unique column values:
print(train['Departure Delay in Minutes'].unique()[0:100])
# Checking column dtype:
print('Dtype:', train['Departure Delay in Minutes'].dtype)
# Validating data:
print('Minimum Departure Delay in Minutes:', train['Departure Delay in Minutes'].min())
print('Maximum Departure Delay in Minutes:',train['Departure Delay in Minutes'].max())

[ 25   1   0  11   9   4  28  43  49   7  17  52  54  27  18  19   3 109
  23   8  14  10  51  39  13  30  64  20  45  44  31  81  35  67  22  40
  91  21  15  29 105  12 162  24 141   6  34   2  97  16  99  37  66  53
  36 209  60 149  26   5  59  62 113  38  83 176  46  73 199  56  93  70
  80  96  57  95  74 172  63 175 143  48  47 101 118  76 220  33  55 232
 170 173 124 112  94 243 128  92 270  90]
Dtype: int64
Minimum Departure Delay in Minutes: 0
Maximum Departure Delay in Minutes: 1592


In [1549]:
# Unique column values:
print(train['Arrival Delay in Minutes'].unique()[0:100])
# Checking column dtype:
print('Dtype:', train['Arrival Delay in Minutes'].dtype)
# Validating data:
print('Minimum Arrival Delay in Minutes:', train['Arrival Delay in Minutes'].min())
print('Maximum Arrival Delay in Minutes:', train['Arrival Delay in Minutes'].max())

[ 18.   6.   0.   9.  23.   8.  35.  51.  10.   5.   4.  29.  44.  28.
  12. 120.  24.   1.  20.  31.  15.  48.  26.  49.   2.  37.  50.   3.
  19.  72.  11.  34.  62.  27.  52.  13.  82.  30.  16.   7. 122. 179.
 125.  17.  nan  89. 101.  14.  61.  32.  33.  41. 191. 138.  53.  22.
  57.  65.  76. 107.  92. 164.  21.  40.  55. 185.  63.  77.  86.  91.
 100.  54.  36.  70. 139.  67. 163. 128. 180.  93. 121.  45. 105. 126.
  56.  73. 212.  88. 241. 172. 175. 111.  99.  25.  42. 226.  46. 131.
 260.  69.]
Dtype: float64
Minimum Arrival Delay in Minutes: 0.0
Maximum Arrival Delay in Minutes: 1584.0


In [1550]:
# Unique column values:
print(train['satisfaction'].unique())

# Formatting column name:
train = train.rename(columns = {'satisfaction': 'Satisfaction'})
train['Satisfaction'].unique()

# Converting to ordered categorical:
train['Satisfaction'] = train['Satisfaction'].astype('category')
train['Satisfaction'] = train['Satisfaction'].cat.set_categories(new_categories = ['satisfied', 'neutral or dissatisfied'], ordered = True)

['neutral or dissatisfied' 'satisfied']


In [1551]:
# Formating column name:
train = train.rename(columns = {'Inflight wifi service': 'Inflight Wifi Service', 'Departure/Arrival time convenient': 'Departure/Arrival Time Convenient',
                                'Ease of Online booking': 'Ease of Online Booking', 'Gate location': 'Gate Location', 
                                'Food and drink': 'Food and Drink', 'Online boarding': 'Online Boarding', 'Seat comfort': 'Seat Comfort',
                                'Inflight entertainment': 'Inflight Entertainment', 'On-board service': 'On-board Service', 'Leg room service': 'Leg Room Service',
                                'Baggage handling': 'Baggage Handling', 'Checkin service': 'Checkin Service', 'Inflight service': 'Inflight Service'})


Notice how, when cleaning the data and setting correct ordered category values, a large number of NaNs were created. This will be further analysed below:

In [1552]:
# Checking for missing values:
train.isna().sum()

id                                      0
Gender                                  0
Customer Type                           0
Age                                     0
Type of Travel                          0
Class                                   0
Flight Distance                         0
Inflight Wifi Service                3103
Departure/Arrival Time Convenient    5300
Ease of Online Booking               4487
Gate Location                           1
Food and Drink                        107
Online Boarding                      2428
Seat Comfort                            1
Inflight Entertainment                 14
On-board Service                        3
Leg Room Service                      472
Baggage Handling                        0
Checkin Service                         1
Inflight Service                        3
Cleanliness                            12
Departure Delay in Minutes              0
Arrival Delay in Minutes              310
Satisfaction                      

We will drop NaN for variables more than 95% of missing values:

In [1553]:
# Establishing a threshold in accordance with the dataset size for the deletion of NaNs:
threshold = len(train)*0.05
print(round(threshold))

5195


In [1554]:
# Identifying the columns for which NaNs can be dropped:
columns_to_drop = train.columns[train.isna().sum() <= threshold]
print(columns_to_drop)

Index(['id', 'Gender', 'Customer Type', 'Age', 'Type of Travel', 'Class',
       'Flight Distance', 'Inflight Wifi Service', 'Ease of Online Booking',
       'Gate Location', 'Food and Drink', 'Online Boarding', 'Seat Comfort',
       'Inflight Entertainment', 'On-board Service', 'Leg Room Service',
       'Baggage Handling', 'Checkin Service', 'Inflight Service',
       'Cleanliness', 'Departure Delay in Minutes', 'Arrival Delay in Minutes',
       'Satisfaction'],
      dtype='object')


There are many variables for which missing values can be dropped. There are many sophistacted methods for missing value imputation. However, here a very basic approach will be taken; missing values will be imputed with the features' mode.  

In [1555]:
# Identifying the columns for which NaNs should be imputed:
cols_to_impute = train.columns[train.isna().sum() > threshold]
print(cols_to_impute)

Index(['Departure/Arrival Time Convenient'], dtype='object')


In [1556]:
# Iterate through each column to impute NaNs with the mode:
for col in cols_to_impute:
    mode_value = train[col].mode()
    if not mode_value.empty:
        train[col] = train[col].fillna(mode_value.iloc[0])

## Function for Data Preprocessing

Since the previous data preprocessing was quite significant, a function will be defined to streamline this process for the test data set and in the eventuality if more data becomes available. 

In [1565]:
def data_preprocessing(data):

    # Checking the shape of the dataset:
    print(f'Data shape:', data.shape)

    # Checking for duplicate entries:
    print(f'\nData is duplicated:', data.duplicated().any())

    # Checking whether the `id` column has only unique values:
    print(f'\nLength of unique ID values:', len(data['id'].unique()))

    # Convert 'Gender' to categorical:
    data['Gender'] = data['Gender'].astype('category')

    # Format 'Customer Type' column:
    data['Customer Type'] = data['Customer Type'].str.capitalize()
    data['Customer Type'] = data['Customer Type'].astype('category')

    # Validating the 'Age' column:
    print('\nMinimum Age:', data['Age'].min())
    print('Maximum Age:', data['Age'].max())

    # Format the 'Type of Travel' column:
    data['Type of Travel'] = data['Type of Travel'].str.capitalize()
    data['Type of Travel'] = data['Type of Travel'].astype('category')

    # Format the 'Class' column:
    data['Class'] = data['Class'].str.replace('Eco', '1')
    data['Class'] = data['Class'].str.replace('1 Plus', '2')
    data['Class'] = data['Class'].str.replace('Business', '3')
    data['Class'] = data['Class'].astype('category')
    data['Class'] = data['Class'].cat.set_categories(new_categories = ['1', '2', '3'], ordered = True)

    # Validate the 'Flight Distance' column:
    print('\nMinimum Flight Distance:', data['Flight Distance'].min())
    print('Maximum Flight Distance:', data['Flight Distance'].max())

    # Format the 'Inflight wifi service' column:
    data['Inflight wifi service'] = data['Inflight wifi service'].astype('category')
    data['Inflight wifi service'] = data['Inflight wifi service'].cat.set_categories(new_categories = [1, 2, 3, 4, 5], ordered = True)

    # Format the 'Departure/Arrival time convenient' column:
    data['Departure/Arrival time convenient'] = data['Departure/Arrival time convenient'].astype('category')
    data['Departure/Arrival time convenient'] = data['Departure/Arrival time convenient'].cat.set_categories(new_categories = [1, 2, 3, 4, 5],
                                                                                                           ordered = True)
    
    # Format the 'Ease of Online booking' column:
    data['Ease of Online booking'] = data['Ease of Online booking'].astype('category')
    data['Ease of Online booking'] = data['Ease of Online booking'].cat.set_categories(new_categories = [1, 2, 3, 4, 5], 
                                                                                     ordered = True)
    
    # Format the 'Gate location' column:
    data['Gate location'] = data['Gate location'].astype('category')
    data['Gate location'] = data['Gate location'].cat.set_categories(new_categories = [1, 2, 3, 4, 5], ordered = True)

    # Format the 'Food and drink' column:
    data['Food and drink'] = data['Food and drink'].astype('category')
    data['Food and drink'] = data['Food and drink'].cat.set_categories(new_categories = [1, 2, 3, 4, 5], ordered = True)

    # Format the 'Online boarding' column:
    data['Online boarding'] = data['Online boarding'].astype('category')
    data['Online boarding'] = data['Online boarding'].cat.set_categories(new_categories = [1, 2, 3, 4, 5], ordered = True)

    # Format the 'Seat comfort' column:
    data['Seat comfort'] = data['Seat comfort'].astype('category')
    data['Seat comfort'] = data['Seat comfort'].cat.set_categories(new_categories = [1, 2, 3, 4, 5], ordered = True)

    # Format the 'Inflight entertainment' column:
    data['Inflight entertainment'] = data['Inflight entertainment'].astype('category')
    data['Inflight entertainment'] = data['Inflight entertainment'].cat.set_categories(new_categories = [1, 2, 3, 4, 5], ordered = True)

    # Format the 'On-board service' column:
    data['On-board service'] = data['On-board service'].astype('category')
    data['On-board service'] = data['On-board service'].cat.set_categories(new_categories = [1, 2, 3, 4, 5], ordered = True)

    # Format the 'Leg room service' column:
    data['Leg room service'] = data['Leg room service'].astype('category')
    data['Leg room service'] = data['Leg room service'].cat.set_categories(new_categories = [1, 2, 3, 4, 5], ordered = True)

    # Format the 'Baggage handling' column:
    data['Baggage handling'] = data['Baggage handling'].astype('category')
    data['Baggage handling'] = data['Baggage handling'].cat.set_categories(new_categories = [1, 2, 3, 4, 5], ordered = True)

    # Format the 'Checkin service' column:
    data['Checkin service'] = data['Checkin service'].astype('category')
    data['Checkin service'] = data['Checkin service'].cat.set_categories(new_categories = [1, 2, 3, 4, 5], ordered = True)

    # Format the 'Inflight service' column:
    data['Inflight service'] = data['Inflight service'].astype('category')
    data['Inflight service'] = data['Inflight service'].cat.set_categories(new_categories = [1, 2, 3, 4, 5], ordered = True)

    # Format the 'Cleanliness' column:
    data['Cleanliness'] = data['Cleanliness'].astype('category')
    data['Cleanliness'] = data['Cleanliness'].cat.set_categories(new_categories = [1, 2, 3, 4, 5], ordered = True)

    # Validate the 'Departure Delay in Minutes' column:
    print('\nMinimum Departure Delay in Minutes:', data['Departure Delay in Minutes'].min())
    print('Maximum Departure Delay in Minutes:',data['Departure Delay in Minutes'].max())

    # Validate the 'Arrival Delay in Minutes' column:
    print('\nMinimum Arrival Delay in Minutes:', data['Arrival Delay in Minutes'].min())
    print('Maximum Arrival Delay in Minutes:', data['Arrival Delay in Minutes'].max())

    # Format the 'satisfaction' column:
    data = data.rename(columns = {'satisfaction': 'Satisfaction'})
    data['Satisfaction'] = data['Satisfaction'].astype('category')
    data['Satisfaction'] = data['Satisfaction'].cat.set_categories(new_categories = ['satisfied', 'neutral or dissatisfied'], ordered = True)

    # Format other column name:
    data = data.rename(columns = {'Inflight wifi service': 'Inflight Wifi Service', 'Departure/Arrival time convenient': 'Departure/Arrival Time Convenient',
                                'Ease of Online booking': 'Ease of Online Booking', 'Gate location': 'Gate Location', 
                                'Food and drink': 'Food and Drink', 'Online boarding': 'Online Boarding', 'Seat comfort': 'Seat Comfort',
                                'Inflight entertainment': 'Inflight Entertainment', 'On-board service': 'On-board Service', 'Leg room service': 'Leg Room Service',
                                'Baggage handling': 'Baggage Handling', 'Checkin service': 'Checkin Service', 'Inflight service': 'Inflight Service'})
    
    def resolve_nans(df):
        threshold = len(df) * 0.05

        # Identify columns with more NaNs than the threshold for dropping
        columns_to_drop = df.columns[df.isna().sum() <= threshold]

        # Identify columns with less NaNs than the threshold for imputation
        cols_to_impute = df.columns[df.isna().sum() > threshold]

        # Drop columns with too many NaNs (above the threshold)
        df = df.drop(columns=columns_to_drop)

        # Iterate through each column that needs imputation:
        for col in cols_to_impute:
            mode_value = df[col].mode()
            if not mode_value.empty:  # Check if the mode exists
                df[col] = df[col].fillna(mode_value.iloc[0])

        return df
    
    return data




## Cleaning and Transforming the Test Data Set

The same process is performed on the test set.

In [1566]:
# Importing the data set:
test = pd.read_csv('../Data/Raw/test.csv', index_col = 0)

In [1567]:
# Using the created function to process the test set:
test = data_preprocessing(test)

Data shape: (25976, 24)

Data is duplicated: False

Length of unique ID values: 25976

Minimum Age: 7
Maximum Age: 85

Minimum Flight Distance: 31
Maximum Flight Distance: 4983

Minimum Departure Delay in Minutes: 0
Maximum Departure Delay in Minutes: 1128

Minimum Arrival Delay in Minutes: 0.0
Maximum Arrival Delay in Minutes: 1115.0


In [1568]:
# Saving the first preprocessed dataset to pickle to preserve data type information:
# train.to_pickle('../Data/Preprocessed_1/train_preprocessed_1.pkl')
# test.to_pickle('../Data/Preprocessed_1/test_preprocessed_1.pkl')

In [1569]:
# Importing the preprocessed data:
train = pd.read_pickle('../Data/Preprocessed_1/train_preprocessed_1.pkl')
test = pd.read_pickle('../Data/Preprocessed_1/test_preprocessed_1.pkl')


Both datasets are now ready. The next step is to perform an Exploratory Data Analysis on the train set. Please go to the `exploratory_data_analysis.ipynb` notebook. 