# ***Causal Learning Tutorial - Airline Passenger Satisfaction Classification***


'''
Author:
        
        KIM, JoengYoong, jeongyoong@ccnets.org
        
    COPYRIGHT (c) 2024. CCNets. All Rights reserved.
'''

<p align="center">
    <img src="https://www.travelandleisure.com/thmb/h97kSvljd2QYH2nUy3Y9ZNgO_pw=/1500x0/filters:no_upscale():max_bytes(150000):strip_icc()/plane-data-BUSYROUTES1217-f4f84b08d47f4951b11c148cee2c3dea.jpg" width=600>
</p>

<br>

Data Source: https://www.kaggle.com/datasets/teejmahal20/airline-passenger-satisfaction/data

<hr>

CCNet result: https://wandb.ai/ccnets/causal-learning

<blockquote>
R&sup2 score: <mark>0.95</mark>

</blockquote>

Benchmark: https://www.kaggle.com/datasets/teejmahal20/airline-passenger-satisfaction/code?datasetId=522275&sortBy=voteCount
<blockquote>
R&sup2 score:

- Random Forest: 0.96
- LightGBM: 0.96
- Catboost: 0.96
- XGBoost: 0.94

</blockquote>






## 1. Load Libraries

In [1]:
import os
import sys
import warnings
warnings.filterwarnings("ignore")

path_append = "../../"
sys.path.append(path_append)  # Go up one directory from where you are.

In [2]:
import torch
import pandas as pd

## 2. Load  Dataset

In [3]:
df_train = pd.read_csv(path_append + '../data/Airline Customer Satisfaction/train.csv')
df_test = pd.read_csv(path_append + '../data/Airline Customer Satisfaction/test.csv')
df_train.head()
df_test.head()

Unnamed: 0.1,Unnamed: 0,id,Gender,Customer Type,Age,Type of Travel,Class,Flight Distance,Inflight wifi service,Departure/Arrival time convenient,...,Inflight entertainment,On-board service,Leg room service,Baggage handling,Checkin service,Inflight service,Cleanliness,Departure Delay in Minutes,Arrival Delay in Minutes,satisfaction
0,0,19556,Female,Loyal Customer,52,Business travel,Eco,160,5,4,...,5,5,5,5,2,5,5,50,44.0,satisfied
1,1,90035,Female,Loyal Customer,36,Business travel,Business,2863,1,1,...,4,4,4,4,3,4,5,0,0.0,satisfied
2,2,12360,Male,disloyal Customer,20,Business travel,Eco,192,2,0,...,2,4,1,3,2,2,2,0,0.0,neutral or dissatisfied
3,3,77959,Male,Loyal Customer,44,Business travel,Business,3377,0,0,...,1,1,1,1,3,1,4,0,6.0,satisfied
4,4,36875,Female,Loyal Customer,49,Business travel,Eco,1182,2,3,...,2,2,2,2,4,2,4,0,20.0,satisfied


In [4]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103904 entries, 0 to 103903
Data columns (total 25 columns):
 #   Column                             Non-Null Count   Dtype  
---  ------                             --------------   -----  
 0   Unnamed: 0                         103904 non-null  int64  
 1   id                                 103904 non-null  int64  
 2   Gender                             103904 non-null  object 
 3   Customer Type                      103904 non-null  object 
 4   Age                                103904 non-null  int64  
 5   Type of Travel                     103904 non-null  object 
 6   Class                              103904 non-null  object 
 7   Flight Distance                    103904 non-null  int64  
 8   Inflight wifi service              103904 non-null  int64  
 9   Departure/Arrival time convenient  103904 non-null  int64  
 10  Ease of Online booking             103904 non-null  int64  
 11  Gate location                      1039

In [5]:
# Check labels count
df_train[['satisfaction']].value_counts()

satisfaction           
neutral or dissatisfied    58879
satisfied                  45025
Name: count, dtype: int64

## 3. Preprocessing

In [6]:
from preprocessing.data_frame import auto_preprocess_dataframe

target_columns = ['satisfaction']
drop_columns = ['Unnamed: 0','id']

# Assuming df_train and df_test are your initial dataframes
df_train_length = len(df_train)
df_test_length = len(df_test)

# Concatenate the dataframes
df = pd.concat([df_train, df_test], axis=0)

# Process the combined dataframe
df, description_train = auto_preprocess_dataframe(df, target_columns, drop_columns=drop_columns)

# Split the dataframe back into training and test sets
df_train = df.iloc[:df_train_length]
df_test = df.iloc[df_train_length:]

num_features = description_train['num_features']
num_classes = description_train['num_classes']

print(f"Number of features after scaling: {num_features}")
print(f"Number of classes after scaling: {num_classes}")
description_train

Dropped columns: Unnamed: 0, id
Filled NaN values in column 'Arrival Delay in Minutes' with random values.
Column 'Class' has 3 unique values.
Column 'Customer Type' has 2 unique values.
Column 'Gender' has 2 unique values.
Column 'Type of Travel' has 2 unique values.
Column 'satisfaction' has 2 unique values.


Unnamed: 0,Min,Max,Mean,Std,Null Count,Scaled,Encoded
Age,-2.577411,3.622118,-1.066252e-16,1.201704,0,Minmax,
Flight Distance,-0.611278,3.11203,0.2603883,0.749964,0,Robust,
Inflight wifi service,-2.728696,2.271304,7.297998000000001e-17,1.32934,0,,
Departure/Arrival time convenient,-3.057599,1.942401,-3.632587e-17,1.526741,0,,
Ease of Online booking,-2.756876,2.243124,-3.015485e-16,1.40174,0,,
Gate location,-2.976925,2.023075,-1.912579e-16,1.27852,0,,
Food and drink,-3.204774,1.795226,8.643806e-18,1.329933,0,,
Online boarding,-3.252633,1.747367,7.730189000000001e-17,1.350719,0,,
Seat comfort,-3.441361,1.558639,-4.814272e-17,1.319289,0,,
Inflight entertainment,-3.358077,1.641923,5.886541000000001e-17,1.334049,0,,


Number of features after scaling: 24
Number of classes after scaling: 2


{'num_features': 24,
 'num_classes': 2,
 'encoded_columns': Index(['Class', 'Customer Type', 'Gender', 'Type of Travel',
        'ccnets_processed_satisfaction'],
       dtype='object'),
 'one_hot_encoded_columns': Index(['Customer Type', 'Gender', 'Type of Travel', 'Class'], dtype='object'),
 'encoded_datatime_columns': Index([], dtype='object'),
 'scalers': {'Age': 'minmax',
  'Arrival Delay in Minutes': 'robust',
  'Baggage handling': 'none',
  'Checkin service': 'none',
  'Cleanliness': 'none',
  'Departure Delay in Minutes': 'robust',
  'Departure/Arrival time convenient': 'none',
  'Ease of Online booking': 'none',
  'Flight Distance': 'robust',
  'Food and drink': 'none',
  'Gate location': 'none',
  'Inflight entertainment': 'none',
  'Inflight service': 'none',
  'Inflight wifi service': 'none',
  'Leg room service': 'none',
  'On-board service': 'none',
  'Online boarding': 'none',
  'Seat comfort': 'none'}}

In [7]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 103904 entries, 0 to 103903
Data columns (total 25 columns):
 #   Column                             Non-Null Count   Dtype  
---  ------                             --------------   -----  
 0   Age                                103904 non-null  float64
 1   Flight Distance                    103904 non-null  float64
 2   Inflight wifi service              103904 non-null  float64
 3   Departure/Arrival time convenient  103904 non-null  float64
 4   Ease of Online booking             103904 non-null  float64
 5   Gate location                      103904 non-null  float64
 6   Food and drink                     103904 non-null  float64
 7   Online boarding                    103904 non-null  float64
 8   Seat comfort                       103904 non-null  float64
 9   Inflight entertainment             103904 non-null  float64
 10  On-board service                   103904 non-null  float64
 11  Leg room service                   103904 no

In [8]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
Index: 25976 entries, 0 to 25975
Data columns (total 25 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   Age                                25976 non-null  float64
 1   Flight Distance                    25976 non-null  float64
 2   Inflight wifi service              25976 non-null  float64
 3   Departure/Arrival time convenient  25976 non-null  float64
 4   Ease of Online booking             25976 non-null  float64
 5   Gate location                      25976 non-null  float64
 6   Food and drink                     25976 non-null  float64
 7   Online boarding                    25976 non-null  float64
 8   Seat comfort                       25976 non-null  float64
 9   Inflight entertainment             25976 non-null  float64
 10  On-board service                   25976 non-null  float64
 11  Leg room service                   25976 non-null  float64


In [9]:
# Custom Dataset Class
class AirlineDataset(torch.utils.data.Dataset):
    def __init__(self, x, y):
        self.x = x
        self.y = y
        
    def __len__(self):
        return len(self.x)

    def __getitem__(self, index):
        vals = torch.tensor(self.x[index], dtype = torch.float32)
        label = torch.tensor(self.y[index], dtype= torch.long)
        return vals, label

In [10]:
from sklearn.model_selection import train_test_split

# Assuming df_train is already defined
X_train, y_train = df_train.iloc[:, :-1].values, df_train.iloc[:, -1:].values
X_test, y_test = df_test.iloc[:, :-1].values, df_test.iloc[:,-1:].values

# Split the training data into train and eval sets with 80:20 ratio
X_train_split, X_eval, y_train_split, y_eval = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

# Create datasets
trainset = AirlineDataset(X_train_split, y_train_split)
evalset = AirlineDataset(X_eval, y_eval)
testset = AirlineDataset(X_test, y_test)

print(f"Labeled Trainset Shape: {len(trainset)}, {trainset.x.shape[1]}")
print(f"Labeled Evalset Shape: {len(evalset)}, {evalset.x.shape[1]}")
print(f"Labeled Testset Shape: {len(testset)}, {testset.x.shape[1]}")


Labeled Trainset Shape: 83123, 24
Labeled Evalset Shape: 20781, 24
Labeled Testset Shape: 25976, 24


In [11]:
num_features =  trainset.x.shape[1]
num_classes =  trainset.y.shape[1]

num_features, num_classes 

(24, 1)

# Training

In [12]:
from tools.config.data_config import DataConfig
from tools.config.ml_config import MLConfig
from causal_learning import CausalLearning

# Set the data configuration
data_config = DataConfig(dataset_name = 'airline_satisfaction', task_type='binary_classification', obs_shape=[num_features], label_size=num_classes)

#  Set the ML parameters
ml_config = MLConfig(model_name = 'tabnet')
ml_config.training.error_function = 'mae'
ml_config.training.num_epoch = 5
ml_config.model.num_layers = 4

# Set the device to GPU if available, else CPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 

# Initialize the CausalLearning class with the training configuration, data configuration, device, and use_print and use_wandb flags
causal_learning = CausalLearning(ml_config, data_config, device, use_print=True, use_wandb=False)

In [13]:
causal_learning.train(trainset, evalset)

Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
[34m[1mwandb[0m: Currently logged in as: [33mjunhopark[0m ([33mccnets[0m). Use [1m`wandb login --relogin`[0m to force relogin


[34m[1mwandb[0m: Adding directory to artifact (.\..\saved\airline_satisfaction\causal-learning)... Done. 0.0s


Trainer Name: causal_trainer


[1mModelConfig Parameters:[0m


Unnamed: 0,d_model,dropout,model_name,num_layers,use_seq_input
0,256,0.05,tabnet,4,False


[1mTrainConfig Parameters:[0m


Unnamed: 0,batch_size,error_function,max_seq_len,min_seq_len,num_epoch
0,64,mae,,,5


[1mOptimConfig Parameters:[0m


Unnamed: 0,clip_grad_range,decay_rate_100k,learning_rate,max_grad_norm,scheduler_type
0,,0.05,0.001,1.0,exponential


[1mDataConfig Parameters:[0m


Unnamed: 0,dataset_name,task_type,obs_shape,label_size,explain_size,show_image_indices
0,airline_satisfaction,binary_classification,[24],1,,








Epochs:   0%|          | 0/5 [00:00<?, ?it/s]

Iterations:   0%|          | 0/1298 [00:00<?, ?it/s]

[0/5][100/1298][Time 7.94]
Unified LR across all optimizers: 0.000996978883189373
CCNet:  Three Tabnet
Inf: 0.1314	Gen: 0.8262	Rec: 0.8147	E: 0.1429	R: 0.1200	P: 1.5094

accuracy: 0.6016
precision: 0.3008
recall: 0.5000
f1_score: 0.3756
roc_auc: 0.5000

accuracy: 0.5898
precision: 0.2949
recall: 0.5000
f1_score: 0.3710
roc_auc: 0.5000

[0/5][200/1298][Time 7.54]
Unified LR across all optimizers: 0.0009939966705585644
CCNet:  Three Tabnet
Inf: 0.0724	Gen: 0.7284	Rec: 0.7200	E: 0.0808	R: 0.0641	P: 1.3760

accuracy: 0.5625
precision: 0.2812
recall: 0.5000
f1_score: 0.3600
roc_auc: 0.5000

accuracy: 0.5742
precision: 0.2871
recall: 0.5000
f1_score: 0.3648
roc_auc: 0.5000

[0/5][300/1298][Time 7.54]
Unified LR across all optimizers: 0.0009910233784699313
CCNet:  Three Tabnet
Inf: 0.0665	Gen: 0.6558	Rec: 0.6477	E: 0.0746	R: 0.0585	P: 1.2369

accuracy: 0.5586
precision: 0.2793
recall: 0.5000
f1_score: 0.3584
roc_auc: 0.5000

accuracy: 0.5586
precision: 0.2793
recall: 0.5000
f1_score: 0.3584
r

Iterations:   0%|          | 0/1298 [00:00<?, ?it/s]

[1/5][2/1298][Time 7.72]
Unified LR across all optimizers: 0.0009617752563394297
CCNet:  Three Tabnet
Inf: 0.0253	Gen: 0.4134	Rec: 0.4106	E: 0.0281	R: 0.0226	P: 0.7987

accuracy: 0.9297
precision: 0.9317
recall: 0.9265
f1_score: 0.9286
roc_auc: 0.9265

accuracy: 0.9258
precision: 0.9214
recall: 0.9252
f1_score: 0.9232
roc_auc: 0.9252

[1/5][102/1298][Time 7.65]
Unified LR across all optimizers: 0.0009588983465414223
CCNet:  Three Tabnet
Inf: 0.0235	Gen: 0.4049	Rec: 0.4021	E: 0.0263	R: 0.0208	P: 0.7835

accuracy: 0.9531
precision: 0.9495
recall: 0.9526
f1_score: 0.9509
roc_auc: 0.9526

accuracy: 0.9336
precision: 0.9324
recall: 0.9331
f1_score: 0.9327
roc_auc: 0.9331

[1/5][202/1298][Time 7.65]
Unified LR across all optimizers: 0.0009560300422985396
CCNet:  Three Tabnet
Inf: 0.0230	Gen: 0.3897	Rec: 0.3872	E: 0.0256	R: 0.0205	P: 0.7538

accuracy: 0.9492
precision: 0.9531
recall: 0.9417
f1_score: 0.9466
roc_auc: 0.9417

accuracy: 0.9219
precision: 0.9246
recall: 0.9196
f1_score: 0.9212
ro

Iterations:   0%|          | 0/1298 [00:00<?, ?it/s]

[2/5][4/1298][Time 7.73]
Unified LR across all optimizers: 0.0009250393549941983
CCNet:  Three Tabnet
Inf: 0.0138	Gen: 0.3059	Rec: 0.3051	E: 0.0146	R: 0.0129	P: 0.5973

accuracy: 0.9297
precision: 0.9271
recall: 0.9237
f1_score: 0.9253
roc_auc: 0.9237

accuracy: 0.9453
precision: 0.9462
recall: 0.9403
f1_score: 0.9430
roc_auc: 0.9403

[2/5][104/1298][Time 7.66]
Unified LR across all optimizers: 0.0009222723314443782
CCNet:  Three Tabnet
Inf: 0.0133	Gen: 0.3009	Rec: 0.2999	E: 0.0142	R: 0.0124	P: 0.5875

accuracy: 0.9609
precision: 0.9606
recall: 0.9606
f1_score: 0.9606
roc_auc: 0.9606

accuracy: 0.9570
precision: 0.9593
recall: 0.9497
f1_score: 0.9541
roc_auc: 0.9497

[2/5][204/1298][Time 7.51]
Unified LR across all optimizers: 0.0009195135847524926
CCNet:  Three Tabnet
Inf: 0.0128	Gen: 0.2947	Rec: 0.2939	E: 0.0136	R: 0.0120	P: 0.5758

accuracy: 0.9492
precision: 0.9480
recall: 0.9489
f1_score: 0.9485
roc_auc: 0.9489

accuracy: 0.9180
precision: 0.9183
recall: 0.9163
f1_score: 0.9172
ro

Iterations:   0%|          | 0/1298 [00:00<?, ?it/s]

[3/5][6/1298][Time 7.71]
Unified LR across all optimizers: 0.000889706615602604
CCNet:  Three Tabnet
Inf: 0.0102	Gen: 0.2525	Rec: 0.2518	E: 0.0109	R: 0.0095	P: 0.4940

accuracy: 0.9570
precision: 0.9584
recall: 0.9560
f1_score: 0.9568
roc_auc: 0.9560

accuracy: 0.9492
precision: 0.9511
recall: 0.9459
f1_score: 0.9481
roc_auc: 0.9459

[3/5][106/1298][Time 7.46]
Unified LR across all optimizers: 0.0008870452810934158
CCNet:  Three Tabnet
Inf: 0.0094	Gen: 0.2503	Rec: 0.2497	E: 0.0101	R: 0.0088	P: 0.4905

accuracy: 0.9492
precision: 0.9496
recall: 0.9502
f1_score: 0.9492
roc_auc: 0.9502

accuracy: 0.9453
precision: 0.9453
recall: 0.9435
f1_score: 0.9443
roc_auc: 0.9435

[3/5][206/1298][Time 7.53]
Unified LR across all optimizers: 0.0008843919072998678
CCNet:  Three Tabnet
Inf: 0.0092	Gen: 0.2448	Rec: 0.2440	E: 0.0100	R: 0.0085	P: 0.4795

accuracy: 0.9531
precision: 0.9521
recall: 0.9521
f1_score: 0.9521
roc_auc: 0.9521

accuracy: 0.9492
precision: 0.9478
recall: 0.9488
f1_score: 0.9483
roc

Iterations:   0%|          | 0/1298 [00:00<?, ?it/s]

[4/5][8/1298][Time 7.84]
Unified LR across all optimizers: 0.0008557234430874619
CCNet:  Three Tabnet
Inf: 0.0084	Gen: 0.2346	Rec: 0.2339	E: 0.0091	R: 0.0077	P: 0.4600

accuracy: 0.9531
precision: 0.9523
recall: 0.9490
f1_score: 0.9506
roc_auc: 0.9490

accuracy: 0.9453
precision: 0.9517
recall: 0.9382
f1_score: 0.9433
roc_auc: 0.9382

[4/5][108/1298][Time 7.63]
Unified LR across all optimizers: 0.0008531637607276009
CCNet:  Three Tabnet
Inf: 0.0078	Gen: 0.2293	Rec: 0.2286	E: 0.0085	R: 0.0071	P: 0.4500

accuracy: 0.9258
precision: 0.9270
recall: 0.9221
f1_score: 0.9242
roc_auc: 0.9221

accuracy: 0.9648
precision: 0.9678
recall: 0.9627
f1_score: 0.9645
roc_auc: 0.9627

[4/5][208/1298][Time 7.64]
Unified LR across all optimizers: 0.0008506117350164344
CCNet:  Three Tabnet
Inf: 0.0083	Gen: 0.2308	Rec: 0.2301	E: 0.0090	R: 0.0075	P: 0.4527

accuracy: 0.9141
precision: 0.9207
recall: 0.9056
f1_score: 0.9109
roc_auc: 0.9056

accuracy: 0.9453
precision: 0.9452
recall: 0.9431
f1_score: 0.9441
ro

In [14]:
causal_learning.test(testset)

{'accuracy': 0.9499537944793701,
 'precision': 0.9499812144237108,
 'recall': 0.9483292461887934,
 'f1_score': 0.9491063278060328,
 'roc_auc': 0.9483292461887934}