# Welcome to my Notebook
# Airline Passenger Satisfaction Dataset

![](https://i.pinimg.com/originals/b9/b8/1a/b9b81ab0e549a0ef6bbd9616e32031d5.gif)

# About Dataset --> The dataset contains total of 22 input features, and our goal is to predict whether the Passengers are satisfied or not based on above features.

> Detail Description of input features are given below: 
1. Gender: male or female
2. Customer type: regular or non-regular airline customer
3. Age: the actual age of the passenger
4. Type of travel: the purpose of the passenger's flight (personal or business travel)
5. Class: business, economy, economy plus
6. Flight distance
7. Inflight wifi service: satisfaction level with Wi-Fi service on board (0: not applicable; 1-5)
8. Departure/Arrival time convenient: departure/arrival time satisfaction level (0: not rated; 1-5)
9. Ease of Online booking: online booking satisfaction rate (0: not rated; 1-5)
10. Gate location: level of satisfaction with the gate location (0: not rated; 1-5)
11. Food and drink: food and drink satisfaction level (0: not rated; 1-5)
12. Online boarding: satisfaction level with online boarding (0: not rated; 1-5)
13. Seat comfort: seat satisfaction level (0: not rated; 1-5)
14. Inflight entertainment: satisfaction with inflight entertainment (0: not rated; 1-5)
15. On-board service: level of satisfaction with on-board service (0: not rated; 1-5)
16. Leg room service: level of satisfaction with leg room service (0: not rated; 1-5)
17. Baggage handling: level of satisfaction with baggage handling (0: not rated; 1-5)
18. Checkin service: level of satisfaction with checkin service (0: not rated; 1-5)
19. Inflight service: level of satisfaction with inflight service (0: not rated; 1-5)
20. Cleanliness: level of satisfaction with cleanliness (0: not rated; 1-5)
21. Departure delay in minutes
22. Arrival delay in minutes

> This data set contains a survey on air passenger satisfaction. The following classification problem is set:

> It is necessary to predict which of the two levels of satisfaction with the airline the passenger belongs to:

1. Satisfaction
2. Neutral or dissatisfied


In [None]:
pip install "numpy>=1.16.5,<1.23.0"

In [None]:
# Import all the necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt
from sklearn.tree import plot_tree
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import RobustScaler, LabelEncoder, OneHotEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score,roc_auc_score,precision_score, recall_score, f1_score,ConfusionMatrixDisplay,classification_report


import warnings 
warnings.filterwarnings("ignore")

In [None]:
# Read the training and testing data
train_data=pd.read_csv("/kaggle/input/airline-passenger-satisfaction/train.csv")
test_data=pd.read_csv("/kaggle/input/airline-passenger-satisfaction/test.csv")

In [None]:
train_data.shape

In [None]:
train_data.head()

In [None]:
# Lets drop the Unnamed:0 and Id column from the training data that has no impact on the target variable
train_data.drop(["Unnamed: 0","id"], axis=1, inplace=True)

In [None]:
# Lets check the shape of the train data again
train_data.shape

In [None]:
test_data.shape

In [None]:
# Lets drop the Unnamed:0 and Id column from the testing data that has no impact on the target variable
test_data.drop(["Unnamed: 0","id"], axis=1, inplace=True)

In [None]:
# Lets check the shape of the test data again
test_data.shape

In [None]:
# lets see the datatypes of ecah feature
train_data.info()

In [None]:
# Lets describe the dataset into statistical form
train_data.describe()

In [None]:
# To check the duplicate values in the dataset
train_data.duplicated().sum()

In [None]:
# To check Is there any null values in training data
train_data.isna().sum()

> Here we notice that the feature Arrival Delay in Minutes has a 310 missing values.Lets handle these missing values.

In [None]:
# To check Is there any null values in testing data
test_data.isna().sum()

> Here we notice that the feature Arrival Delay in Minutes has a 83 missing values.Lets handle these missing values.

# Handle the missing value using median of Arrival Delay in Minutes with the help of SimpleImputer

In [None]:
imputer=SimpleImputer(missing_values=np.nan, strategy="median", fill_value=None)
train_data["Arrival Delay in Minutes"]= imputer.fit_transform(train_data[["Arrival Delay in Minutes"]])
test_data["Arrival Delay in Minutes"]= imputer.fit_transform(test_data[["Arrival Delay in Minutes"]])

# Lets check the Dataset is balanced or not

In [None]:
train_data["satisfaction"].value_counts()

# Getting the List of Numerial and Categorical Features

In [None]:
def get_num_cat_columns(dataframe):
    categorical_cols=dataframe.select_dtypes(include="object").columns
    numerical_cols=dataframe.select_dtypes(exclude="object").columns
    
    return categorical_cols, numerical_cols

In [None]:
categorical_cols,numerical_cols=get_num_cat_columns(train_data)

In [None]:
categorical_cols

In [None]:
numerical_cols

# Exploratory Data Analysis

# Univariate Analysis

In [None]:
# Plot the countplot of categorical features
for col in categorical_cols:
    plt.figure(figsize=(6,3), dpi=100)
    sns.countplot(x=train_data[col], palette="muted")
    label=col
    plt.xlabel(label)
    plt.ylabel("count")
    plt.title(label)

In [None]:
# Plot the data distribution of Numerical cols
for col in numerical_cols:
    plt.figure(figsize=(7,7))
    sns.histplot(train_data[col], palette="deep", kde=True, bins=15)
    label=col
    plt.xlabel(label)
    plt.ylabel("count")
    plt.title(label)

# Label Encoding -- Convert Categorical Features into Numerical Features

In [None]:
le= LabelEncoder()
train_data["Gender"]= le.fit_transform(train_data["Gender"])
train_data["Customer Type"]= le.fit_transform(train_data["Customer Type"])
train_data["Type of Travel"]= le.fit_transform(train_data["Type of Travel"])
train_data["Class"]= le.fit_transform(train_data["Class"])
train_data["satisfaction"]= le.fit_transform(train_data["satisfaction"])


# Lets make the Correlation Matrix

In [None]:
corr_matrix= train_data.corr()
corr_matrix

In [None]:
plt.figure(figsize=(12,12))
sns.heatmap(corr_matrix, annot=True, cmap="Greens", fmt=".1f")
plt.show()

> Here we can see that Departure Delay in Minutes and Arrival Delay in Minutes has the correlation 1, it convey the information about the flight delay. So lets remove one of them from the dataset.

# Lets Perform the Bivariate Anaylsis

In [None]:
plt.figure(figsize=(10,10))
fig= px.box(train_data, x="satisfaction", y="Age", title="Age Vs Satisfaction", color="satisfaction")
fig.show()

In [None]:
plt.figure(figsize=(10,10))
fig = px.box(train_data, x="satisfaction", y="Flight Distance", title="Flight_Distance Vs Satisfaction", color="satisfaction")
fig.show()

In [None]:
plt.figure(figsize=(10,10))
fig = px.box(train_data, x="satisfaction", y="Departure Delay in Minutes", title="Departure Delay in Minutes Vs Satisfaction", color="satisfaction")
fig.show()

# Lets read the Dataset again for Modelling 

In [None]:
# Read the training and testing data
train_data=pd.read_csv("/kaggle/input/airline-passenger-satisfaction/train.csv")
test_data=pd.read_csv("/kaggle/input/airline-passenger-satisfaction/test.csv")

In [None]:
# Lets drop the Unnamed:0 and Id column from the training data that has no impact on the target variable
train_data.drop(["Unnamed: 0","id"], axis=1, inplace=True)

In [None]:
# Lets drop the Unnamed:0 and Id column from the testing data that has no impact on the target variable
test_data.drop(["Unnamed: 0","id"], axis=1, inplace=True)

In [None]:
train_data.shape, test_data.shape

# Above we notice that Departure Delay in Minutes and Arrival Delay in Minutes has the correlation 1
> Lets drop one of them from the Dataset

In [None]:
train_data.drop("Arrival Delay in Minutes", axis=1, inplace=True)
test_data.drop("Arrival Delay in Minutes", axis=1, inplace=True)

# Divide the Dataset into Train and Test Set

In [None]:
x_train=train_data.drop(["satisfaction"], axis=1)
y_train=train_data["satisfaction"]
x_test=test_data.drop(["satisfaction"], axis=1)
y_test=test_data["satisfaction"]

In [None]:
x_train.shape, y_train.shape, x_test.shape, y_test.shape

# Lets Detect the Outliers in the Dataset

In [None]:
def Winsorization_Method(columns, x_train, y_train , a, b):
    outliers=[]

    for col in columns:
        q1= np.percentile(x_train[col], a)
        q2= np.percentile(x_train[col],b)
        
        for pos in range(len(x_train)):
            if x_train[col].iloc[pos]>q2 or x_train[col].iloc[pos]<q1:
                outliers.append(pos) 
                
    outliers= set(outliers)                   # remove the duplicates from the outliers
    outliers= list(outliers)
    
    ratio= round(len(outliers)/len(x_train)*100, 2)                       # Ratio of outliers
    x_train.drop(x_train.index[outliers], inplace=True)    # remove the outliers from the training dataset
    y_train.drop(y_train.index[outliers], inplace=True)
    
    
    
    return ratio, x_train, y_train

In [None]:
ratio, x_train, y_train= Winsorization_Method(x_train.select_dtypes(exclude="object").columns, x_train, y_train , a=1, b=99)

In [None]:
x_train.shape, y_train.shape

In [None]:
print(f"Ratio of Outliers Detected in the dataset:{ratio}")

In [None]:
# getting categorical and numerical columns
categorical_cols, numerical_cols= get_num_cat_columns(x_train)

# Data Preprocessing

# Data Preprocessing for Training Data

In [None]:
# One-Hot encode non-numeric columns
ohe= OneHotEncoder(handle_unknown="ignore", sparse=False)
x_train_encoded=pd.DataFrame(ohe.fit_transform(x_train[categorical_cols]))
x_train_encoded.columns= ohe.get_feature_names_out(categorical_cols)

# Label Encode the target class
le= LabelEncoder()
y_train=le.fit_transform(y_train)

# Appply RobustScaler for feature scaling
scaler= RobustScaler()
x_train_scaled= pd.DataFrame(scaler.fit_transform(x_train[numerical_cols]))
x_train_scaled.columns=x_train.select_dtypes(exclude="object").columns

# Concatenate the encoded and scaled fetures
x_train_processed=pd.concat([x_train_encoded,x_train_scaled], axis=1)
x_train_processed

# Data Preprocessing for Testing Data

In [None]:
# One-Hot encode non-numeric columns
x_test_encoded=pd.DataFrame(ohe.transform(x_test[categorical_cols]))
x_test_encoded.columns= ohe.get_feature_names_out(categorical_cols)

# Label Encode the target class
y_test=le.transform(y_test)

# Appply RobustScaler for feature scaling
x_test_scaled= pd.DataFrame(scaler.transform(x_test[numerical_cols]))
x_test_scaled.columns=x_test.select_dtypes(exclude="object").columns

# Concatenate the encoded and scaled fetures
x_test_processed=pd.concat([x_test_encoded,x_test_scaled], axis=1)
x_test_processed

# Lets do the Modelling

In [None]:
def modelling(x_train, x_test, y_train, y_test):
    
    cv_result = []
    best_estimators = []
    recall_scores = []
    precision_scores = []
    roc_auc_scores = []
    f1_scores = []
    
    
    
    dt=DecisionTreeClassifier(random_state=42)
    rf=RandomForestClassifier(random_state=42)
    classifiers=[dt, rf]

    dt_param_grid = {"min_samples_split" : range(10,500,20),
                     "max_depth": range(1,20,2)}


    rf_param_grid = {"max_features": [1,3,10],
                     "min_samples_split":[2,3,10],
                     "min_samples_leaf":[1,3,10],
                     "n_estimators":[100,300],
                     "criterion":["gini"]}
    

   

    classifier_parameters = [dt_param_grid,
                            rf_param_grid
                             ]
    
    for i in range(len(classifiers)):
        model= GridSearchCV(classifiers[i], classifier_parameters[i], cv=5, scoring ="accuracy", n_jobs = -1)
        model.fit(x_train, y_train)
        y_pred= model.predict(x_test)
        
        cv_result.append(model.best_score_)
        roc_auc_scores.append(roc_auc_score(y_pred, y_test))
        recall_scores.append(recall_score(y_pred, y_test))
        precision_scores.append(precision_score(y_pred, y_test))
        f1_scores.append(f1_score(y_pred, y_test))
        best_estimators.append(model.best_estimator_)
        
        
        print(f"Model:{classifiers[i]}")
        print(f"Accuracy:{round(cv_result[i]*100,2)}")
        print(f"ROC AUC:{roc_auc_scores[i]}")
        print(f"Recall:{recall_scores[i]}")
        print(f"Precision:{precision_scores[i]}")
        print(f"F1-Score:{f1_scores[i]}")
        print(f"Best Estimator:{model.best_estimator_}")
        print("Classifiaction Reoprt")
        print("---------------------")
        print(classification_report(y_test,y_pred,digits=3))
        print("Confusion_Matrix")
        print("---------------------")
        ConfusionMatrixDisplay.from_predictions(y_test, y_pred)
        plt.show()
    
    
        print("---------------------------------------------------------------------------------------------------------------")

                         
                         
    model_names = ['DecisionTreeClassifier','RandomForestClassifier']
    result_df = pd.DataFrame({'Recall':recall_scores, 'Precision':precision_scores, 'F1_Score':f1_scores,'AUC_Score':roc_auc_scores, 'Accuracy': cv_result,},index=model_names)
    result_df=result_df.sort_values(by="AUC_Score", ascending=False)
    return result_df


In [None]:
result_df= modelling(x_train_processed, x_test_processed, y_train, y_test)

In [None]:
result_df

# Result Comparison

In [None]:
result_df.plot(kind="barh", figsize=(10, 7), grid=True).legend(bbox_to_anchor=(1.2,1));

# Feature Importance 

In [None]:
rf= RandomForestClassifier().fit(x_train_processed,y_train)
importances=rf.feature_importances_
feature_names = [f"feature {i}" for i in range(x_train.shape[1])]

for i in range(len(rf.feature_importances_)):
    if rf.feature_importances_[i] >0.05:
        print(f"{x_train_processed.columns[i]} : {round(rf.feature_importances_[i],3)}")

# Visualise the Decision Tree with max_depth=3
> (The plot_tree returns annotations for the plot, to not show them in the notebook I assigned returned value to underscore)

In [None]:
plt.figure(figsize=(35,30))
_= plot_tree(DecisionTreeClassifier(max_depth=3).fit(x_train_processed,y_train),feature_names=x_train_processed.columns, class_names=['Dissatisfied (0)','Satisfied (1)'], 
                                    label='all', filled=True, rounded=True)


# Conclusion
1. Here we can see that RandomForestclassifier perform best in terms of **Recall that is 0.97** with the help of best estimators that are max_features=10, min_samples_split=10, n_estimators=300.

2. After that we explore the feature importance with RandomForestclassifier,whose importance value is greater than 0.05 and we noticed that the following features have more impact on the target variable:
* Type of Travel_Business travel
* Type of Travel_Personal Travel 
* Class_Business
* Inflight wifi service
* Online boarding 
* Inflight entertainment 

3. We also visualize how decision tree made, here we only select the max_depth=3, for easy interpretation.

**Huge thanks** for your upvote on my Kaggle notebook! Your appreciation fuels my passion. I welcome any suggestions you might have—your insights are immensely valuable. Let's keep growing and improving together in the world of data science.

![](https://gifdb.com/images/high/thank-you-cute-hamster-holding-card-a6zrrnjabk559ndi.webp)