The dataset you'll be working with contains an airline passenger satisfaction survey. There are two primary questions of interest with this dataset:

Can you predict passenger satisfaction?

What factors are associated with passenger satisfaction?

# Project Specs

While the focus of our course is advanced machine learning, your work on this project will be assessed on at least the following things:
Your data preparation/pre-processing
Your exploratory data analysis (EDA)
Your modeling/ML efforts, comparisons, interpretations, and conclusions
Your project narrative
You should also comment (probably after the bulk of your work) on the data itself, possible sources for these data, and the possible issues that should be considered when working with this kind of survey data. 

In particular, I would encourage you to make specific criticisms of the dataset and survey as you understand them, and possibly even make suggestions for how to improve them if they were to be administered again.

Who can you generalize your work and results too? How helpful is all of this to an airline?

Your submission should be built and written with non-experts as the target audience. All of your code should still be included, but do your best to narrate your work in accessible ways.

Again, submit an HTML, ipynb, or Colab link. Be sure to rerun your entire notebook fresh before submitting!

In [5]:
# import libraries
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score, f1_score, classification_report, confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier, StackingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
import numpy as np

ModuleNotFoundError: No module named 'pandas'

# Data

In [19]:
# load in data
data1 = pd.read_csv("data1.csv")
data2 = pd.read_csv("data2.csv")

# concatenate data
data = pd.concat([data1, data2], ignore_index=True)
data = data.dropna()
df = data.drop(columns = ["Unnamed: 0"])
df.head()

Unnamed: 0,id,Gender,Customer Type,Age,Type of Travel,Class,Flight Distance,Inflight wifi service,Departure/Arrival time convenient,Ease of Online booking,...,Inflight entertainment,On-board service,Leg room service,Baggage handling,Checkin service,Inflight service,Cleanliness,Departure Delay in Minutes,Arrival Delay in Minutes,satisfaction
0,19556,Female,Loyal Customer,52,Business travel,Eco,160,5,4,3,...,5,5,5,5,2,5,5,50,44.0,satisfied
1,90035,Female,Loyal Customer,36,Business travel,Business,2863,1,1,3,...,4,4,4,4,3,4,5,0,0.0,satisfied
2,12360,Male,disloyal Customer,20,Business travel,Eco,192,2,0,2,...,2,4,1,3,2,2,2,0,0.0,neutral or dissatisfied
3,77959,Male,Loyal Customer,44,Business travel,Business,3377,0,0,0,...,1,1,1,1,3,1,4,0,6.0,satisfied
4,36875,Female,Loyal Customer,49,Business travel,Eco,1182,2,3,4,...,2,2,2,2,4,2,4,0,20.0,satisfied


Since this dataset comes from a passenger satisfaction survey, it's important to recognize that the responses are subjective and prone to survey bias. In particular, the data likely suffers from convenience sampling because passengers may not necessarily be required to complete the survey. This introduces selection bias, as individuals who respond to satisfaction surveys typically have stronger opinions, and may be more likely to report positive experiences. This means the results may not fully represent the broader population of airline passengers, and cannot be generalized to the larger population. In addition, this data doesn't include much demographic information which can be a powerful tool for drawing conclusions about passenger behavior and identifying differences between customer segments. 

This data can help airlines capture satisfaction trends and identify areas that may need to be improved, such as service or comfort. If airlines wanted to reduce this bias, they could use random sampling or an incentive to increase response rates. Collecting more demographic data would help strengthen the analysis.

# Running Models

In [20]:

def evaluate_model(X, y, model_type="Naive Bayes", test_size=0.2, random_state=42):
    # Split data into train and test
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=test_size, random_state=random_state
    )

    # Identify categorical columns
    cat_cols = X.select_dtypes(include='object').columns.tolist()

    # Column transformer for preprocessing
    ct = ColumnTransformer([
        ('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols),
        ('standardize', StandardScaler(), make_column_selector(dtype_include=np.number))
    ])

    # List of classifiers
    classifiers = {
        "Neural Network (10)": MLPClassifier(hidden_layer_sizes=(10,), activation='relu', max_iter=500, random_state=random_state),
        "Neural Network (50)": MLPClassifier(hidden_layer_sizes=(50,), activation='relu', max_iter=500, random_state=random_state),
        "Logistic Regression + Bagging": BaggingClassifier(estimator=LogisticRegression(max_iter=1000), n_estimators=100),
        "KNN + Bagging": BaggingClassifier(estimator=KNeighborsClassifier(n_neighbors=5), n_estimators=100),
        "Decision Tree + Bagging": BaggingClassifier(estimator=DecisionTreeClassifier(), n_estimators=100),
        "Random Forest": RandomForestClassifier(n_estimators=100),
        "Stacking (LR, DT, KNN)": StackingClassifier(estimators=[
            ('lr', LogisticRegression(max_iter=1000)), 
            ('dt', DecisionTreeClassifier()), 
            ('knn', KNeighborsClassifier())
        ]),
        "SVM (RBF Kernel)": SVC(kernel='rbf', C=1.0, probability=True),
        "QDA": QuadraticDiscriminantAnalysis(),
        "XGBoost": XGBClassifier(use_label_encoder=False, eval_metric='mlogloss', random_state=random_state),
        "LightGBM": LGBMClassifier(random_state=random_state),
        "Naive Bayes": GaussianNB()
    }

    # Select classifier
    if model_type not in classifiers:
        raise ValueError(f"Model '{model_type}' not recognized.")

    clf = classifiers[model_type]
    pipeline = Pipeline([("preprocess", ct), ("model", clf)])

    # Grid search for selected models
    param_grids = {
        "SVM (RBF Kernel)": {"model__C": [0.1, 1, 10]},
        "XGBoost": {"model__n_estimators": [50, 100], "model__max_depth": [3, 5]},
        "LightGBM": {"model__n_estimators": [50, 100], "model__num_leaves": [31, 50]},
    }

    if model_type in param_grids:
        pipeline = GridSearchCV(pipeline, param_grids[model_type], cv=3, scoring='f1_weighted')

    # Fit model
    pipeline.fit(X_train, y_train)

    # Predict
    y_pred = pipeline.predict(X_test)

    # Evaluate
    accuracy = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred, average='weighted')

    print(f"\nResults for {model_type}:")
    print(f"Accuracy: {accuracy:.4f}")
    print(f"F1 Score: {f1:.4f}")
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred))
    print("Confusion Matrix:")
    print(confusion_matrix(y_test, y_pred))

    if isinstance(pipeline, GridSearchCV):
        print("\nBest parameters from GridSearchCV:")
        print(pipeline.best_params_)


In [21]:
X = df.drop(columns = ["satisfaction"])
y = df[["satisfaction"]]


In [22]:
evaluate_model(X, y, "Neural Network (10)")




Results for Neural Network (10):
Accuracy: 0.9543
F1 Score: 0.9542

Classification Report:
                         precision    recall  f1-score   support

neutral or dissatisfied       0.95      0.97      0.96     14456
              satisfied       0.96      0.93      0.95     11442

               accuracy                           0.95     25898
              macro avg       0.96      0.95      0.95     25898
           weighted avg       0.95      0.95      0.95     25898

Confusion Matrix:
[[14061   395]
 [  789 10653]]


In [None]:
model_list = [
    "Neural Network (10)",
    "Neural Network (50)",
    "Logistic Regression + Bagging",
    "KNN + Bagging",
    "Decision Tree + Bagging",
    "Random Forest",
    "Stacking (LR, DT, KNN)",
    "SVM (RBF Kernel)",
    "QDA",
    "XGBoost",
    "LightGBM",
    "Naive Bayes"
]

# Iterate through the list and evaluate each model
for model_name in model_list:
    print("="*60)
    evaluate_model(X, y, model_type=model_name)







Results for Neural Network (10):
Accuracy: 0.9543
F1 Score: 0.9542

Classification Report:
                         precision    recall  f1-score   support

neutral or dissatisfied       0.95      0.97      0.96     14456
              satisfied       0.96      0.93      0.95     11442

               accuracy                           0.95     25898
              macro avg       0.96      0.95      0.95     25898
           weighted avg       0.95      0.95      0.95     25898

Confusion Matrix:
[[14061   395]
 [  789 10653]]





Results for Neural Network (50):
Accuracy: 0.9610
F1 Score: 0.9609

Classification Report:
                         precision    recall  f1-score   support

neutral or dissatisfied       0.96      0.97      0.97     14456
              satisfied       0.97      0.94      0.96     11442

               accuracy                           0.96     25898
              macro avg       0.96      0.96      0.96     25898
           weighted avg       0.96      0.96      0.96     25898

Confusion Matrix:
[[14089   367]
 [  644 10798]]





Results for Logistic Regression + Bagging:
Accuracy: 0.8723
F1 Score: 0.8719

Classification Report:
                         precision    recall  f1-score   support

neutral or dissatisfied       0.87      0.91      0.89     14456
              satisfied       0.87      0.83      0.85     11442

               accuracy                           0.87     25898
              macro avg       0.87      0.87      0.87     25898
           weighted avg       0.87      0.87      0.87     25898

Confusion Matrix:
[[13091  1365]
 [ 1941  9501]]





Results for KNN + Bagging:
Accuracy: 0.9288
F1 Score: 0.9285

Classification Report:
                         precision    recall  f1-score   support

neutral or dissatisfied       0.92      0.96      0.94     14456
              satisfied       0.95      0.89      0.92     11442

               accuracy                           0.93     25898
              macro avg       0.93      0.92      0.93     25898
           weighted avg       0.93      0.93      0.93     25898

Confusion Matrix:
[[13903   553]
 [ 1291 10151]]





Results for Decision Tree + Bagging:
Accuracy: 0.9634
F1 Score: 0.9633

Classification Report:
                         precision    recall  f1-score   support

neutral or dissatisfied       0.96      0.98      0.97     14456
              satisfied       0.97      0.94      0.96     11442

               accuracy                           0.96     25898
              macro avg       0.96      0.96      0.96     25898
           weighted avg       0.96      0.96      0.96     25898

Confusion Matrix:
[[14148   308]
 [  641 10801]]


  return fit_method(estimator, *args, **kwargs)



Results for Random Forest:
Accuracy: 0.9649
F1 Score: 0.9649

Classification Report:
                         precision    recall  f1-score   support

neutral or dissatisfied       0.96      0.98      0.97     14456
              satisfied       0.98      0.94      0.96     11442

               accuracy                           0.96     25898
              macro avg       0.97      0.96      0.96     25898
           weighted avg       0.97      0.96      0.96     25898

Confusion Matrix:
[[14189   267]
 [  641 10801]]
