In [3]:
import pandas as pd

# Load the new dataset
df_new = pd.read_csv('/content/test.csv')

# Display the first 5 rows of the DataFrame
display(df_new.head())

Unnamed: 0.1,Unnamed: 0,id,Gender,Customer Type,Age,Type of Travel,Class,Flight Distance,Inflight wifi service,Departure/Arrival time convenient,...,Inflight entertainment,On-board service,Leg room service,Baggage handling,Checkin service,Inflight service,Cleanliness,Departure Delay in Minutes,Arrival Delay in Minutes,satisfaction
0,0,19556,Female,Loyal Customer,52,Business travel,Eco,160,5,4,...,5,5,5,5,2,5,5,50,44.0,satisfied
1,1,90035,Female,Loyal Customer,36,Business travel,Business,2863,1,1,...,4,4,4,4,3,4,5,0,0.0,satisfied
2,2,12360,Male,disloyal Customer,20,Business travel,Eco,192,2,0,...,2,4,1,3,2,2,2,0,0.0,neutral or dissatisfied
3,3,77959,Male,Loyal Customer,44,Business travel,Business,3377,0,0,...,1,1,1,1,3,1,4,0,6.0,satisfied
4,4,36875,Female,Loyal Customer,49,Business travel,Eco,1182,2,3,...,2,2,2,2,4,2,4,0,20.0,satisfied


Now, let's check the data types and look for any missing values in the new dataset.

In [4]:
df_new.info()
print("\nMissing values per column:")
print(df_new.isnull().sum())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25976 entries, 0 to 25975
Data columns (total 25 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   Unnamed: 0                         25976 non-null  int64  
 1   id                                 25976 non-null  int64  
 2   Gender                             25976 non-null  object 
 3   Customer Type                      25976 non-null  object 
 4   Age                                25976 non-null  int64  
 5   Type of Travel                     25976 non-null  object 
 6   Class                              25976 non-null  object 
 7   Flight Distance                    25976 non-null  int64  
 8   Inflight wifi service              25976 non-null  int64  
 9   Departure/Arrival time convenient  25976 non-null  int64  
 10  Ease of Online booking             25976 non-null  int64  
 11  Gate location                      25976 non-null  int

In [1]:
import pandas as pd

# Load the dataset
df = pd.read_csv('/content/gender_submission.csv')

# Display the first 5 rows of the DataFrame
display(df.head())

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,1
2,894,0
3,895,0
4,896,1


Now, let's check the data types and look for any missing values.

In [2]:
df.info()
print("\nMissing values per column:")
print(df.isnull().sum())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype
---  ------       --------------  -----
 0   PassengerId  418 non-null    int64
 1   Survived     418 non-null    int64
dtypes: int64(2)
memory usage: 6.7 KB

Missing values per column:
PassengerId    0
Survived       0
dtype: int64


# Task
The next step is to handle the missing values in the 'Arrival Delay in Minutes' column in `df_new` by imputing them, likely using the median of that column.

## Handle Missing Values

### Subtask:
Address the 83 missing values in the 'Arrival Delay in Minutes' column, likely by imputation (e.g., using the median) to ensure the dataset is complete for model training.


**Reasoning**:
First, I'll calculate the median of the 'Arrival Delay in Minutes' column. Then, I'll fill the missing values in that column with the calculated median using the `fillna()` method. Finally, I'll print the count of missing values to confirm the imputation.



In [5]:
median_arrival_delay = df_new['Arrival Delay in Minutes'].median()
df_new['Arrival Delay in Minutes'].fillna(median_arrival_delay, inplace=True)

print(f"Median 'Arrival Delay in Minutes' used for imputation: {median_arrival_delay}")
print("\nMissing values after imputation:")
print(df_new.isnull().sum())

Median 'Arrival Delay in Minutes' used for imputation: 0.0

Missing values after imputation:
Unnamed: 0                           0
id                                   0
Gender                               0
Customer Type                        0
Age                                  0
Type of Travel                       0
Class                                0
Flight Distance                      0
Inflight wifi service                0
Departure/Arrival time convenient    0
Ease of Online booking               0
Gate location                        0
Food and drink                       0
Online boarding                      0
Seat comfort                         0
Inflight entertainment               0
On-board service                     0
Leg room service                     0
Baggage handling                     0
Checkin service                      0
Inflight service                     0
Cleanliness                          0
Departure Delay in Minutes           0
Arrival De

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df_new['Arrival Delay in Minutes'].fillna(median_arrival_delay, inplace=True)


**Reasoning**:
The previous code produced a FutureWarning related to using `inplace=True` with chained assignment. To address this, I will modify the line to directly assign the result of `fillna()` back to the column, which is the recommended practice for future Pandas versions.



In [6]:
median_arrival_delay = df_new['Arrival Delay in Minutes'].median()
df_new['Arrival Delay in Minutes'] = df_new['Arrival Delay in Minutes'].fillna(median_arrival_delay)

print(f"Median 'Arrival Delay in Minutes' used for imputation: {median_arrival_delay}")
print("\nMissing values after imputation:")
print(df_new.isnull().sum())

Median 'Arrival Delay in Minutes' used for imputation: 0.0

Missing values after imputation:
Unnamed: 0                           0
id                                   0
Gender                               0
Customer Type                        0
Age                                  0
Type of Travel                       0
Class                                0
Flight Distance                      0
Inflight wifi service                0
Departure/Arrival time convenient    0
Ease of Online booking               0
Gate location                        0
Food and drink                       0
Online boarding                      0
Seat comfort                         0
Inflight entertainment               0
On-board service                     0
Leg room service                     0
Baggage handling                     0
Checkin service                      0
Inflight service                     0
Cleanliness                          0
Departure Delay in Minutes           0
Arrival De

## Encode Categorical Features

### Subtask:
Convert categorical columns ('Gender', 'Customer Type', 'Type of Travel', 'Class', and 'satisfaction') into numerical representations using appropriate encoding techniques like one-hot encoding or label encoding, suitable for machine learning algorithms.


**Reasoning**:
The first step is to label encode the 'satisfaction' column as instructed, mapping 'satisfied' to 1 and 'neutral or dissatisfied' to 0.



In [7]:
df_new['satisfaction'] = df_new['satisfaction'].map({'satisfied': 1, 'neutral or dissatisfied': 0})

**Reasoning**:
Now that the 'satisfaction' column has been label encoded, I will apply one-hot encoding to the remaining specified categorical columns ('Gender', 'Customer Type', 'Type of Travel', 'Class') using `pd.get_dummies()` to convert them into numerical representations.



In [8]:
categorical_cols = ['Gender', 'Customer Type', 'Type of Travel', 'Class']
df_new = pd.get_dummies(df_new, columns=categorical_cols, drop_first=True)

print("DataFrame after one-hot encoding:")
display(df_new.head())
print("\nDataFrame info after encoding:")
df_new.info()

DataFrame after one-hot encoding:


Unnamed: 0.1,Unnamed: 0,id,Age,Flight Distance,Inflight wifi service,Departure/Arrival time convenient,Ease of Online booking,Gate location,Food and drink,Online boarding,...,Inflight service,Cleanliness,Departure Delay in Minutes,Arrival Delay in Minutes,satisfaction,Gender_Male,Customer Type_disloyal Customer,Type of Travel_Personal Travel,Class_Eco,Class_Eco Plus
0,0,19556,52,160,5,4,3,4,3,4,...,5,5,50,44.0,1,False,False,False,True,False
1,1,90035,36,2863,1,1,3,1,5,4,...,4,5,0,0.0,1,False,False,False,False,False
2,2,12360,20,192,2,0,2,4,2,2,...,2,2,0,0.0,0,True,True,False,True,False
3,3,77959,44,3377,0,0,0,2,3,4,...,1,4,0,6.0,1,True,False,False,False,False
4,4,36875,49,1182,2,3,4,3,4,1,...,2,4,0,20.0,1,False,False,False,True,False



DataFrame info after encoding:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 25976 entries, 0 to 25975
Data columns (total 26 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   Unnamed: 0                         25976 non-null  int64  
 1   id                                 25976 non-null  int64  
 2   Age                                25976 non-null  int64  
 3   Flight Distance                    25976 non-null  int64  
 4   Inflight wifi service              25976 non-null  int64  
 5   Departure/Arrival time convenient  25976 non-null  int64  
 6   Ease of Online booking             25976 non-null  int64  
 7   Gate location                      25976 non-null  int64  
 8   Food and drink                     25976 non-null  int64  
 9   Online boarding                    25976 non-null  int64  
 10  Seat comfort                       25976 non-null  int64  
 11  Inflight entertainment

## Drop Unnecessary Columns

### Subtask:
Remove columns such as 'Unnamed: 0' and 'id' as they are typically identifiers and do not contribute to the predictive power of the models.


**Reasoning**:
I will drop the specified 'Unnamed: 0' and 'id' columns from the `df_new` DataFrame, as they are identifiers and not useful for modeling, then display the head of the DataFrame to verify the removal.



In [9]:
columns_to_drop = ['Unnamed: 0', 'id']
df_new.drop(columns=columns_to_drop, inplace=True)

print("DataFrame after dropping 'Unnamed: 0' and 'id' columns:")
display(df_new.head())

DataFrame after dropping 'Unnamed: 0' and 'id' columns:


Unnamed: 0,Age,Flight Distance,Inflight wifi service,Departure/Arrival time convenient,Ease of Online booking,Gate location,Food and drink,Online boarding,Seat comfort,Inflight entertainment,...,Inflight service,Cleanliness,Departure Delay in Minutes,Arrival Delay in Minutes,satisfaction,Gender_Male,Customer Type_disloyal Customer,Type of Travel_Personal Travel,Class_Eco,Class_Eco Plus
0,52,160,5,4,3,4,3,4,3,5,...,5,5,50,44.0,1,False,False,False,True,False
1,36,2863,1,1,3,1,5,4,5,4,...,4,5,0,0.0,1,False,False,False,False,False
2,20,192,2,0,2,4,2,2,2,2,...,2,2,0,0.0,0,True,True,False,True,False
3,44,3377,0,0,0,2,3,4,4,1,...,1,4,0,6.0,1,True,False,False,False,False
4,49,1182,2,3,4,3,4,1,2,2,...,2,4,0,20.0,1,False,False,False,True,False


## Define Features and Target

### Subtask:
Separate the preprocessed dataset into features (X) and the target variable (y), where 'satisfaction' will be the target variable.


**Reasoning**:
I will separate the preprocessed dataset `df_new` into features (X) and target variable (y) by dropping the 'satisfaction' column to create X and selecting only the 'satisfaction' column to create y, then display the first few rows of both to verify the separation.



In [10]:
X = df_new.drop('satisfaction', axis=1)
y = df_new['satisfaction']

print("Features (X) head:")
display(X.head())
print("\nTarget (y) head:")
display(y.head())

Features (X) head:


Unnamed: 0,Age,Flight Distance,Inflight wifi service,Departure/Arrival time convenient,Ease of Online booking,Gate location,Food and drink,Online boarding,Seat comfort,Inflight entertainment,...,Checkin service,Inflight service,Cleanliness,Departure Delay in Minutes,Arrival Delay in Minutes,Gender_Male,Customer Type_disloyal Customer,Type of Travel_Personal Travel,Class_Eco,Class_Eco Plus
0,52,160,5,4,3,4,3,4,3,5,...,2,5,5,50,44.0,False,False,False,True,False
1,36,2863,1,1,3,1,5,4,5,4,...,3,4,5,0,0.0,False,False,False,False,False
2,20,192,2,0,2,4,2,2,2,2,...,2,2,2,0,0.0,True,True,False,True,False
3,44,3377,0,0,0,2,3,4,4,1,...,3,1,4,0,6.0,True,False,False,False,False
4,49,1182,2,3,4,3,4,1,2,2,...,4,2,4,0,20.0,False,False,False,True,False



Target (y) head:


Unnamed: 0,satisfaction
0,1
1,1
2,0
3,1
4,1


## Split Data

### Subtask:
Divide the features (X) and target (y) into training and testing sets to properly evaluate the performance of the classification models.


**Reasoning**:
I will import the `train_test_split` function, then split the `X` and `y` data into training and testing sets, and finally print the shapes of the resulting sets to verify the split.



In [11]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Shape of X_train: {X_train.shape}")
print(f"Shape of X_test: {X_test.shape}")
print(f"Shape of y_train: {y_train.shape}")
print(f"Shape of y_test: {y_test.shape}")

Shape of X_train: (20780, 23)
Shape of X_test: (5196, 23)
Shape of y_train: (20780,)
Shape of y_test: (5196,)


## Train and Evaluate Decision Tree

### Subtask:
Initialize, train, and evaluate a Decision Tree Classifier on the training data, then assess its performance on the test set using metrics like accuracy, precision, recall, and F1-score.


**Reasoning**:
I will import the necessary modules, `DecisionTreeClassifier` for model creation and the evaluation metrics (`accuracy_score`, `precision_score`, `recall_score`, `f1_score`) from `sklearn.metrics`, to prepare for training and evaluating the model.



In [12]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

print("DecisionTreeClassifier and evaluation metrics imported successfully.")

DecisionTreeClassifier and evaluation metrics imported successfully.


**Reasoning**:
I will initialize a Decision Tree Classifier, train it using the training data, make predictions on the test data, and then calculate and print the accuracy, precision, recall, and F1-score to evaluate its performance.



In [13]:
dt_classifier = DecisionTreeClassifier(random_state=42)
dt_classifier.fit(X_train, y_train)

y_pred_dt = dt_classifier.predict(X_test)

accuracy_dt = accuracy_score(y_test, y_pred_dt)
precision_dt = precision_score(y_test, y_pred_dt)
recall_dt = recall_score(y_test, y_pred_dt)
f1_dt = f1_score(y_test, y_pred_dt)

print(f"Decision Tree Classifier Performance:")
print(f"Accuracy: {accuracy_dt:.4f}")
print(f"Precision: {precision_dt:.4f}")
print(f"Recall: {recall_dt:.4f}")
print(f"F1-Score: {f1_dt:.4f}")

Decision Tree Classifier Performance:
Accuracy: 0.9286
Precision: 0.9080
Recall: 0.9294
F1-Score: 0.9186


## Train and Evaluate Random Forest

### Subtask:
Initialize, train, and evaluate a Random Forest Classifier on the training data, then assess its performance on the test set using appropriate classification metrics.


**Reasoning**:
I will import the `RandomForestClassifier` from `sklearn.ensemble` to prepare for training the model.



In [14]:
from sklearn.ensemble import RandomForestClassifier

print("RandomForestClassifier imported successfully.")

RandomForestClassifier imported successfully.


**Reasoning**:
I will initialize a Random Forest Classifier, train it using the training data, make predictions on the test data, and then calculate and print the accuracy, precision, recall, and F1-score to evaluate its performance, similar to how the Decision Tree Classifier was evaluated.



In [15]:
rf_classifier = RandomForestClassifier(random_state=42)
rf_classifier.fit(X_train, y_train)

y_pred_rf = rf_classifier.predict(X_test)

accuracy_rf = accuracy_score(y_test, y_pred_rf)
precision_rf = precision_score(y_test, y_pred_rf)
recall_rf = recall_score(y_test, y_pred_rf)
f1_rf = f1_score(y_test, y_pred_rf)

print(f"Random Forest Classifier Performance:")
print(f"Accuracy: {accuracy_rf:.4f}")
print(f"Precision: {precision_rf:.4f}")
print(f"Recall: {recall_rf:.4f}")
print(f"F1-Score: {f1_rf:.4f}")

Random Forest Classifier Performance:
Accuracy: 0.9519
Precision: 0.9521
Recall: 0.9360
F1-Score: 0.9440


## Final Task

### Subtask:
Summarize the performance of both Decision Tree and Random Forest classifiers, comparing their results and providing insights into which model performed better for this dataset.


## Summary:

### Q&A
The Random Forest Classifier performed better than the Decision Tree Classifier for this dataset.

### Data Analysis Key Findings

*   **Missing Value Imputation:** The 83 missing values in the 'Arrival Delay in Minutes' column were successfully imputed using the median value, which was 0.0.
*   **Categorical Feature Encoding:**
    *   The 'satisfaction' column was label encoded, mapping 'satisfied' to 1 and 'neutral or dissatisfied' to 0.
    *   Other categorical columns ('Gender', 'Customer Type', 'Type of Travel', 'Class') were one-hot encoded, dropping the first category to avoid multicollinearity.
*   **Irrelevant Column Removal:** The 'Unnamed: 0' and 'id' columns were dropped from the dataset as they are identifiers and not predictive features.
*   **Data Split:** The dataset was split into training and testing sets with an 80/20 ratio, resulting in 20,780 samples for training and 5,196 for testing.
*   **Decision Tree Classifier Performance:**
    *   Accuracy: 0.9286
    *   Precision: 0.9080
    *   Recall: 0.9294
    *   F1-Score: 0.9186
*   **Random Forest Classifier Performance:**
    *   Accuracy: 0.9519
    *   Precision: 0.9521
    *   Recall: 0.9360
    *   F1-Score: 0.9440
*   **Model Comparison:** The Random Forest Classifier consistently outperformed the Decision Tree Classifier across all evaluated metrics, showing higher accuracy (0.9519 vs. 0.9286), precision (0.9521 vs. 0.9080), recall (0.9360 vs. 0.9294), and F1-Score (0.9440 vs. 0.9186).

### Insights or Next Steps

*   The Random Forest model demonstrated superior generalization capability and robustness compared to the single Decision Tree, making it the preferred model for predicting customer satisfaction on this dataset.
*   Further model optimization could involve hyperparameter tuning for the Random Forest Classifier to potentially achieve even better performance, or exploring other ensemble methods.
