For this activity, you work as a consultant for an airline. The airline is interested in knowing if a better in-flight entertainment experience leads to higher customer satisfaction. They would like you to construct and evaluate a model that predicts whether a future customer would be satisfied with their services given previous customer feedback about their flight experience.

The data for this activity is for a sample size of 129,880 customers. It includes data points such as class, flight distance, and in-flight entertainment, among others. Your goal will be to utilize a binomial logistic regression model to help the airline model and better understand this data.

**Import packages**

Import relevant Python packages. Use train_test_split, LogisticRegression, and various imports from sklearn.metrics to build, visualize, and evalute the model.

In [35]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report

**Load the dataset**
The dataset Invistico_Airline.csv is loaded. The resulting pandas DataFrame is saved as a variable named df_original. As shown in this cell, the dataset has been automatically loaded in for you. You do not need to download the .csv file, or provide more code, in order to access the dataset and proceed with this lab. Please continue with this activity by completing the following instructions.

In [16]:
df_original = pd.read_csv("Invistico_Airline.csv")

**Output the first 10 rows**
Output the first 10 rows of data.

In [17]:
print("First 10 rows of the dataset:")
print(df_original.head(10))
print("\n")

First 10 rows of the dataset:
  satisfaction   Customer Type  Age   Type of Travel     Class  \
0    satisfied  Loyal Customer   65  Personal Travel       Eco   
1    satisfied  Loyal Customer   47  Personal Travel  Business   
2    satisfied  Loyal Customer   15  Personal Travel       Eco   
3    satisfied  Loyal Customer   60  Personal Travel       Eco   
4    satisfied  Loyal Customer   70  Personal Travel       Eco   
5    satisfied  Loyal Customer   30  Personal Travel       Eco   
6    satisfied  Loyal Customer   66  Personal Travel       Eco   
7    satisfied  Loyal Customer   10  Personal Travel       Eco   
8    satisfied  Loyal Customer   56  Personal Travel  Business   
9    satisfied  Loyal Customer   22  Personal Travel       Eco   

   Flight Distance  Seat comfort  Departure/Arrival time convenient  \
0              265             0                                  0   
1             2464             0                                  0   
2             2138            

**Explore the data**

Check the data type of each column. Note that logistic regression models expect numeric data.

In [18]:
print("Data types of each column:")
print(df_original.dtypes)
print("\n")

Data types of each column:
satisfaction                          object
Customer Type                         object
Age                                    int64
Type of Travel                        object
Class                                 object
Flight Distance                        int64
Seat comfort                           int64
Departure/Arrival time convenient      int64
Food and drink                         int64
Gate location                          int64
Inflight wifi service                  int64
Inflight entertainment                 int64
Online support                         int64
Ease of Online booking                 int64
On-board service                       int64
Leg room service                       int64
Baggage handling                       int64
Checkin service                        int64
Cleanliness                            int64
Online boarding                        int64
Departure Delay in Minutes             int64
Arrival Delay in Minutes    

**Check the number of satisfied customers in the dataset**

To predict customer satisfaction, check how many customers in the dataset are satisfied before modeling.

In [19]:
print("Customer satisfaction distribution:")
print(df_original['satisfaction'].value_counts())
print("\n")

Customer satisfaction distribution:
satisfaction
satisfied       71087
dissatisfied    58793
Name: count, dtype: int64




**Check for missing values**

An assumption of logistic regression models is that there are no missing values. Check for missing values in the rows of the data.

In [20]:
print("Missing values in each column:")
print(df_original.isnull().sum())
print("\n")

Missing values in each column:
satisfaction                           0
Customer Type                          0
Age                                    0
Type of Travel                         0
Class                                  0
Flight Distance                        0
Seat comfort                           0
Departure/Arrival time convenient      0
Food and drink                         0
Gate location                          0
Inflight wifi service                  0
Inflight entertainment                 0
Online support                         0
Ease of Online booking                 0
On-board service                       0
Leg room service                       0
Baggage handling                       0
Checkin service                        0
Cleanliness                            0
Online boarding                        0
Departure Delay in Minutes             0
Arrival Delay in Minutes             393
dtype: int64




**Drop the rows with missing values**

Drop the rows with missing values and save the resulting pandas DataFrame in a variable named df_subset.

In [21]:
# Drop the rows with missing values and save as df_subset
df_subset = df_original.dropna()

print(f"Original dataset shape: {df_original.shape}")
print(f"Dataset shape after dropping missing values: {df_subset.shape}")
print("\n")

Original dataset shape: (129880, 22)
Dataset shape after dropping missing values: (129487, 22)




**Prepare the data**



Make the Inflight entertainment column "of type float."

In [22]:
df_subset['Inflight entertainment'] = df_subset['Inflight entertainment'].astype(float)
print("Data types after conversion:")
print(df_subset.dtypes)
print("\n")

Data types after conversion:
satisfaction                          object
Customer Type                         object
Age                                    int64
Type of Travel                        object
Class                                 object
Flight Distance                        int64
Seat comfort                           int64
Departure/Arrival time convenient      int64
Food and drink                         int64
Gate location                          int64
Inflight wifi service                  int64
Inflight entertainment               float64
Online support                         int64
Ease of Online booking                 int64
On-board service                       int64
Leg room service                       int64
Baggage handling                       int64
Checkin service                        int64
Cleanliness                            int64
Online boarding                        int64
Departure Delay in Minutes             int64
Arrival Delay in Minutes  

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_subset['Inflight entertainment'] = df_subset['Inflight entertainment'].astype(float)


**Convert the categorical column satisfaction into numeric**

Convert the categorical column satisfaction into numeric through one-hot encoding.

In [23]:
df_subset = pd.get_dummies(df_subset, columns=['satisfaction'], drop_first=True)
print("After one-hot encoding satisfaction column:")
print(df_subset.head())
print("\n")

After one-hot encoding satisfaction column:
    Customer Type  Age   Type of Travel     Class  Flight Distance  \
0  Loyal Customer   65  Personal Travel       Eco              265   
1  Loyal Customer   47  Personal Travel  Business             2464   
2  Loyal Customer   15  Personal Travel       Eco             2138   
3  Loyal Customer   60  Personal Travel       Eco              623   
4  Loyal Customer   70  Personal Travel       Eco              354   

   Seat comfort  Departure/Arrival time convenient  Food and drink  \
0             0                                  0               0   
1             0                                  0               0   
2             0                                  0               0   
3             0                                  0               0   
4             0                                  0               0   

   Gate location  Inflight wifi service  ...  Ease of Online booking  \
0              2                      2  .

**Output the first 10 rows of df_subset**

To examine what one-hot encoding did to the DataFrame, output the first 10 rows of df_subset.

In [24]:
print("First 10 rows after preprocessing:")
print(df_subset.head(10))
print("\n")

First 10 rows after preprocessing:
    Customer Type  Age   Type of Travel     Class  Flight Distance  \
0  Loyal Customer   65  Personal Travel       Eco              265   
1  Loyal Customer   47  Personal Travel  Business             2464   
2  Loyal Customer   15  Personal Travel       Eco             2138   
3  Loyal Customer   60  Personal Travel       Eco              623   
4  Loyal Customer   70  Personal Travel       Eco              354   
5  Loyal Customer   30  Personal Travel       Eco             1894   
6  Loyal Customer   66  Personal Travel       Eco              227   
7  Loyal Customer   10  Personal Travel       Eco             1812   
8  Loyal Customer   56  Personal Travel  Business               73   
9  Loyal Customer   22  Personal Travel       Eco             1556   

   Seat comfort  Departure/Arrival time convenient  Food and drink  \
0             0                                  0               0   
1             0                                  0    

**Create the training and testing data**

Put 70% of the data into a training set and the remaining 30% into a testing set. Create an X and y DataFrame with only the necessary variables.

In [25]:
X = df_subset[['Inflight entertainment']]  # Feature matrix
y = df_subset['satisfaction_satisfied']    # Target variable

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
print(f"Training set size: {X_train.shape[0]}")
print(f"Testing set size: {X_test.shape[0]}")
print("\n")

Training set size: 90640
Testing set size: 38847




**Fit a LogisticRegression model to the data**

Build a logistic regression model and fit the model to the training data.

In [26]:
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)
print("Model trained successfully!")
print("\n")

Model trained successfully!




**Obtain parameter estimates**

Make sure you output the two parameters from your model.

In [27]:
# Obtain parameter estimates
print("Model coefficients:")
print(f"Coefficient: {model.coef_[0][0]}")
print(f"Intercept: {model.intercept_[0]}")
print("\n")

Model coefficients:
Coefficient: 0.9975288285811373
Intercept: -3.193590542302588




In [28]:
print("Model Intercept:")
print(f"Intercept: {model.intercept_[0]}")

Model Intercept:
Intercept: -3.193590542302588


**Predict the outcome for the test dataset**

Now that you've completed your regression, review and analyze your results. First, input the holdout dataset into the predict function to get the predicted labels from the model. Save these predictions as a variable called y_pred.

In [36]:
# Predict the outcome for the test dataset
y_pred = model.predict(X_test)

**Print out y_pred** 

In order to examine the predictions, print out y_pred.

In [30]:
# Print out y_pred
print("First 20 predictions:")
print(y_pred[:20])
print("\n")

First 20 predictions:
[ True False False  True  True False  True False  True  True False False
  True  True False  True False False False False]




Use the predict_proba and predict functions on X_testÂ¶

In [31]:
# Use the predict_proba and predict functions on X_test
y_pred_proba = model.predict_proba(X_test)
print("Prediction probabilities (first 10 rows):")
print(y_pred_proba[:10])
print("\n")

print("Class predictions (first 10 rows):")
print(y_pred[:10])
print("\n")

Prediction probabilities (first 10 rows):
[[0.14257646 0.85742354]
 [0.55008251 0.44991749]
 [0.89989529 0.10010471]
 [0.31076939 0.68923061]
 [0.31076939 0.68923061]
 [0.55008251 0.44991749]
 [0.14257646 0.85742354]
 [0.76826369 0.23173631]
 [0.31076939 0.68923061]
 [0.31076939 0.68923061]]


Class predictions (first 10 rows):
[ True False False  True  True False  True False  True  True]




***Analyze the results***

Print out the model's accuracy, precision, recall, and F1 score.

In [32]:
# Analyze the results
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print("Model Performance Metrics:")
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1 Score: {f1:.4f}")
print("\n")

Model Performance Metrics:
Accuracy: 0.8015
Precision: 0.8161
Recall: 0.8215
F1 Score: 0.8188




**Produce a confusion matrix**

Data professionals often like to know the types of errors made by an algorithm. To obtain this information, produce a confusion matrix.

In [34]:
# Produce a confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(cm)
print("\n")

Confusion Matrix:
[[13714  3925]
 [ 3785 17423]]


