You work as a consultant for an airline. The airline is interested in knowing if a better in-flight entertainment experience leads to higher customer satisfaction. They would like you to construct and evaluate a model that predicts whether a future customer would be satisfied with their services given previous customer feedback about their flight experience.

The data for this activity is for a sample size of 129,880 customers. It includes data points such as class, flight distance, and in-flight entertainment, among others. Your goal will be to utilize a binomial logistic regression model to help the airline model and better understand this data.

In [1]:
### YOUR CODE HERE ###

# Standard operational package imports.
import numpy as np
import pandas as pd

# Important imports for preprocessing, modeling, and evaluation.
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import sklearn.metrics as metrics

# Visualization package imports.
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
### YOUR CODE HERE ###

df_original = pd.read_csv(r"E:\tableau google\Invistico_Airline.csv")

FileNotFoundError: [Errno 2] No such file or directory: 'E:\\tableau google\\Invistico_Airline.csv'

In [None]:
### YOUR CODE HERE ###

df_original.head(10)

# Explore the data

In [None]:
df_original.dtypes

To predict customer satisfaction, check how many customers in the dataset are satisfied before modeling.

In [None]:
df_original['satisfaction'].value_counts(dropna = False)
#To examine how many NaN values there are, set the `dropna` parameter passed in to this function to `False`.

# Check for missing values

In [None]:
df_original.isnull().sum()

 ***Should you remove rows where the `Arrival Delay in Minutes` column has missing values, even though the airline is more interested in the `inflight entertainment` column?***

For this activity, the airline is specifically interested in knowing if a better in-flight entertainment experience leads to higher customer satisfaction. The `Arrival Delay in Minutes` column won't be included in the binomial logistic regression model; however, the airline might become interested in this column in the future.

For now, the missing values should be removed for two reasons:

* There are only 393 missing values out of the total of 129,880, so these are a small percentage of the total.
* This column might impact the relationship between entertainment and satisfaction.

In [None]:
# drop missing values from the rows, and drop the old index and create a new index based on the remaining rows

df_subset = df_original.dropna(axis=0).reset_index(drop = True)

# Prepare the data

If you want to create a plot (`sns.regplot`) of your model to visualize results later in the notebook, the independent variable `Inflight entertainment` cannot be "of type int" and the dependent variable `satisfaction` cannot be "of type object." 

Make the `Inflight entertainment` column "of type float." 

In [None]:
df_subset = df_subset.astype({"Inflight entertainment": float})

In [None]:
#Convert the categorical column `satisfaction` into numeric through one-hot encoding.

df_subset['satisfaction'] = OneHotEncoder(drop='first').fit_transform(df_subset[['satisfaction']]).toarray()

In [None]:
df_subset.head(10)

# Create the training and testing data

Put 70% of the data into a training set and the remaining 30% into a testing set. Create an X and y DataFrame with only the necessary variables.

In [None]:
X = df_subset[["Inflight entertainment"]]
y = df_subset["satisfaction"]

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=42)

### Fit a LogisticRegression model to the data

Build a logistic regression model and fit the model to the training data. 

In [None]:
clf = LogisticRegression().fit(X_train,y_train)

# Obtain parameter estimates

In [None]:
print('Models coefficient is',clf.coef_)
print('Models intercept is',clf.intercept_)

Create a plot of your model

In [None]:
sns.regplot(x="Inflight entertainment", y="satisfaction", data=df_subset, logistic=True, ci=None)

# Predict the outcome for the test dataset

In [None]:
y_pred = clf.predict(X_test)
print(y_pred)

# Use the `predict_proba` and `predict` functions on `X_test`

In [None]:
clf.predict_proba(X_test)

In [None]:
print("Accuracy:", "%.6f" % metrics.accuracy_score(y_test, y_pred))
print("Precision:", "%.6f" % metrics.precision_score(y_test, y_pred))
print("Recall:", "%.6f" % metrics.recall_score(y_test, y_pred))
print("F1 Score:", "%.6f" % metrics.f1_score(y_test, y_pred))

# Produce a confusion matrix

In [None]:
cm = metrics.confusion_matrix(y_test, y_pred, labels = clf.classes_)
disp = metrics.ConfusionMatrixDisplay(confusion_matrix = cm,display_labels = clf.classes_)
disp.plot()

Two of the quadrants are under 4,000, which are relatively low numbers. Based on what we know from the data and interpreting the matrix, it's clear that these numbers relate to false positives and false negatives. 

Additionally, the other two quadrants—the true positives and true negatives—are both high numbers above 13,000.

Using more than a single independent variable in the model training process could improve model performance. This is because other variables, like `Departure Delay in Minutes,` seem like they could potentially influence customer satisfaction.

# Conclusion

*   Logistic regression accurately predicted satisfaction 80.2 percent of the time.  
*   The confusion matrix is useful, as it displays a similar amount of true positives and true negatives. 
*  Customers who rated in-flight entertainment highly were more likely to be satisfied. Improving in-flight entertainment should lead to better customer satisfaction. 
*  The model is 80.2 percent accurate. This is an improvement over the dataset's customer satisfaction rate of 54.7 percent. 
*  The success of the model suggests that the airline should invest more in model developement to examine if adding more independent variables leads to better results. Building this model could not only be useful in predicting whether or not a customer would be satisfied but also lead to a better understanding of what independent variables lead to happier customers. 