# Airline Satisfaction Prediction
We have a dataset from an airline company with various features, and we aim to predict customer satisfaction based on them.

## I) Import data & first look

In [3]:
import os

# Charger le chemin depuis le fichier texte
with open(".path_repo.txt", "r") as f:
    path_repo = f.read().strip()

os.chdir(path_repo)

In [4]:
import pandas as pd

airline_df = pd.read_csv(r'data\airline_satisfaction.csv')
airline_df

Unnamed: 0.1,Unnamed: 0,id,Gender,Customer Type,Age,Type of Travel,Class,Flight Distance,Inflight wifi service,Departure/Arrival time convenient,...,Inflight entertainment,On-board service,Leg room service,Baggage handling,Checkin service,Inflight service,Cleanliness,Departure Delay in Minutes,Arrival Delay in Minutes,satisfaction
0,0,70172,Male,Loyal Customer,13,Personal Travel,Eco Plus,460,3,4,...,5,4,3,4,4,5,5,25,18.0,neutral or dissatisfied
1,1,5047,Male,disloyal Customer,25,Business travel,Business,235,3,2,...,1,1,5,3,1,4,1,1,6.0,neutral or dissatisfied
2,2,110028,Female,Loyal Customer,26,Business travel,Business,1142,2,2,...,5,4,3,4,4,4,5,0,0.0,satisfied
3,3,24026,Female,Loyal Customer,25,Business travel,Business,562,2,5,...,2,2,5,3,1,4,2,11,9.0,neutral or dissatisfied
4,4,119299,Male,Loyal Customer,61,Business travel,Business,214,3,3,...,3,3,4,4,3,3,3,0,0.0,satisfied
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
103899,103899,94171,Female,disloyal Customer,23,Business travel,Eco,192,2,1,...,2,3,1,4,2,3,2,3,0.0,neutral or dissatisfied
103900,103900,73097,Male,Loyal Customer,49,Business travel,Business,2347,4,4,...,5,5,5,5,5,5,4,0,0.0,satisfied
103901,103901,68825,Male,disloyal Customer,30,Business travel,Business,1995,1,1,...,4,3,2,4,5,5,4,7,14.0,neutral or dissatisfied
103902,103902,54173,Female,disloyal Customer,22,Business travel,Eco,1000,1,1,...,1,4,5,1,5,4,1,0,0.0,neutral or dissatisfied


In [5]:
airline_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 103904 entries, 0 to 103903
Data columns (total 25 columns):
 #   Column                             Non-Null Count   Dtype  
---  ------                             --------------   -----  
 0   Unnamed: 0                         103904 non-null  int64  
 1   id                                 103904 non-null  int64  
 2   Gender                             103904 non-null  object 
 3   Customer Type                      103904 non-null  object 
 4   Age                                103904 non-null  int64  
 5   Type of Travel                     103904 non-null  object 
 6   Class                              103904 non-null  object 
 7   Flight Distance                    103904 non-null  int64  
 8   Inflight wifi service              103904 non-null  int64  
 9   Departure/Arrival time convenient  103904 non-null  int64  
 10  Ease of Online booking             103904 non-null  int64  
 11  Gate location                      1039

Our target and some of our features are not yet ready to be used in a model and we have missing data in one of our columns. Let's clean that!

## II) Cleaning

In [6]:
airline_df.drop(columns=['Unnamed: 0', 'id'], inplace=True)
airline_df

Unnamed: 0,Gender,Customer Type,Age,Type of Travel,Class,Flight Distance,Inflight wifi service,Departure/Arrival time convenient,Ease of Online booking,Gate location,...,Inflight entertainment,On-board service,Leg room service,Baggage handling,Checkin service,Inflight service,Cleanliness,Departure Delay in Minutes,Arrival Delay in Minutes,satisfaction
0,Male,Loyal Customer,13,Personal Travel,Eco Plus,460,3,4,3,1,...,5,4,3,4,4,5,5,25,18.0,neutral or dissatisfied
1,Male,disloyal Customer,25,Business travel,Business,235,3,2,3,3,...,1,1,5,3,1,4,1,1,6.0,neutral or dissatisfied
2,Female,Loyal Customer,26,Business travel,Business,1142,2,2,2,2,...,5,4,3,4,4,4,5,0,0.0,satisfied
3,Female,Loyal Customer,25,Business travel,Business,562,2,5,5,5,...,2,2,5,3,1,4,2,11,9.0,neutral or dissatisfied
4,Male,Loyal Customer,61,Business travel,Business,214,3,3,3,3,...,3,3,4,4,3,3,3,0,0.0,satisfied
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
103899,Female,disloyal Customer,23,Business travel,Eco,192,2,1,2,3,...,2,3,1,4,2,3,2,3,0.0,neutral or dissatisfied
103900,Male,Loyal Customer,49,Business travel,Business,2347,4,4,4,4,...,5,5,5,5,5,5,4,0,0.0,satisfied
103901,Male,disloyal Customer,30,Business travel,Business,1995,1,1,1,3,...,4,3,2,4,5,5,4,7,14.0,neutral or dissatisfied
103902,Female,disloyal Customer,22,Business travel,Eco,1000,1,1,1,5,...,1,4,5,1,5,4,1,0,0.0,neutral or dissatisfied


In [7]:
airline_df['Arrival Delay in Minutes'].fillna(airline_df['Arrival Delay in Minutes'].median(), inplace=True)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  airline_df['Arrival Delay in Minutes'].fillna(airline_df['Arrival Delay in Minutes'].median(), inplace=True)


## III) Target encoding & train test split

In [8]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()

airline_target = encoder.fit_transform(airline_df['satisfaction'])
airline_target = pd.Series(airline_target, name='satisfaction')

In [9]:
from sklearn.model_selection import train_test_split

X = airline_df.drop(columns='satisfaction')

X_train, X_test, y_train, y_test = train_test_split(X, airline_target, test_size=.2, random_state=1)


Next step is to preprocess our data.
Let's deal with the categorical features in our dataset first.

A number of these contain string labels that must be transformed into numeric values so that they can be understood by the model that we end up training.

We have to encode the following columns `['Gender', 'Customer Type', 'Type of Travel', 'Class']` as a dataframe called 'cat_features' using sklearn's `OneHotEncoder`.

In [10]:
# your code here
from sklearn.preprocessing import OneHotEncoder

#Create X_train_cat and X_test_cat
X_train_cat = X_train.select_dtypes(include=['object'])
X_test_cat = X_test.select_dtypes(include=['object'])

# Define our OneHotEncoder and fit it on the train set
ohe = OneHotEncoder(drop='if_binary', sparse_output=False)
ohe.fit(X_train_cat.select_dtypes(include=['object']))

# Transform both train and test set
X_train_cat = pd.DataFrame(ohe.transform(X_train_cat), columns=ohe.get_feature_names_out())
X_test_cat = pd.DataFrame(ohe.transform(X_test_cat), columns=ohe.get_feature_names_out())


Next we will address the numeric features in our dataset and create dataframes that contains the normalized numeric features.

In [11]:
# your code here
from sklearn.preprocessing import StandardScaler

# extract only numerical data
X_train_num = X_train.select_dtypes(["int", "float"])
X_test_num = X_test.select_dtypes(["int", "float"])

# Define our Scaler and fit_transform on the train set and transform the test set
sc = StandardScaler()
X_train_num = sc.fit_transform(X_train_num)
X_test_num = sc.transform(X_test_num)

# Create two dataframe with the columns name
X_train_num = pd.DataFrame(X_train_num, columns=sc.get_feature_names_out())
X_test_num = pd.DataFrame(X_test_num, columns=sc.get_feature_names_out())

Now that we have processed our numeric and cateogrical features, let's combine them back into one variable called 'X_train_preprocessed' and 'X_test_preprocessed' that contains **all** of our **preprocessed features** for each train and test data. 

In [12]:
# your code here
X_train_preprocessed = pd.concat((X_train_num, X_train_cat), axis=1)
X_test_preprocessed = pd.concat((X_test_num, X_test_cat), axis=1)

## V) Model fitting & scoring

We will start with a very simple classifier using a logistic regression model on your train data. Then evaluate it's accuracy score on your test data!

In [13]:
# your code here
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

model.fit(X_train_preprocessed, y_train)

# compute accuracy
model.score(X_test_preprocessed, y_test)

# Can also use accuracy_score from sklearn

# from sklearn.metrics import accuracy_score
# accuracy_score(y_test, y_pred)

0.8762330975410231

88% accuracy isn't bad, but maybe we can do better...

Let's try using a more complex model like a random forest classifier.

In [14]:
# your code here
from sklearn.ensemble import RandomForestClassifier

forest_model = RandomForestClassifier()

forest_model.fit(X_train_preprocessed, y_train)

forest_model.score(X_test_preprocessed, y_test)

0.961214571002358

96% accuracy is a substantial improvement on the simple model we ran earlier!

This means that our model predicts correctly 96 times out of 100 attempts in our test set. Therefore, we can expect a similar accuracy for our future predictions, but this largely depends on the quality of the data we will use. We must also be cautious about potential and highly probable model **overfitting**.

## VI) Conclusion

It was a really simple example of how we can predict the future customer satisfaction. \
\
However, for greater accuracy, we typically **start by using the minimum parameters** in the first attempt to create a **'dumb model'**. \
Then, we try adding the other parameters one by one and verify if they significantly improve our model. This approach ensures an optimized model with great results and as few parameters as possible. \
**More parameters mean more complexity and a higher probability of overfitting**.

# Thank you for reading!
# Goodbye, Vincent."