# Part 1: Airlines Customer Satisfaction Classification

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

In [2]:
#Libraries
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import Normalizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

In this example you'll see how to use machine learning algorithm to classify the flight experience of a customer as 'Satisfied' or 'Dissatisfied'. To do this we need data. Let's read our data and start our journey.

In [3]:
data = pd.read_csv('Part1_Invistico_Airline.csv')
data.head()

Unnamed: 0,Gender,Customer Type,Age,Type of Travel,Class,Flight Distance,Seat comfort,Departure/Arrival time convenient,Food and drink,Gate location,...,Ease of Online booking,On-board service,Leg room service,Baggage handling,Checkin service,Cleanliness,Online boarding,Departure Delay in Minutes,Arrival Delay in Minutes,satisfaction
0,Female,Loyal Customer,65,Personal Travel,Eco,360,4,4,4,4,...,5,5,5,5,3,5,3,0,0.0,satisfied
1,Male,Loyal Customer,13,Personal Travel,Eco,2991,1,5,1,4,...,5,4,2,4,3,4,5,0,5.0,dissatisfied
2,Female,Loyal Customer,58,Business travel,Business,1903,3,3,3,3,...,3,3,3,3,1,3,4,0,0.0,dissatisfied
3,Female,disloyal Customer,27,Business travel,Business,2813,4,4,4,3,...,3,2,4,3,2,4,3,0,0.0,dissatisfied
4,Female,Loyal Customer,34,Business travel,Business,2864,3,5,4,4,...,1,1,1,3,4,1,1,21,19.0,dissatisfied


We have 103905 customer experience data. In this data, we have some information about customers (age, gender, etc.) and flight (departure time, seat comfort, etc.). We will use these data to set a logistic regression algorithm and to classify customer satisfaction. But before we do this we need to learn more about our data. Let's start by checking the empty values.

In [4]:
data.isna().sum()

Gender                                 0
Customer Type                          0
Age                                    0
Type of Travel                         0
Class                                  0
Flight Distance                        0
Seat comfort                           0
Departure/Arrival time convenient      0
Food and drink                         0
Gate location                          0
Inflight wifi service                  0
Inflight entertainment                 0
Online support                         0
Ease of Online booking                 0
On-board service                       0
Leg room service                       0
Baggage handling                       0
Checkin service                        0
Cleanliness                            0
Online boarding                        0
Departure Delay in Minutes             0
Arrival Delay in Minutes             313
satisfaction                           0
dtype: int64

In the pandas library, there is a method named isna. This method allows you to check the data has any null values or not. If you combine this method with another method which is sum, you can see the total number of null values in a column. If you check the results of this code you can see that we have 313 null values in the 'Arrival Delay in Minutes' column. There are too many methods to handle null values. In this example, we will use the median of the column to fill null values, and let's check null values again.

In [5]:
data['Arrival Delay in Minutes'] = data['Arrival Delay in Minutes'].fillna(data['Arrival Delay in Minutes'].median())
data.isna().sum()

Gender                               0
Customer Type                        0
Age                                  0
Type of Travel                       0
Class                                0
Flight Distance                      0
Seat comfort                         0
Departure/Arrival time convenient    0
Food and drink                       0
Gate location                        0
Inflight wifi service                0
Inflight entertainment               0
Online support                       0
Ease of Online booking               0
On-board service                     0
Leg room service                     0
Baggage handling                     0
Checkin service                      0
Cleanliness                          0
Online boarding                      0
Departure Delay in Minutes           0
Arrival Delay in Minutes             0
satisfaction                         0
dtype: int64

If you check our data, some of our columns have numeric values and some of them have categorical values. We have to apply some other data preprocessing methods to prepare this data for our logistic regression algorithm. But to do this, we need to learn the data type of all of our columns. Let's create 2 methods for this and run them to see the outputs of these methods. 

In [6]:
def object_cols(df):
    return list(df.select_dtypes(include='object').columns)

def numerical_cols(df):
    return list(df.select_dtypes(exclude='object').columns)

In [7]:
obj_col = object_cols(data)
num_col = numerical_cols(data)

In [8]:
obj_col

['Gender', 'Customer Type', 'Type of Travel', 'Class', 'satisfaction']

In [9]:
num_col

['Age',
 'Flight Distance',
 'Seat comfort',
 'Departure/Arrival time convenient',
 'Food and drink',
 'Gate location',
 'Inflight wifi service',
 'Inflight entertainment',
 'Online support',
 'Ease of Online booking',
 'On-board service',
 'Leg room service',
 'Baggage handling',
 'Checkin service',
 'Cleanliness',
 'Online boarding',
 'Departure Delay in Minutes',
 'Arrival Delay in Minutes']

Now we have two different lists. One of them has the names of categorical columns. The name of this list is obj_col. To work with categorical variables we need to encode them. To encode categorical values let's call the LabelEncoder method of sklearn. The second list which is num_col has the names of columns that include numerical values. We can normalize these values by using Normalize method of sklearn. Once we apply these two preprocessing methods to our data, we will be ready for Logistic Regression. 

In [10]:
le = LabelEncoder()
norm = Normalizer()

In [11]:
for col in obj_col:
    data[col] = le.fit_transform(data[col])

In [12]:
data.isna().sum()

Gender                               0
Customer Type                        0
Age                                  0
Type of Travel                       0
Class                                0
Flight Distance                      0
Seat comfort                         0
Departure/Arrival time convenient    0
Food and drink                       0
Gate location                        0
Inflight wifi service                0
Inflight entertainment               0
Online support                       0
Ease of Online booking               0
On-board service                     0
Leg room service                     0
Baggage handling                     0
Checkin service                      0
Cleanliness                          0
Online boarding                      0
Departure Delay in Minutes           0
Arrival Delay in Minutes             0
satisfaction                         0
dtype: int64

In [13]:
data[num_col] = norm.fit_transform(data[num_col])

In [14]:
data.head()

Unnamed: 0,Gender,Customer Type,Age,Type of Travel,Class,Flight Distance,Seat comfort,Departure/Arrival time convenient,Food and drink,Gate location,...,Ease of Online booking,On-board service,Leg room service,Baggage handling,Checkin service,Cleanliness,Online boarding,Departure Delay in Minutes,Arrival Delay in Minutes,satisfaction
0,0,0,0.177518,1,1,0.983177,0.010924,0.010924,0.010924,0.010924,...,0.013655,0.013655,0.013655,0.013655,0.008193,0.013655,0.008193,0.0,0.0,1
1,1,0,0.004346,1,1,0.999978,0.000334,0.001672,0.000334,0.001337,...,0.001672,0.001337,0.000669,0.001337,0.001003,0.001337,0.001672,0.0,0.001672,0
2,0,0,0.030463,0,0,0.999514,0.001576,0.001576,0.001576,0.001576,...,0.001576,0.001576,0.001576,0.001576,0.000525,0.001576,0.002101,0.0,0.0,0
3,0,1,0.009598,0,0,0.999944,0.001422,0.001422,0.001422,0.001066,...,0.001066,0.000711,0.001422,0.001066,0.000711,0.001422,0.001066,0.0,0.0,0
4,0,0,0.01187,0,0,0.999873,0.001047,0.001746,0.001396,0.001396,...,0.000349,0.000349,0.000349,0.001047,0.001396,0.000349,0.000349,0.007331,0.006633,0


We all are set. So far we have read data about customer's flight experience. After then we applied some preprocessing steps to these data to make our data ready for logistic regression. Now we are ready to create a logistic regression model. Let's split our data into two data frames such as train and test by using train_test_split method of sklearn. 

In [15]:
X_data = data.drop(['satisfaction'], axis = 1)
y_data = data['satisfaction']

In [16]:
X_train, X_test, y_train, y_test = train_test_split(
    X_data, y_data, test_size=0.33, random_state=42)

We did everything to make classifications by using logistic regression. Let's create a logistic regression instance firstly. Then fit it by using the split data. Once our model is ready, call the predict method to make classifications and check some of the results.

In [17]:
log_reg = LogisticRegression()

In [18]:
fit_model = log_reg.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [19]:
preds = fit_model.predict(X_test)

In [20]:
preds

array([1, 0, 1, ..., 1, 0, 1])

As you can see our model made some classifications as 1 or 0. To classify any point as 1 or 0 the model is using the probabilities and set the output by choosing the high probability. To see the probability values of any data point we can call the predict_proba method of our logistic regression model. Once we have both classified values and related probabilities let's create a dataframe to see everything combined.

In [21]:
probs = fit_model.predict_proba(X_test)

In [22]:
probs

array([[0.15093146, 0.84906854],
       [0.64654981, 0.35345019],
       [0.39264986, 0.60735014],
       ...,
       [0.15368394, 0.84631606],
       [0.68039437, 0.31960563],
       [0.15101003, 0.84898997]])

In [23]:
model_results = pd.DataFrame([preds, y_test, [elem[0] for elem in probs], [elem[1] for elem in probs]])
model_results = model_results.T
model_results.rename(columns = {0 : 'PredictedClass', 1 : 'TrueClass', 2 : 'ClassProb:0', 3 : 'ClassProb:1'}, inplace = True)

We now have prediction results for all of the data points in our X_test dataframe. Any classification algorithm can make any classification by using any data that you use as input. But how can you evaluate the success rate of this algorithm? Let's check the dataframe which we have created. In the first two columns of this dataframe we have predicted and true class values. For any data point, if the predicted class value and true class value are the same, we can call this classification a true classification. If we divide the count of true classifications by the number of all test values, we can find "accurate classification ratio". Let's do it!

In [24]:
model_results

Unnamed: 0,PredictedClass,TrueClass,ClassProb:0,ClassProb:1
0,1.0,1.0,0.150931,0.849069
1,0.0,0.0,0.646550,0.353450
2,1.0,1.0,0.392650,0.607350
3,1.0,1.0,0.338033,0.661967
4,0.0,0.0,0.777370,0.222630
...,...,...,...,...
34284,1.0,1.0,0.347187,0.652813
34285,0.0,1.0,0.762796,0.237204
34286,1.0,0.0,0.153684,0.846316
34287,0.0,0.0,0.680394,0.319606


In [25]:
true_count = 0
for pred, real in zip(model_results['PredictedClass'], model_results['TrueClass']):
    if pred == real:
        true_count = true_count + 1
print("Number of True Classifications = {0} ".format(true_count))
print("Accurate Classification Ratio = {0} ".format(true_count / len(y_test)))

Number of True Classifications = 25819 
Accurate Classification Ratio = 0.7529820058911021 


The ratio which you can see as "Accurate Classification Ratio" is also known as accuracy score. This ratio simply shows you success ratio of your classification algorithm. We now have only one algorithm hence we can not say this algorithm with these parameters (default parameters) is the best one. Let's create another logistic regression algorithm and let's check the accuracy score again to choose one of them as the best one. 

In [26]:
second_model = LogisticRegression(penalty = 'l2',solver = 'newton-cg' ,C = 10, class_weight = 'balanced')
second_fit = second_model.fit(X_train, y_train)

In [27]:
second_fit.score(X_test, y_test)

0.767097319840182