# Context and Defining Problem Statement

The data we have at hand is of passengers and their feedback regarding their flight experience. 

Each row is one passenger. Apart from the  feedback from the customers accross various attributes(15 in total) like food, online_support, cleanliness etc, we have data about the customers' age, loyalty to the airline, gender and class.

The target column is a binary variable which tells us if the customer is satisfied or neutral/dissatisfied

The task at hand is to analyze reasons for customers' satisfaction or dissatisfaction.

And finally, we build a model to predict customer satisfaction using all or some of the data we have

# Steps - 
1. Data loading and preprocessing
2. Exploratory Data Analysis
3. Model building and evaluation
4. Model Tuning
5. Dimensionality Reduction

# Data loading and preprocessing

### 1. Import Pandas, Numpy, pyplot and seaborn

In [2]:
#Import necessary libraries
# %matplotlib inline
import numpy as np
import pandas as pd
# import matplotlib.pyplot as plt
# import seaborn as sns
import warnings
warnings.filterwarnings("ignore")  # Not always recommended, but jsut so our notebook looks clean for this activity

### 2. Import the dataframes that are needed
- Import "Flight data_Train.csv" and "Surveydata_Train.csv"

In [6]:
df1 = pd.read_csv("../data/Flight_data_Train.csv")  # Read the data regarding customer attributes
df2 = pd.read_csv("../data/Surveydata_Train.csv")   # Feedback data from customers

In [7]:
df1.columns 

Index(['ID', 'Gender', 'CustomerType', 'Age', 'TypeTravel', 'Class',
       'Flight_Distance', 'DepartureDelayin_Mins', 'ArrivalDelayin_Mins'],
      dtype='object')

In [8]:
df2.columns 

Index(['Id', 'Satisfaction', 'Seat_comfort',
       'Departure.Arrival.time_convenient', 'Food_drink', 'Gate_location',
       'Inflightwifi_service', 'Inflight_entertainment', 'Online_support',
       'Ease_of_Onlinebooking', 'Onboard_service', 'Leg_room_service',
       'Baggage_handling', 'Checkin_service', 'Cleanliness',
       'Online_boarding'],
      dtype='object')

In [3]:
df1.head()

Unnamed: 0,ID,Gender,CustomerType,Age,TypeTravel,Class,Flight_Distance,DepartureDelayin_Mins,ArrivalDelayin_Mins
0,149965,Female,Loyal Customer,65,Personal Travel,Eco,265,0,0.0
1,149966,Female,Loyal Customer,15,Personal Travel,Eco,2138,0,0.0
2,149967,Female,Loyal Customer,60,Personal Travel,Eco,623,0,0.0
3,149968,Female,Loyal Customer,70,Personal Travel,Eco,354,0,0.0
4,149969,Male,Loyal Customer,30,,Eco,1894,0,0.0


In [4]:
df2.head()

Unnamed: 0,Id,Satisfaction,Seat_comfort,Departure.Arrival.time_convenient,Food_drink,Gate_location,Inflightwifi_service,Inflight_entertainment,Online_support,Ease_of_Onlinebooking,Onboard_service,Leg_room_service,Baggage_handling,Checkin_service,Cleanliness,Online_boarding
0,198671,neutral or dissatisfied,poor,acceptable,acceptable,manageable,poor,need improvement,poor,poor,acceptable,acceptable,poor,need improvement,need improvement,poor
1,193378,satisfied,excellent,need improvement,excellent,Convinient,acceptable,excellent,acceptable,acceptable,good,acceptable,excellent,acceptable,excellent,acceptable
2,174522,satisfied,good,good,good,manageable,acceptable,excellent,excellent,need improvement,need improvement,good,need improvement,excellent,need improvement,excellent
3,191830,satisfied,good,good,good,manageable,poor,good,poor,poor,poor,good,poor,acceptable,acceptable,poor
4,221497,satisfied,good,good,,Convinient,good,good,good,good,good,good,good,excellent,good,good


### 3. Join the two dataframes using the 'id' column as the primary key
- Rename the Id column of one dataframe so that there "id" column name becomes same

In [5]:
#Using pandas' Join method
#c = a.join(b)  # Joining two dfs on the 'Id' column

df = df2.set_index("Id").join(df1.set_index("ID"))


print(df.shape)
df.head()  # the combined dataframe

(90917, 23)


Unnamed: 0_level_0,Satisfaction,Seat_comfort,Departure.Arrival.time_convenient,Food_drink,Gate_location,Inflightwifi_service,Inflight_entertainment,Online_support,Ease_of_Onlinebooking,Onboard_service,...,Cleanliness,Online_boarding,Gender,CustomerType,Age,TypeTravel,Class,Flight_Distance,DepartureDelayin_Mins,ArrivalDelayin_Mins
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
198671,neutral or dissatisfied,poor,acceptable,acceptable,manageable,poor,need improvement,poor,poor,acceptable,...,need improvement,poor,Male,Loyal Customer,30,Business travel,Business,1354,11,8.0
193378,satisfied,excellent,need improvement,excellent,Convinient,acceptable,excellent,acceptable,acceptable,good,...,excellent,acceptable,Female,disloyal Customer,20,,Business,1439,6,0.0
174522,satisfied,good,good,good,manageable,acceptable,excellent,excellent,need improvement,need improvement,...,need improvement,excellent,Female,,55,Personal Travel,Eco Plus,976,4,0.0
191830,satisfied,good,good,good,manageable,poor,good,poor,poor,poor,...,acceptable,poor,Male,disloyal Customer,24,Business travel,Eco,2291,0,0.0
221497,satisfied,good,good,,Convinient,good,good,good,good,good,...,good,good,Male,Loyal Customer,32,Business travel,Business,3974,0,0.0


### 4. Print the number of missing values in each of the columns

In [6]:
df.isna().sum()

Satisfaction                            0
Seat_comfort                            0
Departure.Arrival.time_convenient    8244
Food_drink                           8181
Gate_location                           0
Inflightwifi_service                    0
Inflight_entertainment                  0
Online_support                          0
Ease_of_Onlinebooking                   0
Onboard_service                      7179
Leg_room_service                        0
Baggage_handling                        0
Checkin_service                         0
Cleanliness                             0
Online_boarding                         0
Gender                                  0
CustomerType                         9099
Age                                     0
TypeTravel                           9088
Class                                   0
Flight_Distance                         0
DepartureDelayin_Mins                   0
ArrivalDelayin_Mins                   284
dtype: int64

### 5. Fill missing values with mean and check the shape of the dataframe before and after dropping the rows
- Fill null values in ArrivalDelayin_Mins with mean
- After that drop all the rows with null values

In [7]:
# Since there are very less null values in the 'ArrivalDelayin_Mins' column, lets impute with mean

df.ArrivalDelayin_Mins.fillna(df.ArrivalDelayin_Mins.mean(), inplace = True)

In [8]:
# # Alternate way - 1

# df.ArrivalDelayin_Mins.replace({np.nan  : df.ArrivalDelayin_Mins.mean()}, inplace = True)

In [9]:
# # Alternate way - 2

# from sklearn.impute import SimpleImputer

# imputer = SimpleImputer(missing_values = np.nan, strategy = 'mean') 

# df['ArrivalDelayin_Mins'] = imputer.fit_transform(df['ArrivalDelayin_Mins'].values.reshape(-1,1))

# # the imputer algorithm takes n-dimensional array and since it was throwing an error asking to reshape. Hence the reshape


In [8]:
df.ArrivalDelayin_Mins.isna().sum()

0

In [10]:
# Rest other missing values are filled with not_captured.
#It can be done in other way also or also can be dropped according to the problem and business context.
df.fillna("not_captured", inplace = True) 


In [11]:
print(df.shape)

(90917, 23)


# Exploratory Data Analysis

### 6. Print correlation

In [12]:
cor = df.corr() # It will show correlation of only numerical variables here.

In [14]:
cor[cor >0.2]

Unnamed: 0,Age,Flight_Distance,DepartureDelayin_Mins,ArrivalDelayin_Mins
Age,1.0,,,
Flight_Distance,,1.0,,
DepartureDelayin_Mins,,,1.0,0.96112
ArrivalDelayin_Mins,,,0.96112,1.0


In [15]:
to_drop = ['DepartureDelayin_Mins']

- There is just one obvious correlation that we can see that is between arrival delay and departure delay

### Feedback columns
* 'Seat_comfort'
* 'Departure.Arrival.time_convenient'
* 'Food_drink'
* 'Gate_location'
* 'Inflightwifi_service'
* 'Inflight_entertainment'
* 'Online_support'
* 'Ease_of_Onlinebooking'
* 'Onboard_service'
* 'Leg_room_service'
* 'Baggage_handling'
* 'Checkin_service'
* 'Cleanliness'
* 'Online_boarding'

### 7. Manually encode these variables(printed above) such that they follow an order based on the meaning. 
### Example: awful = 1, unpleasent = 2, decent = 3, good = 4, great = 5

In [18]:
# Manual label encoding
# It is a bit of a subjective task. Hence, go ahead in the way you find appropriate

df.replace({'extremely poor' : 0, 'poor' : 1, 'need improvement' : 2, 'acceptable' : 3, 
            'good' : 4, 'excellent' : 5, 'not_captured' : 2}, inplace = True)  

df.replace({'very inconvinient' : 0, 'Inconvinient' : 1, 'need improvement' : 2, 'manageable' : 3,
            'Convinient' : 4, 'very convinient' : 5}, inplace = True)

In [19]:
df.head()

Unnamed: 0_level_0,Satisfaction,Seat_comfort,Departure.Arrival.time_convenient,Food_drink,Gate_location,Inflightwifi_service,Inflight_entertainment,Online_support,Ease_of_Onlinebooking,Onboard_service,...,Cleanliness,Online_boarding,Gender,CustomerType,Age,TypeTravel,Class,Flight_Distance,DepartureDelayin_Mins,ArrivalDelayin_Mins
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
198671,neutral or dissatisfied,1,3,3,3,1,2,1,1,3,...,2,1,Male,Loyal Customer,30,Business travel,Business,1354,11,8.0
193378,satisfied,5,2,5,4,3,5,3,3,4,...,5,3,Female,disloyal Customer,20,2,Business,1439,6,0.0
174522,satisfied,4,4,4,3,3,5,5,2,2,...,2,5,Female,2,55,Personal Travel,Eco Plus,976,4,0.0
191830,satisfied,4,4,4,3,1,4,1,1,1,...,3,1,Male,disloyal Customer,24,Business travel,Eco,2291,0,0.0
221497,satisfied,4,4,2,4,4,4,4,4,4,...,4,4,Male,Loyal Customer,32,Business travel,Business,3974,0,0.0


In [20]:
df['Departure.Arrival.time_convenient'].value_counts()

2    22783
4    18840
5    17079
3    14806
1    13210
0     4199
Name: Departure.Arrival.time_convenient, dtype: int64

### 9. Draw all the insights that you can from the plots

**Green and orange bars are counts of satisfied and dissatisfied customers respectively. We want to look for areas where there is a visually significant difference between the length of the stacked bars**

- From the plots above,
    - Seating comfort can cause high levels of satisfaction to customers. Hardly any people who rated highly for seat_comfort were dissatisfied
    - A similar case with respect to inflight_entertainment. In this case, having less entertainment seems to have caused far more dissatisfaction compared to bad seating.
    - Difference is observed in ease_of_online_booking quiet evidently

### 10. Print the average feedback score
- When Satisfaction columns equals 'satisfied'
- When Satisfaction columns is not equal to 'satisfied'

In [21]:
df

Unnamed: 0_level_0,Satisfaction,Seat_comfort,Departure.Arrival.time_convenient,Food_drink,Gate_location,Inflightwifi_service,Inflight_entertainment,Online_support,Ease_of_Onlinebooking,Onboard_service,...,Cleanliness,Online_boarding,Gender,CustomerType,Age,TypeTravel,Class,Flight_Distance,DepartureDelayin_Mins,ArrivalDelayin_Mins
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
198671,neutral or dissatisfied,1,3,3,3,1,2,1,1,3,...,2,1,Male,Loyal Customer,30,Business travel,Business,1354,11,8.0
193378,satisfied,5,2,5,4,3,5,3,3,4,...,5,3,Female,disloyal Customer,20,2,Business,1439,6,0.0
174522,satisfied,4,4,4,3,3,5,5,2,2,...,2,5,Female,2,55,Personal Travel,Eco Plus,976,4,0.0
191830,satisfied,4,4,4,3,1,4,1,1,1,...,3,1,Male,disloyal Customer,24,Business travel,Eco,2291,0,0.0
221497,satisfied,4,4,2,4,4,4,4,4,4,...,4,4,Male,Loyal Customer,32,Business travel,Business,3974,0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
152951,satisfied,0,4,0,4,2,5,4,4,4,...,4,3,Female,2,46,Personal Travel,Eco,1143,0,0.0
220134,satisfied,1,1,1,1,4,5,5,4,4,...,4,4,Female,Loyal Customer,45,2,Business,3585,1,0.0
174403,satisfied,4,4,4,3,4,4,5,1,1,...,1,5,Female,Loyal Customer,67,2,Eco,2563,3,2.0
224128,satisfied,4,1,5,5,4,4,4,4,4,...,3,4,Female,Loyal Customer,37,Business travel,Eco,1027,0,0.0


In [22]:
Feedback_cols = ['Seat_comfort', 'Departure.Arrival.time_convenient',
                 'Food_drink', 'Gate_location', 'Inflightwifi_service',
                 'Inflight_entertainment', 'Online_support', 'Ease_of_Onlinebooking', 
                 'Onboard_service', 'Leg_room_service', 'Baggage_handling', 'Checkin_service',
                 'Cleanliness', 'Online_boarding']

In [23]:
df.groupby('Satisfaction').mean() # Average rating of individual feedback attributes across satisfaction levels


Unnamed: 0_level_0,Seat_comfort,Departure.Arrival.time_convenient,Food_drink,Gate_location,Inflightwifi_service,Inflight_entertainment,Online_support,Ease_of_Onlinebooking,Onboard_service,Leg_room_service,Baggage_handling,Checkin_service,Cleanliness,Online_boarding,Age,Flight_Distance,DepartureDelayin_Mins,ArrivalDelayin_Mins
Satisfaction,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
neutral or dissatisfied,2.469239,2.928127,2.603606,3.007848,2.923219,2.613058,2.962071,2.858563,2.900209,3.053164,3.368209,2.970673,3.381159,2.87353,37.493221,2026.647512,17.793177,18.432402
satisfied,3.144511,2.882559,2.91421,2.975985,3.523121,4.021543,3.979864,3.985953,3.723297,3.845803,3.969695,3.646852,3.978115,3.748598,41.063222,1944.396194,12.11722,12.268821


### 11. Draw any insights that you can from the above values

- Observe the Exreme values (lower side for 'dissatisfaction' and higher side for 'satisfaction')
- Bad seats are a strong cause for customer dissatisfication
- Time convenience doesn't seem to be that big of a deal
- Dissatisfied customers had some bad experiences with food but average food seems to satisfy most people
- Gate location is totally irrelevant
- Wifi is quiet a factor. On an average, having good wifi yeilded customer satisfaction
- Easy online booking facility seems to be very important for customer satisfaction
- In flight entertainment seems to be a deal breaker

### 12. Achieve the following
- Print the number of people who are more than just satisfied with the "Inflight_entertainment" and yet were dissatisfied overall
- Print the number of people who are more than just satisfied with the "Inflight_entertainment" and were satisfied overall

In [24]:
df.Inflight_entertainment.value_counts()

4    29373
5    20786
3    16995
2    13527
1     8198
0     2038
Name: Inflight_entertainment, dtype: int64

In [25]:
# Number of people who got entertained well but were dissatisfied in the end
entertained_and_dissatisfied = df[(df.Inflight_entertainment > 3) & (df.Satisfaction != 'satisfied')]
print(100 * entertained_and_dissatisfied.shape[0]/df.shape[0], 'percent')

10.11581992366664 percent


In [24]:
# Number of customers who got entertained and were satisfied with the flight
entertained_and_satisfied = df[(df.Inflight_entertainment <= 3) & (df.Satisfaction == 'satisfied')]
print(100 * entertained_and_satisfied.shape[0]/df.shape[0], 'percent')

9.67805800895322 percent


### 13. Create a new column which is the mean of 'Ease_of_Onlinebooking', 'Online_boarding', 'Online_support' and name it "avg_feedback_of_online_services". 

DIY : If online services has a bad ratings then what is the average ratings of other feedback attributes? 
And how does it impact Final Satisfaction of customers?

In [26]:
online_df = df.loc[:, ['Ease_of_Onlinebooking', 'Online_boarding', 'Online_support']]
online_df['avg_feedback_of_online_services'] = online_df.median(axis = 1)

online_df['avg_feedback_of_online_services'].value_counts()


4.0    31581
5.0    21334
3.0    17028
2.0    12840
1.0     8125
0.0        9
Name: avg_feedback_of_online_services, dtype: int64

-You might find that -  **A lot of things had to go well to satisfy the customers when the only services had a bad rating**

# Model building and evaluation

### 14. Encode the columns "Gender", "CustomerType", "TypeTravel", "Class", "Satisfaction" 
- Use manual encoding or other type of encoding

In [27]:
# Number of classes in each of the categorical attributes
for i in df.columns:
    if df[i].dtype == 'O':
        print(i, '->', len(df[i].value_counts()))

Satisfaction -> 2
Gender -> 2
CustomerType -> 3
TypeTravel -> 3
Class -> 3


In [27]:
df.Class.value_counts()

Business    43535
Eco         40758
Eco Plus     6624
Name: Class, dtype: int64

In [28]:
# #Manuanl Encoding
# df.replace({'Loyal Customer' : 1, 'disloyal Customer' : 0,
#                'Business travel' : 1, 'Personal Travel' : 0,
#               'Female' : 0, 'Male' : 1,
#                'satisfied' : 1, 'neutral or dissatisfied' : 0, 'Eco Plus': 0 , 'Eco': 1, 'Business': 2}, inplace = True)
               

In [29]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 90917 entries, 198671 to 186039
Data columns (total 23 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   Satisfaction                       90917 non-null  object 
 1   Seat_comfort                       90917 non-null  int64  
 2   Departure.Arrival.time_convenient  90917 non-null  int64  
 3   Food_drink                         90917 non-null  int64  
 4   Gate_location                      90917 non-null  int64  
 5   Inflightwifi_service               90917 non-null  int64  
 6   Inflight_entertainment             90917 non-null  int64  
 7   Online_support                     90917 non-null  int64  
 8   Ease_of_Onlinebooking              90917 non-null  int64  
 9   Onboard_service                    90917 non-null  int64  
 10  Leg_room_service                   90917 non-null  int64  
 11  Baggage_handling                   90917 non-nul

In [28]:
df[['Gender','CustomerType','TypeTravel','Class']]

Unnamed: 0_level_0,Gender,CustomerType,TypeTravel,Class
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
198671,Male,Loyal Customer,Business travel,Business
193378,Female,disloyal Customer,2,Business
174522,Female,2,Personal Travel,Eco Plus
191830,Male,disloyal Customer,Business travel,Eco
221497,Male,Loyal Customer,Business travel,Business
...,...,...,...,...
152951,Female,2,Personal Travel,Eco
220134,Female,Loyal Customer,2,Business
174403,Female,Loyal Customer,2,Eco
224128,Female,Loyal Customer,Business travel,Eco


In [29]:
#onehotencoding
df_coded = pd.get_dummies(df[['Gender','CustomerType','TypeTravel','Class']],drop_first=True)

In [30]:
df_coded

Unnamed: 0_level_0,Gender_Male,CustomerType_Loyal Customer,CustomerType_disloyal Customer,TypeTravel_Business travel,TypeTravel_Personal Travel,Class_Eco,Class_Eco Plus
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
198671,1,1,0,1,0,0,0
193378,0,0,1,0,0,0,0
174522,0,0,0,0,1,0,1
191830,1,0,1,1,0,1,0
221497,1,1,0,1,0,0,0
...,...,...,...,...,...,...,...
152951,0,0,0,0,1,1,0
220134,0,1,0,0,0,0,0
174403,0,1,0,0,0,1,0
224128,0,1,0,1,0,1,0


In [33]:
# #Alternate method - 1
# #Use categoryencoding
# import category_encoders as ce
# cols = ["Gender", "CustomerType", "TypeTravel", "Class", "Satisfaction"]

# ce = ce.TargetEncoder()
# df[cols] = ce.fit_transform(df[cols])


                             
# Alternate - 2 - Onehotencoding
# You can learn more about encoding here - https://kiwidamien.github.io/encoding-categorical-variables.html





In [31]:
df_coded.columns

Index(['Gender_Male', 'CustomerType_Loyal Customer',
       'CustomerType_disloyal Customer', 'TypeTravel_Business travel',
       'TypeTravel_Personal Travel', 'Class_Eco', 'Class_Eco Plus'],
      dtype='object')

In [32]:
df_coded.head()

Unnamed: 0_level_0,Gender_Male,CustomerType_Loyal Customer,CustomerType_disloyal Customer,TypeTravel_Business travel,TypeTravel_Personal Travel,Class_Eco,Class_Eco Plus
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
198671,1,1,0,1,0,0,0
193378,0,0,1,0,0,0,0
174522,0,0,0,0,1,0,1
191830,1,0,1,1,0,1,0
221497,1,1,0,1,0,0,0


In [33]:
df.drop(columns=['Gender','CustomerType','TypeTravel','Class'],inplace=True)

In [34]:
df = pd.concat([df,df_coded],axis=1)

In [35]:
df

Unnamed: 0_level_0,Satisfaction,Seat_comfort,Departure.Arrival.time_convenient,Food_drink,Gate_location,Inflightwifi_service,Inflight_entertainment,Online_support,Ease_of_Onlinebooking,Onboard_service,...,Flight_Distance,DepartureDelayin_Mins,ArrivalDelayin_Mins,Gender_Male,CustomerType_Loyal Customer,CustomerType_disloyal Customer,TypeTravel_Business travel,TypeTravel_Personal Travel,Class_Eco,Class_Eco Plus
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
198671,neutral or dissatisfied,1,3,3,3,1,2,1,1,3,...,1354,11,8.0,1,1,0,1,0,0,0
193378,satisfied,5,2,5,4,3,5,3,3,4,...,1439,6,0.0,0,0,1,0,0,0,0
174522,satisfied,4,4,4,3,3,5,5,2,2,...,976,4,0.0,0,0,0,0,1,0,1
191830,satisfied,4,4,4,3,1,4,1,1,1,...,2291,0,0.0,1,0,1,1,0,1,0
221497,satisfied,4,4,2,4,4,4,4,4,4,...,3974,0,0.0,1,1,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
152951,satisfied,0,4,0,4,2,5,4,4,4,...,1143,0,0.0,0,0,0,0,1,1,0
220134,satisfied,1,1,1,1,4,5,5,4,4,...,3585,1,0.0,0,1,0,0,0,0,0
174403,satisfied,4,4,4,3,4,4,5,1,1,...,2563,3,2.0,0,1,0,0,0,1,0
224128,satisfied,4,1,5,5,4,4,4,4,4,...,1027,0,0.0,0,1,0,1,0,1,0


# Scaling

In [39]:
# #MinMax Scaling - scales the data set such that all feature values are in the range [0, 1].
# #StandardScaler - removes the mean and scales the data to unit variance


# You can learn about other scalers here -
# https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html


In [36]:
df = df.reset_index(drop=True)
cols_to_scale = ['Age', 'Flight_Distance','DepartureDelayin_Mins', 'ArrivalDelayin_Mins']
df[cols_to_scale]

Unnamed: 0,Age,Flight_Distance,DepartureDelayin_Mins,ArrivalDelayin_Mins
0,30,1354,11,8.0
1,20,1439,6,0.0
2,55,976,4,0.0
3,24,2291,0,0.0
4,32,3974,0,0.0
...,...,...,...,...
90912,46,1143,0,0.0
90913,45,3585,1,0.0
90914,67,2563,3,2.0
90915,37,1027,0,0.0


In [37]:
#Here we are going to use StandardScaler to scale our data.
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
#What all columns to scale? I have preferred all columns except onehotencoded columns and target columns as
#scaling of target feature will not change anything as they already have values like 0 and 1 only.
#You may only scale numerical features and leave categorical features as required according to business problem need and results.
# cols_to_scale = ['Seat_comfort', 'Departure.Arrival.time_convenient',
#        'Food_drink', 'Gate_location', 'Inflightwifi_service',
#        'Inflight_entertainment', 'Online_support', 'Ease_of_Onlinebooking',
#        'Onboard_service', 'Leg_room_service', 'Baggage_handling',
#        'Checkin_service', 'Cleanliness', 'Online_boarding',
#        'Age', 'Flight_Distance','DepartureDelayin_Mins', 'ArrivalDelayin_Mins']
cols_to_scale = ['Age', 'Flight_Distance','DepartureDelayin_Mins', 'ArrivalDelayin_Mins']
df[cols_to_scale] = scaler.fit_transform(df[cols_to_scale].to_numpy())








In [38]:
df.describe()

Unnamed: 0,Seat_comfort,Departure.Arrival.time_convenient,Food_drink,Gate_location,Inflightwifi_service,Inflight_entertainment,Online_support,Ease_of_Onlinebooking,Onboard_service,Leg_room_service,...,Flight_Distance,DepartureDelayin_Mins,ArrivalDelayin_Mins,Gender_Male,CustomerType_Loyal Customer,CustomerType_disloyal Customer,TypeTravel_Business travel,TypeTravel_Personal Travel,Class_Eco,Class_Eco Plus
count,90917.0,90917.0,90917.0,90917.0,90917.0,90917.0,90917.0,90917.0,90917.0,90917.0,...,90917.0,90917.0,90917.0,90917.0,90917.0,90917.0,90917.0,90917.0,90917.0,90917.0
mean,2.838831,2.903186,2.773607,2.990409,3.251559,3.383955,3.519133,3.47561,3.350704,3.486994,...,9.558100000000001e-17,6.252232e-19,-4.84548e-18,0.491998,0.735803,0.164117,0.621237,0.278804,0.448299,0.072858
std,1.393582,1.482137,1.397892,1.307902,1.320115,1.342158,1.307794,1.304658,1.280816,1.291758,...,1.000005,1.000005,1.000005,0.499939,0.440907,0.370383,0.485082,0.448413,0.497323,0.259904
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,-1.88126,-0.3798023,-0.3863514,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,2.0,2.0,2.0,2.0,2.0,2.0,3.0,2.0,2.0,2.0,...,-0.6054198,-0.3798023,-0.3863514,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,3.0,3.0,3.0,3.0,3.0,4.0,4.0,4.0,4.0,4.0,...,-0.05320492,-0.3798023,-0.3863514,0.0,1.0,0.0,1.0,0.0,0.0,0.0
75%,4.0,4.0,4.0,4.0,4.0,4.0,5.0,5.0,4.0,5.0,...,0.5457583,-0.06947658,-0.05282384,1.0,1.0,0.0,1.0,1.0,1.0,0.0
max,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,...,4.838815,40.79008,40.2527,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [39]:
df.head()

Unnamed: 0,Satisfaction,Seat_comfort,Departure.Arrival.time_convenient,Food_drink,Gate_location,Inflightwifi_service,Inflight_entertainment,Online_support,Ease_of_Onlinebooking,Onboard_service,...,Flight_Distance,DepartureDelayin_Mins,ArrivalDelayin_Mins,Gender_Male,CustomerType_Loyal Customer,CustomerType_disloyal Customer,TypeTravel_Business travel,TypeTravel_Personal Travel,Class_Eco,Class_Eco Plus
0,neutral or dissatisfied,1,3,3,3,1,2,1,1,3,...,-0.611263,-0.095337,-0.181104,1,1,0,1,0,0,0
1,satisfied,5,2,5,4,3,5,3,3,4,...,-0.52848,-0.224639,-0.386351,0,0,1,0,0,0,0
2,satisfied,4,4,4,3,3,5,5,2,2,...,-0.979407,-0.27636,-0.386351,0,0,0,0,1,0,1
3,satisfied,4,4,4,3,1,4,1,1,1,...,0.301303,-0.379802,-0.386351,1,0,1,1,0,1,0
4,satisfied,4,4,2,4,4,4,4,4,4,...,1.940417,-0.379802,-0.386351,1,1,0,1,0,0,0


In [44]:
# df = df_coded.copy()

### 15. Seperate the column "Satisfaction" from the rest of the columns
- Create X and y

In [40]:
#We are going to drop highly correlated feature which we have found before as there features might affect our models.
to_drop.append('Satisfaction')

In [41]:
to_drop

['DepartureDelayin_Mins', 'Satisfaction']

In [42]:
X = df.drop(columns= to_drop)  # Seperating the target and the rest
#X = df.drop(columns= ['Satisfaction'])
y = df.Satisfaction

In [48]:
X

Unnamed: 0,Seat_comfort,Departure.Arrival.time_convenient,Food_drink,Gate_location,Inflightwifi_service,Inflight_entertainment,Online_support,Ease_of_Onlinebooking,Onboard_service,Leg_room_service,...,Age,Flight_Distance,ArrivalDelayin_Mins,Gender_Male,CustomerType_Loyal Customer,CustomerType_disloyal Customer,TypeTravel_Business travel,TypeTravel_Personal Travel,Class_Eco,Class_Eco Plus
0,1,3,3,3,1,2,1,1,3,3,...,-0.624412,-0.611263,-0.181104,1,1,0,1,0,0,0
1,5,2,5,4,3,5,3,3,4,3,...,-1.285363,-0.528480,-0.386351,0,0,1,0,0,0,0
2,4,4,4,3,3,5,5,2,2,4,...,1.027966,-0.979407,-0.386351,0,0,0,0,1,0,1
3,4,4,4,3,1,4,1,1,1,4,...,-1.020982,0.301303,-0.386351,1,0,1,1,0,1,0
4,4,4,2,4,4,4,4,4,4,4,...,-0.492221,1.940417,-0.386351,1,1,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
90912,0,4,0,4,2,5,4,4,4,5,...,0.433110,-0.816761,-0.386351,0,0,0,0,1,1,0
90913,1,1,1,1,4,5,5,4,4,4,...,0.367015,1.561561,-0.386351,0,1,0,0,0,0,0
90914,4,4,4,3,4,4,5,1,1,4,...,1.821108,0.566211,-0.335039,0,1,0,0,0,1,0
90915,4,1,5,5,4,4,4,4,4,3,...,-0.161746,-0.929736,-0.386351,0,1,0,1,0,1,0


### 16. Create train and test datasets
- Use train_test_split

In [43]:
from sklearn.model_selection import train_test_split # Splitting the data for training and testing out model

X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = 1, stratify = y)

In [44]:
y_train.value_counts(normalize=True)

satisfied                  0.547318
neutral or dissatisfied    0.452682
Name: Satisfaction, dtype: float64

In [46]:
y_test.value_counts(normalize=True)

satisfied                  0.547338
neutral or dissatisfied    0.452662
Name: Satisfaction, dtype: float64

In [47]:
X_train.dtypes

Seat_comfort                           int64
Departure.Arrival.time_convenient      int64
Food_drink                             int64
Gate_location                          int64
Inflightwifi_service                   int64
Inflight_entertainment                 int64
Online_support                         int64
Ease_of_Onlinebooking                  int64
Onboard_service                        int64
Leg_room_service                       int64
Baggage_handling                       int64
Checkin_service                        int64
Cleanliness                            int64
Online_boarding                        int64
Age                                  float64
Flight_Distance                      float64
ArrivalDelayin_Mins                  float64
Gender_Male                            uint8
CustomerType_Loyal Customer            uint8
CustomerType_disloyal Customer         uint8
TypeTravel_Business travel             uint8
TypeTravel_Personal Travel             uint8
Class_Eco 

### 17. Print accuracy
- Print accuracy on test data using below models
- Logistic regression model trained using all the attributes
- Logistic regression model trained using only the feedback columns
- Decision tree model trained using all the attributes
- Random forest model trained using all the attributes

### Logistic Regression

In [48]:
#Logistic Regression with only feedback columns
from sklearn.linear_model import LogisticRegression #importing logistic regression

lr = LogisticRegression()

lr.fit(X_train, y_train)

pred = lr.predict(X_test)  # Predictions from logistic regression
logreg_score = lr.score(X_test, y_test)
logreg_score

0.834007919049714

Predicting customer satisfaction solely based on the feedback

In [49]:
#Logistic Regression with only feedback columns
X_train, X_test, y_train, y_test = train_test_split(X.loc[:,Feedback_cols], y, random_state = 1, stratify = y)

lr = LogisticRegression()

lr.fit(X_train, y_train)

pred = lr.predict(X_test)

logreg_feedback_score = lr.score(X_test, y_test)

print(f'Number of features used = {len(X_train.columns)}')
print(f'Accuracy in predicting customer satisfaction solely based on the feedback = {logreg_feedback_score}')

Number of features used = 14
Accuracy in predicting customer satisfaction solely based on the feedback = 0.8048834139903212


### Decision Tree

In [50]:
from sklearn.tree import DecisionTreeClassifier

X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = 1, stratify = y)

dt = DecisionTreeClassifier()

dt.fit(X_train, y_train)

DT_score = dt.score(X_test, y_test)
pred = dt.predict(X_test)

print(f"Decision tree acccuracy score: {DT_score}")

Decision tree acccuracy score: 0.9258688957325121


### Random Forest

In [51]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(max_depth=10)

rf.fit(X_train, y_train)

RF_score = rf.score(X_test, y_test)

print(f'Random Forest accuracy score = {RF_score}')

Random Forest accuracy score = 0.9264408271007479


### 18. Print feature importance
- Print feature importance of Random Forest



In [52]:
#Decision Tree 
pd.Series(dt.feature_importances_, X_train.columns ).sort_values(ascending= False)

Inflight_entertainment               0.391439
Seat_comfort                         0.175919
Ease_of_Onlinebooking                0.057857
Flight_Distance                      0.048028
Checkin_service                      0.025652
Age                                  0.025647
Online_boarding                      0.023692
Inflightwifi_service                 0.023201
ArrivalDelayin_Mins                  0.022665
Gate_location                        0.022570
Leg_room_service                     0.019935
Cleanliness                          0.016569
TypeTravel_Personal Travel           0.016154
Gender_Male                          0.016014
Online_support                       0.016013
Baggage_handling                     0.015471
Departure.Arrival.time_convenient    0.014112
Class_Eco                            0.013513
CustomerType_disloyal Customer       0.013195
Onboard_service                      0.010966
Food_drink                           0.010751
CustomerType_Loyal Customer       

In [53]:
#RandomForest
pd.Series(rf.feature_importances_, X_train.columns).sort_values(ascending= False)

Inflight_entertainment               0.254149
Seat_comfort                         0.141897
Online_support                       0.082489
Ease_of_Onlinebooking                0.080846
Leg_room_service                     0.045429
Online_boarding                      0.041453
Onboard_service                      0.039103
Food_drink                           0.038055
CustomerType_disloyal Customer       0.035802
Gender_Male                          0.034299
Class_Eco                            0.025505
Cleanliness                          0.024321
Checkin_service                      0.018305
CustomerType_Loyal Customer          0.018302
Baggage_handling                     0.017363
TypeTravel_Personal Travel           0.017001
Flight_Distance                      0.015948
Inflightwifi_service                 0.015207
TypeTravel_Business travel           0.015029
Departure.Arrival.time_convenient    0.012851
Age                                  0.010190
Gate_location                     

# Model Tuning

### 19. Print cross validation score
- Decision tree model trained using all the attributes
- Random Forest model trained using all the attributes
- Fine tuned (using Grid Search or Random Search) Random Forest model

**Display all the scores above with their respective models in a single dataframe**



Cross Validation Score

In [54]:
from sklearn.model_selection import cross_val_score,cross_val_predict
#For Decision Tree dt
CV_DT_score = cross_val_score(dt, X_train, y_train, cv = 5).mean()
print(f'Cross validation score of Decision tree = {CV_DT_score}')

Cross validation score of Decision tree = 0.9262909379231168


In [55]:
#Random Forest rf
CV_RF_score = cross_val_score(rf, X_train, y_train, cv = 5).mean()
print(f'Cross validation score of Random forest = {CV_RF_score}')

Cross validation score of Random forest = 0.9243257995371861


### Parameter Tuning Using GridSearch

Doing it only for RandomForest as the mean CV score is better.

In [58]:
from sklearn.model_selection import GridSearchCV,RandomizedSearchCV

#### RandomCV Search

In [57]:
3*2*4

24

In [59]:
rf_test = RandomForestClassifier(max_depth=10)
import time
start = time.time()
rf_test.fit(X_train, y_train)
end = time.time()
print((end-start)/60," minutes")

0.1002457062403361  minutes


In [60]:
from sklearn.model_selection import GridSearchCV,RandomizedSearchCV
import time
start = time.time()
parameters = {'max_depth': [30, 40, 70],
 'max_features': ['log2', 'sqrt'],
 'min_samples_leaf': [1, 2, 4, 8],
 'n_estimators': [100]}



clf = RandomizedSearchCV(RandomForestClassifier(), parameters, cv = 5, verbose = 2, n_jobs= -1,n_iter=6)
clf.fit(X_train, y_train)

clf.best_params_

end = time.time()
print((end-start)/60," minutes")
# Best parameters
clf.best_params_

Fitting 5 folds for each of 6 candidates, totalling 30 fits
1.0691234509150187  minutes


{'n_estimators': 100,
 'min_samples_leaf': 1,
 'max_features': 'log2',
 'max_depth': 40}

In [62]:
# rf = RandomForestClassifier(bootstrap= True,
#  max_depth= 20,
#  max_features= 'sqrt',
#  min_samples_leaf= 1,
#  n_estimators= 100)

# rf.fit(X_train, y_train)

RS_RF_score = clf.score(X_test, y_test)

print(f'Random Forest accuracy score with best param = {RS_RF_score}')   

Random Forest accuracy score with best param = 0.9528816542014958


In [63]:
from sklearn.model_selection import GridSearchCV,RandomizedSearchCV
import time
start = time.time()
# parameters = {'max_depth': [10, 20, 30, 40, 50],
#  'max_features': ['log2', 'sqrt'],
#  'min_samples_leaf': [1, 2, 4, 8],
#  'n_estimators': [100]}

parameters = {'max_depth': [30, 40, 70],
 'max_features': ['log2', 'sqrt'],
 'min_samples_leaf': [1, 2, 4, 8],
 'n_estimators': [100]}

clf_gs = GridSearchCV(RandomForestClassifier(), parameters, cv = 5, verbose = 2, n_jobs= -1)
clf_gs.fit(X, y)

end = time.time()
print((end-start)/60," minutes")

# Best parameters
clf_gs.best_params_

Fitting 5 folds for each of 24 candidates, totalling 120 fits
4.310887455940247  minutes


{'max_depth': 30,
 'max_features': 'log2',
 'min_samples_leaf': 1,
 'n_estimators': 100}

In [67]:
# rf = RandomForestClassifier(bootstrap= True,
#  max_depth= 50,
#  max_features= 'sqrt',
#  min_samples_leaf= 4,
#  n_estimators= 100)

# rf.fit(X_train, y_train)

GS_RF_score = clf_gs.score(X_test, y_test)

print(f'Random Forest accuracy with Grid Search score with best param = {GS_RF_score}')   

Random Forest accuracy with Grid Search score with best param = 1.0


In [68]:
data = {'Technique' : ['Logistic Regression', "LR with only feedback columns ", 'Decision tree',
                       'Random forest', 'DT CV','RF CV','Tuned RS RF CV','Tuned GS RF CV'],
       'Score' : [logreg_score, logreg_feedback_score, DT_score, RF_score, CV_DT_score, CV_RF_score, RS_RF_score, GS_RF_score] }

result = pd.DataFrame(data)

In [69]:
result

Unnamed: 0,Technique,Score
0,Logistic Regression,0.834008
1,LR with only feedback columns,0.804883
2,Decision tree,0.925869
3,Random forest,0.926441
4,DT CV,0.926291
5,RF CV,0.924326
6,Tuned RS RF CV,0.952882
7,Tuned GS RF CV,1.0


# Dimensionality Reduction

### 20. Perform the following tasks

- Use PCA to reduce the number of dimensions such that the components capture 95% of the data
- Train Logistic Regression, Decision Tree and Random Forest using the principle components
- Calculate the accuracy scores for each of the models
- Calculate the cross validation scores for each of the above models trained using principle components

#### We need to scale the data before using PCA which we have already done before


In [101]:
from sklearn.decomposition import PCA
#pca = PCA(10)# Initialize PCA object
pca = PCA(.95)
pca.fit(X_train)  # Fit the PCA object with the train data

In [102]:
X_train_pca = pca.transform(X_train)  # PCs for the train data
X_test_pca = pca.transform(X_test)    # PCs for the test data

X_train_pca.shape, X_test_pca.shape

((68187, 17), (22730, 17))

In [103]:
pca.explained_variance_

array([6.56249167, 4.67016428, 3.25158065, 1.84906655, 1.52855529,
       1.34024597, 1.09697412, 1.01838954, 0.94021169, 0.88092802,
       0.79598787, 0.73467932, 0.60038807, 0.58050698, 0.51886151,
       0.48277017, 0.42963188])

In [104]:
lr = LogisticRegression()
lr.fit(X_train_pca, y_train)
logreg_score_pca = lr.score(X_test_pca, y_test)


dt = DecisionTreeClassifier()
dt.fit(X_train_pca, y_train)
DT_score_pca = dt.score(X_test_pca, y_test)

rf = RandomForestClassifier(max_depth = 20, max_features ='auto', min_samples_leaf = 5, n_estimators = 100)
rf.fit(X_train_pca, y_train)
RF_score_pca = rf.score(X_test_pca, y_test)

In [105]:
print("Log reg")
lr = LogisticRegression()
logreg_score_pca_cv = cross_val_score(lr,X_train_pca, y_train , cv = 5,n_jobs=-1).mean()
print("Decision Tree")
dt = DecisionTreeClassifier()
DT_score_pca_cv = cross_val_score(dt, X_train_pca, y_train, cv = 5,n_jobs=-1).mean()
print("RF")
rf = RandomForestClassifier(max_depth = 20, max_features ='sqrt', min_samples_leaf = 4, n_estimators = 100)
RF_score_pca_cv = cross_val_score(rf, X_train_pca, y_train, cv = 5,n_jobs=-1).mean()

Log reg
Decision Tree
RF


In [106]:
result = pd.DataFrame({'Algorithm' : ['Logistic Regression', 'Deision Tree', 'Random Forest'],
                      'Accuracy_score': [logreg_score_pca, DT_score_pca, RF_score_pca],
                      'Cross_val_score' : [logreg_score_pca_cv, DT_score_pca_cv, RF_score_pca_cv]})
result

Unnamed: 0,Algorithm,Accuracy_score,Cross_val_score
0,Logistic Regression,0.809327,0.810682
1,Deision Tree,0.854597,0.85025
2,Random Forest,0.906115,0.904161


# Pipeline - Automate and Simplyfy the process

In [75]:
#pip install --upgrade category_encoders

In [83]:
import pandas as pd
import numpy as np

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler,LabelEncoder, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV
from category_encoders import OrdinalEncoder



numeric_features = ['Age', 'Flight_Distance', 'ArrivalDelayin_Mins']

feedback_features = ['Seat_comfort', 'Departure.Arrival.time_convenient', 'Food_drink',
       'Gate_location', 'Inflightwifi_service', 'Inflight_entertainment',
       'Online_support', 'Ease_of_Onlinebooking', 'Onboard_service',
       'Leg_room_service', 'Baggage_handling', 'Checkin_service',
       'Cleanliness', 'Online_boarding']

other_cat_cols =  ['Gender', 'CustomerType', 'TypeTravel', 'Class']


#TRANSFORMERS



numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())])


feedback_feature_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='not_captured')),
    ('label_encoder', OrdinalEncoder())
    #('scaler', StandardScaler())
])


other_cat_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='not_captured')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])


preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('feed_col', feedback_feature_transformer, feedback_features),
        ('other_cat_col', other_cat_transformer, other_cat_cols )
    ])


In [84]:
#Adding into Pipeline
clf = Pipeline(steps=[('preprocessor', preprocessor),
                      #('PCA', PCA(n_components = 10)),
                      ('classifier', RandomForestClassifier(max_depth= 20,max_features= 'auto',min_samples_leaf= 4,n_estimators= 100))])


In [85]:
#Taking the raw data
data = df2.set_index("Id").join(df1.set_index("ID"))
data

Unnamed: 0_level_0,Satisfaction,Seat_comfort,Departure.Arrival.time_convenient,Food_drink,Gate_location,Inflightwifi_service,Inflight_entertainment,Online_support,Ease_of_Onlinebooking,Onboard_service,...,Cleanliness,Online_boarding,Gender,CustomerType,Age,TypeTravel,Class,Flight_Distance,DepartureDelayin_Mins,ArrivalDelayin_Mins
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
198671,neutral or dissatisfied,poor,acceptable,acceptable,manageable,poor,need improvement,poor,poor,acceptable,...,need improvement,poor,Male,Loyal Customer,30,Business travel,Business,1354,11,8.0
193378,satisfied,excellent,need improvement,excellent,Convinient,acceptable,excellent,acceptable,acceptable,good,...,excellent,acceptable,Female,disloyal Customer,20,,Business,1439,6,0.0
174522,satisfied,good,good,good,manageable,acceptable,excellent,excellent,need improvement,need improvement,...,need improvement,excellent,Female,,55,Personal Travel,Eco Plus,976,4,0.0
191830,satisfied,good,good,good,manageable,poor,good,poor,poor,poor,...,acceptable,poor,Male,disloyal Customer,24,Business travel,Eco,2291,0,0.0
221497,satisfied,good,good,,Convinient,good,good,good,good,good,...,good,good,Male,Loyal Customer,32,Business travel,Business,3974,0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
152951,satisfied,extremely poor,good,extremely poor,Convinient,need improvement,excellent,good,good,good,...,good,acceptable,Female,,46,Personal Travel,Eco,1143,0,0.0
220134,satisfied,poor,poor,poor,Inconvinient,good,excellent,excellent,good,good,...,good,good,Female,Loyal Customer,45,,Business,3585,1,0.0
174403,satisfied,good,good,good,manageable,good,good,excellent,poor,poor,...,poor,excellent,Female,Loyal Customer,67,,Eco,2563,3,2.0
224128,satisfied,good,poor,excellent,very convinient,good,good,good,good,good,...,acceptable,good,Female,Loyal Customer,37,Business travel,Eco,1027,0,0.0


In [86]:
#Getting X and y
X1 = data.drop(['Satisfaction', 'DepartureDelayin_Mins'], axis = 1)
y1 = data['Satisfaction']

In [87]:
#Data SPlit
X_trains, X_tests, y_trains, y_tests = train_test_split(X1,y1, random_state = 1, stratify = y)

In [88]:
X_trains.shape,y_trains.shape

((68187, 21), (68187,))

In [89]:
#Fitting Pipeline 
clf.fit(X_trains, y_trains)

In [90]:
#Getting score 
clf.score(X_tests, y_tests)

0.9476462824461065

In [92]:
clf.get_params()

{'memory': None,
 'steps': [('preprocessor',
   ColumnTransformer(transformers=[('num',
                                    Pipeline(steps=[('imputer', SimpleImputer()),
                                                    ('scaler', StandardScaler())]),
                                    ['Age', 'Flight_Distance',
                                     'ArrivalDelayin_Mins']),
                                   ('feed_col',
                                    Pipeline(steps=[('imputer',
                                                     SimpleImputer(fill_value='not_captured',
                                                                   strategy='constant')),
                                                    ('label_encoder',
                                                     OrdinalEncoder())]),
                                    ['Seat_comfort',
                                     'Departure.Arrival.time_convenient...
                                     'Inflight_entert

In [93]:
#Adding into Pipeline
clf_1 = Pipeline(steps=[('preprocessor', preprocessor),
                      ('PCA', PCA(n_components = 10)),
                      ('classifier', RandomForestClassifier(max_depth= 20,max_features= 'auto',min_samples_leaf= 4,n_estimators= 100))])


In [94]:
#Getting score 
#Fitting Pipeline 
clf_1.fit(X_trains, y_trains)
clf_1.score(X_tests, y_tests)

0.8904971403431589

#### Lets do grid search with pipelines

In [95]:
clf_tuned = Pipeline(steps=[('preprocessor', preprocessor),
                      # ('PCA', PCA(n_components = 10)),
                      ('classifier', RandomForestClassifier(max_depth= 20,max_features= 'auto',min_samples_leaf= 4,n_estimators= 100))])


In [96]:
import time
start = time.time()

# parameters = {'max_depth': [30, 40, 70],
#  'max_features': ['log2', 'sqrt'],
#  'min_samples_leaf': [1, 2, 4, 8],
#  'n_estimators': [100]}

parameters = [
    {
        'preprocessor__num__imputer__strategy': (['mean','median']),
        'classifier__max_depth': (30, 40),
        'classifier__max_features': (['log2','sqrt'])
    }
]


grid_search = GridSearchCV(clf_tuned, parameters,n_jobs=-1,cv = 5, verbose = 2,)
grid_search.fit(X_trains, y_trains)

end = time.time()
print((end-start)/60," minutes")
#Getting score 
grid_search.score(X_tests, y_tests)


Fitting 5 folds for each of 8 candidates, totalling 40 fits
1.3485207319259644  minutes


0.9491860976682798

In [97]:
grid_search.best_params_

{'classifier__max_depth': 40,
 'classifier__max_features': 'sqrt',
 'preprocessor__num__imputer__strategy': 'median'}

In [100]:

start = time.time()
parameters = [
    {
        'preprocessor__num__imputer__strategy': (['mean','median']),
        'classifier__max_depth': (30, 40,50,60,80,100),
        'classifier__max_features': (['log2','sqrt'])
    }
]
Random_search = RandomizedSearchCV(clf_tuned, parameters,n_jobs=-1,cv = 5, verbose = 3,n_iter=1)
Random_search.fit(X_trains, y_trains)

end = time.time()
print((end-start)/60," minutes")
#Getting score 
Random_search.score(X_tests, y_tests)


Fitting 5 folds for each of 1 candidates, totalling 5 fits
0.33193029165267945  minutes


0.9449626044874615

In [None]:
# Best parameters
grid_search.best_params_

# Conclusion:

- Given some data, we have seen how to perform EDA for that dataset
- The data that we had was not entirely continuous or categorical. Hence we improvised the analysis to draw insights
- We used the same old bar-charts but the way we interpreted is unique to this problem
- Beyond EDA, we have seen how to preprocess data and train Supervised Models with it
- We finally put all the steps in one place and built a pipeline using the Sklearn's Pipeline function

# Explore:

- You can cluster different segments of customer to get more insights about their behaviours.
- Create new features and select the best features to improve your model further.

and more ----