# Airline Customer Satisfaction Capstone
## EDA
## Cleaning The Dataset

#### What problems do an airline face?

1: There are 5000 airlines currently operating. This means relative to other industries there has been little consolidation so competition for passangers is fierce.
2: Flyers have significant choice on which airline to fly with.
3: COVID restrictions and labour shortages have made operating airlines even more challenging.

#### Why do airlines need to know about customer satisfaction?

1: There is limited capital to invest.
2: Airlines want to know what to invest in with regards to retaining existing customers whilst attracting new ones.
3: Airlines want to know which customers to focus on.


#### What does this project aim to do?

Whether it's deciding on what refreshments to offer or how much to invest in the online booking process, this project mimicks the type of analysis a data scientist would conduct to answer these types of questions.

#### Where is the data from?

This data is sourced from John.D on Kaggle: https://www.kaggle.com/datasets/johndddddd/customer-satisfaction.

The data is reportadley from an airline survery, though the exact source is not shared. 

This means the data suffers the general disadvantages of survery data:

1: Valuable data is missing from individuals who did not fill in the survery which could contradict our findings.

2: The data provided is what individuals believe their satisfaction to be is, but this may not represent their true satisfaction.

In [1]:
# Below I am importing the relavent libraries and packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

import random
np.random.seed(123)
random.seed(123)

from sklearn.model_selection import train_test_split

# Filter warnings
import warnings
warnings.filterwarnings('ignore')

import os
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.compose import ColumnTransformer

In [2]:
# Below I am setting df to our data
os.getcwd()
df = pd.read_excel("satisfaction_2015.xlsx")

FileNotFoundError: [Errno 2] No such file or directory: 'satisfaction_2015.xlsx'

In [None]:
# Below I am looking at the first 2 rows of the dataframe
df.head(2)

In [None]:
# Below I am looking at the shape of the dataframe:
df.shape

### Removing Missing/Null Values

In [None]:
# Below i am identifying if any columns have any null values
df_nulls = df.isna().any()
df_nulls.value_counts()

In [None]:
# Below I am identifying the % of null values in each column
df.isnull().sum()/len(df)

We will see in a leter notebook that the 'arrival delay in minutes' column suffers from multocolinearity. Therefore it will be dropped in the future, but for now it will be left as is in our dataset.

In [None]:
# Below I am identifying the rows which have null values
df[df['Arrival Delay in Minutes'].isnull()]

If the missing values where something we could fix (like lattitude, longitude etc.) then we could try to replace the missing values with correct ones. Here this is not the case so our best course of action is to delete these rows. Luckily they account for a small % of the overall dataset.

In [None]:
# Below I am dropping the null values in the 'Arrival Delay in Minutes' column
df = df.dropna(how='any',axis=0,thresh=None, subset=None, inplace=False)

In [None]:
# Below I am sanity checking that the null values have been removed
df.isnull().sum()/len(df)

Below I am checking for any rows with null values

In [None]:
def nans(df): return df[df.isnull().any(axis=1)]

In [None]:
print(f"There are {row_null.shape[0]} rows with missing values.")

### Removing Duplicated Values

In [None]:
# Below I am checking for any duplicated rows
df.duplicated().sum()

### Analysing Column Data Types

We need to understand what each column is showing. We can look at the datatype of each column:

In [None]:
# Below I am looking at the using the built in 'info()' function to identify the datatype of each column
df.info()

The id of a customer has no relavence to if a customer is satisfied or not, since the id is assigned at random. Therefore we can remove it from our dataframe.

In [None]:
# Below I am checking the sum of 'Arrival Delay in Minutes'
df['Arrival Delay in Minutes'].sum()

We can see that the sum of 'Arrival Delay in Minutes' is a whole number, which means it can be stored as an int data type.

In [None]:
# Below I am changing the the 'Arrival Delay in Minutes' column to the int datatype
df['Arrival Delay in Minutes'] = df['Arrival Delay in Minutes'].astype(int)

### Renaming and Dropping Columns

First I need to get a better understanding of what each column represents.

In [None]:
print("-------------------------------------------------------------------")
print(f"The unique values for 'satisfaction_v2' are:")
print("\n")
print(df['satisfaction_v2'].value_counts())
print("-------------------------------------------------------------------")
print(f"The unique values for 'Gender' are:")
print("\n")
print(df['Gender'].value_counts())
print("-------------------------------------------------------------------")
print(f"The unique values for 'Customer Type 'are:")
print("\n")
print(df['Customer Type'].value_counts())
print("-------------------------------------------------------------------")
print(f"The unique values for 'Age 'are:")
print("\n")
print(df['Age'].value_counts())
print("-------------------------------------------------------------------")
print(f"The unique values for 'Type of Travel'are:")
print("\n")
print(df['Type of Travel'].value_counts())
print("-------------------------------------------------------------------")
print(f"The unique values for 'Class'are:")
print("\n")
print(df['Class'].value_counts())
print("-------------------------------------------------------------------")
print(f"The unique values for 'Flight Distance'are:")
print("\n")
print(df['Flight Distance'].value_counts())
print("-------------------------------------------------------------------")
print(f"The unique values for 'Inflight wifi' service'are:")
print("\n")
print(df['Inflight wifi service'].value_counts())
print("-------------------------------------------------------------------")
print(f"The unique values for 'Departure/Arrival' time convenient'are:")
print("\n")
print(df['Departure/Arrival time convenient'].value_counts())
print("-------------------------------------------------------------------")
print(f"The unique values for 'Ease of Online booking' time convenient'are:")
print("\n")
print(df['Ease of Online booking'].value_counts())
print("-------------------------------------------------------------------")
print(f"The unique values for 'Gate location' time convenient'are:")
print("\n")
print(df['Gate location'].value_counts())
print("-------------------------------------------------------------------")
print(f"The unique values for 'Food and drink' time convenient'are:")
print("\n")
print(df['Food and drink'].value_counts())
print("-------------------------------------------------------------------")
print(f"The unique values for 'Online boarding' time convenient'are:")
print("\n")
print(df['Online boarding'].value_counts())
print("-------------------------------------------------------------------")
print(f"The unique values for 'Seat comfort' time convenient'are:")
print("\n")
print(df['Seat comfort'].value_counts())
print("-------------------------------------------------------------------")
print(f"The unique values for 'Inflight entertainment' time convenient'are:")
print("\n")
print(df['Inflight entertainment'].value_counts())
print("-------------------------------------------------------------------")
print(f"The unique values for 'On-board service' time convenient'are:")
print("\n")
print(df['On-board service'].value_counts())
print("-------------------------------------------------------------------")
print(f"The unique values for 'Leg room service' time convenient'are:")
print("\n")
print(df['Leg room service'].value_counts())
print("-------------------------------------------------------------------")
print(f"The unique values for 'Baggage handling' time convenient'are:")
print("\n")
print(df['Baggage handling'].value_counts())
print("-------------------------------------------------------------------")
print(f"The unique values for 'Checkin service' time convenient'are:")
print("\n")
print(df['Checkin service'].value_counts())
print("-------------------------------------------------------------------")
print(f"The unique values for 'Inflight service' time convenient'are:")
print("\n")
print(df['Inflight service'].value_counts())
print("-------------------------------------------------------------------")
print(f"The unique values for 'Cleanliness' time convenient'are:")
print("\n")
print(df['Cleanliness'].value_counts())
print("-------------------------------------------------------------------")
print(f"The unique values for 'Departure Delay in Minutes' time convenient'are:")
print("\n")
print(df['Departure Delay in Minutes'].value_counts())
print("-------------------------------------------------------------------")
print(f"The unique values for 'Arrival Delay in Minutes' time convenient'are:")
print("\n")
print(df['Arrival Delay in Minutes'].value_counts())
print("-------------------------------------------------------------------")

#### Data Description v1

Categorical Columns:
- `satisfaction_v2` --> This is our target feature. It is whether or not a customer was satisfied
- `Gender` --> This shows whether or not the customer was male or female
- `Type of Travel` --> This shows whether or not the customer travelled for business or personal travel
- `Class` --> This is the type of class the customer flew in (Eco, Eco Plus, Business)
- `Gate location` --> This is the satisfaction rating of gate location

Numerical Column With No Range Limit:
- `Age` --> This is the age of the customer
- `Flight Distance` --> This is the flight distance in miles
- `Gate location` --> This is the satisfaction rating of gate location
- `Departure Delay in Minutes` --> This is the numnber of minutes the flight was delayed at departure
- `Arrival Delay in Minutes` --> This is the numnber of minutes the flight was delayed at arrival to destination

Numerical Column Rating 0 - 5:
- `Inflight wifi` --> This is the satisfaction rating of the inflight wifi
- `Departure/Arrival time` --> This is the satisfaction rating of the inflight wifi
- `Ease of Online booking` --> This is the satisfaction rating of the online booking process
- `Food and drink` --> This is the satisfaction rating of the food and drink offered in flight
- `Online boarding` --> This is the satisfaction rating of the online boarding
- `Seat comfort` --> This is the satisfaction rating of the inflight seats
- `Inflight entertainment` --> This is the satisfaction rating of the inflight entertainment
- `On-board service` --> This is the satisfaction rating of onboard service
- `Baggage handling` --> This is the satisfaction rating of baggage handling
- `Checkin service` --> This is the satisfaction rating of the checkin service
- `Leg room service` --> This is the satisfaction rating of inflight leg room
- `Inflight service` --> This is the satisfaction rating of the inflight service
- `Cleanliness` --> This is the satisfaction rating of how clean the aeroplane

These column names need to be optimised so they best describe what they are showing.

In [None]:
df.rename(columns = {'satisfaction_v2' : 'satisfaction_target'}, inplace=True)
df.rename(columns = {'Gender': 'customer_gender'}, inplace=True)
df.rename(columns = { 'Customer Type' : 'customer_loyalty_type'}, inplace=True)
df.rename(columns = {'Age': 'customer_age'}, inplace=True)
df.rename(columns = {'Type of Travel' : 'customer_travel_type'}, inplace=True)
df.rename(columns = {'Class' : 'customer_class_type'}, inplace=True)
df.rename(columns = {'Flight Distance' : 'flight_distance'}, inplace=True)
df.rename(columns = {'Inflight wifi service' : 'flight_wifi_satisfaction_rating'}, inplace=True)
df.rename(columns = {'Departure/Arrival time convenient' : 'departure/arrival_time_satisfaction_rating'}, inplace=True)
df.rename(columns = {'Ease of Online booking' : 'online_booking_satisfaction_rating'}, inplace=True)
df.rename(columns = {'Gate location' : 'gate_location_satisfaction_rating'}, inplace=True)
df.rename(columns = {'Food and drink' : 'food/drink_satisfaction_rating'}, inplace=True)
df.rename(columns = {'Online boarding' : 'online_boarding_satisfaction_rating'}, inplace=True)
df.rename(columns = {'Seat comfort' : 'seat_comfort_satisfaction_rating'}, inplace=True)
df.rename(columns = {'Inflight entertainment' : 'inflight_entertainment_satisfaction_rating'}, inplace=True)
df.rename(columns = {'On-board service' : 'onboard_service_satisfaction_rating'}, inplace=True)
df.rename(columns = {'Leg room service' : 'leg_room_n_satisfaction_rating'}, inplace=True)
df.rename(columns = {'Baggage handling' : 'baggage_handling_satisfaction_rating'}, inplace=True)
df.rename(columns = {'Checkin service' : 'checkin_service_satisfaction_rating'}, inplace=True)
df.rename(columns = {'Inflight service' : 'inflight_service_satisfaction_rating'}, inplace=True)
df.rename(columns = {'Cleanliness' : 'cleanliness_satisfaction_rating'}, inplace=True)
df.rename(columns = {'Departure Delay in Minutes' : 'departure_delay_in_minutes'}, inplace=True)
df.rename(columns = {'Arrival Delay in Minutes' : 'arrival_delay_in_minutes'}, inplace=True)

In [None]:
# Below I am sanity checking that the column names have been changed correctly
df.head(2)

In [None]:
df.to_csv('cleaned_airlines.csv')