# Building Machine Learning Models

## Part 1: Exploring the data and completing the cleaning process

This `Part 1` will conduct basic EDA, data cleaning, and other manipulations to prepare the data for modeling.
1. Importing packages and loading data
2. Exploring the data and completing the cleaning process

### 1. Imports

#### 1.1. Import packages

Import relevant Python packages. For now, let's just import `Pandas` and check out whether other packages for data manipulation would be necessary.

In [12]:
import pandas as pd

#### 1.2. Load the dataset

`Pandas` is used to load the `Galactico_Airline.csv`. The resulting pandas `DataFrame` is saved in a variable named `df_original`.

In [16]:
df_original = pd.read_csv('../../../data/Galactico_Airline.csv')
df_original

Unnamed: 0,satisfaction,Customer Type,Age,Type of Travel,Class,Flight Distance,Seat comfort,Departure/Arrival time convenient,Food and drink,Gate location,...,Online support,Ease of Online booking,On-board service,Leg room service,Baggage handling,Checkin service,Cleanliness,Online boarding,Departure Delay in Minutes,Arrival Delay in Minutes
0,satisfied,Loyal Customer,65,Personal Travel,Eco,265,0,0,0,2,...,2,3,3,0,3,5,3,2,0,0.0
1,satisfied,Loyal Customer,47,Personal Travel,Business,2464,0,0,0,3,...,2,3,4,4,4,2,3,2,310,305.0
2,satisfied,Loyal Customer,15,Personal Travel,Eco,2138,0,0,0,3,...,2,2,3,3,4,4,4,2,0,0.0
3,satisfied,Loyal Customer,60,Personal Travel,Eco,623,0,0,0,3,...,3,1,1,0,1,4,1,3,0,0.0
4,satisfied,Loyal Customer,70,Personal Travel,Eco,354,0,0,0,3,...,4,2,2,0,2,4,2,5,0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
129875,satisfied,disloyal Customer,29,Personal Travel,Eco,1731,5,5,5,3,...,2,2,3,3,4,4,4,2,0,0.0
129876,dissatisfied,disloyal Customer,63,Personal Travel,Business,2087,2,3,2,4,...,1,3,2,3,3,1,2,1,174,172.0
129877,dissatisfied,disloyal Customer,69,Personal Travel,Eco,2320,3,0,3,3,...,2,4,4,3,4,2,3,2,155,163.0
129878,dissatisfied,disloyal Customer,66,Personal Travel,Eco,2450,3,2,3,2,...,2,3,3,2,3,2,1,2,193,205.0


### 2. Data exploration, cleaning and preparation

After loading the dataset, prepare the data to be suitable for decision tree classifiers. This includes: 
*   Exploring the data
*   Checking for missing values
*   Encoding the data
*   Saving the data to `CSV` file (It will be used for the next modeling parts as the base dataset.)

#### 2.1. Explore the data

Check the type of each column. Note that most `ML` models expect `numeric` data

In [4]:
df_original.dtypes

satisfaction                          object
Customer Type                         object
Age                                    int64
Type of Travel                        object
Class                                 object
Flight Distance                        int64
Seat comfort                           int64
Departure/Arrival time convenient      int64
Food and drink                         int64
Gate location                          int64
Inflight wifi service                  int64
Inflight entertainment                 int64
Online support                         int64
Ease of Online booking                 int64
On-board service                       int64
Leg room service                       int64
Baggage handling                       int64
Checkin service                        int64
Cleanliness                            int64
Online boarding                        int64
Departure Delay in Minutes             int64
Arrival Delay in Minutes             float64
dtype: obj

Review the `object` type colums:

`Customer Type`, `Type of Travel`, `Class` and `satisfaction`

As mentioned above, those data should be encoded to `numeric` type data for `ML` models.

Note that `satisfaction` will be the `target` of this project.

In [5]:
print('**Features**')
print('  Customer Type:\t', df_original['Customer Type'].unique())
print('  Type of Travel:\t', df_original['Type of Travel'].unique())
print('  Class:\t\t', df_original['Class'].unique())
print('**Target**')
print('  Satisfaction:\t\t', df_original['satisfaction'].unique())

**Features**
  Customer Type:	 ['Loyal Customer' 'disloyal Customer']
  Type of Travel:	 ['Personal Travel' 'Business travel']
  Class:		 ['Eco' 'Business' 'Eco Plus']
**Target**
  Satisfaction:		 ['satisfied' 'dissatisfied']


#### 2.2. Check for missing values

`scikit-learn` package for ML modeling does not support missing values. Check for missing values in the rows of the data.

In [6]:
df_original.isna().sum()

satisfaction                           0
Customer Type                          0
Age                                    0
Type of Travel                         0
Class                                  0
Flight Distance                        0
Seat comfort                           0
Departure/Arrival time convenient      0
Food and drink                         0
Gate location                          0
Inflight wifi service                  0
Inflight entertainment                 0
Online support                         0
Ease of Online booking                 0
On-board service                       0
Leg room service                       0
Baggage handling                       0
Checkin service                        0
Cleanliness                            0
Online boarding                        0
Departure Delay in Minutes             0
Arrival Delay in Minutes             393
dtype: int64

#### 2.3. Drop the rows with missing values (Cleaning)

Drop the rows with missing values and save the resulting pandas DataFrame in a variable named `df`.

In [7]:
print('before drop:\t', df_original.shape)
df = df_original.dropna(axis=0).reset_index(drop=True)
print('after drop:\t', df.shape)

before drop:	 (129880, 22)
after drop:	 (129487, 22)


In [8]:
df.head()

Unnamed: 0,satisfaction,Customer Type,Age,Type of Travel,Class,Flight Distance,Seat comfort,Departure/Arrival time convenient,Food and drink,Gate location,...,Online support,Ease of Online booking,On-board service,Leg room service,Baggage handling,Checkin service,Cleanliness,Online boarding,Departure Delay in Minutes,Arrival Delay in Minutes
0,satisfied,Loyal Customer,65,Personal Travel,Eco,265,0,0,0,2,...,2,3,3,0,3,5,3,2,0,0.0
1,satisfied,Loyal Customer,47,Personal Travel,Business,2464,0,0,0,3,...,2,3,4,4,4,2,3,2,310,305.0
2,satisfied,Loyal Customer,15,Personal Travel,Eco,2138,0,0,0,3,...,2,2,3,3,4,4,4,2,0,0.0
3,satisfied,Loyal Customer,60,Personal Travel,Eco,623,0,0,0,3,...,3,1,1,0,1,4,1,3,0,0.0
4,satisfied,Loyal Customer,70,Personal Travel,Eco,354,0,0,0,3,...,4,2,2,0,2,4,2,5,0,0.0


#### 2.3. Encode the data

As previously reviewed, four columns (`satisfaction`, `Customer Type`, `Type of Travel`, `Class`) are the pandas dtype object. 

`ML` models need numeric columns. 
- Convert the ordinal `Class` column into numeric with `map()`. 
- Convert the ordinal `satisfaction` column into numeric with `map()`. 
- Convert categorical columns such as `Customer Type` and `Type of Travel` into numeric with `get_dummies()`. 

In [9]:
df['Class'] = df['Class'].map({'Business':3, 'Eco Plus':2, 'Eco':1})
df['satisfaction'] = df['satisfaction'].map({'satisfied':1, 'dissatisfied':0})
df = pd.get_dummies(df, columns=['Customer Type', 'Type of Travel'])
df.head()

Unnamed: 0,satisfaction,Age,Class,Flight Distance,Seat comfort,Departure/Arrival time convenient,Food and drink,Gate location,Inflight wifi service,Inflight entertainment,...,Baggage handling,Checkin service,Cleanliness,Online boarding,Departure Delay in Minutes,Arrival Delay in Minutes,Customer Type_Loyal Customer,Customer Type_disloyal Customer,Type of Travel_Business travel,Type of Travel_Personal Travel
0,1,65,1,265,0,0,0,2,2,4,...,3,5,3,2,0,0.0,1,0,0,1
1,1,47,3,2464,0,0,0,3,0,2,...,4,2,3,2,310,305.0,1,0,0,1
2,1,15,1,2138,0,0,0,3,2,0,...,4,4,4,2,0,0.0,1,0,0,1
3,1,60,1,623,0,0,0,3,3,4,...,1,4,1,3,0,0.0,1,0,0,1
4,1,70,1,354,0,0,0,3,4,3,...,2,4,2,5,0,0.0,1,0,0,1


In [10]:
df.dtypes

satisfaction                           int64
Age                                    int64
Class                                  int64
Flight Distance                        int64
Seat comfort                           int64
Departure/Arrival time convenient      int64
Food and drink                         int64
Gate location                          int64
Inflight wifi service                  int64
Inflight entertainment                 int64
Online support                         int64
Ease of Online booking                 int64
On-board service                       int64
Leg room service                       int64
Baggage handling                       int64
Checkin service                        int64
Cleanliness                            int64
Online boarding                        int64
Departure Delay in Minutes             int64
Arrival Delay in Minutes             float64
Customer Type_Loyal Customer           uint8
Customer Type_disloyal Customer        uint8
Type of Tr

#### 2.4. Save the cleaned data for the next modeling parts (Preparation).

Save the data to `CSV`

In [11]:
df.to_csv('Galactico_Airline_prepared.csv')

#### References

The Nuts and Bolts of Machine Learning: Build a decision tree model (Coursera)
https://www.coursera.org/learn/the-nuts-and-bolts-of-machine-learning/ungradedLab/KPS5v/exemplar-build-a-decision-tree