## Session 13 - Machine Learning for Business and Analytics
### Materials - Dataset: Titanic

**Content Materials**

1. EDA, Data Preprocessing and Feature Engineering
2. Interpretable Machine Learning
    - Regression-based Interpretation
    - Decision Tree Plot
    - Decision Boundary of SVC and k-Means
    - Dendogram of Hierarchical Clustering
3. Explainable Machine Learning
    - Partial Dependence
    - Feature Importance
    - Shapley Value of Explanation (SHAP)
    - Local Interpretable Model-agnostic Explanation (LIME)

# 1. Library

In [1]:
import pandas as pd
import numpy as np
import string

import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="darkgrid")

from sklearn.preprocessing import LabelEncoder

import string
import warnings
warnings.filterwarnings('ignore')

# 2. Data

In [2]:
df = pd.read_csv(r'C:\Users\dheof\Documents\Csv_Files\data_titanic.csv')
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


# 3. EDA

In [3]:
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

## 3.1 AGE

In [4]:
df['Age'].nunique()

88

In [5]:
print(f"Unique Value of feature Age is: {df['Age'].nunique()}")
df["Age"].describe()

Unique Value of feature Age is: 88


count    714.000000
mean      29.699118
std       14.526497
min        0.420000
25%       20.125000
50%       28.000000
75%       38.000000
max       80.000000
Name: Age, dtype: float64

In [6]:
# Missing value handling
df["Age"] = df.groupby(["Sex", "Pclass"])["Age"].apply(lambda x: x.fillna(x.median()))
# Create age group by 10 groups
df["Age"] = pd.qcut(df["Age"], 10)
df["Age"].head()

0    (20.0, 22.0]
1    (34.0, 40.0]
2    (25.0, 26.0]
3    (34.0, 40.0]
4    (34.0, 40.0]
Name: Age, dtype: category
Categories (10, interval[float64]): [(0.419, 16.0] < (16.0, 20.0] < (20.0, 22.0] < (22.0, 25.0] ... (30.0, 34.0] < (34.0, 40.0] < (40.0, 47.0] < (47.0, 80.0]]

In [7]:
# Age group distribution by passenger id
df.groupby(["Age"])["PassengerId"].count()

Age
(0.419, 16.0]    100
(16.0, 20.0]      79
(20.0, 22.0]      94
(22.0, 25.0]     164
(25.0, 26.0]      18
(26.0, 30.0]     101
(30.0, 34.0]      69
(34.0, 40.0]     116
(40.0, 47.0]      61
(47.0, 80.0]      89
Name: PassengerId, dtype: int64

## 3.2 EMBARKED

In [8]:
# Lets see the value distribution
df["Embarked"].value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

In [9]:
# Filling the missing values in Embarked with S, as its mode (appears most frequent)
df["Embarked"] = df["Embarked"].fillna("S")
df["Embarked"].head()

0    S
1    C
2    S
3    S
4    S
Name: Embarked, dtype: object

## 3.3FARE

In [10]:
print(f"Unique Value of feature Age is: {df['Age'].nunique()}")
df["Fare"].describe()

Unique Value of feature Age is: 10


count    891.000000
mean      32.204208
std       49.693429
min        0.000000
25%        7.910400
50%       14.454200
75%       31.000000
max      512.329200
Name: Fare, dtype: float64

The situation is similar with `Age` feature, where standard deviation > mean, while unique value is only 10. In such way, its better to transform this feature into categorical data (fare-group). We are going to make a `Fare` group into 10 groups.

In [11]:
df["Fare"] = pd.qcut(df["Fare"], 10)
df["Fare"].head()

0      (-0.001, 7.55]
1    (39.688, 77.958]
2       (7.854, 8.05]
3    (39.688, 77.958]
4       (7.854, 8.05]
Name: Fare, dtype: category
Categories (10, interval[float64]): [(-0.001, 7.55] < (7.55, 7.854] < (7.854, 8.05] < (8.05, 10.5] ... (21.679, 27.0] < (27.0, 39.688] < (39.688, 77.958] < (77.958, 512.329]]

## 3.4 FAMILY SIZE

- `SibSp`: Number of Siblings/Spouses Aboard
- `Parch`: Number of Parents/Children Aboard

We are going to make new feature: `Family Size` which consist of `SibSp`, `Parch`, and its passenger. Then, we create new feature `Family Size Group` based on family size number, while remove the numerical feature afterwards.

In [13]:
df["Family_Size"] = df["SibSp"] + df["Parch"] + 1
family_map = {1: "Alone", 2: "Small", 3: "Small", 4: "Small", 5: "Medium", 6: "Medium", 7: "Large", 8: "Large", 11: "Large"}
df["Family_Size_Grouped"] = df["Family_Size"].map(family_map)
df["Family_Size_Grouped"].head()

df.drop(["Family_Size"], inplace=True, axis=1)
df["Family_Size_Grouped"].head()

0    Small
1    Small
2    Alone
3    Small
4    Alone
Name: Family_Size_Grouped, dtype: object

## 3.5 CABIN

`Cabin` has most missing value in the dataset, however we are going to create new feature: named `Deck`. This feature describes the group name if deck position based on the first alphabetic letter in `Cabin` value, let say `C85` means `C` for `Deck`. The missing `Cabin` will be replaced by `M` which stands for `missing`

In [15]:
# Missing value into "M" deck~ missing deck
df["Deck"] = df["Cabin"].apply(lambda s: s[0] if pd.notnull(s) else "M")

# Replace value "T" into "A"
idx = df[df["Deck"] == "T"].index
df.loc[idx, "Deck"] = "A"

# Create three group replacement of A-G
df["Deck"] = df["Deck"].replace(["A", "B", "C"], "ABC")
df["Deck"] = df["Deck"].replace(["D", "E"], "DE")
df["Deck"] = df["Deck"].replace(["F", "G"], "FG")

# Remove Cabin since its unused
df.drop(["Cabin"], inplace=True, axis=1)

df["Deck"].head()

0      M
1    ABC
2      M
3    ABC
4      M
Name: Deck, dtype: object

In [20]:
df['Deck'].describe()

count     891
unique      4
top         M
freq      687
Name: Deck, dtype: object

## 3.6 TIcket and PClass

In [21]:
df['Ticket'].value_counts()

347082          7
1601            7
CA. 2343        7
3101295         6
347088          6
               ..
112053          1
4134            1
2003            1
F.C.C. 13528    1
349256          1
Name: Ticket, Length: 681, dtype: int64

Feature `Ticket` has many unique values (681), so we will transform this into `Ticket Frequency` based on its appearance of each ticket. 

In [22]:
df["Ticket_Frequency"] = df.groupby("Ticket")["Ticket"].transform("count")
df.drop(["Ticket"], inplace=True, axis=1)

df["Pclass"] = df["Pclass"].astype("str")

df["Ticket_Frequency"].head()

0    1
1    1
2    1
3    2
4    1
Name: Ticket_Frequency, dtype: int64

## 3.7 NAME

We try to extract the name, by its title and marriage status, we create `Title` and `Is Married` as new features

In [25]:
df["Title"] = df["Name"].str.split(", ", expand=True)[1].str.split(".", expand=True)[0]
df["Is_Married"] = "no"
df["Is_Married"].loc[df["Title"] == "Mrs"] = "yes"

df.drop(["Name"], inplace=True, axis=1)

df[["Title","Is_Married"]].head()

Unnamed: 0,Title,Is_Married
0,Mr,no
1,Mrs,yes
2,Miss,no
3,Mrs,yes
4,Mr,no


## 3.8 CLEANUP

In [26]:
# Since this feature is no longer be used, now we remove this
df.drop(["PassengerId"], inplace=True, axis=1)

In [27]:
df.head(1)

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Family_Size_Grouped,Deck,Ticket_Frequency,Title,Is_Married
0,0,3,male,"(20.0, 22.0]",1,0,"(-0.001, 7.55]",S,Small,M,1,Mr,no


Data with high cardinality will slow down the process of machine learning if we transform into dummy variables, in such way, we would like to transform data with high cardinality into ordinal-based data.

In [28]:
for i in df.columns:
    total_unique_values = len(df[i].unique())
    print(f"Unique value of {i} is {total_unique_values}")

Unique value of Survived is 2
Unique value of Pclass is 3
Unique value of Sex is 2
Unique value of Age is 10
Unique value of SibSp is 7
Unique value of Parch is 7
Unique value of Fare is 10
Unique value of Embarked is 3
Unique value of Family_Size_Grouped is 4
Unique value of Deck is 4
Unique value of Ticket_Frequency is 7
Unique value of Title is 17
Unique value of Is_Married is 2


Feature `Age`, `Fare` and `Title` has so many unique values (>10). Since `Age` and `Fare` are numerical feature, where 0 > 1, so we transform into ordinal-based data type, while we let `Title` feature working as usual

In [29]:
high_cardinality_data = ["Fare","Age"]

for feature in high_cardinality_data:        
    df[feature] = LabelEncoder().fit_transform(df[feature])

df_encoded = pd.get_dummies(df)
df_encoded.head()

Unnamed: 0,Survived,Age,SibSp,Parch,Fare,Ticket_Frequency,Pclass_1,Pclass_2,Pclass_3,Sex_female,...,Title_Mlle,Title_Mme,Title_Mr,Title_Mrs,Title_Ms,Title_Rev,Title_Sir,Title_the Countess,Is_Married_no,Is_Married_yes
0,0,2,1,0,0,1,0,0,1,0,...,0,0,1,0,0,0,0,0,1,0
1,1,7,1,0,8,1,1,0,0,1,...,0,0,0,1,0,0,0,0,0,1
2,1,4,0,0,2,1,0,0,1,1,...,0,0,0,0,0,0,0,0,1,0
3,1,7,1,0,8,2,1,0,0,1,...,0,0,0,1,0,0,0,0,0,1
4,0,7,0,0,2,1,0,0,1,0,...,0,0,1,0,0,0,0,0,1,0


# 4. MODEL

Now lets move to the main dish, we are not giving too much focus on hyperparameter/model selection since our objective is creating interpretable AI and explainable AI, so let use the normal stages of machine learning building. 

## 4.1 Set X and Y features

In [30]:
# Separate between X and y features
x = df_encoded.drop(["Survived"],axis=1)
y = df_encoded["Survived"].values
# Store the column names
feature_names = x.columns
target_names = "Survived"

## 4.2 PCA to Transform 2d Vectors

In [32]:
from sklearn.decomposition import PCA

# We create PCA to transform all of X features into 2D vectors
# PCA will be used on plotting 2D contour
pca = PCA(n_components=2)
principalComponents = pca.fit_transform(x)
principalDf = pd.DataFrame(data = principalComponents, columns = ["PC1","PC2"])

## 4.3 Train Test Split

In [33]:
from sklearn.model_selection import train_test_split

# Ratio 70 : 30
# Normal Dataset
x_train, x_test, y_train, y_test = train_test_split(
    x,
    y,
    test_size=0.3,
    random_state=123)

# Principal Component Dataset
x_train_pc, x_test_pc, y_train_pc, y_test_pc = train_test_split(
    principalDf,
    y,
    test_size=0.3,
    random_state=123)

NameError: name 'X' is not defined