# **Machine Learning Week 8 - Cases**
---
> Introduction to Machine Learning <br>
> Sekolah Data, Pacmann

# GOAL

- Kita ingin clusterkan data client bank
- Harapannya dapat **memahami karakter client bank** yang ada terhadap **campaign** yang dilakukan

---
# Dataset Information

- The data is related with direct marketing campaigns of a Portuguese banking institution. 
- The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, on order to access if the product (bank term deposit) would be (or not) subscribed. 
- Dataset `bank.csv` ordered by date (from May 2008 to November 2010). 
- The **exercise goal** is to discover interesting things about the measurement.

**Variables**

<u>Numeric</u>
- `age`
- `balance`: average yearly balance, in euros
- `duration`: last coontact duration, in seconds
- `campaign`: number of contacts performed during this campaign and for this client (numeric, includes last contact)
- `pdays`: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)
- `previous`: number of contacts performed before this campaign and for this client

<u>Categoric</u>
- `job` : type of job (categorical) 
- `marital` : marital status (categorical)
- `education` (categorical)
- `default`: has credit in default? (binary: "yes","no")
- `housing`: has housing loan? (binary: "yes","no")
- `loan`: has personal loan? (binary: "yes","no")
- `contact`: contact communication type (categorical) 
- `day`: last contact day of the month 
- `month`: last contact month of year (categorical)
- `poutcome`: outcome of the previous marketing campaign (categorical)

Source :  S. Moro, R. Laureano and P. Cortez. Using Data Mining for Bank Direct Marketing: An Application of the CRISP-DM Methodology. <br>
  In P. Novais et al. (Eds.), Proceedings of the European Simulation and Modelling Conference - ESM'2011, pp. 117-121, Guimarães, Portugal, October, 2011. EUROSIS.

---
# Import Data

In [1]:
import numpy as np
import pandas as pd

In [2]:
def importData(path, col_to_drop):
    # Read Data
    data = pd.read_csv(path)
    print(f"Data awal                  : {data.shape}, (#observasi, #fitur)")

    # Drop kolom
    data = data.drop(columns = col_to_drop)
    print(f"Data setelah drop kolom    : {data.shape}, (#observasi, #fitur)")

    # Drop duplikat
    print(f"Ada {data.duplicated().sum()} data duplikat")
    data = data.drop_duplicates()
    print(f"Data setelah drop duplikat : {data.shape}, (#observasi, #fitur)")

    return data


In [3]:
filepath = "dataset/w8-2-bank.csv"
col_to_drop = "Unnamed: 0"

data = importData(path = filepath,
                  col_to_drop = col_to_drop)

Data awal                  : (45211, 17), (#observasi, #fitur)
Data setelah drop kolom    : (45211, 16), (#observasi, #fitur)
Ada 0 data duplikat
Data setelah drop duplikat : (45211, 16), (#observasi, #fitur)


In [4]:
data.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown


---
# Data Preprocessing

## Train-Test Split

- Kita tidak pisahkan input-output, karena akan menganalisa struktur data

In [5]:
from sklearn.model_selection import train_test_split

In [6]:
data_train, data_test = train_test_split(data,
                                         test_size = 0.25,
                                         random_state = 123)

In [7]:
print(data_train.shape)
print(data_test.shape)

(33908, 16)
(11303, 16)


In [8]:
data_train.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome
26999,32,unemployed,single,secondary,no,2706,no,no,cellular,21,nov,462,3,-1,0,unknown
16168,37,admin.,married,secondary,no,1396,yes,no,cellular,22,jul,199,2,-1,0,unknown
12338,22,blue-collar,married,secondary,no,-295,yes,no,unknown,26,jun,150,2,-1,0,unknown
6074,36,blue-collar,married,secondary,no,-870,yes,no,unknown,26,may,102,2,-1,0,unknown
7385,50,admin.,married,primary,no,429,no,no,unknown,29,may,60,2,-1,0,unknown


In [9]:
data_test.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome
7281,56,technician,married,secondary,no,589,yes,no,unknown,29,may,535,2,-1,0,unknown
19469,37,management,married,tertiary,no,649,no,no,cellular,7,aug,64,2,-1,0,unknown
31637,27,unemployed,single,secondary,no,1972,no,no,cellular,6,apr,97,1,-1,0,unknown
22484,43,management,married,tertiary,no,1,no,no,cellular,22,aug,239,4,-1,0,unknown
35919,58,retired,divorced,secondary,no,-808,yes,no,cellular,8,may,75,4,-1,0,unknown


## Numerical & Categorical Split

- Cek unique value untuk setiap kolom

In [10]:
for col in data_train.columns:
    print(f"col: {col}, #unique: {len(data_train[col].unique())}")

col: age, #unique: 76
col: job, #unique: 12
col: marital, #unique: 3
col: education, #unique: 4
col: default, #unique: 2
col: balance, #unique: 6464
col: housing, #unique: 2
col: loan, #unique: 2
col: contact, #unique: 3
col: day, #unique: 31
col: month, #unique: 12
col: duration, #unique: 1481
col: campaign, #unique: 45
col: pdays, #unique: 521
col: previous, #unique: 37
col: poutcome, #unique: 4


- Kita anggap `day` dan `month` sebagai numerik dalam latihan ini
- karena jumlah unique valuenya besar.

In [11]:
num_col = ["age", "day", "month", "balance", 
           "duration", "campaign", "pdays", "previous"]
cat_col = list(set(data_train.columns) - set(num_col))

print(num_col)
print(cat_col)

['age', 'day', 'month', 'balance', 'duration', 'campaign', 'pdays', 'previous']
['education', 'loan', 'contact', 'marital', 'poutcome', 'default', 'job', 'housing']


In [12]:
def splitNumCat(data, num_col, cat_col):
    data_num = data[num_col]
    data_cat = data[cat_col]

    return data_num, data_cat


In [13]:
data_train_num, data_train_cat = splitNumCat(data = data_train,
                                             num_col = num_col,
                                             cat_col = cat_col)

In [14]:
data_train_num.head()

Unnamed: 0,age,day,month,balance,duration,campaign,pdays,previous
26999,32,21,nov,2706,462,3,-1,0
16168,37,22,jul,1396,199,2,-1,0
12338,22,26,jun,-295,150,2,-1,0
6074,36,26,may,-870,102,2,-1,0
7385,50,29,may,429,60,2,-1,0


- fitur `month` perlu di-transform jadi angka

In [15]:
data_train_cat.head()

Unnamed: 0,education,loan,contact,marital,poutcome,default,job,housing
26999,secondary,no,cellular,single,unknown,no,unemployed,no
16168,secondary,no,cellular,married,unknown,no,admin.,yes
12338,secondary,no,unknown,married,unknown,no,blue-collar,yes
6074,secondary,no,unknown,married,unknown,no,blue-collar,yes
7385,primary,no,unknown,married,unknown,no,admin.,no


## Handling Data - Impute & Standardize

**Transform** - fitur `month`

In [16]:
def transformMonth(data):
    month_list = ["jan", "feb", "mar", "apr", "may", "jun",
                  "jul", "aug", "sep", "oct", "nov", "dec"]
    number_list = [i+1 for i in range(len(month_list))]

    data["month"] = data["month"].replace(month_list, number_list)

    return data


In [17]:
data_train_num = transformMonth(data = data_train_num)

  data["month"] = data["month"].replace(month_list, number_list)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data["month"] = data["month"].replace(month_list, number_list)


In [18]:
data_train_num.head()

Unnamed: 0,age,day,month,balance,duration,campaign,pdays,previous
26999,32,21,11,2706,462,3,-1,0
16168,37,22,7,1396,199,2,-1,0
12338,22,26,6,-295,150,2,-1,0
6074,36,26,5,-870,102,2,-1,0
7385,50,29,5,429,60,2,-1,0


**Missing Values** - Numerical

In [19]:
# Cek missing value
data_train_num.isna().any()

age         False
day         False
month       False
balance     False
duration    False
campaign    False
pdays       False
previous    False
dtype: bool

In [20]:
# Buat imputer, kalau-kalau ada yang butuh di data test
from sklearn.impute import SimpleImputer

def imputerNumeric(data, imputer = None):
    if imputer == None:
        # Buat imputer
        imputer = SimpleImputer(missing_values = np.nan,
                                strategy = "median")
        imputer.fit(data)

    # Transform data
    data_imputed = imputer.transform(data)
    data_imputed = pd.DataFrame(data = data_imputed,
                                columns = data.columns,
                                index = data.index)
    
    return data_imputed, imputer


In [21]:
data_train_num_imputed, num_imputer = imputerNumeric(data = data_train_num)

In [22]:
data_train_num_imputed.head()

Unnamed: 0,age,day,month,balance,duration,campaign,pdays,previous
26999,32.0,21.0,11.0,2706.0,462.0,3.0,-1.0,0.0
16168,37.0,22.0,7.0,1396.0,199.0,2.0,-1.0,0.0
12338,22.0,26.0,6.0,-295.0,150.0,2.0,-1.0,0.0
6074,36.0,26.0,5.0,-870.0,102.0,2.0,-1.0,0.0
7385,50.0,29.0,5.0,429.0,60.0,2.0,-1.0,0.0


**Standardizing** - Numerical

In [23]:
from sklearn.preprocessing import StandardScaler

# Buat scaler
def fitStandardize(data):
    scaler = StandardScaler()
    scaler.fit(data)

    return scaler

# Transform scaler
def transformStandardize(data, scaler):
    data_scaled = scaler.transform(data)
    data_scaled = pd.DataFrame(data = data_scaled,
                               columns = data.columns,
                               index = data.index)
    
    return data_scaled


In [24]:
# Cari scaler
num_scaler = fitStandardize(data = data_train_num_imputed)

# Transform data
data_train_num_clean = transformStandardize(data = data_train_num_imputed,
                                            scaler = num_scaler)

In [25]:
data_train_num_clean.head()

Unnamed: 0,age,day,month,balance,duration,campaign,pdays,previous
26999,-0.844906,0.626379,2.024176,0.440092,0.779886,0.077638,-0.411528,-0.242719
16168,-0.37353,0.746689,0.358937,0.006577,-0.231883,-0.246209,-0.411528,-0.242719
12338,-1.787657,1.227931,-0.057373,-0.553021,-0.420387,-0.246209,-0.411528,-0.242719
6074,-0.467806,1.227931,-0.473683,-0.743304,-0.605045,-0.246209,-0.411528,-0.242719
7385,0.852046,1.588862,-0.473683,-0.31343,-0.76662,-0.246209,-0.411528,-0.242719


## Handling Data 2 - PCA (Dimensionality Reduction)

**Goal**: represent data in fewer dimensions

In [26]:
# Import package PCA - Sklearn
from sklearn.decomposition import PCA

In [27]:
# Define PCA with random state
pca_obj = PCA(random_state = 123)

In [28]:
# Fit to data_train_num_clean
pca_obj.fit(data_train_num_clean)

*dapatkan principal component*

In [29]:
# Show PCA Component
pca_component = pca_obj.components_

# Turn to dataframe
pca_component = pd.DataFrame(data = pca_component,
                             columns = data_train_num_clean.columns)
pca_component

Unnamed: 0,age,day,month,balance,duration,campaign,pdays,previous
0,-0.069772,-0.300405,-0.271369,-0.03182,0.059091,-0.266862,0.639239,0.588984
1,0.442293,0.297602,0.485049,0.480265,-0.128529,0.257602,0.220141,0.3443
2,-0.36048,0.410899,-0.142899,-0.39058,-0.419306,0.538353,0.159648,0.19265
3,-0.374714,0.419159,0.148437,0.015518,0.801606,0.045387,0.056493,0.117457
4,0.690436,0.027622,-0.11762,-0.601584,0.321183,0.200227,0.007981,0.059021
5,-0.138927,-0.066414,0.746403,-0.503089,-0.152494,-0.366869,0.026452,0.086753
6,0.180314,0.686298,-0.260422,-0.022065,-0.185509,-0.624547,0.033671,-0.050689
7,0.025212,0.021885,0.099308,-0.007061,0.027794,0.076148,0.715776,-0.685614


*dapatkan variance yang dijelaskan*

In [30]:
# Explained variance
pca_obj.explained_variance_

array([1.53187704, 1.19523565, 1.15590508, 0.98118651, 0.89585695,
       0.87547251, 0.81436845, 0.55033375])

In [31]:
# Explained variance ratio
pca_obj.explained_variance_ratio_

array([0.19147898, 0.14940005, 0.14448387, 0.1226447 , 0.11197882,
       0.10943084, 0.10179305, 0.06878969])

bisa kita lihat,
- PC 1 adalah baris pertama pada dataframe `pca_component`
- PC 1 menjelaskan 19.3% variasi data

*transform data dengan principal component*

In [32]:
# Transform data
data_train_num_pca = pca_obj.transform(data_train_num_clean)

# Set data sebagai dataframe
col_names = [f"PC_{i+1}" for i in range(data_train_num_pca.shape[1])]
data_train_num_pca = pd.DataFrame(data = data_train_num_pca,
                                  columns = col_names,
                                  index = data_train_num_clean.index)

data_train_num_pca.head()

Unnamed: 0,PC_1,PC_2,PC_3,PC_4,PC_5,PC_6,PC_7,PC_8
26999,-1.073176,0.7515,-0.296867,1.463372,-0.820467,1.185871,-0.454034,0.089753
16168,-0.649881,0.026484,0.239826,0.25752,-0.424834,0.36065,0.54671,-0.110821
12338,-0.57614,-0.902214,1.304431,0.767544,-1.062838,0.524688,0.777732,-0.178573
6074,-0.560112,-0.588036,1.039891,0.060206,-0.047435,0.154479,1.16259,-0.190428
7385,-0.783853,0.330361,0.612266,-0.405923,0.563307,-0.244481,1.668772,-0.15678


*Berapa principal component?*

- Pilih untuk mempertahankan persentase variance tertentu dalam data

In [33]:
# Jika gunakan seluruh component, maka variance-nya
sum(pca_obj.explained_variance_ratio_)

np.float64(0.9999999999999999)

In [34]:
# Jika memilih n component, maka variance yang dijelaskan
for i in range(1, len(pca_obj.explained_variance_ratio_) + 1):
    sum_of_variance_n = sum(pca_obj.explained_variance_ratio_[:i]) * 100
    print(f"n_component: {i}, %variance explained: {sum_of_variance_n:.2f} %")

n_component: 1, %variance explained: 19.15 %
n_component: 2, %variance explained: 34.09 %
n_component: 3, %variance explained: 48.54 %
n_component: 4, %variance explained: 60.80 %
n_component: 5, %variance explained: 72.00 %
n_component: 6, %variance explained: 82.94 %
n_component: 7, %variance explained: 93.12 %
n_component: 8, %variance explained: 100.00 %


- Apabila ingin mempertahankan 90% variance, maka Anda memilih 7 komponen
- Jumlah komponen yang dipilih dapat dijadikan bagian dari eksperimentasi

*Buat user-defined function untuk PCA*

In [35]:
def fitPCA(data):
    # Buat objek PCA
    pca_obj = PCA(random_state = 123)

    # Fit PCA pada data
    pca_obj.fit(data)

    # Tampilkan explained-variance
    print("Explained variance using n_components:")
    for i in range(1, len(pca_obj.explained_variance_ratio_) + 1):
        sum_of_variance_n = sum(pca_obj.explained_variance_ratio_[:i]) * 100
        print(f"n_component: {i}, %variance explained: {sum_of_variance_n:.2f} %")

    print()

    # Pilih n_components
    n_comp = int(input("n_components : "))

    # Buat ulang PCA
    pca_obj = PCA(n_components = n_comp,
                  random_state = 123)
    pca_obj.fit(data)

    # Ekstrak komponen
    pca_component = pca_obj.components_[:n_comp]

    # Turn to dataframe
    pca_component = pd.DataFrame(data = pca_component,
                                columns = data.columns)
    
    return pca_component, pca_obj


In [37]:
pca_component, pca_obj = fitPCA(data = data_train_num_clean)

Explained variance using n_components:
n_component: 1, %variance explained: 19.15 %
n_component: 2, %variance explained: 34.09 %
n_component: 3, %variance explained: 48.54 %
n_component: 4, %variance explained: 60.80 %
n_component: 5, %variance explained: 72.00 %
n_component: 6, %variance explained: 82.94 %
n_component: 7, %variance explained: 93.12 %
n_component: 8, %variance explained: 100.00 %



ValueError: invalid literal for int() with base 10: ''

In [None]:
pca_component

In [None]:
# Buat fungsi transformasi data
def transformPCA(data, pca_obj):
    # Transform data
    data_pca = pca_obj.transform(data)

    cols = [f"PC_{i+1}" for i in range(data_pca.shape[1])]
    data_pca = pd.DataFrame(data = data_pca,
                            columns = cols,
                            index = data.index)
    
    return data_pca


In [None]:
data_train_num_pca = transformPCA(data = data_train_num_clean,
                                  pca_obj = pca_obj)

In [None]:
# Cek data yang sudah diPCA
data_train_num_pca.head()

In [None]:
# Cek komponen
pca_component

*membuat bi-plot*

In [None]:
import matplotlib.pyplot as plt

In [None]:
transformed_data = data_train_num_clean @ pca_component[:2].T
transformed_data.head()

In [None]:
fig, ax = plt.subplots(nrows = 1, ncols = 1, figsize = (10, 7))

ax.scatter(transformed_data[0][data_train_cat["poutcome"]=="success"], 
           transformed_data[1][data_train_cat["poutcome"]=="success"], 
           marker=".", 
           c="red", #s=10,
           alpha=.2,
           label = "SUCCESS")

ax.scatter(transformed_data[0][data_train_cat["poutcome"]=="other"], 
           transformed_data[1][data_train_cat["poutcome"]=="other"], 
           marker=".", 
           c="blue", #s=10,
           alpha=.2,
           label = "FAILED")

for col in pca_component.columns:
    data_col = np.array(pca_component[col].loc[0:1])*5.
    start_point = [0, data_col[0]]
    end_point = [0, data_col[1]]

    ax.plot(start_point, end_point, marker="o", label=col)

ax.set_ylabel("Second Principal Components")
ax.set_xlabel("First Principal Components")
ax.set_xlim([-2.0, 5])
ax.set_ylim([-2.0, 5])
plt.grid()
plt.legend()
plt.show()

Gimana cara interpretasinya?
- Untuk PC_1, memberi bobot besar pada `pdays` dan `previous`, tapi bobot untuk `duration`, `balance` dan `age` kecil
- Artinya `pdays` dan `previous` berkorelasi satu sama lain,
- Semakin besar `previous`, semakin besar `pdays`

---
# Modeling Clustering - Data Full

- **Goal**: make separate group with similar character, and assign them into cluster
- **TASK CLUSTERING IS SUBJECTIVE**

In [None]:
from sklearn.cluster import KMeans

*buat objek clustering*

In [None]:
# Buat objek k-means
kmeans_obj = KMeans(n_clusters = 3,
                    random_state = 123)

In [None]:
# Fit objek k-means
kmeans_obj.fit(data_train_num_clean)

*predict clustering*

In [None]:
# Predict Cluster
kmeans_obj.predict(data_train_num_clean)

In [None]:
# Reshape predicted cluster to dataframe
cluster_result = kmeans_obj.predict(data_train_num_clean)
cluster_result = pd.DataFrame(data = cluster_result,
                              columns = ["cluster"],
                              index = data_train_num_clean.index)

In [None]:
cluster_result.head()

*periksa proporsi cluster*

In [None]:
cluster_result["cluster"].value_counts(normalize = True)

- 2 cluster memiliki porsi di atas 43% data

*periksa centroid* sebagai representasi cluster

In [None]:
# Check centroid
kmeans_obj.cluster_centers_

In [None]:
# Jadikan dataframe
centroids = kmeans_obj.cluster_centers_
centroids = pd.DataFrame(data = centroids,
                         columns = data_train_num_clean.columns)

centroids

- Tentu hal diatas tidak bisa diartikan
- Karena dalam bentuk terstandardkan
- Kita harus balikan ke dalam bentuk awal sebelum distandarisasi

*inverse transform dari standardizer*

In [None]:
centroid_real = num_scaler.inverse_transform(centroids)
centroid_real = pd.DataFrame(data = centroid_real,
                             columns = data_train_num_clean.columns)

centroid_real

*lalu artinya apa?* - Harus di translate sendiri
- Cluster 1 (0) adalah **group** yang
    - dikontak di awal bulan
    - sudah pernah dikontak 2x **selama** campaign
    - belum pernah dikontak **sebelum** campaign

*BEST K?*

Score -- within-cluster sum-of-squares

$$
\text{scores} = - \sum_{i=0}^{n} ||x_{i} - \mu_{j}||^{2}
$$

In [None]:
# Tampilkan score
-kmeans_obj.score(data_train_num_clean)

*coba variasikan beberapa cluster*

In [None]:
score_list = []
k_list = np.arange(2, 11, 1)

for k in k_list:
    # Buat object
    kmeans_obj_k = KMeans(n_clusters = k,
                          max_iter = 50,
                          random_state = 123)
    
    # Fit data
    kmeans_obj_k.fit(data_train_num_clean)

    # update score
    score_k = -kmeans_obj_k.score(data_train_num_clean)
    score_list.append(score_k)


In [None]:
score_list

In [None]:
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(10, 7))

ax.plot(k_list, score_list, "r", marker="o")

ax.set_xlabel("number of cluster")
ax.set_ylabel("within-cluster sum-of-square")
plt.show()

- Makin banyak cluster, makin rendah scorenya.
- Tapi, makin banyak cluster, makin kompleks untuk diinterpretasikan.
- Kita coba ambil cluster terbaik di 9, karena perubahan error di cluster 10 mengecil

In [None]:
# Buat object
kmeans_obj_best = KMeans(n_clusters = 9,
                         random_state = 123)

# Fit object
kmeans_obj_best.fit(data_train_num_clean)

**Tampilkan Centroid**

In [None]:
# Jadikan centroid dalam bentuk dataframe
centroids_best = kmeans_obj_best.cluster_centers_
centroids_best = pd.DataFrame(data = centroids_best,
                              columns = data_train_num_clean.columns)

# Inverse transform
centroid_real_best = num_scaler.inverse_transform(centroids_best)
centroid_real_best = pd.DataFrame(data = centroid_real_best,
                                  columns = data_train_num_clean.columns)

centroid_real_best

*Coba Interpretasikan di atas ini?*

**Predict Cluster**

In [None]:
cluster_best = kmeans_obj_best.predict(data_train_num_clean)

cluster_best = pd.DataFrame(data = cluster_best,
                            columns = ["cluster"],
                            index= data_train_num_clean.index)
cluster_best.head()

---
# Modeling Clustering - Data PCA

*Variasikan beberapa cluster*

In [None]:
score_list = []
k_list = np.arange(2, 11, 1)

for k in k_list:
    # Buat object
    kmeans_obj_k = KMeans(n_clusters = k,
                          max_iter = 50,
                          random_state = 123)
    
    # Fit data
    kmeans_obj_k.fit(data_train_num_pca)

    # update score
    score_k = -kmeans_obj_k.score(data_train_num_pca)
    score_list.append(score_k)


In [None]:
score_list

In [None]:
fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(10, 7))

ax.plot(k_list, score_list, "r", marker="o")

ax.set_xlabel("number of cluster")
ax.set_ylabel("within-cluster sum-of-square")
plt.show()

- Makin banyak cluster, makin rendah scorenya.
- Tapi, makin banyak cluster, makin kompleks untuk diinterpretasikan.
- Kita coba ambil cluster terbaik di 7, karena perubahan error di cluster selanjutnya mengecil

In [None]:
# Buat object
kmeans_obj_pca_best = KMeans(n_clusters = 7,
                             random_state = 123)

# Fit object
kmeans_obj_pca_best.fit(data_train_num_pca)

**Predict Cluster**

In [None]:
cluster_pca_best = kmeans_obj_pca_best.predict(data_train_num_pca)

cluster_pca_best = pd.DataFrame(data = cluster_pca_best,
                                columns = ["cluster"],
                                index = data_train_num_pca.index)
cluster_pca_best.head()

*Centroid PCA*

In [None]:
# Cari centroid
centroid_pca_best = kmeans_obj_pca_best.cluster_centers_
centroid_pca_best = pd.DataFrame(data = centroid_pca_best,
                                 columns = data_train_num_pca.columns)
centroid_pca_best

In [None]:
# Inverse transform centroid pca
# agar dapat diinterpretasikan
centroid_pca_best_inv = pca_obj.inverse_transform(centroid_pca_best)
centroid_pca_best_inv = pd.DataFrame(centroid_pca_best_inv,
                                     columns = data_train_num_clean.columns)
centroid_pca_best_inv

In [None]:
# Inverse transform centroid standardisasi
# agar dapat diinterpretasikan
centroid_pca_best_real = num_scaler.inverse_transform(centroid_pca_best_inv)
centroid_pca_best_real = pd.DataFrame(centroid_pca_best_real,
                                      columns = data_train_num_clean.columns)

centroid_pca_best_real

**Sekarang, data bisa diinterpretasikan**

---
# Clustering Test Data

## Preprocessing Test Data

In [None]:
def transformTestData(data, num_col, cat_col, num_imputer, num_scaler):
    # 1. Split num-cat data
    data_num, _ = splitNumCat(data = data,
                              num_col = num_col,
                              cat_col = cat_col)
    
    # 2. Handling Data
    # 2.1 transform month
    data_num = transformMonth(data = data_num)

    # 2.2 impute data
    data_num_imputed, _= imputerNumeric(data = data_train_num,
                                        imputer = num_imputer)
    
    # 2.3 Standardization
    data_num_scaled = transformStandardize(data = data_num_imputed,
                                           scaler = num_scaler)
    
    return data_num_scaled
    

In [None]:
data_test_clean = transformTestData(data = data_test,
                                    num_col = num_col,
                                    cat_col = cat_col,
                                    num_imputer = num_imputer,
                                    num_scaler = num_scaler)

In [None]:
data_test_clean.head()

*Transform PCA*

In [None]:
data_test_clean_pca = transformPCA(data = data_test_clean,
                                   pca_obj = pca_obj)

In [None]:
data_test_clean_pca.head()

## Predict Test Data

*predict data test - FULL*

In [None]:
kmeans_obj_best.predict(data_test_clean)

*predict data test - PCA*

In [None]:
kmeans_obj_pca_best.predict(data_test_clean_pca)

*Perlukah Cross Validation?*