# **Machine Learning Workflow**
---
> Introduction to Machine Learning <br>
> Sekolah Data, Pacmann

## **Machine Learning with Scikit-Learn is Easy**
---

<center>
<img src="https://img.ifunny.co/images/1f58ab4c0a13ce4b916aa56838792ea21938849394e9a0b309fffbe69e9dce21_1.jpg">
</center>


## **Machine Learning Workflow** (Simplified)
---

### 1. <font color='blue'> Importing Data to Python</font> 
    * Drop Duplicates 
### 2. <font color='blue'> Data Preprocessing:</font> 
    * Input-Output Split, Train-Test Split
    * Imputation, Processing Categorical, Normalization 
### 3. <font color='blue'> Training Machine Learning:</font> 
    * Choose Score to optimize and Hyperparameter Space

## **Bank Analysis**
---

- Task : Classification
- Objective : Prediksi client bank yang berlangganan term deposit

<br>

<center>
<img src="https://keralagbank.com/public/images/inner/personal/term-deposit.png">
</center>

### **Data description:**

**Bank Client Data**:

- `age` (numeric)
- `job` : type of job (categorical: "admin.", "unknown", "unemployed", "management", "housemaid", "entrepreneur", "student", "blue-collar", "self-employed", "retired", "technician", "services")
- `marital` : marital status (categorical: "married", "divorced", "single"; note: "divorced" means divorced or widowed)
- `education` (categorical: "unknown", "secondary", "primary", "tertiary")
- `default`: has credit in default? (binary: "yes", "no")
- `balance`: average yearly balance, in euros (numeric)
- `housing`: has housing loan? (binary: "yes", "no")
- `loan`: has personal loan? (binary: "yes", "no")

<br>

**Kondisi komunikasi dengan campaign terakhir**
- `contact`: contact communication type (categorical: "unknown", "telephone", "cellular")
- `day`: last contact day of the month (numeric)
- `month`: last contact month of year (categorical: "jan", "feb", "mar", ..., "nov", "dec")
- `duration`: last contact duration, in seconds (numeric)

<br>

**Atribut/Fitur lain**
- `campaign`: number of contacts performed during this campaign and for this client (numeric, includes last contact)
- `pdays`: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)
- `previous`: number of contacts performed before this campaign and for this client (numeric)
- `poutcome`: outcome of the previous marketing campaign (categorical: "unknown","other","failure","success")

<br>

**Output variable (desired target)**
- `y` - has the client subscribed a term deposit? (binary: "yes","no")

## <b><font color='blue'>1.  Importing Data to Python</font></b>
---

Anda dapat import data dari berbagai format:
- .csv
- .Excel
- .txt
- .SQL
- .dat

**Import library pengolahan data**

Biasanya
- Pandas
- Numpy

In [1]:
# import library pengolahan data
import pandas as pd
import numpy as np


**Load data**

- Pakai `pd.read_csv()` apabila file-nya .csv
- Load data `bank-data.csv`

In [2]:
# load data 
path = 'bank-data.csv'

bank = pd.read_csv(path)

bank

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58.0,management,married,tertiary,no,2143.0,yes,no,unknown,,,261.0,1.0,-1.0,0.0,unknown,no
1,,technician,single,secondary,no,29.0,yes,no,unknown,5.0,may,151.0,1.0,-1.0,0.0,unknown,no
2,,entrepreneur,married,secondary,no,2.0,yes,yes,unknown,5.0,may,76.0,1.0,-1.0,0.0,unknown,no
3,,blue-collar,married,unknown,no,1506.0,yes,no,unknown,5.0,may,92.0,1.0,-1.0,0.0,unknown,no
4,33.0,unknown,single,unknown,no,1.0,no,no,,5.0,may,198.0,1.0,-1.0,0.0,unknown,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45206,51.0,technician,married,tertiary,no,825.0,no,no,,17.0,nov,977.0,3.0,-1.0,0.0,unknown,yes
45207,71.0,retired,divorced,primary,no,1729.0,no,no,cellular,17.0,nov,456.0,2.0,-1.0,0.0,unknown,yes
45208,72.0,,married,,no,5715.0,no,,cellular,,nov,1127.0,5.0,184.0,,success,yes
45209,57.0,blue-collar,married,secondary,no,668.0,no,no,telephone,17.0,nov,508.0,4.0,-1.0,,unknown,no


In [3]:
bank.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58.0,management,married,tertiary,no,2143.0,yes,no,unknown,,,261.0,1.0,-1.0,0.0,unknown,no
1,,technician,single,secondary,no,29.0,yes,no,unknown,5.0,may,151.0,1.0,-1.0,0.0,unknown,no
2,,entrepreneur,married,secondary,no,2.0,yes,yes,unknown,5.0,may,76.0,1.0,-1.0,0.0,unknown,no
3,,blue-collar,married,unknown,no,1506.0,yes,no,unknown,5.0,may,92.0,1.0,-1.0,0.0,unknown,no
4,33.0,unknown,single,unknown,no,1.0,no,no,,5.0,may,198.0,1.0,-1.0,0.0,unknown,no


**Cek banyak observasi**

In [4]:
# cek dimensi data
row, columns = bank.shape

print(f"Banyaknya baris sebanyak {row} dan jumlah kolom sebanyak {columns}")

Banyaknya baris sebanyak 45211 dan jumlah kolom sebanyak 17


**Cek & Drop data yang duplikat**

- cek-nya pakai `.duplicated()`

In [5]:
# cek data duplikat
data_duplicated = bank.duplicated().any().sum()

data_duplicated

np.int64(0)

In [6]:
# drop data duplikat
bank = bank.drop_duplicates()

In [7]:
# cek apakah ada yang ke drop
bank.shape

(45211, 17)

**Buat semuanya dalam fungsi**

1. Import data
2. Cek **Jumlah observasi** dan **Jumlah kolom**
3. Drop duplicate
4. Cek **Jumlah observasi** dan **Jumlah kolom** setelah di-drop
5. Return data setelah di-drop

In [8]:
# membuat fungsi yang perintahnya sebagai berikut
bank = pd.read_csv('bank-data.csv')
print("Data asli                    : ", bank.shape, "-(#Observasi, #kolom)")   

bank = bank.drop_duplicates()
print("Data setelah di drop         : ", bank.shape, "-(#Observasi, #kolom)")   

Data asli                    :  (45211, 17) -(#Observasi, #kolom)
Data setelah di drop         :  (45211, 17) -(#Observasi, #kolom)


In [9]:
def importData(filename):
    """
    Fungsi untuk import data & hapus duplikat
    :param filename: <string> nama file input (format .csv)
    :return df: <pandas dataframe> sampel data
    """

    # read data
    df = pd.read_csv(filename)
    print("Data asli            : ", df.shape, "- (#observasi, #kolom)")

    # drop duplicates
    df = df.drop_duplicates()
    print("Data setelah di-drop : ", df.shape, "- (#observasi, #kolom)")

    return df

# (filename) adalah argumen
# Argumen adalah sebuah variable. 
# Jika fungsi tsb. diberi argumen filename = "bank_data.csv", 
# maka semua variabel 'filename' di dalam fungsi 
# akan berubah menjadi "bank_data.csv"

In [10]:
# import
file_bank = 'bank-data.csv'

# import fungsi
bank = importData(filename = file_bank)

Data asli            :  (45211, 17) - (#observasi, #kolom)
Data setelah di-drop :  (45211, 17) - (#observasi, #kolom)


In [11]:
bank.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58.0,management,married,tertiary,no,2143.0,yes,no,unknown,,,261.0,1.0,-1.0,0.0,unknown,no
1,,technician,single,secondary,no,29.0,yes,no,unknown,5.0,may,151.0,1.0,-1.0,0.0,unknown,no
2,,entrepreneur,married,secondary,no,2.0,yes,yes,unknown,5.0,may,76.0,1.0,-1.0,0.0,unknown,no
3,,blue-collar,married,unknown,no,1506.0,yes,no,unknown,5.0,may,92.0,1.0,-1.0,0.0,unknown,no
4,33.0,unknown,single,unknown,no,1.0,no,no,,5.0,may,198.0,1.0,-1.0,0.0,unknown,no


## <b><font color='blue'> 2. Data Preprocessing:</font></b>
---
    * Input-Output Split, Train-Test Split
    * Processing Categorical
    * Imputation, Normalization, Drop Duplicates

### **Input-Output Split**

- Fitur `y` adalah output variabel dari data marketing
- yang lainnya menjadi input

**Buat data output**

In [12]:
bank.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58.0,management,married,tertiary,no,2143.0,yes,no,unknown,,,261.0,1.0,-1.0,0.0,unknown,no
1,,technician,single,secondary,no,29.0,yes,no,unknown,5.0,may,151.0,1.0,-1.0,0.0,unknown,no
2,,entrepreneur,married,secondary,no,2.0,yes,yes,unknown,5.0,may,76.0,1.0,-1.0,0.0,unknown,no
3,,blue-collar,married,unknown,no,1506.0,yes,no,unknown,5.0,may,92.0,1.0,-1.0,0.0,unknown,no
4,33.0,unknown,single,unknown,no,1.0,no,no,,5.0,may,198.0,1.0,-1.0,0.0,unknown,no


In [13]:
output_data = bank['y']

# buat data yang berisi data target
# pilih data dengan nama kolom 'y', lalu namakan sebagai output_data

In [14]:
output_data.head()

0    no
1    no
2    no
3    no
4    no
Name: y, dtype: object

**Buat data input**

- DATA = INPUT + OUTPUT
- DATA - OUTPUT = INPUT
- Jadi kalau dari data, kita drop VARIABLE OUTPUT, maka tersisa hanya variabel INPUT.

In [15]:
input_data = bank.drop(['y'],
                       axis = 1)

In [16]:
input_data.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome
0,58.0,management,married,tertiary,no,2143.0,yes,no,unknown,,,261.0,1.0,-1.0,0.0,unknown
1,,technician,single,secondary,no,29.0,yes,no,unknown,5.0,may,151.0,1.0,-1.0,0.0,unknown
2,,entrepreneur,married,secondary,no,2.0,yes,yes,unknown,5.0,may,76.0,1.0,-1.0,0.0,unknown
3,,blue-collar,married,unknown,no,1506.0,yes,no,unknown,5.0,may,92.0,1.0,-1.0,0.0,unknown
4,33.0,unknown,single,unknown,no,1.0,no,no,,5.0,may,198.0,1.0,-1.0,0.0,unknown


**Buat semuanya jadi fungsi**
1. buat output_data
2. buat input_data
3. return input_data dan output_data

In [17]:
# isi perintah yang akan dimasukan ke dalam fungsi
output_data = bank['y']
input_data  = bank.drop('y', 
                        axis=1)

In [18]:
def InputOutput(data,
                output_column_name):
    """
    Fungsi untuk memisahkan data input dan output
    :param data: <pandas dataframe> data seluruh sample
    :param output_column_name: <string> nama kolom output
    :return input_data: <pandas dataframe> data input
    :return output_data: <pandas series> data output
    """
    output_data = data[output_column_name] 
    input_data  = data.drop(output_column_name, 
                            axis = 1) 
    return input_data, output_data
   
# (data, output_column_name) adalah argumen
# Argumen adalah sebuah variable. 
# Jika fungsi tsb. diberi argumen data = bank_df, 
# maka semua variabel 'data' di dalam fungsi akan berubah menjadi bank_df

In [19]:
# jangan sampai salah urutan dalam penempatan return
x, y = InputOutput(data = bank,
                   output_column_name= "y")

**Selalu sanity check!**

In [20]:
x.head(2)

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome
0,58.0,management,married,tertiary,no,2143.0,yes,no,unknown,,,261.0,1.0,-1.0,0.0,unknown
1,,technician,single,secondary,no,29.0,yes,no,unknown,5.0,may,151.0,1.0,-1.0,0.0,unknown


In [21]:
y.head(2)

0    no
1    no
Name: y, dtype: object

### **Train-Test Split**

- **Kenapa?**
  - Karena tidak mau overfit data training
  - Test data akan menjadi future data
  - Kita akan latih model ML di data training, dengan CV (Cross-validation)
  - Selanjutnya melakukan evaluasi di data testing

In [22]:
# Import train-test splitting library dari sklearn (scikit learn)
from sklearn.model_selection import train_test_split 

**Train Test Split Function**
1. `X` adalah input
2. `y` adalah output (target)
3. `test_size` adalah seberapa besar proporsi data test dari keseluruhan data. Contoh `test_size = 0.2` artinya data test akan berisi 20% data.
4. `random_state` adalah kunci untuk random. Harus di-setting sama. Misal `random_state = 123`.
5. Output:
   - `X_train` = input dari data training
   - `X_test` = input dari data testing
   - `y_train` = output dari data training
   - `y_test` = output dari data testing
6. Urutan outputnya: `X_train, X_test, y_train, y_test`. Tidak boleh terbalik

> Readmore: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [23]:
# Train test split
x_train, x_test, y_train, y_test = train_test_split(x, y,
                                                    test_size = 0.25,
                                                    random_state= 12)

In [24]:
# Sanity check hasil splitting
print(x_train.shape)
print(x_test.shape)

(33908, 16)
(11303, 16)


In [25]:
# Rasio 
x_test.shape[0] / x.shape[0]

# Hasil 0.25 - sesuai dengan test_size kita

0.25000552962774547

**Selamat!** - Anda sudah memiliki data train & test

> Selanjutnya, hanya **fokus** ke data **training**

### **Data Imputation**

- Proses pengisian data yang kosong (NaN)
- Ada 2 hal yang diperhatikan:
  - Numerical Imputation
  - Categorical Imputation

**Cek data yang kosong dari variabel input**

In [26]:
x_train.isnull().sum()

# Output: nama variabel, True/False.
# Jika True, maka ada data yang kosong

# Ada 2500-2700 data yang kosong

age          2626
job          2650
marital      2650
education    2542
default      2689
balance      2574
housing      2660
loan         2668
contact      2695
day          2617
month        2602
duration     2701
campaign     2614
pdays        2634
previous     2638
poutcome     2629
dtype: int64

**Bedakan antara data categorical & numerical**

In [27]:
x_train.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome
37156,35.0,management,single,tertiary,no,2749.0,no,no,cellular,13.0,may,127.0,1.0,-1.0,0.0,unknown
20494,30.0,management,,,no,443.0,yes,,cellular,12.0,,80.0,2.0,-1.0,0.0,unknown
35272,39.0,management,,tertiary,no,4239.0,yes,no,cellular,7.0,may,40.0,1.0,-1.0,0.0,unknown
22260,49.0,services,,,no,400.0,no,no,cellular,21.0,aug,151.0,3.0,-1.0,0.0,unknown
2728,28.0,technician,single,secondary,no,468.0,yes,no,unknown,13.0,may,152.0,3.0,-1.0,0.0,unknown


Data kategorikal:
- job
- marital
- education
- default
- housing
- loan
- contact
- month
- poutcome

Sisanya adalah numerical

**Numerical Imputation**

In [28]:
x_train.columns

Index(['age', 'job', 'marital', 'education', 'default', 'balance', 'housing',
       'loan', 'contact', 'day', 'month', 'duration', 'campaign', 'pdays',
       'previous', 'poutcome'],
      dtype='object')

In [29]:
# Buat kolom numerik
numeric_columns = ['age', 'balance', 'day', 'duration', 
                   'campaign', 'pdays', 'previous']

In [30]:
# seleksi dataframe numerik
x_train_numerical = x_train[numeric_columns]

In [31]:
x_train_numerical.head()

Unnamed: 0,age,balance,day,duration,campaign,pdays,previous
37156,35.0,2749.0,13.0,127.0,1.0,-1.0,0.0
20494,30.0,443.0,12.0,80.0,2.0,-1.0,0.0
35272,39.0,4239.0,7.0,40.0,1.0,-1.0,0.0
22260,49.0,400.0,21.0,151.0,3.0,-1.0,0.0
2728,28.0,468.0,13.0,152.0,3.0,-1.0,0.0


**Cek apakah ada data numerik yang kosong**

In [32]:
x_train_numerical.isnull().any()

# Semua variabel numerical memiliki missing values

age         True
balance     True
day         True
duration    True
campaign    True
pdays       True
previous    True
dtype: bool

**Gunakan imputer dari sklearn untuk data imputation numerik saja**

In [33]:
from sklearn.impute import SimpleImputer

In [34]:
imputer = SimpleImputer(missing_values= np.nan, 
                        strategy= 'median')


# namakan function SimpleImputer menjadi imputer, jangan lupa tanda kurung ()
# missing_values adalah tanda missing values dalam data.
#   - bisa NaN, bisa 999, bisa "KOSONG"
# Strategy median adalah strategy imputasi, 
# jika data kosong, diganti dengan median target
# Strategi lainnya adalah: mean

- `fit` : imputer agar mengetahui mean atau median dari tiap kolom
- `transform` : isi data dengan median atau mean
- output dari transform adalah pandas dataframe
- namakan kolom `X_train_numerical_imputed` sesuai dengan `X_train_numerical`.
   - MENGAPA? karena kita kehilangan nama kolom setelah data imputation
- beri index dari `X_train_numerical_imputed` sesuai dengan `X_train_numerical`.
   - MENGAPA? karena kita kehilangan index setelah data imputation

In [35]:
# isi perintah yang akan dibuat dalam fungsi

# fit imputer
imputer.fit(x_train_numerical)

# Transform
imputed_data = imputer.transform(x_train_numerical)
x_train_numerical_imputed = pd.DataFrame(imputed_data)

x_train_numerical_imputed.columns = x_train_numerical.columns
x_train_numerical_imputed.index = x_train_numerical.index

In [36]:
x_train_numerical_imputed.isnull().any().sum()

np.int64(0)

**Mari buat dalam fungsi**

In [37]:
from sklearn.impute import SimpleImputer

def numericalImpputation(data, numeric_columns):
  """
    Fungsi untuk melakukan imputasi data numerik
    :param data: <pandas dataframe> sample data input
    :param numerical_column: <list> list kolom numerik data
    :return X_train_numerical: <pandas dataframe> data numerik
    :return imputer_numerical: numerical imputer method
    """
    
  # filter data numerik 
  numerical_data = data[numeric_columns]
  
  # buat imputer
  imputer_numerical = SimpleImputer(missing_values = np.nan, 
                                    strategy="median")
  imputer_numerical.fit(numerical_data)
  
  # transform 
  imputed_data = imputer_numerical.transform(numerical_data)
  
  
  numerical_data_imputed = pd.DataFrame(
    imputed_data, 
    columns=numeric_columns,
    index=data.index)
     
  return numerical_data_imputed, imputer_numerical

In [38]:
numeric_columns = ["age", "balance", "day", "duration", 
                    "campaign", "pdays", "previous"]

# imputation Numeric
x_train_numerical, imputer_numerical = numericalImpputation(data = x_train,
                                                            numeric_columns = numeric_columns)

In [39]:
x_train_numerical.isnull().sum()

age         0
balance     0
day         0
duration    0
campaign    0
pdays       0
previous    0
dtype: int64

**Categorical Imputation**

In [40]:
# Ambil daftar nama kolom kategorikal
# Anda bisa langsung menuliskanya atau mengambil list jika jumlahnya banyak

x_train_columns = list(x_train.columns)
categorical_columns = list(set(x_train_columns).difference(set(numeric_columns)))

In [41]:
categorical_columns

['loan',
 'marital',
 'month',
 'job',
 'default',
 'housing',
 'poutcome',
 'education',
 'contact']

In [42]:
# periksa Lagi missing value
categorical_data = x_train[categorical_columns]
categorical_data.isnull().sum()

loan         2668
marital      2650
month        2602
job          2650
default      2689
housing      2660
poutcome     2629
education    2542
contact      2695
dtype: int64

In [43]:
# kita isi kolom kategorik dengan "KOSONG"
categorical_data = x_train[categorical_columns]
categorical_data    = categorical_data.fillna(value="KOSONG")

In [44]:
categorical_data.isnull().sum()

loan         0
marital      0
month        0
job          0
default      0
housing      0
poutcome     0
education    0
contact      0
dtype: int64

**Mari buat dalam bentuk function**

In [45]:
def categoricalImputation(data, categorical_columns):
 
 '''Fungsi untuk melakukan imputasi data kategorik 
    : param data: <pandas dataframe> sample data input
    : param categorical_column: <list> list kolom kategorical data
    : return categorical_data : <pandas dataframe> data kategorikal
 '''
 
 # seleksi data 
 categorical_data = data[categorical_columns]
 
 # lakukan imputasi
 categorical_data = categorical_data.fillna(value="KOSONG")
 
 return categorical_data


In [46]:
x_train_categorical = categoricalImputation(data = x_train,
                                            categorical_columns= categorical_columns)

In [47]:
x_train_categorical.isnull().sum()

loan         0
marital      0
month        0
job          0
default      0
housing      0
poutcome     0
education    0
contact      0
dtype: int64

### **Preprocessing Categorical Variables**

- Kita tidak bisa memasukkan data categorical, jika tidak diubah menjadi numerical
- Solusi: One Hot Encoding (OHE)

In [48]:
categorical_ohe = pd.get_dummies(x_train_categorical)

In [49]:
categorical_ohe.head(2)

Unnamed: 0,loan_KOSONG,loan_no,loan_yes,marital_KOSONG,marital_divorced,marital_married,marital_single,month_KOSONG,month_apr,month_aug,...,poutcome_unknown,education_KOSONG,education_primary,education_secondary,education_tertiary,education_unknown,contact_KOSONG,contact_cellular,contact_telephone,contact_unknown
37156,False,True,False,False,False,False,True,False,False,False,...,True,False,False,False,True,False,False,True,False,False
20494,True,False,False,True,False,False,False,True,False,False,...,True,True,False,False,False,False,False,True,False,False


**Mari buat menjadi fungsi**

In [50]:
def extractCategorical(data, categorical_columns):
 """
    Fungsi untuk ekstrak data kategorikal dengan One Hot Encoding
    :param data: <pandas dataframe> data sample
    :param categorical_column: <list> list kolom kategorik
    :return categorical_ohe: <pandas dataframe> data sample dengan ohe
    """
    
 data_categorical = categoricalImputation(data= data,
                                             categorical_columns = categorical_columns)
 categorical_columns = pd.get_dummies(data_categorical)
 
 return categorical_ohe

In [51]:
x_train_categorical_ohe = extractCategorical(data = x_train, 
                                             categorical_columns= categorical_columns)

In [52]:
x_train_categorical_ohe.head()

Unnamed: 0,loan_KOSONG,loan_no,loan_yes,marital_KOSONG,marital_divorced,marital_married,marital_single,month_KOSONG,month_apr,month_aug,...,poutcome_unknown,education_KOSONG,education_primary,education_secondary,education_tertiary,education_unknown,contact_KOSONG,contact_cellular,contact_telephone,contact_unknown
37156,False,True,False,False,False,False,True,False,False,False,...,True,False,False,False,True,False,False,True,False,False
20494,True,False,False,True,False,False,False,True,False,False,...,True,True,False,False,False,False,False,True,False,False
35272,False,True,False,True,False,False,False,False,False,False,...,True,False,False,False,True,False,False,True,False,False
22260,False,True,False,True,False,False,False,False,False,True,...,True,True,False,False,False,False,False,True,False,False
2728,False,True,False,False,False,False,True,False,False,False,...,True,False,False,True,False,False,False,False,False,True


In [53]:
# simpan kolom OHE untuk diimplementasikan dalam testing data
# Agar shape-nya konsisten 
ohe_columns = x_train_categorical_ohe.columns

In [54]:
ohe_columns

Index(['loan_KOSONG', 'loan_no', 'loan_yes', 'marital_KOSONG',
       'marital_divorced', 'marital_married', 'marital_single', 'month_KOSONG',
       'month_apr', 'month_aug', 'month_dec', 'month_feb', 'month_jan',
       'month_jul', 'month_jun', 'month_mar', 'month_may', 'month_nov',
       'month_oct', 'month_sep', 'job_KOSONG', 'job_admin.', 'job_blue-collar',
       'job_entrepreneur', 'job_housemaid', 'job_management', 'job_retired',
       'job_self-employed', 'job_services', 'job_student', 'job_technician',
       'job_unemployed', 'job_unknown', 'default_KOSONG', 'default_no',
       'default_yes', 'housing_KOSONG', 'housing_no', 'housing_yes',
       'poutcome_KOSONG', 'poutcome_failure', 'poutcome_other',
       'poutcome_success', 'poutcome_unknown', 'education_KOSONG',
       'education_primary', 'education_secondary', 'education_tertiary',
       'education_unknown', 'contact_KOSONG', 'contact_cellular',
       'contact_telephone', 'contact_unknown'],
      dtype='object'

### **Join data Numerical dan Categorical**

- Data numerik & kategorik harus disatukan kembali
- Penyatuan dengan `pd.concat`

In [55]:
x_train_concat =pd.concat([x_train_numerical, 
                           x_train_categorical_ohe], 
                          axis = 1)

In [56]:
x_train_concat.head()

Unnamed: 0,age,balance,day,duration,campaign,pdays,previous,loan_KOSONG,loan_no,loan_yes,...,poutcome_unknown,education_KOSONG,education_primary,education_secondary,education_tertiary,education_unknown,contact_KOSONG,contact_cellular,contact_telephone,contact_unknown
37156,35.0,2749.0,13.0,127.0,1.0,-1.0,0.0,False,True,False,...,True,False,False,False,True,False,False,True,False,False
20494,30.0,443.0,12.0,80.0,2.0,-1.0,0.0,True,False,False,...,True,True,False,False,False,False,False,True,False,False
35272,39.0,4239.0,7.0,40.0,1.0,-1.0,0.0,False,True,False,...,True,False,False,False,True,False,False,True,False,False
22260,49.0,400.0,21.0,151.0,3.0,-1.0,0.0,False,True,False,...,True,True,False,False,False,False,False,True,False,False
2728,28.0,468.0,13.0,152.0,3.0,-1.0,0.0,False,True,False,...,True,False,False,True,False,False,False,False,False,True


In [57]:
x_train_concat.isnull().any()

age                    False
balance                False
day                    False
duration               False
campaign               False
pdays                  False
previous               False
loan_KOSONG            False
loan_no                False
loan_yes               False
marital_KOSONG         False
marital_divorced       False
marital_married        False
marital_single         False
month_KOSONG           False
month_apr              False
month_aug              False
month_dec              False
month_feb              False
month_jan              False
month_jul              False
month_jun              False
month_mar              False
month_may              False
month_nov              False
month_oct              False
month_sep              False
job_KOSONG             False
job_admin.             False
job_blue-collar        False
job_entrepreneur       False
job_housemaid          False
job_management         False
job_retired            False
job_self-emplo

### **Standardizing Variables**

- Menyamakan skala dari variabel input
- `fit`: imputer agar mengetahui mean dan standar deviasi dari setiap kolom
- `transform`: isi data dengan value yang sudah dinormalisasi
- output dari transform berupa pandas dataframe
- normalize dikeluarkan karena akan digunakan pada data test

In [58]:
from sklearn.preprocessing import StandardScaler

# buat fungsi
def standardizerData(data):
 """
    Fungsi untuk melakukan standarisasi data
    :param data: <pandas dataframe> sampel data
    :return standardized_data: <pandas dataframe> sampel data standard
    :return standardizer: method untuk standardisasi data
    """
 data_columns = data.columns # agar nama kolom tidak hilang
 data_index = data.index # agar index tidak hilang
 
 # buat (fit) standardizazer
 standardizer = StandardScaler()
 standardizer.fit(data)
 
 # transform data
 standardized_data_raw = standardizer.transform(data)
 standardized_data = pd.DataFrame(standardized_data_raw)
 standardized_data.columns = data_columns
 standardized_data.index = data_index
 
 return standardized_data, standardizer
 

In [59]:
x_train_clean, standardizer = standardizerData(data = x_train_concat)

In [60]:
x_train_clean.head()

Unnamed: 0,age,balance,day,duration,campaign,pdays,previous,loan_KOSONG,loan_no,loan_yes,...,poutcome_unknown,education_KOSONG,education_primary,education_secondary,education_tertiary,education_unknown,contact_KOSONG,contact_cellular,contact_telephone,contact_unknown
37156,-0.568886,0.502047,-0.354434,-0.504073,-0.569761,-0.390255,-0.292138,-0.292238,0.543266,-0.418762,...,0.56727,-0.284681,-0.402775,-0.949599,1.632949,-0.200227,-0.29384,0.82585,-0.252548,-0.602062
20494,-1.058043,-0.288093,-0.47921,-0.693995,-0.236736,-0.390255,-0.292138,3.421863,-1.84072,-0.418762,...,0.56727,3.512706,-0.402775,-0.949599,-0.612389,-0.200227,-0.29384,0.82585,-0.252548,-0.602062
35272,-0.177561,1.012588,-1.103089,-0.855631,-0.569761,-0.390255,-0.292138,-0.292238,0.543266,-0.418762,...,0.56727,-0.284681,-0.402775,-0.949599,1.632949,-0.200227,-0.29384,0.82585,-0.252548,-0.602062
22260,0.800752,-0.302827,0.643772,-0.407091,0.096289,-0.390255,-0.292138,-0.292238,0.543266,-0.418762,...,0.56727,3.512706,-0.402775,-0.949599,-0.612389,-0.200227,-0.29384,0.82585,-0.252548,-0.602062
2728,-1.253705,-0.279527,-0.354434,-0.40305,0.096289,-0.390255,-0.292138,-0.292238,0.543266,-0.418762,...,0.56727,-0.284681,-0.402775,1.053076,-0.612389,-0.200227,-0.29384,-1.210874,-0.252548,1.660959


## <b><font color='blue'> 3. Training Machine Learning:</font></b>
---
    * Choose Score to optimize and Hyperparameter Space
    * Cross-Validation: Random vs Grid Search CV
    * Kita harus mengalahkan benchmark

### **Benchmark / Baseline**

- Baseline untuk evaluasi nanti
- Karena ini klasifikasi, bisa kita ambil dari proporsi kelas target yang terbesar
- Dengan kata lain, menebak hasil output marketing response dengan nilai "no" semua tanpa modeling

In [61]:
y_train.value_counts(normalize= True)

# baseline akurasi = 88%

y
no     0.882624
yes    0.117376
Name: proportion, dtype: float64

### **1. Import Model**

- Misal kita gunakan 3 model ML untuk klasifikasi:
    - K-nearest neighbor (K-NN)
    - Logistic Regression
    - Random Forest

In [62]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

### **2. Fitting Model**

- Cara fitting/training model mengikuti yang dokumentasi model

In [63]:
# model K nearest neighbor 
# knn = KNeighborsClassifier
# knn.fit(x_train_clean, y_train)

In [64]:
# model logistic regression
loreg = LogisticRegression(random_state= 123)
loreg.fit(x_train_clean, y_train)

In [65]:
random_forest = RandomForestClassifier(random_state= 123)
random_forest.fit(x_train_clean, y_train)

In [66]:
# Model Random Forest Classifier 1
# Mari kita ubah hyperparameter dari random forest --> n_estimator
# Maksud & tujuan akan dijelaskan pada kelas Random Forest
# Tambahkan n_estimator = 500

#random_forest_1 = RandomForestClassifier(random_state = 123,
#                                         n_estimators = 500)
#random_forest_1.fit(X_train_clean, y_train)

### **3. Prediction**

- Saatnya melakukan prediksi

In [67]:
# prediksi Logistic Regression
loreg.predict(x_train_clean)

array(['no', 'no', 'no', ..., 'no', 'no', 'yes'],
      shape=(33908,), dtype=object)

In [68]:
predicted_loreg = pd.DataFrame(loreg.predict(x_train_clean))
predicted_loreg

Unnamed: 0,0
0,no
1,no
2,no
3,no
4,no
...,...
33903,no
33904,no
33905,no
33906,no


In [69]:
#predicted_knn = pd.DataFrame(knn.predict(X_train_clean))
#predicted_knn.head()

In [70]:
predicted_rf = pd.DataFrame(random_forest.predict(x_train_clean))
predicted_rf.head()

Unnamed: 0,0
0,no
1,no
2,no
3,no
4,no


In [71]:
#predicted_rf_1 = pd.DataFrame(random_forest_1.predict(X_train_clean))
#predicted_rf_1.head()

### **4. Cek performa model di data training**

In [72]:
benchmark = y_train.value_counts(normalize=True)[0]
benchmark

  benchmark = y_train.value_counts(normalize=True)[0]


np.float64(0.8826235696590775)

In [73]:
# akurasi knn
#knn.score(X_train_clean, y_train)

In [74]:
# akurasi Logistic regression
loreg.score(x_train_clean, y_train)

0.900554441429751

In [75]:
# akurasi random forest
random_forest.score(x_train_clean, y_train)

1.0

In [76]:
# akurasi random forest 1
#random_forest_1.score(X_train_clean, y_train)

### **5. Simpan model ke file pickle**

In [77]:
import joblib

# simpan model loreg ke dalam folder yang sama dengan notebook
# dengan nama loreg.pki
joblib.dump(loreg, 'loreg.pkl')

# joblib.dump(knn, "knn.pkl")
joblib.dump(random_forest, "random_forest.pkl")
# joblib.dump(random_forest_1, "random_forest_1.pkl")

['random_forest.pkl']

### **6. Test Prediction**

1. Siapkan file test dataset
2. Lakukan preprocessing yang sama dengan yang dilakukan di train dataset
3. gunakan `imputer_numerical` dan `standardizer` yang telah di-fit di train dataset

In [78]:
def extractTest(data, 
                numeric_columns, categorical_columns, ohe_columns, 
                imputer_numerical, standardizer):
 """
    Fungsi untuk mengekstrak & membersihkan test data 
    :param data: <pandas dataframe> sampel data test
    :param numerical_column: <list> kolom numerik
    :param categorical_column: <list> kolom kategorik
    :param ohe_column: <list> kolom one-hot-encoding dari data kategorik
    :param imputer_numerical: <sklearn method> imputer data numerik
    :param standardizer: <sklearn method> standardizer data
    :return cleaned_data: <pandas dataframe> data final
    """
    # filter data
 numerical_data = data[numeric_columns]
 categorical_data = data[categorical_columns]
 
 # Proses data numerik
 numerical_data = pd.DataFrame(imputer_numerical.transform(numerical_data))
 numerical_data.columns = numeric_columns
 numerical_data.index = data.index
 
 # proses data kategorik 
 categorical_data = categorical_data.fillna(value="KOSONG")
 categorical_data.index = data.index
 categorical_data = pd.get_dummies(categorical_data)
 categorical_data.reindex(index = categorical_data.index, 
                          columns = ohe_columns)
 
 # Gabungkan data
 concat_data  = pd.concat([numerical_data, categorical_data],
                          axis=1)
 cleaned_data = pd.DataFrame(standardizer.transform(concat_data))
 cleaned_data.columns = concat_data.columns
 
 return cleaned_data

In [79]:
def testPrediction(x_test, y_test, classifier, compute_score):
    """
    Fungsi untuk mendapatkan prediksi dari model
    :param X_test: <pandas dataframe> input
    :param y_test: <pandas series> output/target
    :param classifier: <sklearn method> model klasifikasi
    :param compute_score: <bool> True: menampilkan score, False: tidak
    :return test_predict: <list> hasil prediksi data input
    :return score: <float> akurasi model
    """
    if compute_score:
        score = classifier.score(x_test, y_test)
        print(f"Accuracy : {score:.4f}")

    test_predict = classifier.predict(x_test)

    return test_predict, score

In [80]:
x_test_clean = extractTest(data = x_test,
                           numeric_columns = numeric_columns,
                           categorical_columns = categorical_columns,
                           ohe_columns = ohe_columns,
                           imputer_numerical = imputer_numerical,
                           standardizer = standardizer)

In [81]:
x_test_clean.shape

(11303, 60)

In [82]:
loreg_test_predict, score = testPrediction(x_test= x_test_clean,
                                           y_test= y_test,
                                           classifier= loreg, 
                                           compute_score= True)

Accuracy : 0.9010


In [None]:
#K nearest neighbor Performance
#knn_test_predict, score = testPrediction(X_test = X_test_clean,
#                                         y_test = y_test,
#                                        classifier = knn,
#                                         compute_score = True)

NameError: name 'X_test_clean' is not defined

In [None]:
rf_test_predict, score = testPrediction(x_test= x_test_clean, 
                                        y_test= y_test,
                                        classifier= random_forest, 
                                        compute_score= True)

Accuracy : 0.8994


In [None]:
# Random Forest 1 Performance
rf_1_test_predict, score = testPrediction(X_test = X_test_clean,
                                       y_test = y_test,
                                       classifier = random_forest_1,
                                      compute_score = True)  