Machine Learning Week 3

Machine Learning Workflow (Simplified)

1. Importing Data to Python

* Drop Duplicates 

2. Data Preprocessing:

* Input-Output Split, Train-Test Split
* Imputation, Processing Categorical, Normalization 

3. Training Machine Learning:

* Choose Score to optimize and Hyperparameter Space

Client Credit Card Analysis

    Task : Classification
    Objective : Prediksi churn

Data description:

"Churn Rate" is a business term describing the rate at which customers leave or cease paying for a product or service. It's a critical figure in many businesses, as it's often the case that acquiring new customers is a lot more costly than retaining existing ones (in some cases, 5 to 20 times more expensive).

Predicting churn is particularly important for businesses with subscription models such as cell phone, cable, or merchant credit card processing plans.

Features There are 21 variables:

    State
    Account Length
    Area Code
    Phone
    Int'l Plan
    VMail Plan
    VMail Message
    Day Mins
    Day Calls
    Day Charge
    Eve Mins
    Eve Calls
    Eve Charge
    Night Mins
    Night Calls
    Night Charge
    Intl Mins
    Intl Calls
    Intl Charge
    CustServ Calls
    Churn? : Output



In [1]:
# Import library pengolahan struktur data
import pandas as pd

# Import library pengolahan angka
import numpy as np

In [2]:
# Load Data
# Simpan dengan nama bank_df
data = pd.read_csv("dataset/w3-3-churn.csv")

In [3]:
# Tampilkan seluruh data
data.head()

Unnamed: 0,State,Account Length,Area Code,Phone,Int'l Plan,VMail Plan,VMail Message,Day Mins,Day Calls,Day Charge,...,Eve Calls,Eve Charge,Night Mins,Night Calls,Night Charge,Intl Mins,Intl Calls,Intl Charge,CustServ Calls,Churn?
0,KS,128,415,382-4657,no,yes,25,265.1,110,45.07,...,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False.
1,OH,107,415,371-7191,no,yes,26,161.6,123,27.47,...,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False.
2,NJ,137,415,358-1921,no,no,0,243.4,114,41.38,...,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False.
3,OH,84,408,375-9999,yes,no,0,299.4,71,50.9,...,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False.
4,OK,75,415,330-6626,yes,no,0,166.7,113,28.34,...,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False.


Cek banyak observasi

In [4]:
data.shape

# Output
# (Jumlah observasi, jumlah kolom/fitur)

(3333, 21)

Cek & Drop data yang duplikat

    cek-nya pakai .duplicated()



In [5]:
# cek data duplicate
duplicate_status = data.duplicated()
duplicate_status

0       False
1       False
2       False
3       False
4       False
        ...  
3328    False
3329    False
3330    False
3331    False
3332    False
Length: 3333, dtype: bool

In [6]:
# Cari jumlah data duplikatnya
duplicate_status.sum()

# FALSE = 0 --> kalo tidak duplikat 
# TRUE = 1 --> kalo duplikat
# Kalau ada yang duplikat, maka jumlahnya > 0

np.int64(0)

In [7]:
data = data.drop_duplicates()

# Tidak ada yang di-drop karena tidak ada duplikat

In [8]:
data.shape

# Selalu sanity check!
# Periksa ulang jumlah observasi

(3333, 21)

Buat semuanya dalam fungsi

   1. Import data
   2. Cek Jumlah observasi dan Jumlah kolom
   3. Drop duplicate
   4. Cek Jumlah observasi dan Jumlah kolom setelah di-drop
   5. Return data setelah di-drop



In [10]:
# Kita ingin membuat fungsi yang isi perintahnya sebagai berikut
data = pd.read_csv("dataset/w3-3-churn.csv")
print("Data asli            : ", data.shape, "- (#observasi, #kolom)")

data = data.drop_duplicates()
print("Data setelah di-drop : ", data.shape, "- (#observasi, #kolom)")

Data asli            :  (3333, 21) - (#observasi, #kolom)
Data setelah di-drop :  (3333, 21) - (#observasi, #kolom)


In [11]:
def importData(filename, dropped_column):
    """
    Fungsi untuk import data & hapus duplikat
    :param filename: <string> nama file input (format .csv)
    :param dropped_column: <string> nama fitur yang di drop
    :return df: <pandas dataframe> sampel data
    """

    # read data
    df = pd.read_csv(filename)
    print("Data asli            : ", df.shape, "- (#observasi, #kolom)")

    # drop column
    df = df.drop(dropped_column, axis=1)

    # drop duplicates
    df = df.drop_duplicates()
    print("Data setelah di-drop : ", df.shape, "- (#observasi, #kolom)")

    return df


In [12]:
# input
file_credit = "dataset/w3-3-churn.csv"

# panggil fungsi
data = importData(filename = file_credit,
                  dropped_column = "Phone")

Data asli            :  (3333, 21) - (#observasi, #kolom)
Data setelah di-drop :  (3333, 20) - (#observasi, #kolom)


In [13]:
data.head()

Unnamed: 0,State,Account Length,Area Code,Int'l Plan,VMail Plan,VMail Message,Day Mins,Day Calls,Day Charge,Eve Mins,Eve Calls,Eve Charge,Night Mins,Night Calls,Night Charge,Intl Mins,Intl Calls,Intl Charge,CustServ Calls,Churn?
0,KS,128,415,no,yes,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False.
1,OH,107,415,no,yes,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False.
2,NJ,137,415,no,no,0,243.4,114,41.38,121.2,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False.
3,OH,84,408,yes,no,0,299.4,71,50.9,61.9,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False.
4,OK,75,415,yes,no,0,166.7,113,28.34,148.3,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False.


 2. Data Preprocessing:

* Input-Output Split, Train-Test Split
* Processing Categorical
* Imputation, Normalization, Drop Duplicates

Input-Output Split

    Fitur y adalah output variabel dari data marketing
    yang lainnya menjadi input

Buat data input & output


In [14]:
data.head()

Unnamed: 0,State,Account Length,Area Code,Int'l Plan,VMail Plan,VMail Message,Day Mins,Day Calls,Day Charge,Eve Mins,Eve Calls,Eve Charge,Night Mins,Night Calls,Night Charge,Intl Mins,Intl Calls,Intl Charge,CustServ Calls,Churn?
0,KS,128,415,no,yes,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False.
1,OH,107,415,no,yes,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False.
2,NJ,137,415,no,no,0,243.4,114,41.38,121.2,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False.
3,OH,84,408,yes,no,0,299.4,71,50.9,61.9,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False.
4,OK,75,415,yes,no,0,166.7,113,28.34,148.3,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False.


In [15]:
def extractInputOutput(data,
                       output_column_name):
    """
    Fungsi untuk memisahkan data input dan output
    :param data: <pandas dataframe> data seluruh sample
    :param output_column_name: <string> nama kolom output
    :return input_data: <pandas dataframe> data input
    :return output_data: <pandas series> data output
    """
    # buat output
    output_data = data[output_column_name]
    
    # buat input
    input_data = data.drop(output_column_name,
                           axis = 1)
    
    return input_data, output_data


In [16]:
# Jangan sampai salah urutan dalam penempatan return
output_column_name = ["Churn?"]

X, y = extractInputOutput(data = data,
                          output_column_name = output_column_name)

Selalu sanity check!

In [17]:
X.head(2)

Unnamed: 0,State,Account Length,Area Code,Int'l Plan,VMail Plan,VMail Message,Day Mins,Day Calls,Day Charge,Eve Mins,Eve Calls,Eve Charge,Night Mins,Night Calls,Night Charge,Intl Mins,Intl Calls,Intl Charge,CustServ Calls
0,KS,128,415,no,yes,25,265.1,110,45.07,197.4,99,16.78,244.7,91,11.01,10.0,3,2.7,1
1,OH,107,415,no,yes,26,161.6,123,27.47,195.5,103,16.62,254.4,103,11.45,13.7,3,3.7,1


In [18]:
y.head(2)

Unnamed: 0,Churn?
0,False.
1,False.


Train-Test Split

    Kenapa?
        Karena tidak mau overfit data training
        Test data akan menjadi future data
        Kita akan latih model ML di data training, dengan CV (Cross-validation)
        Selanjutnya melakukan evaluasi di data testing



In [19]:
# Import train-test splitting library dari sklearn (scikit learn)
from sklearn.model_selection import train_test_split

Train Test Split Function

1.    X adalah input
2.    y adalah output (target)
3.    test_size adalah seberapa besar proporsi data test dari keseluruhan data. Contoh test_size = 0.2 artinya data test akan berisi 25% data.
4.    random_state adalah kunci untuk random. Harus di-setting sama. Misal random_state = 123.
5.    Output:
    *    X_train = input dari data training
    *    X_test = input dari data testing
    *    y_train = output dari data training
    *    y_test = output dari data testing
6.    Urutan outputnya: X_train, X_test, y_train, y_test. Tidak boleh terbalik

    Readmore: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html



In [20]:
# Train test split
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size = 0.25,
                                                    random_state = 123)

In [21]:
# Sanity check hasil splitting
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

(2499, 19)
(2499, 1)
(834, 19)
(834, 1)


In [22]:
# Ratio
X_test.shape[0] / X.shape[0]

# Hasil 0.25 - sesuai dengan test_size kita

0.2502250225022502

Selamat! - Anda sudah memiliki data train & test

    Selanjutnya, hanya fokus ke data training



Data Imputation

    Proses pengisian data yang kosong (NaN)
    Ada 2 hal yang diperhatikan:
        Numerical Imputation
        Categorical Imputation

Cek data yang kosong dari variabel input


In [23]:
X_train.isnull().sum()

# Tidak ada data yang kosong

State             0
Account Length    0
Area Code         0
Int'l Plan        0
VMail Plan        0
VMail Message     0
Day Mins          0
Day Calls         0
Day Charge        0
Eve Mins          0
Eve Calls         0
Eve Charge        0
Night Mins        0
Night Calls       0
Night Charge      0
Intl Mins         0
Intl Calls        0
Intl Charge       0
CustServ Calls    0
dtype: int64

Bedakan antara data categorical & numerical

In [24]:
X_train.head()

Unnamed: 0,State,Account Length,Area Code,Int'l Plan,VMail Plan,VMail Message,Day Mins,Day Calls,Day Charge,Eve Mins,Eve Calls,Eve Charge,Night Mins,Night Calls,Night Charge,Intl Mins,Intl Calls,Intl Charge,CustServ Calls
1066,KS,117,510,no,yes,25,216.0,140,36.72,224.1,69,19.05,267.9,112,12.06,11.8,4,3.19,0
1553,CO,86,415,no,no,0,217.8,93,37.03,214.7,95,18.25,228.7,70,10.29,11.3,7,3.05,0
2628,TN,37,415,no,no,0,221.0,126,37.57,204.5,110,17.38,118.0,98,5.31,6.8,3,1.84,4
882,FL,130,415,no,no,0,162.8,113,27.68,290.3,111,24.68,114.9,140,5.17,7.2,3,1.94,1
984,NV,77,415,no,no,0,142.3,112,24.19,306.3,111,26.04,196.5,82,8.84,9.9,1,2.67,1


In [25]:
X_train.columns

Index(['State', 'Account Length', 'Area Code', 'Int'l Plan', 'VMail Plan',
       'VMail Message', 'Day Mins', 'Day Calls', 'Day Charge', 'Eve Mins',
       'Eve Calls', 'Eve Charge', 'Night Mins', 'Night Calls', 'Night Charge',
       'Intl Mins', 'Intl Calls', 'Intl Charge', 'CustServ Calls'],
      dtype='object')

In [26]:
#_get_numeric_data() hanya akan mengambil column berisikan integer dan float
# hati-hati dengan data kategoric yang berbentuk integer!!
X_train_numerical = X_train._get_numeric_data() 

In [27]:
X_train_numerical.head()

Unnamed: 0,Account Length,Area Code,VMail Message,Day Mins,Day Calls,Day Charge,Eve Mins,Eve Calls,Eve Charge,Night Mins,Night Calls,Night Charge,Intl Mins,Intl Calls,Intl Charge,CustServ Calls
1066,117,510,25,216.0,140,36.72,224.1,69,19.05,267.9,112,12.06,11.8,4,3.19,0
1553,86,415,0,217.8,93,37.03,214.7,95,18.25,228.7,70,10.29,11.3,7,3.05,0
2628,37,415,0,221.0,126,37.57,204.5,110,17.38,118.0,98,5.31,6.8,3,1.84,4
882,130,415,0,162.8,113,27.68,290.3,111,24.68,114.9,140,5.17,7.2,3,1.94,1
984,77,415,0,142.3,112,24.19,306.3,111,26.04,196.5,82,8.84,9.9,1,2.67,1


In [28]:
# drop unexpected numerical column if any
num_categorical = ["Area Code"]
X_train_numerical = X_train_numerical.drop(num_categorical, axis = 1)
X_train_numerical.head()

Unnamed: 0,Account Length,VMail Message,Day Mins,Day Calls,Day Charge,Eve Mins,Eve Calls,Eve Charge,Night Mins,Night Calls,Night Charge,Intl Mins,Intl Calls,Intl Charge,CustServ Calls
1066,117,25,216.0,140,36.72,224.1,69,19.05,267.9,112,12.06,11.8,4,3.19,0
1553,86,0,217.8,93,37.03,214.7,95,18.25,228.7,70,10.29,11.3,7,3.05,0
2628,37,0,221.0,126,37.57,204.5,110,17.38,118.0,98,5.31,6.8,3,1.84,4
882,130,0,162.8,113,27.68,290.3,111,24.68,114.9,140,5.17,7.2,3,1.94,1
984,77,0,142.3,112,24.19,306.3,111,26.04,196.5,82,8.84,9.9,1,2.67,1


Cek apakah ada data numerik yang kosong

In [29]:
X_train_numerical.isnull().any()

# Semua variabel numerical memiliki missing values

Account Length    False
VMail Message     False
Day Mins          False
Day Calls         False
Day Charge        False
Eve Mins          False
Eve Calls         False
Eve Charge        False
Night Mins        False
Night Calls       False
Night Charge      False
Intl Mins         False
Intl Calls        False
Intl Charge       False
CustServ Calls    False
dtype: bool

In [30]:
numerical_column = list(X_train_numerical.columns.values)
numerical_column

['Account Length',
 'VMail Message',
 'Day Mins',
 'Day Calls',
 'Day Charge',
 'Eve Mins',
 'Eve Calls',
 'Eve Charge',
 'Night Mins',
 'Night Calls',
 'Night Charge',
 'Intl Mins',
 'Intl Calls',
 'Intl Charge',
 'CustServ Calls']

Categorical Imputation

In [31]:
X_train_categorical = X_train.drop(list(X_train_numerical.columns.values), axis=1)

In [32]:
X_train_categorical.head()

Unnamed: 0,State,Area Code,Int'l Plan,VMail Plan
1066,KS,510,no,yes
1553,CO,415,no,no
2628,TN,415,no,no
882,FL,415,no,no
984,NV,415,no,no


In [33]:
categorical_column = list(X_train_categorical.columns.values)

In [34]:
X_train_categorical.isnull().any()

# Semua variabel categorical memiliki missing values

State         False
Area Code     False
Int'l Plan    False
VMail Plan    False
dtype: bool

Preprocessing Categorical Variables

    Kita tidak bisa memasukkan data categorical, jika tidak diubah menjadi numerical
    Solusi: One Hot Encoding (OHE)



In [35]:
categorical_ohe = pd.get_dummies(X_train_categorical)

In [36]:
categorical_ohe.head(2)

Unnamed: 0,Area Code,State_AK,State_AL,State_AR,State_AZ,State_CA,State_CO,State_CT,State_DC,State_DE,...,State_VA,State_VT,State_WA,State_WI,State_WV,State_WY,Int'l Plan_no,Int'l Plan_yes,VMail Plan_no,VMail Plan_yes
1066,510,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,True,False,False,True
1553,415,False,False,False,False,False,True,False,False,False,...,False,False,False,False,False,False,True,False,True,False


Mari buat menjadi fungsi

In [37]:
def extractCategorical(data, categorical_column):
    """
    Fungsi untuk ekstrak data kategorikal dengan One Hot Encoding
    :param data: <pandas dataframe> data sample
    :param categorical_column: <list> list kolom kategorik
    :return categorical_ohe: <pandas dataframe> data sample dengan ohe
    """
    data_categorical = data[categorical_column]
    categorical_ohe = pd.get_dummies(data_categorical)

    return categorical_ohe

In [38]:
X_train_categorical_ohe = extractCategorical(data = X_train,
                                             categorical_column = categorical_column)

In [39]:
X_train_categorical_ohe.head()

Unnamed: 0,Area Code,State_AK,State_AL,State_AR,State_AZ,State_CA,State_CO,State_CT,State_DC,State_DE,...,State_VA,State_VT,State_WA,State_WI,State_WV,State_WY,Int'l Plan_no,Int'l Plan_yes,VMail Plan_no,VMail Plan_yes
1066,510,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,True,False,False,True
1553,415,False,False,False,False,False,True,False,False,False,...,False,False,False,False,False,False,True,False,True,False
2628,415,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,True,False,True,False
882,415,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,True,False,True,False
984,415,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,True,False,True,False


In [40]:
# Simpan kolom OHE untuk diimplementasikan dalam testing data
# Agar shape-nya konsisten
ohe_columns = X_train_categorical_ohe.columns

In [41]:
ohe_columns

Index(['Area Code', 'State_AK', 'State_AL', 'State_AR', 'State_AZ', 'State_CA',
       'State_CO', 'State_CT', 'State_DC', 'State_DE', 'State_FL', 'State_GA',
       'State_HI', 'State_IA', 'State_ID', 'State_IL', 'State_IN', 'State_KS',
       'State_KY', 'State_LA', 'State_MA', 'State_MD', 'State_ME', 'State_MI',
       'State_MN', 'State_MO', 'State_MS', 'State_MT', 'State_NC', 'State_ND',
       'State_NE', 'State_NH', 'State_NJ', 'State_NM', 'State_NV', 'State_NY',
       'State_OH', 'State_OK', 'State_OR', 'State_PA', 'State_RI', 'State_SC',
       'State_SD', 'State_TN', 'State_TX', 'State_UT', 'State_VA', 'State_VT',
       'State_WA', 'State_WI', 'State_WV', 'State_WY', 'Int'l Plan_no',
       'Int'l Plan_yes', 'VMail Plan_no', 'VMail Plan_yes'],
      dtype='object')

Join data Numerical dan Categorical

    Data numerik & kategorik harus disatukan kembali
    Penyatuan dengan pd.concat



In [42]:
X_train_concat = pd.concat([X_train_numerical,
                            X_train_categorical_ohe],
                           axis = 1)

In [43]:
X_train_concat.head()

Unnamed: 0,Account Length,VMail Message,Day Mins,Day Calls,Day Charge,Eve Mins,Eve Calls,Eve Charge,Night Mins,Night Calls,...,State_VA,State_VT,State_WA,State_WI,State_WV,State_WY,Int'l Plan_no,Int'l Plan_yes,VMail Plan_no,VMail Plan_yes
1066,117,25,216.0,140,36.72,224.1,69,19.05,267.9,112,...,False,False,False,False,False,False,True,False,False,True
1553,86,0,217.8,93,37.03,214.7,95,18.25,228.7,70,...,False,False,False,False,False,False,True,False,True,False
2628,37,0,221.0,126,37.57,204.5,110,17.38,118.0,98,...,False,False,False,False,False,False,True,False,True,False
882,130,0,162.8,113,27.68,290.3,111,24.68,114.9,140,...,False,False,False,False,False,False,True,False,True,False
984,77,0,142.3,112,24.19,306.3,111,26.04,196.5,82,...,False,False,False,False,False,False,True,False,True,False


In [44]:
X_train_concat.shape

(2499, 71)

In [45]:
X_train_concat.isnull().sum()

Account Length    0
VMail Message     0
Day Mins          0
Day Calls         0
Day Charge        0
                 ..
State_WY          0
Int'l Plan_no     0
Int'l Plan_yes    0
VMail Plan_no     0
VMail Plan_yes    0
Length: 71, dtype: int64

Standardizing Variables

    * Menyamakan skala dari variabel input
    * fit: imputer agar mengetahui mean dan standar deviasi dari setiap kolom
    * transform: isi data dengan value yang sudah dinormalisasi
    * output dari transform berupa pandas dataframe
    * normalize dikeluarkan karena akan digunakan pada data test



In [46]:
from sklearn.preprocessing import StandardScaler

# Buat fungsi
def standardizerData(data):
    """
    Fungsi untuk melakukan standarisasi data
    :param data: <pandas dataframe> sampel data
    :return standardized_data: <pandas dataframe> sampel data standard
    :return standardizer: method untuk standardisasi data
    """
    data_columns = data.columns  # agar nama kolom tidak hilang
    data_index = data.index  # agar index tidak hilang

    # buat (fit) standardizer
    standardizer = StandardScaler()
    standardizer.fit(data)

    # transform data
    standardized_data_raw = standardizer.transform(data)
    standardized_data = pd.DataFrame(standardized_data_raw)
    standardized_data.columns = data_columns
    standardized_data.index = data_index

    return standardized_data, standardizer

In [47]:
X_train_clean, standardizer = standardizerData(data = X_train_concat)

In [48]:
X_train_clean.head()

Unnamed: 0,Account Length,VMail Message,Day Mins,Day Calls,Day Charge,Eve Mins,Eve Calls,Eve Charge,Night Mins,Night Calls,...,State_VA,State_VT,State_WA,State_WI,State_WV,State_WY,Int'l Plan_no,Int'l Plan_yes,VMail Plan_no,VMail Plan_yes
1066,0.422132,1.249874,0.665609,1.957851,0.665555,0.451895,-1.551391,0.452184,1.316807,0.612775,...,-0.138449,-0.147201,-0.124261,-0.152779,-0.175899,-0.1555,0.325947,-0.325947,-1.633892,1.633892
1553,-0.360628,-0.586231,0.698711,-0.377054,0.699089,0.265705,-0.244133,0.265761,0.543769,-1.551279,...,-0.138449,-0.147201,-0.124261,-0.152779,-0.175899,-0.1555,0.325947,-0.325947,0.612036,-0.612036
2628,-1.597894,-0.586231,0.757558,1.262347,0.757504,0.063668,0.510055,0.063026,-1.639273,-0.108576,...,-0.138449,-0.147201,-0.124261,-0.152779,-0.175899,-0.1555,0.325947,-0.325947,0.612036,-0.612036
882,0.750387,-0.586231,-0.312731,0.616522,-0.312356,1.76315,0.560334,1.764138,-1.700406,2.055478,...,-0.138449,-0.147201,-0.124261,-0.152779,-0.175899,-0.1555,0.325947,-0.325947,0.612036,-0.612036
984,-0.587881,-0.586231,-0.689724,0.566844,-0.68989,2.08007,0.560334,2.081057,-0.091226,-0.932977,...,-0.138449,-0.147201,-0.124261,-0.152779,-0.175899,-0.1555,0.325947,-0.325947,0.612036,-0.612036


 3. Training Machine Learning:

* Choose Score to optimize and Hyperparameter Space
* Cross-Validation: Random vs Grid Search CV
* Kita harus mengalahkan benchmark

Benchmark / Baseline

    Baseline untuk evaluasi nanti
    Karena ini klasifikasi, bisa kita ambil dari proporsi kelas target yang terbesar
    Dengan kata lain, menebak hasil output marketing response dengan nilai "no" semua tanpa modeling



In [49]:
y_train.value_counts(normalize = True)

# baseline akurasi = 85%

Churn?
False.    0.85114
True.     0.14886
Name: proportion, dtype: float64

1. Import Model¶

    Misal kita gunakan 1 model ML untuk klasifikasi:
        Decision Tree



In [50]:
# Import dari sklearn
from sklearn.tree import DecisionTreeClassifier

2. Fitting Model¶

    Cara fitting/training model mengikuti yang dokumentasi model



In [51]:
# Model Decision Tree
decTree = DecisionTreeClassifier(random_state = 123)

In [52]:
# Fitting model
decTree.fit(X_train_clean, y_train)

In [53]:
# Model score
decTree.score(X_train_clean, y_train)

1.0

Lakukan eksperimentasi

In [54]:
# Import cross-validation
from sklearn.model_selection import GridSearchCV

In [55]:
# Buat parameter untuk eksperimen
decTree_param = {"max_depth": [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]}

#decTree_param = {"max_depth": [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12],
#                 "criterion": ["gini", "entropy", "log_loss"]}

In [56]:
# Buat plan eksperimentasi
random_decTree = GridSearchCV(estimator = DecisionTreeClassifier(random_state=123),
                              param_grid = decTree_param,
                              cv = 5,
                              scoring = "accuracy") 

In [57]:
# Lakukan fitting eksperimentasi
random_decTree.fit(X_train_clean, y_train)

In [58]:
# Evaluasi model
random_decTree.score(X_train_clean, y_train)

0.961984793917567

In [59]:
# Best parameters
random_decTree.best_params_

{'max_depth': 6}

Buat model dengan parameter terbaik & pakai seluruh data training

In [60]:
# Buat model
best_decTree = DecisionTreeClassifier(max_depth = random_decTree.best_params_["max_depth"],
                                      random_state = 123)

In [61]:
# Fit model
best_decTree.fit(X_train_clean, y_train)

3. Test Prediction

    1. Siapkan file test dataset
    2. Lakukan preprocessing yang sama dengan yang dilakukan di train dataset
    3. gunakan imputer_numerical dan standardizer yang telah di-fit di train dataset



In [62]:
def extractTest(data,
                numerical_column, categorical_column, ohe_column,
                standardizer):
    """
    Fungsi untuk mengekstrak & membersihkan test data 
    :param data: <pandas dataframe> sampel data test
    :param numerical_column: <list> kolom numerik
    :param categorical_column: <list> kolom kategorik
    :param ohe_column: <list> kolom one-hot-encoding dari data kategorik
    :param standardizer: <sklearn method> standardizer data
    :return cleaned_data: <pandas dataframe> data final
    """
    # Filter data
    numerical_data = data[numerical_column]
    categorical_data = data[categorical_column]

    # Proses data kategorik
    categorical_data = pd.get_dummies(categorical_data)
    categorical_data.reindex(index = categorical_data.index, 
                             columns = ohe_column)

    # Gabungkan data
    concat_data = pd.concat([numerical_data, categorical_data],
                             axis = 1)
    cleaned_data = pd.DataFrame(standardizer.transform(concat_data))
    cleaned_data.columns = concat_data.columns

    return cleaned_data


In [63]:
X_test_clean = extractTest(data = X_test,
                           numerical_column = numerical_column,
                           categorical_column = categorical_column,
                           ohe_column = ohe_columns,
                           standardizer = standardizer)

In [64]:
X_test_clean.shape

(834, 71)

In [65]:
# Cek Test data
best_decTree.score(X_test_clean, y_test)

0.9448441247002398