# Algoritma C4.5
Merupakan pengembangan dari algoritma ID3. Jika ID3 menggunakan Information gain, maka Algoritma C4.5 menggunakan **Gain Ratio** agar tidak bias dalam penentuan atribut pemilah terbaik (the best split attribute).

Algoritma ini dapat digunakan untuk data bernilai **kategorik** maupun **numerik**. Untuk data bernilai numerik, C4.5 membangun nilai-nilai batas (treshold) lalu memilah data ke dalam sejumlah interval sehingga didapatkan nilai-nilai kategorical (**convert categoric to numeric**).

C4.5 juga dapat menangani missing value dengan cara memberi tanda "?" dan tidak digunakan dalam perhitungan entropy dan information gain.

Algoritma ini memiliki pemangkasan pohon keputusan yang dihasilkan selama proses induksi, yaitu membuang cabang-cabang pohon yang overfit dan menggantinya dengan simpul-simpul daun yang lebih general.

# How to Convert Numeric Value into Categoric

Variabel yang bernilai numeric perlu diconvert terlebih dahulu ke kategorik. Caranya hampir sama saat kita membuat tabel frekuensi dengan beberapa kelas. Namun dalam hal ini pemilihan banyaknya kelas dan batas kelas tidak menggunakan aturan sturges, namun menggunakan "BEST Gain Ratio". Batas dan banyaknya kelas yang terpilih berdasarkan nilai Gain Ratio yang tertinggi.

# Import Data

In [1]:
import pandas as pd
import math
import numpy
from scipy.stats import entropy
import matplotlib.pyplot as plt

In [3]:
df = pd.read_excel('Hp.xlsx',sheet_name='Sheet2')
df

Unnamed: 0,Handphone,Baterai,Kamera,Harga,Layak Direkomendasikan
0,H1,26,8,1.2,Ya
1,H2,27,13,15.0,Ya
2,H3,28,5,6.0,Ya
3,H4,25,2,5.0,Tidak
4,H5,23,10,1.0,Ya
5,H6,20,7,3.5,Ya
6,H7,22,7,10.0,Ya
7,H8,24,8,2.0,Ya
8,H9,21,3,4.0,Tidak
9,H10,16,13,0.8,Ya


In [10]:
## Mendapatkan setiap data yang unik
Baterai = list(set(df.Baterai))
Kamera = list(set(df.Kamera))
Harga = list(set(df.Harga))

In [11]:
Harga

[0.8, 1.2, 1.0, 3.5, 2.0, 5.0, 6.0, 4.0, 3.0, 10.0, 12.0, 14.0, 15.0]

# Convert Numeric value into Categoric for Baterai Variable 

In [17]:
print("Nilai Min :",min(Baterai),"Nilai Max :",max(Baterai))

Nilai Min : 12 Nilai Max : 28


In [28]:
n_class = 2
batas_atas = max(Baterai)
batas_bawah = min(Baterai)
batas = list(range(batas_bawah,batas_atas))

In [29]:
batas

[12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27]

In [30]:
df

Unnamed: 0,Handphone,Baterai,Kamera,Harga,Layak Direkomendasikan
0,H1,26,8,1.2,Ya
1,H2,27,13,15.0,Ya
2,H3,28,5,6.0,Ya
3,H4,25,2,5.0,Tidak
4,H5,23,10,1.0,Ya
5,H6,20,7,3.5,Ya
6,H7,22,7,10.0,Ya
7,H8,24,8,2.0,Ya
8,H9,21,3,4.0,Tidak
9,H10,16,13,0.8,Ya


In [75]:
LowerYa = []
LowerNo =[]
UpperYa=[]
UpperNo=[]
kk=[]
for k in batas:
    LowerYa.append( len(df[(df['Baterai'] <= k) & (df['Layak Direkomendasikan']=='Ya')]))
    LowerNo.append( len(df[(df['Baterai'] <= k) & (df['Layak Direkomendasikan']=='Tidak')]))
    UpperYa.append( len(df[(df['Baterai'] > k) & (df['Layak Direkomendasikan']=='Ya')]))
    UpperNo.append( len(df[(df['Baterai'] > k) & (df['Layak Direkomendasikan']=='Tidak')]))
    kk.append(k)

In [102]:
tabel_count = pd.DataFrame([kk,LowerYa,LowerNo,UpperYa,UpperNo]).transpose()
tabel_count.columns =['batas','LowerYa','LowerNo','UpperYa','UpperNo']

In [114]:
tabel_count

Unnamed: 0,batas,LowerYa,LowerNo,UpperYa,UpperNo
0,12,0,1,8,5
1,13,0,1,8,5
2,14,0,2,8,4
3,15,0,3,8,3
4,16,1,3,7,3
5,17,1,3,7,3
6,18,1,4,7,2
7,19,1,4,7,2
8,20,2,4,6,2
9,21,2,5,6,1


In [178]:
## Gain Ratio untuk variabel Harga
def Gain(ls):
    LowerYa = ls[0]
    LowerNo = ls[1]
    UpperYa = ls[2]
    UpperNo = ls[3]
    
    Ya = UpperYa + LowerYa
    No = UpperNo + LowerNo
    
    S = entropy([Ya/(Ya + No), No/(Ya + No)], base=2)
    
    S_bawah = entropy([LowerYa/(LowerYa + LowerNo), LowerNo/(LowerYa + LowerNo)], base=2)
    S_atas = entropy([UpperYa/(UpperYa+UpperNo), UpperNo/(UpperYa+UpperNo)], base=2)
    
    gain = S - ((LowerYa+LowerNo)/(Ya+No))*S_bawah - ((UpperYa+UpperNo)/(Ya+No))*S_atas
    
    split = entropy([(LowerYa+LowerNo)/(Ya+No), (UpperYa+UpperNo)/(Ya+No)], base=2)
    
    gain_ratio = gain/split
    
    return gain_ratio

In [133]:
Gain([0,1,8,5])

0.24957764220035264

In [180]:
def GainRatio(df):
    ls = df[['LowerYa','LowerNo','UpperYa','UpperNo']].values.tolist()
    gain_ratio = []
    for k in range(len(ls)):
        gain_ratio.append(Gain(ls[k]))
        
    df_gain = pd.DataFrame(gain_ratio)
    gr = pd.concat([df,df_gain],ignore_index=True,axis=1)
    gr.columns = ['batas', 'LowerYa', 'LowerNo', 'UpperYa', 'UpperNo','GainRatio']
    return gr

In [181]:
GainRatio(tabel_count)

Unnamed: 0,batas,LowerYa,LowerNo,UpperYa,UpperNo,GainRatio
0,12,0,1,8,5,0.249578
1,13,0,1,8,5,0.249578
2,14,0,2,8,4,0.334843
3,15,0,3,8,3,0.428263
4,16,1,3,7,3,0.143596
5,17,1,3,7,3,0.143596
6,18,1,4,7,2,0.251118
7,19,1,4,7,2,0.251118
8,20,2,4,6,2,0.130006
9,21,2,5,6,1,0.257831


In [182]:
## Berdasarkan tabel di atas, nilai gain ratio tertinggi (terbaik) = 0.42, maka batas yang dipilih adalah 15

# Convert Numeric value into Categoric for Kamera Variable 

In [195]:
# Mendapatkan count dari masing masing kelas
def getTableCal(df,var):
    batas = list(set(df[var]))
    LowerYa = []
    LowerNo =[]
    UpperYa=[]
    UpperNo=[]
    kk=[]
    for k in batas:
        LowerYa.append( len(df[(df[var] <= k) & (df['Layak Direkomendasikan']=='Ya')]))
        LowerNo.append( len(df[(df[var] <= k) & (df['Layak Direkomendasikan']=='Tidak')]))
        UpperYa.append( len(df[(df[var] > k) & (df['Layak Direkomendasikan']=='Ya')]))
        UpperNo.append( len(df[(df[var] > k) & (df['Layak Direkomendasikan']=='Tidak')]))
        kk.append(k)
        
    tabel_count = pd.DataFrame([kk,LowerYa,LowerNo,UpperYa,UpperNo]).transpose()
    tabel_count.columns =['batas','LowerYa','LowerNo','UpperYa','UpperNo']
    
    return tabel_count

In [208]:
## Gain Ratio untuk variabel Harga
def Gain(ls):
    LowerYa = ls[0]
    LowerNo = ls[1]
    UpperYa = ls[2]
    UpperNo = ls[3]
    
    Ya = UpperYa + LowerYa
    No = UpperNo + LowerNo
    
    Upper = UpperYa + UpperNo
    Lower = LowerYa + LowerNo
    
    S = entropy([Ya/(Ya + No), No/(Ya + No)], base=2)
    
    if Lower > 0:
        S_bawah = entropy([LowerYa/(LowerYa + LowerNo), LowerNo/(LowerYa + LowerNo)], base=2)
    else:
        S_bawah = 0
        
    if Upper > 0 :
        S_atas = entropy([UpperYa/(UpperYa+UpperNo), UpperNo/(UpperYa+UpperNo)], base=2)
    else:
        S_atas = 0
    
    gain = S - ((LowerYa+LowerNo)/(Ya+No))*S_bawah - ((UpperYa+UpperNo)/(Ya+No))*S_atas
    
    split = entropy([(LowerYa+LowerNo)/(Ya+No), (UpperYa+UpperNo)/(Ya+No)], base=2)
    
    if split > 0 :
        gain_ratio = gain/split
    else:
        gain_ratio = 0
        
    return gain_ratio

In [214]:
def GainRatio(tabel_count):
    ls = tabel_count[['LowerYa','LowerNo','UpperYa','UpperNo']].values.tolist()
    gain_ratio = []
    for k in range(len(ls)):
        gain_ratio.append(Gain(ls[k]))
        
    df_gain = pd.DataFrame(gain_ratio)
    gr = pd.concat([tabel_count,df_gain],ignore_index=True,axis=1)
    gr.columns = ['batas', 'LowerYa', 'LowerNo', 'UpperYa', 'UpperNo','GainRatio']
    return gr

In [215]:
tabel_count_kamera = getTableCal(df,'Kamera')
GainRatio(tabel_count_kamera)

Unnamed: 0,batas,LowerYa,LowerNo,UpperYa,UpperNo,GainRatio
0,2,0,1,8,5,0.249578
1,3,0,3,8,3,0.428263
2,5,1,5,7,1,0.401977
3,7,3,5,5,1,0.163674
4,8,5,5,3,1,0.045357
5,10,6,6,2,0,0.21648
6,13,8,6,0,0,0.0


In [216]:
## Berdasarkan tabel di atas, pemisah terbaik adalah dengan batas 3 

# Convert Numeric value into Categoric for Harga Variable 

In [217]:
tabel_count_kamera = getTableCal(df,'Harga')
GainRatio(tabel_count_kamera)

Unnamed: 0,batas,LowerYa,LowerNo,UpperYa,UpperNo,GainRatio
0,0.8,1.0,0.0,7.0,6.0,0.163305
1,1.2,3.0,0.0,5.0,6.0,0.27242
2,1.0,2.0,0.0,6.0,6.0,0.21648
3,3.5,5.0,1.0,3.0,5.0,0.163674
4,2.0,4.0,0.0,4.0,6.0,0.33795
5,5.0,5.0,4.0,3.0,2.0,0.001425
6,6.0,6.0,4.0,2.0,2.0,0.006926
7,4.0,5.0,2.0,3.0,4.0,0.061054
8,3.0,4.0,1.0,4.0,5.0,0.096009
9,10.0,7.0,4.0,1.0,2.0,0.060608


# Pembentukan Model
Langkah-langkah diatas akan menghasilkan hasil seperti di bawah ini

<img src="C4.5.jpeg"/>