# Decision Tree, Feature Engineering, dan Grid Search
*Versi:* 2025-11-03 03:29:48

Notebook ini **fokus pada hitungan manual** (angka kecil) agar mudah dipahami, lalu diverifikasi singkat dengan `scikit-learn`.

**Isi:**
1) Decision Tree — Gini/Entropy/Gain (klasifikasi) + MSE (regresi) **manual**
2) Feature Engineering — one-hot, binning, fitur waktu siklik, interaksi **manual**
3) Grid Search ringan — rata-rata & simpangan baku **manual**, lalu contoh kecil `GridSearchCV`


## 1) Decision Tree — Impurity & Gain (Hitungan Manual)
### 1.1 Gini & Entropy (rumus)
- Gini:  $\text{Gini}(S) = 1 - \sum_k p_k^2$
- Entropy:  $H(S) = -\sum_k p_k \log_2 p_k$
- Gain:  $\text{Gain} = \text{Impurity}(S) - \sum_j \frac{|S_j|}{|S|}\,\text{Impurity}(S_j)$

### 1.2 Contoh Manual (Klasifikasi, Gini)
Dataset mini (10 baris): Spam=4, Ham=6. Split berdasarkan `has_gratis`.

**Langkah:**
1) Gini awal:  $1-(0.4^2+0.6^2)=0.48$  
2) Left (4 data: 3 Spam, 1 Ham):  $1-(0.75^2+0.25^2)=0.375$  
3) Right (6 data: 1 Spam, 5 Ham):  $1-(\tfrac{1}{6})^2-(\tfrac{5}{6})^2 \approx 0.2778$  
4) Gini sesudah split:  $\tfrac{4}{10}\cdot 0.375 + \tfrac{6}{10}\cdot 0.2778 = 0.3167$  
5) **Gain**:  $0.48 - 0.3167 = 0.1633$

In [None]:
# Verifikasi Gini dan Gain secara programatik
p_spam, p_ham = 4/10, 6/10
gini_root = 1 - (p_spam**2 + p_ham**2)
gini_L = 1 - (0.75**2 + 0.25**2)
gini_R = 1 - ((1/6)**2 + (5/6)**2)
gini_after = (4/10)*gini_L + (6/10)*gini_R
gain = gini_root - gini_after
print({'gini_root': round(gini_root,4), 'gini_L': round(gini_L,4), 'gini_R': round(gini_R,4),
       'gini_after': round(gini_after,4), 'gain': round(gain,4)})

### 1.3 Bonus: Entropy (opsional)
Entropy awal p(spam)=0.4, p(ham)=0.6:  $H=-(0.4\log_2 0.4 + 0.6\log_2 0.6) \approx 0.97095$.

In [None]:
import math
p=[0.4,0.6]
H=-sum(pi*math.log(pi,2) for pi in p)
print(round(H,5))

### 1.4 Contoh Manual (Regresi, MSE)
Node awal $y=\{2,3,4,10\}$: $\bar{y}=4.75$,  $\text{MSE}(S)=9.6875$.
Split: $S_L=\{2,3,4\}$ (MSE=$2/3$), $S_R=\{10\}$ (MSE=0).  
MSE sesudah split: $0.5$ → reduction $=9.1875$.

In [None]:
import numpy as np
y=np.array([2,3,4,10],float)
mu=y.mean(); mse_root=np.mean((y-mu)**2)
yL=np.array([2,3,4],float); muL=yL.mean(); mseL=np.mean((yL-muL)**2)
mse_after=(3/4)*mseL
print({'mse_root':mse_root,'mseL':mseL,'mse_after':mse_after,'reduction':mse_root-mse_after})

## 2) Feature Engineering — Transformasi Manual
### 2.1 One-Hot Encoding
`channel ∈ {email, chat, web}`; jika `channel=chat` → $[0,1,0]$.

### 2.2 Binning (numerik → kategori)
`age ∈ {18,22,25,31,35,49}` → bins: Young ($<25$), Adult ($25$–$<40$), Senior ($≥40$).

### 2.3 Waktu siklik (jam 0 ≈ 24)
$$
\sin\_\text{hour} = \sin\!\left(2\pi\cdot \frac{\mathrm{hour}}{24}\right),\quad
\cos\_\text{hour} = \cos\!\left(2\pi\cdot \frac{\mathrm{hour}}{24}\right)
$$

In [None]:
import math, pandas as pd
hours=[0,6,12,18,23]
rows=[{'hour':h,'sin_hour':round(math.sin(2*math.pi*h/24),4),'cos_hour':round(math.cos(2*math.pi*h/24),4)} for h in hours]
pd.DataFrame(rows)

### 2.4 Interaksi fitur
$x_{1\times2}=x_1\cdot x_2$ (contoh: $(2,5) \Rightarrow 10$).

In [None]:
import pandas as pd
df=pd.DataFrame({'x1':[2,1.5,0.5],'x2':[5,4,10]}); df['x1x2']=df['x1']*df['x2']; df

### 2.5 Ringkas TF–IDF
$\text{tfidf}(t,d)=\text{tf}(t,d)\cdot\log\!\left(\dfrac{|D|}{1+\text{df}(t)}\right)$; contoh: $|D|=3$, `gratis` muncul di 1 dokumen ⇒ $\text{idf}=\log(3/2)\approx 0.4055$; jika $\text{tf}=2$ ⇒ $\text{tfidf}\approx 0.811$.

## 3) Grid Search (Manual) — Mean & Std k-Fold
Hitung rata-rata & std 5-fold untuk dua parameter lalu bandingkan.

In [None]:
import numpy as np
sA=np.array([0.78,0.80,0.82,0.79,0.81])
sB=np.array([0.80,0.81,0.83,0.80,0.82])
def summary(name,arr): return {'name':name,'mean':round(arr.mean(),3),'std_pop':round(arr.std(ddof=0),4)}
summary('Param A',sA), summary('Param B',sB)

### 3.1 Verifikasi kecil `GridSearchCV` (Iris + DecisionTree)
Grid: `max_depth ∈ {3,5,7}`, `min_samples_leaf ∈ {1,3}`.

In [None]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
X,y=load_iris(return_X_y=True)
Xtr,Xte,ytr,yte=train_test_split(X,y,test_size=0.2,stratify=y,random_state=42)
dt=DecisionTreeClassifier(random_state=42)
cv=StratifiedKFold(n_splits=5,shuffle=True,random_state=42)
grid=GridSearchCV(dt,{'max_depth':[3,5,7],'min_samples_leaf':[1,3]},cv=cv,scoring='accuracy',n_jobs=-1)
grid.fit(Xtr,ytr)
best=grid.best_estimator_
print('Best params:',grid.best_params_)
print('CV best mean:',round(grid.best_score_,3))
print('Test acc:',round(accuracy_score(yte,best.predict(Xte)),3))

## 4) Ringkasan
- DT: Gini/Gain & MSE **manual** menjelaskan *mengapa* split dipilih.
- FE: one-hot, binning, waktu siklik, interaksi dapat dihitung manual.
- Grid ringan: mean & std k-fold **manual**; `GridSearchCV` untuk verifikasi cepat.