# Chapter 3: Classification

**Tujuan:** Memahami klasifikasi (binary & multiclass) dan metrik evaluasinya.

---

## 1. Apa itu Classification?

- **Classification**: memetakan instance ke dalam satu dari beberapa kelas diskrit.  
- Contoh: spam vs ham, penyakit positif vs negatif, digit `0–9`.

---

## 2. Binary Classification

- **Binary**: hanya dua kelas (positif/negatif).  
- Model umum: Logistic Regression, SVM, k‑NN, Decision Tree, dll.

### 2.1 Probabilitas & Threshold  
- Model memprediksi _probability_ kelas positif.  
- Pilih _threshold_ (misal 0.5) untuk ambil keputusan.

### 2.2 Metrik Utama  
- **Accuracy** = (TP + TN) / total  
- **Confusion Matrix**:  
  - TP, TN, FP (false positive), FN (false negative)  
- **Precision** = TP / (TP + FP)  
- **Recall** = TP / (TP + FN)  
- **F1‑score** = 2 x (Precision x Recall) / (Precision + Recall)  
- **ROC curve** & **AUC**  

---

## 3. Multiclass Classification

- Banyak kelas (>2), misal digit `0–9`.  
- Strategi:  
  - **One-vs-Rest (OvR)**  
  - **Softmax (multinomial)**  
- Metrik: **accuracy**, **confusion matrix**.

---

## 4. Contoh Binary Classification: Digit 0 vs 1 dari dataset sklearn.digits

In [1]:
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    accuracy_score, confusion_matrix, precision_score, recall_score,
    f1_score, roc_curve, auc
)
import matplotlib.pyplot as plt

# Load hanya digit 0 & 1
digits = load_digits(n_class=2)
X, y = digits.data, digits.target
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Latih logistic regression
clf = LogisticRegression(solver='liblinear')
clf.fit(X_train, y_train)
y_proba = clf.predict_proba(X_test)[:,1]
y_pred = (y_proba >= 0.5).astype(int)

# Hitung metrik
acc    = accuracy_score(y_test, y_pred)
prec   = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1     = f1_score(y_test, y_pred)
cm     = confusion_matrix(y_test, y_pred)

print(f"Accuracy : {acc:.3f}")
print(f"Precision: {prec:.3f}")
print(f"Recall   : {recall:.3f}")
print(f"F1‑score : {f1:.3f}")
print("Confusion Matrix:\n", cm)

# ROC Curve
fpr, tpr, _ = roc_curve(y_test, y_proba)
roc_auc = auc(fpr, tpr)
plt.plot(fpr, tpr, label=f"AUC = {roc_auc:.3f}")
plt.plot([0,1], [0,1], 'k--')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve (0 vs 1)")
plt.legend()
plt.show()

**Penjelasan Binary Example**  
- Kita ambil digit 0/1 saja untuk binary.  
- `predict_proba` mengembalikan probabilitas, lalu threshold 0.5.  
- Lihat metrik: akurasi, precision (ketepatan positif), recall (cukupan positif), F1, dan ROC AUC.

---

## 5. Contoh Multiclass Classification: Semua digit 0–9

In [2]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix
import seaborn as sns

# Load semua digit
digits = load_digits()
X, y = digits.data, digits.target
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Latih RandomForest
rfc = RandomForestClassifier(n_estimators=100, random_state=42)
rfc.fit(X_train, y_train)
y_pred_mc = rfc.predict(X_test)

# Akurasi & Confusion Matrix
acc_mc = accuracy_score(y_test, y_pred_mc)
cm_mc  = confusion_matrix(y_test, y_pred_mc)

print(f"Multiclass Accuracy: {acc_mc:.3f}")

# Plot confusion matrix
plt.figure(figsize=(8,6))
sns.heatmap(cm_mc, annot=True, fmt='d', cmap='Blues')
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix: Multiclass (0–9)")
plt.show()

Accuracy: 0.9658


**Penjelasan Multiclass Example**  
- Kita pakai `RandomForestClassifier` untuk klasifikasi 10 kelas.  
- Evaluasi utama: **accuracy** dan **confusion matrix**.

---

## Ringkasan Chapter 3

1. **Classification** memetakan data ke kelas diskrit.  
2. **Binary** vs **Multiclass**: metrik sedikit berbeda.  
3. Penting memeriksa Confusion Matrix, Precision/Recall, ROC/AUC.