# Linear Regression • Logistic Regression • Regularisasi Ringan
*Versi:* 2025-11-03 03:42:43

Notebook ini **ramah pemula**: mulai dari konsep inti, rumus, lalu praktik dengan dataset publik.
Semua transformasi dilakukan lewat **Pipeline** untuk menghindari **data leakage**.

**Isi:**
1) Linear Regression → MSE, solusi normal (mini), Ridge (L2)
2) Logistic Regression → sigmoid, log-loss, evaluasi (F1/ROC)
3) Regularisasi Ringan → L2 (Ridge/Logistic), L1 (opsional), Grid Search kecil


## A. Linear Regression — Intuisi & Rumus
Prediksi target numerik $y$ dari fitur $\mathbf{x}$:

$$\hat{y} = \mathbf{w}^\top \mathbf{x} + b$$

**Loss (MSE):**

$$J(\mathbf{w},b) = \frac{1}{n}\sum_{i=1}^{n}\big(y_i - (\mathbf{w}^\top \mathbf{x}_i + b)\big)^2$$

**Solusi normal (tanpa reguler):**

$$\mathbf{\hat{w}} = (X^\top X)^{-1}X^\top \mathbf{y}$$

**Ridge (L2):**

$$\mathbf{\hat{w}}_{\mathrm{ridge}} = (X^\top X + \lambda I)^{-1}X^\top \mathbf{y}$$

### Demo mini: solusi normal (2 fitur, data kecil)
Kita buat 3 contoh data agar hitungan kecil bisa diverifikasi.

In [None]:
import numpy as np
# X with bias column handled separately (we'll augment later) for clarity
X = np.array([[1.0, 2.0],
              [2.0, 0.0],
              [3.0, 1.0]])
y = np.array([4.0, 1.0, 5.0])

# add bias term by augmenting ones column
X_aug = np.c_[X, np.ones(len(X))]
w_hat = np.linalg.pinv(X_aug.T @ X_aug) @ (X_aug.T @ y)
w_hat  # [w1, w2, b]

### Praktik: Ridge Regression (dataset Diabetes — publik scikit-learn)
Kita pakai **Ridge (L2)** dengan **StandardScaler** dalam **Pipeline**.

In [None]:
from sklearn.datasets import load_diabetes
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

X, y = load_diabetes(return_X_y=True)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=42)

ridge = make_pipeline(StandardScaler(), Ridge(alpha=1.0, random_state=42))
ridge.fit(X_tr, y_tr)
y_pr = ridge.predict(X_te)
rmse = mean_squared_error(y_te, y_pr, squared=False)
r2 = r2_score(y_te, y_pr)
print({'RMSE': round(rmse,3), 'R2': round(r2,3)})

#### Koefisien (setelah scaling)
Kita tampilkan koefisien model untuk melihat fitur mana yang berpengaruh besar.

In [None]:
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge

# Extract trained Ridge step & coef_ after scaling
ridge_model = ridge.named_steps['ridge']
coefs = ridge_model.coef_
plt.figure()
plt.bar(range(len(coefs)), coefs)
plt.title('Koefisien Ridge (setelah scaling)')
plt.xlabel('Fitur index')
plt.ylabel('Koefisien')
plt.show()

## B. Logistic Regression — Intuisi & Rumus
Probabilitas kelas 1:

$$p(y{=}1\mid \mathbf{x}) = \sigma(z) = \frac{1}{1+e^{-z}},\quad z=\mathbf{w}^\top\mathbf{x}+b$$

**Log-loss (binary cross-entropy):**

$$J(\mathbf{w},b)=\frac{1}{n}\sum_{i=1}^{n}\big(-y_i\log p_i-(1-y_i)\log(1-p_i)\big)$$

### Praktik: Logistic Regression (Breast Cancer — publik scikit-learn)
Kita mulai dengan default **L2** (parameter `C=1.0`), **StandardScaler** dalam **Pipeline**.

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve
import matplotlib.pyplot as plt

X, y = load_breast_cancer(return_X_y=True)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

logreg = make_pipeline(StandardScaler(), LogisticRegression(max_iter=2000, C=1.0, random_state=42))
logreg.fit(X_tr, y_tr)
y_pr = logreg.predict(X_te)
print(classification_report(y_te, y_pr))

cm = confusion_matrix(y_te, y_pr)
plt.figure()
plt.imshow(cm)
plt.title('Confusion Matrix — Logistic')
plt.xlabel('Predicted'); plt.ylabel('True')
for (i,j),v in __import__('numpy').ndenumerate(cm):
    plt.text(j,i,str(v),ha='center',va='center')
plt.colorbar(); plt.show()

# ROC-AUC
y_score = logreg.predict_proba(X_te)[:,1]
auc = roc_auc_score(y_te, y_score)
fpr, tpr, _ = roc_curve(y_te, y_score)
plt.figure()
plt.plot(fpr, tpr)
plt.plot([0,1],[0,1], linestyle='--')
plt.title(f'ROC Curve (AUC={auc:.3f})')
plt.xlabel('FPR'); plt.ylabel('TPR')
plt.show()

## C. Regularisasi Ringan — L2 (Ridge/LR) & L1 (opsional)
Kenapa reguler? Mengontrol **kompleksitas**, kurangi **overfitting**.
- **L2 (Ridge/LR)**: menyeimbangkan bobot → kecil-mulus.
- **L1 (Lasso)**: mendorong bobot ke **0** (seleksi fitur).

### Perbandingan kecil Ridge (alpha) — CV=5
Kita bandingkan `alpha ∈ {0.1, 1.0, 10.0}` dengan **cross_val_score**.

In [None]:
from sklearn.model_selection import KFold, cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge
import pandas as pd

alphas = [0.1, 1.0, 10.0]
cv = KFold(n_splits=5, shuffle=True, random_state=42)
rows = []
for a in alphas:
    pipe = make_pipeline(StandardScaler(), Ridge(alpha=a, random_state=42))
    scores = cross_val_score(pipe, X, y, cv=cv, scoring='neg_root_mean_squared_error')
    rows.append({'alpha': a, 'rmse_mean': round(-scores.mean(),3), 'rmse_std': round(scores.std(),3)})
df_ridge = pd.DataFrame(rows)
from caas_jupyter_tools import display_dataframe_to_user
display_dataframe_to_user('CV Ridge RMSE', df_ridge)
df_ridge

### Perbandingan kecil Logistic (C) — Stratified CV=5 (scoring=F1)
Ingat: `C = 1/λ` → **C kecil = reguler kuat**.

In [None]:
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import make_scorer, f1_score
import pandas as pd

C_list = [0.1, 1.0, 10.0]
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
rows = []
for C in C_list:
    pipe = make_pipeline(StandardScaler(), LogisticRegression(max_iter=2000, C=C, random_state=42))
    scores = cross_val_score(pipe, X, y, cv=cv, scoring=make_scorer(f1_score))
    rows.append({'C': C, 'f1_mean': round(scores.mean(),3), 'f1_std': round(scores.std(),3)})
df_lr = pd.DataFrame(rows)
from caas_jupyter_tools import display_dataframe_to_user
display_dataframe_to_user('CV Logistic F1', df_lr)
df_lr

### Grid Search kecil (Logistic) — anti-leakage via Pipeline
Grid: `C ∈ {0.1, 1.0, 10.0}` dengan **StratifiedKFold(5)**.

In [None]:
from sklearn.model_selection import GridSearchCV
pipe = make_pipeline(StandardScaler(), LogisticRegression(max_iter=2000, random_state=42))
param_grid = {'logisticregression__C': [0.1, 1.0, 10.0]}
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
grid = GridSearchCV(pipe, param_grid, scoring='f1', cv=cv, n_jobs=-1)
grid.fit(X_tr, y_tr)
print('Best params:', grid.best_params_)
print('CV best mean F1:', round(grid.best_score_,3))
best = grid.best_estimator_
from sklearn.metrics import f1_score
print('Test F1:', round(f1_score(y_te, best.predict(X_te)),3))

### Catatan Imbalance — class_weight='balanced'
Contoh cepat penggunaan `class_weight='balanced'` pada LogisticRegression.

In [None]:
bal = make_pipeline(StandardScaler(), LogisticRegression(max_iter=2000, class_weight='balanced', random_state=42))
bal.fit(X_tr, y_tr)
from sklearn.metrics import precision_recall_fscore_support
y_pred_bal = bal.predict(X_te)
p, r, f1, _ = precision_recall_fscore_support(y_te, y_pred_bal, average='binary')
print({'precision': round(p,3), 'recall': round(r,3), 'f1': round(f1,3)})

## Ringkasan
- **Linear**: minimalkan MSE; gunakan **Ridge (L2)** untuk stabilitas.
- **Logistic**: prediksi probabilitas; evaluasi dengan **F1/ROC-AUC**.
- **Regularisasi**: L2 mengecilkan bobot, **C kecil** (Logistic) = reguler kuat.
- **Pipeline & CV**: wajib untuk evaluasi adil dan anti-leakage.

**Tugas Mini:**
1) Coba `alpha ∈ {0.01, 0.1, 1, 10}` untuk Ridge; bandingkan RMSE (CV=5).
2) Coba `C ∈ {0.01, 0.1, 1, 10}` untuk Logistic; bandingkan F1 (CV=5).
3) Ubah threshold prediksi Logistic (bukan 0.5) dan lihat perubahan precision/recall.