# 4. Model Optimizasyonu

Bu notebook hiperparametre optimizasyonu ve model karÅŸÄ±laÅŸtÄ±rmasÄ± iÃ§erir.

**Ä°Ã§erik:**
1. Veri HazÄ±rlÄ±ÄŸÄ± ve Feature Engineering
2. LightGBM Hiperparametre Optimizasyonu
3. XGBoost Hiperparametre Optimizasyonu
4. Model KarÅŸÄ±laÅŸtÄ±rmasÄ±
5. Final Model SeÃ§imi

In [None]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')


from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score, GridSearchCV
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import roc_auc_score, classification_report, confusion_matrix, roc_curve


import lightgbm as lgb
import xgboost as xgb


RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)


## 1. Veri YÃ¼kleme ve Feature Engineering

In [None]:

DATA_PATH = Path('../data/raw/')
df = pd.read_csv(DATA_PATH / 'bank.csv')


TARGET = 'deposit'
y = (df[TARGET] == 'yes').astype(int)
X = df.drop(columns=[TARGET])

print(f'Veri yÃ¼klendi: {df.shape[0]:,} satÄ±r, {df.shape[1]} sÃ¼tun')

# === FEATURE ENGINEERING (03 notebook'tan) ===
X_fe = X.copy()

# 1. Duration Ã§Ä±kar
X_fe = X_fe.drop(columns=['duration'])

# 2. YaÅŸ gruplarÄ±
X_fe['age_group'] = pd.cut(
    X_fe['age'], 
    bins=[0, 30, 40, 50, 60, 100],
    labels=['18-30', '31-40', '41-50', '51-60', '60+']
)

# 3. Bakiye kategorileri
X_fe['balance_category'] = pd.cut(
    X_fe['balance'],
    bins=[-np.inf, 0, 100, 500, 2000, np.inf],
    labels=['Negatif', 'Dusuk', 'Orta', 'Yuksek', 'Cok Yuksek']
)

# 4. Never contacted flag
X_fe['never_contacted'] = (X_fe['pdays'] == -1).astype(int)

# 5. Mevsimsellik
month_map = {
    'jan': 1, 'feb': 2, 'mar': 3, 'apr': 4, 'may': 5, 'jun': 6,
    'jul': 7, 'aug': 8, 'sep': 9, 'oct': 10, 'nov': 11, 'dec': 12
}
X_fe['month_numeric'] = X_fe['month'].map(month_map)
X_fe['quarter'] = ((X_fe['month_numeric'] - 1) // 3) + 1
X_fe['is_year_end'] = (X_fe['month_numeric'].isin([11, 12])).astype(int)
X_fe['is_year_start'] = (X_fe['month_numeric'].isin([1, 2])).astype(int)

# 6. Kampanya metrikleri
X_fe['total_contacts'] = X_fe['campaign'] + X_fe['previous']
X_fe['over_contacted'] = (X_fe['campaign'] > 5).astype(int)

# 7. Ä°nteraksiyon feature'larÄ±
X_fe['age_balance_interaction'] = X_fe['age'] * (X_fe['balance'] / 1000)
X_fe['age_campaign_interaction'] = X_fe['age'] * X_fe['campaign']

# 8. Ratio
X_fe['balance_per_age'] = X_fe['balance'] / (X_fe['age'] + 1)

# Label Encoding
cat_cols = X_fe.select_dtypes(include=['object', 'category']).columns.tolist()
for col in cat_cols:
    le = LabelEncoder()
    X_fe[col] = le.fit_transform(X_fe[col].astype(str))

print(f'Feature Engineering tamamlandÄ±: {X_fe.shape[1]} feature')

## 2. Train-Test Split ve Baseline

In [None]:
# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(
    X_fe, y,
    test_size=0.2,
    random_state=RANDOM_STATE,
    stratify=y
)

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)

print(f'Train: {X_train.shape[0]:,} Ã¶rnek')
print(f'Test:  {X_test.shape[0]:,} Ã¶rnek')

# Baseline skoru
baseline_model = lgb.LGBMClassifier(n_estimators=100, learning_rate=0.1, random_state=RANDOM_STATE, verbose=-1)
baseline_scores = cross_val_score(baseline_model, X_train, y_train, cv=cv, scoring='roc_auc')
print(f'\nBaseline CV AUC: {baseline_scores.mean():.4f} (+/- {baseline_scores.std():.4f})')

## 3. LightGBM Hiperparametre Optimizasyonu

In [None]:
print("=" * 50)
print("LightGBM Hiperparametre Optimizasyonu")
print("=" * 50)

# GeniÅŸletilmiÅŸ parametre grid'i + regularization
lgb_param_grid = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.03, 0.05, 0.1],
    'num_leaves': [31, 50, 70],
    'max_depth': [5, 7, 10],
    'subsample': [0.8, 0.9],
    'colsample_bytree': [0.8, 0.9],
    'reg_alpha': [0, 0.1],      # L1 regularization
    'reg_lambda': [0, 1],       # L2 regularization
}

lgb_model = lgb.LGBMClassifier(random_state=RANDOM_STATE, verbose=-1, n_jobs=-1)

from sklearn.model_selection import RandomizedSearchCV

lgb_grid = RandomizedSearchCV(
    lgb_model,
    lgb_param_grid,
    n_iter=50,  # 50 random komb.
    cv=3,
    scoring='roc_auc',
    n_jobs=-1,
    verbose=1,
    random_state=RANDOM_STATE
)

lgb_grid.fit(X_train, y_train)

print(f'\nEn iyi parametreler: {lgb_grid.best_params_}')
print(f'En iyi CV AUC: {lgb_grid.best_score_:.4f}')

## 4. XGBoost Hiperparametre Optimizasyonu

In [None]:
print("=" * 50)
print("XGBoost Hiperparametre Optimizasyonu")
print("=" * 50)

# GeniÅŸletilmiÅŸ parametre grid'i + regularization
xgb_param_grid = {
    'n_estimators': [100, 200, 300],
    'learning_rate': [0.03, 0.05, 0.1],
    'max_depth': [3, 5, 7],
    'subsample': [0.8, 0.9],
    'colsample_bytree': [0.8, 0.9],
    'reg_alpha': [0, 0.1],      # L1 regularization
    'reg_lambda': [1, 2],       # L2 regularization
}

xgb_model = xgb.XGBClassifier(
    random_state=RANDOM_STATE, 
    eval_metric='logloss',
    n_jobs=-1
)

#randomized search
from sklearn.model_selection import RandomizedSearchCV

xgb_grid = RandomizedSearchCV(
    xgb_model,
    xgb_param_grid,
    n_iter=50,  # 50 random kombinasyon
    cv=3,
    scoring='roc_auc',
    n_jobs=-1,
    verbose=1,
    random_state=RANDOM_STATE
)

xgb_grid.fit(X_train, y_train)

print(f'\nEn iyi parametreler: {xgb_grid.best_params_}')
print(f'En iyi CV AUC: {xgb_grid.best_score_:.4f}')

## 5. Model KarÅŸÄ±laÅŸtÄ±rmasÄ±

In [None]:
print("=" * 50)
print("MODEL KARÅžILAÅžTIRMASI")
print("=" * 50)

# En iyi modeller
best_lgb = lgb_grid.best_estimator_
best_xgb = xgb_grid.best_estimator_

# Test seti tahminleri
lgb_pred_proba = best_lgb.predict_proba(X_test)[:, 1]
xgb_pred_proba = best_xgb.predict_proba(X_test)[:, 1]

# Test AUC
lgb_test_auc = roc_auc_score(y_test, lgb_pred_proba)
xgb_test_auc = roc_auc_score(y_test, xgb_pred_proba)

# SonuÃ§lar tablosu
results = pd.DataFrame({
    'Model': ['LightGBM', 'XGBoost'],
    'CV AUC': [lgb_grid.best_score_, xgb_grid.best_score_],
    'Test AUC': [lgb_test_auc, xgb_test_auc]
}).sort_values('Test AUC', ascending=False)

print(results.to_string(index=False))

# En iyi model
best_model_name = results.iloc[0]['Model']
print(f'\nEn iyi model: {best_model_name}')

## 6. Final Model ve GÃ¶rselleÅŸtirme

In [None]:
# En iyi modeli seÃ§
if best_model_name == 'LightGBM':
    final_model = best_lgb
    final_pred_proba = lgb_pred_proba
else:
    final_model = best_xgb
    final_pred_proba = xgb_pred_proba

final_pred = (final_pred_proba >= 0.5).astype(int)

# GÃ¶rselleÅŸtirme
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# ROC Curve
ax1 = axes[0]
fpr, tpr, _ = roc_curve(y_test, final_pred_proba)
ax1.plot(fpr, tpr, 'b-', linewidth=2, label=f'{best_model_name} (AUC={roc_auc_score(y_test, final_pred_proba):.4f})')
ax1.plot([0, 1], [0, 1], 'k--', linewidth=1)
ax1.set_xlabel('False Positive Rate')
ax1.set_ylabel('True Positive Rate')
ax1.set_title('ROC Curve', fontsize=14, fontweight='bold')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Confusion Matrix
ax2 = axes[1]
cm = confusion_matrix(y_test, final_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=ax2,
            xticklabels=['No', 'Yes'], yticklabels=['No', 'Yes'])
ax2.set_xlabel('Predicted')
ax2.set_ylabel('Actual')
ax2.set_title('Confusion Matrix', fontsize=14, fontweight='bold')

plt.tight_layout()
plt.savefig('../docs/model_performance.png', dpi=150, bbox_inches='tight')
plt.show()

# Classification Report
print("\nClassification Report:")
print(classification_report(y_test, final_pred, target_names=['No', 'Yes']))

## 7. SonuÃ§lar ve Ã–zet

In [None]:
print("=" * 60)
print("MODEL OPTÄ°MÄ°ZASYONU Ã–ZET")
print("=" * 60)

print(f"""
ðŸ“Š SONUÃ‡LAR
{'â”€' * 40}
Baseline CV AUC:        {baseline_scores.mean():.4f}
LightGBM CV AUC:        {lgb_grid.best_score_:.4f}
XGBoost CV AUC:         {xgb_grid.best_score_:.4f}

 SeÃ§ilen Model:       {best_model_name}
   Test AUC:            {roc_auc_score(y_test, final_pred_proba):.4f}

En Ä°yi Parametreler:
""")

if best_model_name == 'LightGBM':
    for k, v in lgb_grid.best_params_.items():
        print(f"   {k}: {v}")
else:
    for k, v in xgb_grid.best_params_.items():
        print(f"   {k}: {v}")



## 8. Dokumantasyon ve Notlar

### Bu Notebook'ta Neler Yaptik?

Bu calismada iki farkli gradient boosting algoritmasini (LightGBM ve XGBoost) karsilastirdik ve hiperparametre optimizasyonu yaptik.

**Adimlar:**
1. Onceki notebook'tan gelen feature set'i kullandik
2. Duration sutununu cikardik cunku production ortaminda bu bilgi mevcut degil
3. RandomizedSearchCV ile 50 farkli parametre kombinasyonu denedik
4. Regularization parametreleri (L1, L2) ekledik
5. Her iki modeli test seti uzerinde degerlendirdik

### Sonuclar ve Yorumlar

Duration olmadan elde ettigimiz ~0.79-0.80 AUC skoru bu veri seti icin gercekci ve iyi bir sonuc. Duration feature'i cikarildiginda performansin dusuk olmasi beklenen bir durum cunku bu degisken target ile cok guclu bir korelasyona sahipti.

Hiperparametre optimizasyonu baseline'dan belirgin bir iyilesme saglamadi. Bu da modelin zaten iyi calistigini ve veri setinin sinirlarinda oldugumuzun bir gostergesi.

### Onemli Kararlar

- **Duration cikarildi:** Gercek dunya senaryosunda, bir musteri aramayi cevaplamadan once gorusme suresini bilemeyiz. Bu yuzden bu feature'i modelden cikarmak dogru bir karardi.

- **Regularization:** Overfitting onlemek icin L1 ve L2 regularization parametreleri eklendi. Ancak mevcut durumda buyuk bir fark yaratmadi.

- **Model secimi:** LightGBM ve XGBoost birbirine cok yakin performans gosterdi. Her iki model de production icin uygun.

