
# Binary Classification with GUIDE

This notebook demonstrates **GuideGradientBoostingClassifier** on the **Breast Cancer** dataset.

Highlights:
- **Log Loss Optimization:** Probabilistic classification.
- **Unbiased Selection:** GUIDE ensures features aren't selected just because they have many values.


In [None]:

import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.metrics import accuracy_score, log_loss, roc_auc_score
from sklearn.model_selection import train_test_split

from pyguide import GuideGradientBoostingClassifier

# Load Data
data = load_breast_cancer(as_frame=True)
X, y = data.data, data.target
feature_names = X.columns

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"Training samples: {len(X_train)}")
print(f"Class distribution: {np.bincount(y_train)}")


## Train Gradient Boosting Model

In [None]:

clf = GuideGradientBoostingClassifier(
    n_estimators=50,
    learning_rate=0.1,
    max_depth=2,
    subsample=0.8,
    random_state=42
)
clf.fit(X_train, y_train)


## Evaluation

In [None]:

# Predict probabilities
y_prob = clf.predict_proba(X_test)
y_pred = clf.predict(X_test)

acc = accuracy_score(y_test, y_pred)
auc = roc_auc_score(y_test, y_prob[:, 1])
ll = log_loss(y_test, y_prob)

print(f"Accuracy: {acc:.4f}")
print(f"ROC AUC:  {auc:.4f}")
print(f"Log Loss: {ll:.4f}")



## Feature Importance

Unlike impurity-based importance in standard GBMs, we can look at the structure of the underlying GUIDE trees.
While `GuideGradientBoostingClassifier` doesn't yet aggregate importance scores automatically, we can inspect individual trees or use permutation importance (if implemented for ensembles in the future).
