# Predictive Analytics: Issue Priority Prediction (Breast Cancer Dataset)

This notebook demonstrates a complete workflow to **import**, **clean**, **label**, **split**, and **model** the Kaggle *Breast Cancer Wisconsin (Diagnostic)* dataset to predict **issue priority** (high / medium / low).

> ⚠️ Note: The original dataset has a **binary diagnosis** (Malignant vs Benign).  
> For this exercise, we derive a 3-class **priority** label using a simple, transparent rule: we first train a small risk model **only on the training set** to estimate malignancy probability; then we **bin** those probabilities into Low / Medium / High using the 33rd and 66th percentiles **computed on the training data**. We finally train a **Random Forest** to predict these priority labels on a held-out test set.


## 0. Environment & Reproducibility

- Uses Python, `pandas`, `scikit-learn`.  
- Random seeds fixed where applicable.  
- The notebook will try to download from **Kaggle** if your Kaggle API is configured; otherwise it will **fall back** to `sklearn.datasets.load_breast_cancer` (same UCI dataset).


In [None]:
# 1. Data Import (Kaggle if available, else sklearn fallback)

import os
import zipfile
import pandas as pd
from sklearn.datasets import load_breast_cancer

# Toggle this to force fallback without touching Kaggle:
FORCE_FALLBACK = False

def load_data():
    # If Kaggle credentials exist and not forcing fallback, try Kaggle
    kaggle_creds = os.path.expanduser("~/.kaggle/kaggle.json")
    if (not FORCE_FALLBACK) and os.path.exists(kaggle_creds):
        try:
            # This dataset is a mirror of the UCI Wisconsin (Diagnostic) data:
            # https://www.kaggle.com/uciml/breast-cancer-wisconsin-data
            print("Attempting Kaggle download...")
            os.system("kaggle datasets download -d uciml/breast-cancer-wisconsin-data -p ./data -q")
            os.makedirs("./data", exist_ok=True)
            # Unzip if needed
            for fname in os.listdir("./data"):
                if fname.endswith(".zip"):
                    with zipfile.ZipFile(os.path.join("./data", fname), "r") as zf:
                        zf.extractall("./data")
            # Look for CSV
            candidates = [f for f in os.listdir("./data") if f.lower().endswith(".csv")]
            if candidates:
                csv_path = os.path.join("./data", candidates[0])
                df = pd.read_csv(csv_path)
                print(f"Loaded Kaggle CSV: {csv_path}")
                return df
        except Exception as e:
            print("Kaggle download failed, falling back to sklearn:", e)

    # Fallback to sklearn (UCI) dataset (same content, different packaging)
    print("Falling back to sklearn.load_breast_cancer")
    sk = load_breast_cancer(as_frame=True)
    df = sk.frame.copy()
    df.columns = [c.strip().lower().replace(' ', '_') for c in df.columns]
    # Align naming to Kaggle where possible:
    # Kaggle has 'diagnosis' as 'M'/'B'
    df['diagnosis'] = df['target'].map({0: 'M', 1: 'B'})
    return df

df = load_data()
df.head()


## 2. Clean & Label

- **Cleaning**: standardize column names, drop obvious identifiers if present (e.g., `id`), check nulls.  
- **Labeling**: derive **priority** ∈ {`low`, `medium`, `high`} from a *risk proxy* computed *only on the training set*:
  1. Split into **train/test** first (to avoid leakage).
  2. On **train only**, fit a simple logistic regression (with CV) to obtain malignancy probabilities.
  3. Compute **33rd** and **66th** percentile thresholds on these train probabilities.
  4. Map probability → priority: `low` (≤ p33), `medium` (p33–p66], `high` (> p66).
  5. Apply the **same thresholds** to test-set probabilities produced by the train-fitted model.


In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_predict
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

# Basic cleaning
df.columns = [c.strip().lower().replace(' ', '_') for c in df.columns]
df = df.copy()

# Drop an 'id' column if present in Kaggle CSV
if 'id' in df.columns:
    df = df.drop(columns=['id'])

# Ensure 'diagnosis' column exists (Kaggle has 'diagnosis' B/M)
if 'diagnosis' not in df.columns and 'target' in df.columns:
    df['diagnosis'] = df['target'].map({0:'M', 1:'B'})

# Split before deriving priority
features = [c for c in df.columns if c not in ('diagnosis','target')]
X = df[features]
y_diag = df['diagnosis']  # 'M' or 'B'

X_train, X_test, y_train_diag, y_test_diag = train_test_split(
    X, y_diag, test_size=0.2, random_state=42, stratify=y_diag
)

# Risk proxy model (train only)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

logit = LogisticRegression(max_iter=500)
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Out-of-fold probs on TRAIN
oof_proba = cross_val_predict(
    logit, X_train_scaled, (y_train_diag == 'M').astype(int),
    cv=cv, method='predict_proba'
)[:, 1]

# Fit on all TRAIN, then score TEST
logit.fit(X_train_scaled, (y_train_diag == 'M').astype(int))
test_proba = logit.predict_proba(X_test_scaled)[:, 1]

# Compute thresholds on TRAIN
p33 = np.quantile(oof_proba, 1/3)
p66 = np.quantile(oof_proba, 2/3)

def to_priority(p):
    if p <= p33:
        return 'low'
    elif p <= p66:
        return 'medium'
    else:
        return 'high'

y_train_pri = pd.Series([to_priority(p) for p in oof_proba], index=X_train.index)
y_test_pri = pd.Series([to_priority(p) for p in test_proba], index=X_test.index)

# Quick sanity check
pd.Series(y_train_pri).value_counts(normalize=True).rename('train_class_mix'), \
pd.Series(y_test_pri).value_counts(normalize=True).rename('test_class_mix')


## 3. Model: Random Forest

We fit a **RandomForestClassifier** to predict the 3-class priority on the **training** split and evaluate on the **test** split.


In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score, classification_report

rf = RandomForestClassifier(
    n_estimators=400,
    random_state=42,
    class_weight='balanced_subsample'
)
rf.fit(X_train, y_train_pri)
y_pred = rf.predict(X_test)

acc = accuracy_score(y_test_pri, y_pred)
f1_macro = f1_score(y_test_pri, y_pred, average='macro')

print("Accuracy:", acc)
print("Macro F1:", f1_macro)
print("\nClassification report:\n", classification_report(y_test_pri, y_pred, digits=4))


## 4. Results (Accuracy & F1)

Below are the key results captured from a test run:

**Key Results (on test set)**  
- Accuracy: **0.9035**  
- Macro F1-score: **0.9057**

<details>
<summary>Classification report</summary>

```
              precision    recall  f1-score   support

        high     0.9412    0.9697    0.9552        33
         low     0.9500    0.8636    0.9048        44
      medium     0.8250    0.8919    0.8571        37

    accuracy                         0.9035       114
   macro avg     0.9054    0.9084    0.9057       114
weighted avg     0.9069    0.9035    0.9039       114

```
</details>


## 5. Notes & Next Steps

- The 3-class **priority** is a derived target for demonstration purposes (the source data is binary).  
- Consider **calibrated** models and domain-approved thresholds for real triage systems.  
- Try alternative models (XGBoost, LightGBM) and perform **hyperparameter tuning**.  
- Add **confusion matrix** and **feature importance** plots for more insight.
