# Lab: Binary Classification & Metrics from Scratch
 
## Objectives
1.  Train a simple **Logistic Regression** model using `scikit-learn`.
2.  Understand the difference between `.predict()` and `.predict_proba()`.
3.  **Implement evaluation metrics from scratch** (without using `sklearn.metrics` functions) to understand the mathematics behind them.
4.  Visualize the Confusion Matrix, ROC Curve, and Precision-Recall Curve.




## 1. Setup and Data Loading 
We will use the **Breast Cancer Wisconsin dataset**, a classic binary classification dataset.
* **Input (X):** Features computed from a digitized image of a fine needle aspirate (FNA) of a breast mass.
* **Target (y):** Diagnosis (0 = Malignant, 1 = Benign).

*Note: We are using standard libraries for data handling and plotting, but we will avoid `sklearn.metrics`.*


In [None]:
!pip install pandas numpy matplotlib seaborn scikit-learn

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression

sns.set_style("whitegrid")

data = load_breast_cancer()
X = data.data
y = data.target

print(f"Feature shape: {X.shape}")
print(f"Target shape: {y.shape}")
print(f"Class distribution: {np.bincount(y)} (0: Malignant, 1: Benign)")

## 2. Data Splitting and Training

To evaluate a model properly, we must train it on one set of data and test it on unseen data.

**Task:**
1.  Import `train_test_split` from `sklearn.model_selection`.
2.  Split `X` and `y` into training and testing sets. Use `test_size=0.2` and `random_state=42`.
3.  Initialize a `LogisticRegression` model (use `max_iter=10000` to ensure convergence).
4.  Fit the model on the training data.

In [None]:
from sklearn.model_selection import train_test_split

# 2. Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3. Initialize model
model = LogisticRegression(max_iter=10000)

# 4. Fit model
model.fit(X_train, y_train)

## 3. Predictions
 
Scikit-learn models usually provide two methods for prediction:
* `.predict(X)`: Returns the hard class labels (0 or 1).
* `.predict_proba(X)`: Returns the probability estimates for each class.
 
**Task:** Generate predictions for the **Test Set**.

In [None]:
# Get hard predictions (0 or 1)
y_pred = model.predict(X_test)

# Get probabilities
# Note: predict_proba returns [prob_class_0, prob_class_1]. We usually want prob_class_1.
y_prob = model.predict_proba(X_test)[:, 1]

In [None]:
print(f"First 5 predictions: {y_pred[:5]}")
print(f"First 5 probabilities (of class 1): {y_prob[:5]}")

## 4. The Confusion Matrix (From Scratch)

A confusion matrix summarizes the performance of a classification algorithm.

$$
\begin{bmatrix}
TN & FP \\
 FN & TP
 \end{bmatrix}
 $$
 
 * **True Positive (TP):** Model predicted 1, Actual was 1.
 * **True Negative (TN):** Model predicted 0, Actual was 0.
 * **False Positive (FP):** Model predicted 1, Actual was 0 (Type I Error).
 * **False Negative (FN):** Model predicted 0, Actual was 1 (Type II Error).
 
**Task:** Calculate TP, TN, FP, and FN using `y_test` and `y_pred`. Do **not** use `sklearn.metrics.confusion_matrix`.
 
*Hint: You can use boolean masking or numpy summation. E.g., `((y_test == 1) & (y_pred == 1)).sum()`*


In [None]:
TP = 0
TN = 0
FP = 0
FN = 0

In [None]:
print(f"TP: {TP}, TN: {TN}, FP: {FP}, FN: {FN}")

### Visualization: Confusion Matrix
Run the cell below to visualize your matrix.

In [None]:
def plot_custom_confusion_matrix(tp, tn, fp, fn):
    matrix = np.array([[tn, fp], [fn, tp]])
    plt.figure(figsize=(6, 5))
    sns.heatmap(matrix, annot=True, fmt='d', cmap='Blues', cbar=False,
                xticklabels=['Predicted 0', 'Predicted 1'],
                yticklabels=['Actual 0', 'Actual 1'])
    plt.title('Confusion Matrix')
    plt.show()

plot_custom_confusion_matrix(TP, TN, FP, FN)

## 5. Classification Metrics (From Scratch)
 
Now we will calculate the core metrics using the variables (TP, TN, FP, FN) you calculated above.
 
### Definitions
 
 1.  **Accuracy:** How often is the classifier correct overall?
     $$ Accuracy = \frac{TP + TN}{TP + TN + FP + FN} $$
 
 2.  **Recall (Sensitivity):** Out of all actual positives, how many did we identify?
     $$ Recall = \frac{TP}{TP + FN} $$
 
 3.  **Precision:** Out of all predicted positives, how many were actually positive?
     $$ Precision = \frac{TP}{TP + FP} $$
 
 4.  **F1 Score:** The harmonic mean of Precision and Recall.
     $$ F1 = 2 \times \frac{Precision \times Recall}{Precision + Recall} $$
 
 **Task:** Implement these formulas.

In [None]:
accuracy = 0
recall = 0
precision = 0
f1_score = 0

In [None]:
print(f"Accuracy:  {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall:    {recall:.4f}")
print(f"F1 Score:  {f1_score:.4f}")

## 6. Thresholds and Curves
 
The predictions (`y_pred`) you generated earlier used a default threshold of **0.5**.
If $Probability > 0.5$, predict 1. Otherwise, predict 0.
 
However, we can change this threshold to trade off Precision for Recall.
 
### Task: Threshold Helper Function
Write a function that takes the probabilities (`y_prob`), the actual labels (`y_test`), and a specific `threshold`. It should return the **True Positive Rate (TPR)** and **False Positive Rate (FPR)** for that specific threshold.
 
$$ TPR (Recall) = \frac{TP}{TP + FN} $$
$$ FPR = \frac{FP}{FP + TN} $$

In [None]:
def calculate_tpr_fpr(y_true, y_probs, threshold):
    
    # 1. Create temporary predictions based on threshold
    # temp_pred = (y_probs >= threshold).astype(int)
    
    # 2. Calculate TP, FN, FP, TN manually
    # tp = ...
    # fn = ...
    # fp = ...
    # tn = ...
    
    # 3. Calculate Rates
    # tpr = ...
    # fpr = ...
    
    return 0.0, 0.0 # return tpr, fpr

# Test your function with threshold 0.5 (Should match your previous Recall)
# t, f = calculate_tpr_fpr(y_test, y_prob, 0.5)
# print(f"Threshold 0.5 -> TPR: {t}, FPR: {f}")

### Visualization: ROC and PR Curves

The following code uses your `calculate_tpr_fpr` function to generate the curves. You do not need to write code here, just run it to verify your logic.

1.  **ROC Curve (Receiver Operating Characteristic):** Plots TPR vs FPR. Ideally, it hugs the top-left corner.
2.  **Precision-Recall Curve:** Plots Precision vs Recall.

In [None]:
thresholds = np.linspace(0, 1, 101)

tprs = []
fprs = []
precisions = []
recalls = []

# Loop through thresholds and calculate metrics using YOUR function
for thresh in thresholds:
    # We use a try/except block to handle division by zero in precision
    try:
        tpr, fpr = calculate_tpr_fpr(y_test, y_prob, thresh)
        
        # We need to recalculate precision for the PR curve
        # Re-using logic for simplicity here, though ideally would be in the function
        temp_pred = (y_prob >= thresh).astype(int)
        tp = ((y_test == 1) & (temp_pred == 1)).sum()
        fp = ((y_test == 0) & (temp_pred == 1)).sum()
        precision = tp / (tp + fp) if (tp + fp) > 0 else 1.0
        
        tprs.append(tpr)
        fprs.append(fpr)
        precisions.append(precision)
        recalls.append(tpr) # Recall is same as TPR
    except:
        pass


In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# ROC Curve
axes[0].plot(fprs, tprs, color='darkorange', lw=2, label='ROC curve')
axes[0].plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
axes[0].set_xlim([0.0, 1.0])
axes[0].set_ylim([0.0, 1.05])
axes[0].set_xlabel('False Positive Rate')
axes[0].set_ylabel('True Positive Rate')
axes[0].set_title('Receiver Operating Characteristic (ROC)')
axes[0].legend(loc="lower right")

# Precision-Recall Curve
axes[1].plot(recalls, precisions, color='blue', lw=2, label='PR curve')
axes[1].set_xlim([0.0, 1.0])
axes[1].set_ylim([0.0, 1.05])
axes[1].set_xlabel('Recall')
axes[1].set_ylabel('Precision')
axes[1].set_title('Precision-Recall Curve')
axes[1].legend(loc="lower left")

plt.show()