# Detecting Rare Medical Conditions with Machine Learning

Time estimate: **30** minutes

## Objectives

After completing this lab, you will be able to:

 - Identify and analyze class imbalance in medical datasets
 - Apply Synthetic Minority Oversampling Technique (SMOTE) to handle imbalanced data
 - Evaluate models using Precision, Recall, and Precision-Recall curves
 - Describe why accuracy is misleading for rare conditions
 - Build effective classifiers for detecting rare medical conditions

## What you will do in this lab

In medical diagnostics, detecting rare conditions presents unique challenges. When a disease affects only 1-2% of patients, traditional machine learning approaches often fail. A model that simply predicts "healthy" for everyone would achieve 98-99% accuracy while being completely useless in practice.

This lab addresses the real-world challenge of building reliable diagnostic tools for rare conditions. You will work with a highly imbalanced medical dataset where malignant cases are significantly less common than benign cases, mirroring actual clinical scenarios.

You will:

- Analyze a highly imbalanced medical dataset with rare positive cases
- Train baseline machine learning models and identify their limitations
- Apply Synthetic Minority Oversampling Technique (SMOTE) to balance the training data
- Compare model performance using appropriate metrics for imbalanced data
- Learn to interpret Precision-Recall curves for clinical decision-making

## Overview

Class imbalance is one of the most common challenges in medical machine learning. When positive cases (patients with a condition) represent less than 5% of the dataset, standard classification algorithms become biased toward the majority class. This creates a dangerous situation in medical diagnostics: models may achieve impressive accuracy scores while failing to detect the very cases they were designed to identify.

The problem stems from how most machine learning algorithms optimize their objective functions. They minimize overall error rate, which in an imbalanced dataset means correctly classifying the abundant negative cases. The algorithm "learns" that predicting "negative" most of the time yields good performance metrics, even if this strategy misses all positive cases.

This lab introduces you to techniques specifically designed for imbalanced classification problems. You will use SMOTE, a sophisticated resampling technique that creates synthetic examples of the minority class by interpolating between existing positive cases. This approach is superior to simple duplication because it helps the model learn the broader decision boundary of the positive class without overfitting to specific examples.

Beyond resampling techniques, you will learn to evaluate models using metrics appropriate for imbalanced data. Precision-Recall curves, F1-scores, and careful analysis of confusion matrices provide insights that accuracy alone cannot reveal. These evaluation methods help data scientists and medical professionals make informed decisions about model deployment, threshold selection, and the trade-offs between false positives and false negatives in clinical settings.

## About the dataset

This lab uses a synthetic medical dataset designed to simulate the challenge of detecting rare conditions in clinical practice.

### Dataset overview

The dataset contains medical measurements from 1,000 patients, where malignant cases represent approximately 1% of the total samples. This extreme imbalance (99:1 ratio) reflects realistic scenarios in medical screening and diagnostics, where most patients do not have the condition being tested.

The synthetic nature of this dataset allows for safe learning and experimentation without privacy concerns, while maintaining the statistical properties and challenges found in real medical data. Each patient is represented by 10 numerical features derived from medical measurements, along with a binary target indicating the diagnosis.

### Column descriptions

1. **feature_1** - Normalized medical measurement (continuous value)
2. **feature_2** - Normalized medical measurement (continuous value)
3. **feature_3** - Normalized medical measurement (continuous value)
4. **feature_4** - Normalized medical measurement (continuous value)
5. **feature_5** - Normalized medical measurement (continuous value)
6. **feature_6** - Normalized medical measurement (continuous value)
7. **feature_7** - Normalized medical measurement (continuous value)
8. **feature_8** - Normalized medical measurement (continuous value)
9. **feature_9** - Normalized medical measurement (continuous value)
10. **feature_10** - Normalized medical measurement (continuous value)
11. **target** - Diagnosis (0 = Benign/Healthy, 1 = Malignant/Condition Present)

## Setup

### Installing required libraries

The following libraries are required to run this lab.

In [None]:
# Install the libraries required for this lab
!pip install -q imblearn
!pip install -q pandas
!pip install -q numpy
!pip install -q matplotlib
!pip install -q seaborn
!pip install -q scikit-learn

In [None]:
# Optional: suppress warnings
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
warnings.filterwarnings('ignore')

### Importing required libraries

In [None]:
# Import libraries for data manipulation and numerical operations
import pandas as pd  # for data manipulation and analysis
import numpy as np  # for numerical operations

# Import visualization libraries
import matplotlib.pyplot as plt  # for creating plots
import seaborn as sns  # for statistical visualizations

# Import tools for splitting data into training and test sets
from sklearn.model_selection import train_test_split  # for data splitting

# Import resampling techniques to handle imbalanced data
from imblearn.over_sampling import SMOTE  # Synthetic Minority Over-sampling Technique

# Import the classifier (machine learning model)
from sklearn.ensemble import RandomForestClassifier  # ensemble learning model

# Import evaluation metrics
from sklearn.metrics import (
    classification_report,  # detailed classification metrics
    confusion_matrix,  # matrix showing prediction outcomes
    precision_recall_curve,  # precision-recall trade-off visualization
    average_precision_score,  # summary metric for precision-recall
    roc_auc_score,  # area under ROC curve
    roc_curve,  # receiver operating characteristic curve
    accuracy_score  # simple accuracy metric
)

print("All libraries imported successfully!")
print("Ready to begin rare condition detection analysis.")

In [None]:
# Set random seed for reproducibility (ensures consistent results across runs)
np.random.seed(42)

# Configure plot style for better visualizations
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

print("Environment configured successfully!")

## Step 1: Load and explore the dataset

The first step in any machine learning project is to load and explore the data. Understanding the structure, size, and basic characteristics of your dataset is crucial before applying any algorithms. In this step, you will load the imbalanced medical dataset and examine its basic properties.

In [None]:
# Load the imbalanced dataset from CSV file
df = pd.read_csv("https://advanced-machine-learning-for-medical-data-8e1579.gitlab.io/labs/lab3/imbalanced_dataset.csv")

# Display the first few rows to understand the data structure
print("First 5 rows of the dataset:")
print(df.head())

# Display dataset dimensions
print("\nDataset shape:", df.shape)
print(f"This means there are {df.shape[0]} samples (patients) and {df.shape[1]-1} features (measurements)")

## Step 2: Analyze class distribution

Understanding the class distribution is critical when working with medical datasets. In rare condition detection, the ratio of positive to negative cases (class imbalance) directly impacts model performance and evaluation strategy.

An imbalance ratio of 10:1 or higher is considered severe, and special techniques are required. Medical datasets often have ratios of 100:1 or even 1000:1, making this analysis essential for choosing appropriate modeling approaches.

In [None]:
# Count the number of cases for each class
class_counts = df['target'].value_counts()
print("Class distribution (counts):")
print(class_counts)

# Calculate the percentage of each class
class_percentages = df['target'].value_counts(normalize=True) * 100
print("\nClass distribution (percentages):")
for class_label, percentage in class_percentages.items():
    class_name = "Benign" if class_label == 0 else "Malignant"
    print(f"  {class_name} (Class {class_label}): {percentage:.2f}%")

# Calculate and display the imbalance ratio
imbalance_ratio = class_counts[0] / class_counts[1]
print(f"\nImbalance ratio: {imbalance_ratio:.2f}:1")
print(f"This means there are approximately {round(imbalance_ratio)} benign cases for every 1 malignant case.")
print("\n This severe imbalance will cause problems for standard machine learning algorithms!")

In [None]:
# Visualize the class distribution with a bar chart
plt.figure(figsize=(8, 5))
class_counts.plot(kind='bar', color=['green', 'red'])
plt.title('Distribution of Diagnoses', fontsize=14, fontweight='bold')
plt.xlabel('Diagnosis (0 = Benign, 1 = Malignant)', fontsize=12)
plt.ylabel('Number of Cases', fontsize=12)
plt.xticks(rotation=0)
plt.grid(axis='y', alpha=0.3)

# Add percentage labels on bars for clarity
for i, (count, pct) in enumerate(zip(class_counts, class_percentages)):
    plt.text(i, count + 10, f'{pct:.1f}%', ha='center', fontsize=11, fontweight='bold')

plt.tight_layout()
plt.show()

## Step 3: Prepare the data for machine learning

Before training a model, you must split the data into features (X) and the target variable (y), then divide it into training and testing sets. The training set is used to teach the model, while the test set evaluates how well the model generalizes to unseen data.

The `stratify` parameter ensures both sets maintain the same class distribution as the original dataset. This is especially important for imbalanced data, as it prevents the test set from accidentally having zero positive cases.

In [None]:
# Separate features (X) from the target variable (y)
X = df.drop(columns=['target'])  # All measurement columns
y = df['target']  # Only the diagnosis column

# Split into training set (80%) and test set (20%)
# stratify=y ensures both sets have similar class distributions
# random_state=42 ensures reproducibility
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2, 
    random_state=42, 
    stratify=y
)

print("Data split completed!")
print(f"Training set size: {X_train.shape[0]} samples")
print(f"Test set size: {X_test.shape[0]} samples")

print("\nTraining set class distribution:")
print(y_train.value_counts())

print("\nTest set class distribution:")
print(y_test.value_counts())

print("\n‚úì Both sets maintain the severe class imbalance from the original dataset.")

## Step 4: Train a baseline model (without resampling)

Before applying any techniques to handle class imbalance, it's important to establish a baseline. This baseline model is trained on the original imbalanced data without any special handling. By comparing this baseline to models trained with resampling techniques, you can measure the effectiveness of your imbalance-handling strategies.

Random Forest is an ensemble learning method that combines multiple decision trees. It's chosen here because it generally performs well on various tasks and serves as a strong baseline classifier.

In [None]:
# Create a Random Forest classifier
# n_estimators=100 means it will create 100 decision trees
# random_state=42 ensures reproducibility
baseline_model = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the model on the imbalanced training data
baseline_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred_baseline = baseline_model.predict(X_test)

# Get probability scores (used for precision-recall curves)
# [:, 1] selects probabilities for the positive class (malignant)
y_pred_proba_baseline = baseline_model.predict_proba(X_test)[:, 1]

print("‚úì Baseline model trained successfully!")
print("Next: Evaluate this model and see why accuracy can be misleading.")

## Step 5: Evaluate the baseline model

Now comes the critical analysis: evaluating how well the baseline model performs. This step reveals a fundamental problem with imbalanced datasets: traditional metrics like accuracy can be highly misleading.

A model that achieves 99% accuracy sounds excellent, but if it never predicts the positive class (malignant), it's completely useless for medical diagnosis. This is why you must examine multiple metrics, especially Precision, Recall, and the confusion matrix.

In [None]:
# Calculate accuracy
baseline_accuracy = accuracy_score(y_test, y_pred_baseline)

print("=" * 60)
print("BASELINE MODEL EVALUATION (No Resampling)")
print("=" * 60)
print(f"\nAccuracy: {baseline_accuracy:.3f} ({baseline_accuracy*100:.1f}%)")
print("\n  WARNING: High accuracy can be misleading with imbalanced data!")
print("   A model that predicts 'benign' for ALL cases would have ~99% accuracy.")
print("   Let's look at more detailed metrics...")

In [None]:
# Display detailed classification report
print("\nDetailed Classification Report:")
print(classification_report(y_test, y_pred_baseline, target_names=['Benign', 'Malignant'], zero_division=0))

print("\n Key metrics explained:")
print("  ‚Ä¢ Precision: Of cases predicted as malignant, what % are actually malignant?")
print("  ‚Ä¢ Recall: Of actual malignant cases, what % was detected?")
print("  ‚Ä¢ F1-Score: Harmonic mean of precision and recall")
print("\n  For rare condition detection, RECALL is often most critical!")
print("  Missing a malignant case (false negative) can be life-threatening.")

In [None]:
# Create and display confusion matrix
cm_baseline = confusion_matrix(y_test, y_pred_baseline)
print("\nConfusion Matrix:")
print(cm_baseline)

# Extract confusion matrix components
tn, fp, fn, tp = cm_baseline.ravel()

print("\n Confusion Matrix Breakdown:")
print(f"  True Negatives (TN): {tn}")
print(f"    ‚Üí Correctly identified benign cases ‚úì")
print(f"\n  False Positives (FP): {fp}")
print(f"    ‚Üí Benign cases incorrectly predicted as malignant")
print(f"    ‚Üí Causes unnecessary worry and follow-up tests")
print(f"\n  False Negatives (FN): {fn}")
print(f"    ‚Üí Malignant cases incorrectly predicted as benign")
print(f"    ‚Üí MOST DANGEROUS ERROR - missed diagnoses!")
print(f"\n  True Positives (TP): {tp}")
print(f"    ‚Üí Correctly identified malignant cases ‚úì")

In [None]:
# Visualize confusion matrix
plt.figure(figsize=(7, 5))
sns.heatmap(cm_baseline, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Benign', 'Malignant'],
            yticklabels=['Benign', 'Malignant'],
            cbar=False)
plt.title('Confusion Matrix - Baseline Model', fontsize=14, fontweight='bold')
plt.ylabel('Actual Diagnosis', fontsize=12)
plt.xlabel('Predicted Diagnosis', fontsize=12)
plt.tight_layout()
plt.show()

print("\n Notice: The bottom-left cell (False Negatives) shows missed malignant cases.")
print("  This is the error you want to minimize most in medical diagnosis!")

## Step 6: Apply SMOTE resampling

Now you will apply SMOTE to address the class imbalance. SMOTE is more sophisticated than simple duplication because it creates synthetic examples by interpolating between existing minority class samples.

**How SMOTE works:**
1. For each minority class sample, SMOTE finds its k nearest neighbors
2. It randomly selects one of these neighbors
3. It creates a new synthetic sample along the line connecting the two samples
4. This process repeats until the classes are balanced

This approach helps the model learn the broader characteristics of the minority class without simply memorizing specific examples.

In [None]:
# Apply SMOTE to the training data
# SMOTE creates synthetic minority class examples by interpolating between existing ones
smote = SMOTE(random_state=42)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

print("SMOTE Resampling Applied")
print("=" * 40)

print("\nOriginal training set class distribution:")
print(pd.Series(y_train).value_counts())

print("\nAfter SMOTE class distribution:")
print(pd.Series(y_train_smote).value_counts())

print("\n‚úì Classes are now balanced!")
print(f"  Training set size increased from {len(y_train)} to {len(y_train_smote)} samples")
print(f"  Added {len(y_train_smote) - len(y_train)} synthetic malignant cases")

print("\n Important: SMOTE is applied only to the TRAINING set.")
print(" The TEST set remains unchanged to provide unbiased evaluation.")

## Step 7: Train model with SMOTE-resampled data

With the balanced training data created by SMOTE, you can now train a new model. This model should learn to recognize both classes more effectively because it sees equal numbers of positive and negative examples during training.

Remember: the test set remains unchanged. You want to evaluate how well the model performs on real, unbalanced data that reflects actual clinical conditions.

In [None]:
# Train model with SMOTE-resampled data
model_smote = RandomForestClassifier(n_estimators=100, random_state=42)
model_smote.fit(X_train_smote, y_train_smote)

# Make predictions on the test set
y_pred_smote = model_smote.predict(X_test)

# Get probability scores for precision-recall curves
y_pred_proba_smote = model_smote.predict_proba(X_test)[:, 1]

print("Model trained with SMOTE-resampled data!")
print("Next: Compare this model's performance to the baseline.")

## Step 8: Compare model performance

Now comes the moment of truth: comparing the baseline model (trained on imbalanced data) with the SMOTE model (trained on balanced data). You'll calculate Precision, Recall, F1-score, and Accuracy for both models.

**Metric definitions:**
- **Accuracy** = (TP + TN) / Total ‚Äî Overall correctness, but misleading for imbalanced data
- **Precision** = TP / (TP + FP) ‚Äî Of predicted positives, how many are correct?
- **Recall** = TP / (TP + FN) ‚Äî Of actual positives, how many were detected?
- **F1-score** = 2 √ó (Precision √ó Recall) / (Precision + Recall) ‚Äî Harmonic mean balancing both

In [None]:
# Define helper function to manually compute metrics
def compute_metrics(y_true, y_pred):
    """Compute precision, recall, F1-score, and accuracy from predictions."""
    # Convert to numpy arrays for easier computation
    y_true = np.array(y_true)
    y_pred = np.array(y_pred)
    
    # Calculate confusion matrix elements
    TP = np.sum((y_true == 1) & (y_pred == 1))  # True Positives
    TN = np.sum((y_true == 0) & (y_pred == 0))  # True Negatives
    FP = np.sum((y_true == 0) & (y_pred == 1))  # False Positives
    FN = np.sum((y_true == 1) & (y_pred == 0))  # False Negatives
    
    # Apply metric formulas
    accuracy = (TP + TN) / (TP + TN + FP + FN)
    precision = TP / (TP + FP) if (TP + FP) > 0 else 0.0
    recall = TP / (TP + FN) if (TP + FN) > 0 else 0.0
    f1 = (2 * precision * recall / (precision + recall)) if (precision + recall) > 0 else 0.0
    
    return accuracy, precision, recall, f1

In [None]:
# Compute metrics for both models
baseline_metrics = compute_metrics(y_test, y_pred_baseline)
smote_metrics = compute_metrics(y_test, y_pred_smote)

# Create comparison table
comparison_data = {
    'Model': ['Baseline (No Resampling)', 'SMOTE'],
    'Accuracy': [baseline_metrics[0], smote_metrics[0]],
    'Precision': [baseline_metrics[1], smote_metrics[1]],
    'Recall': [baseline_metrics[2], smote_metrics[2]],
    'F1-Score': [baseline_metrics[3], smote_metrics[3]]
}

comparison_df = pd.DataFrame(comparison_data)

# Display formatted comparison
print("\n" + "=" * 80)
print("MODEL PERFORMANCE COMPARISON")
print("=" * 80)
print(comparison_df.to_string(index=False))
print("=" * 80)

print("\nKey Observations:")
print(f"  ‚Ä¢ Baseline recall: {baseline_metrics[2]:.3f} ‚Üí SMOTE recall: {smote_metrics[2]:.3f}")
print(f"    Improvement: {(smote_metrics[2] - baseline_metrics[2]):.3f}")
print("\n  ‚Ä¢ SMOTE significantly improves recall (detection of malignant cases)")
print("  ‚Ä¢ This is the most important metric for rare condition detection!")
print("  ‚Ä¢ Accuracy may decrease slightly, but we're catching more cases that matter.")

## Step 9: Visualize confusion matrices

Visual comparison of confusion matrices provides immediate insight into how each model performs. Pay special attention to the false negative (bottom-left) cell, which represents missed malignant cases, the most critical error in medical diagnosis.

In [None]:
# Create confusion matrices for both models
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

models_data = [
    ('Baseline (No Resampling)', y_pred_baseline),
    ('SMOTE', y_pred_smote)
]

for idx, (title, predictions) in enumerate(models_data):
    ax = axes[idx]
    cm = confusion_matrix(y_test, predictions)
    
    # Create heatmap
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=ax,
                xticklabels=['Benign', 'Malignant'],
                yticklabels=['Benign', 'Malignant'],
                cbar=False)
    
    ax.set_title(title, fontsize=13, fontweight='bold')
    ax.set_ylabel('Actual Diagnosis', fontsize=11)
    ax.set_xlabel('Predicted Diagnosis', fontsize=11)

plt.tight_layout()
plt.show()

print("\nüí° Key Observation:")
print("   Compare the bottom-left cell (False Negatives) in each matrix.")
print("   This represents missed malignant cases‚Äîthe most dangerous type of error!")
print("   SMOTE typically reduces false negatives, improving patient safety.")

## Step 10: Plot Precision-Recall curves

The **Precision-Recall curve** is the most important evaluation tool for imbalanced datasets. It shows the trade-off between Precision and Recall at different classification thresholds.

**Why Precision-Recall curves are crucial:**
- They focus on the minority class (malignant cases), which is what matters most
- They reveal performance across all possible decision thresholds
- They're more informative than ROC curves when classes are highly imbalanced
- They help clinicians understand trade-offs between catching all cases vs. minimizing false alarms

**Interpretation:**
- **X-axis (Recall)**: What percentage of malignant cases does the model detect?
- **Y-axis (Precision)**: Of cases predicted as malignant, what percentage are correct?
- **Curves closer to top-right**: Better overall performance
- **Average Precision (AP)**: Summary metric‚Äîhigher is better (1.0 is perfect)

In [None]:
# Calculate and plot precision-recall curves
plt.figure(figsize=(12, 8))

# Store model predictions for plotting
all_predictions = [
    ('Baseline', y_pred_proba_baseline, 'blue'),
    ('SMOTE', y_pred_proba_smote, 'green')
]

# Plot precision-recall curve for each model
for name, y_pred_proba, color in all_predictions:
    precision, recall, _ = precision_recall_curve(y_test, y_pred_proba)
    avg_precision = average_precision_score(y_test, y_pred_proba)
    
    plt.plot(recall, precision, 
             label=f'{name} (AP = {avg_precision:.3f})', 
             color=color, linewidth=2.5)

# Add reference line for random classifier
no_skill = len(y_test[y_test == 1]) / len(y_test)
plt.plot([0, 1], [no_skill, no_skill], 
         linestyle='--', color='gray', 
         label=f'Random Classifier (AP = {no_skill:.3f})', linewidth=2)

plt.xlabel('Recall (Sensitivity)', fontsize=13, fontweight='bold')
plt.ylabel('Precision', fontsize=13, fontweight='bold')
plt.title('Precision-Recall Curves: Model Comparison', fontsize=15, fontweight='bold')
plt.legend(loc='best', fontsize=11)
plt.grid(alpha=0.3)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.tight_layout()
plt.show()

print("\nHow to Read This Chart:")
print("  ‚Ä¢ X-axis (Recall): What percentage of malignant cases does the model detect?")
print("  ‚Ä¢ Y-axis (Precision): Of the cases predicted as malignant, what percentage are correct?")
print("  ‚Ä¢ AP (Average Precision): Summary metric‚Äîhigher is better (1.0 is perfect)")
print("  ‚Ä¢ Curves closer to the top-right corner indicate better performance")
print("\nClinical Interpretation:")
print("  ‚Ä¢ High Recall (right side): Catch most malignant cases (fewer missed diagnoses)")
print("  ‚Ä¢ High Precision (top): Fewer false alarms (don't worry patients unnecessarily)")
print("  ‚Ä¢ The challenge: Improving one often decreases the other!")
print("  ‚Ä¢ The optimal threshold depends on the clinical context and consequences of errors.")

## Step 11: Plot ROC curves (for comparison)

While Precision-Recall curves are more appropriate for imbalanced data, Receiver Operating Characteristic (ROC) curves are also commonly used in medical diagnostics. This step plots ROC curves for comparison, helping you understand why Precision-Recall curves are preferred for rare conditions.

**Key difference:**
- **ROC curves** use False Positive Rate (FPR), which includes True Negatives in the denominator
- When negative cases vastly outnumber positive cases, FPR changes very slowly
- This makes ROC curves appear overly optimistic for imbalanced datasets
- **Precision-Recall curves** focus exclusively on the positive class, providing more realistic assessment

In [None]:
# Plot ROC curves for comparison
plt.figure(figsize=(12, 8))

# Plot ROC curve for each model
for name, y_pred_proba, color in all_predictions:
    fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
    auc_score = roc_auc_score(y_test, y_pred_proba)
    
    plt.plot(fpr, tpr, 
             label=f'{name} (AUC = {auc_score:.3f})', 
             color=color, linewidth=2.5)

# Add diagonal reference line (random classifier)
plt.plot([0, 1], [0, 1], 
         linestyle='--', color='gray', 
         label='Random Classifier (AUC = 0.500)', linewidth=2)

plt.xlabel('False Positive Rate', fontsize=13, fontweight='bold')
plt.ylabel('True Positive Rate (Recall)', fontsize=13, fontweight='bold')
plt.title('ROC Curves: Model Comparison', fontsize=15, fontweight='bold')
plt.legend(loc='best', fontsize=11)
plt.grid(alpha=0.3)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.tight_layout()
plt.show()

print("\n‚öñÔ∏è  ROC vs Precision-Recall Curves:")
print("  ‚Ä¢ ROC curves can appear overly optimistic with imbalanced data")
print("  ‚Ä¢ They give equal weight to both classes")
print("  ‚Ä¢ False Positive Rate includes True Negatives, which dominate in imbalanced data")
print("  ‚Ä¢ Precision-Recall curves focus on the minority class (malignant)")
print("  ‚Ä¢ For rare condition detection, Precision-Recall curves are more informative!")

# Exercises

Now it's time to apply what you've learned! These exercises will help reinforce your understanding of rare condition detection and imbalanced classification.

## Exercise 1: Load and analyze a new dataset

Load the **imbalanced_dataset2.csv** file (https://advanced-machine-learning-for-medical-data-8e1579.gitlab.io/labs/lab3/imbalanced_dataset2.csv) and analyze its class distribution. Calculate the imbalance ratio and visualize the class distribution with a bar chart.

In [None]:
# your code goes here


<details>
    <summary>Click here for a hint</summary>
    
Use `pd.read_csv()` to load the data, then use `value_counts()` to examine the target column distribution. Calculate the imbalance ratio by dividing the count of class 0 by class 1. Refer to Step 1 and Step 2 for the complete approach.

</details>

<details>
    <summary>Click here for solution</summary>

```python
# Load the new dataset
df2 = pd.read_csv("https://advanced-machine-learning-for-medical-data-8e1579.gitlab.io/labs/lab3/imbalanced_dataset2.csv")

# Display first few rows
print("First 5 rows of the new dataset:")
print(df2.head())

# Analyze class distribution
class_counts2 = df2['target'].value_counts()
print("\nClass distribution:")
print(class_counts2)

# Calculate percentages
class_percentages2 = df2['target'].value_counts(normalize=True) * 100
print("\nClass distribution (percentages):")
print(class_percentages2)

# Calculate imbalance ratio
imbalance_ratio2 = class_counts2[0] / class_counts2[1]
print(f"\nImbalance ratio: {imbalance_ratio2:.2f}:1")

# Visualize
plt.figure(figsize=(8, 5))
class_counts2.plot(kind='bar', color=['green', 'red'])
plt.title('Distribution of Diagnoses - Dataset 2', fontsize=14, fontweight='bold')
plt.xlabel('Diagnosis (0 = Benign, 1 = Malignant)', fontsize=12)
plt.ylabel('Number of Cases', fontsize=12)
plt.xticks(rotation=0)
plt.grid(axis='y', alpha=0.3)

for i, (count, pct) in enumerate(zip(class_counts2, class_percentages2)):
    plt.text(i, count + 10, f'{pct:.1f}%', ha='center', fontsize=11, fontweight='bold')

plt.tight_layout()
plt.show()
```

</details>

## Exercise 2: Train a baseline model on the new dataset

Using the dataset from Exercise 1, split the data into training and test sets (80/20 split with stratification), train a Random Forest baseline model, and evaluate it using a confusion matrix and classification report.

In [None]:
# your code goes here


<details>
    <summary>Click here for a hint</summary>
    
Follow Steps 3, 4, and 5 from the lab. Separate features from the target, use `train_test_split()` with `stratify=y`, create a `RandomForestClassifier`, fit it on the training data, and evaluate with `confusion_matrix()` and `classification_report()`.

</details>

<details>
    <summary>Click here for solution</summary>

```python
# Prepare the data
X2 = df2.drop(columns=['target'])
y2 = df2['target']

# Split the data
X_train2, X_test2, y_train2, y_test2 = train_test_split(
    X2, y2, test_size=0.2, random_state=42, stratify=y2
)

print(f"Training set size: {X_train2.shape[0]} samples")
print(f"Test set size: {X_test2.shape[0]} samples")

# Train baseline model
baseline_model2 = RandomForestClassifier(n_estimators=100, random_state=42)
baseline_model2.fit(X_train2, y_train2)

# Make predictions
y_pred_baseline2 = baseline_model2.predict(X_test2)
y_pred_proba_baseline2 = baseline_model2.predict_proba(X_test2)[:, 1]

# Evaluate
print("\nBaseline Model Evaluation:")
print(f"Accuracy: {accuracy_score(y_test2, y_pred_baseline2):.3f}")
print("\nClassification Report:")
print(classification_report(y_test2, y_pred_baseline2, target_names=['Benign', 'Malignant'], zero_division=0))

# Confusion matrix
cm2 = confusion_matrix(y_test2, y_pred_baseline2)
plt.figure(figsize=(7, 5))
sns.heatmap(cm2, annot=True, fmt='d', cmap='Blues',
            xticklabels=['Benign', 'Malignant'],
            yticklabels=['Benign', 'Malignant'],
            cbar=False)
plt.title('Confusion Matrix - Baseline Model (Dataset 2)', fontsize=14, fontweight='bold')
plt.ylabel('Actual', fontsize=12)
plt.xlabel('Predicted', fontsize=12)
plt.tight_layout()
plt.show()
```

</details>

## Exercise 3: Apply SMOTE and compare results

Apply SMOTE to the training data from Exercise 2, train a new Random Forest model, and create a comparison table showing Accuracy, Precision, Recall, and F1-score for both the baseline and SMOTE models. Visualize the confusion matrices side-by-side.

In [None]:
# your code goes here


<details>
    <summary>Click here for a hint</summary>
    
Follow Steps 6, 7, 8, and 9. Use `SMOTE()` to resample the training data, train a new model on the balanced data, compute metrics using the `compute_metrics()` function, and create comparison visualizations using matplotlib and seaborn.

</details>

<details>
    <summary>Click here for solution</summary>

```python
# Apply SMOTE
smote2 = SMOTE(random_state=42)
X_train_smote2, y_train_smote2 = smote2.fit_resample(X_train2, y_train2)

print("After SMOTE:")
print(pd.Series(y_train_smote2).value_counts())

# Train SMOTE model
model_smote2 = RandomForestClassifier(n_estimators=100, random_state=42)
model_smote2.fit(X_train_smote2, y_train_smote2)

# Predictions
y_pred_smote2 = model_smote2.predict(X_test2)
y_pred_proba_smote2 = model_smote2.predict_proba(X_test2)[:, 1]

# Compute metrics
baseline_metrics2 = compute_metrics(y_test2, y_pred_baseline2)
smote_metrics2 = compute_metrics(y_test2, y_pred_smote2)

# Create comparison table
comparison_data2 = {
    'Model': ['Baseline', 'SMOTE'],
    'Accuracy': [baseline_metrics2[0], smote_metrics2[0]],
    'Precision': [baseline_metrics2[1], smote_metrics2[1]],
    'Recall': [baseline_metrics2[2], smote_metrics2[2]],
    'F1-Score': [baseline_metrics2[3], smote_metrics2[3]]
}

comparison_df2 = pd.DataFrame(comparison_data2)
print("\n" + "=" * 80)
print("MODEL COMPARISON (Dataset 2)")
print("=" * 80)
print(comparison_df2.to_string(index=False))
print("=" * 80)

# Visualize confusion matrices
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

models_data2 = [
    ('Baseline', y_pred_baseline2),
    ('SMOTE', y_pred_smote2)
]

for idx, (title, predictions) in enumerate(models_data2):
    ax = axes[idx]
    cm = confusion_matrix(y_test2, predictions)
    
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=ax,
                xticklabels=['Benign', 'Malignant'],
                yticklabels=['Benign', 'Malignant'],
                cbar=False)
    
    ax.set_title(f'{title} (Dataset 2)', fontsize=13, fontweight='bold')
    ax.set_ylabel('Actual', fontsize=11)
    ax.set_xlabel('Predicted', fontsize=11)

plt.tight_layout()
plt.show()

print("\nObserve how SMOTE improves recall (detection of malignant cases)!")
```

</details>

## Exercise 4: Plot Precision-Recall curves

Create Precision-Recall curves for both the baseline and SMOTE models from Exercise 3. Compare the Average Precision scores and interpret which model performs better for rare condition detection.

In [None]:
# your code goes here


<details>
    <summary>Click here for a hint</summary>
    
Refer to Step 10. Use `precision_recall_curve()` and `average_precision_score()` from sklearn.metrics. Plot recall on the x-axis and precision on the y-axis. Include a reference line for a random classifier.

</details>

<details>
    <summary>Click here for solution</summary>

```python
# Plot precision-recall curves
plt.figure(figsize=(12, 8))

# Prepare predictions
all_predictions2 = [
    ('Baseline', y_pred_proba_baseline2, 'blue'),
    ('SMOTE', y_pred_proba_smote2, 'green')
]

# Plot each model
for name, y_pred_proba, color in all_predictions2:
    precision, recall, _ = precision_recall_curve(y_test2, y_pred_proba)
    avg_precision = average_precision_score(y_test2, y_pred_proba)
    
    plt.plot(recall, precision,
             label=f'{name} (AP = {avg_precision:.3f})',
             color=color, linewidth=2.5)

# Add reference line
no_skill2 = len(y_test2[y_test2 == 1]) / len(y_test2)
plt.plot([0, 1], [no_skill2, no_skill2],
         linestyle='--', color='gray',
         label=f'Random Classifier (AP = {no_skill2:.3f})', linewidth=2)

plt.xlabel('Recall (Sensitivity)', fontsize=13, fontweight='bold')
plt.ylabel('Precision', fontsize=13, fontweight='bold')
plt.title('Precision-Recall Curves - Dataset 2', fontsize=15, fontweight='bold')
plt.legend(loc='best', fontsize=11)
plt.grid(alpha=0.3)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.tight_layout()
plt.show()

print("\nThe model with higher Average Precision (AP) and curve closer to top-right performs better.")
print("SMOTE typically improves performance on the minority class (malignant cases).")
```

</details>

# Congratulations!

You have successfully completed this lab on detecting rare medical conditions with machine learning. You now understand how to identify class imbalance, apply SMOTE resampling, evaluate models using appropriate metrics, and interpret Precision-Recall curves for clinical decision-making. These skills are essential for building reliable diagnostic tools in real-world medical applications.

## Authors

Ramesh Sannareddy

Copyright ¬© 2025 SkillUp. All rights reserved.