<h2 style="font-size:32px; font-family:Georgia; font-weight:bold;">
  Phase 2 - Baseline Model (Leaky Setup)
</h2>

<p style="font-size:15px; font-family:Arial;">
This notebook demonstrates a deliberately flawed modeling pipeline to highlight the impact of <b>patient-level data leakage</b> in medical imaging tasks. A naïve image-level split is applied, and model performance is evaluated under conditions known to produce misleading results. The goal is to document how leakage inflates metrics and to motivate stricter data handling in future phases.
</p>

<h3 style="font-size:20px; font-family:Georgia; font-weight:bold;">
Objectives
</h3>

<ul style="font-size:15px; font-family:Arial;">
  <li>Apply a naïve/random train–test split at the image level</li>
  <li>Build a baseline convolutional neural network (CNN)</li>
  <li>Use raw tensors for image and label access</li>
  <li>Train the model on a dataset with known leakage</li>
  <li>Evaluate model performance on the test set</li>
  <li>Log key metrics: <b>Accuracy</b>, <b>Loss</b>, <b>ROC-AUC</b> (multi-class)</li>
  <li>Save model weights and evaluation results</li>
  <li>Document why this setup is leaky and why results appear overly optimistic</li>
</ul>

<hr>
<h3 style="font-size:20px; font-family:Georgia; font-weight:bold;">
Execution Summary
</h3>

<ul style="font-size:15px; font-family:Arial;">
  <li>Dataset loaded from Deep Lake with <code>images</code>, <code>labels</code>, and <code>person_num</code> tensors</li>
  <li>Label distribution visualized and confirmed across three classes</li>
  <li>Sample chest X-rays displayed for Normal and Pneumonia classes</li>
  <li>Patient ID distribution analyzed; 61% of patients had multiple images</li>
  <li>Image-level split performed using <code>train_test_split</code> with stratification</li>
  <li>Leakage verified: <b>435 patients</b> appeared in both train and test sets</li>
  <li>Custom PyTorch dataset class implemented with resizing and channel normalization</li>
  <li>Simple CNN defined and trained for 5 epochs</li>
  <li>Training loss logged per batch and averaged per epoch</li>
  <li>Final evaluation yielded:
    <ul>
      <li><b>Test Accuracy:</b> 0.784</li>
      <li><b>Test ROC-AUC (macro, ovr):</b> 0.913</li>
    </ul>
  </li>
</ul>
<hr>
<h3 style="font-size:20px; font-family:Georgia; font-weight:bold;">
Conclusion
</h3>

<p style="font-size:15px; font-family:Arial;">
This phase intentionally demonstrates how image-level splits can leak patient-specific information across partitions. The resulting performance metrics are artificially inflated and do not reflect true generalization. These findings reinforce the need for patient-level separation and motivate the corrected pipeline introduced in Phase 3.
</p>

In [42]:
import deeplake
import numpy as np
import matplotlib.pyplot as plt

import torch
import torch.nn as nn
import torch.optim as optim
import torchvision.transforms as T
from torch.utils.data import Dataset, DataLoader
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score

<h2 style="font-size:28px; font-family:Georgia; font-weight:bold;">
Notebook Initialization and Dataset Loading
</h2>

<p style="font-size:15px; font-family:Arial;">
Essential libraries were imported to support data loading, preprocessing, modeling, and evaluation. Random seeds were set for reproducibility across NumPy and PyTorch operations.
</p>

<ul style="font-size:15px; font-family:Arial;">
  <li><b>Deep Lake</b> used for dataset access</li>
  <li><b>NumPy</b> and <b>Matplotlib</b> for numerical analysis and visualization</li>
  <li><b>PyTorch</b> for model construction and training</li>
  <li><b>scikit-learn</b> for data splitting and metric evaluation</li>
</ul>

<p style="font-size:15px; font-family:Arial;">
The Chest X‑Ray dataset was loaded from ActiveLoop in read‑only mode. Available tensors include:
</p>

<ul style="font-size:15px; font-family:Arial;">
  <li><code>images</code> — raw chest X‑ray data</li>
  <li><code>labels</code> — diagnostic class annotations</li>
  <li><code>person_num</code> — patient identifiers</li>
</ul>

<p style="font-size:15px; font-family:Arial;">
Initial inspection confirmed successful dataset access and verified the presence of all required components for downstream analysis.
</p>


In [26]:
np.random.seed(42)
torch.manual_seed(42)

<torch._C.Generator at 0x18efdaf9cf0>

In [27]:
ds = deeplake.load("hub://activeloop/chest-xray-train")

print(ds)
print("Tensors:", ds.tensors.keys())

|

Opening dataset in read-only mode as you don't have write permissions.


-

This dataset can be visualized in Jupyter Notebook by ds.visualize() or at https://app.activeloop.ai/activeloop/chest-xray-train



 

hub://activeloop/chest-xray-train loaded successfully.

Dataset(path='hub://activeloop/chest-xray-train', read_only=True, tensors=['images', 'labels', 'person_num'])
Tensors: dict_keys(['images', 'labels', 'person_num'])




In [28]:
images = ds.images
labels = ds.labels
person_ids = ds.person_num


In [29]:
labels_list = [int(l[0]) for l in labels.numpy(aslist=True)]

print("Total samples:", len(labels_list))
print("Unique labels:", set(labels_list))

Total samples: 5216
Unique labels: {0, 1, 2}


<h2 style="font-size:28px; font-family:Georgia; font-weight:bold;">
Image-Level Split and Patient ID Filtering
</h2>

<p style="font-size:15px; font-family:Arial;">
A stratified image-level split was performed using <code>train_test_split</code>, resulting in <b>4172 training samples</b> and <b>1044 test samples</b>. This split was based solely on image labels and did not account for patient identity.
</p>

<p style="font-size:15px; font-family:Arial;">
Patient identifiers were extracted from the dataset to assess split integrity. Out of <b>5216 total samples</b>, <b>1341 entries</b> were missing patient IDs, leaving <b>3875 valid samples</b> for leakage analysis.
</p>

<ul style="font-size:15px; font-family:Arial;">
  <li><b>Missing ID ratio:</b> 25.71%</</li>
  <li><b>Valid samples used for leakage analysis:</b> 3875</li>
</ul>

<p style="font-size:15px; font-family:Arial;">
This filtering step ensures that only samples with known patient identity are used to verify leakage across train and test partitions.
</p>


In [30]:
indices = np.arange(len(labels_list))

train_idx, test_idx = train_test_split(
    indices,
    test_size=0.2,
    random_state=42,
    stratify=labels_list
)

print("Train size:", len(train_idx))
print("Test size:", len(test_idx))

Train size: 4172
Test size: 1044


In [31]:
person_ids_raw = ds.person_num.numpy(aslist=True)

clean_person_ids = []
missing_count = 0

for pid in person_ids_raw:
    if pid is None or len(pid) == 0:
        clean_person_ids.append(None)
        missing_count += 1
    else:
        clean_person_ids.append(int(pid[0]))

print("Total samples:", len(clean_person_ids))
print("Missing patient IDs:", missing_count)
print("Samples with valid patient IDs:", len(clean_person_ids) - missing_count)


Total samples: 5216
Missing patient IDs: 1341
Samples with valid patient IDs: 3875


In [32]:
missing_ratio = missing_count / len(clean_person_ids)
print(f"Missing ID ratio: {missing_ratio:.2%}")

Missing ID ratio: 25.71%


In [33]:
valid_indices = [
    i for i, pid in enumerate(clean_person_ids)
    if pid is not None
]

valid_person_ids = [clean_person_ids[i] for i in valid_indices]

print("Valid samples used for leakage analysis:", len(valid_indices))

Valid samples used for leakage analysis: 3875


<h2 style="font-size:28px; font-family:Georgia; font-weight:bold;">
Patient Leakage Confirmation
</h2>

<p style="font-size:15px; font-family:Arial;">
Patient identifiers were extracted from both training and test sets using image-level indices. The intersection of these sets was computed to identify patients appearing in both partitions.
</p>

<ul style="font-size:15px; font-family:Arial;">
  <li><b>Overlapping patients:</b> 435</li>
</ul>

<p style="font-size:15px; font-family:Arial;">
This overlap confirms the presence of <b>patient-level leakage</b>, where multiple images from the same individual are distributed across training and evaluation sets. Such leakage compromises the validity of performance metrics and inflates model accuracy by exposing patient-specific features during training.
</p>

In [34]:
train_patients = set(clean_person_ids[i] for i in train_idx)
test_patients = set(clean_person_ids[i] for i in test_idx)

overlap = train_patients.intersection(test_patients)

print("Overlapping patients:", len(overlap))

Overlapping patients: 435


<h2 style="font-size:28px; font-family:Georgia; font-weight:bold;">
Model Training and Runtime Notes
</h2>

<p style="font-size:15px; font-family:Arial;">
The baseline convolutional neural network (CNN) was trained for <b>5 epochs</b> using a batch size of 8 and a dataset of <b>4172 training samples</b>. Training was performed on a leaky image-level split, and progress was logged every 20 batches to monitor convergence.
</p>

<h3 style="font-size:20px; font-family:Georgia; font-weight:bold;">
Training Duration
</h3>

<p style="font-size:15px; font-family:Arial;">
⚠️ <b>Important:</b> This training loop took approximately <b>3–4 hours</b> to complete on standard hardware. Users running this notebook should expect a similar runtime unless using accelerated infrastructure (e.g., GPU clusters).
</p>

<h3 style="font-size:20px; font-family:Georgia; font-weight:bold;">
Epoch-Level Summary
</h3>

<ul style="font-size:15px; font-family:Arial;">
  <li><b>Epoch 1 — Mean Loss:</b> 0.6174</li>
  <li><b>Epoch 2 — Mean Loss:</b> 0.4471</li>
  <li><b>Epoch 3 — Mean Loss:</b> 0.4059</li>
  <li><b>Epoch 4 — Mean Loss:</b> 0.3597</li>
  <li><b>Epoch 5 — Mean Loss:</b> 0.3222</li>
</ul>

<p style="font-size:15px; font-family:Arial;">
Loss values decreased steadily across epochs, indicating successful optimization. However, due to patient-level leakage in the data split, these results are not reliable indicators of generalization performance.
</p>

<h3 style="font-size:20px; font-family:Georgia; font-weight:bold;">
Observations
</h3>

<ul style="font-size:15px; font-family:Arial;">
  <li>Loss values varied significantly across batches, with occasional spikes and dips</li>
  <li>Some batches showed extremely low loss (<code>&lt; 0.1</code>), suggesting memorization of patient-specific features</li>
  <li>Later epochs showed tighter clustering of loss values, but leakage undermines interpretability</li>
</ul>

<p style="font-size:15px; font-family:Arial;">
These results reinforce the importance of patient-level partitioning and motivate the corrected pipeline introduced in Phase 3.
</p>


In [35]:
class ChestXrayDataset(torch.utils.data.Dataset):
    def __init__(self, indices, transform=None):
        self.indices = indices
        self.transform = transform

    def __len__(self):
        return len(self.indices)

    def __getitem__(self, idx):
        i = int(self.indices[idx])
        image = images[i].numpy()
        label = int(labels_list[i])
        image = torch.tensor(image, dtype=torch.float32)
        #  FORCE CHANNEL CONSISTENCY 
        if image.ndim == 2:
            image = image.unsqueeze(0)          # (1, H, W)
        elif image.ndim == 3 and image.shape[2] == 3:
            image = image.permute(2, 0, 1)      # (3, H, W)
            image = image.mean(dim=0, keepdim=True)  
        elif image.ndim == 3:
            image = image.permute(2, 0, 1)      # (1, H, W)
        image = image / 255.0
        #  RESIZE 
        image = torch.nn.functional.interpolate(
            image.unsqueeze(0),     # (1, 1, H, W)
            size=(224, 224),
            mode="bilinear",
            align_corners=False
        ).squeeze(0)                # (1, 224, 224)
    
        label = torch.tensor(label, dtype=torch.long)
        return image, label

In [36]:
train_ds = ChestXrayDataset(train_idx)
test_ds  = ChestXrayDataset(test_idx)

train_loader = DataLoader(train_ds, batch_size=8, shuffle=True)
test_loader  = DataLoader(test_ds, batch_size=8, shuffle=False)

In [37]:
class SimpleCNN(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv = nn.Sequential(
            nn.Conv2d(1, 16, 3, padding=1),  
            nn.ReLU(),
            nn.MaxPool2d(2),

            nn.Conv2d(16, 32, 3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),
        )

        self.fc = nn.Sequential(
            nn.Flatten(),
            nn.Linear(32 * 56 * 56, 64),
            nn.ReLU(),
            nn.Linear(64, 3)  
        )

    def forward(self, x):
        return self.fc(self.conv(x))

In [38]:
device = "cuda" if torch.cuda.is_available() else "cpu"

model = SimpleCNN().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-3)

In [39]:
epochs = 5

for epoch in range(epochs):
    model.train()
    losses = []

    for batch_idx, (x, y) in enumerate(train_loader):
        x, y = x.to(device), y.to(device)

        optimizer.zero_grad()
        outputs = model(x)
        loss = criterion(outputs, y)
        loss.backward()
        optimizer.step()

        losses.append(loss.item())

        if batch_idx % 20 == 0:
            print(
                f"Epoch {epoch+1}/{epochs} | "
                f"Batch {batch_idx}/{len(train_loader)} | "
                f"Loss {loss.item():.4f}"
            )

    print(f"Epoch {epoch+1}/{epochs} - Mean Loss: {np.mean(losses):.4f}")

Epoch 1/5 | Batch 0/522 | Loss 1.0984
Epoch 1/5 | Batch 20/522 | Loss 1.1894
Epoch 1/5 | Batch 40/522 | Loss 0.9637
Epoch 1/5 | Batch 60/522 | Loss 0.7732
Epoch 1/5 | Batch 80/522 | Loss 0.5571
Epoch 1/5 | Batch 100/522 | Loss 0.7887
Epoch 1/5 | Batch 120/522 | Loss 0.6915
Epoch 1/5 | Batch 140/522 | Loss 0.9749
Epoch 1/5 | Batch 160/522 | Loss 0.4621
Epoch 1/5 | Batch 180/522 | Loss 0.5920
Epoch 1/5 | Batch 200/522 | Loss 0.7517
Epoch 1/5 | Batch 220/522 | Loss 0.4904
Epoch 1/5 | Batch 240/522 | Loss 0.3602
Epoch 1/5 | Batch 260/522 | Loss 0.5826
Epoch 1/5 | Batch 280/522 | Loss 0.3903
Epoch 1/5 | Batch 300/522 | Loss 0.7965
Epoch 1/5 | Batch 320/522 | Loss 0.5576
Epoch 1/5 | Batch 340/522 | Loss 0.4705
Epoch 1/5 | Batch 360/522 | Loss 0.2406
Epoch 1/5 | Batch 380/522 | Loss 0.4456
Epoch 1/5 | Batch 400/522 | Loss 0.5609
Epoch 1/5 | Batch 420/522 | Loss 0.3928
Epoch 1/5 | Batch 440/522 | Loss 0.3898
Epoch 1/5 | Batch 460/522 | Loss 0.5231
Epoch 1/5 | Batch 480/522 | Loss 0.6893
Epoch 

<h2 style="font-size:22px; font-family:Georgia; font-weight:bold;">
 Leaky CNN - Phase 2 Evaluation
</h2>

<p style="font-size:15px; font-family:Arial;">
This convolutional neural network (CNN) was trained under a flawed setup where patient-level leakage was present. The train–test split was performed at the image level, resulting in <b>435 patients</b> appearing in both training and test sets. This allowed the model to learn patient-specific features, artificially boosting performance.
</p>

<h3 style="font-size:20px; font-family:Georgia; font-weight:bold;">
Evaluation Metrics
</h3>

<ul style="font-size:15px; font-family:Arial;">
  <li><b>Test Accuracy:</b> 0.784</li>
  <li><b>Test ROC-AUC (macro, ovr):</b> 0.913</li>
  <li><b>Test F1-Score (macro):</b> 0.744</li>
</ul>

<h3 style="font-size:20px; font-family:Georgia; font-weight:bold;">
Interpretation
</h3>

<p style="font-size:15px; font-family:Arial;">
While the metrics appear strong, they are misleading. The presence of patient-level leakage means the model likely memorized features from individuals it saw during training, which reappeared in the test set. As a result, these scores do not reflect true generalization and should not be used to assess clinical reliability.
</p>

<p style="font-size:15px; font-family:Arial;">
This evaluation serves as a cautionary baseline and reinforces the importance of group-aware splitting and leak-free pipelines in medical machine learning workflows.
</p>

In [40]:
model.eval()
y_true, y_pred, y_prob = [], [], []

with torch.no_grad():
    for x, y in test_loader:
        x = x.to(device)
        outputs = model(x)                         
        probs = torch.softmax(outputs, dim=1)      

        y_true.extend(y.numpy())                   
        y_pred.extend(outputs.argmax(dim=1).cpu().numpy())
        y_prob.extend(probs.cpu().numpy())         

acc = accuracy_score(y_true, y_pred)

y_true = np.array(y_true)
y_prob = np.array(y_prob)                          

auc = roc_auc_score(
    y_true,
    y_prob,
    multi_class="ovr",      
    average="macro"         
)
print(f"Test Accuracy: {acc:.3f}")
print(f"Test ROC-AUC (macro, ovr): {auc:.3f}")

Test Accuracy: 0.784
Test ROC-AUC (macro, ovr): 0.913


In [43]:
f1 = f1_score(y_true, y_pred, average='macro')
print(f"Test F1-Score: {f1:.3f}")

Test F1-Score: 0.744
