# Histopathologic Cancer Detection

## 1. Problem Description

In this project, we work on the Kaggle challenge titled *Histopathologic Cancer Detection*. The goal is to identify the presence of metastatic tissue in small histopathologic image patches derived from larger digital pathology scans.

This is a **binary image classification** problem:
- `1`: Tumor present (metastatic cancer tissue in center region)
- `0`: No tumor tissue in the center

The challenge has real-world implications: pathologists often scan large digital slides for regions of cancerous tissue, and automating this process can accelerate diagnosis and reduce human error.

### Dataset Description

The dataset is derived from the **PatchCamelyon (PCam)** benchmark dataset, a popular and approachable dataset designed for evaluating metastasis detection models. This Kaggle version:
- **Removes duplicates** that existed due to probabilistic sampling in the original PCam.
- Maintains the same data split and label definition as PCam.

From the authors of PCam:
> "[PCam] packs the clinically-relevant task of metastasis detection into a straight-forward binary image classification task, akin to CIFAR-10 and MNIST. Models can easily be trained on a single GPU in a couple hours, and achieve competitive scores in the Camelyon16 tasks of tumor detection and whole-slide image diagnosis. Furthermore, the balance between task-difficulty and tractability makes it a prime suspect for fundamental machine learning research on topics as active learning, model uncertainty, and explainability."

### Technical Details

- **Image Size**: 96×96 RGB
- **Training Images**: 220,025
- **Label**: Binary (0 or 1)
- **Focus**: Label depends only on the **center 32×32 region** — this region must contain at least **one pixel of tumor** for the image to be labeled as `1`. The surrounding border is included to support models that avoid zero-padding.

### Evaluation Metric

Submissions are evaluated using the **Area Under the ROC Curve (AUC ROC)** between predicted probabilities and true binary labels.

### Submission Format

For each `id` in the test set, your model should predict a **probability** that the center 32×32 region contains tumor tissue. The CSV submission should follow this format:

```
id,label
0b2ea2a822ad23fdb1b5dd26653da899fbd2c0d5,0.03
95596b92e5066c5c52466c90b69ff089b39f2737,0.81
248e6738860e2ebcf6258cdc1f32f299e0c76914,0.47
```


## 2. Exploratory Data Analysis (EDA)
Here, we inspect label distributions and sample images to understand the dataset.

In [None]:
# Load libraries
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from PIL import Image
from sklearn.model_selection import train_test_split

# Load labels
labels_df = pd.read_csv('../input/histopathologic-cancer-detection/train_labels.csv')
print(labels_df.shape)
labels_df.head()

In [None]:
# Plot label distribution
sns.countplot(x='label', data=labels_df)
plt.title('Label Distribution')
plt.show()

In [None]:
# Show some images
def show_images(df, label, n=5):
    subset = df[df['label'] == label].sample(n)
    fig, axs = plt.subplots(1, n, figsize=(15,5))
    for i, img_id in enumerate(subset['id']):
        img = Image.open(f"../input/histopathologic-cancer-detection/train/{img_id}.tif")
        axs[i].imshow(img)
        axs[i].axis('off')
    plt.suptitle(f'Sample images - label {label}')
    plt.show()

show_images(labels_df, 0)
show_images(labels_df, 1)

## 3. Model Architecture
We define CNN architectures and explore transfer learning options.

In [None]:
# Example PyTorch CNN (can switch to Keras as needed)
import torch
import torch.nn as nn
import torch.nn.functional as F

class SimpleCNN(nn.Module):
    def __init__(self):
        super(SimpleCNN, self).__init__()
        self.conv1 = nn.Conv2d(3, 32, 3)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(32, 64, 3)
        self.fc1 = nn.Linear(64*22*22, 128)
        self.fc2 = nn.Linear(128, 1)

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 64*22*22)
        x = F.relu(self.fc1(x))
        x = torch.sigmoid(self.fc2(x))
        return x

## 4. Results and Analysis
Here we train models, evaluate them using metrics, and perform hyperparameter tuning.

In [None]:
# Training loop placeholder (PyTorch/Keras depending on your stack)
# Save best model and evaluate using metrics like AUC, accuracy, F1
# Example metric plotting code

def plot_metrics(history):
    plt.plot(history['train_loss'], label='train_loss')
    plt.plot(history['val_loss'], label='val_loss')
    plt.title('Loss over Epochs')
    plt.legend()
    plt.show()

## 5. Conclusion
Summarize what worked best, what didn’t, and what you would try in future iterations.

## 6. GitHub and Kaggle Leaderboard
- [GitHub Repository](https://github.com/abdallahmohammed2025/CNN_Histopathologic_Cancer_Detection#)
- Kaggle Leaderboard Screenshot (attach or embed below):

![Kaggle Screenshot](kaggle_leaderboard.png)