# Histopathologic Cancer Detection

This notebook follows the mini-project requirements for the Kaggle competition **Histopathologic Cancer Detection**. It contains: problem description, EDA, model building (transfer learning), training, evaluation, and instructions to create a Kaggle submission.

Notes:
- Dataset is available on Kaggle. Use the Kaggle API to download (instructions below).
- The notebook includes placeholders and safe checks so it can be opened even when the dataset isn't yet present.

## 1 — Setup and dependencies
Install Python packages and configure Kaggle credentials if you haven't already. Running the install cell in some environments (remote kernels) may be required only once.

In [1]:
# Package install (uncomment and run if needed)
# !pip install -q -U pip
# !pip install -q kaggle tensorflow==2.12.0 matplotlib pandas scikit-learn opencv-python scikit-image

# Note: TensorFlow can be large — use an existing environment if available.

In [2]:
# Imports
import os
from pathlib import Path
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from PIL import Image
import tensorflow as tf
from sklearn.metrics import roc_auc_score, roc_curve, precision_recall_curve

print('tf version:', tf.__version__)

ModuleNotFoundError: No module named 'tensorflow'

## 2 — Data overview & EDA
This section will inspect dataset files, show class balance and example images. The exact filenames depend on the Kaggle dataset structure (train_labels.csv and train folder).

In [None]:
DATA_DIR = Path('data')
TRAIN_DIR = DATA_DIR / 'train'
LABELS_CSV = DATA_DIR / 'train_labels.csv'

print('Looking for data at:', DATA_DIR)
print('Train dir exists?', TRAIN_DIR.exists())
print('Labels csv exists?', LABELS_CSV.exists())

In [None]:
# Load labels if available and show class balance + basic EDA
if LABELS_CSV.exists():
    labels = pd.read_csv(LABELS_CSV)
    print('Labels head:')
    print(labels.head())
    print('\nClass distribution:')
    print(labels['label'].value_counts())

    # Plot class balance
    try:
        ax = labels['label'].value_counts().sort_index().plot(kind='bar', color=['C0','C1'])
        ax.set_xlabel('label')
        ax.set_ylabel('count')
        ax.set_title('Class balance')
        plt.show()
    except Exception as e:
        print('Could not plot class balance:', e)

    # Quick pixel statistics on a small sample (safe & fast)
    if TRAIN_DIR.exists():
        sample_ids = labels['id'].sample(min(200, len(labels)), random_state=42).tolist()
        means = []
        stds = []
        for sid in sample_ids:
            p = TRAIN_DIR / f'{sid}.png'
            try:
                im = np.array(Image.open(p).convert('RGB')) / 255.0
                means.append(im.mean())
                stds.append(im.std())
            except Exception:
                continue
        if means:
            print(f'Pixel mean (sample): {np.mean(means):.4f}, std: {np.mean(stds):.4f}')
            plt.hist(means, bins=30, alpha=0.6, label='means')
            plt.hist(stds, bins=30, alpha=0.6, label='stds')
            plt.legend()
            plt.title('Sample pixel means and stds')
            plt.show()
    else:
        print('Train folder not available; skipping pixel stats')
else:
    print('Labels CSV not found. Please download dataset as described above.')

In [None]:
# Show a few example images (if train folder exists)
def show_examples(n=6):
    if not TRAIN_DIR.exists():
        print('Train folder not found; skipping image preview')
        return
    img_paths = list(TRAIN_DIR.glob('*.png'))[:n]
    if not img_paths:
        print('No PNG images found in train folder')
        return
    cols = 3
    rows = int(np.ceil(len(img_paths)/cols))
    plt.figure(figsize=(cols*3, rows*3))
    for i, p in enumerate(img_paths):
        img = Image.open(p)
        ax = plt.subplot(rows, cols, i+1)
        plt.imshow(img)
        plt.axis('off')
    plt.show()

show_examples(6)

## 3 — Data pipeline and augmentation
We'll use TensorFlow's image dataset utilities. For large-scale experiments prefer tf.data pipelines and caching/prefetching.

In [None]:
BATCH_SIZE = 64
IMG_SIZE = (96, 96)  # original Kaggle patches are 96x96
AUTOTUNE = tf.data.AUTOTUNE

def make_datasets_from_labels(labels_df, train_dir, img_size=IMG_SIZE, batch_size=BATCH_SIZE, val_split=0.15, seed=123):
    # labels_df should have columns 'id' and 'label' where 'id' is filename without extension
    labels_df = labels_df.copy()
    labels_df['filename'] = labels_df['id'].apply(lambda x: str(train_dir / f'{x}.png'))
    # Create a dataset from filenames and labels
    names = labels_df['filename'].tolist()
    labs = labels_df['label'].astype(np.int32).tolist()
    ds = tf.data.Dataset.from_tensor_slices((names, labs))

    def _load(path, label):
        image = tf.io.read_file(path)
        image = tf.image.decode_png(image, channels=3)
        image = tf.image.resize(image, img_size)
        image = image / 255.0
        return image, label

    ds = ds.shuffle(1024, seed=seed)
    ds = ds.map(_load, num_parallel_calls=AUTOTUNE)
    val_count = int(len(names) * val_split)
    val_ds = ds.take(val_count).batch(batch_size).prefetch(AUTOTUNE)
    train_ds = ds.skip(val_count).batch(batch_size).prefetch(AUTOTUNE)
    return train_ds, val_ds

# Example: only run if labels variable exists
if 'labels' in globals():
    train_ds, val_ds = make_datasets_from_labels(labels, TRAIN_DIR)
    print('train_ds and val_ds created')
else:
    print('labels not loaded; skip dataset creation')

## 4 — Model (Transfer learning)
We'll use a small transfer-learning model (EfficientNetB0) pre-trained on ImageNet and fine-tune for binary classification.

In [None]:
def build_model(input_shape=(*IMG_SIZE, 3), base_trainable=False):
    base = tf.keras.applications.EfficientNetB0(include_top=False, input_shape=input_shape, weights='imagenet')
    base.trainable = base_trainable
    inputs = tf.keras.Input(shape=input_shape)
    x = tf.keras.applications.efficientnet.preprocess_input(inputs)
    x = base(x, training=False)
    x = tf.keras.layers.GlobalAveragePooling2D()(x)
    x = tf.keras.layers.Dropout(0.3)(x)
    outputs = tf.keras.layers.Dense(1, activation='sigmoid')(x)
    model = tf.keras.Model(inputs, outputs)
    model.compile(optimizer=tf.keras.optimizers.Adam(1e-3), loss='binary_crossentropy', metrics=['accuracy'])
    return model

model = build_model()
model.summary()

## 5 — Training
Run training with callbacks (early stopping, model checkpoint). If dataset is large, train for a few epochs and monitor validation AUC.

In [None]:
EPOCHS = 10
if 'train_ds' in globals():
    callbacks = [
        tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True),
        tf.keras.callbacks.ModelCheckpoint('best_model.h5', save_best_only=True, monitor='val_loss')
    ]
    history = model.fit(train_ds, validation_data=val_ds, epochs=EPOCHS, callbacks=callbacks)
else:
    print('Training datasets not available — run data download and dataset creation first')

## 6 — Evaluation and Metrics
Compute ROC AUC on the validation set and show curves.

In [None]:
if 'val_ds' in globals():
    y_true = []
    y_prob = []
    for x_batch, y_batch in val_ds.unbatch().batch(1024):
        preds = model.predict(x_batch, verbose=0).ravel()
        y_prob.extend(preds.tolist())
        y_true.extend(y_batch.numpy().tolist())
    auc = roc_auc_score(y_true, y_prob)
    print('Validation ROC AUC:', auc)
else:
    print('Validation dataset not available — skipping evaluation')

## 7 — Prepare Kaggle submission (example)
This section shows how to run predictions on the test set and create the CSV submission required by the Kaggle competition (id, label).

In [None]:
TEST_DIR = DATA_DIR / 'test'
SAMPLE_SUB = DATA_DIR / 'sample_submission.csv'
# Fallback to repo root if not present under data/
if not SAMPLE_SUB.exists():
    SAMPLE_SUB = Path('sample_submission.csv')

if TEST_DIR.exists() and SAMPLE_SUB.exists():
    try:
        sample = pd.read_csv(SAMPLE_SUB)
        print('Sample submission loaded, rows =', len(sample))
    except Exception as e:
        print('Could not read sample submission:', e)
    print('Test dir exists; placeholder for inference when model & test images are present')
else:
    print('Test folder or sample submission not found; skipping submission creation')

## 8 — Discussion & Next steps
- Try stronger data augmentation (random rotations, color jitter).
- Experiment with model ensembles or deeper models.
- Use stratified k-fold CV and mixup/cutmix for improved generalization.
- Track experiments with Weights & Biases or TensorBoard.

---
**Deliverable checklist**:
- Notebook with EDA, model, training code, evaluation (this document).
- `requirements.txt` provided in the repository.
- README with basic run instructions.