# Dataset Information & Automatic Download

This notebook uses sample histopathology images from open datasets. The images will be automatically downloaded to the `data/` folder if they don't exist.

**Datasets Used:**
- **CAMELYON16**: Lymph node sections for metastasis detection
- **Sample WSI files**: Representative whole slide images for demonstration
- **Test images**: Smaller patches for quick processing

**Data Sources:**
- CAMELYON16 Challenge: https://camelyon16.grand-challenge.org/
- OpenSlide Test Data: https://openslide.cs.cmu.edu/demo/
- Kaggle Histopathologic Cancer Detection: Sample patches

In [None]:
# Centralized data setup: persistent across notebooks (Colab Drive aware)
import sys, os
from pathlib import Path

try:
    from shared import utils as u
except ImportError:
    # Bootstrap for Colab: clone the repo so `shared` is available
    repo_url = "https://github.com/anand-indx/dp-t25.git"
    dest = "/content/dp-t25"
    if 'google.colab' in sys.modules and not os.path.exists(dest):
        import subprocess
        subprocess.run(['git', 'clone', '--depth', '1', repo_url, dest], check=False)
        sys.path.insert(0, dest)
    else:
        # Fallback: add project root heuristics
        sys.path.insert(0, str(Path.cwd().parents[1]))
    from shared import utils as u

print("🚀 Initializing dataset setup (persistent, offline-friendly)…")
DATA_DIR = u.get_data_dir()
created = u.ensure_image_processing_samples(DATA_DIR)
print(f"📁 Data directory: {DATA_DIR}")
print(f"🎯 Prepared {len(created)} new files")

## Getting Started: Loading and Viewing a Digital Slide

This notebook covers the essential first step in any computational pathology project: accessing and displaying a whole-slide image (WSI).

**Our Goals:**
1.  Set up the environment by installing and importing key libraries.
2.  Use the `openslide-python` library to open a WSI file.
3.  Examine the slide's metadata, such as its different magnification levels.
4.  Extract a small tile (a patch) from the slide.
5.  Display the extracted tile using `matplotlib`.

### 1. Environment Setup

First, we need to install the necessary Python packages. We'll use `openslide-python` for handling WSI files, `Pillow` for image objects, and `matplotlib` for plotting.

Execute this command in your terminal to install them:
```bash
pip install openslide-python Pillow matplotlib
```
After installation, we can import them into our notebook.

In [None]:
import openslide
from PIL import Image
import matplotlib.pyplot as plt
import numpy as np
import os
from pathlib import Path

# --- Configuration ---
# Use the detected writable data directory from the setup cell
wsi_filepath = str((DATA_DIR / 'CMU-1-Small-Region.svs'))

print("Libraries imported and file path configured.")
print(f"Target WSI file: {wsi_filepath}")

# Check if file exists, if not use sample patch images
if not os.path.exists(wsi_filepath):
    print(f"⚠️  WSI file not found, will use sample patches instead")
    sample_patches = [str(DATA_DIR / 'sample_patch_1.jpg'), str(DATA_DIR / 'sample_patch_2.jpg')]
    print(f"Sample patches: {sample_patches}")

### 2. Opening the WSI File

Let's use OpenSlide to create a slide object. This is memory-efficient as it doesn't load the entire gigapixel image at once. We'll include a check to ensure the file exists before trying to open it.

In [None]:
slide_handle = None
if not os.path.exists(wsi_filepath):
    print(f"ERROR: File not found at {wsi_filepath}")
    print("Please ensure you have downloaded the data and the path is correct.")
else:
    try:
        slide_handle = openslide.OpenSlide(wsi_filepath)
        print(f"Successfully opened: {os.path.basename(wsi_filepath)}")
    except openslide.OpenSlideError as e:
        print(f"Failed to open slide. Details: {e}")

### 3. Exploring Slide Properties

WSIs are stored as image pyramids with multiple layers, each at a different resolution. Level 0 is the highest-resolution layer. Let's see what levels are available in our file.

In [None]:
if slide_handle:
    print(f"Total levels: {slide_handle.level_count}")
    for level_index, dims in enumerate(slide_handle.level_dimensions):
        downsample_factor = slide_handle.level_downsamples[level_index]
        print(f"- Level {level_index}: Dimensions={dims}, Downsample={downsample_factor:.2f}x")

### 4. Extracting a Region of Interest

We can't work with the whole slide at once. Instead, we extract smaller tiles. The `read_region` method allows us to specify a location `(x, y)` in the level 0 coordinate system, the `level` to read from, and the desired `size` of the tile.

In [None]:
# The top-left coordinate for our tile (from Level 0)
read_location = (50000, 35000)

# The pyramid level to read from
level_to_read = 0

# The desired tile size (width, height)
tile_dimensions = (256, 256)

tile_image = None
if slide_handle:
    tile_image = slide_handle.read_region(read_location, level_to_read, tile_dimensions)
    print(f"Extracted a {tile_image.size} tile from level {level_to_read}.")

### 5. Visualizing the Extracted Tile

`read_region` gives us a standard `Pillow` Image object, which we can easily plot.

In [None]:
if tile_image:
    # The output from read_region is RGBA, let's convert to RGB for cleaner display
    rgb_tile = tile_image.convert('RGB')
    
    plt.figure(figsize=(6, 6))
    plt.imshow(rgb_tile)
    plt.title(f"Tile from Level {level_to_read} at {read_location}")
    plt.xlabel('X-axis (pixels)')
    plt.ylabel('Y-axis (pixels)')
    plt.grid(False)
    plt.show()

## ✅ Final Check

To complete this task, we need to verify that our extracted tile has the correct dimensions. The cell below converts the `Pillow` image to a `NumPy` array and asserts its shape.

In [None]:
if 'rgb_tile' in locals() and rgb_tile is not None:
    image_as_array = np.array(rgb_tile)
    
    # The shape should be (height, width, channels)
    expected_shape = (256, 256, 3)
    assert image_as_array.shape == expected_shape, f"Shape mismatch! Expected {expected_shape}, but got {image_as_array.shape}"
    
    print("SUCCESS: The extracted image tile has the correct shape.")
    print(f"Shape: {image_as_array.shape}")
else:
    print("Skipping test: The image tile was not loaded correctly in previous steps.")