# 0.3. Normalize and Crop Training Data

Due to the large size of the training data, we will normalize and crop the images to a smaller size. This will help in reducing the computational load and make it easier to work with the data.

Normalization is done to ensure that the pixel values are scaled to a range that is suitable for training machine learning models. Cropping is done to focus on the areas of interest in the images, which can help improve model performance by reducing noise and irrelevant information.

## 0.3.1. Load Python Libraries

In [None]:
import os
import numpy as np
import matplotlib.pyplot as plt
from csbdeep.utils import normalize
from tifffile import imread, imwrite, imshow


## 0.3.2. Load Custom Functions

In [None]:
def get_crop_coordinates_3d(
    label_image: np.ndarray,
    label_value: int = 1,
    crop_size_xy: int = 128,
    padding_z: int = 2,
    min_crop_z: int = 12,
) -> tuple:
    """
    Calculate the coordinates for a crop of specified size around the given label value in a 3D label image.
    The z-dimension crop is dynamically determined based on the label's z-coordinates, with a minimum size enforced.

    Parameters:
        label_image (numpy.ndarray): The 3D label image.
        label_value (int): The label value to center the crop around.
        crop_size_xy (int): The size of the crop in the x and y dimensions.
        padding_z (int): Extra slices to include above and below the label in the z-dimension.
        min_crop_z (int): The minimum size of the crop in the z-dimension.

    Returns:
        tuple: (start_z, start_x, start_y, end_z, end_x, end_y) coordinates for the crop.
    """
    # Find the coordinates of the label with the specified value
    coords = np.argwhere(label_image == label_value)

    if coords.size == 0:
        raise ValueError(f"Label value {label_value} not found in the image.")

    # Calculate the z-range of the label
    z_min = coords[:, 0].min()
    z_max = coords[:, 0].max()

    # Add padding to the z-range
    start_z = max(z_min - padding_z, 0)
    end_z = min(
        z_max + padding_z + 1, label_image.shape[0]
    )  # +1 to include the upper bound

    # Ensure the z crop is at least `min_crop_z`
    current_crop_z = end_z - start_z
    if current_crop_z < min_crop_z:
        extra = min_crop_z - current_crop_z
        start_z = max(start_z - extra // 2, 0)
        end_z = min(start_z + min_crop_z, label_image.shape[0])
        start_z = max(0, end_z - min_crop_z)  # Adjust if crop exceeds bounds

    # Calculate the center of the label in the x and y dimensions
    center_x = coords[:, 1].mean().astype(int)
    center_y = coords[:, 2].mean().astype(int)

    # Calculate crop boundaries for the x and y dimensions
    half_crop_xy = crop_size_xy // 2

    # X-dimension
    start_x = max(center_x - half_crop_xy, 0)
    end_x = min(start_x + crop_size_xy, label_image.shape[1])
    start_x = max(0, end_x - crop_size_xy)  # Adjust if crop exceeds bounds

    # Y-dimension
    start_y = max(center_y - half_crop_xy, 0)
    end_y = min(start_y + crop_size_xy, label_image.shape[2])
    start_y = max(0, end_y - crop_size_xy)  # Adjust if crop exceeds bounds

    return start_z, start_x, start_y, end_z, end_x, end_y


## 0.3.3. Code to Normalize and Crop Training Data

### User defined parameters:

Directories for images and masks, both the original and watershed labels.

The path to the images and masks should be specified in the variables `img_directory`, `og_mask_directory`, and `watershed_label_directory`. These directories should contain the respective image and mask files that you want to process.

The original masks in `og_mask_directory` are used as reference for the cropping coordinates to reduce the generation of similiar cropped images, for example a cluster of touching watersheded labels . The watersheded labels in `watershed_label_directory` are used to generate the cropped labels.

In [None]:
# The path to the images and masks should be specified in the variables `img_directory`, and `watershed_label_directory`.
img_directory = "directory/to/images"

watershed_label_directory = "directory/to/watershed/labels"

# OPTIONAL: If you have an original mask directory, specify it here.
# This is used to reduce the amount of similiar crops generated by cropping each label in a cluster individually.
og_mask_directory = "directory/to/original/masks"  # Set to None if not used

# Provide the directories to store the cropped images and labels
# They will be created if they do not exist
cropped_img_directory = "directory/to/cropped/images"
cropped_lbl_directory = "directory/to/cropped/labels"

# Size of the crop in the x and y dimensions
# Should be at least twice the patch size used for training the model
# This is to ensure that an edge crop still has a sufficient size for training
# For example, if the patch size is 128, a crop size of 256 is recommended.
crop_size_xy = 256

# Minimum size of the crop in the z-dimension
min_crop_z = 30

# Extra slices to include above and below the label in the z-dimension, only used if the z crop is bigger than the minimum size
padding_z = 2

### Code to run:

Important: Make sure to run the code in the cells below in order, as they depend on each other.

- Corresponding images, masks, and labels must have the same file names.
- The normalization is done on the whole image stack, not on individual slices. to reproduce the normalization, you can use the `normalize` function from `csbdeep.utils`, with axis set to (0,1,2) for 3D images or (0,1) for 2D images.

In [None]:
# Get the list of files in the specified image directory
img_dir_list = sorted(os.listdir(img_directory))

# Iterate through the file list and process each image
for file in img_dir_list:
    # only process files with .tif or .tiff extensions
    if file.endswith((".tif", ".tiff")):
        img = imread(os.path.join(img_directory, file))
        lbl = imread(os.path.join(watershed_label_directory, file))

        # Handle different cases for og_mask_directory
        try:
            # Check if variable exists and is a valid directory path
            if og_mask_directory and os.path.isdir(og_mask_directory):
                mask = imread(os.path.join(og_mask_directory, file))
            else:
                mask = np.copy(lbl)
        except (NameError, TypeError):
            # Variable doesn't exist or is None
            mask = np.copy(lbl)

        # Normalize the image from 1 to 99.8 percentile
        img = normalize(img, 1, 99.8, axis=(0, 1, 2))

        # Ensure the directories for cropped images and labels exist
        os.makedirs(cropped_img_directory, exist_ok=True)
        os.makedirs(cropped_lbl_directory, exist_ok=True)

        # Check if the mask has any labels
        if mask.max() > 0:
            # Iterate through each label in the mask, skipping the background (label 0)
            for idx in range(1, mask.max() + 1):
                # Get the 3D crop coordinates for the current label
                crop_coords = get_crop_coordinates_3d(
                    mask,
                    label_value=idx,
                    crop_size_xy=crop_size_xy,
                    min_crop_z=min_crop_z,
                    padding_z=padding_z,
                )

                # Crop the image and label using the calculated (z, x, y) coordinates
                img_crop = img[
                    crop_coords[0] : crop_coords[3],
                    crop_coords[1] : crop_coords[4],
                    crop_coords[2] : crop_coords[5],
                ]
                lbl_crop = lbl[
                    crop_coords[0] : crop_coords[3],
                    crop_coords[1] : crop_coords[4],
                    crop_coords[2] : crop_coords[5],
                ]

                # Save the cropped image and label
                imwrite(
                    os.path.join(
                        cropped_img_directory,
                        file.split(".tif")[0] + "_" + str(idx) + ".tif",
                    ),
                    img_crop,
                )
                imwrite(
                    os.path.join(
                        cropped_lbl_directory,
                        file.split(".tif")[0] + "_" + str(idx) + ".tif",
                    ),
                    lbl_crop,
                )

            print(f"DONE: {file}")

print("All images processed and saved in the specified directories.")


DONE: 08082024_rLabel_014.2_TAMRA_sense_P00002_C4scaled_oriScale.tif
DONE: 08082024_rLabel_014.2_TAMRA_sense_P00006_C4scaled_oriScale.tif
DONE: 08082024_rLabel_014.2_TAMRA_sense_P00009_C4scaled_oriScale.tif
DONE: 21082024_rLabel_024_TAMRA_sense_P00013_C4scaled_oriScale.tif
DONE: 21082024_rLabel_024_TAMRA_sense_P00017_C4scaled_oriScale.tif
DONE: 21082024_rLabel_024_TAMRA_sense_P00020_C4scaled_oriScale.tif
DONE: 21082024_rLabel_024_TAMRA_sense_P00023_C4scaled_oriScale.tif
DONE: 21082024_rLabel_024_TAMRA_sense_P00027_C4scaled_oriScale.tif
DONE: 21082024_rLabel_024_TAMRA_sense_P00031_C4scaled_oriScale.tif
DONE: 21082024_rLabel_024_TAMRA_sense_P00033_C4scaled_oriScale.tif
DONE: 21082024_rLabel_024_TAMRA_sense_P00045_C4scaled_oriScale.tif
DONE: 21082024_rLabel_024_TAMRA_sense_P00049_C4scaled_oriScale.tif
DONE: 21082024_rLabel_024_TAMRA_sense_P00051_C4scaled_oriScale.tif
DONE: 21082024_rLabel_024_TAMRA_sense_P00052_C4scaled_oriScale.tif
DONE: 21082024_rLabel_024_TAMRA_sense_P00055_C4scaled_or