<a href="https://colab.research.google.com/github/aandrijana/Image-Colorization-Project/blob/main/final_data_preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Preparation

For training, the number of luminance (L) images must match the chrominance (AB) image pairs. Our dataset has 25,000 grayscale L images, two sets of 10,000 and one set of 5000 AB pairs (ab1, ab2, ab3).

Due to Google Colab's memory constraints, we're initially using a subset: 10,000 grayscale images paired with ab1. This helps prevent crashes but may result in lower model accuracy compared to using the full dataset.

In [None]:
import os
import tensorflow as tf
from tensorflow.keras.layers import *
from tensorflow.keras.models import Model
import matplotlib.pyplot as plt
import numpy as np
import cv2
import time

In [None]:
import numpy as np
l_channel = np.load("image_colorization_data/l/gray_scale.npy")[:10000]
ab = np.load("image_colorization_data/ab/ab/ab1.npy")
print("Gray image shape:", l_channel.shape)
print("AB image shape:", ab.shape)

Gray image shape: (10000, 224, 224)
AB image shape: (10000, 224, 224, 2)


### Resize to 128x128

In [None]:
import numpy as np
import cv2

def resize_l_ab(l_array, ab_array, target_shape=(128, 128)):
    resized_l = []
    resized_ab = []

    for l_img, ab_img in zip(l_array, ab_array):
        # Resizing L channel
        l_resized = cv2.resize(l_img, target_shape, interpolation=cv2.INTER_AREA)

        # Resizing A and B channels separately
        a_resized = cv2.resize(ab_img[:, :, 0], target_shape, interpolation=cv2.INTER_AREA)
        b_resized = cv2.resize(ab_img[:, :, 1], target_shape, interpolation=cv2.INTER_AREA)
        ab_resized = np.stack((a_resized, b_resized), axis=-1)

        resized_l.append(l_resized)
        resized_ab.append(ab_resized)

    return np.array(resized_l), np.array(resized_ab)


In [None]:
l_channel, ab= resize_l_ab(l_channel, ab)
print("Gray image shape:", l_channel.shape)
print("AB image shape:", ab.shape)

Gray image shape: (10000, 128, 128)
AB image shape: (10000, 128, 128, 2)


We resized the input images from 224×224 pixels to 128×128 (recommended size in [pix2pix paper](https://arxiv.org/pdf/1611.07004v3) ) pixels. This downscaling significantly reduces the memory footprint, which is crucial for preventing out-of-memory errors and enabling more efficient training, especially on systems with limited RAM.

### Filter Outliers

In [None]:
# Remove over/under-exposed images (L channel)
mean_brightness = np.mean(l_channel, axis=(1, 2))
# Tighten the brightness range based on your distribution
valid_indices = np.where((mean_brightness >= 50) & (mean_brightness <= 170))[0]
l_filtered = l_channel[valid_indices]
ab_filtered = ab[valid_indices]

# Remove low-colorfulness images (AB channels)
colorfulness = np.std(ab, axis=(1, 2, 3))
# Increase threshold to remove bland/grayscale images
valid_indices = np.where(colorfulness > 10)[0]
l_filtered = l_channel[valid_indices]
ab_filtered = ab[valid_indices]

Leveraging insights from the Data exploration phase, specifically the **colorfulness and brightness distribution plots**, we performed **outlier filtering**. This process involved identifying and removing images that exhibited extreme or anomalous values in these characteristics, which can negatively impact model training and performance. This step ensures a more robust and representative training dataset.

### Splitting the data into training, validation and test sets

In [None]:
#dimension is 3
if l_filtered.ndim == 3:
    l_filtered = l_filtered[..., np.newaxis]

In [None]:
from sklearn.model_selection import train_test_split #we have no classes, so we'll just go with regular seed 42
l_train, l_test, ab_train, ab_test = train_test_split(l_filtered, ab_filtered, test_size=0.1, random_state=42)
l_train, l_val, ab_train, ab_val = train_test_split(l_train, ab_train, test_size=0.1, random_state=42)

### Normalization to [-1,1]

To prepare the LAB image data for our pix2pix model, we perform a crucial normalization step. The pix2pix architecture, like many deep learning models, generally performs best when input data is scaled to a specific range, typically [-1, 1].

In [None]:
print("L range:", np.min(l_train), "to", np.max(l_train))
print("A range:", np.min(ab_train[:,:,:,0]), "to", np.max(ab_train[:,:,:,0]))
print("B range:", np.min(ab_train[:,:,:,1]), "to", np.max(ab_train[:,:,:,1]))

L range: 0 to 255
A range: 42 to 226
B range: 20 to 223


In [None]:
L_IN_MIN, L_IN_MAX = 0.0, 255.0
A_IN_MIN, A_IN_MAX = 42.0, 226.0
B_IN_MIN, B_IN_MAX = 20.0, 223.0

In [None]:
def normalize_data(l_channel, ab_channels):
    """
    Casts data to float32 and normalizes from the CUSTOM source ranges to [-1, 1].
    """
    # Cast to float32 first
    #Neural networks perform calculations with floating-point numbers, so this step is essential.
    l_channel = tf.cast(l_channel, tf.float32)
    ab_channels = tf.cast(ab_channels, tf.float32)

    # Separating A and B channels from the (h, w, 2) tensor
    # We use slicing to keep the final dimension, which makes concatenation easy
    a_channel = ab_channels[..., 0:1]
    b_channel = ab_channels[..., 1:2]

    # Generic formula for mapping [min, max] to [-1, 1] is: 2 * (x - min) / (max - min) - 1
    l_norm = 2 * (l_channel - L_IN_MIN) / (L_IN_MAX - L_IN_MIN) - 1
    a_norm = 2 * (a_channel - A_IN_MIN) / (A_IN_MAX - A_IN_MIN) - 1
    b_norm = 2 * (b_channel - B_IN_MIN) / (B_IN_MAX - B_IN_MIN) - 1

    # Re-combining the normalized A and B channels
    ab_norm = tf.concat([a_norm, b_norm], axis=-1)

    return l_norm, ab_norm

l_train, ab_train =normalize_data(l_train, ab_train)
l_test, ab_test= normalize_data(l_test, ab_test)
l_val, ab_val=normalize_data(l_val, ab_val)

The normalization formula used is <br> $$
2 \cdot \frac{x - \min}{\max - \min} - 1
$$ <br>
which effectively maps the custom input range to the desired [-1, 1] output range. This transformation is applied uniformly to our training, testing, and validation datasets, making them suitable for the model's activation functions and improving training stability.

### Data Augmentation

To improve generalization and prevent overfitting, we implement data augmentation via random horizontal flipping. There's a 50% chance of applying a horizontal flip to both the luminance (L) channel and its corresponding chrominance (AB) channels simultaneously. This ensures the spatial relationship between the grayscale input and its color information remains consistent, effectively expanding our training data.

In [None]:
def augment(l_channel, ab_channels):
    """Applies identical random horizontal flip to both L and AB channels."""
    if tf.random.uniform(()) > 0.5:
        l_channel = tf.image.flip_left_right(l_channel)
        ab_channels = tf.image.flip_left_right(ab_channels)
    return l_channel, ab_channels

## TensorFlow Dataset

In [None]:
DRIVE_BASE_PATH = "/content/drive/MyDrive/ImageColorization"
MODELS_DIR = os.path.join(DRIVE_BASE_PATH, "saved_models")
PROGRESS_DIR = os.path.join(DRIVE_BASE_PATH, "training_progress")

os.makedirs(MODELS_DIR, exist_ok=True)
os.makedirs(PROGRESS_DIR, exist_ok=True)

def create_tf_dataset(l_data, ab_data, batch_size=32, shuffle=True,augment_data=False):
    dataset = tf.data.Dataset.from_tensor_slices((l_data, ab_data))
    if shuffle:
        dataset = dataset.shuffle(buffer_size=len(l_data))
    dataset = dataset.batch(batch_size)
    # Apply augmentation AFTER batching and ONLY if specified
    if augment_data:
        dataset = dataset.map(augment, num_parallel_calls=tf.data.AUTOTUNE)
    # Prefetch for performance
    dataset = dataset.prefetch(tf.data.AUTOTUNE)
    return dataset

BATCH_SIZE = 32
train_dataset = create_tf_dataset(l_train, ab_train, BATCH_SIZE)
val_dataset = create_tf_dataset(l_val, ab_val, BATCH_SIZE, shuffle=False)
test_dataset = create_tf_dataset(l_test, ab_test, BATCH_SIZE, shuffle=False)

* Shuffling: For the training set, we shuffle the data before batching to ensure
diverse batches and improve model generalization. Shuffling is omitted for validation and test sets.
* Batching: Data is grouped into batches of 32, optimizing GPU utilization and stabilizing gradient updates during training.
* Augmentation
* Prefetching: allows the next batch of data to be prepared in the background while the current batch is being processed, maximizing throughput and preventing idle time.