# Creating a Validation Split for the Colorization Dataset  
## Scripted Train/Validation Split with Backup and Verification

This notebook creates a **train/validation split** for the colorization dataset used in the main experiments.

Given an original training set with:

- Color images in `train_color/`
- Corresponding grayscale images in `train_black/`

the script will:

1. **Load and pair** all training images.
2. **Split** them into a new training set and a validation set according to a fixed fraction (for example, $5\%$ for validation).
3. **Back up** the original training folders so that no data is lost.
4. **Create new directories** for:
   - Updated training data (`train_color/`, `train_black/`)
   - Validation data (`val_color/`, `val_black/`)
5. **Copy files** into their new locations based on the split.
6. **Verify** that:
   - The split sizes match expectations.
   - The test set remains unchanged.
   - Original training data is safely backed up.

This notebook is part of the **data preparation** stage of the project and ensures that all subsequent training and evaluation notebooks use a **consistent and reproducible train/validation partition**.

In [1]:
# Import necessary libraries
import glob
import shutil
from pathlib import Path

from sklearn.model_selection import train_test_split

print("Libraries imported successfully!")

Libraries imported successfully!


## Configuration for Validation Split

This section defines the **main parameters** that control how the validation split is created.

Key configuration items:

- `DATA_DIR`  
  Root directory of the dataset. It is assumed to contain:
  - `train_color/` and `train_black/` with the current training data.
  - `test_color/` and `test_black/` with the test data (left unchanged).

- `VALIDATION_SPLIT`  
  Fraction of the original training set that will be moved into the **validation** set.  
  For example, `0.05` corresponds to a 5 percent validation split.

- `RANDOM_SEED`  
  Seed used by the splitting function to ensure that the **same subset** of images is selected for validation every time this notebook is run.

Folder definitions:

- **Original folders** (to be renamed as backup):
  - `ORIGINAL_TRAIN_COLOR` (existing `train_color/`)
  - `ORIGINAL_TRAIN_BLACK` (existing `train_black/`)

- **Backup folders** (where the original data will be preserved):
  - `BACKUP_TRAIN_COLOR` (`train_color_original/`)
  - `BACKUP_TRAIN_BLACK` (`train_black_original/`)

- **New folders** (that will hold the new split):
  - `NEW_TRAIN_COLOR` (`train_color/` after split)
  - `NEW_TRAIN_BLACK` (`train_black/` after split)
  - `VAL_COLOR` (`val_color/`)
  - `VAL_BLACK` (`val_black/`)

Running this configuration cell prints a short summary of the active settings and paths, documenting the **data preparation setup** used to create the validation split.


In [2]:
# Configuration
DATA_DIR = Path("../data/colorize_dataset/data")
VALIDATION_SPLIT = 0.05  # 5% for validation
RANDOM_SEED = 42

# Original folders (will be renamed to backup)
ORIGINAL_TRAIN_COLOR = DATA_DIR / "train_color"
ORIGINAL_TRAIN_BLACK = DATA_DIR / "train_black"

# Backup folders
BACKUP_TRAIN_COLOR = DATA_DIR / "train_color_original"
BACKUP_TRAIN_BLACK = DATA_DIR / "train_black_original"

# New folders
NEW_TRAIN_COLOR = DATA_DIR / "train_color"
NEW_TRAIN_BLACK = DATA_DIR / "train_black"
VAL_COLOR = DATA_DIR / "val_color"
VAL_BLACK = DATA_DIR / "val_black"

print(f"Data directory: {DATA_DIR}")
print(f"Validation split: {VALIDATION_SPLIT * 100}%")
print(f"Random seed: {RANDOM_SEED}")

Data directory: ..\data\colorize_dataset\data
Validation split: 5.0%
Random seed: 42


## Scan and Pair Original Training Images

This section **collects and pairs** all original training images before performing the split.

### File Discovery

From the dataset root (`DATA_DIR`), the script:

- Scans `train_color/` for all color images (for example, `*.jpg` files).
- Scans `train_black/` for the corresponding grayscale images.

The result is two lists of file paths:

- `color_files` – paths to color images.
- `black_files` – paths to grayscale images.

### Pairing and Consistency Check

To ensure a valid one-to-one mapping:

- File lists are sorted and/or matched based on filename patterns.
- The script verifies that:
  - The number of color and black images is identical.
  - Each color image has a corresponding grayscale image.

If there is any mismatch (different counts or missing pairs), this step will highlight the problem before modifying the dataset.

At the end of this section, you can interpret the printed counts as a **summary of the available training data** prior to creating the validation split.


In [3]:
# Get all image files from original training folders
color_images = sorted(glob.glob(str(ORIGINAL_TRAIN_COLOR / "*.jpg")))
black_images = sorted(glob.glob(str(ORIGINAL_TRAIN_BLACK / "*.jpg")))

print(f"Found {len(color_images)} color images")
print(f"Found {len(black_images)} black images")

# Verify they match
assert len(color_images) == len(black_images), "Mismatch between color and black images!"

# Extract filenames to ensure pairing
color_names = [Path(p).name for p in color_images]
black_names = [Path(p).name for p in black_images]
assert color_names == black_names, "Image names don't match between color and black folders!"

print("✓ All images are properly paired")

Found 5000 color images
Found 5000 black images
✓ All images are properly paired


## Create Train/Validation Split

This section performs the actual **train/validation split** of the original training set.

### Splitting Strategy

Using the lists of paired training images:

- A fixed random seed (`RANDOM_SEED`) is set for reproducibility.
- The indices of all training samples are randomly shuffled.
- The first fraction, determined by `VALIDATION_SPLIT` (for example, 0.05 for 5 percent), is assigned to the **validation set**.
- The remaining samples are assigned to the **new training set**.

Conceptually, if there are $N$ training samples and the validation fraction is $p$:

- Number of validation samples:  
  $$N_{val} = \lfloor p \cdot N \rfloor$$
- Number of training samples after the split:  
  $$N_{train} = N - N_{val}$$

### Outputs of This Step

The script constructs two lists of paired file paths:

- `train_pairs` – list of (color, grayscale) pairs assigned to the new training set.
- `val_pairs` – list of (color, grayscale) pairs assigned to the validation set.

These lists define **which files will be copied** into the new `train_*` and `val_*` directories in the next step.


In [4]:
# Split the data
train_indices, val_indices = train_test_split(
    range(len(color_images)), test_size=VALIDATION_SPLIT, random_state=RANDOM_SEED
)

print(f"Training set: {len(train_indices)} images ({(1 - VALIDATION_SPLIT) * 100:.1f}%)")
print(f"Validation set: {len(val_indices)} images ({VALIDATION_SPLIT * 100:.1f}%)")

train_color_files = [color_images[i] for i in train_indices]
train_black_files = [black_images[i] for i in train_indices]
val_color_files = [color_images[i] for i in val_indices]
val_black_files = [black_images[i] for i in val_indices]

Training set: 4750 images (95.0%)
Validation set: 250 images (5.0%)


## Backup Original Data and Create New Split Directories

This section performs the **filesystem changes** required to implement the new train/validation split in a safe and reversible way.

### 1. Backup Original Training Data

To avoid any data loss, the script first **backs up** the original training folders:

- `train_color/` is moved or renamed to `train_color_original/`.
- `train_black/` is moved or renamed to `train_black_original/`.

After this step:

- The original training data is preserved and can be restored if needed.
- The names `train_color/` and `train_black/` become available for the new split.

### 2. Create New Train and Validation Directories

The script then creates a clean directory structure:

- New training directories:
  - `train_color/`
  - `train_black/`
- New validation directories:
  - `val_color/`
  - `val_black/`

These directories are initially empty and will be populated according to the split lists defined earlier:

- `train_pairs` for training
- `val_pairs` for validation

### 3. Copy Files According to the Split

For each pair in `train_pairs`:

- The color image is copied into `train_color/`.
- The corresponding grayscale image is copied into `train_black/`.

For each pair in `val_pairs`:

- The color image is copied into `val_color/`.
- The corresponding grayscale image is copied into `val_black/`.

This ensures that:

- Training and validation sets remain **perfectly paired** (color and grayscale).
- The original dataset remains available under the `_original` backup directories.

After this step, the dataset directory is fully reorganized into a **backed-up original** and a **new, clearly separated train/validation structure**.


In [5]:
# Backup original folders (rename)
print("\nCreating backups...")

if BACKUP_TRAIN_COLOR.exists():
    print(f"Backup already exists: {BACKUP_TRAIN_COLOR}")
    print("  Skipping backup creation. Delete backup folders if you want to re-run.")
else:
    shutil.move(str(ORIGINAL_TRAIN_COLOR), str(BACKUP_TRAIN_COLOR))
    shutil.move(str(ORIGINAL_TRAIN_BLACK), str(BACKUP_TRAIN_BLACK))
    print(f"  ✓ Backed up train_color → {BACKUP_TRAIN_COLOR.name}")
    print(f"  ✓ Backed up train_black → {BACKUP_TRAIN_BLACK.name}")


Creating backups...
  ✓ Backed up train_color → train_color_original
  ✓ Backed up train_black → train_black_original


In [6]:
# Create new directories
print("\nCreating new directories...")
NEW_TRAIN_COLOR.mkdir(exist_ok=True)
NEW_TRAIN_BLACK.mkdir(exist_ok=True)
VAL_COLOR.mkdir(exist_ok=True)
VAL_BLACK.mkdir(exist_ok=True)

print(f"  ✓ Created {NEW_TRAIN_COLOR.name}")
print(f"  ✓ Created {NEW_TRAIN_BLACK.name}")
print(f"  ✓ Created {VAL_COLOR.name}")
print(f"  ✓ Created {VAL_BLACK.name}")


Creating new directories...
  ✓ Created train_color
  ✓ Created train_black
  ✓ Created val_color
  ✓ Created val_black


In [7]:
# Copy files to new training folders
print("\nCopying training files...")
for src in train_color_files:
    dst = NEW_TRAIN_COLOR / Path(src).name
    # Use the backup folder as source if original was already moved
    if BACKUP_TRAIN_COLOR.exists():
        src = BACKUP_TRAIN_COLOR / Path(src).name
    shutil.copy2(src, dst)

for src in train_black_files:
    dst = NEW_TRAIN_BLACK / Path(src).name
    if BACKUP_TRAIN_BLACK.exists():
        src = BACKUP_TRAIN_BLACK / Path(src).name
    shutil.copy2(src, dst)

print(f"  ✓ Copied {len(train_color_files)} images to train_color")
print(f"  ✓ Copied {len(train_black_files)} images to train_black")


Copying training files...
  ✓ Copied 4750 images to train_color
  ✓ Copied 4750 images to train_black


In [8]:
# Copy files to validation folders
print("\nCopying validation files...")
for src in val_color_files:
    dst = VAL_COLOR / Path(src).name
    if BACKUP_TRAIN_COLOR.exists():
        src = BACKUP_TRAIN_COLOR / Path(src).name
    shutil.copy2(src, dst)

for src in val_black_files:
    dst = VAL_BLACK / Path(src).name
    if BACKUP_TRAIN_BLACK.exists():
        src = BACKUP_TRAIN_BLACK / Path(src).name
    shutil.copy2(src, dst)

print(f"  ✓ Copied {len(val_color_files)} images to val_color")
print(f"  ✓ Copied {len(val_black_files)} images to val_black")


Copying validation files...
  ✓ Copied 250 images to val_color
  ✓ Copied 250 images to val_black


## Verify Split and Summarize Results

This final section **verifies** that the new train/validation split was created correctly and prints a concise **summary** of the dataset after restructuring.

### Consistency Checks

The script checks:

- That the number of files in:
  - `train_color/` matches `train_black/`
  - `val_color/` matches `val_black/`
- That the total number of images is preserved:
  - New training count + validation count equals the original training count.
- That the test folders (`test_color/`, `test_black/`) remain unchanged.

If any of these checks fail, the script will report a mismatch so that it can be investigated before using the data for training.

### Summary of the New Split

The script prints a short summary including:

- Number of training images after the split.
- Number of validation images.
- Location of backup folders containing the original training data:
  - `train_color_original/`
  - `train_black_original/`

This provides a clear **record of the data preparation step**, documenting:

- Exactly how many images are now in each subset.
- Where to find the original, unmodified training data.
- Confirmation that the train/validation split is internally consistent.

You can reference this summary directly in your project report as the **description of your data split procedure**.


In [9]:
# Verify the split
print("\nVerification:")
print(f"  train_color: {len(list(NEW_TRAIN_COLOR.glob('*.jpg')))} images")
print(f"  train_black: {len(list(NEW_TRAIN_BLACK.glob('*.jpg')))} images")
print(f"  val_color: {len(list(VAL_COLOR.glob('*.jpg')))} images")
print(f"  val_black: {len(list(VAL_BLACK.glob('*.jpg')))} images")

# Check test set still exists
test_color = DATA_DIR / "test_color"
test_black = DATA_DIR / "test_black"
if test_color.exists() and test_black.exists():
    print(f"  test_color: {len(list(test_color.glob('*.jpg')))} images")
    print(f"  test_black: {len(list(test_black.glob('*.jpg')))} images")

print("\nValidation split created successfully!")
print("\nNote: Original data is backed up in:")
print(f"   - {BACKUP_TRAIN_COLOR.name}")
print(f"   - {BACKUP_TRAIN_BLACK.name}")


Verification:
  train_color: 4750 images
  train_black: 4750 images
  val_color: 250 images
  val_black: 250 images
  test_color: 739 images
  test_black: 739 images

Validation split created successfully!

Note: Original data is backed up in:
   - train_color_original
   - train_black_original
