# Dataset Generation — Pixel Coordinate Regression (TFRecord, 150,000 samples)

This notebook/script generates a synthetic dataset for the task:

**Input:** 50×50 grayscale image with exactly one bright pixel (=255), all others 0  
**Output label:** (x, y) coordinate of the bright pixel

We store the dataset in **TFRecord shards** for efficient loading with TensorFlow.


## Why synthetic data?

This problem is perfectly defined and does not require real-world images.

Synthetic generation is ideal because:
- Labels are **exact and error-free**
- Dataset size can be increased easily (here: 150,000 samples)
- Positions can be sampled **uniformly** so all coordinates are equally represented
- Enables fast experimentation and reproducible results


## Cell 1 — Imports and dependencies

This cell imports all required libraries for dataset generation:

- `numpy` for random coordinate generation and split indices
- `tensorflow` for TFRecord writing and tensor serialization
- `json` and `pathlib` for saving dataset metadata and output paths

> If TensorFlow is not installed, install it before running:
> `pip install tensorflow`


In [1]:
import json
from pathlib import Path
from typing import Dict

import numpy as np
import tensorflow as tf


## Cell 2 — Core dataset generation utilities (TFRecord + splits + metadata)

This cell defines the full dataset generation pipeline:

### What this cell contains
1. **`make_splits()`**
   - Creates reproducible shuffled indices for:
     - train (80%)
     - validation (10%)
     - test (10%)

2. **TFRecord helper functions**
   - Creates TFRecord `Feature` objects for saving bytes and integer values.

3. **`write_tfrecord_dataset()`**
   - Generates `n_samples` images of size `img_size × img_size`
   - For each sample:
     - Picks a random coordinate `(x, y)` uniformly from `[0, 49]`
     - Creates an image with exactly one bright pixel: `image[y, x] = bright_value`
     - Serializes the image using `tf.io.serialize_tensor`
     - Writes data into sharded TFRecord files

4. **Outputs created**
   - TFRecord shard files: `data-xxxxx-of-yyyyy.tfrecord`
   - Split index files: `split_train_idx.npy`, `split_val_idx.npy`, `split_test_idx.npy`
   - Metadata file: `meta.json` (image size, bright value, compression, coordinate convention)

### Coordinate convention
We store labels as `(x, y)` where:
- `x` = column index
- `y` = row index
- and the bright pixel is placed using: `image[y, x] = bright_value`


In [4]:
def make_splits(
    n_samples: int,
    seed: int = 42,
    train_ratio: float = 0.80,
    val_ratio: float = 0.10
) -> Dict[str, np.ndarray]:
    if train_ratio + val_ratio >= 1.0:
        raise ValueError("train_ratio + val_ratio must be < 1.0")

    rng = np.random.default_rng(seed)
    idx = np.arange(n_samples, dtype=np.int64)
    rng.shuffle(idx)

    n_train = int(n_samples * train_ratio)
    n_val = int(n_samples * val_ratio)

    return {
        "train": idx[:n_train],
        "val": idx[n_train:n_train + n_val],
        "test": idx[n_train + n_val:]
    }


def _bytes_feature(value: bytes) -> tf.train.Feature:
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))


def _int64_feature(value: int) -> tf.train.Feature:
    return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))


def write_tfrecord_dataset(
    out_dir: str = "Pixel_dataset",
    n_samples: int = 150_000,
    img_size: int = 50,
    bright_value: int = 255,
    shard_size: int = 15_000,
    seed: int = 42,
    compression: str = "GZIP",
) -> None:
    out_path = Path(out_dir)
    out_path.mkdir(parents=True, exist_ok=True)

    n_shards = (n_samples + shard_size - 1) // shard_size
    options = tf.io.TFRecordOptions(compression_type=compression) if compression else None

    rng = np.random.default_rng(seed)
    total_written = 0

    for shard_id in range(n_shards):
        start = shard_id * shard_size
        end = min(start + shard_size, n_samples)
        batch_n = end - start

        # Generate labels with NumPy (simple + fast)
        xs = rng.integers(0, img_size, size=batch_n, dtype=np.int16)
        ys = rng.integers(0, img_size, size=batch_n, dtype=np.int16)

        shard_name = f"data-{shard_id:05d}-of-{n_shards:05d}.tfrecord"
        shard_file = out_path / shard_name

        with tf.io.TFRecordWriter(str(shard_file), options=options) as writer:
            for i in range(batch_n):
                # Build one image (uint8)
                img = np.zeros((img_size, img_size), dtype=np.uint8)
                img[int(ys[i]), int(xs[i])] = np.uint8(bright_value)

                # Serialize safely (TensorFlow-native)
                img_tensor = tf.convert_to_tensor(img, dtype=tf.uint8)
                img_bytes = tf.io.serialize_tensor(img_tensor).numpy()

                example = tf.train.Example(
                    features=tf.train.Features(
                        feature={
                            "image_tensor": _bytes_feature(img_bytes),
                            "x": _int64_feature(int(xs[i])),
                            "y": _int64_feature(int(ys[i])),
                        }
                    )
                )
                writer.write(example.SerializeToString())

        total_written += batch_n
        print(f"Wrote shard {shard_id + 1}/{n_shards}: {batch_n} samples (total {total_written}/{n_samples})")

    # Save split indices
    splits = make_splits(n_samples=n_samples, seed=seed, train_ratio=0.80, val_ratio=0.10)
    np.save(out_path / "split_train_idx.npy", splits["train"])
    np.save(out_path / "split_val_idx.npy", splits["val"])
    np.save(out_path / "split_test_idx.npy", splits["test"])

    # Save metadata
    meta = {
        "n_samples": n_samples,
        "img_size": img_size,
        "bright_value": bright_value,
        "shard_size": shard_size,
        "n_shards": n_shards,
        "compression": compression,
        "coord_convention": "labels are (x, y) where image[y, x] = bright_value",
        "feature_schema": {
            "image_tensor": "tf.io.serialize_tensor(uint8[50,50])",
            "x": "int64",
            "y": "int64"
        }
    }
    (out_path / "meta.json").write_text(json.dumps(meta, indent=2), encoding="utf-8")

    print("\n✅ Dataset generation complete:", out_path.resolve())


## Cell 3 — Generate the dataset (150,000 samples)

This cell runs the dataset generator with the chosen configuration:

- Output folder: `Pixel_dataset`
- Total samples: `150,000`
- Image size: `50 × 50`
- Bright pixel value: `255`
- Sharding: `15,000` samples per TFRecord (≈ 10 shards)
- Seed: `42` (reproducible)
- Compression: `GZIP` (smaller files, faster I/O in many cases)

After this cell finishes, the dataset files will be created inside `Pixel_dataset/`.


In [None]:
write_tfrecord_dataset(
    out_dir="Pixel_dataset",
    n_samples=150_000,
    img_size=50,
    bright_value=255,
    shard_size=15_000,
    seed=42,
    compression="GZIP",
)

Wrote shard 1/10: 15000 samples (total 15000/150000)
Wrote shard 2/10: 15000 samples (total 30000/150000)
Wrote shard 3/10: 15000 samples (total 45000/150000)
Wrote shard 4/10: 15000 samples (total 60000/150000)
Wrote shard 5/10: 15000 samples (total 75000/150000)
Wrote shard 6/10: 15000 samples (total 90000/150000)
Wrote shard 7/10: 15000 samples (total 105000/150000)
Wrote shard 8/10: 15000 samples (total 120000/150000)
Wrote shard 9/10: 15000 samples (total 135000/150000)
Wrote shard 10/10: 15000 samples (total 150000/150000)

✅ Dataset generation complete: C:\Users\adity\Desktop\DeepEdge\Pixel_dataset


## Cell 4 — Sanity check (verify dataset integrity)

This cell validates the generated dataset by reading TFRecords back and checking:

1. TFRecord decoding works correctly (`tf.io.parse_tensor`)
2. The label `(x, y)` matches the bright pixel location:
   - `image[y, x] == bright_value`
3. Runs the check for multiple samples (default: 30)

If all checks pass, it prints a success message:
✅ Sanity check passed

This ensures the dataset is correct before using it in the training notebook.


In [6]:
import glob
from pathlib import Path
import json
import tensorflow as tf


def parse_example(example_proto, img_size: int):
    feature_spec = {
        "image_tensor": tf.io.FixedLenFeature([], tf.string),
        "x": tf.io.FixedLenFeature([], tf.int64),
        "y": tf.io.FixedLenFeature([], tf.int64),
    }
    ex = tf.io.parse_single_example(example_proto, feature_spec)

    img = tf.io.parse_tensor(ex["image_tensor"], out_type=tf.uint8)
    img = tf.reshape(img, (img_size, img_size))

    x = tf.cast(ex["x"], tf.int32)
    y = tf.cast(ex["y"], tf.int32)
    return img, x, y


def sanity_check(dataset_dir: str, n_checks: int = 30):
    meta = json.loads(Path(dataset_dir, "meta.json").read_text(encoding="utf-8"))
    img_size = meta["img_size"]
    bright = meta["bright_value"]
    compression = meta["compression"]

    files = sorted(glob.glob(str(Path(dataset_dir, "*.tfrecord"))))
    ds = tf.data.TFRecordDataset(
        files,
        compression_type=compression if compression else None
    ).map(lambda r: parse_example(r, img_size))

    for i, (img, x, y) in enumerate(ds.take(n_checks), start=1):
        val = int(img[int(y), int(x)].numpy())
        if val != bright:
            raise AssertionError(f"Mismatch at sample {i}: img[y,x]={val}, expected {bright}")

    print(f"✅ Sanity check passed for {n_checks} samples")


sanity_check("Pixel_dataset", n_checks=30)


✅ Sanity check passed for 30 samples
