# Neuro-Fuzzy Computing - Project - Fall 2025
## Galaxy Zoo — Training

In this notebook, we train and evaluate on the **training** portion of the Galaxy Zoo dataset.

Dataset location (Drive): `MyDrive/galaxy-zoo-the-galaxy-challenge/`: includes the files `images_training_rev1.zip` and `training_solutions_rev1.zip`.

#### Inspecting the initial dataset location

In [1]:
from google.colab import drive
from pathlib import Path

# Mount Google Drive
drive.mount("/content/drive")

# Moving to the directory of the original dataset
data_dir = Path("/content/drive/MyDrive/galaxy-zoo-the-galaxy-challenge")

# List files and folders inside the directory
for item in data_dir.iterdir():
    print(item)

Mounted at /content/drive
/content/drive/MyDrive/galaxy-zoo-the-galaxy-challenge/images_training_rev1.zip
/content/drive/MyDrive/galaxy-zoo-the-galaxy-challenge/training_solutions_rev1.zip


#### Extracting the dataset in its original form
The dataset gets extracted to `/content`, where it remains as long as the session is connected/active.

In [2]:
import zipfile
from pathlib import Path

# Dataset directory
data_dir = Path("/content/drive/MyDrive/galaxy-zoo-the-galaxy-challenge")

# Directory where contents should be extracted, create folder if it doesn't exist
extract_dir = Path("/content")
extract_dir.mkdir(parents=True, exist_ok=True)

# ZIP files
zip_files = [
    data_dir / "images_training_rev1.zip",
    data_dir / "training_solutions_rev1.zip"
]

# Function to safely unzip a file
def safe_unzip(zip_path: Path, extract_to: Path):
    print(f"Unzipping {zip_path.name}...")
    with zipfile.ZipFile(zip_path, 'r') as zip_ref:
        for file_info in zip_ref.infolist():
            extracted_path = extract_to / file_info.filename
            if not extracted_path.exists():  # Skip if already extracted
                zip_ref.extract(file_info, extract_to)
    print(f"Finished unzipping {zip_path.name}")

# Unzip each file
for zip_path in zip_files:
    safe_unzip(zip_path, extract_dir)

print("All files unzipped successfully.")

Unzipping images_training_rev1.zip...
Finished unzipping images_training_rev1.zip
Unzipping training_solutions_rev1.zip...
Finished unzipping training_solutions_rev1.zip
All files unzipped successfully.


### Dataset inspection and preprocessing

In the following cells we perform the essential preprocessing steps

1. **Define data paths and parameters**
   - Point to the processed training images folder: `images_training_rev1/`
   - Point to the label file: `training_solutions_rev1.csv`

In [3]:
from google.colab import drive
from pathlib import Path
import pandas as pd

drive.mount("/content/drive")

DATA_ROOT = Path("/content")

img_dir = DATA_ROOT / "images_training_rev1"
csv_path = DATA_ROOT / "training_solutions_rev1.csv"

print("img_dir exists:", img_dir.is_dir())
print("csv_path exists:", csv_path.is_file())

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
img_dir exists: True
csv_path exists: True


2. **Check label to image consistency**
   - Confirm the number of label rows matches the number of processed images
   - If there is a mismatch, we report example GalaxyIDs whose image files are missing

In [4]:
# If this cell takes minutes to run, something went wrong with Colab finding the images, most likely due to their size. If that happens, restart session
solutions_df = pd.read_csv(csv_path)

# IDs of Galaxies are the labels of the column "GalaxyID"
ids = solutions_df["GalaxyID"].astype(str).tolist()

# These labels must match the names of the files inside the folder "images_training_424"
train_image_names = sorted([p.name for p in img_dir.glob("*.jpg")])

if len(ids) != len(train_image_names):
    missing = []
    name_set = set(train_image_names)
    for gid in ids[:50]:
        if f"{gid}.jpg" not in name_set:
            missing.append(gid)
    raise ValueError(f"Label/image count mismatch: labels={len(ids)} images={len(train_image_names)}. Example missing IDs: {missing[:10]}")

3. **Prepare inputs for a TensorFlow dataset**

In this cell, we:
   - Create `paths` (image filepaths) and `labels` (soft targets) from `training_solutions_rev1.csv`
   - Define `load_image(path, y)`, which will be used later with `tf.data.Dataset.map(...)` to load/parse images **on demand**

In [5]:
!pip -q install tensorflow

In [6]:
import tensorflow as tf
import pandas as pd

# assuming DATA_ROOT is defined
target_cols = [c for c in solutions_df.columns if c != "GalaxyID"]

paths  = (solutions_df["GalaxyID"].astype(int).astype(str) + ".jpg").apply(lambda fn: str(img_dir / fn)).to_numpy()
labels = solutions_df[target_cols].to_numpy(dtype="float32")

size = (424, 424)

def load_image(path, y):
    img = tf.io.decode_jpeg(tf.io.read_file(path), channels=3)
    img = tf.image.convert_image_dtype(img, tf.float32)  # [0,1]
    img = tf.image.resize(img, size)  # "Guard rail": ensuring that images are of size 424x424
    return img, y

### Train/Val/Test split (80/10/10)

We create our own 80/10/10 train/val/test split from the edited training set.

We start from a dataset of `(filepath, target_vector)` pairs and we shuffle **once** with a fixed seed (`seed=42`, `reshuffle_each_iteration=False`) to get a reproducible, **fixed** random ordering

Split by slicing the shuffled dataset:
  - **Train:** first 80% of samples
  - **Validation:** next 10%
  - **Test:** final 10%

Lastly, we apply `load_image` **after** the split so each subset loads/decodes images lazily and independently

In [7]:
import tensorflow as tf

n_total = len(paths)
n_train = int(0.8 * n_total)
n_val   = int(0.1 * n_total)
n_test  = n_total - n_train - n_val

print("Dataset size:", n_total)
print("Train:", n_train)
print("Val:", n_val)
print("Test:", n_test)

# Shuffle ONCE (fixed order for reproducibility)
base_ds = tf.data.Dataset.from_tensor_slices((paths, labels)).shuffle(buffer_size=n_total, seed=42, reshuffle_each_iteration=False)

# Split (no images loaded yet)
train_ds = base_ds.take(n_train)
val_ds   = base_ds.skip(n_train).take(n_val)
test_ds  = base_ds.skip(n_train + n_val)

Dataset size: 61578
Train: 49262
Val: 6157
Test: 6159


### Image loading pipeline (lazy + batched)

To build the dataset, we:
   - Convert the split datasets from `(filepath, target_vector)` into `(image_tensor, target_tensor)` using `map(load_image)`
     - Images are read/decoded **on demand** with `tf.io.read_file` + `tf.io.decode_jpeg`
     - Converted to `float32` in **[0, 1]** (and resized to 424×424 as a safety step)
   - Optimize input throughput:
     - **Train:** shuffle (per epoch) → batch → prefetch
     - **Val/Test:** batch → prefetch

Each dataset element is a **batch** `(image_tensor, target_tensor)` where:
  - `image_tensor` has shape **(batch_size, 424, 424, 3)** (channels-last) in **[0, 1]**
  - `target_tensor` has shape **(batch_size, 37)**


In [8]:
import tensorflow as tf

AUTOTUNE = tf.data.AUTOTUNE
BATCH_SIZE = 32

# Allow non-deterministic ordering for speed (especially with parallel map)
options = tf.data.Options()
options.experimental_deterministic = False

train_img_ds = (train_ds.shuffle(10_000, reshuffle_each_iteration=True).map(load_image, num_parallel_calls=AUTOTUNE).batch(BATCH_SIZE, drop_remainder=True).prefetch(AUTOTUNE).with_options(options))
val_img_ds = (val_ds.map(load_image, num_parallel_calls=AUTOTUNE).batch(BATCH_SIZE, drop_remainder=True).prefetch(AUTOTUNE).with_options(options))
test_img_ds = (test_ds.map(load_image, num_parallel_calls=AUTOTUNE).batch(BATCH_SIZE, drop_remainder=True).prefetch(AUTOTUNE).with_options(options))

xb, yb = next(iter(train_img_ds))
print("train batch x:", xb.shape, xb.dtype, "y:", yb.shape, yb.dtype)

train batch x: (32, 424, 424, 3) <dtype: 'float32'> y: (32, 37) <dtype: 'float32'>


### Building our CNN model

In [9]:
import tensorflow as tf

# Sequential CNN, assumes inputs are (424, 424, 3) and outputs are 37 probabilities.
model = tf.keras.Sequential()

# Input
model.add(tf.keras.layers.Input(shape=(424, 424, 3)))

# Block 1 (3 -> 32)
model.add(tf.keras.layers.Conv2D(32, kernel_size=3, use_bias=True, padding="same"))
model.add(tf.keras.layers.BatchNormalization())
model.add(tf.keras.layers.ReLU())
#model.add(tf.keras.layers.Conv2D(32, kernel_size=3, use_bias=True, padding="same"))
#model.add(tf.keras.layers.BatchNormalization())
#model.add(tf.keras.layers.ReLU())
model.add(tf.keras.layers.MaxPooling2D(pool_size=2))
model.add(tf.keras.layers.SpatialDropout2D(0.05))

# Block 2 (32 -> 64)
model.add(tf.keras.layers.Conv2D(64, kernel_size=3, use_bias=True, padding="same"))
model.add(tf.keras.layers.BatchNormalization())
model.add(tf.keras.layers.ReLU())
#model.add(tf.keras.layers.Conv2D(64, kernel_size=3, use_bias=True, padding="same"))
#model.add(tf.keras.layers.BatchNormalization())
#model.add(tf.keras.layers.ReLU())
model.add(tf.keras.layers.MaxPooling2D(pool_size=2))
model.add(tf.keras.layers.SpatialDropout2D(0.10))

# Block 3 (64 -> 128)
model.add(tf.keras.layers.Conv2D(128, kernel_size=3, use_bias=True, padding="same"))
model.add(tf.keras.layers.BatchNormalization())
model.add(tf.keras.layers.ReLU())
#model.add(tf.keras.layers.Conv2D(128, kernel_size=3, use_bias=True, padding="same"))
#model.add(tf.keras.layers.BatchNormalization())
#model.add(tf.keras.layers.ReLU())
model.add(tf.keras.layers.MaxPooling2D(pool_size=2))
model.add(tf.keras.layers.SpatialDropout2D(0.15))

# Block 4 (128 -> 256)
model.add(tf.keras.layers.Conv2D(256, kernel_size=3, use_bias=True, padding="same"))
model.add(tf.keras.layers.BatchNormalization())
model.add(tf.keras.layers.ReLU())
#model.add(tf.keras.layers.Conv2D(256, kernel_size=3, use_bias=True, padding="same"))
#model.add(tf.keras.layers.BatchNormalization())
#model.add(tf.keras.layers.ReLU())
model.add(tf.keras.layers.MaxPooling2D(pool_size=2))
model.add(tf.keras.layers.SpatialDropout2D(0.20))

# Pooling layer
model.add(tf.keras.layers.GlobalAveragePooling2D())

# 37-dim output right after pooling:
model.add(tf.keras.layers.Dense(37, use_bias=True))

model.summary()

### Optimizer, loss function and model compilation

Older ver.
```
import tensorflow as tf

# AdamW optimizer (decoupled weight decay)
optimizer = tf.keras.optimizers.AdamW(learning_rate=1e-3, weight_decay=1e-4)

# Model compilation, MSE loss and RMSE metric for reporting
model.compile(
    optimizer=optimizer,
    loss=tf.keras.losses.MeanSquaredError(),
    metrics=[tf.keras.metrics.RootMeanSquaredError(name="rmse")],
    jit_compile=True,  # enable XLA for built-in Keras paths too (safe even if using custom loop)
)
```

In [10]:
import tensorflow as tf

# Estimate total steps (batches) for full training run
steps_per_epoch = tf.data.experimental.cardinality(train_img_ds).numpy()
total_steps = int(steps_per_epoch * 30)  # 30 = your max epochs

lr_schedule = tf.keras.optimizers.schedules.CosineDecay(
    initial_learning_rate=1e-3,
    decay_steps=total_steps,
    alpha=1e-2,  # final LR = alpha * initial (here: 1e-5)
)

# AdamW optimizer (decoupled weight decay)
optimizer = tf.keras.optimizers.AdamW(learning_rate=lr_schedule, weight_decay=1e-4)

# Model compilation, MSE loss and RMSE metric for reporting
model.compile(
    optimizer=optimizer,
    loss=tf.keras.losses.MeanSquaredError(),
    metrics=[tf.keras.metrics.RootMeanSquaredError(name="rmse")],
    jit_compile=True,
)

### Defining the training loop

In [11]:
import time
import gc
import tensorflow as tf

# helper to read current LR (works for constant LR or schedules)
def get_current_lr(optimizer: tf.keras.optimizers.Optimizer):
    lr = optimizer.learning_rate
    # If lr is a schedule, call it with optimizer.iterations
    if isinstance(lr, tf.keras.optimizers.schedules.LearningRateSchedule):
        return float(lr(optimizer.iterations).numpy())
    # Otherwise it's a scalar/tensor/variable
    return float(tf.convert_to_tensor(lr).numpy())

def train_loop(model, train_ds, val_ds, epochs=30, patience=3, min_delta=1e-3):
    best_val = float("inf")
    patience_ctr = 0
    best_weights = None

    # Helper: get metric value by name
    def metric_value(name: str):
        for m in model.metrics:
            r = m.result()
            if isinstance(r, dict):
                # compiled metrics here
                if name in r:
                    return float(r[name].numpy())
            else:
                if m.name == name:
                    return float(r.numpy())
        # If not found, give a helpful error listing available keys/names
        available = []
        for m in model.metrics:
            r = m.result()
            if isinstance(r, dict):
                available.extend(list(r.keys()))
            else:
                available.append(m.name)
        raise ValueError(f"Metric '{name}' not found. Available: {available}")

    @tf.function(jit_compile=True)
    def train_step(xb, yb):
        with tf.GradientTape() as tape:
            preds = model(xb, training=True)
            loss = model.compute_loss(x=xb, y=yb, y_pred=preds, sample_weight=None, training=True)

        grads = tape.gradient(loss, model.trainable_variables)
        model.optimizer.apply_gradients(zip(grads, model.trainable_variables))

        for m in model.metrics:
            m.update_state(yb, preds)

        return loss

    @tf.function(jit_compile=True)
    def val_step(xb, yb):
        preds = model(xb, training=False)
        for m in model.metrics:
            m.update_state(yb, preds)

    for epoch in range(1, epochs + 1):
        t0 = time.time()

        # Train
        model.reset_metrics()
        for xb, yb in train_ds:
            train_step(xb, yb)
        train_rmse_val = metric_value("rmse")

        # Evaluation
        model.reset_metrics()
        for xb, yb in val_ds:
            val_step(xb, yb)
        val_rmse_val = metric_value("rmse")

        lr_val = get_current_lr(model.optimizer)
        dt = time.time() - t0

        # Early stopping mechanism
        improved = (best_val - val_rmse_val) > min_delta
        if improved:
            best_val = val_rmse_val
            patience_ctr = 0
            best_weights = model.get_weights()
        else:
            patience_ctr += 1

        print(
            f"Epoch {epoch:02d}/{epochs} | "
            f"lr={lr_val:.6g} | "
            f"train_RMSE={train_rmse_val:.6f} | "
            f"eval_RMSE={val_rmse_val:.6f} | "
            f"patience={patience_ctr}/{patience} | "
            f"time={dt:.2f}s"
        )

        if patience_ctr >= patience:
            break

    if best_weights is not None:
        model.set_weights(best_weights)

    return epoch

### Training our model

In [12]:
epochs_ran = train_loop(model, train_img_ds, val_img_ds, epochs=30, patience=3, min_delta=1e-3)

Epoch 01/30 | lr=0.000997288 | train_RMSE=0.180896 | eval_RMSE=0.155054 | patience=0/3 | time=453.44s
Epoch 02/30 | lr=0.000989183 | train_RMSE=0.156555 | eval_RMSE=0.156158 | patience=1/3 | time=382.40s
Epoch 03/30 | lr=0.000975773 | train_RMSE=0.154362 | eval_RMSE=0.151570 | patience=0/3 | time=382.68s
Epoch 04/30 | lr=0.000957205 | train_RMSE=0.152671 | eval_RMSE=0.150386 | patience=0/3 | time=392.31s
Epoch 05/30 | lr=0.000933683 | train_RMSE=0.150076 | eval_RMSE=0.146223 | patience=0/3 | time=381.67s
Epoch 06/30 | lr=0.000905463 | train_RMSE=0.146168 | eval_RMSE=0.140713 | patience=0/3 | time=392.43s
Epoch 07/30 | lr=0.000872857 | train_RMSE=0.142763 | eval_RMSE=0.137427 | patience=0/3 | time=381.36s
Epoch 08/30 | lr=0.00083622 | train_RMSE=0.139282 | eval_RMSE=0.132269 | patience=0/3 | time=391.98s
Epoch 09/30 | lr=0.000795954 | train_RMSE=0.134472 | eval_RMSE=0.128412 | patience=0/3 | time=381.08s
Epoch 10/30 | lr=0.0007525 | train_RMSE=0.130806 | eval_RMSE=0.125739 | patience=0/

### Evaluate on held-out test split

In [13]:
import tensorflow as tf

test_rmse_metric = tf.keras.metrics.RootMeanSquaredError()

for xb, yb in test_img_ds:
    preds = model(xb, training=False)
    test_rmse_metric.update_state(yb, preds)

test_rmse = float(test_rmse_metric.result().numpy())
print("Test RMSE:", test_rmse)

Test RMSE: 0.110501728951931


### Append results to `results.csv`

In [14]:
import pandas as pd
from datetime import datetime
from zoneinfo import ZoneInfo

results_path = Path("/content/drive/MyDrive/galaxy-zoo-the-galaxy-challenge/results.csv")

# timestamp format: MMDDhhmmss in Europe/Athens
timestamp = datetime.now(ZoneInfo("Europe/Athens")).strftime("%m%d%H%M%S")

new_row = pd.DataFrame([{"timestamp": timestamp,"RMSE": test_rmse,"epochs": epochs_ran,}])

if results_path.is_file():
    prev = pd.read_csv(results_path)
    out = pd.concat([prev, new_row], ignore_index=True)
else:
    out = new_row

out = out[["timestamp", "RMSE", "epochs"]]  # enforce column order
out.to_csv(results_path, index=False)

print("Appended to:", results_path)
print(out.tail())

Appended to: /content/drive/MyDrive/galaxy-zoo-the-galaxy-challenge/results.csv
    timestamp      RMSE  epochs
0  0203152547  0.110502      27
