# CV Course Project: Assignment - Image Recognition

---

**Group:**
- Alumno 1: San Millan Rodrigues, Nadine (n.srodrigues@alumnos.upm.es)
- Alumno 2: Sukhorukova, Anastasia (anastasia.s@alumnos.upm.es)
- Alumno 3: Reyes Castro, Didier Yamil (didier.reyes.castro@alumnos.upm.es)

**Course:** Computer Vision (CV) - 2025/26

**Institution:** Polytechnic University of Madrid (UPM)

**Date:** January 2026

**for group members: OUR PROJECT SLIDES (editor rights link):** https://docs.google.com/presentation/d/1YELVYRoQ6xEikMO35GdRnY7-QVEacHZRYHQPR3g47eI/edit?usp=sharing

---

## Goals

The goals of the assignment are:
* Develop proficiency in using Tensorflow/Keras for training Neural Nets (NNs).
* Put into practice the acquired knowledge to optimize the parameters and architecture of a feedforward Neural Net (ffNN), in the context of an image recognition problem.
* Put into practice NNs specially conceived for analysing images. Design and optimize the parameters of a Convolutional Neural Net (CNN) to deal with previous task.
* Train popular architectures from scratch (e.g., GoogLeNet, VGG, ResNet, ...), and compare the results with the ones provided by their pre-trained versions using transfer learning.

Follow the link below to download the classification data set  “xview_recognition”: [https://drive.upm.es/s/2DDPE2zHw5dbM3G](https://drive.upm.es/s/2DDPE2zHw5dbM3G)

## 0 Setup and Data Loading

### 0.1 Install and Import Required Libraries

If using Google Colab, uncomment the following line to install the required packages.

In [2]:
# !pip install tensorflow numpy rasterio scikit-learn matplotlib keras
# for MacOS:
# pip install tensorflow-macos tensorflow-metal numpy rasterio scikit-learn matplotlib jupyterlab
# pip install notebook ipykernel

If using Conda, create a new environment and install the required packages:
```bash
conda create -n cv_project python tensorflow numpy rasterio scikit-learn matplotlib jupyterlab
conda activate cv_project
```

Loading the necessary libraries.

In [2]:
# Python libraries
import uuid
import warnings
import json
import os
import math

# External libraries
import numpy as np
import rasterio
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

import tensorflow as tf
from tensorflow.keras import Input
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, Flatten
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import TerminateOnNaN, EarlyStopping, ReduceLROnPlateau, ModelCheckpoint


Check if GPU is available for training the models.

In [3]:
print(tf.config.list_physical_devices('GPU'))

[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]


### 0.2 Load the Dataset & Image Loader

Before loading the dataset, set the path to the folder containing the images.

In [4]:
# IMAGES_PATH = '/mnt/c/Users/didie/Desktop/main/MUIA/CV/Labs/Image_Recognition/xview_recognition/'
IMAGES_PATH = '/Users/anastasia/Desktop/MUIA/CompVis/Project/Data/xview_recognition'

In [5]:
class GenericObject:
    """
    Generic object data.
    """
    def __init__(self):
        self.id = uuid.uuid4()
        self.bb = (-1, -1, -1, -1)
        self.category= -1
        self.score = -1

class GenericImage:
    """
    Generic image data.
    """
    def __init__(self, filename):
        self.filename = filename
        self.tile = np.array([-1, -1, -1, -1])  # (pt_x, pt_y, pt_x+width, pt_y+height)
        self.objects = list([])

    def add_object(self, obj: GenericObject):
        self.objects.append(obj)

In [6]:
categories = {0: 'Cargo plane', 1: 'Small car', 2: 'Bus', 3: 'Truck', 4: 'Motorboat', 5: 'Fishing vessel', 6: 'Dump truck', 7: 'Excavator', 8: 'Building', 9: 'Helipad', 10: 'Storage tank', 11: 'Shipping container', 12: 'Pylon'}

In [7]:
#def load_geoimage(filename):
#    warnings.filterwarnings('ignore', category=rasterio.errors.NotGeoreferencedWarning)
#    src_raster = rasterio.open(IMAGES_PATH + filename, 'r')
#    # RasterIO to OpenCV (see inconsistencies between libjpeg and libjpeg-turbo)
#    input_type = src_raster.profile['dtype']
#    input_channels = src_raster.count
#    img = np.zeros((src_raster.height, src_raster.width, src_raster.count), dtype=input_type)
#    for band in range(input_channels):
#        img[:, :, band] = src_raster.read(band+1)
#    return img

def load_geoimage(filename):
    warnings.filterwarnings('ignore', category=rasterio.errors.NotGeoreferencedWarning)
    
    full_path = os.path.join(IMAGES_PATH, filename)
    
    if not os.path.exists(full_path):
        raise FileNotFoundError(f"Image not found: {full_path}")
    
    src_raster = rasterio.open(full_path, 'r')
    
    input_type = src_raster.profile['dtype']
    input_channels = src_raster.count
    img = np.zeros((src_raster.height, src_raster.width, src_raster.count), dtype=input_type)
    
    for band in range(input_channels):
        img[:, :, band] = src_raster.read(band + 1)
    
    return img

In [8]:
# Load database
# json_file = IMAGES_PATH + 'xview_ann_train.json'
json_file = IMAGES_PATH + '/xview_ann_train.json'
with open(json_file) as ifs:
    json_data = json.load(ifs)
    print(json_data.keys())
ifs.close()

dict_keys(['info', 'images', 'annotations', 'categories'])


In [9]:
counts = dict.fromkeys(categories.values(), 0)
anns_dataset = []
for json_img, json_ann in zip(json_data['images'].values(), json_data['annotations'].values()):
    image = GenericImage(json_img['filename'])
    image.tile = np.array([0, 0, json_img['width'], json_img['height']])
    obj = GenericObject()
    obj.bb = (int(json_ann['bbox'][0]), int(json_ann['bbox'][1]), int(json_ann['bbox'][2]), int(json_ann['bbox'][3]))
    obj.category = json_ann['category_id']
    # Resampling strategy to reduce training time
    counts[obj.category] += 1
    image.add_object(obj)
    anns_dataset.append(image)
print(counts)

{'Cargo plane': 635, 'Small car': 3324, 'Bus': 1768, 'Truck': 2210, 'Motorboat': 1069, 'Fishing vessel': 706, 'Dump truck': 1236, 'Excavator': 789, 'Building': 3594, 'Helipad': 111, 'Storage tank': 1469, 'Shipping container': 1523, 'Pylon': 312}


In [10]:
def generator_images(objs, batch_size, do_shuffle=False):
    while True:
        if do_shuffle:
            np.random.shuffle(objs)
        groups = [objs[i:i+batch_size] for i in range(0, len(objs), batch_size)]
        for group in groups:
            images, labels = [], []
            for (filename, obj) in group:
                # Load image
                images.append(load_geoimage(filename))
                probabilities = np.zeros(len(categories))
                probabilities[list(categories.values()).index(obj.category)] = 1
                labels.append(probabilities)
            images = np.array(images).astype(np.float32)
            labels = np.array(labels).astype(np.float32)
            yield images, labels

### 0.3 Setup

In [12]:
# Fix random seed for reproducibility
RANDOM_SEED = 42
def set_seed(seed_value):
    np.random.seed(seed_value)
    tf.random.set_seed(seed_value)
set_seed(RANDOM_SEED)

# Number of categories for classification
NUM_CATEGORIES = len(categories)

### 0.4 Possible Architectures / Parameters to Tune

- Normalisation of the input data (rescaling pixels from [0, 255] to [0, 1])
- Adam tuning: learning rate, beta 1, beta 2, epsilon, amsgrad
- Increase number of epochs
- Adding more Dense layers
- Change the activation functions of Dense layers: ReLU, leaky ReLU, ELU (which to choose for each layer?)
- Change loss function from Crossentropy to Focal Loss
- Weight Initialisation (HE initialisation)
- Use BatchNorm layers (but where? in all layers?)
- Dropout?, Early stopping?
- Increase batch size (although it could generalise worst but be parallelised) (also batch size being related to learning rate: learning*N/batch)
- Other advanced techniques seen in class: SGD with warm restarts, SAM

## 1 Model 0: Simple (and Bad) ffNN > Multinomial Logistic Regression Model

The following architecture is used as a starting point:
- Validation split: 10% (NO stratification)
- Input layer: 150,528 neurons (224x224x3) > Activation function: ReLU
- Output layer: 13 neurons (the categories to classify) > Activation function: Softmax
- Optimiser: Adam with learning_rate=1e-3, beta_1=0.9, beta_2=0.999, epsilon=1e-8, amsgrad=True, clipnorm=1.0
- Loss function: Categorical Crossentropy

### 1.1 Setup

In [13]:
anns_train, anns_valid = train_test_split(anns_dataset, test_size=0.1, random_state=RANDOM_SEED, shuffle=True)
print('Number of training images: ' + str(len(anns_train)))
print('Number of validation images: ' + str(len(anns_valid)))

Number of training images: 16871
Number of validation images: 1875


In [14]:
print('Compiling the model...')

model = Sequential([
    Input(shape=(224, 224, 3)), # 224x224 images with 3 channels (RGB)
    Flatten(), # Flatten the 3D tensor to 1D
    Activation('relu'),
    Dense(NUM_CATEGORIES, activation='softmax') # Output layer with softmax activation
])

model.summary()

opt = Adam(learning_rate=1e-3, beta_1=0.9, beta_2=0.999, epsilon=1e-8, amsgrad=True, clipnorm=1.0)
model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])

Compiling the model...


2025-10-03 18:43:57.252841: I metal_plugin/src/device/metal_device.cc:1154] Metal device set to: Apple M3
2025-10-03 18:43:57.253095: I metal_plugin/src/device/metal_device.cc:296] systemMemory: 24.00 GB
2025-10-03 18:43:57.253109: I metal_plugin/src/device/metal_device.cc:313] maxCacheSize: 8.00 GB
2025-10-03 18:43:57.253532: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2025-10-03 18:43:57.253570: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)


The following are helper functions that monitor the training process:
- **ModelCheckpoint:** Save the best model based on validation accuracy.
- **ReduceLROnPlateau:** Reduce learning rate when validation accuracy stops improving.
- **EarlyStopping:** Stop training when validation accuracy stops improving.
- **TerminateOnNaN:** Stop training if NaN loss is encountered.

In [15]:
# Callbacks
model_checkpoint = ModelCheckpoint('model.keras', monitor='val_accuracy', verbose=1, save_best_only=True)
reduce_lr = ReduceLROnPlateau('val_accuracy', factor=0.1, patience=10, verbose=1)
early_stop = EarlyStopping('val_accuracy', patience=40, verbose=1)
terminate = TerminateOnNaN()
callbacks = [model_checkpoint, reduce_lr, early_stop, terminate]

Transforming the ImageObjects to actual image generators for training and validation.

In [16]:
# Generate the list of objects from annotations
objs_train = [(ann.filename, obj) for ann in anns_train for obj in ann.objects]
objs_valid = [(ann.filename, obj) for ann in anns_valid for obj in ann.objects]

# Generators
BATCH_SIZE = 16
train_generator = generator_images(objs_train, BATCH_SIZE, do_shuffle=True)
valid_generator = generator_images(objs_valid, BATCH_SIZE, do_shuffle=False)

### 1.2 Training

In [1]:
print('Training model...')

EPOCHS = 20
train_steps = math.ceil(len(objs_train)/BATCH_SIZE)
valid_steps = math.ceil(len(objs_valid)/BATCH_SIZE)

h = model.fit(train_generator, 
              steps_per_epoch=train_steps, 
              validation_data=valid_generator, 
              validation_steps=valid_steps, 
              epochs=EPOCHS, 
              callbacks=callbacks, 
              verbose=1)

# Best validation model
best_idx = int(np.argmax(h.history['val_accuracy']))
best_value = np.max(h.history['val_accuracy'])
print('Best validation model: epoch ' + str(best_idx+1), ' - val_accuracy ' + str(best_value))

Training model...


NameError: name 'math' is not defined

- Training time: ~ 108 minutes
- GPU: NVIDIA GeForce RTX 3050 Ti Laptop GPU
- Best validation model: Epoch 15 - Validation accuracy: 0.3941

### 1.3 Validation
Compute validation metrics.

In [None]:
def draw_confusion_matrix(cm, categories):
    # Draw confusion matrix
    fig = plt.figure(figsize=[6.4*pow(len(categories), 0.5), 4.8*pow(len(categories), 0.5)])
    ax = fig.add_subplot(111)
    cm = cm.astype('float') / np.maximum(cm.sum(axis=1)[:, np.newaxis], np.finfo(np.float64).eps)
    im = ax.imshow(cm, interpolation='nearest', cmap=plt.colormaps['Blues'])
    ax.figure.colorbar(im, ax=ax)
    ax.set(xticks=np.arange(cm.shape[1]), yticks=np.arange(cm.shape[0]), xticklabels=list(categories.values()), yticklabels=list(categories.values()), ylabel='Annotation', xlabel='Prediction')
    # Rotate the tick labels and set their alignment
    plt.setp(ax.get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor")
    # Loop over data dimensions and create text annotations
    thresh = cm.max() / 2.0
    for i in range(cm.shape[0]):
        for j in range(cm.shape[1]):
            ax.text(j, i, format(cm[i, j], '.2f'), ha="center", va="center", color="white" if cm[i, j] > thresh else "black", fontsize=int(20-pow(len(categories), 0.5)))
    fig.tight_layout()
    plt.show()

In [None]:
model.load_weights('model.keras')
y_true, y_pred = [], []
for ann in anns_valid:
    # Load image
    image = load_geoimage(ann.filename)
    for obj_pred in ann.objects:
        # Generate prediction
        warped_image = np.expand_dims(image, 0)
        predictions = model.predict(warped_image, verbose=0)
        # Save prediction
        pred_category = list(categories.values())[np.argmax(predictions)]
        pred_score = np.max(predictions)
        y_true.append(obj_pred.category)
        y_pred.append(pred_category)

In [None]:
# Compute the confusion matrix
cm = confusion_matrix(y_true, y_pred, labels=list(categories.values()))
draw_confusion_matrix(cm, categories)

In [None]:
# Compute the accuracy
correct_samples_class = np.diag(cm).astype(float)
total_samples_class = np.sum(cm, axis=1).astype(float)
total_predicts_class = np.sum(cm, axis=0).astype(float)
print('Mean Accuracy: %.3f%%' % (np.sum(correct_samples_class) / np.sum(total_samples_class) * 100))
acc = correct_samples_class / np.maximum(total_samples_class, np.finfo(np.float64).eps)
print('Mean Recall: %.3f%%' % (acc.mean() * 100))
acc = correct_samples_class / np.maximum(total_predicts_class, np.finfo(np.float64).eps)
print('Mean Precision: %.3f%%' % (acc.mean() * 100))
for idx in range(len(categories)):
    # True/False Positives (TP/FP) refer to the number of predicted positives that were correct/incorrect.
    # True/False Negatives (TN/FN) refer to the number of predicted negatives that were correct/incorrect.
    tp = cm[idx, idx]
    fp = sum(cm[:, idx]) - tp
    fn = sum(cm[idx, :]) - tp
    tn = sum(np.delete(sum(cm) - cm[idx, :], idx))
    # True Positive Rate: proportion of real positive cases that were correctly predicted as positive.
    recall = tp / np.maximum(tp+fn, np.finfo(np.float64).eps)
    # Precision: proportion of predicted positive cases that were truly real positives.
    precision = tp / np.maximum(tp+fp, np.finfo(np.float64).eps)
    # True Negative Rate: proportion of real negative cases that were correctly predicted as negative.
    specificity = tn / np.maximum(tn+fp, np.finfo(np.float64).eps)
    # Dice coefficient refers to two times the intersection of two sets divided by the sum of their areas.
    # Dice = 2 |A∩B| / (|A|+|B|) = 2 TP / (2 TP + FP + FN)
    f1_score = 2 * ((precision * recall) / np.maximum(precision+recall, np.finfo(np.float64).eps))
    print('> %s: Recall: %.3f%% Precision: %.3f%% Specificity: %.3f%% Dice: %.3f%%' % (list(categories.values())[idx], recall*100, precision*100, specificity*100, f1_score*100))

### 1.4 Testing
Trying to improve the results provided in the competition.

In [None]:
anns = []
for (dirpath, dirnames, filenames) in os.walk('../PROJECT/xview_recognition/xview_test'):
    for filename in filenames:
        image = GenericImage(dirpath[29:] + '/' + filename)
        image.tile = np.array([0, 0, 224, 224])
        obj = GenericObject()
        obj.bb = (0, 0, 224, 224)
        obj.category = dirpath[dirpath.rfind('/')+1:]
        image.add_object(obj)
        anns.append(image)
print('Number of testing images: ' + str(len(anns)))

In [None]:
model.load_weights('model.keras')
predictions_data = {"images": {}, "annotations": {}}
for idx, ann in enumerate(anns):
    image_data = {"image_id": ann.filename.split('/')[-1], "filename": ann.filename, "width": int(ann.tile[2]), "height": int(ann.tile[3])}
    predictions_data["images"][idx] = image_data
    # Load image
    image = load_geoimage(ann.filename)
    for obj_pred in ann.objects:
        # Generate prediction
        warped_image = np.expand_dims(image, 0)
        predictions = model.predict(warped_image, verbose=0)
        # Save prediction
        pred_category = list(categories.values())[np.argmax(predictions)]
        pred_score = np.max(predictions)
        annotation_data = {"image_id": ann.filename.split('/')[-1], "category_id": pred_category, "bbox": [int(x) for x in obj_pred.bb]}
        predictions_data["annotations"][idx] = annotation_data

In [None]:
with open("prediction.json", "w") as outfile:
    json.dump(predictions_data, outfile)

## 2 Model 1: 
### Trying to improve original model accuracy by scaling the data and tuning the ffNN:

In [None]:
import os
import uuid
import math
import json
import warnings
import numpy as np
import rasterio
import tensorflow as tf
from sklearn.model_selection import train_test_split
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Input, Flatten, Dense, BatchNormalization, Dropout, LeakyReLU
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.initializers import HeNormal
from tensorflow.keras.callbacks import ModelCheckpoint, ReduceLROnPlateau, EarlyStopping, TerminateOnNaN

# -----------------------------
# SETUP
# -----------------------------
IMAGES_PATH = '/Users/anastasia/Desktop/MUIA/CompVis/Project/Data/xview_recognition'
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
tf.random.set_seed(RANDOM_SEED)

# -----------------------------
# Categories
# -----------------------------
categories = {
    0: 'Cargo plane', 1: 'Small car', 2: 'Bus', 3: 'Truck',
    4: 'Motorboat', 5: 'Fishing vessel', 6: 'Dump truck', 7: 'Excavator',
    8: 'Building', 9: 'Helipad', 10: 'Storage tank', 11: 'Shipping container',
    12: 'Pylon'
}
NUM_CATEGORIES = len(categories)
category_to_index = {v: k for k, v in categories.items()}
index_to_category = {k: v for k, v in categories.items()}

# -----------------------------
# Data classes
# -----------------------------
class GenericObject:
    def __init__(self):
        self.id = uuid.uuid4()
        self.bb = (-1, -1, -1, -1)
        self.category = -1
        self.score = -1

class GenericImage:
    def __init__(self, filename):
        self.filename = filename
        self.tile = np.array([-1, -1, -1, -1])
        self.objects = []

    def add_object(self, obj: GenericObject):
        self.objects.append(obj)

# -----------------------------
# Load images
# -----------------------------
def load_geoimage(filename):
    warnings.filterwarnings('ignore', category=rasterio.errors.NotGeoreferencedWarning)
    full_path = os.path.join(IMAGES_PATH, filename)
    if not os.path.exists(full_path):
        raise FileNotFoundError(f"Image not found: {full_path}")

    src_raster = rasterio.open(full_path, 'r')
    img = np.zeros((src_raster.height, src_raster.width, src_raster.count), dtype=np.float32)
    for band in range(src_raster.count):
        img[:, :, band] = src_raster.read(band + 1)
    return img

# -----------------------------
# Load annotations
# -----------------------------
json_file = os.path.join(IMAGES_PATH, 'xview_ann_train.json')
with open(json_file) as ifs:
    json_data = json.load(ifs)

anns_dataset = []
for json_img, json_ann in zip(json_data['images'].values(), json_data['annotations'].values()):
    image = GenericImage(json_img['filename'])
    obj = GenericObject()
    obj.bb = tuple(map(int, json_ann['bbox']))
    obj.category = json_ann['category_id']  # can be string or int
    image.add_object(obj)
    anns_dataset.append(image)

# -----------------------------
# Split dataset
# -----------------------------
anns_train, anns_valid = train_test_split(
    anns_dataset, test_size=0.1, random_state=RANDOM_SEED, shuffle=True
)
print('Training images:', len(anns_train), 'Validation images:', len(anns_valid))

# Flatten annotations
objs_train = [(ann.filename, obj) for ann in anns_train for obj in ann.objects]
objs_valid = [(ann.filename, obj) for ann in anns_valid for obj in ann.objects]

# -----------------------------
# Compute class weights (fixed for string categories)
# -----------------------------
def compute_class_weights(objs):
    counts = np.zeros(NUM_CATEGORIES, dtype=np.int64)
    for _, obj in objs:
        if isinstance(obj.category, str):
            cat_idx = category_to_index[obj.category]
        else:
            cat_idx = int(obj.category)
        counts[cat_idx] += 1
    counts = np.maximum(counts, 1)  # avoid division by zero
    class_weights = {i: float(np.sum(counts)) / (len(counts) * counts[i]) for i in range(len(counts))}
    return class_weights, counts

class_weights, class_counts = compute_class_weights(objs_train)
print("Class counts:", class_counts)
print("Class weights:", class_weights)

# -----------------------------
# Generator (TF-friendly, memory safe)
# -----------------------------
DOWNSAMPLE_SIZE = (64, 64)
BATCH_SIZE = 4

def generator_images(objs, batch_size, do_shuffle=False, target_size=DOWNSAMPLE_SIZE, yield_sample_weight=True):
    while True:
        if do_shuffle:
            np.random.shuffle(objs)
        for i in range(0, len(objs), batch_size):
            group = objs[i:i+batch_size]
            images, labels, sample_weights = [], [], []
            for filename, obj in group:
                img = load_geoimage(filename).astype(np.float32)

                # Ensure 3 channels
                if img.ndim == 2:
                    img = np.stack([img]*3, axis=-1)
                elif img.shape[-1] > 3:
                    img = img[..., :3]
                elif img.shape[-1] < 3:
                    pad_ch = 3 - img.shape[-1]
                    pad_shape = list(img.shape)
                    pad_shape[-1] = pad_ch
                    img = np.concatenate([img, np.zeros(pad_shape, dtype=img.dtype)], axis=-1)

                # TF resize and normalize
                img_tf = tf.convert_to_tensor(img, dtype=tf.float32)
                img_tf = tf.image.resize(img_tf, target_size) / 255.0
                images.append(img_tf)

                # One-hot label
                prob = np.zeros(NUM_CATEGORIES, dtype=np.float32)
                cat_idx = category_to_index[obj.category] if isinstance(obj.category, str) else int(obj.category)
                prob[cat_idx] = 1.0
                labels.append(prob)

                # Sample weight
                if yield_sample_weight:
                    sample_weights.append(float(class_weights[cat_idx]))

            images = tf.stack(images)
            labels = tf.convert_to_tensor(labels, dtype=tf.float32)
            if yield_sample_weight:
                sample_weights = tf.convert_to_tensor(sample_weights, dtype=tf.float32)
                yield images, labels, sample_weights
            else:
                yield images, labels

train_generator = generator_images(objs_train, BATCH_SIZE, do_shuffle=True)
valid_generator = generator_images(objs_valid, BATCH_SIZE, do_shuffle=False)

# -----------------------------
# Metal GPU memory growth
# -----------------------------
gpus = tf.config.list_physical_devices("GPU")
for gpu in gpus:
    try:
        tf.config.experimental.set_memory_growth(gpu, True)
    except Exception:
        pass

# -----------------------------
# FFNN model
# -----------------------------
model = Sequential([
    Input(shape=(DOWNSAMPLE_SIZE[0], DOWNSAMPLE_SIZE[1], 3)),
    Flatten(),

    Dense(256, kernel_initializer=HeNormal()),
    BatchNormalization(),
    LeakyReLU(negative_slope=0.1),
    Dropout(0.4),

    Dense(128, kernel_initializer=HeNormal()),
    BatchNormalization(),
    LeakyReLU(negative_slope=0.1),
    Dropout(0.3),

    Dense(NUM_CATEGORIES, activation='softmax')
])

model.summary()

# -----------------------------
# Compile
# -----------------------------
opt = Adam(learning_rate=1e-3, beta_1=0.9, beta_2=0.999, epsilon=1e-8, amsgrad=True, clipnorm=1.0)
model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])

# -----------------------------
# Callbacks
# -----------------------------
model_checkpoint = ModelCheckpoint('model.keras', monitor='val_accuracy', save_best_only=True, verbose=1)
reduce_lr = ReduceLROnPlateau('val_accuracy', factor=0.1, patience=10, verbose=1)
early_stop = EarlyStopping('val_accuracy', patience=40, verbose=1)
terminate = TerminateOnNaN()
callbacks = [model_checkpoint, reduce_lr, early_stop, terminate]

# -----------------------------
# Training
# -----------------------------
EPOCHS = 20
train_steps = math.ceil(len(objs_train)/BATCH_SIZE)
valid_steps = math.ceil(len(objs_valid)/BATCH_SIZE)

history = model.fit(
    train_generator,
    steps_per_epoch=train_steps,
    validation_data=valid_generator,
    validation_steps=valid_steps,
    epochs=EPOCHS,
    callbacks=callbacks,
    verbose=1
)

best_idx = int(np.argmax(history.history.get('val_accuracy', [0])))
best_value = np.max(history.history.get('val_accuracy', [0]))
print(f'Best validation model: epoch {best_idx+1} - val_accuracy {best_value:.4f}')

Training images: 16871 Validation images: 1875
Class counts: [ 592 2973 1557 1981  963  632 1118  726 3248  103 1320 1386  272]
Class weights: {0: 2.1921777546777546, 1: 0.4365184092732024, 2: 0.8335062496912208, 3: 0.6551081427406515, 4: 1.3476315999680486, 5: 2.0534323271665045, 6: 1.1607953763588825, 7: 1.7875609239245602, 8: 0.399559492231906, 9: 12.599701269604182, 10: 0.9831585081585081, 11: 0.9363414363414363, 12: 4.771210407239819}


2025-10-03 19:44:40.854882: I metal_plugin/src/device/metal_device.cc:1154] Metal device set to: Apple M3
2025-10-03 19:44:40.854902: I metal_plugin/src/device/metal_device.cc:296] systemMemory: 24.00 GB
2025-10-03 19:44:40.854905: I metal_plugin/src/device/metal_device.cc:313] maxCacheSize: 8.00 GB
2025-10-03 19:44:40.854919: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2025-10-03 19:44:40.854927: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)


Epoch 1/20


2025-10-03 19:44:41.319490: I tensorflow/core/grappler/optimizers/custom_graph_optimizer_registry.cc:117] Plugin optimizer for device_type GPU is enabled.


Here are the key changes we tried: 

1. Downsampling to 112×112: still keeps spatial info but reduces flattened size from 150k → 37k features, cutting the parameter explosion that caused your crashes. If you must keep 224×224 exactly, you can set DOWNSAMPLE_SIZE = (224, 224), but expect much larger memory use and possible OOM.

2. He init + BatchNorm + LeakyReLU: works well for dense networks on images — stabilizes training and improves gradient flow.

3. Dropout reduces overfitting (you have imbalanced classes and a limited dataset).

4. sample_weight: produced per-sample from inverse frequency gives rarer classes more importance during training; this is a simple, generally effective mitigation for your imbalance.

5. (specific for M3 GPU): Memory growth helps the TF Metal backend avoid grabbing all GPU memory immediately (reduces crashes).

the thing above crashed. trying to do it even safer and smaller below:

In [None]:
import os
import uuid
import json
import warnings
import numpy as np
import rasterio
import tensorflow as tf
from sklearn.model_selection import train_test_split
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Input, Flatten, Dense, BatchNormalization, Dropout, LeakyReLU
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.initializers import HeNormal
from tensorflow.keras.callbacks import ModelCheckpoint, ReduceLROnPlateau, EarlyStopping, TerminateOnNaN

# -----------------------------
# SETUP
# -----------------------------
IMAGES_PATH = '/Users/anastasia/Desktop/MUIA/CompVis/Project/Data/xview_recognition'
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
tf.random.set_seed(RANDOM_SEED)

# -----------------------------
# Categories
# -----------------------------
categories = {
    0: 'Cargo plane', 1: 'Small car', 2: 'Bus', 3: 'Truck',
    4: 'Motorboat', 5: 'Fishing vessel', 6: 'Dump truck', 7: 'Excavator',
    8: 'Building', 9: 'Helipad', 10: 'Storage tank', 11: 'Shipping container',
    12: 'Pylon'
}
NUM_CATEGORIES = len(categories)
category_to_index = {v: k for k, v in categories.items()}

# -----------------------------
# Data classes
# -----------------------------
class GenericObject:
    def __init__(self):
        self.id = uuid.uuid4()
        self.bb = (-1, -1, -1, -1)
        self.category = -1
        self.score = -1

class GenericImage:
    def __init__(self, filename):
        self.filename = filename
        self.tile = np.array([-1, -1, -1, -1])
        self.objects = []

    def add_object(self, obj: GenericObject):
        self.objects.append(obj)

# -----------------------------
# Load image
# -----------------------------
def load_geoimage(filename):
    warnings.filterwarnings('ignore', category=rasterio.errors.NotGeoreferencedWarning)
    full_path = os.path.join(IMAGES_PATH, filename)
    if not os.path.exists(full_path):
        raise FileNotFoundError(f"Image not found: {full_path}")

    src_raster = rasterio.open(full_path, 'r')
    img = np.zeros((src_raster.height, src_raster.width, src_raster.count), dtype=np.float32)
    for band in range(src_raster.count):
        img[:, :, band] = src_raster.read(band + 1)
    return img

# -----------------------------
# Load annotations
# -----------------------------
json_file = os.path.join(IMAGES_PATH, 'xview_ann_train.json')
with open(json_file) as ifs:
    json_data = json.load(ifs)

anns_dataset = []
for json_img, json_ann in zip(json_data['images'].values(), json_data['annotations'].values()):
    image = GenericImage(json_img['filename'])
    obj = GenericObject()
    obj.bb = tuple(map(int, json_ann['bbox']))
    obj.category = json_ann['category_id']
    image.add_object(obj)
    anns_dataset.append(image)

# -----------------------------
# Split dataset
# -----------------------------
anns_train, anns_valid = train_test_split(
    anns_dataset, test_size=0.1, random_state=RANDOM_SEED, shuffle=True
)
objs_train = [(ann.filename, obj) for ann in anns_train for obj in ann.objects]
objs_valid = [(ann.filename, obj) for ann in anns_valid for obj in ann.objects]

# -----------------------------
# Compute class weights
# -----------------------------
def compute_class_weights(objs):
    counts = np.zeros(NUM_CATEGORIES, dtype=np.int64)
    for _, obj in objs:
        if isinstance(obj.category, str):
            cat_idx = category_to_index[obj.category]
        else:
            cat_idx = int(obj.category)
        counts[cat_idx] += 1
    counts = np.maximum(counts, 1)
    class_weights = {i: float(np.sum(counts)) / (len(counts) * counts[i]) for i in range(len(counts))}
    return class_weights, counts

class_weights, class_counts = compute_class_weights(objs_train)
print("Class counts:", class_counts)
print("Class weights:", class_weights)

# -----------------------------
# Generator
# -----------------------------
DOWNSAMPLE_SIZE = (16, 16)  # very small to save memory
BATCH_SIZE = 2  # tiny batch size to avoid memory spike

def generator_images(objs, batch_size=BATCH_SIZE, target_size=DOWNSAMPLE_SIZE):
    while True:
        np.random.shuffle(objs)
        for i in range(0, len(objs), batch_size):
            group = objs[i:i+batch_size]
            images, labels, sample_weights = [], [], []
            for filename, obj in group:
                img = load_geoimage(filename).astype(np.float32)
                # Resize
                img_tf = tf.image.resize(img, target_size)
                # Ensure 3 channels
                if img_tf.shape[-1] < 3:
                    pad_ch = 3 - img_tf.shape[-1]
                    img_tf = tf.concat([img_tf, tf.zeros((*img_tf.shape[:-1], pad_ch), dtype=tf.float32)], axis=-1)
                elif img_tf.shape[-1] > 3:
                    img_tf = img_tf[..., :3]
                img_tf = img_tf / 255.0
                images.append(img_tf)

                # Labels
                prob = np.zeros(NUM_CATEGORIES, dtype=np.float32)
                prob[category_to_index[obj.category] if isinstance(obj.category, str) else int(obj.category)] = 1.0
                labels.append(prob)

                # Sample weights
                sw = float(class_weights[category_to_index[obj.category] if isinstance(obj.category, str) else int(obj.category)])
                sample_weights.append(sw)

            yield tf.stack(images), tf.convert_to_tensor(labels, dtype=tf.float32), tf.convert_to_tensor(sample_weights, dtype=tf.float32)

train_generator = generator_images(objs_train)
valid_generator = generator_images(objs_valid)

# -----------------------------
# FFNN model (feedforward, partial connections)
# -----------------------------
model = Sequential([
    Input(shape=(DOWNSAMPLE_SIZE[0], DOWNSAMPLE_SIZE[1], 3)),
    Flatten(),
    Dense(32, kernel_initializer=HeNormal()),
    LeakyReLU(0.1),
    Dropout(0.2),
    Dense(16, kernel_initializer=HeNormal()),
    LeakyReLU(0.1),
    Dropout(0.1),
    Dense(NUM_CATEGORIES, activation='softmax')
])

model.summary()

# -----------------------------
# Compile
# -----------------------------
opt = Adam(learning_rate=1e-3)
model.compile(optimizer=opt, loss='categorical_crossentropy', metrics=['accuracy'])

# -----------------------------
# Callbacks
# -----------------------------
callbacks = [
    ModelCheckpoint('ffnn_model.keras', monitor='val_accuracy', save_best_only=True, verbose=1),
    ReduceLROnPlateau('val_accuracy', factor=0.5, patience=5, verbose=1),
    EarlyStopping('val_accuracy', patience=10, verbose=1),
    TerminateOnNaN()
]

# -----------------------------
# Training
# -----------------------------
EPOCHS = 10
train_steps = max(1, len(objs_train)//BATCH_SIZE)
valid_steps = max(1, len(objs_valid)//BATCH_SIZE)

history = model.fit(
    train_generator,
    steps_per_epoch=train_steps,
    validation_data=valid_generator,
    validation_steps=valid_steps,
    epochs=EPOCHS,
    callbacks=callbacks,
    verbose=1
)

# -----------------------------
# Best validation
# -----------------------------
best_idx = int(np.argmax(history.history.get('val_accuracy', [0])))
best_value = np.max(history.history.get('val_accuracy', [0]))
print(f'Best validation model: epoch {best_idx+1} - val_accuracy {best_value:.4f}')


Class counts: [ 592 2973 1557 1981  963  632 1118  726 3248  103 1320 1386  272]
Class weights: {0: 2.1921777546777546, 1: 0.4365184092732024, 2: 0.8335062496912208, 3: 0.6551081427406515, 4: 1.3476315999680486, 5: 2.0534323271665045, 6: 1.1607953763588825, 7: 1.7875609239245602, 8: 0.399559492231906, 9: 12.599701269604182, 10: 0.9831585081585081, 11: 0.9363414363414363, 12: 4.771210407239819}


2025-10-03 19:49:35.501921: I metal_plugin/src/device/metal_device.cc:1154] Metal device set to: Apple M3
2025-10-03 19:49:35.501941: I metal_plugin/src/device/metal_device.cc:296] systemMemory: 24.00 GB
2025-10-03 19:49:35.501945: I metal_plugin/src/device/metal_device.cc:313] maxCacheSize: 8.00 GB
2025-10-03 19:49:35.501959: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:305] Could not identify NUMA node of platform GPU ID 0, defaulting to 0. Your kernel may not have been built with NUMA support.
2025-10-03 19:49:35.501966: I tensorflow/core/common_runtime/pluggable_device/pluggable_device_factory.cc:271] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 0 MB memory) -> physical PluggableDevice (device: 0, name: METAL, pci bus id: <undefined>)


Epoch 1/10
