# Final Training & Evaluation
* In this notebook we are going to train our final model in multiple iterations
* We are going to utilize what we experimented with in previous notebooks and use the learnings here.
* Below are the general steps to train final model,
    * Step 1: Create a custom model class & optimized loss function calculation to train the model. This loss function will calcualate the loss only once instead of 4 times in previous version
    * Step 2: Train a model to detect 2 objects on the canvas. 
    * Step 3: We'll than select the best model and train it to detect 3 objects on the canvas, then 4 objects and so on. 


In [2]:
import pandas as pd
import numpy as np
from pathlib import Path
import tensorflow as tf
import matplotlib.pyplot as plt
from keras.datasets import mnist
import matplotlib.pyplot as plt
import matplotlib.patches as patches



## validate tensorflow 
print("Num GPUs Available: ", len(tf.config.list_physical_devices('GPU')))

2025-11-05 09:30:47.781107: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1762363847.798377  480501 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1762363847.803878  480501 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1762363847.823376  480501 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1762363847.823402  480501 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1762363847.823403  480501 computation_placer.cc:177] computation placer alr

Num GPUs Available:  1


## Constants

In [3]:
data_dir = Path("..","data")
models_dir = Path("..","models")

In [4]:
import os
import sys
# Build an absolute path from this notebook's parent directory
module_path = os.path.abspath(os.path.join('..'))

# Add to sys.path if not already present
if module_path not in sys.path:
    sys.path.append(module_path)
    
from src import graph_compatible_data_generator,object_detection_model,training_utils

## logic to auto reload scripts without restarting the kernel
%load_ext autoreload
%autoreload 2

I0000 00:00:1762363851.767740  480501 gpu_device.cc:2019] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 5899 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 2080 SUPER, pci bus id: 0000:2e:00.0, compute capability: 7.5


## Initialize Pipeline

In [5]:
(x_train, y_train), (x_test, y_test) = mnist.load_data()

X_tensor = tf.convert_to_tensor(x_train, dtype=tf.float32)
# X_tensor = tf.reshape(X_tensor, shape=(-1, 28, 28, 1))
y_tensor = tf.convert_to_tensor(y_train, dtype=tf.float32)

batch_size = 32
raw_dataset = tf.data.Dataset.from_tensor_slices((X_tensor, y_tensor))

# create a generator for 2 digits
data_gen_2_digits = graph_compatible_data_generator.create_data_generator(2)

processed_dataset_2 = raw_dataset.map(
    data_gen_2_digits).batch(batch_size=batch_size).prefetch(tf.data.AUTOTUNE)

--- Loading and caching MNIST data... ---


## Initialize Model

In [6]:
inputs = tf.keras.Input(shape=(100,100,1),batch_size=batch_size ,name="input_layer")

x = tf.keras.layers.Rescaling(scale=1./255, name="rescaling")(inputs)

x = tf.keras.layers.Conv2D(filters=8, kernel_size=5, padding='same', activation='relu')(x)
x = tf.keras.layers.Conv2D(filters=8, kernel_size=5, padding='same', activation='relu')(x)
x = tf.keras.layers.MaxPooling2D()(x)

x = tf.keras.layers.Conv2D(filters=8, kernel_size=3, padding='same', activation='relu')(x)
x = tf.keras.layers.Conv2D(filters=8, kernel_size=3, padding='same', activation='relu')(x)
x = tf.keras.layers.MaxPooling2D()(x)

x = tf.keras.layers.Conv2D(filters=16, kernel_size=3, padding='same', activation='relu')(x)
x = tf.keras.layers.Conv2D(filters=16, kernel_size=3, padding='same', activation='relu')(x)
x = tf.keras.layers.MaxPooling2D()(x)

x = tf.keras.layers.Conv2D(filters=32, kernel_size=3, padding='same', activation='relu')(x)
x = tf.keras.layers.Conv2D(filters=32, kernel_size=3, padding='same', activation='relu')(x)
x = tf.keras.layers.MaxPooling2D()(x)

outputs = tf.keras.layers.Conv2D(filters=45, kernel_size=1, padding='same', activation='linear')(x)

# Define the final model by specifying its inputs and outputs
model = tf.keras.Model(inputs=inputs, outputs=outputs)

model.summary()

In [7]:
# Define the callbacks
checkpoint_filepath = '../models/experiment_0_digits_2_{epoch:02d}_{loss:.2f}.keras'
model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_filepath,
    monitor='loss',
    mode='min',
    save_best_only=True,
    save_freq="epoch",
    verbose=1,
    )

In [8]:
custom_model = object_detection_model.ObjectDetectionModel(model)


custom_model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.0001,clipnorm=1.0))

custom_model.build(input_shape=(None, 100, 100, 1))

In [9]:
## Fit the model
epochs=20

## commenting out to avoid accidental run
# history = custom_model.fit(
#   processed_dataset_2,
#   epochs=epochs,
#   callbacks=[model_checkpoint_callback]
# )


* So the best model we got had bbox loss of `0.0312`, class_loss of `0.2417` and total loss of `0.32`
* The model performance is descent, but the bounding boxes are still not pefect. 
* We need to experiment with post processing in evaluate worksheet and see if training model for a longer time would give us a better result. 

#### Experiment 2

In [10]:
# Define the callbacks
checkpoint_filepath = '../models/experiment_1_digits_2_{epoch:02d}_{loss:.2f}.keras'
model_checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_filepath,
    monitor='loss',
    mode='min',
    save_best_only=True,
    save_freq="epoch",
    verbose=1,
    )

custom_model = object_detection_model.ObjectDetectionModel(model)


custom_model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.001,clipnorm=1.0))

custom_model.build(input_shape=(None, 100, 100, 1))
## Fit the model
epochs=30

history = custom_model.fit(
  processed_dataset_2,
  epochs=epochs,
  callbacks=[model_checkpoint_callback]
)


Epoch 1/30


I0000 00:00:1762363879.421292  493412 service.cc:152] XLA service 0x5647646f90c0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
I0000 00:00:1762363879.421371  493412 service.cc:160]   StreamExecutor device (0): NVIDIA GeForce RTX 2080 SUPER, Compute Capability 7.5
2025-11-05 09:31:19.564377: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:269] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
I0000 00:00:1762363880.398898  493412 cuda_dnn.cc:529] Loaded cuDNN version 90300
2025-11-05 09:31:21.137712: I external/local_xla/xla/service/gpu/autotuning/conv_algorithm_picker.cc:549] Omitted potentially buggy algorithm eng14{k25=0} for conv %cudnn-conv-bias-activation.29 = (f32[32,8,50,50]{3,2,1,0}, u8[0]{0}) custom-call(f32[32,8,50,50]{3,2,1,0} %bitcast.10382, f32[8,8,3,3]{3,2,1,0} %bitcast.8218, f32[8]{0} %bitcast.10476), window={size=3x3 pad=1_1x1_1}, dim_labels=bf01_oi01->bf01, custom_call_targe

[1m   3/1875[0m [37m━━━━━━━━━━━━━━━━━━━━[0m [1m1:03[0m 34ms/step - bbox_loss: 0.0741 - class_loss: 2.3029 - loss: 3.6834 - obj_less_loss: 0.6945 - obj_loss: 0.6859  

I0000 00:00:1762363885.138078  493412 device_compiler.h:188] Compiled cluster using XLA!  This line is logged at most once for the lifetime of the process.


[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 122ms/step - bbox_loss: 0.0646 - class_loss: 1.2435 - loss: 1.4976 - obj_less_loss: 0.1895 - obj_loss: 0.0646
Epoch 1: loss improved from inf to 0.87468, saving model to ../models/experiment_1_digits_2_01_0.87.keras
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m253s[0m 125ms/step - bbox_loss: 0.0646 - class_loss: 1.2432 - loss: 1.4973 - obj_less_loss: 0.1895 - obj_loss: 0.0645
Epoch 2/30
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 123ms/step - bbox_loss: 0.0346 - class_loss: 0.2872 - loss: 0.3847 - obj_less_loss: 0.0678 - obj_loss: 0.0297
Epoch 2: loss improved from 0.87468 to 0.34986, saving model to ../models/experiment_1_digits_2_02_0.35.keras
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m240s[0m 125ms/step - bbox_loss: 0.0346 - class_loss: 0.2872 - loss: 0.3847 - obj_less_loss: 0.0678 - obj_loss: 0.0297
Epoch 3/30
[1m1875/1875[0m [32m━━━━━━━━━━━━━━━━━━━━[0

* So we saw a significant drop in class loss and overall loss from previous experiment, but the bbox loss didn't improve at all, infact the loss increased. 
```bash
Experiment 0
bbox loss : `0.0312`
class loss : `0.2417` 
total loss : `0.32`

---

Experiment 1
bbox loss : `0.0352`
class loss : `0.0760` 
total loss : `0.1396`
```