# CLIP4CAD-GFA Training Notebook

This notebook provides a persistent training environment that keeps the 204GB dataset in memory between runs.

## Key Benefits
- **Load data ONCE** (~5 min) and keep it in RAM
- **Fast iteration** - Restart training instantly
- **No memory overflow** - Using `num_workers=0` since data is in RAM

## Usage
1. Run Cells 1-5 once to load data and initialize model (~5 min)
2. Run Cell 6 to start/restart training (instant!)
3. Use Cell 7-9 for resuming, validation, or monitoring

## Cell 1: Imports & Setup

In [1]:
import sys
from pathlib import Path

# Add project root to path
sys.path.insert(0, 'D:/Defect_Det/MMCAD/MMCAD')

import torch
from omegaconf import OmegaConf
from clip4cad.models.clip4cad_gfa import CLIP4CAD_GFA
from clip4cad.data.gfa_dataset import GFAMappedDataset
from clip4cad.training.gfa_trainer import GFATrainer

print("✓ Imports loaded successfully")
print(f"✓ PyTorch version: {torch.__version__}")
print(f"✓ CUDA available: {torch.cuda.is_available()}")

✓ Imports loaded successfully
✓ PyTorch version: 2.5.1+cu121
✓ CUDA available: True


## Cell 2: Configuration

Configure paths, hyperparameters, and output directory.

**CRITICAL**: `num_workers=0` because all data is loaded into RAM (no need for worker processes)

In [None]:
# Paths
data_root = Path("d:/Defect_Det/MMCAD/data")
pc_file = "c:/Users/User/Desktop/pc_embeddings_full.h5"
brep_file = "c:/Users/User/Desktop/brep_features.h5"
text_file = "c:/Users/User/Desktop/text_embeddings.h5"
output_dir = Path("outputs/gfa_111k")
output_dir.mkdir(parents=True, exist_ok=True)

# Load config
config = OmegaConf.load("d:/Defect_Det/MMCAD/MMCAD/configs/model/clip4cad_gfa.yaml")

# Override training settings
config.training.batch_size = 512
config.training.num_workers = 0  # CRITICAL: No workers since data in RAM!
config.training.num_epochs_stage1 = 15
config.training.num_epochs_stage2 = 20

# Save config
OmegaConf.save(config, output_dir / "config.yaml")

print("✓ Configuration loaded")
print(f"  - Batch size: {config.training.batch_size}")
print(f"  - Num workers: {config.training.num_workers}")
print(f"  - Stage 1 epochs: {config.training.num_epochs_stage1}")
print(f"  - Stage 2 epochs: {config.training.num_epochs_stage2}")
print(f"  - Output dir: {output_dir}")

✓ Configuration loaded
  - Batch size: 512
  - Num workers: 0
  - Stage 1 epochs: 15
  - Stage 2 epochs: 20
  - Output dir: outputs\gfa_111k


## Cell 3: Load Datasets (RUN ONCE, ~5 min)

**This cell loads 204GB of data into RAM**  
**⚠️ Takes ~5 minutes, but only run ONCE!**

After running this cell:
- Training data stays in memory
- Can restart training instantly by re-running Cell 6
- Don't re-run this cell unless you restart the kernel

In [4]:
print("Loading datasets (this takes ~5 min, but only once!)...")
print("="*60)

# Train dataset - LOAD TO RAM
print("\n[1/2] Loading TRAIN dataset to RAM...")
train_dataset = GFAMappedDataset(
    data_root=str(data_root),
    split="train",
    pc_file=pc_file,
    text_file=text_file,
    brep_file=brep_file,
    num_rotations=1,
    load_to_memory=True,  # Load 111k samples to RAM
    use_live_text=False,
)
print(f"✓ Train dataset loaded: {len(train_dataset):,} samples in RAM")

# Val dataset - KEEP ON DISK (saves RAM)
print("\n[2/2] Loading VAL dataset (on disk)...")
val_dataset = GFAMappedDataset(
    data_root=str(data_root),
    split="val",
    pc_file=pc_file,
    text_file=text_file,
    brep_file=brep_file,
    num_rotations=1,
    load_to_memory=False,  # Val stays on disk (saves RAM)
    use_live_text=False,
)
print(f"✓ Val dataset loaded: {len(val_dataset):,} samples on disk")

print("\n" + "="*60)
print("✓ DATASETS READY!")
print(f"  - Train: {len(train_dataset):,} samples in RAM")
print(f"  - Val: {len(val_dataset):,} samples on disk")
print("\n⚠️  Do NOT re-run this cell unless you restart the kernel!")
print("="*60)

Loading datasets (this takes ~5 min, but only once!)...

[1/2] Loading TRAIN dataset to RAM...
  Loading train data to memory (B-Rep + PC + Text)...
    Loading B-Rep (3GB)...
    Loading PC (50GB)...
    Loading Text from: c:\Users\User\Desktop\text_splits\train_text_embeddings.h5
    ✓ Text loaded: 174.6GB in 361.4s
    ⚠️  Pre-split has 111000 samples, dataset expected 133105
    Using first 111000 samples to match pre-split file
  ✓ Loaded 111000 samples: 203.6GB in RAM (B-Rep + PC + Text)
GFAMappedDataset: train with 111000 samples (in memory)
✓ Train dataset loaded: 111,000 samples in RAM

[2/2] Loading VAL dataset (on disk)...
GFAMappedDataset: val with 16638 samples
✓ Val dataset loaded: 16,638 samples on disk

✓ DATASETS READY!
  - Train: 111,000 samples in RAM
  - Val: 16,638 samples on disk

⚠️  Do NOT re-run this cell unless you restart the kernel!


## Cell 4: Initialize Model

Create the CLIP4CAD-GFA model and move it to GPU.

In [9]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Create model
model = CLIP4CAD_GFA(config)

print("\n✓ Model initialized")
print(f"  - Total parameters: {model.count_parameters():,}")
print(f"  - Trainable parameters: {model.count_parameters(trainable_only=True):,}")

Using device: cuda

✓ Model initialized
  - Total parameters: 3,168,771
  - Trainable parameters: 3,168,771


## Cell 5: Create Trainer

Initialize the GFA trainer with datasets, model, and configuration.

In [38]:
trainer = GFATrainer(
    model=model,
    train_dataset=train_dataset,
    val_dataset=val_dataset,
    config=config.training,
    device=device,
    output_dir=str(output_dir),
)

print("✓ Trainer initialized")
print(f"  - Output directory: {output_dir}")
print(f"  - Device: {device}")
print("\n✓ Ready to train! Run Cell 6 to start training.")

✓ Trainer initialized
  - Output directory: outputs\gfa_111k
  - Device: cuda

✓ Ready to train! Run Cell 6 to start training.


In [39]:
# Cell 5.5: Load from Best Checkpoint                                                                                                                                                                                                        checkpoint_path = Path("notebooks/outputs/gfa_111k/checkpoint_best.pt")
checkpoint_path = Path("d:/Defect_Det/MMCAD/MMCAD/notebooks/outputs/gfa_111k/checkpoint_best.pt")
if checkpoint_path.exists():
    print(f"Loading checkpoint: {checkpoint_path}")
    checkpoint = torch.load(checkpoint_path, map_location=device)

    # Load model state
    model.load_state_dict(checkpoint['model_state_dict'])
    trainer.resume(str(checkpoint_path))
    print(f"✓ Will continue from epoch {trainer.current_epoch + 1}")
    
else:
    print(f"No checkpoint found at {checkpoint_path}")

Loading checkpoint: d:\Defect_Det\MMCAD\MMCAD\notebooks\outputs\gfa_111k\checkpoint_best.pt


  checkpoint = torch.load(checkpoint_path, map_location=device)


Resuming from: d:\Defect_Det\MMCAD\MMCAD\notebooks\outputs\gfa_111k\checkpoint_best.pt


  checkpoint = torch.load(checkpoint_path, map_location=self.device)


Resumed at epoch 15, stage 1
✓ Will continue from epoch 16


## Cell 6: Train (Re-run this cell to restart training)

**Start or restart training from scratch**

This cell can be run multiple times:
- Data stays in RAM (no reload needed!)
- Model re-initializes from scratch
- To resume from checkpoint, use Cell 7 instead

In [40]:
# Optional: Resume from checkpoint
# Uncomment the lines below to resume from a specific checkpoint
# checkpoint_path = output_dir / "checkpoint_epoch5.pt"
# if checkpoint_path.exists():
#     trainer.load_checkpoint(str(checkpoint_path))
#     print(f"✓ Resumed from {checkpoint_path}")

print("Starting training...")
print("="*60)
trainer.train()
print("="*60)
print("✓ Training completed!")

Starting training...
CLIP4CAD-GFA Training
Total epochs: 35
Stage 1: 15 epochs (grounding)
Stage 2: 20 epochs (alignment)
Batch size: 512
Learning rate: 3e-05
Checkpoint every: 5 epochs
Trainable parameters: 3,168,771
Resuming from epoch 16


✓ Saved Stage 1 final checkpoint: outputs\gfa_111k\checkpoint_stage1_final.pt

Transitioning to Stage 2

Mining Hard Negatives


  with torch.no_grad():
Extracting embeddings: 100%|██████████| 216/216 [00:56<00:00,  3.85it/s]


Mining hard negatives for 110592 samples...
  Computing geometric kNN...
  Filtering by text similarity...


  Filtering: 100%|██████████| 110592/110592 [00:04<00:00, 25521.34it/s]


  Found hard negatives for 105570 samples
  Average negatives per sample: 6.8
Saved hard negatives to outputs\gfa_111k\hard_negatives\hard_negatives_epoch15.json

Learning rate reduced to: 0.00e+00



  with autocast():
Epoch 16 (Stage 2):   0%|          | 1/216 [00:01<05:33,  1.55s/it, loss=1.708, G=1.672, L=0.027, C=0.006, D=0.018, cf=0.000, lr=4.6e-08]


  [Conf] min=0.007 max=0.993 mean=0.324 active=5.9/12


Epoch 16 (Stage 2):  47%|████▋     | 101/216 [01:09<01:20,  1.42it/s, loss=1.644, G=1.604, L=0.028, C=0.006, D=0.016, cf=0.000, lr=4.7e-06]


  [Conf] min=0.007 max=0.993 mean=0.310 active=5.8/12


Epoch 16 (Stage 2):  93%|█████████▎| 201/216 [02:17<00:10,  1.42it/s, loss=1.664, G=1.623, L=0.029, C=0.006, D=0.018, cf=0.000, lr=9.3e-06]


  [Conf] min=0.007 max=0.993 mean=0.333 active=6.0/12


Epoch 16 (Stage 2): 100%|██████████| 216/216 [02:28<00:00,  1.46it/s, loss=1.629, G=1.587, L=0.028, C=0.006, D=0.019, cf=0.000, lr=1.0e-05]



Epoch 16 Summary:
  Train Loss: 1.6370
    Global: 1.5979
    Local: 0.0265
    Consistency: 0.0059
    Diversity: 0.0180
    Conf Floor: 0.0010


Epoch 17 (Stage 2):   0%|          | 1/216 [00:00<02:46,  1.29it/s, loss=1.663, G=1.620, L=0.028, C=0.006, D=0.019, cf=0.000, lr=1.0e-05]


  [Conf] min=0.007 max=0.993 mean=0.334 active=6.1/12


Epoch 17 (Stage 2):  47%|████▋     | 101/216 [01:08<01:14,  1.54it/s, loss=1.555, G=1.513, L=0.029, C=0.006, D=0.016, cf=0.001, lr=1.5e-05]


  [Conf] min=0.007 max=0.993 mean=0.299 active=5.8/12


Epoch 17 (Stage 2):  93%|█████████▎| 201/216 [02:15<00:10,  1.46it/s, loss=1.557, G=1.512, L=0.031, C=0.006, D=0.017, cf=0.000, lr=1.9e-05]


  [Conf] min=0.007 max=0.993 mean=0.314 active=5.8/12


Epoch 17 (Stage 2): 100%|██████████| 216/216 [02:25<00:00,  1.49it/s, loss=1.533, G=1.487, L=0.033, C=0.006, D=0.021, cf=0.000, lr=2.0e-05]



Epoch 17 Summary:
  Train Loss: 1.5831
    Global: 1.5383
    Local: 0.0305
    Consistency: 0.0059
    Diversity: 0.0185
    Conf Floor: 0.0017


Epoch 18 (Stage 2):   0%|          | 1/216 [00:00<02:47,  1.28it/s, loss=1.691, G=1.634, L=0.035, C=0.006, D=0.016, cf=0.018, lr=2.0e-05]


  [Conf] min=0.007 max=0.993 mean=0.282 active=5.6/12


Epoch 18 (Stage 2):  47%|████▋     | 101/216 [01:08<01:16,  1.51it/s, loss=1.534, G=1.487, L=0.033, C=0.006, D=0.019, cf=0.000, lr=2.5e-05]


  [Conf] min=0.007 max=0.993 mean=0.327 active=6.0/12


Epoch 18 (Stage 2):  93%|█████████▎| 201/216 [02:16<00:10,  1.47it/s, loss=1.443, G=1.392, L=0.033, C=0.006, D=0.021, cf=0.000, lr=2.9e-05]


  [Conf] min=0.007 max=0.993 mean=0.365 active=6.4/12


Epoch 18 (Stage 2): 100%|██████████| 216/216 [02:26<00:00,  1.47it/s, loss=1.429, G=1.379, L=0.036, C=0.006, D=0.020, cf=0.000, lr=3.0e-05]



Epoch 18 Summary:
  Train Loss: 1.5264
    Global: 1.4768
    Local: 0.0348
    Consistency: 0.0057
    Diversity: 0.0188
    Conf Floor: 0.0016


Epoch 19 (Stage 2):   0%|          | 1/216 [00:00<02:52,  1.25it/s, loss=1.546, G=1.493, L=0.036, C=0.006, D=0.023, cf=0.000, lr=3.0e-05]


  [Conf] min=0.007 max=0.993 mean=0.397 active=6.8/12


Epoch 19 (Stage 2):  47%|████▋     | 101/216 [01:08<01:19,  1.44it/s, loss=1.417, G=1.365, L=0.035, C=0.005, D=0.022, cf=0.000, lr=3.0e-05]


  [Conf] min=0.007 max=0.993 mean=0.383 active=6.6/12


Epoch 19 (Stage 2):  93%|█████████▎| 201/216 [02:16<00:10,  1.46it/s, loss=1.412, G=1.361, L=0.035, C=0.005, D=0.019, cf=0.000, lr=3.0e-05]


  [Conf] min=0.007 max=0.993 mean=0.318 active=5.8/12


Epoch 19 (Stage 2): 100%|██████████| 216/216 [02:26<00:00,  1.47it/s, loss=1.399, G=1.349, L=0.036, C=0.005, D=0.016, cf=0.000, lr=3.0e-05]



Epoch 19 Summary:
  Train Loss: 1.4313
    Global: 1.3793
    Local: 0.0358
    Consistency: 0.0055
    Diversity: 0.0191
    Conf Floor: 0.0012


Epoch 20 (Stage 2):   0%|          | 1/216 [00:00<02:43,  1.31it/s, loss=1.381, G=1.324, L=0.035, C=0.005, D=0.016, cf=0.014, lr=3.0e-05]


  [Conf] min=0.007 max=0.993 mean=0.286 active=5.5/12


Epoch 20 (Stage 2):  47%|████▋     | 101/216 [01:08<01:16,  1.50it/s, loss=1.329, G=1.279, L=0.033, C=0.005, D=0.017, cf=0.000, lr=3.0e-05]


  [Conf] min=0.007 max=0.993 mean=0.302 active=5.9/12


Epoch 20 (Stage 2):  93%|█████████▎| 201/216 [02:15<00:10,  1.49it/s, loss=1.288, G=1.225, L=0.034, C=0.005, D=0.015, cf=0.026, lr=3.0e-05]


  [Conf] min=0.007 max=0.993 mean=0.274 active=5.5/12


Epoch 20 (Stage 2): 100%|██████████| 216/216 [02:25<00:00,  1.48it/s, loss=1.334, G=1.274, L=0.033, C=0.005, D=0.015, cf=0.017, lr=3.0e-05]
Validation: 100%|██████████| 33/33 [03:57<00:00,  7.19s/it]



Epoch 20 Summary:
  Train Loss: 1.3495
    Global: 1.2965
    Local: 0.0352
    Consistency: 0.0052
    Diversity: 0.0187
    Conf Floor: 0.0015
  Val Loss: 1.6807


Epoch 21 (Stage 2):   0%|          | 1/216 [00:01<05:36,  1.57s/it, loss=1.238, G=1.184, L=0.036, C=0.005, D=0.018, cf=0.000, lr=3.0e-05]


  [Conf] min=0.007 max=0.993 mean=0.301 active=5.7/12


Epoch 21 (Stage 2):  47%|████▋     | 101/216 [01:09<01:16,  1.50it/s, loss=1.326, G=1.274, L=0.034, C=0.005, D=0.018, cf=0.000, lr=3.0e-05]


  [Conf] min=0.007 max=0.993 mean=0.325 active=6.1/12


Epoch 21 (Stage 2):  93%|█████████▎| 201/216 [02:16<00:10,  1.47it/s, loss=1.233, G=1.179, L=0.036, C=0.005, D=0.018, cf=0.000, lr=2.9e-05]


  [Conf] min=0.007 max=0.993 mean=0.307 active=5.7/12


Epoch 21 (Stage 2): 100%|██████████| 216/216 [02:26<00:00,  1.48it/s, loss=1.255, G=1.204, L=0.032, C=0.005, D=0.018, cf=0.000, lr=2.9e-05]



Epoch 21 Summary:
  Train Loss: 1.2699
    Global: 1.2161
    Local: 0.0338
    Consistency: 0.0049
    Diversity: 0.0188
    Conf Floor: 0.0022


Epoch 22 (Stage 2):   0%|          | 1/216 [00:00<02:37,  1.37it/s, loss=1.217, G=1.163, L=0.032, C=0.005, D=0.021, cf=0.000, lr=2.9e-05]


  [Conf] min=0.007 max=0.993 mean=0.344 active=6.1/12


Epoch 22 (Stage 2):  47%|████▋     | 101/216 [01:07<01:18,  1.46it/s, loss=1.231, G=1.178, L=0.033, C=0.005, D=0.022, cf=0.000, lr=2.9e-05]


  [Conf] min=0.007 max=0.993 mean=0.351 active=6.2/12


Epoch 22 (Stage 2):  93%|█████████▎| 201/216 [02:15<00:09,  1.52it/s, loss=1.277, G=1.225, L=0.032, C=0.004, D=0.019, cf=0.000, lr=2.9e-05]


  [Conf] min=0.007 max=0.993 mean=0.356 active=6.3/12


Epoch 22 (Stage 2): 100%|██████████| 216/216 [02:25<00:00,  1.49it/s, loss=1.137, G=1.084, L=0.033, C=0.004, D=0.019, cf=0.000, lr=2.9e-05]



Epoch 22 Summary:
  Train Loss: 1.2137
    Global: 1.1602
    Local: 0.0326
    Consistency: 0.0046
    Diversity: 0.0188
    Conf Floor: 0.0017


Epoch 23 (Stage 2):   0%|          | 1/216 [00:00<02:31,  1.42it/s, loss=1.209, G=1.156, L=0.031, C=0.004, D=0.021, cf=0.000, lr=2.9e-05]


  [Conf] min=0.007 max=0.993 mean=0.345 active=6.1/12


Epoch 23 (Stage 2):  47%|████▋     | 101/216 [01:07<01:17,  1.47it/s, loss=1.161, G=1.095, L=0.031, C=0.005, D=0.013, cf=0.029, lr=2.9e-05]


  [Conf] min=0.007 max=0.993 mean=0.271 active=5.4/12


Epoch 23 (Stage 2):  93%|█████████▎| 201/216 [02:14<00:10,  1.40it/s, loss=1.129, G=1.078, L=0.029, C=0.004, D=0.017, cf=0.001, lr=2.8e-05]


  [Conf] min=0.007 max=0.993 mean=0.299 active=5.6/12


Epoch 23 (Stage 2): 100%|██████████| 216/216 [02:24<00:00,  1.49it/s, loss=1.081, G=1.029, L=0.030, C=0.004, D=0.017, cf=0.000, lr=2.8e-05]



Epoch 23 Summary:
  Train Loss: 1.1468
    Global: 1.0937
    Local: 0.0308
    Consistency: 0.0042
    Diversity: 0.0187
    Conf Floor: 0.0018


Epoch 24 (Stage 2):   0%|          | 1/216 [00:00<02:50,  1.26it/s, loss=1.090, G=1.038, L=0.029, C=0.004, D=0.020, cf=0.000, lr=2.8e-05]


  [Conf] min=0.007 max=0.993 mean=0.323 active=5.9/12


Epoch 24 (Stage 2):  47%|████▋     | 101/216 [01:07<01:14,  1.55it/s, loss=1.031, G=0.978, L=0.028, C=0.004, D=0.020, cf=0.000, lr=2.8e-05]


  [Conf] min=0.007 max=0.993 mean=0.329 active=5.8/12


Epoch 24 (Stage 2):  93%|█████████▎| 201/216 [02:14<00:10,  1.50it/s, loss=1.067, G=1.016, L=0.029, C=0.004, D=0.020, cf=0.000, lr=2.8e-05]


  [Conf] min=0.007 max=0.993 mean=0.340 active=6.0/12


Epoch 24 (Stage 2): 100%|██████████| 216/216 [02:24<00:00,  1.50it/s, loss=1.058, G=1.008, L=0.027, C=0.004, D=0.018, cf=0.000, lr=2.7e-05]



Epoch 24 Summary:
  Train Loss: 1.0922
    Global: 1.0402
    Local: 0.0288
    Consistency: 0.0039
    Diversity: 0.0188
    Conf Floor: 0.0017


Epoch 25 (Stage 2):   0%|          | 1/216 [00:00<02:57,  1.21it/s, loss=1.053, G=1.002, L=0.028, C=0.004, D=0.019, cf=0.000, lr=2.7e-05]


  [Conf] min=0.007 max=0.993 mean=0.335 active=5.9/12


Epoch 25 (Stage 2):  47%|████▋     | 101/216 [01:08<01:14,  1.54it/s, loss=1.032, G=0.981, L=0.026, C=0.004, D=0.021, cf=0.000, lr=2.7e-05]


  [Conf] min=0.007 max=0.993 mean=0.366 active=6.3/12


Epoch 25 (Stage 2):  93%|█████████▎| 201/216 [02:16<00:09,  1.51it/s, loss=1.124, G=1.073, L=0.029, C=0.003, D=0.019, cf=0.000, lr=2.7e-05]


  [Conf] min=0.007 max=0.993 mean=0.317 active=5.7/12


Epoch 25 (Stage 2): 100%|██████████| 216/216 [02:26<00:00,  1.48it/s, loss=0.996, G=0.947, L=0.025, C=0.003, D=0.019, cf=0.000, lr=2.7e-05]
Validation: 100%|██████████| 33/33 [03:32<00:00,  6.43s/it]



Epoch 25 Summary:
  Train Loss: 1.0519
    Global: 1.0013
    Local: 0.0270
    Consistency: 0.0035
    Diversity: 0.0182
    Conf Floor: 0.0020
  Val Loss: 1.5924


Epoch 26 (Stage 2):   0%|          | 1/216 [00:01<05:38,  1.58s/it, loss=1.065, G=1.016, L=0.026, C=0.003, D=0.017, cf=0.004, lr=2.7e-05]


  [Conf] min=0.007 max=0.993 mean=0.296 active=5.4/12


Epoch 26 (Stage 2):  47%|████▋     | 101/216 [01:09<01:16,  1.51it/s, loss=1.034, G=0.987, L=0.026, C=0.003, D=0.018, cf=0.000, lr=2.6e-05]


  [Conf] min=0.007 max=0.993 mean=0.304 active=5.5/12


Epoch 26 (Stage 2):  93%|█████████▎| 201/216 [02:17<00:10,  1.43it/s, loss=0.910, G=0.863, L=0.023, C=0.003, D=0.019, cf=0.000, lr=2.6e-05]


  [Conf] min=0.007 max=0.993 mean=0.337 active=5.8/12


Epoch 26 (Stage 2): 100%|██████████| 216/216 [02:27<00:00,  1.46it/s, loss=1.017, G=0.969, L=0.025, C=0.003, D=0.021, cf=0.000, lr=2.6e-05]



Epoch 26 Summary:
  Train Loss: 1.0098
    Global: 0.9605
    Local: 0.0255
    Consistency: 0.0032
    Diversity: 0.0186
    Conf Floor: 0.0015


Epoch 27 (Stage 2):   0%|          | 1/216 [00:00<02:41,  1.33it/s, loss=0.964, G=0.916, L=0.025, C=0.003, D=0.019, cf=0.000, lr=2.6e-05]


  [Conf] min=0.007 max=0.993 mean=0.351 active=6.0/12


Epoch 27 (Stage 2):  47%|████▋     | 101/216 [01:10<01:17,  1.48it/s, loss=0.980, G=0.933, L=0.025, C=0.003, D=0.015, cf=0.000, lr=2.5e-05]


  [Conf] min=0.007 max=0.993 mean=0.302 active=5.5/12


Epoch 27 (Stage 2):  93%|█████████▎| 201/216 [02:19<00:09,  1.50it/s, loss=0.979, G=0.927, L=0.022, C=0.003, D=0.017, cf=0.012, lr=2.5e-05]


  [Conf] min=0.007 max=0.993 mean=0.288 active=5.3/12


Epoch 27 (Stage 2): 100%|██████████| 216/216 [02:29<00:00,  1.45it/s, loss=0.951, G=0.904, L=0.022, C=0.003, D=0.020, cf=0.000, lr=2.5e-05]



Epoch 27 Summary:
  Train Loss: 0.9723
    Global: 0.9236
    Local: 0.0241
    Consistency: 0.0029
    Diversity: 0.0183
    Conf Floor: 0.0018


Epoch 28 (Stage 2):   0%|          | 1/216 [00:00<02:36,  1.37it/s, loss=0.925, G=0.879, L=0.023, C=0.003, D=0.019, cf=0.000, lr=2.5e-05]


  [Conf] min=0.007 max=0.993 mean=0.332 active=5.7/12


Epoch 28 (Stage 2):  47%|████▋     | 101/216 [01:07<01:16,  1.51it/s, loss=0.957, G=0.911, L=0.022, C=0.003, D=0.017, cf=0.000, lr=2.4e-05]


  [Conf] min=0.007 max=0.993 mean=0.312 active=5.5/12


Epoch 28 (Stage 2):  93%|█████████▎| 201/216 [02:15<00:10,  1.46it/s, loss=0.947, G=0.901, L=0.023, C=0.002, D=0.017, cf=0.000, lr=2.3e-05]


  [Conf] min=0.007 max=0.993 mean=0.306 active=5.5/12


Epoch 28 (Stage 2): 100%|██████████| 216/216 [02:25<00:00,  1.48it/s, loss=0.935, G=0.888, L=0.021, C=0.002, D=0.020, cf=0.000, lr=2.3e-05]



Epoch 28 Summary:
  Train Loss: 0.9413
    Global: 0.8941
    Local: 0.0225
    Consistency: 0.0026
    Diversity: 0.0182
    Conf Floor: 0.0017


Epoch 29 (Stage 2):   0%|          | 1/216 [00:00<02:42,  1.33it/s, loss=0.901, G=0.853, L=0.023, C=0.002, D=0.018, cf=0.000, lr=2.3e-05]


  [Conf] min=0.007 max=0.993 mean=0.339 active=5.8/12


Epoch 29 (Stage 2):  47%|████▋     | 101/216 [01:08<01:21,  1.42it/s, loss=0.925, G=0.878, L=0.023, C=0.002, D=0.018, cf=0.000, lr=2.3e-05]


  [Conf] min=0.007 max=0.993 mean=0.328 active=5.8/12


Epoch 29 (Stage 2):  93%|█████████▎| 201/216 [02:15<00:10,  1.47it/s, loss=0.878, G=0.833, L=0.021, C=0.002, D=0.018, cf=0.000, lr=2.2e-05]


  [Conf] min=0.007 max=0.993 mean=0.321 active=5.6/12


Epoch 29 (Stage 2): 100%|██████████| 216/216 [02:25<00:00,  1.48it/s, loss=0.872, G=0.827, L=0.021, C=0.002, D=0.015, cf=0.005, lr=2.2e-05]



Epoch 29 Summary:
  Train Loss: 0.9088
    Global: 0.8622
    Local: 0.0213
    Consistency: 0.0023
    Diversity: 0.0181
    Conf Floor: 0.0020


Epoch 30 (Stage 2):   0%|          | 1/216 [00:00<02:48,  1.27it/s, loss=0.914, G=0.868, L=0.021, C=0.002, D=0.016, cf=0.004, lr=2.2e-05]


  [Conf] min=0.007 max=0.993 mean=0.296 active=5.3/12


Epoch 30 (Stage 2):  47%|████▋     | 101/216 [01:08<01:17,  1.48it/s, loss=0.875, G=0.831, L=0.021, C=0.002, D=0.016, cf=0.000, lr=2.1e-05]


  [Conf] min=0.007 max=0.993 mean=0.302 active=5.4/12


Epoch 30 (Stage 2):  93%|█████████▎| 201/216 [02:14<00:10,  1.48it/s, loss=0.898, G=0.850, L=0.020, C=0.003, D=0.015, cf=0.007, lr=2.1e-05]


  [Conf] min=0.007 max=0.993 mean=0.293 active=5.3/12


Epoch 30 (Stage 2): 100%|██████████| 216/216 [02:24<00:00,  1.49it/s, loss=0.859, G=0.814, L=0.020, C=0.002, D=0.018, cf=0.000, lr=2.1e-05]
Validation: 100%|██████████| 33/33 [03:32<00:00,  6.45s/it]



Epoch 30 Summary:
  Train Loss: 0.8863
    Global: 0.8405
    Local: 0.0203
    Consistency: 0.0021
    Diversity: 0.0181
    Conf Floor: 0.0013
  Val Loss: 1.5995


Epoch 31 (Stage 2):   0%|          | 1/216 [00:01<05:01,  1.40s/it, loss=0.865, G=0.822, L=0.020, C=0.002, D=0.017, cf=0.000, lr=2.1e-05]


  [Conf] min=0.007 max=0.993 mean=0.330 active=5.6/12


Epoch 31 (Stage 2):  47%|████▋     | 101/216 [01:08<01:17,  1.48it/s, loss=0.914, G=0.873, L=0.021, C=0.002, D=0.017, cf=0.000, lr=2.0e-05]


  [Conf] min=0.007 max=0.993 mean=0.302 active=5.3/12


Epoch 31 (Stage 2):  93%|█████████▎| 201/216 [02:15<00:10,  1.49it/s, loss=0.855, G=0.811, L=0.018, C=0.002, D=0.016, cf=0.003, lr=1.9e-05]


  [Conf] min=0.007 max=0.993 mean=0.297 active=5.3/12


Epoch 31 (Stage 2): 100%|██████████| 216/216 [02:25<00:00,  1.49it/s, loss=0.873, G=0.832, L=0.017, C=0.002, D=0.018, cf=0.000, lr=1.9e-05]



Epoch 31 Summary:
  Train Loss: 0.8564
    Global: 0.8118
    Local: 0.0190
    Consistency: 0.0019
    Diversity: 0.0180
    Conf Floor: 0.0017


Epoch 32 (Stage 2):   0%|          | 1/216 [00:00<02:38,  1.36it/s, loss=0.773, G=0.729, L=0.018, C=0.002, D=0.020, cf=0.000, lr=1.9e-05]


  [Conf] min=0.007 max=0.993 mean=0.361 active=5.9/12


Epoch 32 (Stage 2):  47%|████▋     | 101/216 [01:07<01:15,  1.51it/s, loss=0.843, G=0.798, L=0.019, C=0.002, D=0.022, cf=0.000, lr=1.9e-05]


  [Conf] min=0.007 max=0.993 mean=0.365 active=5.9/12


Epoch 32 (Stage 2):  93%|█████████▎| 201/216 [02:15<00:10,  1.44it/s, loss=0.788, G=0.746, L=0.017, C=0.002, D=0.018, cf=0.000, lr=1.8e-05]


  [Conf] min=0.007 max=0.993 mean=0.329 active=5.6/12


Epoch 32 (Stage 2): 100%|██████████| 216/216 [02:25<00:00,  1.48it/s, loss=0.802, G=0.760, L=0.017, C=0.002, D=0.020, cf=0.000, lr=1.8e-05]



Epoch 32 Summary:
  Train Loss: 0.8354
    Global: 0.7916
    Local: 0.0181
    Consistency: 0.0017
    Diversity: 0.0179
    Conf Floor: 0.0014


Epoch 33 (Stage 2):   0%|          | 1/216 [00:00<02:39,  1.35it/s, loss=0.789, G=0.747, L=0.018, C=0.002, D=0.016, cf=0.002, lr=1.8e-05]


  [Conf] min=0.007 max=0.993 mean=0.298 active=5.3/12


Epoch 33 (Stage 2):  47%|████▋     | 101/216 [01:07<01:15,  1.52it/s, loss=0.826, G=0.782, L=0.018, C=0.002, D=0.019, cf=0.000, lr=1.7e-05]


  [Conf] min=0.007 max=0.993 mean=0.344 active=5.8/12


Epoch 33 (Stage 2):  93%|█████████▎| 201/216 [02:12<00:09,  1.52it/s, loss=0.815, G=0.773, L=0.017, C=0.001, D=0.017, cf=0.000, lr=1.7e-05]


  [Conf] min=0.007 max=0.993 mean=0.301 active=5.2/12


Epoch 33 (Stage 2): 100%|██████████| 216/216 [02:22<00:00,  1.51it/s, loss=0.748, G=0.705, L=0.017, C=0.001, D=0.018, cf=0.000, lr=1.6e-05]



Epoch 33 Summary:
  Train Loss: 0.8158
    Global: 0.7727
    Local: 0.0172
    Consistency: 0.0015
    Diversity: 0.0179
    Conf Floor: 0.0019


Epoch 34 (Stage 2):   0%|          | 1/216 [00:00<02:24,  1.49it/s, loss=0.767, G=0.724, L=0.016, C=0.001, D=0.022, cf=0.000, lr=1.6e-05]


  [Conf] min=0.007 max=0.993 mean=0.378 active=5.9/12


Epoch 34 (Stage 2):  47%|████▋     | 101/216 [01:07<01:18,  1.46it/s, loss=0.775, G=0.734, L=0.016, C=0.001, D=0.021, cf=0.000, lr=1.6e-05]


  [Conf] min=0.007 max=0.993 mean=0.355 active=5.8/12


Epoch 34 (Stage 2):  93%|█████████▎| 201/216 [02:14<00:09,  1.51it/s, loss=0.799, G=0.751, L=0.017, C=0.001, D=0.017, cf=0.015, lr=1.5e-05]


  [Conf] min=0.007 max=0.993 mean=0.285 active=5.1/12


Epoch 34 (Stage 2): 100%|██████████| 216/216 [02:24<00:00,  1.50it/s, loss=0.737, G=0.695, L=0.015, C=0.001, D=0.018, cf=0.000, lr=1.5e-05]



Epoch 34 Summary:
  Train Loss: 0.8013
    Global: 0.7589
    Local: 0.0164
    Consistency: 0.0014
    Diversity: 0.0178
    Conf Floor: 0.0017


Epoch 35 (Stage 2):   0%|          | 1/216 [00:00<02:48,  1.28it/s, loss=0.787, G=0.747, L=0.015, C=0.001, D=0.018, cf=0.000, lr=1.5e-05]


  [Conf] min=0.007 max=0.993 mean=0.323 active=5.4/12


Epoch 35 (Stage 2):  47%|████▋     | 101/216 [01:07<01:15,  1.52it/s, loss=0.817, G=0.777, L=0.016, C=0.001, D=0.019, cf=0.000, lr=1.4e-05]


  [Conf] min=0.007 max=0.993 mean=0.327 active=5.6/12


Epoch 35 (Stage 2):  93%|█████████▎| 201/216 [02:15<00:10,  1.50it/s, loss=0.726, G=0.685, L=0.016, C=0.001, D=0.017, cf=0.000, lr=1.4e-05]


  [Conf] min=0.007 max=0.993 mean=0.324 active=5.4/12


Epoch 35 (Stage 2): 100%|██████████| 216/216 [02:25<00:00,  1.49it/s, loss=0.782, G=0.742, L=0.015, C=0.001, D=0.017, cf=0.000, lr=1.4e-05]
Validation: 100%|██████████| 33/33 [03:39<00:00,  6.65s/it]



Epoch 35 Summary:
  Train Loss: 0.7807
    Global: 0.7388
    Local: 0.0155
    Consistency: 0.0012
    Diversity: 0.0174
    Conf Floor: 0.0019
  Val Loss: 1.5772
Final model saved: outputs\gfa_111k\clip4cad_gfa_final.pt

Training Complete!
✓ Training completed!


## Cell 7: Resume Training from Checkpoint (Optional)

Resume training from a specific checkpoint epoch.

In [None]:
# Specify the checkpoint to resume from
checkpoint_epoch = 10  # Change this to the epoch you want to resume from
checkpoint_path = output_dir / f"checkpoint_epoch{checkpoint_epoch}.pt"

if checkpoint_path.exists():
    print(f"Resuming from {checkpoint_path}...")
    trainer.load_checkpoint(str(checkpoint_path))
    print(f"✓ Checkpoint loaded from epoch {checkpoint_epoch}")
    
    print("\nRestarting training...")
    print("="*60)
    trainer.train()
    print("="*60)
    print("✓ Training completed!")
else:
    print(f"❌ Checkpoint not found: {checkpoint_path}")
    print(f"\nAvailable checkpoints:")
    checkpoints = sorted(output_dir.glob("checkpoint_epoch*.pt"))
    if checkpoints:
        for ckpt in checkpoints:
            print(f"  - {ckpt.name}")
    else:
        print("  - No checkpoints found")

## Cell 8: Quick Validation (Optional)

Run validation without training to check model performance.

In [None]:
print("Running validation...")
val_metrics = trainer.validate()

print("\n✓ Validation Results:")
print("="*60)
for key, value in val_metrics.items():
    if isinstance(value, float):
        print(f"  {key}: {value:.4f}")
    else:
        print(f"  {key}: {value}")
print("="*60)

## Cell 9: Check RAM Usage (Optional)

Monitor memory usage to ensure no memory leaks.

In [None]:
import psutil
import gc

# Force garbage collection
gc.collect()
torch.cuda.empty_cache()

# Check system RAM
process = psutil.Process()
ram_gb = process.memory_info().rss / 1e9
total_ram = psutil.virtual_memory().total / 1e9
available_ram = psutil.virtual_memory().available / 1e9

print("Memory Usage:")
print("="*60)
print(f"  Process RAM: {ram_gb:.1f} GB")
print(f"  Total System RAM: {total_ram:.1f} GB")
print(f"  Available RAM: {available_ram:.1f} GB")
print(f"  RAM Usage: {(total_ram - available_ram) / total_ram * 100:.1f}%")

# Check GPU memory if available
if torch.cuda.is_available():
    gpu_allocated = torch.cuda.memory_allocated() / 1e9
    gpu_reserved = torch.cuda.memory_reserved() / 1e9
    gpu_total = torch.cuda.get_device_properties(0).total_memory / 1e9
    
    print(f"\nGPU Memory:")
    print(f"  Allocated: {gpu_allocated:.1f} GB")
    print(f"  Reserved: {gpu_reserved:.1f} GB")
    print(f"  Total: {gpu_total:.1f} GB")
    print(f"  Usage: {gpu_allocated / gpu_total * 100:.1f}%")

print("="*60)

In [36]:
# Reload model module to pick up code changes (no data reload needed!)
import importlib
from clip4cad.models import clip4cad_gfa

# Reload the module with the fix
importlib.reload(clip4cad_gfa)

# Re-import the class
from clip4cad.models.clip4cad_gfa import CLIP4CAD_GFA
from clip4cad.training import gfa_trainer                                                                                                                                                                                                  
importlib.reload(gfa_trainer)
from clip4cad.training.gfa_trainer import GFATrainer

from clip4cad.training import hard_negative_mining                                                                                                                                                                                         
importlib.reload(hard_negative_mining)
print("✓ Model code reloaded with confidence clamp fix")

✓ Model code reloaded with confidence clamp fix
