## ENV SETUP

1. Install uv (or do it you're own way)
2. Run `uv sync`
3. Run `source .venv/bin/activate`

You're good to go.

# Instructions

The Task : Create the best CadQuery code generator model. 

1. Load the dataset (147K pairs of Images/CadQuery code).
2. Create a baseline model and evaluate it with the given metrics.
3. Enhance by any manner the baseline model and evaluate it again.
4. Explain you choices and possible bottlenecks. 
5. Show what enhancements you would have done if you had more time.

You can do *WHATEVER* you want, be creative, result is not what matters the most. 
Creating new model architectures, reusing ones you used in the past, fine-tuning, etc...

If you are GPU poor, there are solutions. Absolute value is not what matters, relative value between baseline and enhanced model is what matters.

In [None]:
from datasets import load_dataset
ds = load_dataset("CADCODER/GenCAD-Code", num_proc=16, split=["train", "test"])

## Evaluation Metrics

1. Valid Syntax Rate metric assess the validity of the code by executing and checking if error are returned.
2. Best IOU assess the similarity between the meshes generated by the code.

In [None]:
from metrics.valid_syntax_rate import evaluate_syntax_rate_simple
from metrics.best_iou import get_iou_best

In [None]:
## Example usage of the metrics
sample_code = """
height = 60.0
width = 80.0
thickness = 10.0
diameter = 22.0

# make the base
result = (
    cq.Workplane("XY")
    .box(height, width, thickness)
)
"""

sample_code_2 = """
 height = 60.0
 width = 80.0
 thickness = 10.0
 diameter = 22.0
 padding = 12.0

 # make the base
 result = (
     cq.Workplane("XY")
     .box(height, width, thickness)
     .faces(">Z")
     .workplane()
     .hole(diameter)
     .faces(">Z")
     .workplane()
     .rect(height - padding, width - padding, forConstruction=True)
     .vertices()
     .cboreHole(2.4, 4.4, 2.1)
 )
"""

codes = {
    "sample_code": sample_code,
    "sample_code_2": sample_code_2,
}
vsr = evaluate_syntax_rate_simple(codes)
print("Valid Syntax Rate:", vsr)
iou = get_iou_best(sample_code, sample_code_2)
print("IOU:", iou)

## Have Fun

# Solution
I just lost my access to my university HPC, therefore all codes are run on Mac Air with M1 chip.<br>
Even running a full validation takes 9hrs.<br>
Therefore, all the below codes are only tested to run successfully but the results cannot be obtained.<br>
I also tried to run on GoogleColab later but only !uv sync takes a lot of time, leaving insufficient time for the later execution.<br>

In [None]:
from datasets import load_dataset
from metrics.valid_syntax_rate import evaluate_syntax_rate_simple
from metrics.best_iou import get_iou_best
from pprint import pprint
from torchvision import transforms
import torch
from torch.nn import GELU
from torch.nn import CrossEntropyLoss
from torch.optim import AdamW
from torch.utils.data import DataLoader
from torch.utils.data._utils.collate import default_collate
from transformers import CLIPModel, CLIPProcessor, AutoTokenizer, AutoModelForCausalLM
from PIL import Image
from tqdm import tqdm

In [None]:
def get_device():
    if torch.cuda.is_available():
        return torch.device("cuda")
    elif torch.backends.mps.is_available() and torch.backends.mps.is_built():
        return torch.device("mps")
    else:
        return torch.device("cpu")

device = get_device()

# Checking Dataset

In [None]:
ds = load_dataset("CADCODER/GenCAD-Code", num_proc=16, split=["train", "test", "validation"])

train= ds[0]
test = ds[1]
val = ds[2]

example = train[0]
pprint(example)

print("Prompt:", example['prompt'])
print("CADQuery Code:\n", example['cadquery'])
print("ID:", example['deepcad_id'])

example['image'].show()

# Baseline Model

## Model Definition
Inspired by [CAD-Coder: An Open-Source Vision-Language Model for Computer-Aided Design Code Generation](https://arxiv.org/html/2505.14646v1?utm_source=chatgpt.com) but switched models to lighter ones.

In [None]:
# Vision Encoder (CLIP ViT-B/32), frozen
vision_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32").to(device)

# Decoder (CodeGen-350M)
tokenizer = AutoTokenizer.from_pretrained("Salesforce/codegen-350M-multi")
decoder = AutoModelForCausalLM.from_pretrained("Salesforce/codegen-350M-multi").to(device)

# MLP Projector (fine-tuned with rest)
class MLPProjector(torch.nn.Module):
    def __init__(self, input_dim=512, output_dim=decoder.config.n_embd):  # 512→1024 for CodeGen
        super().__init__()
        self.proj = torch.nn.Sequential(
            torch.nn.Linear(input_dim, output_dim),
            torch.nn.GELU(),
            torch.nn.Linear(output_dim, output_dim)
        )

    def forward(self, x):
        return self.proj(x)

## Generation

In [None]:
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
projector = MLPProjector().to(device)

vision_model.eval()
projector.eval()
decoder.eval()

for m in [vision_model, projector, decoder]:
    for p in m.parameters():
        p.requires_grad = False

generated_codes = []  
ground_truths = [] 

for example in tqdm(val, desc="Running inference"):
    image = example["image"].convert("RGB")
    prompt = """
    Generate CadQuery code for this shape, using the following structure as the reference:\n
    height = 60.0
    width = 80.0
    thickness = 10.0
    diameter = 22.0
    padding = 12.0

    # make the base
    result = (
        cq.Workplane("XY")
        .box(height, width, thickness)
        .faces(">Z")
        .workplane()
        .hole(diameter)
        .faces(">Z")
        .workplane()
        .rect(height - padding, width - padding, forConstruction=True)
        .vertices()
        .cboreHole(2.4, 4.4, 2.1)
    )
    """

    inputs = processor(images=image, return_tensors="pt").to(device)
    with torch.no_grad():
        vision_embed = vision_model.get_image_features(**inputs)        
        projected_embed = projector(vision_embed).unsqueeze(1) 

    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
    prompt_embeds = decoder.get_input_embeddings()(input_ids)         
    inputs_embeds = torch.cat([projected_embed, prompt_embeds], dim=1)

    generated_ids = decoder.generate(
        inputs_embeds=inputs_embeds,
        max_new_tokens=500,
        do_sample=True,
        top_p=0.95,
        temperature=0.7
    )

    output_text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
    generated_codes.append(output_text)

    if "cadquery" in example:
        ground_truths.append(example["cadquery"])

## Evaluation

In [None]:
codes = {}
total_IoU = 0
for i, pred, truth in enumerate(zip(generated_codes, ground_truths)):
    key = f'code[i]'
    codes[key] = pred
    total_IoU += get_iou_best(pred, truth)
    vsr = evaluate_syntax_rate_simple(codes)

print("Valid Syntax Rate:", vsr)
print("IOU:", total_IoU/len(generated_codes))

# Improvement1: Finetuning

In [None]:
transform = transforms.Compose([
    transforms.ToTensor(),
])

def custom_collate(batch):
    for example in batch:
        example["image"] = transform(example["image"])
    return default_collate(batch)  # now safe

train_loader = DataLoader(train, batch_size=1, shuffle=True, collate_fn=custom_collate)

In [None]:
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
projector = MLPProjector().to(device)

def preprocess_batch(batch):
    image = batch["image"]
    cadquery_code = batch["cadquery"]
    prompt = "Generate CadQuery code for this shape:\n"

    prompt_ids = tokenizer(prompt, return_tensors="pt").input_ids.squeeze(0).to(device)
    target_ids = tokenizer(cadquery_code, return_tensors="pt").input_ids.squeeze(0).to(device)

    inputs = processor(images=image, return_tensors="pt")
    inputs["pixel_values"] = inputs["pixel_values"].to(device)
    return inputs, prompt_ids, target_ids

vision_model.eval()
for param in vision_model.parameters():
    param.requires_grad = False

projector.train()
decoder.train()

optimizer = AdamW(list(projector.parameters()) + list(decoder.parameters()), lr=2e-5)

for epoch in range(50):  
    loader = tqdm(train_loader, desc=f"Epoch {epoch+1}", leave=False)

    for step, batch in enumerate(loader):
        inputs, prompt_ids, target_ids = preprocess_batch(batch)
        
        with torch.no_grad():
            image_embed = vision_model.get_image_features(**inputs) 

        projected_embeds = projector(image_embed).unsqueeze(1)
        prompt_embeds = decoder.get_input_embeddings()(prompt_ids.unsqueeze(0))
        code_embeds = decoder.get_input_embeddings()(target_ids.unsqueeze(0))

        inputs_embeds = torch.cat([projected_embeds, prompt_embeds, code_embeds], dim=1) 

        prefix_mask = torch.full((1, 11), -100, dtype=torch.long, device=target_ids.device)
        labels = torch.cat([prefix_mask, target_ids.unsqueeze(0)], dim=1).to(device)

        outputs = decoder(
            inputs_embeds=inputs_embeds,
            labels=labels
        )

        loss = outputs.loss

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        loader.set_postfix(loss=f"{loss.item()}")

# Other Ideas
1. Augment the images by changing the view angle and use the views from multiple angles to enhance the result. Both late fusion and early fusion can be tried.
2. In reality where we don't have true generation codes, multiple views can be synthesized using NeRF-family neural networks.
3. Estimate the depth map of the component using depth estimation models, then build a point cloud from it. Rotate the component to align with x, y, z axis and the extract key features like height, width, length to inform the decoder. It can be done by feeding the information to the prompt, or be embedded into feature tokens.
4. Attach a second agent (like ChatGPT) to correct grammar mistakes by the code generator.
5. Randomly mask some parts of the image (not key parts like a hole), then train the model as usual. This can be generalized to normal image augmentation, like rotation, change the brightness, etc.