# Silvaco TCAD Code Generation — Colab Training Notebook
This end-to-end notebook fine-tunes the ≤1B Qwen model with QLoRA on the Silvaco dataset, saves the adapter + tokenizer artifacts, and runs a quick generation smoke test so you can plug everything back into `src/generate.py`. Run each cell in order.

**Workflow Overview**
1. Verify GPU and install dependencies
2. Upload your zipped project/data bundle (generated via `prepare_for_colab.sh` or manual zip)
3. Locate the processed Hugging Face dataset + define training hyperparameters
4. Fine-tune Qwen/Qwen2-0.5B with QLoRA
5. Save the adapter + tokenizer, export a download zip
6. Reload the checkpoint for an immediate Silvaco code generation sanity check

In [None]:
#@title ⛽️ Check GPU availability
!nvidia-smi
import torch, platform
print(f'Torch version: {torch.__version__}')
print(f'CUDA available: {torch.cuda.is_available()}')
if torch.cuda.is_available():
    device_name = torch.cuda.get_device_name(0)
    capability = torch.cuda.get_device_capability(0)
    print(f'GPU: {device_name} (capability {capability[0]}.{capability[1]})')
print(f'Python: {platform.python_version()}')

In [None]:
#@title 📦 Install project dependencies
!pip install -q -U \n  transformers==4.40.2 \n  accelerate==0.30.1 \n  datasets==2.19.0 \n  peft==0.11.1 \n  bitsandbytes==0.43.1 \n  sentence-transformers==3.0.1 \n  evaluate==0.4.2 \n  einops==0.7.0

## 1. Upload Your Project/Data Bundle
Upload the archive produced by `silvaco_training_data.zip` (or zip the repo manually). The archive must contain at least:
- `data/processed/processed_dataset/` (Hugging Face dataset from `src/preprocess.py`)
- `src/` (so you can reference helper scripts later)
- optional `embeddings/` or `model/` artifacts if you want to test RAG on Colab
If you re-run this cell, the previous `/content/project` folder will be wiped to avoid stale files.

In [None]:
#@title 📁 Upload and extract your zip file
import shutil, zipfile, os
from pathlib import Path
from google.colab import files
project_dir = Path('/content/project')
if project_dir.exists():
    shutil.rmtree(project_dir)
project_dir.mkdir(parents=True, exist_ok=True)
print('Select your project zip (e.g., silvaco_training_data.zip)...')
uploaded = files.upload()
for filename, data in uploaded.items():
    local_path = Path(filename)
    with open(local_path, 'wb') as f:
        f.write(data)
    if local_path.suffix == '.zip':
        print(f'Extracting {local_path} ...')
        with zipfile.ZipFile(local_path, 'r') as zip_ref:
            zip_ref.extractall(project_dir)
print('Extraction complete. Listing top-level contents:')
for path in project_dir.iterdir():
    print(' -', path)

## 2. Locate Project Root and Dataset
This cell automatically detects the directory that contains `src/` and `data/`. Adjust the fallback paths if your archive uses a different layout.

In [None]:
#@title 🔍 Detect important directories
from pathlib import Path
project_dir = Path('/content/project')
def find_project_root(base: Path) -> Path:
    candidates = []
    for src_dir in base.rglob('src'):
        root = src_dir.parent
        if (root / 'data').exists():
            candidates.append(root)
    if candidates:
        return max(candidates, key=lambda p: len(str(p)))
    return base
PROJECT_ROOT = find_project_root(project_dir)
DATASET_PATH = PROJECT_ROOT / 'data' / 'processed' / 'processed_dataset'
print(f'PROJECT_ROOT -> {PROJECT_ROOT}')
print(f'DATASET_PATH -> {DATASET_PATH}')
if not DATASET_PATH.exists():
    raise FileNotFoundError('Processed dataset not found. Run src/preprocess.py locally and upload the resulting data/processed/processed_dataset folder.')

## 3. Load the Tokenized Dataset
We reuse the Hugging Face dataset saved by `src/preprocess.py`. This avoids re-tokenizing large Silvaco files on Colab.

In [None]:
#@title 📚 Load dataset and show stats
from datasets import load_from_disk
dataset = load_from_disk(str(DATASET_PATH))
dataset_splits = list(dataset.keys())
print('Available splits:', dataset_splits)
train_dataset = dataset['train']
eval_dataset = dataset['validation'] if 'validation' in dataset else None
print(f'Train samples: {len(train_dataset)}')
if eval_dataset is not None:
    print(f'Validation samples: {len(eval_dataset)}')
else:
    print('Validation split missing -> creating 5% eval subset on-the-fly')
    eval_dataset = train_dataset.train_test_split(test_size=0.05, seed=42)['test']
print('Columns:', train_dataset.column_names)
print('Example keys in a sample:', train_dataset[0].keys())

## 4. Configure QLoRA Training
Tweak the hyperparameters below to trade off speed vs. quality. Smaller batch sizes help when running on T4/16GB GPUs.

In [None]:
#@title ⚙️ Hyperparameters
MODEL_NAME = 'Qwen/Qwen2-0.5B'  #@param {type:'string'}
OUTPUT_DIR = '/content/model_output'  #@param {type:'string'}
MICRO_BATCH_SIZE = 2  #@param {type:'integer'}
GRAD_ACCUM = 16  #@param {type:'integer'}
NUM_EPOCHS = 1  #@param {type:'integer'}
LEARNING_RATE = 0.0002  #@param {type:'number'}
MAX_SEQ_LENGTH = 2048  #@param {type:'integer'}
LORA_R = 16  #@param {type:'integer'}
LORA_ALPHA = 32  #@param {type:'integer'}
LORA_DROPOUT = 0.05  #@param {type:'number'}
print('Configuration:')
print(f'  Model: {MODEL_NAME}')
print(f'  Output dir: {OUTPUT_DIR}')
print(f'  Micro batch: {MICRO_BATCH_SIZE}, Grad accum: {GRAD_ACCUM}')
print(f'  Epochs: {NUM_EPOCHS}, Learning rate: {LEARNING_RATE}')
print(f'  LoRA r/alpha/dropout: {LORA_R}/{LORA_ALPHA}/{LORA_DROPOUT}')

In [None]:
#@title 🧠 Load tokenizer + base model with 4-bit quantization
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
dtype = torch.bfloat16 if torch.cuda.is_available() and torch.cuda.get_device_capability(0)[0] >= 8 else torch.float16
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=dtype,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type='nf4'
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    trust_remote_code=True,
    quantization_config=bnb_config,
    device_map='auto'
)
model = prepare_model_for_kbit_training(model)
lora_config = LoraConfig(
    r=LORA_R,
    lora_alpha=LORA_ALPHA,
    lora_dropout=LORA_DROPOUT,
    bias='none',
    task_type='CAUSAL_LM',
    target_modules=['q_proj', 'k_proj', 'v_proj', 'o_proj']
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

In [None]:
#@title 🏃‍♂️ Prepare Trainer
from transformers import DataCollatorForLanguageModeling, TrainingArguments, Trainer
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
use_bf16 = torch.cuda.is_available() and torch.cuda.get_device_capability(0)[0] >= 8
training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    num_train_epochs=NUM_EPOCHS,
    per_device_train_batch_size=MICRO_BATCH_SIZE,
    gradient_accumulation_steps=GRAD_ACCUM,
    gradient_checkpointing=True,
    learning_rate=LEARNING_RATE,
    warmup_ratio=0.03,
    lr_scheduler_type='cosine',
    logging_steps=10,
    save_strategy='epoch',
    evaluation_strategy='epoch',
    bf16=use_bf16,
    fp16=not use_bf16,
    max_grad_norm=1.0,
    report_to='none',
    optim='paged_adamw_32bit'
)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=data_collator,
)
print('Trainer ready!')

In [None]:
#@title 🚀 Launch training
train_result = trainer.train()
trainer.save_state()
metrics = train_result.metrics
print('Training metrics:', metrics)
eval_metrics = trainer.evaluate()
print('Evaluation metrics:', eval_metrics)

## 5. Save, Zip, and Download the Adapter + Tokenizer

In [None]:
#@title 💾 Persist artifacts
from pathlib import Path
adapter_dir = Path(OUTPUT_DIR) / 'adapter_model'
tokenizer_dir = Path(OUTPUT_DIR) / 'tokenizer'
adapter_dir.mkdir(parents=True, exist_ok=True)
tokenizer_dir.mkdir(parents=True, exist_ok=True)
trainer.model.save_pretrained(str(adapter_dir))
tokenizer.save_pretrained(str(tokenizer_dir))
trainer.save_metrics('train', metrics)
trainer.save_metrics('eval', eval_metrics)
print(f'Adapter saved to {adapter_dir}')
print(f'Tokenizer saved to {tokenizer_dir}')

In [None]:
#@title 📦 Create download zip
import shutil, os
from google.colab import files
zip_path = '/content/silvaco_trained_model.zip'
if os.path.exists(zip_path):
    os.remove(zip_path)
shutil.make_archive('/content/silvaco_trained_model', 'zip', OUTPUT_DIR)
print(f'Created {zip_path}')
files.download(zip_path)

## 6. Reload Adapter for a Quick Generation Test
This cell proves the weights you just trained can generate Silvaco code without leaving Colab. For full RAG + benchmarking, move the adapter/tokenizer back into your local repo and run `src/generate.py` / `benchmark/eval.py`.

In [None]:
#@title 🧪 Smoke test the fine-tuned model
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch
from pathlib import Path
adapter_dir = Path(OUTPUT_DIR) / 'adapter_model'
tokenizer_dir = Path(OUTPUT_DIR) / 'tokenizer'
assert adapter_dir.exists(), 'Adapter folder missing'
smoke_tokenizer = AutoTokenizer.from_pretrained(tokenizer_dir, trust_remote_code=True)
if smoke_tokenizer.pad_token is None:
    smoke_tokenizer.pad_token = smoke_tokenizer.eos_token
base_model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    trust_remote_code=True,
    torch_dtype=dtype,
    device_map='auto'
)
ft_model = PeftModel.from_pretrained(base_model, adapter_dir)
ft_model.eval()
description = 'Design an NMOS transistor with 0.5um gate length, 10um width, and include DC sweep from 0-3V gate voltage.'
prompt = (
    'You are a semiconductor TCAD code generator.\n\n'
    'Write a Silvaco ATLAS .in file based on the following device description:\n'
    f'{description}\n\n'
    'Use correct Silvaco syntax with mesh, regions, electrodes, materials, doping, models, and solve steps.\n\n'
    'Silvaco code:\n'
)
inputs = smoke_tokenizer(prompt, return_tensors='pt').to(ft_model.device)
with torch.no_grad():
    output = ft_model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0.2,
        top_p=0.95,
        do_sample=True,
        pad_token_id=smoke_tokenizer.pad_token_id
    )
generated_text = smoke_tokenizer.decode(output[0], skip_special_tokens=True)
generated_code = generated_text[len(prompt):]
print(generated_code[:2000])

## 7. Next Steps
1. Download `silvaco_trained_model.zip` and extract it into your repo’s `model/` directory.
2. Run `python src/generate.py --input "..." --adapter_path model/adapter_model` locally (with optional `--rag`) to verify outputs.
3. Execute `python benchmark/eval.py` to recompute metrics using the new adapter.
4. Update the technical report + presentation with the refreshed benchmark results.