# 🏠 HouseBrain LLM Training

**Train your custom architectural AI on Google Colab (Free GPU)**

This notebook will help you train the HouseBrain LLM using QLoRA fine-tuning on the DeepSeek model.

---

## 📋 Prerequisites

1. **Enable GPU**: Runtime → Change runtime type → GPU (T4)
2. **Upload Dataset**: You'll need a HouseBrain dataset zip file
3. **Patience**: Training takes 2-4 hours

## 🎯 What You'll Get

- **Trained Model**: Ready-to-use HouseBrain LLM
- **Performance**: 70-85% architectural compliance
- **Cost**: Completely free (Google Colab)

---

## 🚀 Step 1: Setup Environment

In [None]:
# Install required dependencies
!pip install torch transformers datasets accelerate peft bitsandbytes wandb tqdm fastapi uvicorn pydantic orjson svgwrite trimesh python-dotenv

print("✅ Dependencies installed successfully!")

In [None]:
# Clone the HouseBrain repository
!git clone https://github.com/Vinay-O/HouseBrainLLM.git
%cd HouseBrainLLM

print("✅ Repository cloned successfully!")

## 📁 Step 2: Upload Dataset

Upload your HouseBrain dataset zip file (generated locally with `generate_dataset.py`)

In [None]:
# Upload your dataset zip file
from google.colab import files
uploaded = files.upload()

print(f"📦 Uploaded files: {list(uploaded.keys())}")

In [None]:
# Extract the dataset
import zipfile
import os

for filename in uploaded.keys():
    if filename.endswith('.zip'):
        print(f"📂 Extracting {filename}...")
        with zipfile.ZipFile(filename, 'r') as zip_ref:
            zip_ref.extractall('.')
        print(f"✅ Extracted {filename}")

# List available datasets
datasets = [d for d in os.listdir('.') if d.startswith('housebrain_dataset') and os.path.isdir(d)]
print(f"\n📊 Available datasets: {datasets}")

## ⚙️ Step 3: Configure Training

Set up your training configuration

In [None]:
# Import training modules
import sys
sys.path.append('src')

from housebrain.finetune import FineTuningConfig, HouseBrainFineTuner
import torch

print("✅ Training modules imported successfully!")

In [None]:
# Check GPU availability
if torch.cuda.is_available():
    gpu_name = torch.cuda.get_device_name(0)
    gpu_memory = torch.cuda.get_device_properties(0).total_memory / 1e9
    print(f"🚀 GPU: {gpu_name} ({gpu_memory:.1f}GB VRAM)")
else:
    print("⚠️  No GPU detected. Training will be very slow on CPU.")
    print("   Please enable GPU: Runtime → Change runtime type → GPU")

In [None]:
# Training configuration
dataset_name = datasets[0] if datasets else "housebrain_dataset_v5_50k"  # Use first available dataset

config = FineTuningConfig(
    model_name="deepseek-ai/deepseek-coder-6.7b-base",
    dataset_path=dataset_name,
    output_dir="models/housebrain-colab-trained",
    max_length=1024,
    batch_size=2,  # Adjust based on GPU memory
    num_epochs=3,
    learning_rate=2e-4,
    use_4bit=True,  # Enable for CUDA
    fp16=True,      # Enable for CUDA
    warmup_steps=100,
    logging_steps=50,
    save_steps=500,
    eval_steps=500,
    gradient_accumulation_steps=4,
    lora_r=16,
    lora_alpha=32,
    lora_dropout=0.1,
)

print(f"📋 Training Configuration:")
print(f"   Model: {config.model_name}")
print(f"   Dataset: {config.dataset_path}")
print(f"   Output: {config.output_dir}")
print(f"   Epochs: {config.num_epochs}")
print(f"   Batch Size: {config.batch_size}")
print(f"   Learning Rate: {config.learning_rate}")

## 🚀 Step 4: Start Training

This will take 2-4 hours. Make sure to keep the notebook active!

In [None]:
# Initialize trainer
print("🔧 Setting up trainer...")
trainer = HouseBrainFineTuner(config)
print("✅ Trainer initialized successfully!")

In [None]:
# Start training
print("🎯 Starting training...")
print("⏰ This will take 2-4 hours. Keep the notebook active!")
print("📊 Monitor progress below:")

try:
    trainer.train()
    print("\n🎉 Training completed successfully!")
except Exception as e:
    print(f"\n❌ Training failed: {e}")
    print("💡 Try reducing batch_size or using a smaller model")

## 💾 Step 5: Save Model

Save your trained model for download

In [None]:
# Save the trained model
print("💾 Saving model...")
trainer.save_model()
print("✅ Model saved successfully!")

In [None]:
# Create zip archive for download
import zipfile
import os

model_dir = config.output_dir
zip_path = "housebrain-model.zip"

print(f"📦 Creating zip archive: {zip_path}")

with zipfile.ZipFile(zip_path, 'w', zipfile.ZIP_DEFLATED) as zipf:
    for root, dirs, files in os.walk(model_dir):
        for file in files:
            file_path = os.path.join(root, file)
            arcname = os.path.relpath(file_path, model_dir)
            zipf.write(file_path, arcname)

print(f"✅ Zip archive created: {zip_path}")
print(f"📁 Archive size: {os.path.getsize(zip_path) / 1e6:.1f} MB")

In [None]:
# Download the trained model
from google.colab import files

print("⬇️  Downloading trained model...")
files.download(zip_path)
print("✅ Model downloaded successfully!")