# 🏠 HouseBrain LLM Training

**Train your custom architectural AI on Google Colab/Kaggle (Free GPU)**

This notebook will help you train your HouseBrain LLM using the enhanced 150K dataset.

---

## 🎯 Training Strategy

### **Option 1: Kaggle (Recommended)**
- **GPU**: P100 (16GB VRAM)
- **Training Time**: 5-7 hours
- **Cost**: Free
- **Quality**: Excellent

### **Option 2: Colab**
- **GPU**: T4 (16GB VRAM)
- **Training Time**: 6-8 hours
- **Cost**: Free
- **Quality**: Very Good

## 📊 Enhanced Dataset Features

Your 150K enhanced dataset includes:
- **Plot Shape & Orientation**
- **Exterior Finishes & Materials**
- **Climate & Site Conditions**
- **Building Codes & Regulations**
- **Garage & Parking**
- **Utilities & Accessibility**

## 🚀 Expected Results

- **Training Loss**: < 0.8 (target), < 0.6 (excellent)
- **Validation Loss**: < 1.0 (target), < 0.8 (excellent)
- **Compliance Score**: 85-95% (excellent)
- **Generation Speed**: < 10s per design
- **Enhanced Output**: Includes all architectural parameters

---

## 🚀 Step 1: Setup Environment

In [None]:
# Install required dependencies
!pip install torch transformers datasets accelerate peft bitsandbytes wandb tqdm fastapi uvicorn pydantic orjson svgwrite trimesh python-dotenv

print("✅ Dependencies installed successfully!")

In [None]:
# Clone the HouseBrain repository
!git clone https://github.com/Vinay-O/HouseBrainLLM.git
%cd HouseBrainLLM

print("✅ Repository cloned successfully!")

## 📤 Step 2: Upload Enhanced Dataset

Upload your 150K enhanced dataset zip file

In [None]:
# Upload your enhanced dataset
from google.colab import files
import zipfile
import os

print("📤 Upload your enhanced dataset zip file...")
print("💡 Upload: housebrain_dataset_v5_150k_colab.zip")

uploaded = files.upload()

# Extract the dataset
for filename in uploaded.keys():
    if filename.endswith('.zip'):
        print(f"📦 Extracting {filename}...")
        with zipfile.ZipFile(filename, 'r') as zip_ref:
            zip_ref.extractall('.')
        print(f"✅ Dataset extracted successfully!")
        break

# List extracted files
print("\n📁 Extracted files:")
for root, dirs, files in os.walk('.'):
    if 'housebrain_dataset_v5_150k' in root:
        print(f"   {root}")
        for file in files[:5]:  # Show first 5 files
            print(f"     - {file}")
        if len(files) > 5:
            print(f"     ... and {len(files) - 5} more files")

## ⚙️ Step 3: Configure Training

Set up your training parameters for the enhanced dataset

In [None]:
# Import training modules
import sys
sys.path.append('src')

from housebrain.finetune import FineTuningConfig, HouseBrainFineTuner
import torch

print("✅ Training modules imported successfully!")

# Check GPU
if torch.cuda.is_available():
    gpu_name = torch.cuda.get_device_name(0)
    gpu_memory = torch.cuda.get_device_properties(0).total_memory / 1e9
    print(f"🚀 GPU: {gpu_name} ({gpu_memory:.1f}GB VRAM)")
else:
    print("⚠️  No GPU detected. Training will be very slow on CPU.")

In [None]:
# Training configuration for 150K enhanced dataset
config = FineTuningConfig(
    model_name="deepseek-ai/deepseek-coder-6.7b-base",
    dataset_path="housebrain_dataset_v5_150k_colab",  # Your dataset path
    output_dir="models/housebrain-colab-trained",
    max_length=1024,
    batch_size=2,  # Adjust based on GPU memory
    num_epochs=3,
    learning_rate=2e-4,
    use_4bit=True,  # Use 4-bit quantization for memory efficiency
    fp16=True,  # Use mixed precision training
    warmup_steps=100,
    logging_steps=50,
    save_steps=500,
    eval_steps=500,
    gradient_accumulation_steps=4,
    lora_r=16,
    lora_alpha=32,
    lora_dropout=0.1,
)

print(f"📋 Training Configuration:")
print(f"   Model: {config.model_name}")
print(f"   Dataset: {config.dataset_path}")
print(f"   Output: {config.output_dir}")
print(f"   Samples: 150,000 enhanced")
print(f"   Batch Size: {config.batch_size}")
print(f"   Epochs: {config.num_epochs}")
print(f"   Learning Rate: {config.learning_rate}")
print(f"   4-bit Quantization: {config.use_4bit}")
print(f"   Mixed Precision: {config.fp16}")
print(f"   LoRA Rank: {config.lora_r}")
print(f"   LoRA Alpha: {config.lora_alpha}")

## 🧠 Step 4: Start Training

Train your HouseBrain LLM on the enhanced dataset

In [None]:
# Initialize trainer
print("🔧 Setting up trainer...")
trainer = HouseBrainFineTuner(config)
print("✅ Trainer initialized successfully!")
print(f"\n📊 Training on enhanced dataset with:")
print(f"   • Plot shape & orientation")
print(f"   • Exterior finishes & materials")
print(f"   • Climate & site conditions")
print(f"   • Building codes & regulations")
print(f"   • Garage & parking requirements")
print(f"   • Utilities & accessibility")

In [None]:
# Start training
print("🎯 Starting training...")
print("⏰ This will take 5-7 hours on GPU")
print("📊 Training on 150K enhanced samples...")
print("💡 Keep this notebook active and don't close the browser tab!")

try:
    trainer.train()
    print("\n🎉 Training completed successfully!")
except Exception as e:
    print(f"\n❌ Training failed: {e}")
    print("💡 Check GPU memory or reduce batch size")

## 💾 Step 5: Save Trained Model

Save your trained model for later use

In [None]:
# Save the trained model
print("💾 Saving trained model...")
trainer.save_model()
print("✅ Model saved successfully!")

# Create zip archive for download
import zipfile
import os
from pathlib import Path

model_dir = Path(config.output_dir)
zip_path = "housebrain-model-colab-150k.zip"

print(f"📦 Creating zip archive: {zip_path}")
print("⏰ This may take 2-3 minutes...")

with zipfile.ZipFile(zip_path, 'w', zipfile.ZIP_DEFLATED) as zipf:
    for root, dirs, files in os.walk(model_dir):
        for file in files:
            file_path = os.path.join(root, file)
            arcname = os.path.relpath(file_path, model_dir)
            zipf.write(file_path, arcname)

print(f"✅ Zip archive created: {zip_path}")
print(f"📁 Archive size: {os.path.getsize(zip_path) / 1e6:.1f} MB")

## ⬇️ Step 6: Download Trained Model

Download your trained model to your computer

In [None]:
# Download the trained model
from google.colab import files

print("⬇️  Downloading trained model...")
print(f"📦 File: {zip_path}")
print(f"📁 Size: {os.path.getsize(zip_path) / 1e6:.1f} MB")
print("💡 This may take a few minutes to download...")

files.download(zip_path)
print("✅ Trained model downloaded successfully!")

## 🧪 Step 7: Test Trained Model (Optional)

Test your trained model with a sample input

In [None]:
# Test the trained model
print("🧪 Testing trained model...")

# Sample input for testing
test_input = {
    "basicDetails": {
        "totalArea": 2000,
        "unit": "sqft",
        "bedrooms": 3,
        "bathrooms": 2,
        "floors": 2,
        "budget": 400000,
        "style": "Modern"
    },
    "plot": {
        "length": 50,
        "width": 40,
        "unit": "ft",
        "shape": "Rectangle",
        "orientation": "S",
        "slope_degrees": 2.5,
        "is_corner_plot": False,
        "setbacks_ft": {
            "front": 5,
            "rear": 5,
            "left": 3,
            "right": 3
        }
    },
    "roomBreakdown": [
        {"type": "master_bedroom", "count": 1, "minArea": 200},
        {"type": "bedroom", "count": 2, "minArea": 150},
        {"type": "bathroom", "count": 2, "minArea": 60},
        {"type": "kitchen", "count": 1, "minArea": 180},
        {"type": "livingRoom", "count": 1, "minArea": 300},
        {"type": "diningRoom", "count": 1, "minArea": 150}
    ]
}

print("📋 Test Input:")
print(f"   Area: {test_input['basicDetails']['totalArea']} sqft")
print(f"   Bedrooms: {test_input['basicDetails']['bedrooms']}")
print(f"   Floors: {test_input['basicDetails']['floors']}")
print(f"   Style: {test_input['basicDetails']['style']}")
print(f"   Plot Shape: {test_input['plot']['shape']}")
print(f"   Orientation: {test_input['plot']['orientation']}")

# Test with trained model
try:
    from housebrain.llm import HouseBrainLLM
    
    llm = HouseBrainLLM(finetuned_model_path=config.output_dir)
    result = llm.generate_design(test_input)
    
    print("\n✅ Model test successful!")
    print(f"📊 Generated design with {len(result.levels)} levels")
    print(f"💰 Construction cost: ${result.construction_cost:,}")
    print(f"📐 Total area: {result.total_area} sqft")
    
    # Check for enhanced features
    if hasattr(result, 'exterior_specifications'):
        print(f"🏠 Exterior: {result.exterior_specifications.get('exterior_wall', 'Unknown')}")
    if hasattr(result, 'climate_and_site'):
        print(f"🌡️  Climate: {result.climate_and_site.get('climate_zone', 'Unknown')}")
    
except Exception as e:
    print(f"\n❌ Model test failed: {e}")
    print("💡 This is normal if the model is still training or there are compatibility issues")

## 🎯 Next Steps

### 1. **Download Trained Model** ✅
Your trained HouseBrain LLM has been downloaded.

### 2. **Use Locally**
1. Extract the model zip file
2. Place in your local HouseBrain project
3. Use with the API or test scripts

### 3. **Deploy**
1. Upload to cloud platforms
2. Integrate with web applications
3. Use for architectural design services

### 4. **Evaluate Performance**
1. Test with various inputs
2. Compare with baseline models
3. Measure compliance scores

---

## 📊 Training Results Summary

### **Dataset Used**
- **Size**: 150,000 enhanced samples
- **Features**: 6+ crucial architectural parameters
- **Quality**: Excellent diversity and realism

### **Model Performance**
- **Base Model**: DeepSeek Coder 6.7B
- **Fine-tuning**: QLoRA (LoRA + 4-bit quantization)
- **Training Time**: 5-7 hours
- **Expected Compliance**: 85-95%

### **Enhanced Features Learned**
- **Plot Shape & Orientation**: Rectangle, L-shape, corner plot, etc.
- **Exterior Finishes**: Brick, stone, stucco, vinyl, wood, concrete
- **Climate Adaptation**: Hot, cold, tropical, Mediterranean zones
- **Building Codes**: FAR, height limits, parking requirements
- **Site Conditions**: Soil types, utilities, accessibility

## 🆘 Troubleshooting

### **Out of Memory**
- Reduce `batch_size` to 1
- Use smaller model: `deepseek-ai/deepseek-coder-1.3b-base`
- Reduce `max_length` to 512

### **Slow Training**
- Increase `gradient_accumulation_steps` to 8
- Use `fp16=True` (already enabled)
- Reduce `num_epochs` to 2

### **Poor Results**
- Check dataset quality
- Increase `learning_rate` to 3e-4
- Increase `lora_r` to 32

---

**🎉 Congratulations! You've successfully trained your HouseBrain LLM!**

**Your model now understands all crucial architectural parameters and can generate high-quality house designs!**

For more information, visit: https://github.com/Vinay-O/HouseBrainLLM