# 🏗️ HouseBrain Dataset Generation

**Generate massive training datasets on Google Colab (Free CPU/GPU)**

This notebook will help you generate large HouseBrain datasets for training your custom LLM.

---

## 📋 Strategy

1. **Generate Dataset on Colab** (this notebook)
2. **Download Dataset** to your computer
3. **Upload to Kaggle** for training
4. **Train Model on Kaggle** (separate notebook)

## 🎯 What You'll Get

- **Large Dataset**: 50K-100K+ samples
- **High Quality**: Realistic architectural parameters
- **Fast Generation**: Colab's powerful CPU
- **Free**: No cost involved

---

## 🚀 Step 1: Setup Environment

In [None]:
# Install required dependencies
!pip install torch transformers datasets accelerate peft bitsandbytes wandb tqdm fastapi uvicorn pydantic orjson svgwrite trimesh python-dotenv

print("✅ Dependencies installed successfully!")

In [None]:
# Clone the HouseBrain repository
!git clone https://github.com/Vinay-O/HouseBrainLLM.git
%cd HouseBrainLLM

print("✅ Repository cloned successfully!")

## ⚙️ Step 2: Configure Dataset Generation

Set up your dataset generation parameters

In [None]:
# Import dataset generation modules
import sys
sys.path.append('.')

from generate_dataset import DatasetConfig, HouseBrainDatasetGenerator
import os

print("✅ Dataset generation modules imported successfully!")

In [None]:
# Dataset generation configuration
# Adjust these parameters based on your needs

config = DatasetConfig(
    num_samples=50000,  # Number of samples to generate
    output_dir="housebrain_dataset_v5_50k_colab",  # Output directory
    train_ratio=0.9,  # Train/validation split
    min_plot_size=1000,  # Minimum plot area (sqft)
    max_plot_size=10000,  # Maximum plot area (sqft)
    min_bedrooms=1,  # Minimum bedrooms
    max_bedrooms=6,  # Maximum bedrooms
    min_floors=1,  # Minimum floors
    max_floors=4,  # Maximum floors
    min_budget=100000,  # Minimum budget
    max_budget=2000000,  # Maximum budget
    fast_mode=True,  # Skip layout solving for speed
)

print(f"📋 Dataset Configuration:")
print(f"   Samples: {config.num_samples:,}")
print(f"   Output: {config.output_dir}")
print(f"   Train Ratio: {config.train_ratio}")
print(f"   Plot Size: {config.min_plot_size:,} - {config.max_plot_size:,} sqft")
print(f"   Bedrooms: {config.min_bedrooms} - {config.max_bedrooms}")
print(f"   Floors: {config.min_floors} - {config.max_floors}")
print(f"   Budget: ${config.min_budget:,} - ${config.max_budget:,}")
print(f"   Fast Mode: {config.fast_mode}")

## 🏗️ Step 3: Generate Dataset

This will take 30-60 minutes depending on the number of samples.

In [None]:
# Initialize dataset generator
print("🔧 Setting up dataset generator...")
generator = HouseBrainDatasetGenerator(config)
print("✅ Dataset generator initialized successfully!")

In [None]:
# Generate the dataset
print("🎯 Starting dataset generation...")
print(f"⏰ This will take 30-60 minutes for {config.num_samples:,} samples.")
print("📊 Monitor progress below:")

try:
    output_dir = generator.generate_dataset()
    print(f"\n🎉 Dataset generation completed successfully!")
    print(f"📁 Output directory: {output_dir}")
except Exception as e:
    print(f"\n❌ Dataset generation failed: {e}")
    print("💡 Try reducing num_samples or using fast_mode=True")

## 📦 Step 4: Create Zip Archive

Create a zip file for easy download and upload to Kaggle

In [None]:
# Create zip archive
import zipfile
import os
from pathlib import Path

output_dir = Path(config.output_dir)
zip_path = f"{config.output_dir}.zip"

print(f"📦 Creating zip archive: {zip_path}")

with zipfile.ZipFile(zip_path, 'w', zipfile.ZIP_DEFLATED) as zipf:
    for root, dirs, files in os.walk(output_dir):
        for file in files:
            file_path = os.path.join(root, file)
            arcname = os.path.relpath(file_path, output_dir.parent)
            zipf.write(file_path, arcname)

print(f"✅ Zip archive created: {zip_path}")
print(f"📁 Archive size: {os.path.getsize(zip_path) / 1e6:.1f} MB")

## ⬇️ Step 5: Download Dataset

Download the dataset to your computer

In [None]:
# Download the dataset
from google.colab import files

print("⬇️  Downloading dataset...")
files.download(zip_path)
print("✅ Dataset downloaded successfully!")