# 🏗️ HouseBrain Enhanced Dataset Generation

**Generate 150K training samples with crucial architectural parameters on Google Colab (Free CPU)**

This notebook will help you generate a massive HouseBrain dataset with enhanced features for training your custom LLM.

---

## 🆕 Enhanced Features

### **Plot & Site Parameters**
- **Plot Shape**: Rectangle, L-shape, irregular, corner plot, square
- **Orientation**: 8 compass directions
- **Slope**: 0-15 degrees
- **Corner Plot**: Special setback considerations

### **Exterior Finishes & Materials**
- **Exterior Wall**: Brick, stone, stucco, vinyl, wood, concrete, fiber cement
- **Roofing**: Asphalt shingles, metal, tile, slate, flat roof, wood shingles
- **Windows**: Single-hung, double-hung, casement, picture, sliding, bay
- **Doors**: Wood, steel, fiberglass, sliding glass, French
- **Garage**: Attached, detached, carport, none

### **Climate & Site Conditions**
- **Climate Zone**: Hot dry, hot humid, cold, temperate, tropical, Mediterranean
- **Seismic Zone**: Low, medium, high earthquake risk
- **Soil Type**: Clay, sandy, rocky, loamy, silty
- **Utilities**: City/well water, city/septic sewer, solar ready

### **Building Codes & Regulations**
- **Floor Area Ratio (FAR)**: 0.2-0.8
- **Height Restrictions**: 25-35 feet
- **Parking Requirements**: 1-3 spaces
- **Fire Safety**: Sprinklers, fire exits, fire walls

## 📋 Strategy

1. **Generate 150K samples on Colab** (this notebook)
2. **Download dataset** to your computer
3. **Upload to Kaggle** for training
4. **Train model on Kaggle** (separate notebook)

## 🎯 What You'll Get

- **150K Samples**: Massive training dataset
- **Enhanced Quality**: 6+ crucial architectural parameters
- **Fast Generation**: Colab's powerful CPU
- **Free**: No cost involved
- **Training Time**: 90-120 minutes

---

## 🚀 Step 1: Setup Environment

In [None]:
# Install required dependencies
!pip install torch transformers datasets accelerate peft bitsandbytes wandb tqdm fastapi uvicorn pydantic orjson svgwrite trimesh python-dotenv

print("✅ Dependencies installed successfully!")

In [None]:
# Clone the HouseBrain repository
!git clone https://github.com/Vinay-O/HouseBrainLLM.git
%cd HouseBrainLLM

print("✅ Repository cloned successfully!")

## ⚙️ Step 2: Configure Enhanced Dataset Generation

Set up your dataset generation parameters with all enhanced features

In [None]:
# Import dataset generation modules
import sys
sys.path.append('.')

from generate_dataset import DatasetConfig, HouseBrainDatasetGenerator
import os

print("✅ Enhanced dataset generation modules imported successfully!")

In [None]:
# Enhanced dataset generation configuration for 150K samples
# This includes all crucial architectural parameters

config = DatasetConfig(
    num_samples=150000,  # 150K samples for maximum free tier value
    output_dir="housebrain_dataset_v5_150k_colab",  # Output directory
    train_ratio=0.9,  # Train/validation split
    min_plot_size=1000,  # Minimum plot area (sqft)
    max_plot_size=10000,  # Maximum plot area (sqft)
    min_bedrooms=1,  # Minimum bedrooms
    max_bedrooms=6,  # Maximum bedrooms
    min_floors=1,  # Minimum floors
    max_floors=4,  # Maximum floors
    min_budget=100000,  # Minimum budget
    max_budget=2000000,  # Maximum budget
    fast_mode=True,  # Skip layout solving for speed
    # Enhanced styles
    styles=[
        "Modern", "Contemporary", "Traditional", "Colonial", "Mediterranean",
        "Craftsman", "Victorian", "Minimalist", "Scandinavian", "Industrial",
        "Tropical", "Rustic", "Art Deco", "Mid-Century Modern", "Gothic"
    ],
    # Enhanced regions
    regions=[
        "US_Northeast", "US_Southeast", "US_Midwest", "US_Southwest", "US_West",
        "EU_UK", "EU_Germany", "EU_France", "EU_Italy", "EU_Spain",
        "Asia_India", "Asia_China", "Asia_Japan", "Asia_Singapore", "Asia_Australia"
    ]
)

print(f"📋 Enhanced Dataset Configuration:")
print(f"   Samples: {config.num_samples:,}")
print(f"   Output: {config.output_dir}")
print(f"   Train Ratio: {config.train_ratio}")
print(f"   Plot Size: {config.min_plot_size:,} - {config.max_plot_size:,} sqft")
print(f"   Bedrooms: {config.min_bedrooms} - {config.max_bedrooms}")
print(f"   Floors: {config.min_floors} - {config.max_floors}")
print(f"   Budget: ${config.min_budget:,} - ${config.max_budget:,}")
print(f"   Fast Mode: {config.fast_mode}")
print(f"   Styles: {len(config.styles)} architectural styles")
print(f"   Regions: {len(config.regions)} global regions")
print(f"\n🎯 Enhanced Features:")
print(f"   • Plot shape & orientation")
print(f"   • Exterior finishes & materials")
print(f"   • Climate & site conditions")
print(f"   • Building codes & regulations")
print(f"   • Garage & parking requirements")
print(f"   • Utilities & accessibility")

## 🏗️ Step 3: Generate Enhanced Dataset

This will take 90-120 minutes for 150K samples with all enhanced features.

In [None]:
# Initialize enhanced dataset generator
print("🔧 Setting up enhanced dataset generator...")
generator = HouseBrainDatasetGenerator(config)
print("✅ Enhanced dataset generator initialized successfully!")
print(f"\n📊 Generator includes:")
print(f"   • {len(generator.plot_shapes)} plot shapes")
print(f"   • {len(generator.exterior_materials)} exterior materials")
print(f"   • {len(generator.roofing_materials)} roofing materials")
print(f"   • {len(generator.climate_zones)} climate zones")
print(f"   • {len(generator.soil_types)} soil types")
print(f"   • {len(generator.garage_types)} garage types")

In [None]:
# Generate the enhanced dataset
print("🎯 Starting enhanced dataset generation...")
print(f"⏰ This will take 90-120 minutes for {config.num_samples:,} samples.")
print("📊 Monitor progress below:")
print("💡 Keep this notebook active and don't close the browser tab!")

try:
    output_dir = generator.generate_dataset()
    print(f"\n🎉 Enhanced dataset generation completed successfully!")
    print(f"📁 Output directory: {output_dir}")
except Exception as e:
    print(f"\n❌ Dataset generation failed: {e}")
    print("💡 Try reducing num_samples or check your internet connection")

## 📦 Step 4: Create Zip Archive

Create a zip file for easy download and upload to Kaggle

In [None]:
# Create zip archive
import zipfile
import os
from pathlib import Path

output_dir = Path(config.output_dir)
zip_path = f"{config.output_dir}.zip"

print(f"📦 Creating zip archive: {zip_path}")
print("⏰ This may take 5-10 minutes...")

with zipfile.ZipFile(zip_path, 'w', zipfile.ZIP_DEFLATED) as zipf:
    for root, dirs, files in os.walk(output_dir):
        for file in files:
            file_path = os.path.join(root, file)
            arcname = os.path.relpath(file_path, output_dir.parent)
            zipf.write(file_path, arcname)

print(f"✅ Zip archive created: {zip_path}")
print(f"📁 Archive size: {os.path.getsize(zip_path) / 1e6:.1f} MB")

# Show dataset info
dataset_info_path = output_dir / "dataset_info.json"
if dataset_info_path.exists():
    import json
    with open(dataset_info_path, 'r') as f:
        info = json.load(f)
    print(f"\n📊 Enhanced Dataset Info:")
    print(f"   Name: {info.get('name', 'Unknown')}")
    print(f"   Version: {info.get('version', 'Unknown')}")
    print(f"   Total Samples: {info.get('num_samples', 0):,}")
    print(f"   Train Samples: {info.get('train_samples', 0):,}")
    print(f"   Validation Samples: {info.get('val_samples', 0):,}")
    print(f"   Enhanced Features: {len(info.get('enhanced_features', []))}")
    print(f"\n🎯 Enhanced Features:")
    for feature in info.get('enhanced_features', []):
        print(f"   • {feature}")

## ⬇️ Step 5: Download Enhanced Dataset

Download the enhanced dataset to your computer

In [None]:
# Download the enhanced dataset
from google.colab import files

print("⬇️  Downloading enhanced dataset...")
print(f"📦 File: {zip_path}")
print(f"📁 Size: {os.path.getsize(zip_path) / 1e6:.1f} MB")
print("💡 This may take a few minutes to download...")

files.download(zip_path)
print("✅ Enhanced dataset downloaded successfully!")

## 🎯 Next Steps

### 1. **Download Enhanced Dataset** ✅
The enhanced dataset has been downloaded to your computer.

### 2. **Upload to Kaggle**
1. Go to [Kaggle](https://www.kaggle.com/)
2. Create a new dataset
3. Upload the zip file: `housebrain_dataset_v5_150k_colab.zip`
4. Make it public or private as needed

### 3. **Train on Kaggle**
Use the updated `colab_training.ipynb` notebook for training.

### 4. **Alternative: Train on Colab**
If you want to train on Colab instead:
1. Use the `colab_training.ipynb` notebook
2. Upload the dataset zip file
3. Follow the training instructions

---

## 📊 Enhanced Dataset Statistics

Your generated enhanced dataset includes:

### **Plot & Site Parameters**
- **Plot Shapes**: Rectangle, L-shape, irregular, corner plot, square
- **Orientations**: 8 compass directions
- **Slopes**: 0-15 degrees
- **Setbacks**: Front, rear, left, right (corner plot variations)

### **Exterior Finishes & Materials**
- **Exterior Walls**: 7 material types
- **Roofing**: 6 material types
- **Windows**: 6 window types
- **Doors**: 5 door types
- **Garage**: 4 garage configurations

### **Climate & Site Conditions**
- **Climate Zones**: 6 climate types
- **Seismic Zones**: 3 risk levels
- **Soil Types**: 5 soil types
- **Utilities**: Water, sewer, electricity, gas, solar

### **Building Codes & Regulations**
- **Floor Area Ratio**: 0.2-0.8
- **Height Restrictions**: 25-35 feet
- **Parking Requirements**: 1-3 spaces
- **Fire Safety**: Sprinklers, exits, walls

## 🆘 Troubleshooting

### **Out of Memory**
- Reduce `num_samples` to 100K
- Use `fast_mode=True` (already enabled)

### **Slow Generation**
- Use `fast_mode=True` (already enabled)
- Reduce `num_samples`

### **Poor Quality**
- All enhanced parameters are automatically included
- Quality is optimized for architectural realism

---

**🎉 Congratulations! You've successfully generated a massive 150K enhanced HouseBrain dataset!**

**This dataset includes all crucial architectural parameters and is ready for training your custom LLM!**

For more information, visit: https://github.com/Vinay-O/HouseBrainLLM