# HouseBrain Data Factory 2.0: The Architect's Assembly Line (Parallel Mode)

This notebook is the control center for generating a high-quality, architecturally-sound dataset for training the HouseBrain LLM. It is designed for **large-scale, parallel data generation.**

**Our Strategy:**
1.  **Generate a Master Prompt File (Once)**: Run Cell 4 a single time to generate a large (e.g., 30,000) list of prompts and save it to your Google Drive.
2.  **Run Multiple Notebooks in Parallel**: You can open this same notebook using different Google accounts.
3.  **Process Random Batches**: Each notebook instance will read from the master prompt list, select a random, unique batch of prompts to process, and save the results to a central dataset folder on your Drive.
4.  **Avoid Duplicate Work**: The script checks if a plan for a given prompt already exists, allowing multiple instances to contribute to the same dataset without collisions.

## Instructions
1.  **Set Your GitHub PAT**: In Cell 1, you will be prompted to enter a GitHub Personal Access Token to clone the repository.
2.  **(First Time Only) Run Cell 4**: Run Cell 4 to create your master `platinum_prompts.txt` file in Google Drive. You only need to do this once.
3.  **Run the Factory (Cell 3)**: Run cells 1, 2, and 3. You can configure the number of plans you want the current notebook instance to generate in Cell 3.
4.  **Repeat**: Open this notebook with other accounts, mount the same Google Drive, and run Cell 3 again to generate more data in parallel.


In [None]:
# @title ## 1. Setup Environment
# @markdown Mount Google Drive and clone the repository using a secure token.
from google.colab import drive
import os
import getpass
import subprocess

# Mount Google Drive
drive.mount('/content/drive')
print("✅ Google Drive mounted.")

# --- GitHub Setup ---
#@markdown Enter your GitHub Personal Access Token (PAT) with repo access.
GITHUB_TOKEN = getpass.getpass('Enter your GitHub PAT: ')
REPO_URL = f"https://{GITHUB_TOKEN}@github.com/Vinay-O/HouseBrainLLM.git"
REPO_DIR = "/content/HouseBrainLLM"

# Clone the repository
if os.path.exists(REPO_DIR):
    print("Repository already exists. Pulling latest changes...")
    # Use subprocess.run for better error handling
    subprocess.run(f"cd {REPO_DIR} && git pull", shell=True, check=True)
else:
    print("Cloning repository...")
    subprocess.run(f"git clone {REPO_URL} {REPO_DIR}", shell=True, check=True)

print("✅ Repository is ready.")

# --- Install Dependencies ---
#@markdown Install necessary Python packages from the new requirements file.
requirements_path = os.path.join(REPO_DIR, "requirements.txt")
if os.path.exists(requirements_path):
    print("Installing dependencies from requirements.txt...")
    !pip install -q -r {requirements_path}
    print("✅ Dependencies installed.")
else:
    print("⚠️ requirements.txt not found. Installing default packages.")
    !pip install -q pydantic

print("✅ Environment setup complete.")



In [None]:
# @title ## 2. Configure and Start Ollama Server
# @markdown This cell will download and start the Ollama server, then pull the specified model.
# @markdown **NOTE:** A powerful model like `deepseek-r1:32b` is now recommended for higher quality results. It will be slower but more reliable.

MODEL_NAME = "deepseek-r1:32b" # @param ["deepseek-r1:32b", "llama3:70b-instruct", "qwen2:72b-instruct", "mixtral:instruct"]

# Download and start Ollama
!curl -fsSL https://ollama.com/install.sh | sh
import threading
import subprocess
import time

def run_ollama():
    try:
        subprocess.run("ollama serve", shell=True, check=True, capture_output=True, text=True)
    except subprocess.CalledProcessError as e:
        print(f"Ollama server failed: {e.stderr}")

print("🚀 Starting Ollama server in the background...")
ollama_thread = threading.Thread(target=run_ollama)
ollama_thread.daemon = True
ollama_thread.start()

# Wait for the server to be ready
print("⏳ Waiting for Ollama server to initialize...")
time.sleep(15) # Increased wait time for stability

# Pull the model
print(f"📦 Pulling model: {MODEL_NAME}. This may take a while...")
try:
    process = subprocess.run(
        f"ollama pull {MODEL_NAME}",
        shell=True, check=True, capture_output=True, text=True, timeout=900
    )
    print(f"✅ Model {MODEL_NAME} is ready.")
except subprocess.CalledProcessError as e:
    print(f"Error pulling model: {e.stderr}")
    print("This might happen if the model name is incorrect or the Ollama server is not ready.")
except subprocess.TimeoutExpired:
    print("Timed out while pulling the model. The model might be very large or the connection slow.")


# Verify Ollama is running
!ollama list



In [None]:
# @title ## 3. Run the Data Factory (Parallel Mode)
# @markdown This cell is designed for large-scale, parallel data generation.
# @markdown It reads prompts from a central file in your Google Drive, processes a random batch, and saves to a central dataset folder.
# @markdown You can run this notebook on multiple accounts simultaneously to accelerate data creation.

import os
import sys
import json
import textwrap
import logging
from pathlib import Path
import urllib.request
import urllib.error
from inspect import getsource
from pydantic import BaseModel, ValidationError
import random
import hashlib
import time

# --- Configuration ---
#@markdown The central location in your Google Drive for the master prompt file.
DRIVE_PROMPT_FILE = "/content/drive/MyDrive/housebrain_prompts/platinum_prompts.txt" #@param {type:"string"}

#@markdown The central location in your Google Drive to save the final dataset.
DRIVE_OUTPUT_DIR = "/content/drive/MyDrive/housebrain_platinum_dataset" #@param {type:"string"}

#@markdown The number of plans this specific Colab instance should generate in this run.
NUM_PLANS_TO_GENERATE = 100 #@param {type:"integer"}

#@markdown The Ollama model to use for generation (should match the model from Cell 2).
MODEL_NAME = "deepseek-r1:32b" # @param ["deepseek-r1:32b", "llama3:70b-instruct", "qwen2:72b-instruct", "mixtral:instruct"]
# --- End Configuration ---


# --- Setup Paths and Logging ---
REPO_DIR = "/content/HouseBrainLLM"
if REPO_DIR not in sys.path:
    sys.path.insert(0, REPO_DIR)
os.chdir(REPO_DIR)

logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
logger = logging.getLogger(__name__)

# --- Import Schema from Cloned Repo ---
try:
    from src.housebrain.schema import HouseOutput, RoomType
    print("✅ Successfully imported HouseBrain schema.")
except ImportError as e:
    print(f"❌ Failed to import HouseBrain schema: {e}")
    print("Please ensure the repository was cloned correctly in Step 1.")
    # Stop execution if schema fails
    raise e

# --- Self-Contained Generation Logic (with A+ Prompts) ---
VALID_ROOM_TYPES = [e.value for e in RoomType]
VALID_WINDOW_TYPES = ["fixed", "casement", "sliding", "bay"]
VALID_DOOR_TYPES = ["interior", "exterior", "sliding", "pocket"]

STAGE_1_PROMPT_TEMPLATE = """You are an expert AI architect. Your task is to generate ONLY the high-level geometric layout for a house based on a user's prompt.

**CRITICAL INSTRUCTIONS:**
1.  Focus ONLY on `levels` and `rooms`.
2.  Rooms MUST have an `id`, `type`, and non-overlapping `bounds`. **BOUNDS MUST NOT OVERLAP.**
3.  The `type` for each room MUST be one of the following valid options: `{valid_room_types}`. Do NOT invent new types or use shorthands.
4.  **Size Constraint**: Rooms must have realistic dimensions. For example, an `entrance` must be at least 40 sqft, a `bathroom` at least 40 sqft, and a `bedroom` at least 120 sqft.
5.  DO NOT include `doors` or `windows` in this stage.
6.  Your output MUST be a single, valid JSON object with a root "levels" key.

**Golden Example of a perfect room structure:**
```json
{{
  "id": "living_room_0",
  "type": "living_room",
  "bounds": {{"x": 10, "y": 10, "width": 20, "height": 15}}
}}
```
---
**User Prompt:**
{user_prompt}
---
Now, generate the JSON for the house layout, adhering strictly to the instructions provided."""

STAGE_2_PROMPT_TEMPLATE = """You are an expert AI architect. Your task is to add `doors` and `windows` to a pre-existing house layout.

**CRITICAL INSTRUCTIONS:**
1.  Use ONLY the official `RoomType` enum values for all room `type` fields. Valid types are: `{valid_room_types}`. Do not use shorthands like "living".
2.  The `type` for each window MUST be one of the following valid options: `{valid_window_types}`.
3.  The `type` for each door MUST be one of the following valid options: `{valid_door_types}`. Do NOT confuse window and door types.
4.  DO NOT change the existing `id`, `type`, or `bounds` of the rooms.
5.  A `Door` object MUST have `room1` and `room2` fields (NOT `room1_id`). A `Window` MUST have a `room_id`.
6.  Your final output must be a single JSON object containing ONLY the `levels` key.

**Expert Design Hints:**
-   Place doors to create a logical and efficient flow between connected rooms.
-   Place windows on exterior walls to maximize natural light and capture views where appropriate.

**Golden Example (Pay close attention to structure):**
```json
"rooms": [
   {{
     "id": "living_room_0",
     "type": "living_room",
     "bounds": {{ "x": 10, "y": 10, "width": 20, "height": 15 }},
     "doors": [
       {{
         "position": {{ "x": 20, "y": 25 }},
         "width": 3.0,
         "type": "interior",
         "room1": "living_room_0",
         "room2": "dining_room_0"
       }}
     ],
     "windows": [ {{ "position": {{...}}, "width": 8.0, "height": 5.0, "type": "sliding", "room_id": "living_room_0" }} ]
   }}
]
```
---
**Existing House Layout (Do not change this part):**
```json
{existing_layout}
```
---
**Original User Prompt:**
{user_prompt}
---
Now, add the doors and windows to the layout, following the format of the Golden Example and Schema Reference exactly."""

JSON_REPAIR_PROMPT = """The following text is not a valid JSON object. Please fix any syntax errors (like missing commas, brackets, or quotes) and return ONLY the corrected, valid JSON object. Do not add any commentary.

**Broken Text:**
{broken_json}
"""

def call_ollama_colab(model_name: str, prompt: str, max_retries=3):
    """A more robust implementation of the Ollama API call for Colab with retries."""
    url = "http://localhost:11434/api/generate"
    data = {"model": model_name, "prompt": prompt, "stream": False, "format": "json"}
    encoded_data = json.dumps(data).encode('utf-8')
    req = urllib.request.Request(url, data=encoded_data, headers={'Content-Type': 'application/json'})
    
    for attempt in range(max_retries):
        try:
            with urllib.request.urlopen(req, timeout=900) as response:
                if response.status == 200:
                    response_data = json.loads(response.read().decode('utf-8'))
                    return response_data.get("response", "")
        except urllib.error.HTTPError as e:
            error_content = e.read().decode('utf-8')
            logger.error(f"HTTP Error: {e.code} {e.reason} - {error_content}")
        except Exception as e:
            logger.error(f"Attempt {attempt + 1}/{max_retries} failed. Error calling Ollama: {e}")
            if attempt < max_retries - 1:
                time.sleep(5)
            else:
                logger.error("Max retries reached. Failing.")
    return None

def repair_and_correct_plan(raw_json_str: str, model_name: str) -> dict:
    """Attempts to parse, repair, and programmatically correct a JSON plan."""
    plan_dict = None
    # --- Step 1: Initial Parse Attempt ---
    try:
        plan_dict = json.loads(raw_json_str)
    except json.JSONDecodeError:
        print("Initial JSON parse failed. Attempting to repair with LLM...")
        repair_prompt = JSON_REPAIR_PROMPT.format(broken_json=raw_json_str)
        repaired_str = call_ollama_colab(model_name, repair_prompt)
        if not repaired_str:
            print("❌ JSON repair failed.")
            return None
        try:
            plan_dict = json.loads(repaired_str)
            print("✅ JSON successfully repaired and parsed.")
        except json.JSONDecodeError:
            print("❌ JSON parse failed even after repair.")
            return None

    if not plan_dict:
        return None
            
    # --- Step 2: Programmatic Correction ---
    room_type_map = {
        "living": "living_room",
        "dining": "dining_room",
        "master_bedroom": "master_bedroom",
        "bedroom": "bedroom",
        "kitchen": "kitchen",
        "bathroom": "bathroom",
        "study": "study",
        "terrace": "balcony",
        "roof_deck": "balcony",
        "roof": "balcony",
        "walk_in_closet": "storage",
        "private_gym": "study",
    }
    
    if "levels" not in plan_dict or not isinstance(plan_dict["levels"], list):
        plan_dict["levels"] = []

    for level in plan_dict.get("levels", []):
        for room in level.get("rooms", []):
            # Correct room types
            room_type = room.get("type")
            if room_type in room_type_map:
                corrected_type = room_type_map[room_type]
                print(f"Correcting room type: '{room_type}' -> '{corrected_type}'")
                room["type"] = corrected_type
            
            # Fix missing room_id in windows
            room_id = room.get("id")
            if room_id:
                for window in room.get("windows", []):
                    if "room_id" not in window:
                        print(f"Injecting missing room_id '{room_id}' into window.")
                        window["room_id"] = room_id
            
            # Sanitize doors: rebuild the list, keeping only valid ones
            valid_doors = []
            for door in room.get("doors", []):
                # RUTHLESSLY DELETE doors missing required fields
                if "room1" not in door or "room2" not in door:
                    print(f"Sanitizing: Removing door from '{room_id}' because it's missing room1/room2.")
                    continue

                # Fix invalid door types
                if door.get("type") not in VALID_DOOR_TYPES:
                    original_door_type = door.get("type")
                    door["type"] = "interior"
                    print(f"Correcting invalid door type: '{original_door_type}' -> 'interior'")
                
                # Fix invented room1_id/room2_id fields
                if "room1_id" in door:
                    door["room1"] = door.pop("room1_id")
                    print(f"Correcting door field: 'room1_id' -> 'room1'")
                if "room2_id" in door:
                    door["room2"] = door.pop("room2_id")
                    print(f"Correcting door field: 'room2_id' -> 'room2'")
                
                valid_doors.append(door)
            room["doors"] = valid_doors

    return plan_dict


# --- Execution ---
print("--- Starting Data Factory Run (Parallel Mode) ---")
Path(DRIVE_OUTPUT_DIR).mkdir(parents=True, exist_ok=True)

# 1. Load all available prompts from central file
all_prompts = []
try:
    with open(DRIVE_PROMPT_FILE, 'r') as f:
        all_prompts = [line.strip() for line in f if line.strip()]
    if not all_prompts:
        raise FileNotFoundError
    print(f"✅ Found {len(all_prompts)} total prompts in the master list.")
except FileNotFoundError:
    print(f"❌ MASTER PROMPT FILE NOT FOUND at '{DRIVE_PROMPT_FILE}'.")
    print("Please run Cell 4 to generate it before running this cell.")

# 2. Select a random batch to process
if all_prompts:
    random.shuffle(all_prompts)
    prompts_to_process = all_prompts[:NUM_PLANS_TO_GENERATE]
    print(f"✅ This run will process a random batch of {len(prompts_to_process)} prompts.")

    for i, prompt_text in enumerate(prompts_to_process):
        print("\n" + "="*50)
        print(f"Processing prompt {i+1}/{len(prompts_to_process)}")
        
        prompt_hash = hashlib.sha1(prompt_text.encode()).hexdigest()[:16]
        run_name = f"plan_{prompt_hash}"
        output_file = Path(DRIVE_OUTPUT_DIR) / f"{run_name}.json"

        if output_file.exists():
            print(f"⏭️ Skipping prompt, output file already exists: {output_file.name}")
            continue

        print(textwrap.shorten(prompt_text, width=100, placeholder="..."))
        print("="*50)

        # --- STAGE 1 ---
        print("Running Stage 1: Layout Generation...")
        stage_1_prompt = STAGE_1_PROMPT_TEMPLATE.format(
            user_prompt=prompt_text,
            valid_room_types=VALID_ROOM_TYPES
        )
        stage_1_response = call_ollama_colab(MODEL_NAME, stage_1_prompt)

        if not stage_1_response:
            print("❌ Stage 1 Failed: No response from model.")
            continue
        
        # --- STAGE 2 ---
        print("Running Stage 2: Adding Openings...")
        stage_2_prompt = STAGE_2_PROMPT_TEMPLATE.format(
            existing_layout=stage_1_response,
            user_prompt=prompt_text,
            valid_room_types=VALID_ROOM_TYPES,
            valid_window_types=VALID_WINDOW_TYPES,
            valid_door_types=VALID_DOOR_TYPES
        )
        stage_2_response = call_ollama_colab(MODEL_NAME, stage_2_prompt)
        
        if not stage_2_response:
            print("❌ Stage 2 Failed: No response from model.")
            continue

        # --- STAGE 2.5: Repair and Correct ---
        print("Running Stage 2.5: Repairing and Correcting Plan...")
        corrected_plan = repair_and_correct_plan(stage_2_response, MODEL_NAME)
        if not corrected_plan:
            print("❌ Stage 2.5 Failed: Could not produce a valid plan.")
            with open(str(output_file).replace('.json', '_failed_repair.txt'), 'w') as f: f.write(stage_2_response)
            continue

        # --- STAGE 3: Finalize & Validate ---
        print("Running Stage 3: Finalizing and Validating...")
        try:
            processed_levels = corrected_plan.get("levels", [])
            for level_idx, level in enumerate(processed_levels):
                level['level_number'] = level_idx

            total_area_sqft = sum(
                r['bounds']['width'] * r['bounds']['height']
                for l in processed_levels
                for r in l.get("rooms", [])
            )
            
            final_plan = {
                "input": {
                    "basicDetails": {
                        "prompt": prompt_text, 
                        "totalArea": total_area_sqft,
                        "unit": "sqft",
                        "floors": len(processed_levels),
                        "bedrooms": 0, # Placeholder
                        "bathrooms": 0, # Placeholder
                        "style": "unknown", # Placeholder
                        "budget": 0 # Placeholder
                    },
                    "plot": {}, "roomBreakdown": []
                },
                "levels": processed_levels,
                "total_area": round(total_area_sqft, 2),
                "construction_cost": 0.0, "materials": {}, "render_paths": {}
            }
            
            HouseOutput.model_validate(final_plan)

            with open(output_file, 'w') as f:
                json.dump(final_plan, f, indent=2)
            print(f"✅ SUCCESS! Saved validated plan to {output_file}")

        except ValidationError as e:
            print(f"❌ Stage 3 Failed: Pydantic validation error - {e}")
            with open(str(output_file).replace('.json', '_failed_validation.txt'), 'w') as f: json.dump(corrected_plan, f, indent=2)
        except Exception as e:
            print(f"❌ Stage 3 Failed: An unexpected error occurred - {e}")
            with open(str(output_file).replace('.json', '_failed_exception.txt'), 'w') as f: json.dump(corrected_plan, f, indent=2)
    
    print("\n🎉 Data Factory run complete!")
else:
    print("No prompts to process.")



In [None]:
# @title ## 4. (One-Time Setup) Generate Master Prompt File
# @markdown This cell uses the `generate_prompts.py` script to create your master prompt file in Google Drive.
# @markdown **You only need to run this cell once.**
# @markdown Once the file is created, Cell 3 will be able to read from it for all future runs.

import os
from pathlib import Path

# --- Configuration ---
#@markdown The desired location in your Google Drive for the master prompt file. This MUST match the path in Cell 3.
DRIVE_PROMPT_FILE = "/content/drive/MyDrive/housebrain_prompts/platinum_prompts.txt" #@param {type:"string"}

#@markdown The total number of prompts to generate for your master list.
NUM_PROMPTS_TO_GENERATE = 30000 #@param {type:"integer"}
# --- End Configuration ---

# --- Execution ---
REPO_DIR = "/content/HouseBrainLLM"
script_path = os.path.join(REPO_DIR, "scripts/generate_prompts.py")

# Ensure the repository is in the correct directory
os.chdir(REPO_DIR)

# Ensure the target directory in Drive exists
Path(DRIVE_PROMPT_FILE).parent.mkdir(parents=True, exist_ok=True)

print(f"Running prompt generation script to create {NUM_PROMPTS_TO_GENERATE} prompts...")
# Use an f-string for safer command construction
command = f'python3 "{script_path}" --num-prompts {NUM_PROMPTS_TO_GENERATE} --output-file "{DRIVE_PROMPT_FILE}"'
!{command}

print("\n--- Verification ---")
if Path(DRIVE_PROMPT_FILE).exists():
    print(f"✅ Master prompt file successfully created at: {DRIVE_PROMPT_FILE}")
    print("First 5 prompts in the file:")
    !head -n 5 "{DRIVE_PROMPT_FILE}"
else:
    print(f"❌ ERROR: Master prompt file was not created. Please check for errors above.")



In [None]:
# @title ## 5. (Optional) Download Generated Dataset
# @markdown Run this cell after the data generation is complete to compress and download the entire output folder.

import shutil
import os
from google.colab import files
from datetime import datetime

# Define the source directory in Google Drive. This should match DRIVE_OUTPUT_DIR from Cell 3.
source_dir = "/content/drive/MyDrive/housebrain_platinum_dataset"

# Create a timestamped zip filename
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
zip_filename = f"housebrain_dataset_{timestamp}.zip"
zip_filepath = f"/content/{zip_filename}"

if os.path.exists(source_dir) and os.listdir(source_dir):
    # Create the zip archive
    print(f"Compressing '{source_dir}' into '{zip_filepath}'...")
    shutil.make_archive(zip_filepath.replace('.zip', ''), 'zip', source_dir)
    print("✅ Compression complete.")

    # Provide a download link
    print(f"\nDownloading '{zip_filename}'...")
    files.download(zip_filepath)
else:
    print(f"❌ ERROR: The source directory '{source_dir}' was not found or is empty. Please ensure the Data Factory ran correctly.")

