# HouseBrain Data Factory 2.0: The Architect's Assembly Line (Parallel Mode)

This notebook is the control center for generating a high-quality, architecturally-sound dataset for training the HouseBrain LLM. It is designed for **large-scale, parallel data generation.**

**Our Strategy:**
1.  **Generate a Master Prompt File (Once)**: Run Cell 4 a single time to generate a large (e.g., 30,000) list of prompts and save it to your Google Drive.
2.  **Run Multiple Notebooks in Parallel**: You can open this same notebook using different Google accounts.
3.  **Process Random Batches**: Each notebook instance will read from the master prompt list, select a random, unique batch of prompts to process, and save the results to a central dataset folder on your Drive.
4.  **Avoid Duplicate Work**: The script checks if a plan for a given prompt already exists, allowing multiple instances to contribute to the same dataset without collisions.

## Instructions
1.  **Set Your GitHub PAT**: In Cell 1, you will be prompted to enter a GitHub Personal Access Token to clone the repository.
2.  **(First Time Only) Run Cell 4**: Run Cell 4 to create your master `platinum_prompts.txt` file in Google Drive. You only need to do this once.
3.  **Run the Factory (Cell 3)**: Run cells 1, 2, and 3. You can configure the number of plans you want the current notebook instance to generate in Cell 3.
4.  **Repeat**: Open this notebook with other accounts, mount the same Google Drive, and run Cell 3 again to generate more data in parallel.



In [None]:
# @title ## 1. Setup Environment
# @markdown Mount Google Drive and clone the repository using a secure token.
from google.colab import drive
import os
import getpass
import subprocess

# Mount Google Drive
drive.mount('/content/drive')
print("✅ Google Drive mounted.")

# --- GitHub Setup ---
#@markdown Enter your GitHub Personal Access Token (PAT) with repo access.
GITHUB_TOKEN = getpass.getpass('Enter your GitHub PAT: ')
REPO_URL = f"https://{GITHUB_TOKEN}@github.com/Vinay-O/HouseBrainLLM.git"
REPO_DIR = "/content/HouseBrainLLM"

# Clone the repository
if os.path.exists(REPO_DIR):
    print("Repository already exists. Pulling latest changes...")
    # Use subprocess.run for better error handling
    subprocess.run(f"cd {REPO_DIR} && git pull", shell=True, check=True)
else:
    print("Cloning repository...")
    subprocess.run(f"git clone {REPO_URL} {REPO_DIR}", shell=True, check=True)

print("✅ Repository is ready.")

# --- Install Dependencies ---
#@markdown Install necessary Python packages from the new requirements file.
requirements_path = os.path.join(REPO_DIR, "requirements.txt")
if os.path.exists(requirements_path):
    print("Installing dependencies from requirements.txt...")
    !pip install -q -r {requirements_path}
    print("✅ Dependencies installed.")
else:
    print("⚠️ requirements.txt not found. Installing default packages.")
    !pip install -q pydantic

print("✅ Environment setup complete.")



In [None]:
# @title ## 2. Configure and Start Ollama Server
# @markdown This cell will download and start the Ollama server, then pull the specified model.
# @markdown The process will take a few minutes.

MODEL_NAME = "mixtral:instruct" # @param ["mixtral:instruct", "qwen2:7b", "llama3:8b", "mistral:7b-instruct", "qwen2:72b"]

# Download and start Ollama
!curl -fsSL https://ollama.com/install.sh | sh
import threading
import subprocess
import time

def run_ollama():
    try:
        subprocess.run("ollama serve", shell=True, check=True, capture_output=True, text=True)
    except subprocess.CalledProcessError as e:
        print(f"Ollama server failed: {e.stderr}")

print("🚀 Starting Ollama server in the background...")
ollama_thread = threading.Thread(target=run_ollama)
ollama_thread.daemon = True
ollama_thread.start()

# Wait for the server to be ready
print("⏳ Waiting for Ollama server to initialize...")
time.sleep(15) # Increased wait time for stability

# Pull the model
print(f"📦 Pulling model: {MODEL_NAME}. This may take a while...")
try:
    process = subprocess.run(
        f"ollama pull {MODEL_NAME}",
        shell=True, check=True, capture_output=True, text=True, timeout=600
    )
    print(f"✅ Model {MODEL_NAME} is ready.")
except subprocess.CalledProcessError as e:
    print(f"Error pulling model: {e.stderr}")
    print("This might happen if the model name is incorrect or the Ollama server is not ready.")
except subprocess.TimeoutExpired:
    print("Timed out while pulling the model. The model might be very large or the connection slow.")


# Verify Ollama is running
!ollama list



In [None]:
# @title ## 3. Run the Data Factory (Parallel Mode)
# @markdown This cell is designed for large-scale, parallel data generation.
# @markdown It reads prompts from a central file in your Google Drive, processes a random batch, and saves to a central dataset folder.
# @markdown You can run this notebook on multiple accounts simultaneously to accelerate data creation.

import os
import sys
import json
import textwrap
import logging
from pathlib import Path
import urllib.request
import urllib.error
from inspect import getsource
from pydantic import BaseModel, ValidationError
import random
import hashlib

# --- Configuration ---
#@markdown The central location in your Google Drive for the master prompt file.
DRIVE_PROMPT_FILE = "/content/drive/MyDrive/housebrain_prompts/platinum_prompts.txt" #@param {type:"string"}

#@markdown The central location in your Google Drive to save the final dataset.
DRIVE_OUTPUT_DIR = "/content/drive/MyDrive/housebrain_platinum_dataset" #@param {type:"string"}

#@markdown The number of plans this specific Colab instance should generate in this run.
NUM_PLANS_TO_GENERATE = 100 #@param {type:"integer"}

#@markdown The Ollama model to use for generation.
MODEL_NAME = "mixtral:instruct" #@param ["mixtral:instruct", "qwen2:7b", "llama3:8b"]
# --- End Configuration ---


# --- Setup Paths and Logging ---
REPO_DIR = "/content/HouseBrainLLM"
if REPO_DIR not in sys.path:
    sys.path.insert(0, REPO_DIR)
os.chdir(REPO_DIR)

logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
logger = logging.getLogger(__name__)

# --- Import Schema from Cloned Repo ---
try:
    from src.housebrain.schema import HouseOutput, RoomType
    print("✅ Successfully imported HouseBrain schema.")
except ImportError as e:
    print(f"❌ Failed to import HouseBrain schema: {e}")
    print("Please ensure the repository was cloned correctly in Step 1.")
    # Stop execution if schema fails
    raise e

# --- Self-Contained Generation Logic (with A+ Prompts) ---
VALID_ROOM_TYPES = [e.value for e in RoomType]

# Using regular strings and .format() to avoid KeyError with f-strings
STAGE_1_PROMPT_TEMPLATE = """You are an expert AI architect. Your task is to generate ONLY the high-level geometric layout for a house based on a user's prompt.

**CRITICAL INSTRUCTIONS:**
1.  Focus ONLY on `levels` and `rooms`.
2.  Rooms MUST have an `id`, `type`, and non-overlapping `bounds`.
3.  The `type` for each room MUST be one of the following valid options: `{valid_room_types}`. Do NOT invent new types.
4.  DO NOT include `doors` or `windows` in this stage.
5.  Your output MUST be a single, valid JSON object with a root "levels" key.

**Golden Example of a perfect room structure:**
```json
{{
  "id": "living_room_0",
  "type": "living_room",
  "bounds": {{"x": 10, "y": 10, "width": 20, "height": 15}}
}}
```
---
**User Prompt:**
{user_prompt}
---
Now, generate the JSON for the house layout, adhering strictly to the instructions provided."""

STAGE_2_PROMPT_TEMPLATE = """You are an expert AI architect. Your task is to add `doors` and `windows` to a pre-existing house layout.

**CRITICAL INSTRUCTIONS:**
1.  Use the official `RoomType` enum for all room `type` fields. Valid types are: `{valid_room_types}`.
2.  DO NOT change the existing `id`, `type`, or `bounds` of the rooms.
3.  `Door` and `Window` objects MUST be complete and valid JSON objects, conforming to the schema.
4.  Your final output must be a single JSON object containing ONLY the `levels` key.

**Expert Design Hints:**
-   Place doors to create a logical and efficient flow between connected rooms.
-   Place windows on exterior walls to maximize natural light and capture views where appropriate.
-   Ensure `Door` objects correctly link two adjacent rooms in the `room1` and `room2` fields.

**Golden Example of a perfect room with openings (Pay close attention to the structure of Door and Window):**
```json
"rooms": [
   {{
     "id": "living_room_0",
     "type": "living_room",
     "bounds": {{ "x": 10, "y": 10, "width": 20, "height": 15 }},
     "doors": [
       {{
         "position": {{ "x": 20, "y": 25 }},
         "width": 3.0,
         "type": "interior",
         "room1": "living_room_0",
         "room2": "dining_room_0"
       }}
     ],
     "windows": [
       {{
         "position": {{ "x": 10, "y": 17.5 }},
         "width": 8.0,
         "height": 5.0,
         "type": "sliding",
         "room_id": "living_room_0"
       }}
     ]
   }}
]
```
---
**Full Schema Reference for Door and Window:**
```python
class Point2D(BaseModel):
    x: float
    y: float

class Door(BaseModel):
    position: Point2D
    width: float = 3.0
    type: str = "interior"
    room1: str
    room2: str

class Window(BaseModel):
    position: Point2D
    width: float
    height: float = 4.0
    type: str = "fixed"
    room_id: str
```
---
**Existing House Layout (Do not change this part):**
```json
{existing_layout}
```
---
**Original User Prompt:**
{user_prompt}
---
Now, add the doors and windows to the layout, following the format of the Golden Example and Schema Reference exactly."""

def call_ollama_colab(model_name: str, prompt: str):
    """A direct implementation of the Ollama API call for Colab."""
    url = "http://localhost:11434/api/generate"
    data = {"model": model_name, "prompt": prompt, "stream": False, "format": "json"}
    encoded_data = json.dumps(data).encode('utf-8')
    req = urllib.request.Request(url, data=encoded_data, headers={'Content-Type': 'application/json'})
    try:
        with urllib.request.urlopen(req, timeout=900) as response:
            if response.status == 200:
                response_data = json.loads(response.read().decode('utf-8'))
                return response_data.get("response", "")
    except urllib.error.HTTPError as e:
        error_content = e.read().decode('utf-8')
        logger.error(f"HTTP Error: {e.code} {e.reason} - {error_content}")
    except Exception as e:
        logger.error(f"Unexpected error calling Ollama: {e}")
    return None

# --- Execution ---
print("--- Starting Data Factory Run (Parallel Mode) ---")
Path(DRIVE_OUTPUT_DIR).mkdir(parents=True, exist_ok=True)

# 1. Load all available prompts from central file
all_prompts = []
try:
    with open(DRIVE_PROMPT_FILE, 'r') as f:
        all_prompts = [line.strip() for line in f if line.strip()]
    if not all_prompts:
        raise FileNotFoundError
    print(f"✅ Found {len(all_prompts)} total prompts in the master list.")
except FileNotFoundError:
    print(f"❌ MASTER PROMPT FILE NOT FOUND at '{DRIVE_PROMPT_FILE}'.")
    print("Please run Cell 4 to generate it before running this cell.")

# 2. Select a random batch to process
if all_prompts:
    random.shuffle(all_prompts)
    prompts_to_process = all_prompts[:NUM_PLANS_TO_GENERATE]
    print(f"✅ This run will process a random batch of {len(prompts_to_process)} prompts.")

    for i, prompt_text in enumerate(prompts_to_process):
        print("\n" + "="*50)
        print(f"Processing prompt {i+1}/{len(prompts_to_process)}")
        
        # Create a unique filename from the prompt hash to avoid collisions
        prompt_hash = hashlib.sha1(prompt_text.encode()).hexdigest()[:16]
        run_name = f"plan_{prompt_hash}"
        output_file = Path(DRIVE_OUTPUT_DIR) / f"{run_name}.json"

        if output_file.exists():
            print(f"⏭️ Skipping prompt, output file already exists: {output_file.name}")
            continue

        print(textwrap.shorten(prompt_text, width=100, placeholder="..."))
        print("="*50)

        # --- STAGE 1 ---
        print("Running Stage 1: Layout Generation...")
        stage_1_prompt = STAGE_1_PROMPT_TEMPLATE.format(
            user_prompt=prompt_text,
            valid_room_types=VALID_ROOM_TYPES
        )
        stage_1_response = call_ollama_colab(MODEL_NAME, stage_1_prompt)

        if not stage_1_response:
            print("❌ Stage 1 Failed: No response from model.")
            continue
        
        # --- STAGE 2 ---
        print("Running Stage 2: Adding Openings...")
        stage_2_prompt = STAGE_2_PROMPT_TEMPLATE.format(
            existing_layout=stage_1_response,
            user_prompt=prompt_text,
            valid_room_types=VALID_ROOM_TYPES
        )
        stage_2_response = call_ollama_colab(MODEL_NAME, stage_2_prompt)
        
        if not stage_2_response:
            print("❌ Stage 2 Failed: No response from model.")
            continue

        # --- STAGE 3: Finalize & Validate ---
        print("Running Stage 3: Finalizing and Validating...")
        try:
            layout_with_openings = json.loads(stage_2_response)
            
            total_area_sqft = sum(
                r['bounds']['width'] * r['bounds']['height']
                for l in layout_with_openings.get("levels", [])
                for r in l.get("rooms", [])
            )
            
            final_plan = {
                "input": {
                    "basicDetails": {"prompt": prompt_text, "totalArea": total_area_sqft},
                    "plot": {}, "roomBreakdown": []
                },
                "levels": layout_with_openings.get("levels", []),
                "total_area": round(total_area_sqft, 2),
                "construction_cost": 0.0, "materials": {}, "render_paths": {}
            }
            
            HouseOutput.model_validate(final_plan)

            with open(output_file, 'w') as f:
                json.dump(final_plan, f, indent=2)
            print(f"✅ SUCCESS! Saved validated plan to {output_file}")

        except json.JSONDecodeError:
            print("❌ Stage 3 Failed: Could not decode JSON from Stage 2.")
            with open(str(output_file).replace('.json', '_failed.txt'), 'w') as f: f.write(stage_2_response)
        except ValidationError as e:
            print(f"❌ Stage 3 Failed: Pydantic validation error - {e}")
            with open(str(output_file).replace('.json', '_failed.txt'), 'w') as f: f.write(stage_2_response)
        except Exception as e:
            print(f"❌ Stage 3 Failed: An unexpected error occurred - {e}")
            with open(str(output_file).replace('.json', '_failed.txt'), 'w') as f: f.write(stage_2_response)
    
    print("\n🎉 Data Factory run complete!")
else:
    print("No prompts to process.")



In [None]:
# @title ## 4. (One-Time Setup) Generate Master Prompt File
# @markdown This cell uses the `generate_prompts.py` script to create your master prompt file in Google Drive.
# @markdown **You only need to run this cell once.**
# @markdown Once the file is created, Cell 3 will be able to read from it for all future runs.

import os
from pathlib import Path

# --- Configuration ---
#@markdown The desired location in your Google Drive for the master prompt file. This MUST match the path in Cell 3.
DRIVE_PROMPT_FILE = "/content/drive/MyDrive/housebrain_prompts/platinum_prompts.txt" #@param {type:"string"}

#@markdown The total number of prompts to generate for your master list.
NUM_PROMPTS_TO_GENERATE = 30000 #@param {type:"integer"}
# --- End Configuration ---

# --- Execution ---
REPO_DIR = "/content/HouseBrainLLM"
script_path = os.path.join(REPO_DIR, "scripts/generate_prompts.py")

# Ensure the repository is in the correct directory
os.chdir(REPO_DIR)

# Ensure the target directory in Drive exists
Path(DRIVE_PROMPT_FILE).parent.mkdir(parents=True, exist_ok=True)

print(f"Running prompt generation script to create {NUM_PROMPTS_TO_GENERATE} prompts...")
# Use an f-string for safer command construction
command = f'python3 "{script_path}" --num-prompts {NUM_PROMPTS_TO_GENERATE} --output-file "{DRIVE_PROMPT_FILE}"'
!{command}

print("\n--- Verification ---")
if Path(DRIVE_PROMPT_FILE).exists():
    print(f"✅ Master prompt file successfully created at: {DRIVE_PROMPT_FILE}")
    print("First 5 prompts in the file:")
    !head -n 5 "{DRIVE_PROMPT_FILE}"
else:
    print(f"❌ ERROR: Master prompt file was not created. Please check for errors above.")



In [None]:
# @title ## 5. (Optional) Download Generated Dataset
# @markdown Run this cell after the data generation is complete to compress and download the entire output folder.

import shutil
import os
from google.colab import files
from datetime import datetime

# Define the source directory in Google Drive. This should match DRIVE_OUTPUT_DIR from Cell 3.
source_dir = "/content/drive/MyDrive/housebrain_platinum_dataset"

# Create a timestamped zip filename
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
zip_filename = f"housebrain_dataset_{timestamp}.zip"
zip_filepath = f"/content/{zip_filename}"

if os.path.exists(source_dir) and os.listdir(source_dir):
    # Create the zip archive
    print(f"Compressing '{source_dir}' into '{zip_filepath}'...")
    shutil.make_archive(zip_filepath.replace('.zip', ''), 'zip', source_dir)
    print("✅ Compression complete.")

    # Provide a download link
    print(f"\nDownloading '{zip_filename}'...")
    files.download(zip_filepath)
else:
    print(f"❌ ERROR: The source directory '{source_dir}' was not found or is empty. Please ensure the Data Factory ran correctly.")



# HouseBrain Data Factory 2.0: The Architect's Assembly Line (Parallel Mode)

This notebook is the control center for generating a high-quality, architecturally-sound dataset for training the HouseBrain LLM. It is designed for **large-scale, parallel data generation.**

**Our Strategy:**
1.  **Generate a Master Prompt File (Once)**: Run Cell 4 a single time to generate a large (e.g., 30,000) list of prompts and save it to your Google Drive.
2.  **Run Multiple Notebooks in Parallel**: You can open this same notebook using different Google accounts.
3.  **Process Random Batches**: Each notebook instance will read from the master prompt list, select a random, unique batch of prompts to process, and save the results to a central dataset folder on your Drive.
4.  **Avoid Duplicate Work**: The script checks if a plan for a given prompt already exists, allowing multiple instances to contribute to the same dataset without collisions.

## Instructions
1.  **Set Your GitHub PAT**: In Cell 1, you will be prompted to enter a GitHub Personal Access Token to clone the repository.
2.  **(First Time Only) Run Cell 4**: Run Cell 4 to create your master `platinum_prompts.txt` file in Google Drive. You only need to do this once.
3.  **Run the Factory (Cell 3)**: Run cells 1, 2, and 3. You can configure the number of plans you want the current notebook instance to generate in Cell 3.
4.  **Repeat**: Open this notebook with other accounts, mount the same Google Drive, and run Cell 3 again to generate more data in parallel.


In [None]:
# @title ## 1. Setup Environment
# @markdown Mount Google Drive and clone the repository using a secure token.
from google.colab import drive
import os
import getpass
import subprocess

# Mount Google Drive
drive.mount('/content/drive')
print("✅ Google Drive mounted.")

# --- GitHub Setup ---
#@markdown Enter your GitHub Personal Access Token (PAT) with repo access.
GITHUB_TOKEN = getpass.getpass('Enter your GitHub PAT: ')
REPO_URL = f"https://{GITHUB_TOKEN}@github.com/Vinay-O/HouseBrainLLM.git"
REPO_DIR = "/content/HouseBrainLLM"

# Clone the repository
if os.path.exists(REPO_DIR):
    print("Repository already exists. Pulling latest changes...")
    # Use subprocess.run for better error handling
    subprocess.run(f"cd {REPO_DIR} && git pull", shell=True, check=True)
else:
    print("Cloning repository...")
    subprocess.run(f"git clone {REPO_URL} {REPO_DIR}", shell=True, check=True)

print("✅ Repository is ready.")

# --- Install Dependencies ---
#@markdown Install necessary Python packages from the new requirements file.
requirements_path = os.path.join(REPO_DIR, "requirements.txt")
if os.path.exists(requirements_path):
    print("Installing dependencies from requirements.txt...")
    !pip install -q -r {requirements_path}
    print("✅ Dependencies installed.")
else:
    print("⚠️ requirements.txt not found. Installing default packages.")
    !pip install -q pydantic

print("✅ Environment setup complete.")


In [None]:
# @title ## 2. Configure and Start Ollama Server
# @markdown This cell will download and start the Ollama server, then pull the specified model.
# @markdown The process will take a few minutes.

MODEL_NAME = "mixtral:instruct" # @param ["mixtral:instruct", "qwen2:7b", "llama3:8b", "mistral:7b-instruct", "qwen2:72b"]

# Download and start Ollama
!curl -fsSL https://ollama.com/install.sh | sh
import threading
import subprocess
import time

def run_ollama():
    try:
        subprocess.run("ollama serve", shell=True, check=True, capture_output=True, text=True)
    except subprocess.CalledProcessError as e:
        print(f"Ollama server failed: {e.stderr}")

print("🚀 Starting Ollama server in the background...")
ollama_thread = threading.Thread(target=run_ollama)
ollama_thread.daemon = True
ollama_thread.start()

# Wait for the server to be ready
print("⏳ Waiting for Ollama server to initialize...")
time.sleep(15) # Increased wait time for stability

# Pull the model
print(f"📦 Pulling model: {MODEL_NAME}. This may take a while...")
try:
    process = subprocess.run(
        f"ollama pull {MODEL_NAME}",
        shell=True, check=True, capture_output=True, text=True, timeout=600
    )
    print(f"✅ Model {MODEL_NAME} is ready.")
except subprocess.CalledProcessError as e:
    print(f"Error pulling model: {e.stderr}")
    print("This might happen if the model name is incorrect or the Ollama server is not ready.")
except subprocess.TimeoutExpired:
    print("Timed out while pulling the model. The model might be very large or the connection slow.")


# Verify Ollama is running
!ollama list


# HouseBrain Data Factory 2.0: The Architect's Assembly Line

This notebook is the control center for generating a high-quality, architecturally-sound dataset for training the HouseBrain LLM. It leverages the "Architect's Assembly Line" script, which is a robust, multi-stage pipeline designed to guide an LLM in creating valid house plans.

**Our Strategy:**
1.  **Use a powerful "Generator" model** (e.g., `qwen2:7b`, as recommended) within the pipeline to create draft plans.
2.  **Run the pipeline at scale** on a large list of diverse architectural prompts.
3.  **Save the validated, "Gold Standard" outputs** directly to your Google Drive.
4.  Use this curated dataset in a separate notebook to fine-tune a more specialized "Architect" model (e.g., `Llama 3`).

## Instructions
1.  **Set Your GitHub PAT**: In the "Setup" section, you will be prompted to enter a GitHub Personal Access Token. This is required to clone the private `HouseBrainLLM` repository.
2.  **Define Your Prompts**: In the "Run Data Factory" section, a default list of prompts is provided. You should replace or extend this with a much larger and more diverse list for a full data generation run.
3.  **Run All Cells**: Once configured, select "Runtime" -> "Run all" from the menu. The notebook will set up the environment, download the necessary models, and begin the data generation process, saving the results to your Google Drive.


# HouseBrain Data Factory 2.0: The Architect's Assembly Line

This notebook is the control center for generating a high-quality, architecturally-sound dataset for training the HouseBrain LLM. It leverages the "Architect's Assembly Line" script, which is a robust, multi-stage pipeline designed to guide an LLM in creating valid house plans.

**Our Strategy:**
1.  **Use a powerful "Generator" model** (e.g., `qwen2:7b`, as recommended) within the pipeline to create draft plans.
2.  **Run the pipeline at scale** on a large list of diverse architectural prompts.
3.  **Save the validated, "Gold Standard" outputs** directly to your Google Drive.
4.  Use this curated dataset in a separate notebook to fine-tune a more specialized "Architect" model (e.g., `Llama 3`).

## Instructions
1.  **Set Your GitHub PAT**: In the "Setup" section, you will be prompted to enter a GitHub Personal Access Token. This is required to clone the private `HouseBrainLLM` repository.
2.  **Define Your Prompts**: In the "Run Data Factory" section, a default list of prompts is provided. You should replace or extend this with a much larger and more diverse list for a full data generation run.
3.  **Run All Cells**: Once configured, select "Runtime" -> "Run all" from the menu. The notebook will set up the environment, download the necessary models, and begin the data generation process, saving the results to your Google Drive.


# HouseBrain Data Factory 2.0: The Architect's Assembly Line

This notebook is the control center for generating a high-quality, architecturally-sound dataset for training the HouseBrain LLM. It leverages the "Architect's Assembly Line" script, which is a robust, multi-stage pipeline designed to guide an LLM in creating valid house plans.

**Our Strategy:**
1.  **Use a powerful "Generator" model** (e.g., `qwen2:7b`, as recommended) within the pipeline to create draft plans.
2.  **Run the pipeline at scale** on a large list of diverse architectural prompts.
3.  **Save the validated, "Gold Standard" outputs** directly to your Google Drive.
4.  Use this curated dataset in a separate notebook to fine-tune a more specialized "Architect" model (e.g., `Llama 3`).

## Instructions
1.  **Set Your GitHub PAT**: In the "Setup" section, you will be prompted to enter a GitHub Personal Access Token. This is required to clone the private `HouseBrainLLM` repository.
2.  **Define Your Prompts**: In the "Run Data Factory" section, a default list of prompts is provided. You should replace or extend this with a much larger and more diverse list for a full data generation run.
3.  **Run All Cells**: Once configured, select "Runtime" -> "Run all" from the menu. The notebook will set up the environment, download the necessary models, and begin the data generation process, saving the results to your Google Drive.


In [None]:
# @title ## 1. Setup Environment
# @markdown Mount Google Drive and clone the repository using a secure token.
from google.colab import drive
import os
import getpass
import subprocess

# Mount Google Drive
drive.mount('/content/drive')
print("✅ Google Drive mounted.")

# --- GitHub Setup ---
#@markdown Enter your GitHub Personal Access Token (PAT) with repo access.
GITHUB_TOKEN = getpass.getpass('Enter your GitHub PAT: ')
REPO_URL = f"https://{GITHUB_TOKEN}@github.com/Vinay-O/HouseBrainLLM.git"
REPO_DIR = "/content/HouseBrainLLM"

# Clone the repository
if os.path.exists(REPO_DIR):
    print("Repository already exists. Pulling latest changes...")
    subprocess.run(f"cd {REPO_DIR} && git pull", shell=True, check=True)
else:
    print("Cloning repository...")
    subprocess.run(f"git clone {REPO_URL} {REPO_DIR}", shell=True, check=True)

print("✅ Repository is ready.")

# --- Install Dependencies ---
#@markdown Install necessary Python packages.
!pip install -q pydantic GitPython

print("✅ Dependencies installed.")


In [None]:
# @title ## 2. Configure and Start Ollama Server
# @markdown This cell will download and start the Ollama server, then pull the specified model.
# @markdown The process will take a few minutes.

MODEL_NAME = "mixtral:instruct" # @param ["mixtral:instruct", "qwen2:7b", "llama3:8b", "mistral:7b-instruct", "qwen2:72b"]

# Download and start Ollama
!curl -fsSL https://ollama.com/install.sh | sh
import threading
import subprocess
import time

def run_ollama():
    try:
        subprocess.run("ollama serve", shell=True, check=True, capture_output=True, text=True)
    except subprocess.CalledProcessError as e:
        print(f"Ollama server failed: {e.stderr}")

print("🚀 Starting Ollama server in the background...")
ollama_thread = threading.Thread(target=run_ollama)
ollama_thread.daemon = True
ollama_thread.start()

# Wait for the server to be ready
print("⏳ Waiting for Ollama server to initialize...")
time.sleep(10)

# Pull the model
print(f"📦 Pulling model: {MODEL_NAME}. This may take a while...")
try:
    subprocess.run(f"ollama pull {MODEL_NAME}", shell=True, check=True, capture_output=True, text=True)
    print(f"✅ Model {MODEL_NAME} is ready.")
except subprocess.CalledProcessError as e:
    print(f"Failed to pull model. Trying default tag...")
    base_model = MODEL_NAME.split(':')[0]
    subprocess.run(f"ollama pull {base_model}", shell=True, check=True)
    print(f"✅ Model {base_model} is ready.")

# Verify Ollama is running
!ollama list


In [None]:
# @title ## 3. Run the Data Factory (Parallel Mode)
# @markdown This cell is designed for large-scale, parallel data generation.
# @markdown It reads prompts from a central file in your Google Drive, processes a random batch, and saves to a central dataset folder.
# @markdown You can run this notebook on multiple accounts simultaneously to accelerate data creation.

import os
import sys
import json
import textwrap
import logging
from pathlib import Path
import urllib.request
import urllib.error
from inspect import getsource
from pydantic import BaseModel, ValidationError
import random
import hashlib

# --- Configuration ---
#@markdown The central location in your Google Drive for the master prompt file.
DRIVE_PROMPT_FILE = "/content/drive/MyDrive/housebrain_prompts/platinum_prompts.txt" #@param {type:"string"}

#@markdown The central location in your Google Drive to save the final dataset.
DRIVE_OUTPUT_DIR = "/content/drive/MyDrive/housebrain_platinum_dataset" #@param {type:"string"}

#@markdown The number of plans this specific Colab instance should generate in this run.
NUM_PLANS_TO_GENERATE = 10 #@param {type:"integer"}

#@markdown The Ollama model to use for generation.
MODEL_NAME = "mixtral:instruct" #@param ["mixtral:instruct", "qwen2:7b", "llama3:8b"]
# --- End Configuration ---


# --- Setup Paths and Logging ---
REPO_DIR = "/content/HouseBrainLLM"
if REPO_DIR not in sys.path:
    sys.path.insert(0, REPO_DIR)
os.chdir(REPO_DIR)

logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
logger = logging.getLogger(__name__)

# --- Import Schema from Cloned Repo ---
try:
    from src.housebrain.schema import HouseOutput, RoomType
    print("✅ Successfully imported HouseBrain schema.")
except ImportError as e:
    print(f"❌ Failed to import HouseBrain schema: {e}")
    print("Please ensure the repository was cloned correctly in Step 1.")
    # Stop execution if schema fails
    raise e

# --- Self-Contained Generation Logic (with A+ Prompts) ---
VALID_ROOM_TYPES = [e.value for e in RoomType]

# --- TEMPLATE FIX ---
# Removed the initial 'f' from the string definition to prevent premature formatting.
# All placeholders are now handled by a single, robust .format() call later.
STAGE_1_PROMPT_TEMPLATE = """You are an expert AI architect. Your task is to generate ONLY the high-level geometric layout for a house based on a user's prompt.

**CRITICAL INSTRUCTIONS:**
1.  Focus ONLY on `levels` and `rooms`.
2.  Rooms MUST have an `id`, `type`, and non-overlapping `bounds`.
3.  The `type` for each room MUST be one of the following valid options: `{valid_room_types}`. Do NOT invent new types.
4.  DO NOT include `doors` or `windows` in this stage.
5.  Your output MUST be a single, valid JSON object with a root "levels" key.

**Golden Example of a perfect room structure:**
```json
{{
  "id": "living_room_0",
  "type": "living_room",
  "bounds": {{"x": 10, "y": 10, "width": 20, "height": 15}}
}}
```
---
**User Prompt:**
{user_prompt}
---
Now, generate the JSON for the house layout, adhering strictly to the instructions provided."""

STAGE_2_PROMPT_TEMPLATE = """You are an expert AI architect. Your task is to add `doors` and `windows` to a pre-existing house layout.

**CRITICAL INSTRUCTIONS:**
1.  Use the official `RoomType` enum for all room `type` fields. Valid types are: `{valid_room_types}`.
2.  DO NOT change the existing `id`, `type`, or `bounds` of the rooms.
3.  `Door` and `Window` objects MUST be complete and valid JSON objects, conforming to the schema.
4.  Your final output must be a single JSON object containing ONLY the `levels` key.

**Expert Design Hints:**
-   Place doors to create a logical and efficient flow between connected rooms.
-   Place windows on exterior walls to maximize natural light and capture views where appropriate.
-   Ensure `Door` objects correctly link two adjacent rooms in the `room1` and `room2` fields.

**Golden Example of a perfect room with openings (Pay close attention to the structure of Door and Window):**
```json
"rooms": [
   {{
     "id": "living_room_0",
     "type": "living_room",
     "bounds": {{ "x": 10, "y": 10, "width": 20, "height": 15 }},
     "doors": [
       {{
         "position": {{ "x": 20, "y": 25 }},
         "width": 3.0,
         "type": "interior",
         "room1": "living_room_0",
         "room2": "dining_room_0"
       }}
     ],
     "windows": [
       {{
         "position": {{ "x": 10, "y": 17.5 }},
         "width": 8.0,
         "height": 5.0,
         "type": "sliding",
         "room_id": "living_room_0"
       }}
     ]
   }}
]
```
---
**Full Schema Reference for Door and Window:**
```python
class Point2D(BaseModel):
    x: float
    y: float

class Door(BaseModel):
    position: Point2D
    width: float = 3.0
    type: str = "interior"
    room1: str
    room2: str

class Window(BaseModel):
    position: Point2D
    width: float
    height: float = 4.0
    type: str = "fixed"
    room_id: str
```
---
**Existing House Layout (Do not change this part):**
```json
{existing_layout}
```
---
**Original User Prompt:**
{user_prompt}
---
Now, add the doors and windows to the layout, following the format of the Golden Example and Schema Reference exactly."""


def call_ollama_colab(model_name: str, prompt: str):
    """A direct implementation of the Ollama API call for Colab."""
    url = "http://localhost:11434/api/generate"
    data = {"model": model_name, "prompt": prompt, "stream": False, "format": "json"}
    encoded_data = json.dumps(data).encode('utf-8')
    req = urllib.request.Request(url, data=encoded_data, headers={'Content-Type': 'application/json'})
    try:
        with urllib.request.urlopen(req, timeout=900) as response:
            if response.status == 200:
                response_data = json.loads(response.read().decode('utf-8'))
                return response_data.get("response", "")
    except urllib.error.HTTPError as e:
        error_content = e.read().decode('utf-8')
        logger.error(f"HTTP Error: {e.code} {e.reason} - {error_content}")
    except Exception as e:
        logger.error(f"Unexpected error calling Ollama: {e}")
    return None

# --- Execution ---
print("--- Starting Data Factory Run (Parallel Mode) ---")
Path(DRIVE_OUTPUT_DIR).mkdir(parents=True, exist_ok=True)

# 1. Load all available prompts from central file
all_prompts = []
try:
    with open(DRIVE_PROMPT_FILE, 'r') as f:
        all_prompts = [line.strip() for line in f if line.strip()]
    if not all_prompts:
        raise FileNotFoundError
    print(f"✅ Found {len(all_prompts)} total prompts in the master list.")
except FileNotFoundError:
    print(f"❌ MASTER PROMPT FILE NOT FOUND at '{DRIVE_PROMPT_FILE}'.")
    print("Please run Cell 4 to generate it before running this cell.")

# 2. Select a random batch to process
if all_prompts:
    random.shuffle(all_prompts)
    prompts_to_process = all_prompts[:NUM_PLANS_TO_GENERATE]
    print(f"✅ This run will process a random batch of {len(prompts_to_process)} prompts.")

    for i, prompt_text in enumerate(prompts_to_process):
        print("\n" + "="*50)
        print(f"Processing prompt {i+1}/{len(prompts_to_process)}")
        
        # Create a unique filename from the prompt hash to avoid collisions
        prompt_hash = hashlib.sha1(prompt_text.encode()).hexdigest()[:16]
        run_name = f"plan_{prompt_hash}"
        output_file = Path(DRIVE_OUTPUT_DIR) / f"{run_name}.json"

        if output_file.exists():
            print(f"⏭️ Skipping prompt, output file already exists: {output_file.name}")
            continue

        print(textwrap.shorten(prompt_text, width=100, placeholder="..."))
        print("="*50)

        # --- STAGE 1 ---
        print("Running Stage 1: Layout Generation...")
        # --- FIX ---
        # Pass all required values into the .format() call at once.
        stage_1_prompt = STAGE_1_PROMPT_TEMPLATE.format(
            user_prompt=prompt_text,
            valid_room_types=VALID_ROOM_TYPES
        )
        stage_1_response = call_ollama_colab(MODEL_NAME, stage_1_prompt)

        if not stage_1_response:
            print("❌ Stage 1 Failed: No response from model.")
            continue
        
        # --- STAGE 2 ---
        print("Running Stage 2: Adding Openings...")
        # --- FIX ---
        # Pass all required values into the .format() call at once.
        stage_2_prompt = STAGE_2_PROMPT_TEMPLATE.format(
            existing_layout=stage_1_response,
            user_prompt=prompt_text,
            valid_room_types=VALID_ROOM_TYPES
        )
        stage_2_response = call_ollama_colab(MODEL_NAME, stage_2_prompt)
        
        if not stage_2_response:
            print("❌ Stage 2 Failed: No response from model.")
            continue

        # --- STAGE 3: Finalize & Validate ---
        print("Running Stage 3: Finalizing and Validating...")
        try:
            layout_with_openings = json.loads(stage_2_response)
            
            total_area_sqft = sum(
                r['bounds']['width'] * r['bounds']['height']
                for l in layout_with_openings.get("levels", [])
                for r in l.get("rooms", [])
            )
            
            final_plan = {
                "input": {
                    "basicDetails": {"prompt": prompt_text, "totalArea": total_area_sqft},
                    "plot": {}, "roomBreakdown": []
                },
                "levels": layout_with_openings.get("levels", []),
                "total_area": round(total_area_sqft, 2),
                "construction_cost": 0.0, "materials": {}, "render_paths": {}
            }
            
            HouseOutput.model_validate(final_plan)

            with open(output_file, 'w') as f:
                json.dump(final_plan, f, indent=2)
            print(f"✅ SUCCESS! Saved validated plan to {output_file}")

        except json.JSONDecodeError:
            print("❌ Stage 3 Failed: Could not decode JSON from Stage 2.")
            with open(str(output_file).replace('.json', '_failed.txt'), 'w') as f: f.write(stage_2_response)
        except ValidationError as e:
            print(f"❌ Stage 3 Failed: Pydantic validation error - {e}")
            with open(str(output_file).replace('.json', '_failed.txt'), 'w') as f: f.write(stage_2_response)
        except Exception as e:
            print(f"❌ Stage 3 Failed: An unexpected error occurred - {e}")
            with open(str(output_file).replace('.json', '_failed.txt'), 'w') as f: f.write(stage_2_response)
    
    print("\n🎉 Data Factory run complete!")
else:
    print("No prompts to process.")



In [None]:
# @title ## 4. (Optional) Create `platinum_prompts.txt` for Large-Scale Runs
# @markdown Run this cell **once** to create the `platinum_prompts.txt` file in your Colab environment.
# @markdown You can then modify the cell above (Cell 3) to read prompts from this file instead of the small list.

import random

# --- Components for Prompt Generation ---
STYLES = [
    "Modern", "Contemporary", "Minimalist", "Traditional Kerala-style 'Nalukettu'",
    "Colonial", "Industrial", "Scandinavian", "Bohemian", "Farmhouse", "Chettinad-style",
    "Eco-friendly", "Brutalist", "Art Deco", "Mediterranean"
]
SIZES_BHK = ["1BHK", "2BHK", "3BHK", "4BHK", "5BHK", "6BHK", "studio apartment"]
SIZES_SQFT = [
    "800 sqft", "1000 sqft", "1200 sqft", "1500 sqft", "1800 sqft",
    "2000 sqft", "2500 sqft", "3000 sqft", "4000 sqft", "5000 sqft"
]
FLOORS = [
    "single-story", "two-story", "G+1", "G+2", "duplex", "triplex",
    "split-level", "penthouse"
]
STRUCTURE_TYPES = ["house", "villa", "apartment", "bungalow", "farmhouse", "townhouse", "cottage"]
PLOT_SIZES = [
    "30x40 feet", "30x50 feet", "40x60 feet", "50x80 feet", "60x90 feet",
    "80x100 feet", "100x100 feet", "corner", "irregular"
]
FEATURES = [
    "with an open-plan kitchen and living area", "with a swimming pool", "with a home theater",
    "with a large garden", "with a central courtyard", "with a dedicated home office",
    "with a private gym", "featuring floor-to-ceiling windows", "with a rooftop terrace",
    "with a two-car garage", "with servant's quarters", "with a library", "with a spacious balcony for each bedroom"
]
CONSTRAINTS = [
    "and be Vastu-compliant", "with a North-facing entrance", "with a West-facing plot",
    "on a tight budget", "for a luxury segment", "designed for a family of four",
    "with a focus on natural light and ventilation", "for a joint family", "as a bachelor pad",
    "to be wheelchair accessible"
]

def generate_prompt():
    prompt_parts = []
    style = random.choice(STYLES)
    num_floors = random.choice(FLOORS)
    bhk = random.choice(SIZES_BHK)
    structure = random.choice(STRUCTURE_TYPES)
    prompt_parts.append(f"Design a {style}, {num_floors} {bhk} {structure}")
    if random.random() < 0.7:
        plot = random.choice(PLOT_SIZES)
        prompt_parts.append(f"for a {plot} plot")
    else:
        area = random.choice(SIZES_SQFT)
        prompt_parts.append(f"with a total area of {area}")
    num_features = random.randint(1, 3)
    selected_features = random.sample(FEATURES, num_features)
    prompt_parts.extend(selected_features)
    if random.random() < 0.6:
        num_constraints = random.randint(1, 2)
        selected_constraints = random.sample(CONSTRAINTS, num_constraints)
        prompt_parts.extend(selected_constraints)
    return ". ".join(prompt_parts) + "."

NUM_PROMPTS = 10000
OUTPUT_FILE = "/content/platinum_prompts.txt"

print(f"Generating {NUM_PROMPTS} prompts and saving to {OUTPUT_FILE}...")
with open(OUTPUT_FILE, 'w') as f:
    for i in range(NUM_PROMPTS):
        prompt = generate_prompt()
        f.write(f"{prompt}\\n")
print(f"✅ Successfully generated and saved {NUM_PROMPTS} prompts.")

# Display the first 5 prompts to verify
!head -n 5 {OUTPUT_FILE}



In [None]:
# @title ## 5. (Optional) Download Generated Dataset
# @markdown Run this cell after the data generation is complete to compress and download the entire output folder.

import shutil
import os
from google.colab import files

# Define the source directory in Google Drive and the target zip file path
# This should match the DRIVE_OUTPUT_DIR from Cell 3
source_dir = "/content/drive/MyDrive/housebrain_dataset"
zip_filename = "housebrain_dataset.zip"
zip_filepath = f"/content/{zip_filename}"

if os.path.exists(source_dir):
    # Create the zip archive
    print(f"Compressing '{source_dir}' into '{zip_filepath}'...")
    shutil.make_archive(zip_filepath.replace('.zip', ''), 'zip', source_dir)
    print("✅ Compression complete.")

    # Provide a download link
    print(f"\nDownloading '{zip_filename}'...")
    files.download(zip_filepath)
else:
    print(f"ERROR: The source directory '{source_dir}' was not found. Please ensure the Data Factory ran correctly.")



In [None]:
# @title ## 1. Setup Environment
# @markdown Mount Google Drive and clone the repository using a secure token.
from google.colab import drive
import os
import getpass
import subprocess

# Mount Google Drive
drive.mount('/content/drive')
print("✅ Google Drive mounted.")

# --- GitHub Setup ---
#@markdown Enter your GitHub Personal Access Token (PAT) with repo access.
GITHUB_TOKEN = getpass.getpass('Enter your GitHub PAT: ')
REPO_URL = f"https://{GITHUB_TOKEN}@github.com/Vinay-O/HouseBrainLLM.git"
REPO_DIR = "/content/HouseBrainLLM"

# Clone the repository
if os.path.exists(REPO_DIR):
    print("Repository already exists. Pulling latest changes...")
    subprocess.run(f"cd {REPO_DIR} && git pull", shell=True, check=True)
else:
    print("Cloning repository...")
    subprocess.run(f"git clone {REPO_URL} {REPO_DIR}", shell=True, check=True)

print("✅ Repository is ready.")

# --- Install Dependencies ---
#@markdown Install necessary Python packages.
!pip install -q pydantic GitPython

print("✅ Dependencies installed.")


In [None]:
# @title ## 2. Configure and Start Ollama Server
# @markdown This cell will download and start the Ollama server, then pull the specified model.
# @markdown The process will take a few minutes.

MODEL_NAME = "mixtral:instruct" # @param ["mixtral:instruct", "qwen2:7b", "llama3:8b", "mistral:7b-instruct", "qwen2:72b"]

# Download and start Ollama
!curl -fsSL https://ollama.com/install.sh | sh
import threading
import subprocess
import time

def run_ollama():
    try:
        subprocess.run("ollama serve", shell=True, check=True, capture_output=True, text=True)
    except subprocess.CalledProcessError as e:
        print(f"Ollama server failed: {e.stderr}")

print("🚀 Starting Ollama server in the background...")
ollama_thread = threading.Thread(target=run_ollama)
ollama_thread.daemon = True
ollama_thread.start()

# Wait for the server to be ready
print("⏳ Waiting for Ollama server to initialize...")
time.sleep(10)

# Pull the model
print(f"📦 Pulling model: {MODEL_NAME}. This may take a while...")
try:
    subprocess.run(f"ollama pull {MODEL_NAME}", shell=True, check=True, capture_output=True, text=True)
    print(f"✅ Model {MODEL_NAME} is ready.")
except subprocess.CalledProcessError as e:
    print(f"Failed to pull model. Trying default tag...")
    base_model = MODEL_NAME.split(':')[0]
    subprocess.run(f"ollama pull {base_model}", shell=True, check=True)
    print(f"✅ Model {base_model} is ready.")

# Verify Ollama is running
!ollama list


In [None]:
# @title ## DUPLICATED CELL
# @markdown This cell was duplicated due to an error in a previous edit.
# @markdown It contains an older, broken version of the code.
# @markdown Please delete this cell.



In [None]:
# @title ## 4. (Optional) Create `platinum_prompts.txt` for Large-Scale Runs
# @markdown Run this cell **once** to create the `platinum_prompts.txt` file in your Colab environment.
# @markdown You can then modify the cell above (Cell 3) to read prompts from this file instead of the small list.

import random

# --- Components for Prompt Generation ---
STYLES = [
    "Modern", "Contemporary", "Minimalist", "Traditional Kerala-style 'Nalukettu'",
    "Colonial", "Industrial", "Scandinavian", "Bohemian", "Farmhouse", "Chettinad-style",
    "Eco-friendly", "Brutalist", "Art Deco", "Mediterranean"
]
SIZES_BHK = ["1BHK", "2BHK", "3BHK", "4BHK", "5BHK", "6BHK", "studio apartment"]
SIZES_SQFT = [
    "800 sqft", "1000 sqft", "1200 sqft", "1500 sqft", "1800 sqft",
    "2000 sqft", "2500 sqft", "3000 sqft", "4000 sqft", "5000 sqft"
]
FLOORS = [
    "single-story", "two-story", "G+1", "G+2", "duplex", "triplex",
    "split-level", "penthouse"
]
STRUCTURE_TYPES = ["house", "villa", "apartment", "bungalow", "farmhouse", "townhouse", "cottage"]
PLOT_SIZES = [
    "30x40 feet", "30x50 feet", "40x60 feet", "50x80 feet", "60x90 feet",
    "80x100 feet", "100x100 feet", "corner", "irregular"
]
FEATURES = [
    "with an open-plan kitchen and living area", "with a swimming pool", "with a home theater",
    "with a large garden", "with a central courtyard", "with a dedicated home office",
    "with a private gym", "featuring floor-to-ceiling windows", "with a rooftop terrace",
    "with a two-car garage", "with servant's quarters", "with a library", "with a spacious balcony for each bedroom"
]
CONSTRAINTS = [
    "and be Vastu-compliant", "with a North-facing entrance", "with a West-facing plot",
    "on a tight budget", "for a luxury segment", "designed for a family of four",
    "with a focus on natural light and ventilation", "for a joint family", "as a bachelor pad",
    "to be wheelchair accessible"
]

def generate_prompt():
    prompt_parts = []
    style = random.choice(STYLES)
    num_floors = random.choice(FLOORS)
    bhk = random.choice(SIZES_BHK)
    structure = random.choice(STRUCTURE_TYPES)
    prompt_parts.append(f"Design a {style}, {num_floors} {bhk} {structure}")
    if random.random() < 0.7:
        plot = random.choice(PLOT_SIZES)
        prompt_parts.append(f"for a {plot} plot")
    else:
        area = random.choice(SIZES_SQFT)
        prompt_parts.append(f"with a total area of {area}")
    num_features = random.randint(1, 3)
    selected_features = random.sample(FEATURES, num_features)
    prompt_parts.extend(selected_features)
    if random.random() < 0.6:
        num_constraints = random.randint(1, 2)
        selected_constraints = random.sample(CONSTRAINTS, num_constraints)
        prompt_parts.extend(selected_constraints)
    return ". ".join(prompt_parts) + "."

NUM_PROMPTS = 10000
OUTPUT_FILE = "/content/platinum_prompts.txt"

print(f"Generating {NUM_PROMPTS} prompts and saving to {OUTPUT_FILE}...")
with open(OUTPUT_FILE, 'w') as f:
    for i in range(NUM_PROMPTS):
        prompt = generate_prompt()
        f.write(f"{prompt}\\n")
print(f"✅ Successfully generated and saved {NUM_PROMPTS} prompts.")

# Display the first 5 prompts to verify
!head -n 5 {OUTPUT_FILE}



In [None]:
# @title ## 5. (Optional) Download Generated Dataset
# @markdown Run this cell after the data generation is complete to compress and download the entire output folder.

import shutil
import os
from google.colab import files

# Define the source directory in Google Drive and the target zip file path
# This should match the DRIVE_OUTPUT_DIR from Cell 3
source_dir = "/content/drive/MyDrive/housebrain_dataset"
zip_filename = "housebrain_dataset.zip"
zip_filepath = f"/content/{zip_filename}"

if os.path.exists(source_dir):
    # Create the zip archive
    print(f"Compressing '{source_dir}' into '{zip_filepath}'...")
    shutil.make_archive(zip_filepath.replace('.zip', ''), 'zip', source_dir)
    print("✅ Compression complete.")

    # Provide a download link
    print(f"\nDownloading '{zip_filename}'...")
    files.download(zip_filepath)
else:
    print(f"ERROR: The source directory '{source_dir}' was not found. Please ensure the Data Factory ran correctly.")



In [None]:
# @title ## 1. Setup Environment
# @markdown Mount Google Drive and clone the repository using a secure token.
from google.colab import drive
import os
import getpass
import subprocess

# Mount Google Drive
drive.mount('/content/drive')
print("✅ Google Drive mounted.")

# --- GitHub Setup ---
#@markdown Enter your GitHub Personal Access Token (PAT) with repo access.
GITHUB_TOKEN = getpass.getpass('Enter your GitHub PAT: ')
REPO_URL = f"https://{GITHUB_TOKEN}@github.com/Vinay-O/HouseBrainLLM.git"
REPO_DIR = "/content/HouseBrainLLM"

# Clone the repository
if os.path.exists(REPO_DIR):
    print("Repository already exists. Pulling latest changes...")
    subprocess.run(f"cd {REPO_DIR} && git pull", shell=True, check=True)
else:
    print("Cloning repository...")
    subprocess.run(f"git clone {REPO_URL} {REPO_DIR}", shell=True, check=True)

print("✅ Repository is ready.")

# --- Install Dependencies ---
#@markdown Install necessary Python packages.
!pip install -q pydantic GitPython

print("✅ Dependencies installed.")


In [None]:
# @title ## 2. Configure and Start Ollama Server
# @markdown This cell will download and start the Ollama server, then pull the specified model.
# @markdown The process will take a few minutes.

MODEL_NAME = "mixtral:instruct" # @param ["mixtral:instruct", "qwen2:7b", "llama3:8b", "mistral:7b-instruct", "qwen2:72b"]

# Download and start Ollama
!curl -fsSL https://ollama.com/install.sh | sh
import threading
import subprocess
import time

def run_ollama():
    try:
        subprocess.run("ollama serve", shell=True, check=True, capture_output=True, text=True)
    except subprocess.CalledProcessError as e:
        print(f"Ollama server failed: {e.stderr}")

print("🚀 Starting Ollama server in the background...")
ollama_thread = threading.Thread(target=run_ollama)
ollama_thread.daemon = True
ollama_thread.start()

# Wait for the server to be ready
print("⏳ Waiting for Ollama server to initialize...")
time.sleep(10)

# Pull the model
print(f"📦 Pulling model: {MODEL_NAME}. This may take a while...")
try:
    subprocess.run(f"ollama pull {MODEL_NAME}", shell=True, check=True, capture_output=True, text=True)
    print(f"✅ Model {MODEL_NAME} is ready.")
except subprocess.CalledProcessError as e:
    print(f"Failed to pull model. Trying default tag...")
    base_model = MODEL_NAME.split(':')[0]
    subprocess.run(f"ollama pull {base_model}", shell=True, check=True)
    print(f"✅ Model {base_model} is ready.")

# Verify Ollama is running
!ollama list


In [None]:
# @title ## DUPLICATED CELL
# @markdown This cell was duplicated due to an error in a previous edit.
# @markdown It contains an older, broken version of the code.
# @markdown Please delete this cell.



In [None]:
# @title ## 4. (Optional) Create `platinum_prompts.txt` for Large-Scale Runs
# @markdown Run this cell **once** to create the `platinum_prompts.txt` file in your Colab environment.
# @markdown You can then modify the cell above (Cell 3) to read prompts from this file instead of the small list.

import random

# --- Components for Prompt Generation ---
STYLES = [
    "Modern", "Contemporary", "Minimalist", "Traditional Kerala-style 'Nalukettu'",
    "Colonial", "Industrial", "Scandinavian", "Bohemian", "Farmhouse", "Chettinad-style",
    "Eco-friendly", "Brutalist", "Art Deco", "Mediterranean"
]
SIZES_BHK = ["1BHK", "2BHK", "3BHK", "4BHK", "5BHK", "6BHK", "studio apartment"]
SIZES_SQFT = [
    "800 sqft", "1000 sqft", "1200 sqft", "1500 sqft", "1800 sqft",
    "2000 sqft", "2500 sqft", "3000 sqft", "4000 sqft", "5000 sqft"
]
FLOORS = [
    "single-story", "two-story", "G+1", "G+2", "duplex", "triplex",
    "split-level", "penthouse"
]
STRUCTURE_TYPES = ["house", "villa", "apartment", "bungalow", "farmhouse", "townhouse", "cottage"]
PLOT_SIZES = [
    "30x40 feet", "30x50 feet", "40x60 feet", "50x80 feet", "60x90 feet",
    "80x100 feet", "100x100 feet", "corner", "irregular"
]
FEATURES = [
    "with an open-plan kitchen and living area", "with a swimming pool", "with a home theater",
    "with a large garden", "with a central courtyard", "with a dedicated home office",
    "with a private gym", "featuring floor-to-ceiling windows", "with a rooftop terrace",
    "with a two-car garage", "with servant's quarters", "with a library", "with a spacious balcony for each bedroom"
]
CONSTRAINTS = [
    "and be Vastu-compliant", "with a North-facing entrance", "with a West-facing plot",
    "on a tight budget", "for a luxury segment", "designed for a family of four",
    "with a focus on natural light and ventilation", "for a joint family", "as a bachelor pad",
    "to be wheelchair accessible"
]

def generate_prompt():
    prompt_parts = []
    style = random.choice(STYLES)
    num_floors = random.choice(FLOORS)
    bhk = random.choice(SIZES_BHK)
    structure = random.choice(STRUCTURE_TYPES)
    prompt_parts.append(f"Design a {style}, {num_floors} {bhk} {structure}")
    if random.random() < 0.7:
        plot = random.choice(PLOT_SIZES)
        prompt_parts.append(f"for a {plot} plot")
    else:
        area = random.choice(SIZES_SQFT)
        prompt_parts.append(f"with a total area of {area}")
    num_features = random.randint(1, 3)
    selected_features = random.sample(FEATURES, num_features)
    prompt_parts.extend(selected_features)
    if random.random() < 0.6:
        num_constraints = random.randint(1, 2)
        selected_constraints = random.sample(CONSTRAINTS, num_constraints)
        prompt_parts.extend(selected_constraints)
    return ". ".join(prompt_parts) + "."

NUM_PROMPTS = 10000
OUTPUT_FILE = "/content/platinum_prompts.txt"

print(f"Generating {NUM_PROMPTS} prompts and saving to {OUTPUT_FILE}...")
with open(OUTPUT_FILE, 'w') as f:
    for i in range(NUM_PROMPTS):
        prompt = generate_prompt()
        f.write(f"{prompt}\\n")
print(f"✅ Successfully generated and saved {NUM_PROMPTS} prompts.")

# Display the first 5 prompts to verify
!head -n 5 {OUTPUT_FILE}



In [None]:
# @title ## 5. (Optional) Download Generated Dataset
# @markdown Run this cell after the data generation is complete to compress and download the entire output folder.

import shutil
import os
from google.colab import files

# Define the source directory in Google Drive and the target zip file path
# This should match the DRIVE_OUTPUT_DIR from Cell 3
source_dir = "/content/drive/MyDrive/housebrain_dataset"
zip_filename = "housebrain_dataset.zip"
zip_filepath = f"/content/{zip_filename}"

if os.path.exists(source_dir):
    # Create the zip archive
    print(f"Compressing '{source_dir}' into '{zip_filepath}'...")
    shutil.make_archive(zip_filepath.replace('.zip', ''), 'zip', source_dir)
    print("✅ Compression complete.")

    # Provide a download link
    print(f"\nDownloading '{zip_filename}'...")
    files.download(zip_filepath)
else:
    print(f"ERROR: The source directory '{source_dir}' was not found. Please ensure the Data Factory ran correctly.")



In [None]:
# @title ## 4. (One-Time Setup) Generate Master Prompt File
# @markdown This cell uses the `generate_prompts.py` script to create your master prompt file in Google Drive.
# @markdown **You only need to run this cell once.**
# @markdown Once the file is created, Cell 3 will be able to read from it for all future runs.

import os
from pathlib import Path

# --- Configuration ---
#@markdown The desired location in your Google Drive for the master prompt file. This MUST match the path in Cell 3.
DRIVE_PROMPT_FILE = "/content/drive/MyDrive/housebrain_prompts/platinum_prompts.txt" #@param {type:"string"}

#@markdown The total number of prompts to generate for your master list.
NUM_PROMPTS_TO_GENERATE = 30000 #@param {type:"integer"}
# --- End Configuration ---

# --- Execution ---
REPO_DIR = "/content/HouseBrainLLM"
script_path = os.path.join(REPO_DIR, "scripts/generate_prompts.py")

# Ensure the repository is in the correct directory
os.chdir(REPO_DIR)

# Ensure the target directory in Drive exists
Path(DRIVE_PROMPT_FILE).parent.mkdir(parents=True, exist_ok=True)

print(f"Running prompt generation script to create {NUM_PROMPTS_TO_GENERATE} prompts...")
# Use an f-string for safer command construction
command = f'python3 "{script_path}" --num-prompts {NUM_PROMPTS_TO_GENERATE} --output-file "{DRIVE_PROMPT_FILE}"'
!{command}

print("\n--- Verification ---")
if Path(DRIVE_PROMPT_FILE).exists():
    print(f"✅ Master prompt file successfully created at: {DRIVE_PROMPT_FILE}")
    print("First 5 prompts in the file:")
    !head -n 5 "{DRIVE_PROMPT_FILE}"
else:
    print(f"❌ ERROR: Master prompt file was not created. Please check for errors above.")

