# HouseBrain Data Factory 2.0: The Architect's Assembly Line

This notebook is the control center for generating a high-quality, architecturally-sound dataset for training the HouseBrain LLM. It leverages the "Architect's Assembly Line" script, which is a robust, multi-stage pipeline designed to guide an LLM in creating valid house plans.

**Our Strategy:**
1.  **Use a powerful "Generator" model** (e.g., `qwen2:7b`, as recommended) within the pipeline to create draft plans.
2.  **Run the pipeline at scale** on a large list of diverse architectural prompts.
3.  **Save the validated, "Gold Standard" outputs** directly to your Google Drive.
4.  Use this curated dataset in a separate notebook to fine-tune a more specialized "Architect" model (e.g., `Llama 3`).

## Instructions
1.  **Set Your GitHub PAT**: In the "Setup" section, you will be prompted to enter a GitHub Personal Access Token. This is required to clone the private `HouseBrainLLM` repository.
2.  **Define Your Prompts**: In the "Run Data Factory" section, a default list of prompts is provided. You should replace or extend this with a much larger and more diverse list for a full data generation run.
3.  **Run All Cells**: Once configured, select "Runtime" -> "Run all" from the menu. The notebook will set up the environment, download the necessary models, and begin the data generation process, saving the results to your Google Drive.

In [None]:
# @title ## 1. Setup Environment
# @markdown Mount Google Drive and clone the repository using a secure token.
from google.colab import drive
import os
import getpass
import subprocess

# Mount Google Drive
drive.mount('/content/drive')
print("✅ Google Drive mounted.")

# --- GitHub Setup ---
#@markdown Enter your GitHub Personal Access Token (PAT) with repo access.
GITHUB_TOKEN = getpass.getpass('Enter your GitHub PAT: ')
REPO_URL = f"https://{GITHUB_TOKEN}@github.com/Vinay-O/HouseBrainLLM.git"
REPO_DIR = "/content/HouseBrainLLM"

# Clone the repository
if os.path.exists(REPO_DIR):
    print("Repository already exists. Pulling latest changes...")
    subprocess.run(f"cd {REPO_DIR} && git pull", shell=True, check=True)
else:
    print("Cloning repository...")
    subprocess.run(f"git clone {REPO_URL} {REPO_DIR}", shell=True, check=True)

print("✅ Repository is ready.")

# --- Install Dependencies ---
#@markdown Install necessary Python packages.
!pip install -q pydantic GitPython

print("✅ Dependencies installed.")

In [None]:
# @title ## 2. Configure and Start Ollama Server
# @markdown This cell will download and start the Ollama server, then pull the specified model.
# @markdown The process will take a few minutes.

MODEL_NAME = "mixtral:instruct" # @param ["mixtral:instruct", "qwen2:7b", "llama3:8b", "mistral:7b-instruct", "qwen2:72b"]

# Download and start Ollama
!curl -fsSL https://ollama.com/install.sh | sh
import threading
import subprocess
import time

def run_ollama():
    try:
        subprocess.run("ollama serve", shell=True, check=True, capture_output=True, text=True)
    except subprocess.CalledProcessError as e:
        print(f"Ollama server failed: {e.stderr}")

print("🚀 Starting Ollama server in the background...")
ollama_thread = threading.Thread(target=run_ollama)
ollama_thread.daemon = True
ollama_thread.start()

# Wait for the server to be ready
print("⏳ Waiting for Ollama server to initialize...")
time.sleep(10)

# Pull the model
print(f"📦 Pulling model: {MODEL_NAME}. This may take a while...")
try:
    subprocess.run(f"ollama pull {MODEL_NAME}", shell=True, check=True, capture_output=True, text=True)
    print(f"✅ Model {MODEL_NAME} is ready.")
except subprocess.CalledProcessError as e:
    print(f"Failed to pull model. Trying default tag...")
    base_model = MODEL_NAME.split(':')[0]
    subprocess.run(f"ollama pull {base_model}", shell=True, check=True)
    print(f"✅ Model {base_model} is ready.")

# Verify Ollama is running
!ollama list

In [None]:
# @title ## 3. Run the Data Factory
# @markdown Execute the Architect's Assembly Line for each prompt.
# @markdown You should replace the `prompts` list with your own extensive list for a full run.

import os
from datetime import datetime
import textwrap

# --- Configuration ---
#@markdown Define the list of prompts to be processed.
prompts = [
    "A modern, single-story 3BHK house for a 50x80 feet plot. It must feature an open-plan kitchen and living area, a dedicated home office, and be Vastu-compliant with a North-facing entrance.",
    "A luxurious two-story 5BHK villa for a 100x100 feet plot, west-facing, with a swimming pool, home theater, and a large garden.",
    "A compact, budget-friendly 2BHK apartment design for a family of four, with a total area of 1200 sqft.",
    "A traditional Kerala-style 'Nalukettu' house with a central courtyard for a 60x90 feet south-facing plot.",
    "Design a G+2 building on a 30x60 feet plot. The ground floor should be for parking. The first and second floors should be identical 2BHK units.",
]

#@markdown Specify the output directory in your Google Drive.
DRIVE_OUTPUT_DIR = "/content/drive/MyDrive/housebrain_dataset"

# --- Execution ---
os.chdir(REPO_DIR)
script_path = "scripts/run_complete_assembly_line.py"
run_timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
output_dir = os.path.join(DRIVE_OUTPUT_DIR, f"run_{run_timestamp}")

print(f"Output directory is ready at: {output_dir}")
print(f"Found {len(prompts)} prompts to process.")
print("="*50)

for i, prompt in enumerate(prompts):
    print(f"Processing prompt {i+1}/{len(prompts)}")
    prompt_short = textwrap.shorten(prompt, width=100, placeholder="...")
    print(f"PROMPT: {prompt_short}")
    print("="*50)

    # Create a unique name for the run based on the prompt
    run_name = f"prompt_{i+1:04d}"

    command = [
        "python3",
        script_path,
        "--prompt", prompt,
        "--output-dir", output_dir,
        "--run-name", run_name,
        "--model", MODEL_NAME,
        "--max-retries", "5" # Be more resilient in Colab
    ]

    subprocess.run(command)
    print("\n" + "-"*50 + "\n")

print("🎉 Data Factory run complete! Check your Google Drive for the generated files.")