# HouseBrain: Automated Data Factory 🏭

This notebook is a dedicated environment for generating a large, high-quality dataset of architectural plans. It implements a hybrid-model `Generate -> Analyze -> Repair` pipeline.

**Workflow:**
1.  **Setup:** Mounts Google Drive, clones the repository, and installs Ollama.
2.  **Model Provisioning:** Pulls two models: a fast model for initial generation (`llama3:8b`) and a powerful model for repairs (`llama3:70b`).
3.  **Prompt Loading:** Reads a list of design prompts from a text file.
4.  **Automated Curation:** For each prompt, it runs the orchestration script which:
    a.  **Generates** a draft using the fast `llama3:8b`.
    b.  **Analyzes** the draft for errors.
    c.  **Repairs** the draft using the powerful `llama3:70b` if needed.
5.  **Output:** Saves the final, validated "Gold Standard" JSON files directly to your Google Drive.


## 1. Setup Environment


In [None]:
# Mount Google Drive to persist our dataset
from google.colab import drive
drive.mount('/content/drive')


In [None]:
# Clone the repository
# IMPORTANT: Replace <YOUR_GITHUB_PAT> with your GitHub Personal Access Token and YOUR_USERNAME with your GitHub username
GITHUB_PAT = "<YOUR_GITHUB_PAT>"
REPOSITORY_URL = f"https://{GITHUB_PAT}@github.com/YOUR_USERNAME/housebrain_v1_1.git"

!git clone {REPOSITORY_URL}
%cd housebrain_v1_1


In [None]:
# Install and start Ollama in the background
!echo "Installing Ollama..."
!curl -fsSL https://ollama.com/install.sh | sh

import os
import subprocess
import asyncio

# Run Ollama serve in the background
command = "ollama serve"
process = subprocess.Popen(command.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
print("Ollama server started in the background.")

!sleep 5 # Give the server a moment to start


### 1.1. Health Check: Verify Ollama Server is Running


In [None]:
# This command lists the models that Ollama is currently serving.
# Before you pull any models, this will likely be empty.
# After the server starts, it should return a success message.
!ollama list


## 2. Provision LLM Models
This step will take a while, especially for the 70B model.


# Pull the fast model for generation
!echo "Pulling Llama 3 8B model (for generation)..."
!ollama pull llama3:8b


In [None]:
### 2.1. Verify 8B Model Installation


In [None]:
# Running 'ollama list' again.
# You should now see 'llama3:8b' in the list of available models.
!ollama list


# Pull the powerful model for repair
# This will download ~19GB.
!echo "Pulling Qwen3 30B model (for repair)..."
!ollama pull qwen3:30b


In [None]:
### 2.2. Verify 30B Model Installation


In [None]:
# Running 'ollama list' a final time.
# You should now see both 'llama3:8b' and 'qwen3:30b' in the list.
!ollama list


## 3. Create Prompt List

Create a file named `prompts.txt` in the root of your repository. Each line in this file should be a unique prompt for a house plan. The more varied and detailed the prompts, the better the dataset will be.


In [None]:
# Create a sample prompts.txt file. 
# YOU SHOULD REPLACE THIS WITH YOUR OWN LARGER FILE.
prompts = [
    "A modern, single-story 3BHK house for a 50x80 feet plot. It must feature an open-plan kitchen and living area, a dedicated home office, and be Vastu-compliant with a North-facing entrance.",
    "A luxurious two-story 5BHK villa for a 100x100 feet plot, west-facing, with a swimming pool, home theater, and a large garden.",
    "A compact, budget-friendly 2BHK apartment design for a family of four, with a total area of 1200 sqft.",
    "A traditional Kerala-style 'Nalukettu' house with a central courtyard for a 60x90 feet south-facing plot.",
    "Design a G+2 building on a 30x60 feet plot. The ground floor should be for parking. The first and second floors should be identical 2BHK units.",
]

with open('prompts.txt', 'w') as f:
    for prompt in prompts:
        f.write(prompt + '\\n')


## 4. Run the Automated Curation Factory

This final step will loop through your `prompts.txt` file and run the full pipeline for each one. Validated files will be saved to your Google Drive.

This process can run for a very long time. Ensure your Colab session does not time out.


In [None]:
import subprocess
import os

prompts_file = 'prompts.txt'
# This path should point to a folder in your Google Drive
output_directory = '/content/drive/MyDrive/housebrain_automated_dataset'

# Ensure the output directory exists
os.makedirs(output_directory, exist_ok=True)

with open(prompts_file, 'r') as f:
    prompts = [line.strip() for line in f if line.strip()]

for i, prompt in enumerate(prompts):
    print(f"""
    =================================================
    Processing prompt {i+1}/{len(prompts)}
    PROMPT: {prompt[:100]}...
    =================================================
    """)
    
    command = [
        'python',
        'scripts/automated_curation.py',
        '--prompt', prompt,
        '--output-dir', output_directory,
        '--model', 'llama3:8b',
        '--repair-model', 'qwen3:30b',
        '--max-retries', '3'
    ]
    
    # Using subprocess.run to see live output in Colab
    with subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True, bufsize=1, universal_newlines=True) as p:
        for line in p.stdout:
            print(line, end='')

print("\\n\\n🎉 Data Factory run complete! Check your Google Drive for the generated files.")
