# HouseBrain: Automated Data Factory (Qwen3-30B Strategy) 🏭

This notebook is a dedicated environment for generating a large, high-quality dataset of architectural plans. It uses a single, powerful model (`qwen3:30b`) for the entire `Generate -> Analyze -> Repair` pipeline.

**Workflow:**
1.  **Setup:** Mounts Google Drive, clones the repository using a secure token prompt, and installs dependencies.
2.  **Ollama Server:** Installs and runs the Ollama server in the background.
3.  **Model Provisioning:** Pulls the powerful `qwen3:30b` model.
4.  **Prompt Loading:** Creates a `prompts.txt` file to serve as the workload for the factory.
5.  **Automated Curation:** For each prompt, it runs an orchestration script which uses `qwen3:30b` to generate a draft, analyze it, and repair it if necessary.
6.  **Output:** Saves the final, validated "Gold Standard" JSON files directly to your Google Drive. This dataset can then be used to fine-tune any other model, such as Llama 3.


## Step 1: Setup Environment


In [None]:
# Mount Google Drive to persist our dataset
from google.colab import drive
drive.mount('/content/drive')


In [None]:
# Securely provide your GitHub token to clone the private repository
from getpass import getpass
import os

# Prompt for the GitHub token
github_token = getpass('Enter your GitHub Personal Access Token (PAT): ')
os.environ['GITHUB_TOKEN'] = github_token

# Clone the repository using the token
# Replace 'Vinay-O/HouseBrainLLM' with your own GitHub username and repository if it's different.
!git clone https://{os.environ.get('GITHUB_TOKEN')}@github.com/Vinay-O/HouseBrainLLM.git
%cd HouseBrainLLM

print("\\n✅ Repository cloned successfully.")


In [None]:
# Install necessary Python packages
!pip install -q requests
print("✅ Dependencies installed.")


## Step 2: Install and Start Ollama Server


In [None]:
# Install and start Ollama in the background
!echo "Installing Ollama..."
!curl -fsSL https://ollama.com/install.sh | sh

import subprocess

# Run Ollama serve in the background
command = "ollama serve"
process = subprocess.Popen(command.split(), stdout=subprocess.PIPE, stderr=subprocess.PIPE)
print("Ollama server starting in the background...")

!sleep 5 # Give the server a moment to start up properly.
print("✅ Ollama server should be running.")


In [None]:
# Health Check: Verify Ollama Server is Running
print("--- Ollama Health Check ---")
!ollama list


## Step 3: Provision the Qwen3-30B Model


In [None]:
# Pull the Qwen3 30B model. This will download ~19GB.
!echo "Pulling Qwen3 30B model..."
!ollama pull qwen3:30b
print("\\n--- Verification ---")
# You should see 'qwen3:30b' in the list below
!ollama list


## Step 4: Create Prompt List


In [None]:
# You can replace this list with the contents of 'formatted_prompts.txt'
# or create a much larger list to generate more data.
prompts = [
    "Design a modern, single-story 3BHK house for a 50x80 feet plot. It must feature an open-plan kitchen and living area, a dedicated home office, and be Vastu-compliant with a North-facing entrance.",
    "A luxurious two-story 5BHK villa for a 100x100 feet plot, west-facing, with a swimming pool, home theater, and a large garden.",
    "A compact, budget-friendly 2BHK apartment design for a family of four, with a total area of 1200 sqft.",
    "A traditional Kerala-style 'Nalukettu' house with a central courtyard for a 60x90 feet south-facing plot.",
    "Design a G+2 building on a 30x60 feet plot. The ground floor should be for parking. The first and second floors should be identical 2BHK units.",
    "Create a Vastu-compliant 4BHK duplex house plan for an east-facing 40x60 feet plot, including a pooja room and a small garden.",
    "A minimalist 1BHK studio apartment layout for a young professional, maximizing space in a 600 sqft area.",
    "Design a sprawling farmhouse on a 1-acre plot with 4 bedrooms, a large verandah, servant's quarters, and space for organic farming.",
    "A G+1 6BHK joint family home for a 50x100 feet plot, with separate kitchen and living areas on each floor but connected by an internal staircase.",
    "A contemporary 3BHK house with a budget of 50 lakhs for a 30x50 feet plot, prioritizing natural light and ventilation.",
]

with open('prompts.txt', 'w') as f:
    for prompt in prompts:
        f.write(prompt + '\\n')

print(f"✅ Created prompts.txt with {len(prompts)} prompts.")


## Step 5: Run the Automated Curation Factory


In [None]:
import subprocess
import os

prompts_file = 'prompts.txt'
# This path points to a folder in your Google Drive where the final data will be saved.
output_directory = '/content/drive/MyDrive/housebrain_qwen3_dataset'

# Ensure the output directory exists
os.makedirs(output_directory, exist_ok=True)
print(f"Output directory is ready at: {output_directory}")

with open(prompts_file, 'r') as f:
    prompts = [line.strip() for line in f if line.strip()]

print(f"Found {len(prompts)} prompts to process.")

for i, prompt in enumerate(prompts):
    print(f"""
    =================================================
    Processing prompt {i+1}/{len(prompts)}
    PROMPT: {prompt[:100]}...
    =================================================
    """)
    
    command = [
        'python',
        'scripts/automated_curation.py',
        '--prompt', prompt,
        '--output-dir', output_directory,
        '--model', 'qwen3:30b', # Use Qwen3 for both generation
        '--repair-model', 'qwen3:30b', # and repair
        '--max-retries', '3'
    ]
    
    # Using Popen to stream the output in real-time in the Colab console
    with subprocess.Popen(command, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True, bufsize=1, universal_newlines=True) as p:
        if p.stdout:
            for line in p.stdout:
                print(line, end='', flush=True)

print("\\n\\n🎉 Data Factory run complete! Check your Google Drive for the generated files.")
