# HouseBrain Data Factory 3.0: The Diamond Series

This notebook is for generating the **Diamond Dataset**. The goal of this dataset is to teach our fine-tuned model how to handle **complex, conflicting, and unconventional** architectural challenges.

**Our Strategy:**
1.  **Use a specialized script** to generate a smaller, more focused list of 2,500 "Diamond-tier" prompts.
2.  **Use our best "Journeyman" model** (fine-tuned on the Platinum dataset) as the generator to create draft plans.
3.  **Save the validated outputs** to a new `housebrain_diamond_dataset` folder in your Google Drive.
4.  Use this dataset for a second round of fine-tuning to elevate our model from a "Journeyman" to a "Master Architect."

## Instructions
1.  **Set Your GitHub PAT**: Ensure your GitHub token is ready.
2.  **Run All Cells**: The notebook will set up the environment, generate the complex prompts, and begin the data generation process.


In [None]:
# @title ## 1. Setup Environment
# @markdown Mount Google Drive and clone the repository using a secure token.
from google.colab import drive
import os
import getpass
import subprocess

# Mount Google Drive
drive.mount('/content/drive')
print("✅ Google Drive mounted.")

# --- GitHub Setup ---
#@markdown Enter your GitHub Personal Access Token (PAT) with repo access.
GITHUB_TOKEN = getpass.getpass('Enter your GitHub PAT: ')
REPO_URL = f"https://{GITHUB_TOKEN}@github.com/Vinay-O/HouseBrainLLM.git"
REPO_DIR = "/content/HouseBrainLLM"

# Clone the repository
if os.path.exists(REPO_DIR):
    print("Repository already exists. Pulling latest changes...")
    subprocess.run(f"cd {REPO_DIR} && git pull", shell=True, check=True)
else:
    print("Cloning repository...")
    subprocess.run(f"git clone {REPO_URL} {REPO_DIR}", shell=True, check=True)

print("✅ Repository is ready.")

# --- Install Dependencies ---
#@markdown Install necessary Python packages.
!pip install -q pydantic GitPython

print("✅ Dependencies installed.")


In [None]:
# @title ## 2. Configure and Start Ollama Server
# @markdown This cell will download and start the Ollama server, then pull the specified model.
# @markdown **Important:** For the Diamond run, we should ideally use our fine-tuned "Journeyman" model. For now, we will continue to use a powerful base model like Mixtral.

MODEL_NAME = "mixtral:instruct" # @param ["mixtral:instruct", "qwen2:7b", "llama3:8b", "mistral:7b-instruct"]

# Download and start Ollama
!curl -fsSL https://ollama.com/install.sh | sh
import threading
import subprocess
import time

def run_ollama():
    try:
        subprocess.run("ollama serve", shell=True, check=True, capture_output=True, text=True)
    except subprocess.CalledProcessError as e:
        print(f"Ollama server failed: {e.stderr}")

print("🚀 Starting Ollama server in the background...")
ollama_thread = threading.Thread(target=run_ollama)
ollama_thread.daemon = True
ollama_thread.start()

# Wait for the server to be ready
print("⏳ Waiting for Ollama server to initialize...")
time.sleep(10)

# Pull the model
print(f"📦 Pulling model: {MODEL_NAME}. This may take a while...")
try:
    subprocess.run(f"ollama pull {MODEL_NAME}", shell=True, check=True, capture_output=True, text=True)
    print(f"✅ Model {MODEL_NAME} is ready.")
except subprocess.CalledProcessError as e:
    print(f"Failed to pull model. Trying default tag...")
    base_model = MODEL_NAME.split(':')[0]
    subprocess.run(f"ollama pull {base_model}", shell=True, check=True)
    print(f"✅ Model {base_model} is ready.")

# Verify Ollama is running
!ollama list


In [None]:
# @title ## 3. Generate Diamond Prompts & Run the Factory
# @markdown This cell first generates 2,500 complex prompts and then runs the assembly line for each one.

import os
from datetime import datetime
import textwrap
import subprocess

# --- 1. Generate Diamond Prompts ---
os.chdir(REPO_DIR)
prompt_script_path = "scripts/generate_diamond_prompts.py"
prompt_output_file = "/content/diamond_prompts.txt"
num_diamond_prompts = 2500

print(f"--- Generating {num_diamond_prompts} Diamond-tier prompts ---")
prompt_command = [
    "python3",
    prompt_script_path,
    "--num-prompts", str(num_diamond_prompts),
    "--output-file", prompt_output_file
]
subprocess.run(prompt_command)
print(f"✅ Diamond prompts saved to {prompt_output_file}")
print("-" * 50)


# --- 2. Load Prompts ---
print(f"Loading prompts from {prompt_output_file}...")
try:
    with open(prompt_output_file, 'r') as f:
        prompts = [line.strip() for line in f if line.strip()]
    print(f"✅ Successfully loaded {len(prompts)} prompts.")
except FileNotFoundError:
    print(f"ERROR: Prompt file not found at {prompt_output_file}.")
    prompts = []

# --- 3. Run the Data Factory ---
#@markdown Specify the output directory in your Google Drive for the Diamond dataset.
DRIVE_OUTPUT_DIR = "/content/drive/MyDrive/housebrain_diamond_dataset"

if prompts:
    assembly_line_script_path = "scripts/run_complete_assembly_line.py"
    run_timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    output_dir = os.path.join(DRIVE_OUTPUT_DIR, f"run_{run_timestamp}")

    print(f"\nOutput directory is ready at: {output_dir}")
    print(f"Found {len(prompts)} prompts to process.")
    print("="*50)

    for i, prompt in enumerate(prompts):
        print(f"Processing prompt {i+1}/{len(prompts)}")
        prompt_short = textwrap.shorten(prompt, width=100, placeholder="...")
        print(f"PROMPT: {prompt_short}")
        print("="*50)

        run_name = f"prompt_{i+1:04d}"
        command = [
            "python3", assembly_line_script_path,
            "--prompt", prompt,
            "--output-dir", output_dir,
            "--run-name", run_name,
            "--model", MODEL_NAME,
            "--max-retries", "5"
        ]
        subprocess.run(command)
        print("\n" + "-"*50 + "\n")

    print("🎉 Diamond Data Factory run complete! Check your Google Drive for the generated files.")
else:
    print("No prompts to process. Please check your configuration.")



In [None]:
# @title ## 4. (Optional) Download Generated Diamond Dataset
# @markdown Run this cell after the data generation is complete to compress and download the entire output folder.

import shutil
import os
from google.colab import files

# Define the source directory in Google Drive and the target zip file path
source_dir = "/content/drive/MyDrive/housebrain_diamond_dataset"
zip_filename = "housebrain_diamond_dataset.zip"
zip_filepath = f"/content/{zip_filename}"

if os.path.exists(source_dir):
    print(f"Compressing '{source_dir}' into '{zip_filepath}'...")
    shutil.make_archive(zip_filepath.replace('.zip', ''), 'zip', source_dir)
    print("✅ Compression complete.")

    # Provide a download link
    print(f"\nDownloading '{zip_filename}'...")
    files.download(zip_filepath)
else:
    print(f"ERROR: The source directory '{source_dir}' was not found. Please ensure the Data Factory ran correctly.")

