# HouseBrain V3: Data Curation & Training with Llama 3

This notebook provides a complete workflow for generating, curating, and fine-tuning with the `llama3` model for architectural design. It uses our **"Gold Standard" dataset** to teach the model our specific schema and architectural nuances.

**Workflow:**
1.  **Generate & Curate Data (Steps 1-3.5):** Use the interactive testing cells to generate new drafts with `llama3` and manually repair them to perfection.
2.  **Fine-Tune (Steps 4-7):** Once you have a high-quality, curated dataset, use the second half of the notebook to fine-tune a specialist model.

**GPU Requirement:** An A100 or H100 GPU (available on Colab Pro+) is recommended for fine-tuning. A T4 is sufficient for data generation.


# HouseBrain V3: Data Curation & Training with Llama 3

This notebook provides a complete workflow for generating, curating, and fine-tuning with the `llama3` model for architectural design. It uses our **"Gold Standard" dataset** to teach the model our specific schema and architectural nuances.

**Workflow:**
1.  **Generate & Curate Data (Steps 1-3.5):** Use the interactive testing cells to generate new drafts with `llama3` and manually repair them to perfection.
2.  **Fine-Tune (Steps 4-6):** Once you have a high-quality, curated dataset, use the second half of the notebook to fine-tune a specialist model.

**GPU Requirement:** An A100 or H100 GPU (available on Colab Pro+) is recommended for fine-tuning. A T4 is sufficient for data generation.


## Step 1: Environment Setup

This step clones the project repository from GitHub and installs all the necessary Python packages for fine-tuning, including `transformers`, `peft`, `trl`, and `bitsandbytes` for memory-efficient 4-bit training.


In [None]:
# Step 1: Provide your GitHub token
# To clone the private repository, you need a GitHub Personal Access Token (PAT)
# with repo access. Create one here: https://github.com/settings/tokens
from getpass import getpass
import os

# Use a placeholder if you're not running this interactively
try:
    github_token = getpass('Enter your GitHub token: ')
    os.environ['GITHUB_TOKEN'] = github_token
except Exception:
    print("Could not read token, please paste it directly into the next cell")
    os.environ['GITHUB_TOKEN'] = "your_github_token_here"

# Step 2: Clone the repository using the token
# Make sure the repository name is correct
!git clone https://{os.environ.get('GITHUB_TOKEN')}@github.com/Vinay-O/HouseBrainLLM.git
%cd HouseBrainLLM

# Step 3: Install dependencies
!pip install -q -U transformers datasets accelerate peft trl bitsandbytes


## Step 2: Authenticate with Hugging Face

To download powerful models from Hugging Face, you need to be authenticated. 

1.  Create a Hugging Face account if you don't have one.
2.  Generate an Access Token with "read" permissions here: https://huggingface.co/settings/tokens
3.  Run the cell below and paste your token when prompted.


In [None]:
from getpass import getpass
import os

# Prompt for Hugging Face token and login
try:
    hf_token = getpass('Enter your Hugging Face token: ')
    os.environ['HF_TOKEN'] = hf_token
except Exception:
    print("Could not read token, please paste it directly into the next cell")
    os.environ['HF_TOKEN'] = "your_hf_token_here"

!huggingface-cli login --token $HF_TOKEN


## Step 3: Generate New Drafts with Ollama & Llama 3

This section is for creating new **Gold** or **Platinum** standard examples. It will set up an Ollama server within the Colab environment, download a powerful base model (`llama3`), and use it to generate raw drafts based on expert prompts.

**Workflow:**
1.  Run the cells below to set up the server and download the model.
2.  Use the **Interactive Curation** section (Step 3.5) to generate, analyze, and repair individual drafts.


In [None]:
# Install Ollama
!if ! command -v ollama &> /dev/null; then curl -fsSL https://ollama.com/install.sh | sh; fi

# Start Ollama as a background process
import os
import time
import requests
from IPython import get_ipython

# Set environment variable to bind to all interfaces
os.environ['OLLAMA_HOST'] = '0.0.0.0'

# Start the server as a raw background process
# This is more robust in non-systemd environments like Colab
get_ipython().system_raw('ollama serve > ollama.log 2>&1 &')

# Wait for Ollama to be ready
print("⏳ Waiting for Ollama server to start...")
time.sleep(5) # Initial wait
for i in range(60): # Wait up to 60 seconds
    try:
        response = requests.get("http://127.0.0.1:11434")
        if response.status_code == 200:
            print("✅ Ollama server is running!")
            break
    except requests.exceptions.ConnectionError:
        pass # Keep trying while the server starts up
    time.sleep(1)
else:
    print("❌ Ollama server failed to start. Check the logs for errors.")
    !cat ollama.log

# Download the model for draft generation
!ollama pull llama3


In [None]:
# --- HEALTH CHECK ---
# First, let's run a very simple prompt to confirm the model is loaded and responding.
# This should be very fast.
HEALTH_CHECK_PROMPT = "Generate a valid JSON object containing a single key 'status' with the value 'ok'."
!python scripts/generate_draft_from_prompt.py --model "llama3" --scenario "{HEALTH_CHECK_PROMPT}" --output-file "data/training/health_check_output.json"

print("="*50)
print("✅ Health check prompt sent. Checking for output...")
!cat data/training/health_check_output.json
print("\\n" + "="*50)
print("If you see a valid JSON object above, the model is working. You can now proceed to the next cell to generate the full drafts.")


In [None]:
# --- DEBUGGING: VIEW RAW MODEL OUTPUT ---
# The cell above may show a "No such file or directory" error if the model's
# response was not pure JSON. This is expected behavior.
# The script saves the full, raw response to a .raw_error.txt file.
# Let's print the content of that file to see what the model *actually* said.

!cat data/training/health_check_output.json.raw_error.txt


### Step 3.5: Interactive Testing & Curation (Generate -> Analyze -> Repair)

This is the most important part of building a high-quality dataset. Use this section to test a specific prompt, see the model's raw output, and then manually repair it to create a "Gold Standard" file.

**Your Workflow:**
1.  **Modify the `TEST_SCENARIO`** in the cell below to the prompt you want to test.
2.  **Run the cell.** It will generate a draft and save it to `data/training/curation_test_draft.json`.
3.  **Inspect the raw output** printed below the cell. It will likely contain errors or be incomplete.
4.  **Copy the JSON** part of the raw output.
5.  **Paste it into the text cell** at the very bottom of this notebook.
6.  **Manually edit and correct the JSON** until it is a perfect, schema-compliant `HouseOutput` object.
7.  **Save the corrected file** to `data/training/gold_standard/` with a descriptive name.

Repeat this process to build up your Gold Standard dataset.


In [None]:
# --- 1. Define Your Test Prompt ---
TEST_SCENARIO = "A modern, single-story 3BHK house for a 50x80 feet plot. It must feature an open-plan kitchen and living area, a dedicated home office, and be Vastu-compliant with a North-facing entrance."


# --- 2. Generate the Draft ---
# We use the same schema-aware prompt template from the next step
# Note: The schema is defined in the *next* cell. Run that cell first.
final_test_prompt = NEW_PROMPT_TEMPLATE.format(scenario=TEST_SCENARIO)
with open("test_prompt.txt", "w") as f:
    f.write(final_test_prompt)

!python scripts/generate_draft_from_prompt.py --model "llama3" --prompt-file "test_prompt.txt" --output-file "data/training/curation_test_draft.json"


# --- 3. View the Raw Output for Curation ---
print("\\n" + "="*80)
print("RAW MODEL OUTPUT (COPY THE JSON FROM HERE TO REPAIR IT):")
print("="*80)
!cat data/training/curation_test_draft.json.raw_error.txt


In [None]:
# ------------------------------------------------------------------
# STEP 4.1: (NEW) Sanitize Gold Standard Data
# ------------------------------------------------------------------
# This step fixes a common data inconsistency issue where some JSON files
# might use `null` for list fields (like `doors`: null) while others use
# an empty list (`doors`: []). This mismatch can cause the `datasets`
# library to fail during loading. This script scans all gold standard
# files and enforces `[]` for consistency.

!python scripts/sanitize_gold_data.py


## Step 4: Prepare Base Training Data

This step runs our preparation script. It will process the 20 raw Gold Standard JSON files (plus any new ones you've generated and perfected) and create a new `gold_standard_finetune_ready` directory containing the data in the simple `{"prompt": "...", "output": "..."}` format required by the training script.


In [None]:
!python scripts/prepare_gold_standard_data.py
!echo "\n✅ Data preparation complete. Verifying the new directory:"
!ls -l data/training/gold_standard_finetune_ready | wc -l


## Step 5: Format Data for Fine-Tuning

This step is crucial. The `meta-llama/Meta-Llama-3-8B-Instruct` model requires a specific chat template for instruction fine-tuning. We will load the data prepared in the previous step and reformat it into the required structure, then save it to a new directory for the trainer to use.

**Llama 3 Prompt Template:**
```
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

{prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{completion}<|eot_id|>
```


In [None]:
import json
from pathlib import Path
import os
from datasets import load_dataset, Dataset

# Define the Llama 3 prompt template
LLAMA3_TEMPLATE = """<|begin_of_text|><|start_header_id|>user<|end_header_id|>

{prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{completion}<|eot_id|>"""

# Load the dataset prepared by the previous script
source_dir = "data/training/gold_standard_finetune_ready"
dataset = load_dataset("json", data_files=[str(f) for f in Path(source_dir).glob("*.json")])['train']

def format_for_llama3(entry):
    """Applies the Llama 3 prompt format to a dataset entry."""
    formatted_text = LLAMA3_TEMPLATE.format(
        prompt=entry['prompt'],
        completion=json.dumps(json.loads(entry['output']), indent=2) # Ensure completion is a formatted string
    )
    return {"text": formatted_text}

# Apply the formatting
formatted_dataset = dataset.map(format_for_llama3)

# Save the newly formatted dataset
output_dir = Path("data/training/gold_standard_finetune_llama3_ready")
output_dir.mkdir(parents=True, exist_ok=True)

# Save as a single JSONL file, which is efficient for the trainer
formatted_dataset.to_json(output_dir / "data.jsonl", orient="records", lines=True)

print(f"✅ Successfully formatted and saved dataset for Llama-style fine-tuning at {output_dir}")
print("Example of formatted data:")
print(formatted_dataset[0]['text'])


## Step 6: Run the Fine-Tuning Script

This is the core of the process. We execute the `run_finetuning.py` script, which will:

1.  **Load** our prepared Gold Standard examples.
2.  **Download** the base `meta-llama/Meta-Llama-3-8B-Instruct` model from Hugging Face.
3.  **Configure** 4-bit quantization and LoRA for efficient training.
4.  **Fine-tune** the model on our data.
5.  **Save** the final, specialized `housebrain-llama3-8b-v1.0` model to the `models/` directory.

We will use a high number of epochs (e.g., 200) because our dataset is very high-quality but small. This is necessary to ensure the model learns the schema thoroughly.


In [None]:
!python scripts/run_finetuning.py \
    --dataset-path "data/training/gold_standard_finetune_llama3_ready" \
    --base-model "meta-llama/Meta-Llama-3-8B-Instruct" \
    --output-path "models/housebrain-llama3-8b-v1.0" \
    --epochs 200 \
    --batch-size 2 \
    --learning-rate 2e-5


## Step 7: Next Steps

Once training is complete, the new model is saved in the `models/housebrain-llama3-8b-v1.0` directory. 

You can now use this specialized model in the "Interactive Testing & Curation" section (by changing the model ID) to generate a large, high-quality "Silver Standard" dataset. This is the path to a truly production-ready system.


### PASTE AND REPAIR YOUR JSON HERE

_Double-click this cell to edit. Paste the raw JSON output from the interactive test above and manually correct it until it is a perfect `HouseOutput` object. Once it's perfect, save it as a new file in the `data/training/gold_standard` directory._

```json
{
  "paste_your_raw_json_here": "delete this line and paste the model's output"
}
```


In [None]:
# !python scripts/run_finetuning.py \
#     --dataset-path "data/training/gold_standard_finetune_ready" \
#     --base-model "Qwen/Qwen2-7B-Instruct" \
#     --output-path "models/housebrain-qwen2-7b-v0.1" \
#     --epochs 200 \
#     --batch-size 2 \
#     --learning-rate 2e-5
