# HouseBrain V2: Fine-Tuning DeepSeek-R1-Distill-Llama-8B on Colab

This notebook provides a complete workflow for fine-tuning the `deepseek-ai/DeepSeek-R1-Distill-Llama-8B` model for architectural design. It uses our **"Gold Standard" dataset** to teach the model our specific schema and architectural nuances.

**GPU Requirement:** An A100 GPU (available on Colab Pro+) is recommended for fine-tuning this model.


## Step 1: Environment Setup

This step clones the project repository from GitHub and installs all the necessary Python packages for fine-tuning, including `transformers`, `peft`, `trl`, and `bitsandbytes` for memory-efficient 4-bit training.


In [None]:
# Step 1: Provide your GitHub token
# To clone the private repository, you need a GitHub Personal Access Token (PAT)
# with repo access. Create one here: https://github.com/settings/tokens
from getpass import getpass
import os

# Use a placeholder if you're not running this interactively
try:
    github_token = getpass('Enter your GitHub token: ')
    os.environ['GITHUB_TOKEN'] = github_token
except Exception:
    print("Could not read token, please paste it directly into the next cell")
    os.environ['GITHUB_TOKEN'] = "your_github_token_here"

# Step 2: Clone the repository using the token
# Make sure the repository name is correct
!git clone https://{os.environ.get('GITHUB_TOKEN')}@github.com/Vinay-O/HouseBrainLLM.git
%cd HouseBrainLLM

# Step 3: Install dependencies
!pip install -q -U transformers datasets accelerate peft trl bitsandbytes


## Step 2: Authenticate with Hugging Face

To download powerful models from Hugging Face, you need to be authenticated. 

1.  Create a Hugging Face account if you don't have one.
2.  Generate an Access Token with "read" permissions here: https://huggingface.co/settings/tokens
3.  Run the cell below and paste your token when prompted.


In [None]:
from getpass import getpass
import os

# Prompt for Hugging Face token and login
try:
    hf_token = getpass('Enter your Hugging Face token: ')
    os.environ['HF_TOKEN'] = hf_token
except Exception:
    print("Could not read token, please paste it directly into the next cell")
    os.environ['HF_TOKEN'] = "your_hf_token_here"

!huggingface-cli login --token $HF_TOKEN


## Step 3: (Optional) Generate New Drafts with Ollama

This section is for creating new **Gold** or **Platinum** standard examples. It will set up an Ollama server within the Colab environment, download a powerful base model (`deepseek-r1:8b`), and use it to generate raw drafts based on expert prompts.

**Workflow:**
1.  Run the cells below to generate the raw `.json` draft files.
2.  Download the generated files from the Colab file browser (under `data/training/gold_standard/` or `data/training/platinum_standard/`).
3.  Use the AI assistant's "Analyze and Repair" process to perfect the drafts locally.
4.  Upload the final, corrected `.json` files back to the appropriate directory before proceeding to the next step.


In [None]:
# Install Ollama
!if ! command -v ollama &> /dev/null; then curl -fsSL https://ollama.com/install.sh | sh; fi

# Start Ollama as a background process
import os
import time
import requests
from IPython import get_ipython

# Set environment variable to bind to all interfaces
os.environ['OLLAMA_HOST'] = '0.0.0.0'

# Start the server as a raw background process
# This is more robust in non-systemd environments like Colab
get_ipython().system_raw('ollama serve > ollama.log 2>&1 &')

# Wait for Ollama to be ready
print("⏳ Waiting for Ollama server to start...")
time.sleep(5) # Initial wait
for i in range(60): # Wait up to 60 seconds
    try:
        response = requests.get("http://127.0.0.1:11434")
        if response.status_code == 200:
            print("✅ Ollama server is running!")
            break
    except requests.exceptions.ConnectionError:
        pass # Keep trying while the server starts up
    time.sleep(1)
else:
    print("❌ Ollama server failed to start. Check the logs for errors.")
    !cat ollama.log

# Download the model for draft generation
# Note: Ollama may not have this exact model, we pull a close equivalent for generation
!ollama pull deepseek-r1:8b


In [None]:
# --- HEALTH CHECK ---
# First, let's run a very simple prompt to confirm the model is loaded and responding.
# This should be very fast.
HEALTH_CHECK_PROMPT = "Generate a valid JSON object for a single wall with id 'test-wall', level_id 'ground_floor', and a simple rectangular polygon."
!python scripts/generate_draft_from_prompt.py --model "deepseek-r1:8b" --scenario "{HEALTH_CHECK_PROMPT}" --output-file "data/training/health_check_output.json"

print("="*50)
print("✅ Health check prompt sent. Checking for output...")
!cat data/training/health_check_output.json
print("\\n" + "="*50)
print("If you see a valid JSON object above, the model is working. You can now proceed to the next cell to generate the full drafts.")


In [None]:
# --- DEBUGGING: VIEW RAW MODEL OUTPUT ---
# The cell above may show a "No such file or directory" error if the model's
# response was not pure JSON. This is expected behavior.
# The script saves the full, raw response to a .raw_error.txt file.
# Let's print the content of that file to see what the model *actually* said.

!cat data/training/health_check_output.json.raw_error.txt


In [None]:
# This Python block defines the new, schema-aware prompt for data generation.
# By embedding the schema directly in the prompt, we guide the LLM to produce
# output that is already compliant with our professional rendering pipeline.

schema_definition = """
from typing import List, Dict, Union, Literal
from pantic import BaseModel, Field
from enum import Enum

class RoomType(str, Enum):
    LIVING_ROOM = "living_room"
    DINING_ROOM = "dining_room"
    KITCHEN = "kitchen"
    MASTER_BEDROOM = "master_bedroom"
    BEDROOM = "bedroom"
    BATHROOM = "bathroom"
    HALF_BATH = "half_bath"
    FAMILY_ROOM = "family_room"
    STUDY = "study"
    GARAGE = "garage"
    UTILITY = "utility"
    STORAGE = "storage"
    STAIRWELL = "stairwell"
    CORRIDOR = "corridor"
    ENTRANCE = "entrance"
    COURTYARD = "courtyard"
    VERANDAH = "verandah"
    BALCONY = "balcony"
    PARKING = "parking"
    GARDEN = "garden"
    SITOUT = "sitout"
    POOJA_ROOM = "pooja_room"

class ArchitecturalStyle(str, Enum):
    MODERN_CONTEMPORARY = "Modern Contemporary"
    TRADITIONAL = "Traditional"
    MINIMALIST = "Minimalist"

class Point2D(BaseModel):
    x: float = Field(..., description="X coordinate in feet")
    y: float = Field(..., description="Y coordinate in feet")

class Rectangle(BaseModel):
    x: float = Field(..., description="X coordinate of bottom-left corner in feet")
    y: float = Field(..., description="Y coordinate of bottom-left corner in feet")
    width: float = Field(..., description="Width in feet")
    height: float = Field(..., description="Height in feet")

class Door(BaseModel):
    position: Point2D
    width: float = Field(default=3.0, description="Door width in feet")
    type: Literal["interior", "exterior", "sliding", "pocket"] = "interior"
    room1: str = Field(..., description="First room ID")
    room2: str = Field(..., description="Second room ID")

class Window(BaseModel):
    position: Point2D
    width: float = Field(..., description="Window width in feet")
    height: float = Field(default=4.0, description="Window height in feet")
    type: Literal["fixed", "casement", "sliding", "bay"] = "fixed"
    room_id: str = Field(..., description="Room ID")

class Room(BaseModel):
    id: str = Field(..., description="Unique room identifier (e.g., 'living_room_0')")
    type: RoomType
    bounds: Rectangle
    doors: List[Door] = Field(default_factory=list)
    windows: List[Window] = Field(default_factory=list)

class Stair(BaseModel):
    position: Point2D
    width: float = Field(default=3.5)
    length: float = Field(default=12.0)
    type: Literal["straight", "L_shaped", "U_shaped"] = "straight"
    floor_from: int
    floor_to: int

class Level(BaseModel):
    level_number: int = Field(..., description="Floor level (0 = ground floor)")
    rooms: List[Room]
    stairs: List[Stair] = Field(default_factory=list)
    height: float = Field(default=10.0, description="Floor to ceiling height in feet")

class HouseInput(BaseModel):
    basicDetails: Dict[str, Union[int, float, str]]
    plot: Dict[str, Union[int, float, str, Dict]]
    roomBreakdown: List[Dict[str, Union[str, int, List[str]]]]

class HouseOutput(BaseModel):
    input: HouseInput
    levels: List[Level]
    total_area: float = Field(..., description="Total built area in sqft")
    construction_cost: float = Field(..., description="Estimated construction cost in local currency")
"""

# This is the new prompt template that instructs the model to act as an expert
# and use the provided schema to generate a valid JSON object.
# We build the string piece by piece to avoid linter issues in the notebook.
prompt_header = """You are an expert Indian architect specializing in Vastu-compliant residential design.
Your task is to generate a complete, valid, and architecturally sound JSON object
that strictly adheres to the Pantic schema provided below.

The JSON object must be a single, complete `HouseOutput` object.
Do not add any text or explanation before or after the JSON object.

**Pydantic Schema Definition:**
```python
"""
prompt_footer = """
```

**User Design Request:**
"{scenario}"

**Your Output (JSON object conforming to HouseOutput schema only):**
"""

NEW_PROMPT_TEMPLATE = prompt_header + schema_definition + prompt_footer


# Create the Platinum Standard directory if it doesn't exist
!mkdir -p data/training/platinum_standard
!mkdir -p data/training/gold_standard

# --- GENERATE GOLD STANDARD DRAFT #21 (NEW SCHEMA) ---
GOLD_PROMPT = "Design a luxurious 4BHK G+1 duplex for a 40x60 feet west-facing plot in a gated community in Bangalore. The design must be Vastu-compliant and include a home office on the ground floor, a private family lounge on the first floor, and balconies for every bedroom. The client desires a contemporary architectural style with large windows for ample natural light."
# Format the final prompt
final_gold_prompt = NEW_PROMPT_TEMPLATE.format(scenario=GOLD_PROMPT)
# Write the prompt to a file to avoid command line length issues
with open("gold_prompt.txt", "w") as f:
    f.write(final_gold_prompt)

!python scripts/generate_draft_from_prompt.py --model "deepseek-r1:8b" --prompt-file "gold_prompt.txt" --output-file "data/training/gold_standard/gold_standard_21_draft.json"


print("\\n" + "="*50 + "\\n")

# --- GENERATE PLATINUM STANDARD DRAFT #01 (NEW SCHEMA) ---
PLATINUM_PROMPT = "Design a one-of-a-kind, 'biophilic' 3BHK luxury retreat on a 50x80 feet plot overlooking the backwaters of Kerala. The design must seamlessly integrate indoor and outdoor spaces, featuring a central open-to-sky courtyard with a water body, extensive use of natural materials like laterite stone and teak wood, and a cantilevered infinity pool on the first floor. Prioritize sustainability with rainwater harvesting and solar panel provisions. The architectural style should be a modern interpretation of traditional Kerala design."
# Format the final prompt
final_platinum_prompt = NEW_PROMPT_TEMPLATE.format(scenario=PLATINUM_PROMPT)
# Write the prompt to a file
with open("platinum_prompt.txt", "w") as f:
    f.write(final_platinum_prompt)

!python scripts/generate_draft_from_prompt.py --model "deepseek-r1:8b" --prompt-file "platinum_prompt.txt" --output-file "data/training/platinum_standard/platinum_standard_01_draft.json"



In [None]:
# ------------------------------------------------------------------
# STEP 4.1: (NEW) Sanitize Gold Standard Data
# ------------------------------------------------------------------
# This step fixes a common data inconsistency issue where some JSON files
# might use `null` for list fields (like `doors`: null) while others use
# an empty list (`doors`: []). This mismatch can cause the `datasets`
# library to fail during loading. This script scans all gold standard
# files and enforces `[]` for consistency.

!python scripts/sanitize_gold_data.py


## Step 4: Prepare Base Training Data

This step runs our preparation script. It will process the 20 raw Gold Standard JSON files (plus any new ones you've generated and perfected) and create a new `gold_standard_finetune_ready` directory containing the data in the simple `{"prompt": "...", "output": "..."}` format required by the training script.


In [None]:
!python scripts/prepare_gold_standard_data.py
!echo "\n✅ Data preparation complete. Verifying the new directory:"
!ls -l data/training/gold_standard_finetune_ready | wc -l


## Step 5: Format Data for Fine-Tuning

This step is crucial. Since `DeepSeek-R1-Distill-Llama-8B` is based on Llama, it requires a Llama-3-style prompt format for instruction fine-tuning. We will load the data prepared in the previous step and reformat it into the required structure, then save it to a new directory for the trainer to use.

**Llama 3 Prompt Template:**
```
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

{prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{completion}<|eot_id|>
```


In [None]:
import json
from pathlib import Path
import os
from datasets import load_dataset, Dataset

# Define the Llama 3 prompt template
LLAMA3_TEMPLATE = """<|begin_of_text|><|start_header_id|>user<|end_header_id|>

{prompt}<|eot_id|><|start_header_id|>assistant<|end_header_id|>

{completion}<|eot_id|>"""

# Load the dataset prepared by the previous script
source_dir = "data/training/gold_standard_finetune_ready"
dataset = load_dataset("json", data_files=[str(f) for f in Path(source_dir).glob("*.json")])['train']

def format_for_llama3(entry):
    """Applies the Llama 3 prompt format to a dataset entry."""
    formatted_text = LLAMA3_TEMPLATE.format(
        prompt=entry['prompt'],
        completion=json.dumps(json.loads(entry['output']), indent=2) # Ensure completion is a formatted string
    )
    return {"text": formatted_text}

# Apply the formatting
formatted_dataset = dataset.map(format_for_llama3)

# Save the newly formatted dataset
output_dir = Path("data/training/gold_standard_finetune_llama3_ready")
output_dir.mkdir(parents=True, exist_ok=True)

# Save as a single JSONL file, which is efficient for the trainer
formatted_dataset.to_json(output_dir / "data.jsonl", orient="records", lines=True)

print(f"✅ Successfully formatted and saved dataset for Llama-style fine-tuning at {output_dir}")
print("Example of formatted data:")
print(formatted_dataset[0]['text'])


# This step was moved to the top of the notebook.


In [None]:
# This cell's logic was moved to Step 2.


## Step 6: Run the Fine-Tuning Script

This is the core of the process. We execute the `run_finetuning.py` script, which will:

1.  **Load** our prepared Gold Standard examples.
2.  **Download** the base `meta-llama/Llama-3-8B-Instruct` model from Hugging Face.
3.  **Configure** 4-bit quantization and LoRA for efficient training.
4.  **Fine-tune** the model on our data.
5.  **Save** the final, specialized `housebrain-llama3-8b-v0.1` model to the `models/` directory.

We will use a high number of epochs (e.g., 200) because our dataset is very high-quality but small. This is necessary to ensure the model learns the schema thoroughly.


In [None]:
!python scripts/run_finetuning.py \
    --dataset-path "data/training/gold_standard_finetune_llama3_ready" \
    --base-model "deepseek-ai/DeepSeek-R1-Distill-Llama-8B" \
    --output-path "models/housebrain-deepseek-r1-distill-llama-8b-v0.1" \
    --epochs 200 \
    --batch-size 2 \
    --learning-rate 2e-5


## Step 7: Next Steps

Once training is complete, the new model is saved in the `models/housebrain-deepseek-r1-distill-llama-8b-v0.1` directory. 

You can now use this specialized model in your `generate_validated_silver_data.py` script (by changing the model ID) to generate a large, high-quality dataset of thousands of examples. This is the path to a truly production-ready system.


## Step 8 (Optional): A/B Test with an Alternative Model

Now that you have a fine-tuned DeepSeek-Llama model, you can run an experiment to compare it against the original, non-distilled Llama 3 model. You can use the `train_on_colab.ipynb` notebook to fine-tune Llama 3 on the same Gold Standard dataset.

Once both are trained, you will have two expert models: `housebrain-deepseek-r1-distill-llama-8b-v0.1` and `housebrain-llama3-8b-v0.1`. You can then evaluate them head-to-head on a new set of prompts to see which one produces superior architectural designs. This data-driven approach guarantees we select the best possible foundation for our production system.


In [None]:
# !python scripts/run_finetuning.py \
#     --dataset-path "data/training/gold_standard_finetune_ready" \
#     --base-model "Qwen/Qwen2-7B-Instruct" \
#     --output-path "models/housebrain-qwen2-7b-v0.1" \
#     --epochs 200 \
#     --batch-size 2 \
#     --learning-rate 2e-5
