# HouseBrain V2: Fine-Tuning DeepSeek Coder on Google Colab

This notebook provides a complete workflow for fine-tuning the `deepseek-ai/deepseek-coder-6.7b-instruct` model for architectural design. It uses our **"Gold Standard" dataset** to teach the model our specific schema and architectural nuances.

**GPU Requirement:** An A100 GPU (available on Colab Pro+) is recommended for fine-tuning this model.


## Step 1: Environment Setup

This step clones the project repository from GitHub and installs all the necessary Python packages for fine-tuning, including `transformers`, `peft`, `trl`, and `bitsandbytes` for memory-efficient 4-bit training.


In [None]:
# Step 1: Provide your GitHub token
# To clone the private repository, you need a GitHub Personal Access Token (PAT)
# with repo access. Create one here: https://github.com/settings/tokens
from getpass import getpass
import os

# Use a placeholder if you're not running this interactively
try:
    github_token = getpass('Enter your GitHub token: ')
    os.environ['GITHUB_TOKEN'] = github_token
except Exception:
    print("Could not read token, please paste it directly into the next cell")
    os.environ['GITHUB_TOKEN'] = "your_github_token_here"

# Step 2: Clone the repository using the token
# Make sure the repository name is correct
!git clone https://{os.environ.get('GITHUB_TOKEN')}@github.com/Vinay-O/HouseBrainLLM.git
%cd HouseBrainLLM

# Step 3: Install dependencies
!pip install -q -U transformers datasets accelerate peft trl bitsandbytes


## Step 3: (Optional) Generate New High-Quality Drafts

This section is for creating new **Gold** or **Platinum** standard examples. It will set up an Ollama server within the Colab environment, download a powerful base model (`deepseek-r1:8b`), and use it to generate raw drafts based on expert prompts.

**Workflow:**
1.  Run the cells below to generate the raw `.json` draft files.
2.  Download the generated files from the Colab file browser (under `data/training/gold_standard/` or `data/training/platinum_standard/`).
3.  Use the AI assistant's "Analyze and Repair" process to perfect the drafts locally.
4.  Upload the final, corrected `.json` files back to the appropriate directory before proceeding to the next step.


In [None]:
# Install Ollama
!if ! command -v ollama &> /dev/null; then curl -fsSL https://ollama.com/install.sh | sh; fi

# Start Ollama as a background process
import os
import time
import requests
from IPython import get_ipython

# Set environment variable to bind to all interfaces
os.environ['OLLAMA_HOST'] = '0.0.0.0'

# Start the server as a raw background process
# This is more robust in non-systemd environments like Colab
get_ipython().system_raw('ollama serve > ollama.log 2>&1 &')

# Wait for Ollama to be ready
print("⏳ Waiting for Ollama server to start...")
time.sleep(5) # Initial wait
for i in range(60): # Wait up to 60 seconds
    try:
        response = requests.get("http://127.0.0.1:11434")
        if response.status_code == 200:
            print("✅ Ollama server is running!")
            break
    except requests.exceptions.ConnectionError:
        pass # Keep trying while the server starts up
    time.sleep(1)
else:
    print("❌ Ollama server failed to start. Check the logs for errors.")
    !cat ollama.log

# Download the model for draft generation
!ollama pull deepseek-coder:6.7b-instruct


In [None]:
# Create the Platinum Standard directory if it doesn't exist
!mkdir -p data/training/platinum_standard

# --- GENERATE GOLD STANDARD DRAFT #21 ---
GOLD_PROMPT = "Design a luxurious 4BHK G+1 duplex for a 40x60 feet west-facing plot in a gated community in Bangalore. The design must be Vastu-compliant and include a home office on the ground floor, a private family lounge on the first floor, and balconies for every bedroom. The client desires a contemporary architectural style with large windows for ample natural light."
!python scripts/generate_draft_from_prompt.py --model "deepseek-coder:6.7b-instruct" --scenario "{GOLD_PROMPT}" --output-file "data/training/gold_standard/gold_standard_21_draft.json"

print("\\n" + "="*50 + "\\n")

# --- GENERATE PLATINUM STANDARD DRAFT #01 ---
PLATINUM_PROMPT = "Design a one-of-a-kind, 'biophilic' 3BHK luxury retreat on a 50x80 feet plot overlooking the backwaters of Kerala. The design must seamlessly integrate indoor and outdoor spaces, featuring a central open-to-sky courtyard with a water body, extensive use of natural materials like laterite stone and teak wood, and a cantilevered infinity pool on the first floor. Prioritize sustainability with rainwater harvesting and solar panel provisions. The architectural style should be a modern interpretation of traditional Kerala design."
!python scripts/generate_draft_from_prompt.py --model "deepseek-coder:6.7b-instruct" --scenario "{PLAT_PROMPT}" --output-file "data/training/platinum_standard/platinum_standard_01_draft.json"


## Step 4: Prepare the Gold Standard Data

This step runs our preparation script. It will process the 20 raw Gold Standard JSON files (plus any new ones you've generated and perfected) and create a new `gold_standard_finetune_ready` directory containing the data in the simple `{"prompt": "...", "output": "..."}` format required by the training script.


In [None]:
!python scripts/prepare_gold_standard_data.py
!echo "\n✅ Data preparation complete. Verifying the new directory:"
!ls -l data/training/gold_standard_finetune_ready | wc -l


## Step 5: Prepare Data for DeepSeek Fine-Tuning

This step is crucial. The `deepseek-coder` model requires a specific prompt format for instruction fine-tuning. We will load the data prepared in the previous step and reformat it into the required structure, then save it to a new directory for the trainer to use.

**DeepSeek Prompt Template:**
```
You are an AI programming assistant, utilizing the DeepSeek Coder model, developed by DeepSeek Company. For inquiries regarding creation tool compatibility, provide comprehensive guidance and support.
### Instruction:
{user_prompt}
### Response:
{model_response}
```


In [None]:
import json
from pathlib import Path
import os
from datasets import load_dataset, Dataset

# Define the DeepSeek prompt template
DEEPSEEK_TEMPLATE = """You are an expert Indian architect AI, utilizing the DeepSeek Coder model. Your task is to generate a complete, valid, and architecturally sound house design in JSON format.
### Instruction:
{prompt}
### Response:
{completion}"""

# Load the dataset prepared by the previous script
source_dir = "data/training/gold_standard_finetune_ready"
dataset = load_dataset("json", data_files=[str(f) for f in Path(source_dir).glob("*.json")])['train']

def format_for_deepseek(entry):
    """Applies the DeepSeek prompt format to a dataset entry."""
    formatted_text = DEEPSEEK_TEMPLATE.format(
        prompt=entry['prompt'],
        completion=json.dumps(json.loads(entry['output']), indent=2) # Ensure completion is a formatted string
    )
    return {"text": formatted_text}

# Apply the formatting
formatted_dataset = dataset.map(format_for_deepseek)

# Save the newly formatted dataset
output_dir = Path("data/training/gold_standard_finetune_deepseek_ready")
output_dir.mkdir(parents=True, exist_ok=True)

# Save as a single JSONL file, which is efficient for the trainer
formatted_dataset.to_json(output_dir / "data.jsonl", orient="records", lines=True)

print(f"✅ Successfully formatted and saved dataset for DeepSeek at {output_dir}")
print("Example of formatted data:")
print(formatted_dataset[0]['text'])


## Step 6: Authenticate with Hugging Face

To download gated models like Llama 3, you need to be authenticated with Hugging Face. 

1.  Create a Hugging Face account if you don't have one.
2.  Generate an Access Token with "read" permissions here: https://huggingface.co/settings/tokens
3.  Run the cell below and paste your token when prompted.


In [None]:
from getpass import getpass
import os

# Prompt for Hugging Face token and login
try:
    hf_token = getpass('Enter your Hugging Face token: ')
    os.environ['HF_TOKEN'] = hf_token
except Exception:
    print("Could not read token, please paste it directly into the next cell")
    os.environ['HF_TOKEN'] = "your_hf_token_here"

!huggingface-cli login --token $HF_TOKEN


## Step 7: Run the Fine-Tuning Script

This is the core of the process. We execute the `run_finetuning.py` script, which will:

1.  **Load** our prepared Gold Standard examples.
2.  **Download** the base `meta-llama/Llama-3-8B-Instruct` model from Hugging Face.
3.  **Configure** 4-bit quantization and LoRA for efficient training.
4.  **Fine-tune** the model on our data.
5.  **Save** the final, specialized `housebrain-llama3-8b-v0.1` model to the `models/` directory.

We will use a high number of epochs (e.g., 200) because our dataset is very high-quality but small. This is necessary to ensure the model learns the schema thoroughly.


In [None]:
!python scripts/run_finetuning.py \
    --dataset-path "data/training/gold_standard_finetune_deepseek_ready" \
    --base-model "deepseek-ai/deepseek-coder-6.7b-instruct" \
    --output-path "models/housebrain-deepseek-coder-6.7b-v0.1" \
    --epochs 200 \
    --batch-size 2 \
    --learning-rate 2e-5


## Step 8: Next Steps - Using Your Fine-Tuned Model

Once training is complete, the new model is saved in the `models/housebrain-deepseek-coder-6.7b-v0.1` directory. 

You can now use this specialized model in your `generate_validated_silver_data.py` script (by changing the model ID) to generate a large, high-quality dataset of thousands of examples. This is the path to a truly production-ready system.


## Step 9 (Optional): A/B Test with an Alternative Model (Llama 3)

Now that you have a fine-tuned DeepSeek Coder model, you can run an experiment to compare it against another powerful base model like Llama 3. You can use the original `train_on_colab.ipynb` notebook to fine-tune Llama 3 on the same Gold Standard dataset.

Once both are trained, you will have two expert models: `housebrain-deepseek-coder-6.7b-v0.1` and `housebrain-llama3-8b-v0.1`. You can then evaluate them head-to-head on a new set of prompts to see which one produces superior architectural designs. This data-driven approach guarantees we select the best possible foundation for our production system.


In [None]:
# !python scripts/run_finetuning.py \
#     --dataset-path "data/training/gold_standard_finetune_ready" \
#     --base-model "Qwen/Qwen2-7B-Instruct" \
#     --output-path "models/housebrain-qwen2-7b-v0.1" \
#     --epochs 200 \
#     --batch-size 2 \
#     --learning-rate 2e-5
