# HouseBrain Model Fine-Tuning on Google Colab

This notebook provides a step-by-step guide to fine-tune the HouseBrain model using Google Colab's free GPU resources. This is the recommended way to train the model, as it avoids local hardware limitations (RAM, GPU, CUDA drivers).


## Step 1: Set Up the Environment

First, we need to connect to a Colab runtime (a T4 GPU is recommended and free). Then, we will clone the project repository from GitHub and install the required Python libraries.


In [None]:
# @title Step 1: Set Up the Environment
# -----------------
# IMPORTANT: PASTE YOUR GITHUB TOKEN HERE
# -----------------
import os
GITHUB_TOKEN = "" # PASTE YOUR GITHUB TOKEN HERE
os.environ['GITHUB_TOKEN'] = GITHUB_TOKEN

# Clone the repository using your token for private access
!git clone https://$GITHUB_TOKEN@github.com/Vinay-O/HouseBrainLLM.git housebrain_v1_1
%cd housebrain_v1_1

# Install the necessary libraries
# We let the libraries automatically install the correct, compatible version of PyTorch
!pip install --upgrade transformers peft trl accelerate datasets bitsandbytes sentencepiece jsonschema pydantic


## Step 2: Authenticate with Hugging Face

To download the base model from Hugging Face, you need to provide an access token. You can get a token from your Hugging Face account settings.

When you run the cell below, a login box will appear. Paste your token there.


In [None]:
from huggingface_hub import login

login()


## Step 3: Generate High-Quality "Silver Standard" Data

Before training, we will leverage the Colab A100 GPU to generate a larger, high-quality dataset. The `generate_silver_standard_data.py` script uses a powerful generate-and-refine loop to create architecturally sound examples automatically. We will generate 100 new examples.

**Note:** This step will take some time as it involves hundreds of LLM calls, but it is crucial for model quality.


In [None]:
# Install Ollama in the Colab environment if it's not already present
!if ! command -v ollama &> /dev/null; then curl -fsSL https://ollama.com/install.sh | sh; fi

import os
import subprocess
import time
import requests

# Start the Ollama server as a background process
with open("ollama_server.log", "w") as log_file:
    ollama_process = subprocess.Popen(["ollama", "serve"], stdout=log_file, stderr=subprocess.STDOUT)

print("🚀 Starting Ollama server in the background...")
time.sleep(5) # Give it a moment to initialize

# --- Health Check Loop ---
# Wait for the Ollama server to be ready by polling the API endpoint
max_wait_time = 180  # 3 minutes
start_time = time.time()
server_ready = False
print("... Waiting for Ollama server to become available...")
while time.time() - start_time < max_wait_time:
    try:
        response = requests.get("http://localhost:11434")
        if response.status_code == 200:
            server_ready = True
            print("✅ Ollama server is up and running!")
            break
    except requests.exceptions.ConnectionError:
        time.sleep(5) # Wait 5 seconds before retrying
else:
    print("❌ Timed out waiting for Ollama server to start.")
    # You might want to handle this error, e.g., by raising an exception
    # For now, we'll let it proceed and likely fail on the next step, which will show the error.

# --- Model Download and Verification ---
if server_ready:
    print("\\n⏳ Downloading the deepseek-coder model (approx. 4-5 GB)...")
    !ollama pull deepseek-coder:6.7b-instruct
    print("✅ Model download complete.")

    print("\\n📋 Verifying installed models...")
    !ollama list
    print("------------------------------------\\n")

    print("⏳ Starting the Silver Standard data generation process...")
    !python scripts/generate_silver_standard_data.py --num-examples 100
else:
    print("🔴 Ollama server failed to start. Cannot proceed with data generation.")
    print("📜 Server logs:")
    !cat ollama_server.log



## Step 4: Prepare All Datasets for Fine-Tuning

The fine-tuning script requires the `output` field in our JSON examples to be a string. The `prepare_data_for_finetuning.py` script handles this conversion. We will run it on both our original "Gold" dataset and our newly generated "Silver" dataset.


In [None]:
# Prepare the Gold Standard dataset
!python scripts/prepare_data_for_finetuning.py \
    --input-dir data/training/gold_standard \
    --output-dir data/training/gold_standard_finetune_ready

# Prepare the newly generated Silver Standard dataset
!python scripts/prepare_data_for_finetuning.py \
    --input-dir data/training/silver_standard \
    --output-dir data/training/silver_standard_finetune_ready


## Step 5: Run the Fine-Tuning Script

Now we are ready to fine-tune the model on our combined dataset. For a Colab Pro+ A100 environment, we can use a larger batch size and sequence length to accelerate training and improve performance.

We will point the training script to both the `gold_standard_finetune_ready` and `silver_standard_finetune_ready` directories.


In [None]:
!python scripts/run_finetuning.py \
    --model_id deepseek-ai/deepseek-coder:6.7b-instruct \
    --dataset_path data/training/gold_standard_finetune_ready data/training/silver_standard_finetune_ready \
    --output_dir models/housebrain-v1.0-silver \
    --epochs 15 \
    --batch_size 4 \
    --learning_rate 0.0002 \
    --use_4bit


## Step 6: (Optional) Download the Trained Model

After training is complete, the new model adapter will be saved in the `models/housebrain-v1.0-silver` directory. You can zip it and download it to your local machine for future use.


In [None]:
!zip -r housebrain-v1.0-silver-adapter.zip models/housebrain-v1.0-silver

from google.colab import files
files.download('housebrain-v1.0-silver-adapter.zip')


## Step 3: Run the Fine-Tuning Script

Now we are ready to run the fine-tuning script. We will use 4-bit quantization (`--use_4bit`) to ensure the model fits comfortably within the Colab GPU's memory. The script will train the model on our 10 Gold Standard examples for 10 epochs and save the resulting LoRA adapter to the `models/housebrain-v0.1` directory.


In [None]:
!python scripts/run_finetuning.py \
    --model_id deepseek-ai/deepseek-coder-6.7b-instruct \
    --dataset_path data/training/gold_standard_finetune_ready \
    --output_dir models/housebrain-v0.1 \
    --epochs 10 \
    --batch_size 1 \
    --learning_rate 0.0002 \
    --use_4bit


## Step 4: (Optional) Download the Trained Model

After training is complete, the new model adapter will be saved in the `models/housebrain-v0.1` directory inside the Colab environment. If you want to save it permanently, you can zip it and download it to your local machine.


In [None]:
!zip -r housebrain-v0.1-adapter.zip models/housebrain-v0.1

from google.colab import files
files.download('housebrain-v0.1-adapter.zip')
