# HouseBrain: Llama 3 Fine-Tuning 🧠

This notebook fine-tunes the `meta-llama/Meta-Llama-3-8B-Instruct` model on the high-quality architectural dataset generated by the `data_generation_colab.ipynb` notebook.

### Workflow:
1.  **Setup:** Mounts Google Drive, clones the repository, and installs all necessary training libraries.
2.  **Authentication:** Logs into Hugging Face to download the Llama 3 model.
3.  **Data Loading:** Loads the `gold_standard` JSON files from the assembly line's output directory in your Google Drive.
4.  **Data Preparation:** Formats the JSON data into the specific instruction format required for fine-tuning.
5.  **Model Loading:** Loads the Llama 3 model and its tokenizer in 4-bit precision for memory efficiency.
6.  **Training:** Runs the fine-tuning process using the `SFTTrainer` and LoRA.
7.  **Save Adapter:** Saves the resulting trained LoRA adapter to your Google Drive for future use.


## Step 1: Setup Environment


In [None]:
# Mount Google Drive to persist our dataset
from google.colab import drive
drive.mount('/content/drive')


In [None]:
# Securely provide your GitHub token to clone the private repository
from getpass import getpass
import os

# Prompt for the GitHub token
github_token = getpass('Enter your GitHub Personal Access Token (PAT): ')
os.environ['GITHUB_TOKEN'] = github_token

# Clean up any previous clones
!rm -rf HouseBrainLLM

# Clone the repository using the token
# Replace 'Vinay-O/HouseBrainLLM' with your own GitHub username and repository if it's different.
!git clone https://{os.environ.get('GITHUB_TOKEN')}@github.com/Vinay-O/HouseBrainLLM.git
%cd HouseBrainLLM

print("\n✅ Repository cloned successfully.")


In [None]:
# Install necessary Python packages for training
!pip install -q -U transformers peft accelerate bitsandbytes trl datasets

print("✅ Training dependencies installed.")


## Step 2: Authentication & Configuration


In [None]:
# Log in to Hugging Face to download the Llama 3 model
# You'll need a Hugging Face account and a User Access Token with 'read' permissions.
# Get a token here: https://huggingface.co/settings/tokens
from huggingface_hub import notebook_login

notebook_login()


In [None]:
# --- Configuration ---

# The model we want to fine-tune
model_id = "meta-llama/Meta-Llama-3-8B-Instruct"

# The path to the dataset generated by the assembly line notebook
# MAKE SURE THIS PATH IS CORRECT
dataset_path = "/content/drive/MyDrive/housebrain_final_dataset/gold_standard"

# Where to save the final trained model adapter
new_adapter_path = "/content/drive/MyDrive/housebrain_llama3_adapter"

print("Configuration is set.")


## Step 3: Load and Prepare the Dataset


In [None]:
import json
from datasets import Dataset
import glob

# Find all the generated JSON files
json_files = glob.glob(f"{dataset_path}/**/*.json", recursive=True)

def format_data_for_training(file_path):
    """Reads a JSON file and formats it into the required Llama 3 instruction format."""
    try:
        with open(file_path, 'r', encoding='utf-8') as f:
            data = json.load(f)
        
        # Extract the original prompt from the 'input' block
        prompt = data.get("input", {}).get("basicDetails", {}).get("prompt", "")
        if not prompt:
            return None
        
        # The full JSON data becomes the 'answer'
        answer = json.dumps(data, indent=2)
        
        # This is the specific format Llama 3 Instruct was trained on.
        # We must match it precisely.
        # The <|begin_of_text|> and <|end_of_text|> tokens are added automatically by the tokenizer.
        formatted_text = f"<|start_header_id|>user<|end_header_id|>\n\n{prompt}<|eot_id|>"
        formatted_text += f"<|start_header_id|>assistant<|end_header_id|>\n\n{answer}<|eot_id|>"
        
        return {"text": formatted_text}
    except Exception as e:
        print(f"Error processing file {file_path}: {e}")
        return None

# Process all files and create a dataset
data_list = [format_data_for_training(f) for f in json_files if f is not None]
data_list = [item for item in data_list if item is not None] # Filter out any errors

if not data_list:
    raise ValueError("No valid data found! Please ensure the dataset_path is correct and contains valid JSON files.")

dataset = Dataset.from_list(data_list)

print(f"✅ Successfully loaded and formatted {len(dataset)} examples.")
print("\n--- Example ---\n")
print(dataset[0]['text'])


## Step 4: Load Model and Tokenizer


In [None]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

# Configure quantization to load the model in 4-bit precision
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# Load the model
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    quantization_config=bnb_config,
    device_map="auto",
)

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token # Set padding token
tokenizer.padding_side = 'right' # Avoid issues with fp16 training

print("✅ Model and tokenizer loaded successfully.")


## Step 5: Configure LoRA and Training


In [None]:
from peft import LoraConfig, get_peft_model

# LoRA configuration
lora_config = LoraConfig(
    r=16, # Rank
    lora_alpha=32,
    lora_dropout=0.05,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    bias="none",
    task_type="CAUSAL_LM"
)

# Add LoRA adapters to the model
model = get_peft_model(model, lora_config)

print("LoRA configured.")


In [None]:
from transformers import TrainingArguments
from trl import SFTTrainer

# Training arguments
training_args = TrainingArguments(
    output_dir=new_adapter_path,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    logging_steps=10,
    num_train_epochs=3, # Adjust as needed
    save_strategy="epoch",
    fp16=True, # Use mixed precision
)

# Create the trainer
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=lora_config,
    dataset_text_field="text",
    max_seq_length=4096, # Adjust based on your VRAM
    args=training_args,
    packing=True,
)

print("Trainer is ready.")


## Step 6: Start Fine-Tuning



In [None]:
print("Starting training...")

trainer.train()

print("✅ Training complete!")


## Step 7: Save the Final Adapter


In [None]:
print(f"Saving final model adapter to {new_adapter_path}")

trainer.save_model(new_adapter_path)

print("🎉 All done! Your fine-tuned adapter is saved in your Google Drive.")


## Step 7: Save the Final Adapter
