# Fine-tuning LLM for Wordpress Posts

This document contains the code to fine-tune Gemma with Wordpress Posts for conversation as a chatbot assistant.

**Database: Weaviate**

**Collection: Post**

**Schema**

```
class Post(WeaviateCollection):
    id:str
    postId:str
    postTitle:str
    postExcerpt:str
    postContent:str
    postDate:datetime
    postAuthor:str
    postCategories:typing.Optional[str]
    postTags:typing.Optional[str]
    postUrl:typing.Optional[str]
    postSequence:typing.Optional[int]=1
    isDeleted:typing.Optional[bool]=False

    def get_embedding(self):
        return [0.12345] * 1536
```

## 1.0 Steps to create embeddings

We will generate embeddings externally using Gemma's tokenizer and store it in weaviate. This will enable not only to perform semantic search in the databaset, it will also keep the vectors prepared and stored reducing the time to generate vectors each time a request is made.

1. The following parameters will be included in the embedding:

    - postContent: Main content of the post
    - postTitle: Title of the post
    - postAuthor: Author of the post
    - postDate: Publish date of the post
    - postCategories: Categories that the post belongs to
    - postTags: Tags related to the post

    

2. The following type of embeddings will be available for the system.

    - **Fine-tuning:** When fine-tuning the model, the embedding will contain post contents with some metadata. The text needs to be conversational.
    - **Query:** When querying the LLM, the text that will be converted to embedding will be different from the ones that are used for fine-tuning. The text will contain instructions, user prompt, and a context. The combined text will form the embedding.




In [1]:
import os
import asyncio

import weaviate
import typing
from weaviate.connect import ConnectionParams
from weaviate.classes.init import AdditionalConfig, Timeout, Auth
from weaviate.classes.config import DataType, Configure, Property
from weaviate.classes.query import Filter
from abc import ABC, abstractmethod
import strawberry
from datetime import datetime

from dotenv import load_dotenv

from abc import ABC, abstractmethod
import requests
from bs4 import BeautifulSoup
import base64
import json
import httpx
import re

In [2]:
# Add HF_HOME and HF_TOKEN to env file
load_dotenv()

True

## 2.0 Weaviate configuration

In [3]:
weaviate_host = "localhost"
weaviate_port = 50050
weaviate_grpc_port = 50051
weaviate_scheme = "http"
weaviate_user = "admin@vip3rtech6069.com"
weaviate_key = "admin123"

weaviate_connection_params = ConnectionParams.from_params(
    http_host=weaviate_host,
    http_port=weaviate_port,
    http_secure=weaviate_scheme == "https",
    grpc_host=weaviate_host,
    grpc_port=weaviate_grpc_port,
    grpc_secure=weaviate_scheme == "https"
)
weaviate_auth_secret = Auth.api_key(weaviate_key)
weaviate_additional_config = AdditionalConfig(
    timeout=Timeout(init=30, query=60, insert=120),  # Values in seconds
)

In [4]:
async_client = weaviate.WeaviateAsyncClient(
    connection_params=weaviate_connection_params,
    auth_client_secret=weaviate_auth_secret,
    additional_config=weaviate_additional_config,
    skip_init_checks=True
)

await async_client.connect()

In [5]:
await async_client.is_ready()

True

## 3.0 Wordpress configuration

In [6]:
class WeaviateCollection(ABC):

    @staticmethod
    @abstractmethod
    def from_dict(data:dict)->typing.Type[typing.Any]:
        pass

    @staticmethod
    @abstractmethod
    def to_dict(data:typing.Any)->dict:
        pass

    @abstractmethod
    def get_embedding(self)->typing.List[float]:
        pass

    @staticmethod
    @abstractmethod
    def get_field_mapping()->dict:
        pass
    

@strawberry.type
class Post(WeaviateCollection):
    id:str
    postId:str
    postTitle:str
    postExcerpt:str
    postContent:str
    postDate:datetime
    postAuthor:str
    postCategories:typing.Optional[str]
    postTags:typing.Optional[str]
    postUrl:typing.Optional[str]
    postSequence:typing.Optional[int]=1
    isDeleted:typing.Optional[bool]=False

    @staticmethod
    def to_dict(post):
        return {
            "id": post.id,
            "postId": post.postId,
            "postTitle": post.postTitle,
            "postExcerpt": post.postExcerpt,
            "postContent": post.postContent,
            "postDate": post.postDate,
            "postAuthor": post.postAuthor,
            "postCategories": post.postCategories,
            "postTags": post.postTags,
            "postUrl": post.postUrl,
            "postSequence": post.postSequence,
            "isDeleted": post.isDeleted
        }
    
    @staticmethod
    def from_dict(data:dict):
        if not data or type(data) is not dict:
            return None

        return Post(
            id=data.get("id", ""),
            postId=data.get("postId", ""),
            postTitle=data.get("postTitle", ""),
            postExcerpt=data.get("postExcerpt", ""),
            postContent=data.get("postContent", ""),
            postDate=data.get("postDate", ""),
            postAuthor=data.get("postAuthor", ""),
            postCategories=data.get("postCategories", ""),
            postTags=data.get("postTags", ""),
            postUrl=data.get("postUrl", ""),
            postSequence=data.get("postSequence", 1),
            isDeleted=data.get("isDeleted", False)
        )
    
    @staticmethod
    def get_field_mapping()->dict:
        return {
            'postId': 'id',
            'postTitle': 'title',
            'postExcerpt': 'excerpt',
            'postContent': 'content',
            'postDate': 'date_gmt',
            'postAuthor': 'author',
            'postCategories': 'categories',
            'postTags': 'tags',
            'postUrl': 'link',
            'postSequence': 'sequence',
            'isDeleted': 'isDeleted'
        }
    
    def get_embedding(self):
        return []


## 4.0 LLM Configuration

**Model: Gemma-2-2b**

**Source: [https://huggingface.co/google/gemma-2-2b](https://huggingface.co/google/gemma-2-2b)**

**Source: [Fine-tuning Llama2](https://github.dev/krishnaik06/Finetuning-LLM/blob/main/Fine_tune_Llama_2.ipynb)**

**Quantization: 4bit**

**Tokenizer: Gemma-2-2b**

**Technique: [Lora](https://huggingface.co/docs/peft/package_reference/lora)**

In [None]:
!pip install accelerate bitsandbytes peft scikit-learn scipy trl transformers

In [None]:
# For fine-tuning use pytorch with cuda. For backend application, use the one without cuda to reduce unnceccessary package size
!pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

In [7]:
import torch
torch.cuda.is_available()

True

In [8]:
import gc
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoModel,
    AutoTokenizer,
    BitsAndBytesConfig,
    HfArgumentParser,
    TrainingArguments,
    pipeline,
    logging,
)
from peft import LoraConfig, PeftModel
from trl import SFTTrainer
import json

  from .autonotebook import tqdm as notebook_tqdm
  np.core.multiarray._reconstruct,


In [9]:
def get_gpu_details():
    if torch.cuda.is_available():
        print(f"GPU Name: {torch.cuda.get_device_name(0)}")
        print(f"Memory Allocated: {torch.cuda.memory_allocated()} bytes")
        print(f"Memory Cached: {torch.cuda.memory_reserved()} bytes")

get_gpu_details()

GPU Name: NVIDIA GeForce RTX 2060 SUPER
Memory Allocated: 0 bytes
Memory Cached: 0 bytes


In [10]:
def clear_cuda_memory():
    if torch.cuda.is_available():
        print("Clearing GPU memory...")
        torch.cuda.empty_cache()  # Clears the GPU cache
        torch.cuda.reset_peak_memory_stats()  # Resets memory stats tracking
        torch.cuda.synchronize()  # Ensures all streams are synced (optional)
        print("GPU memory cleared.")
        gc.collect()
        gc.collect()
    else:
        print("CUDA is not available. No GPU memory to clear.")

In [None]:
# Define the model and tokenizer
model_name = "google/gemma-2b-it"

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# Check if CUDA (GPU) is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Move the model to the GPU
model.to(device)

# Create a sample input
text = "The GPU is being tested with transformers!"
inputs = tokenizer(text, return_tensors="pt").to(device)

# Perform a forward pass
with torch.no_grad():
    outputs = model(**inputs)

# Verify that the tensors are on the GPU
if torch.cuda.is_available():
    is_on_gpu = all(tensor.device.type == 'cuda' for tensor in outputs.values())
    if is_on_gpu:
        print("Success: Transformers library is using the GPU.")
    else:
        print("Warning: Transformers library is not using the GPU.")
else:
    print("CUDA is not available. Running on CPU.")

# Optional: Print GPU details
get_gpu_details()

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Downloading shards:   0%|                                                                        | 0/2 [00:00<?, ?it/s]

In [None]:
clear_cuda_memory()

In [None]:
# The model that you want to train from the Hugging Face hub
model_name = "google/gemma-2b-it" #"NousResearch/Llama-2-7b-chat-hf"

# Fine-tuned model name
new_model = f"{model_name}-chat-finetune" #"Llama-2-7b-chat-finetune"

################################################################################
# QLoRA parameters
################################################################################

# LoRA attention dimension
lora_r = 64

# Alpha parameter for LoRA scaling
lora_alpha = 16

# Dropout probability for LoRA layers
lora_dropout = 0.1

################################################################################
# bitsandbytes parameters
################################################################################

# Activate 4-bit precision base model loading
use_4bit = True

# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "float16"

# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4"

# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = False

################################################################################
# TrainingArguments parameters
################################################################################

# Output directory where the model predictions and checkpoints will be stored
output_dir = "./results"

# Number of training epochs
num_train_epochs = 1

# Enable fp16/bf16 training (set bf16 to True with an A100)
fp16 = False
bf16 = False

# Batch size per GPU for training
per_device_train_batch_size = 4

# Batch size per GPU for evaluation
per_device_eval_batch_size = 4

# Number of update steps to accumulate the gradients for
gradient_accumulation_steps = 1

# Enable gradient checkpointing
gradient_checkpointing = True

# Maximum gradient normal (gradient clipping)
max_grad_norm = 0.3

# Initial learning rate (AdamW optimizer)
learning_rate = 2e-4

# Weight decay to apply to all layers except bias/LayerNorm weights
weight_decay = 0.001

# Optimizer to use
optim = "paged_adamw_32bit"

# Learning rate schedule
lr_scheduler_type = "cosine"

# Number of training steps (overrides num_train_epochs)
max_steps = -1

# Ratio of steps for a linear warmup (from 0 to learning rate)
warmup_ratio = 0.03

# Group sequences into batches with same length
# Saves memory and speeds up training considerably
group_by_length = True

# Save checkpoint every X updates steps
save_steps = 0

# Log every X updates steps
logging_steps = 25

################################################################################
# SFT parameters
################################################################################

# Maximum sequence length to use
max_seq_length = None

# Pack multiple short examples in the same input sequence to increase efficiency
packing = False

# Load the entire model on the GPU 0
device_map = 'cuda' # {"": 0}

In [None]:
# Load tokenizer and model with QLoRA configuration
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

# Check GPU compatibility with bfloat16
if compute_dtype == torch.float16 and use_4bit:
    major, _ = torch.cuda.get_device_capability()
    if major >= 8:
        print("=" * 80)
        print("Your GPU supports bfloat16: accelerate training with bf16=True")
        print("=" * 80)

In [None]:
# Load base model 
model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=bnb_config,
        device_map=device_map,
        token=hf_token,
    )
model.config.use_cache = False
model.config.pretraining_tp = 1

In [None]:
# Load Gemma tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right" # Fix weird overflow issue with fp16 training

## 5.0 Predict output from model

In [None]:
# Generate output using pipeline and predefined prompts by HuggingFace

text = "What is the origin of 'Hello World!'"
encoded_text = tokenizer.encode(text, return_tensors='pt')
print(type(encoded_text))
print(tokenizer.encode(text, return_tensors='pt').to("cuda"))
pipe = pipeline(task='text-generation', model=model, tokenizer=tokenizer, max_length=200)
result = pipe(text)
print(result[0])

In [None]:
# Generate output by constructing prompt from raw text query
text = "What is the origin of 'Hello World!'"
chat = [
    { "role": "user", "content": text },
]
prompt = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)
print(prompt)
inputs = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt")
outputs = model.generate(input_ids=inputs.to(model.device), max_new_tokens=150)
print(tokenizer.decode(outputs[0], skip_special_tokens=False))

## 6.0 Prepare dataset for fine-tuning