## "Retrieval-Augmented" Generation, [RAG](https://arxiv.org/pdf/2005.11401)

### IDEA:

- Separate knowledge from intelligence.
- LLMs can be instruction tuned once, then they can be updated with new/ever changing knowledge, which may not be present in its training data
- Large pre-trained language models store factual knowledge in their parameters.
- These models achieve state-of-the-art results when fine-tuned on downstream NLP tasks.
- However, their ability to access and manipulate knowledge is limited, affecting performance on knowledge-intensive tasks.
- Provenance for decisions and updating world knowledge are still research challenge

![rag](./img/rag.jpg)

## Code example

In [1]:
from transformers import AutoModelForCausalLM, AutoTokenizer, TextIteratorStreamer
from peft import PeftModel
import torch
from threading import Thread
import sys
from rag_utils import initialize_rag
from typing import List
from langchain.docstore.document import Document

In [2]:
def format_context(retrieved_docs: List[Document]) -> str:
    """Format retrieved documents into a context string."""
    context = "Reference information:\n"
    for doc in retrieved_docs:
        content = doc.page_content
        source = doc.metadata.get("source", "Unknown")
        header = doc.metadata.get("header", "")
        
        context += f"\n--- From {source}"
        if header:
            context += f" ({header})"
        context += f" ---\n{content}\n"
    
    context += "\nBased on the above information, please answer: "
    return context

def generate_response_streaming(prompt: str, model, tokenizer, vector_store):
    """Generate a streaming response using RAG and the fine-tuned model."""
    if not prompt:
        return "Hi I am an assistant for Candulor GmbH. I can help you with questions about their products. What do you need help with?"
    
    # Retrieve relevant documents - changed from k=3 to k=5
    retrieved_docs = vector_store.similarity_search(prompt, k=5)
    
    # Format context
    context = format_context(retrieved_docs)
    
    # Combine context and prompt
    full_prompt = context + prompt
        
    messages = [
        {
            "role": "system", 
            "content": "You are a helpful AI assistant for Candulor GmbH. Answer questions based on the given reference information. If the information provided doesn't contain the answer, say you don't know."
        },
        {"role": "user", "content": full_prompt}
    ]
    
    text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    
    model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
    
    # Create streamer
    streamer = TextIteratorStreamer(tokenizer, skip_special_tokens=True)
    
    # Run generation in separate thread
    generation_kwargs = dict(
        **model_inputs,
        streamer=streamer,
        max_new_tokens=512,
        do_sample=True,
        temperature=0.7,
        top_p=0.9,
    )
    
    thread = Thread(target=model.generate, kwargs=generation_kwargs)
    thread.start()
    
    # Yield tokens as they're generated
    for new_text in streamer:
        yield new_text

In [3]:
print("Initializing vector store...")
vector_store = initialize_rag(
    markdown_dir="./dataset/md/",
    faiss_index_file = "/data/horse/ws/s9650707-llm_secrets/datasets/unarxive/faiss_index",
    load_index_from_file=True,
    store_index_to_file=False,
    max_files=None
)


Initializing vector store...

Creating Vector Store...


  embeddings = HuggingFaceEmbeddings(model_name=embeddings_model_name, show_progress=True if docs_per_batch is None else False)


Loading the vector store from file


In [None]:
def rag(vector_store):

    print("\nLoading model...")
    # Set up device
    device = torch.device("cuda" if  torch.cuda.is_available() else "cpu")
    print(f"Using device: {device}")

    # Load both model and tokenizer from the fine-tuned output directory
    model_path = "./finetuned"
    base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
    tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B-Instruct")  # Changed to load from fine-tuned path

    # Load LoRA weights
    #model = PeftModel.from_pretrained(base_model, model_path)
    model = base_model
    model.to(device)

    print("\nStarting ...")

    # make interactive rag
    while True:
        prompt = input("Ask your question. type 'quit' to exit. \nYou: ")
        if prompt == "quit":
            break
        if not prompt:
            print("Usage: python RAG.py <prompt>")
            sys.exit(1)
        # Generate and stream response
        print("\nGenerating response...\n")
        for token in generate_response_streaming(prompt, model, tokenizer, vector_store):
            print(token, end="", flush=True)
        print("\n")

rag(vector_store)



Loading model...
Using device: cuda

Starting ...


## Task 1

- Study the code.
- Add your own dataset.
- Use your own local llm, run it in the hpc.
- You may not use a finetuned model.
- Change the code accordingly

## Task2
- What can we do to make a improve ?
- Write your own implementation of RAG
- You can use your own template, your own dataset
- end goal - make a chatbot that is tailored for one specific purpose