## "Retrieval-Augmented" Generation, [RAG](https://arxiv.org/pdf/2005.11401)

### IDEA:

- Separate knowledge from intelligence.
- LLMs can be instruction tuned once, then they can be updated with new/ever changing knowledge, which may not be present in its training data
- Large pre-trained language models store factual knowledge in their parameters.
- These models achieve state-of-the-art results when fine-tuned on downstream NLP tasks.
- However, their ability to access and manipulate knowledge is limited, affecting performance on knowledge-intensive tasks.
- Provenance for decisions and updating world knowledge are still research challenge

![rag](./img/rag.jpg)

## Code example

In [1]:
print("Hello World!")

Hello World!


In [2]:
#print("Loading pipeline...")
#from llama_pipeline import get_llama_pipeline
#from peft import PeftModel
#print("Loading torch...")
#import torch
print("Loading rag_requests...")
#from rag_utils import initialize_rag
from rag_request import rag_request
print("Loading llama_requests...")
from llama_request import llama_request

print("Loading others...")
from threading import Thread
import sys
import os
import shutil
from typing import List
from langchain.docstore.document import Document
print("Loading done...")

Loading rag_requests...
Loading llama_requests...
Loading others...
Loading done...


In [8]:
def format_context(retrieved_docs: List[Document]) -> str:
    """Format retrieved documents into a context string."""
    context = "Reference information:\n"
    for doc, score in retrieved_docs:
        content = doc["page_content"]
        source = doc["metadata"].get("source", "Unknown")
        header = doc["metadata"].get("header", "")
        
        context += f"\n--- From {source}"
        if header:
            context += f" ({header})"
        context += f" ---\n{content}\n"
    
    context += "\nBased on the above information, please answer: "
    return context

def generate_response(prompt: str):
    """Generate a streaming response using RAG and the fine-tuned model."""
    if not prompt:
        return "Hi I am an assistant for Candulor GmbH. I can help you with questions about their products. What do you need help with?"
    
    # Retrieve relevant documents - changed from k=3 to k=5
    retrieved_docs = rag_request(prompt, k=10, port=8001)
    
    # Format context
    context = format_context(retrieved_docs)
    
    # Combine context and prompt
    full_prompt = context + prompt
        
    messages = [
        {
            "role": "system", 
            #"content": "You are a helpful AI assistant for Candulor GmbH. Answer questions based on the given reference information. If the information provided doesn't contain the answer, say you don't know."
            "content": """You are a very competent and helpful scholarly AI assistant.
            You are an expert in most scholarly disciplines.
            Your task is to answer questions about scientific topics based on reference information
            from scientific papers that have been uploaded to arXiv.
            You will be given several potentially relevant sections several papers, listed
            with their file name (doc_00000000.md etc., which you should ignore), and their title.
            If the references contain no useful information, answer 'I don't know, bro!'
            If the question is about medicine, please answer 'I am not a doctor!'
            Here are the five reference sections:
            """
        },
        {"role": "user", "content": full_prompt}
    ]
    
    """text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )
    
    model_inputs = tokenizer([text], return_tensors="pt").to(model.device)"""

    #generator = pipeline(model=model, tokenizer=tokenizer, task="text-generation")

    #chat = pipeline(messages, do_sample=True, max_new_tokens=512, temperature=0.7, top_p=0.9)[0] # [0] because we have only one chat

    chat = llama_request(messages, port=8000)
    
    return chat
    
    
    ## Create streamer
    #streamer = TextIteratorStreamer(tokenizer, skip_special_tokens=True, skip_propmt=True)
    
    # Run generation in separate thread
    #generation_kwargs = dict(
    #    **model_inputs,
    #    streamer=streamer,
    #    max_new_tokens=512,
    #    do_sample=True,
    #    temperature=0.7,
    #    top_p=0.9,
    #)
    
    #thread = Thread(target=model.generate, kwargs=generation_kwargs)
    #thread.start()
    
    # Yield tokens as they're generated
    #for new_text in streamer:
    #    yield new_text

In [4]:
#pipeline = get_llama_pipeline()

In [9]:
def rag():
    print("\nStarting ...")

    # make interactive rag
    while True:
        prompt = input("Ask your question. type 'quit' to exit. \nYou: ")
        if prompt == "quit":
            break
        if not prompt:
            print("Usage: python RAG.py <prompt>")
            sys.exit(1)
        # Generate and stream response
        print("\nGenerating response...\n")
        return generate_response(prompt)
        #for token in generate_response_streaming(prompt, model, tokenizer, vector_store):
        #    print(token, end="", flush=True)
        #print("\n")

rag()


Starting ...


Ask your question. type 'quit' to exit. 
You:  What is SpiNNaker2?



Generating response...

sending request
sending request
SpiNNaker2 is a digital neuromorphic hardware system. It is a processing element (PE) architecture for hybrid digital neuromorphic computing, which is part of the second-generation SpiNNaker system (SpiNNaker 2).


## Task 1

- Study the code.
- Add your own dataset.
- Use your own local llm, run it in the hpc.
- You may not use a finetuned model.
- Change the code accordingly

## Task2
- What can we do to make a improve ?
- Write your own implementation of RAG
- You can use your own template, your own dataset
- end goal - make a chatbot that is tailored for one specific purpose

In [None]:
print("Test")