# RAG Prototype

A simple prototype of a RAG (Retrieval-Augmented Generation) system for retrieving information from text documents and using it to prompt LLM response generation.

* Built based on LangChain and using the Gemma 3 (12B) model.
* Requires [Ollama to be set up locally](https://ollama.com/) to run.
* Example documents are articles from the SCP Foundation collaborative writing project.
* Reference material [can be found on Youtube](https://www.youtube.com/watch?v=2TJxpyO3ei4).

## Loading Documents

In [137]:
from bs4 import BeautifulSoup
import markdownify

In [138]:
dir_path = "./datasets/scp_html/"
file_names = ["SCP-1000 - SCP Foundation.html",
             "SCP-1001 - SCP Foundation.html",
             "SCP-1002 - SCP Foundation.html"]
documents = []
for a_file in file_names: 
    with open(dir_path+a_file, encoding='utf-8',) as myfile:
        soup = BeautifulSoup(myfile)
    content = soup.find(id="main-content")
    content_md = markdownify.markdownify(content.text)
    documents.append([a_file,content_md])

## Converting to Embeddings

In [154]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.schema.document import Document
from langchain_community.embeddings.ollama import OllamaEmbeddings
from langchain_community.llms.ollama import Ollama
from langchain.vectorstores.chroma import Chroma
from langchain.prompts import ChatPromptTemplate

### Tokenization

Since we'll be creating our RAG with LangChain, we need to convert our text documents into LangChain Documents.

In [147]:
CHROMA_PATH = "chroma"
PROMPT_TEMPLATE = """
Answer the question based on only the following context:
{context}

---
Answer the question based on the above context: {question}
"""

In [140]:
documents_lang = [Document(page_content=text,metadata={"name":name}) for name, text in documents]

In [141]:
def split_document(documents: list[Document]):
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=800,
                                                   chunk_overlap=80,
                                                  length_function=len,
                                                  is_separator_regex=False)
    return text_splitter.split_documents(documents)

def add_chunk_index(chunks):
    last_doc_name = None
    current_chunk_idx = 0
    for chunk in chunks:
        name = chunk.metadata.get("name")
        if last_doc_name == name:
            current_chunk_idx += 1
        else:
            current_chunk_idx = 0
        chunk.metadata["id"] = f'{name}:{current_chunk_idx}'
        last_doc_name = name
    return chunks

        
chunks = split_document(documents_lang)
chunks = add_chunk_index(chunks)

### Adding Embeddings to Database

In [142]:
def get_embedding_function():
    embeddings = OllamaEmbeddings(model="nomic-embed-text")
    return embeddings

In [144]:
def add_to_chroma(chunks: list[Document]):
    # Load the existing database.
    db = Chroma(
        persist_directory=CHROMA_PATH, embedding_function=get_embedding_function()
    )

    # Add or Update the documents.
    existing_items = db.get(include=[])  # IDs are always included by default
    existing_ids = set(existing_items["ids"])
    print(f"Number of existing documents in DB: {len(existing_ids)}")

    # Only add documents that don't exist in the DB.
    new_chunks = []
    for chunk in chunks:
        if chunk.metadata["id"] not in existing_ids:
            new_chunks.append(chunk)

    if len(new_chunks):
        print(f"👉 Adding new documents: {len(new_chunks)}")
        new_chunk_ids = [chunk.metadata["id"] for chunk in new_chunks]
        db.add_documents(new_chunks, ids=new_chunk_ids)
    else:
        print("✅ No new documents to add")

add_to_chroma(chunks)

{'ids': ['SCP-1000 - SCP Foundation.html:0', 'SCP-1000 - SCP Foundation.html:1', 'SCP-1000 - SCP Foundation.html:2', 'SCP-1000 - SCP Foundation.html:3', 'SCP-1000 - SCP Foundation.html:4', 'SCP-1000 - SCP Foundation.html:5', 'SCP-1000 - SCP Foundation.html:6', 'SCP-1000 - SCP Foundation.html:7', 'SCP-1000 - SCP Foundation.html:8', 'SCP-1000 - SCP Foundation.html:9', 'SCP-1000 - SCP Foundation.html:10', 'SCP-1000 - SCP Foundation.html:11', 'SCP-1000 - SCP Foundation.html:12', 'SCP-1000 - SCP Foundation.html:13', 'SCP-1000 - SCP Foundation.html:14', 'SCP-1000 - SCP Foundation.html:15', 'SCP-1000 - SCP Foundation.html:16', 'SCP-1000 - SCP Foundation.html:17', 'SCP-1000 - SCP Foundation.html:18', 'SCP-1000 - SCP Foundation.html:19', 'SCP-1000 - SCP Foundation.html:20', 'SCP-1000 - SCP Foundation.html:21', 'SCP-1000 - SCP Foundation.html:22', 'SCP-1000 - SCP Foundation.html:23', 'SCP-1000 - SCP Foundation.html:24', 'SCP-1001 - SCP Foundation.html:0', 'SCP-1001 - SCP Foundation.html:1', 'SCP

### Query Testing

In [156]:
def query_rag(query_text: str):
    # Prepare the DB.
    embedding_function = get_embedding_function()
    db = Chroma(persist_directory=CHROMA_PATH, embedding_function=embedding_function)

    # Search the DB.
    results = db.similarity_search_with_score(query_text, k=5)

    context_text = "\n\n---\n\n".join([doc.page_content for doc, _score in results])
    prompt_template = ChatPromptTemplate.from_template(PROMPT_TEMPLATE)
    prompt = prompt_template.format(context=context_text, question=query_text)
    # print(prompt)

    model = Ollama(model="gemma3:12b")
    response_text = model.invoke(prompt)

    sources = [doc.metadata.get("id", None) for doc, _score in results]
    formatted_response = f"Response: {response_text}\nSources: {sources}"
    print(formatted_response)
    return response_text

In [160]:
query_rag("Describe what SCP-1001 is, especially including species and behavior.")

Response: SCP-1001 is a plant-like organism contained at Bio Site-103. It exhibits unusual behavior and preferences, and its intelligence is highly debatable. 

Here's a breakdown of what the context describes:

*   **Appearance & Size:** It is described as a plant with a caudex and leaves. It is contained within a soil core 9 meters in diameter and 4 meters deep.
*   **Prey Preferences:** It prefers to consume intelligent animals, especially those capable of tool use or building structures. Humans are its preferred prey, but it will also attack primates, dogs, parrots, pigs, beavers, ants, and nest-building birds, even though some of these are much smaller than its typical prey size and result in a net energy loss.
*   **Hunting Tactics:** It uses two primary hunting tactics: burying its leaves to ambush prey (requiring a minimum size of 40 kg) and sophisticated audio mimicry to lure prey, including recreating and combining sounds to mimic known voices.
*   **Bone Arrangement:** It do

"SCP-1001 is a plant-like organism contained at Bio Site-103. It exhibits unusual behavior and preferences, and its intelligence is highly debatable. \n\nHere's a breakdown of what the context describes:\n\n*   **Appearance & Size:** It is described as a plant with a caudex and leaves. It is contained within a soil core 9 meters in diameter and 4 meters deep.\n*   **Prey Preferences:** It prefers to consume intelligent animals, especially those capable of tool use or building structures. Humans are its preferred prey, but it will also attack primates, dogs, parrots, pigs, beavers, ants, and nest-building birds, even though some of these are much smaller than its typical prey size and result in a net energy loss.\n*   **Hunting Tactics:** It uses two primary hunting tactics: burying its leaves to ambush prey (requiring a minimum size of 40 kg) and sophisticated audio mimicry to lure prey, including recreating and combining sounds to mimic known voices.\n*   **Bone Arrangement:** It does