## Why this matters for your business
- The local set up allows for data privacy (no cloud)
- No accumulating token costs
- Works with internal documents e.g. PDFs
- Audible answers: the exact text chunk used to generate the answer can be identified
- Offline / air-gapped mode is possible 

## Brief tech stack overview
- The [Ollama](https://ollama.com) official docker image is used managing and running the language models.
- We deploy Meta's [Llama3.2](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/) large language model and Nomic's [nomic-embed-text](https://www.nomic.ai/news/nomic-embed-text-v1) embedding model.
- [Weaviate](https://weaviate.io) is used as vector store.
- [LangChain](https://www.langchain.com) is used for orchestration of the RAG agent.
- The agent is deployed locally with the [LangGraph](https://www.langchain.com/langgraph) server.
- Finally, the [Agent Chat UI](https://docs.langchain.com/oss/python/langchain/ui) app is chosen as web UI.

The flow of information is shown below. The user inputs a query into the UI which passes the query to the agent. The agent can use the vector store to retrieve relevant information and use the LLM to generate the answer.

![Architecture overview](static/architecture.svg)

## Deployment of models
Firstly, we build and run the Ollama and Weaviate docker images and supsequently, we can fetch the Ollama and Nomic models. All of this is achieved with the following make commands:

In [None]:
! make setup > /dev/null 2>&1
! make run > /dev/null 2>&1
! make fetch_models > /dev/null 2>&1

We now have the `Llama3.2` model deployed locally.

In the lack of internal business documents to use for this demo we will just demonstrate the principles using the Napoleonic Wars Wikipedia page. As a start, in order to test the model's knowledge on the Napoleonic Wars, we will ask the model the following question:

> Under the Napoleonic wars, which countries took part in the fifth coalition against France?

The core members of the fifth coalition were the Austrian Empire and United Kingdom. As LLMs have randomness built in, we repeat the query five times. From the five answers below, we see that the model either upfront states that it does not know the answer or generates lists that include more or different countries than the actual members of the coalition.

A larger model such as OpenAI's GPT-5 would likely have answered correctly but that is besides the point. We are considering the case of internal documents which neither GPT-5 nor our local Llama model knows about.

In [None]:
from backend.rag import query_raw_model

q = "Under the Napoleonic wars, which countries took part in the fifth coalition against France?"

responses = [
    query_raw_model(q) for i in range(5)
]

for i, response in enumerate(responses):
    print(f'Response {i+1}:')
    print(response)
    print("-" * 100)


Response 1:
The Fifth Coalition was formed in 1809 and consisted of Austria, Britain, Russia, Sweden, and the Ottoman Empire.
----------------------------------------------------------------------------------------------------
Response 2:
I cannot verify which country took part in the Fifth Coalition of the Napoleonic Wars.
----------------------------------------------------------------------------------------------------
Response 3:
The fifth coalition against France consisted of Britain and Russia.
----------------------------------------------------------------------------------------------------
Response 4:
The Fifth Coalition was formed in 1809, during the Napoleonic Wars. The countries that took part in this coalition were:

1. Austria
2. Russia
3. Sweden-Norway
4. Great Britain (United Kingdom)
5. Ottoman Empire
----------------------------------------------------------------------------------------------------
Response 5:
The Fifth Coalition was formed against Napoleon's Frenc

## Weaviate vector store

We will now move on to setting up the vector database containing the internal documents (or in our case the wikipedia page on the Napoleonic Wars).

The script `populate_db.py` will fetch Napoleonic Wars Wikipedia page. Each section on the page is split into chunks using an instance of the `RecursiveCharacterTextSplitter` from LangChain. This ensures that chunk divisions follow the section layout of the page which again should ensure higher quality of the chunks. Each chunk is then stored in the Weaviate database including the relevant section title and an embedding vector.

In [17]:
from scripts import populate_db
populate_db.run()

Imported 155 chunks into the Napoleonic Wars collection


With the chunks added to the database, let's just confirm that the chunks look as expected. For the first chunk we print out the section title, the first 100 characters of the text chunk and the first five elements of the embedding vector. Everything looks fine:

In [2]:
import weaviate

client = weaviate.connect_to_local()
collection = client.collections.get('napoleonic_wars')

for item in collection.iterator(include_vector=True):
    print(f'Section title: {item.properties["title"]}')
    print(f'Chunk text: {item.properties["text"][0:100]} ...')
    print(f'Embedding vector: {item.vector["default"][0:5]} ...')
    break

client.close()

Section title: Invasion of Russia, 1812
Chunk text: The central issue for both Emperor Napoleon I and Tsar Alexander I was control over Poland. Each wan ...
Embedding vector: [-0.00635263929143548, 0.03718530014157295, -0.15025673806667328, -0.012791609391570091, 0.0685529112815857] ...


## RAG

With the vector database set up, we are now ready to build our RAG Agent. The code for the RAG can be found under [`backend/rag`](https://github.com/crs17/local_rag_demo/blob/main/backend/rag.py). Let's import it and give it a spin! 

In [2]:
from backend import rag

my_RAGAgent = rag.LocalRAGAgent()

First let's verify that the vector store is able to find relevant chunks:

In [19]:
docs = my_RAGAgent.vector_store.similarity_search_with_score("Under the Napoleonic wars, which countries took part in the fifth coalition against France?")
print(f'Found {len(docs)} chunks:')
for doc, score in docs:
    print(f'Section title: {doc.metadata["title"]}')
    print(f'{doc.page_content[:100]} ...')
    print(f'Score: {score}')
    print('-' * 100)



Found 4 chunks:
Section title: War of the Fifth Coalition, 1809
The Fifth Coalition (1809) of Britain and Austria against France formed as Britain engaged in the Pe ...
Score: 1.0
----------------------------------------------------------------------------------------------------
Section title: War of the Fifth Coalition, 1809
On land, the Fifth Coalition attempted few extensive military endeavours. One, the Walcheren Expedit ...
Score: 0.7424584031105042
----------------------------------------------------------------------------------------------------
Section title: War of the Fifth Coalition, 1809
in French territory, many breaches of the Continental System occurred and the French Continental Sys ...
Score: 0.7252911329269409
----------------------------------------------------------------------------------------------------
Section title: War of the Fifth Coalition, 1809
the kingdoms of Denmark–Norway
the Kingdom of Spain (under Joseph Bonaparte, Napoleon's elder brothe ...
Score:

This looks good! We can see that the first chunk - which has the highest score - reveals the answer we are looking for.

Let's now ask the Agent the same question to see if it does better than the raw model did above:

In [4]:
query = "Under the Napoleonic wars, which countries took part in the fifth coalition against France?"

for event in my_RAGAgent.stream(query):
    event["messages"][-1].pretty_print()




Under the Napoleonic wars, which countries took part in the fifth coalition against France?
Tool Calls:
  retrieve_context (7669cd28-30c3-475f-b084-1130d8573b48)
 Call ID: 7669cd28-30c3-475f-b084-1130d8573b48
  Args:
    query: Fifth Coalition against France
Name: retrieve_context

Source: {'title': 'War of the Fifth Coalition, 1809'}
Content: The Fifth Coalition (1809) of Britain and Austria against France formed as Britain engaged in the Peninsular War in Spain and Portugal. The sea became a major theatre of war against Napoleon's allies. Austria, previously an ally of France, took the opportunity to attempt to restore its imperial territories in Germany as held prior to Austerlitz. During the time of the Fifth Coalition, the Royal Navy won a succession of victories in the French colonies. On land the major battles included Battles

Source: {'title': 'War of the Fifth Coalition, 1809'}
Content: On land, the Fifth Coalition attempted few extensive military endeavours. One, the Walche

In the output above we see first our initial question under "Human Message". Under "Ai Message" we see that the agent asked the vetor store about information on "Fifth Coalition against France" and under "Tool Message" we see that the vector store responded with to chunks (the first of which contains the answer we are looking for). Finally, under the last "Ai Message" the agent answered the question perfectly. Success!

## Web page chat UI
As the last step we will setup the web page chat interface. We can do that by running `make langgraph-run` for the backend and `make ui-setup && make ui-run` for the fronend. In the below screenshot we see that the Agent works with the chat UI and is able to handle follow-up questions:

![chat UI interface](static/chat_ui.png)


## How to take this to production
While this demo works, there are several things to consider before deploying this approach in a production environment:
- An ingestion ELT pipeline might be needed depending on the nature of the source documents. For PDF files an extraction mechanism is needed - perhaps OCR.
- Again depending on the source documents more advanced chunking mechanism such as semantic chunking could be considered.
- Better vector store search could be achieved with a hybrid BM25/embedding approach
- Reranking could be deployed for more precise chunk selection
- Vector store filtering based on e.g. topics or clients
- Monitoring using e.g. Langfuse
- Authentication on both UI and backend

## TLDR
We have here demonstrated how to set up a local RAG agent for making internal documents chattable. We saw how the raw model struggled answering questions on the Napoleonic wars and how adding a RAG agent improved the quality of the answers. While this works, there is still a host of things to consider before deploying this in production.

## How I can help
Please reach out on [LinkedIn](https://www.linkedin.com/in/chresten-søndergaard/) if I can help with local RAG deployment or other ML/AI endeavors.