# 🦙 Local RAG with LangChain and LlamaCpp

Demo of **Retrieval Augmented Generation** (RAG) to faithfully resolve concepts from an OWL ontology, with conversation memory, running locally, using open source components:
* [LangChain](https://python.langchain.com)
* [FastEmbed embeddings](https://github.com/qdrant/fastembed)
* [Qdrant vectorstore](https://github.com/qdrant/qdrant)
* [LlamaCpp inference library](https://github.com/ggerganov/llama.cpp)
* [Mixtral 8x7B LLM](https://mistral.ai/news/mixtral-of-experts/)

See LangChain docs:
* [RAG with memory](https://python.langchain.com/docs/expression_language/cookbook/retrieval)
* [RAG streaming](https://python.langchain.com/docs/use_cases/question_answering/streaming)

Download the Mixtral 8x7B model in GGUF format (~15G) in the `tests/data/` folder:

```bash
wget https://huggingface.co/TheBloke/Mixtral-8x7B-Instruct-v0.1-GGUF/resolve/main/mixtral-8x7b-instruct-v0.1.Q2_K.gguf
```

> Make sure to pick up a model already fine-tuned for chat (they should have `instruct` or `chat` in the name)

## 📦️ Install and import dependencies

In [1]:
import sys
!{sys.executable} -m pip install langchain langchain-community llama-cpp-python fastembed qdrant-client

import json
from IPython.display import JSON
from operator import itemgetter
from langchain.memory import ConversationBufferMemory
from langchain.prompts import ChatPromptTemplate
from langchain.prompts.prompt import PromptTemplate
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.schema import format_document
from langchain_community.llms import LlamaCpp
from langchain_community.vectorstores import Qdrant
from langchain_community.embeddings.fastembed import FastEmbedEmbeddings
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.messages import AIMessage, HumanMessage, get_buffer_string
from langchain_core.runnables import RunnableLambda, RunnablePassthrough
from langchain_rdf import OntologyLoader

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.2[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## 🌀 Initialize local vectorstore and LLM

```
flag_embeddings_size = 384
```

In [3]:
flag_embeddings = FastEmbedEmbeddings(model_name="BAAI/bge-small-en-v1.5", max_length=512)
loader = OntologyLoader("https://semanticscience.org/ontology/sio.owl", format="xml")
docs = loader.load()

# Split the documents into chunks if necessary
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)
# vectorstore = FAISS.from_documents(documents=docs, embedding=flag_embeddings)
vectorstore = Qdrant.from_documents(
    splits,
    flag_embeddings,
    location=":memory:",
    # path="./data/qdrant",
    collection_name="ontologies",
    # Run Qdrant as a service for production use:
    # url="http://localhost:6333",
    # prefer_grpc=True,
)
retriever = vectorstore.as_retriever()

llm = LlamaCpp(
    model_path="./data/mixtral-8x7b-instruct-v0.1.Q2_K.gguf",
    temperature=0.01,
    max_tokens=2000,
    top_p=1,
    n_threads=8,
    n_ctx=2048,
    f16_kv=True,
    # n_gpu_layers=40,  # Change this value based on your model and your GPU VRAM pool.
    # n_batch=512,  # Should be between 1 and n_ctx, consider the amount of VRAM in your GPU.
)

Fetching 9 files:   0%|          | 0/9 [00:00<?, ?it/s]

llama_model_loader: loaded meta data with 26 key-value pairs and 995 tensors from ./data/mixtral-8x7b-instruct-v0.1.Q2_K.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mixtral-8x7b-instruct-v0.1
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:         

## 🧠 Initialize prompts and memory

In [10]:
memory = ConversationBufferMemory(
    return_messages=True, output_key="answer", input_key="question"
)
# This adds a "memory" key to the input object
loaded_memory = RunnablePassthrough.assign(
    chat_history=RunnableLambda(memory.load_memory_variables) | itemgetter("history"),
)

# QUESTION PROMPT
_template = """Given the following conversation and a follow up question, rephrase the follow up question to be a standalone question, in its original language.

Chat History:
{chat_history}
Follow Up Input: {question}
Standalone question:"""
CONDENSE_QUESTION_PROMPT = PromptTemplate.from_template(_template)

# ANSWER PROMPT
template = """Briefly answer the question based only on the following context:
{context}

Question: {question}
"""
ANSWER_PROMPT = ChatPromptTemplate.from_template(template)

# Format how the ontology concepts are passed as context to the LLM
DEFAULT_DOCUMENT_PROMPT = PromptTemplate.from_template(
    template="Concept URI: {uri} | Type: {type} | Predicate: {predicate} | Label: {page_content} | Ontology: {ontology}"
)
def _combine_documents(
    docs, document_prompt=DEFAULT_DOCUMENT_PROMPT, document_separator="\n\n"
):
    doc_strings = [format_document(doc, document_prompt) for doc in docs]
    # print("doc_strings!", doc_strings)
    return document_separator.join(doc_strings)


## ⛓️ Define the chain

In [11]:
# Calculate the standalone question
standalone_question = {
    "standalone_question": {
        "question": lambda x: x["question"],
        "chat_history": lambda x: get_buffer_string(x["chat_history"]),
    }
    | CONDENSE_QUESTION_PROMPT
    | llm
    | StrOutputParser(),
}
# Retrieve the documents using the standalone question
retrieved_documents = {
    "docs": itemgetter("standalone_question") | retriever,
    "question": lambda x: x["standalone_question"],
}
# Construct the inputs for the final prompt using retrieved documents
final_inputs = {
    "context": lambda x: _combine_documents(x["docs"]),
    "question": itemgetter("question"),
}
# Generate the answer using the documents and answer prompt
answer = {
    "answer": final_inputs | ANSWER_PROMPT | llm,
    "docs": itemgetter("docs"),
}
# And now we put it all together!
final_chain = loaded_memory | standalone_question | retrieved_documents | answer

## 🗨️ Ask a question

In [12]:
inputs = {"question": "What is the concept URI for protein?"}
output = {"answer": ""}
for chunk in final_chain.stream(inputs):
    if "docs" in chunk:
        output["docs"] = [doc.dict() for doc in chunk["docs"]]
        print(json.dumps(output["docs"], indent=2))
    if "answer" in chunk:
        output["answer"] += chunk["answer"]
        print(chunk["answer"], end="", flush=True)

JSON(output)

Llama.generate: prefix-match hit

llama_print_timings:        load time =     387.25 ms
llama_print_timings:      sample time =       6.53 ms /    13 runs   (    0.50 ms per token,  1990.20 tokens per second)
llama_print_timings: prompt eval time =    2966.63 ms /    52 tokens (   57.05 ms per token,    17.53 tokens per second)
llama_print_timings:        eval time =    1224.94 ms /    12 runs   (  102.08 ms per token,     9.80 tokens per second)
llama_print_timings:       total time =    4265.97 ms /    64 tokens
Llama.generate: prefix-match hit


[
  {
    "page_content": "protein complex",
    "metadata": {
      "label": "protein complex",
      "uri": "http://semanticscience.org/resource/SIO_010497",
      "type": "http://www.w3.org/2002/07/owl#Class",
      "predicate": "http://www.w3.org/2000/01/rdf-schema#label",
      "ontology": "https://semanticscience.org/ontology/sio.owl",
      "_id": "16ea616098154a1d8a40fd4399245513",
      "_collection_name": "ontologies"
    },
    "type": "Document"
  },
  {
    "page_content": "protein",
    "metadata": {
      "label": "protein",
      "uri": "http://semanticscience.org/resource/SIO_010043",
      "type": "http://www.w3.org/2002/07/owl#Class",
      "predicate": "http://www.w3.org/2000/01/rdf-schema#label",
      "ontology": "https://semanticscience.org/ontology/sio.owl",
      "_id": "62f06842d51241dbb74deaf4caaa0cab",
      "_collection_name": "ontologies"
    },
    "type": "Document"
  },
  {
    "page_content": "protein-protein association",
    "metadata": {
      "labe


llama_print_timings:        load time =     387.25 ms
llama_print_timings:      sample time =      12.55 ms /    34 runs   (    0.37 ms per token,  2708.08 tokens per second)
llama_print_timings: prompt eval time =   21947.56 ms /   444 tokens (   49.43 ms per token,    20.23 tokens per second)
llama_print_timings:        eval time =    3922.41 ms /    33 runs   (  118.86 ms per token,     8.41 tokens per second)
llama_print_timings:       total time =   26154.51 ms /   477 tokens


<IPython.core.display.JSON object>