# Retrieval Augmented Generation (RAG) with Wasm and Qdrant

In the ever-evolving landscape of AI, the consistency and reliability of Large Language Models (LLMs) remain a challenge. While these models can understand statistical relationships between words, they often fail to provide accurate factual responses. Because their internal knowledge may not be accurate, outputs can range from spot-on to nonsensical. Retrieval Augmented Generation (RAG) is a framework designed to bolster the accuracy of LLMs by grounding them in external knowledge bases. In this example, we'll demonstrate a streamlined  implementation of the RAG pipeline using only Qdrant and OpenAI SDKs. By harnessing Flag embedding's power, we can bypass additional frameworks' overhead.
    
This example assumes you understand the architecture necessary to carry out RAG. If this is new to you, please look at some introductory readings:
* [Retrieval-Augmented Generation: To add knowledge](https://eugeneyan.com/writing/llm-patterns/#retrieval-augmented-generation-to-add-knowledge)

## Prerequisites

Let's start setting up all the pieces to implement the RAG pipeline. We will only use Qdrant and OpenAI SDKs, without any third-party libraries.

### Preparing the environment

We need just a few dependencies to implement the whole application, so let's start with installing the dependencies.

In [110]:
# To run in Google Colab, need to install these libraries, then restart session
!pip install cohere tiktoken protobuf==3.20.3 typing-extensions==4.5.0 qdrant-client fastembed openai

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Defaulting to user installation because normal site-packages is not writeable


[Qdrant](https://qdrant.tech) will act as a knowledge base providing the context information for the prompts we'll be sending to the LLM. There are various ways of running Qdrant, but we'll simply use the Docker container.

In [111]:
# Docker
# !docker run -p "6333:6333" -p "6334:6334" --name "rag-openai-qdrant" --rm -d qdrant/qdrant:latest

### Creating the collection

Qdrant [collection](https://qdrant.tech/documentation/concepts/collections/) is the basic unit of organizing your data. Each collection is a named set of points (vectors with a payload) among which you can search. After connecting to our running Qdrant container, we can check whether we already have some collections.

In [None]:
import qdrant_client

# Memory
client = qdrant_client.QdrantClient(":memory:")
client.get_collections()
client.set_model("intfloat/multilingual-e5-large")
client.recreate_collection(
    collection_name="knowledge-base",
    vectors_config=client.get_fastembed_vector_params(),
)

### Building the knowledge base

Qdrant will use vector embeddings of our facts to enrich the original prompt with some context. Thus, we need to store the vector embeddings and the texts used to generate them. All our facts will have a JSON payload with a single attribute and look as follows:

```json
{
    "document": "Binary Quantization is a method of reducing the memory usage even up to 40 times!"
}
```

This structure is required by [FastEmbed](https://qdrant.github.io/fastembed/), a library that simplifies managing the vectors, as you don't have to calculate them on your own. It's also possible to use an existing collection, However, all the code snippets will assume this data structure. Adjust your examples to work with a different schema.

FastEmbed will automatically create the collection if it doesn't exist. Knowing that we are set to add our documents to a collection, which we'll call `knowledge-base`.

In [115]:
documents = [
        "ต๊อบ ชอบกินมะพร้าว",
        "ต๊อบ ชอบกินถั่ว",
]

In [116]:
# metadata = [{"text": item} for item in documents]

client.add(
    collection_name="knowledge-base",
    documents=documents,
    # metadata=metadata,
)

['de66bb58888e4f7db61e8561e88d5f89', '920df6cbc1c0466ca9cfdb816a3fd51d']

## Retrieval Augmented Generation

RAG changes the way we interact with Large Language Models. We're converting a knowledge-oriented task, in which the model may create a counterfactual answer, into a language-oriented task. The latter expects the model to extract meaningful information and generate an answer. LLMs, when implemented correctly, are supposed to be carrying out language-oriented tasks.

The task starts with the original prompt sent by the user. The same prompt is then vectorized and used as a search query for the most relevant facts. Those facts are combined with the original prompt to build a longer prompt containing more information.

But let's start simply by asking our question directly.

In [117]:
prompt = """
ต๊อบชอบกินอะไร?
"""

We will use `TGI` via dokcer + WSL2 + RTX4090

In [118]:
# !model=SeaLLMs/SeaLLM-7B-Chat
# !volume=$PWD/data

# !docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:1.3 --model-id $model 

Or `WasmEdge` via https://wasmedge.org/docs/start/overview

In [119]:
# !curl -LO https://huggingface.co/parinzee/SeaLLM-7B-Chat-GGUF/resolve/main/seallm-7b-chat.q4_k_m.gguf
# !curl -LO https://code.flows.network/webhook/iwYN1SdN3AmPgR5ao5Gt/llama-api-server.wasm
# !curl -LO https://code.flows.network/webhook/iwYN1SdN3AmPgR5ao5Gt/llama-chat.wasm
# !curl -sSf https://raw.githubusercontent.com/WasmEdge/WasmEdge/master/utils/install.sh | bash -s -- --plugins wasmedge_rustls wasi_nn-ggml
# !source $HOME/.wasmedge/env
# !wasmedge --dir .:. --nn-preload default:GGML:AUTO:seallm-7b-chat.q4_k_m.gguf llama-api-server.wasm

In [120]:
!curl -X POST http://0.0.0.0:8080/v1/chat/completions -H 'accept:application/json' -H 'Content-Type: application/json' -d '{"messages":[{"role":"system", "content":"You are a helpful AI assistant"}, {"role":"user", "content":"กทม ย่อมาจากอะไร"}], "model":"openthaigpt-1.0.0-beta-13b-chat"}'

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


{"id":"9bc7c7aa-598e-4d39-836b-ffa161fbd2c3","object":"chat.completion","created":1704545558,"model":"openthaigpt-1.0.0-beta-13b-chat","choices":[{"index":0,"message":{"role":"assistant","content":"กรุงเทพฯ"},"finish_reason":"stop"}],"usage":{"prompt_tokens":35,"completion_tokens":2,"total_tokens":37}}

In [121]:
# curl -X POST http://0.0.0.0:8080/v1/chat/completions -H 'accept:application/json' -H 'Content-Type: application/json' -d '{"messages":[{"role":"system", "content":"You are a helpful AI assistant"}, {"role":"user", "content":"กทม ย่อมาจากอะไร"}], "model":"openthaigpt-1.0.0-beta-13b-chat"}'

# !curl 127.0.0.1:8080/generate \
#     -X POST \
#     -d '{"inputs":"แอดต๊อบชอบกินอะไร?","parameters":{"max_new_tokens":20}}' \
#     -H 'Content-Type: application/json'

### Extending the prompt

Even though the original answer sounds credible, it didn't answer our question correctly. Instead, it gave us a generic description of an application stack. To improve the results, enriching the original prompt with the descriptions of the tools available seems like one of the possibilities. Let's use a semantic knowledge base to augment the prompt with the descriptions of different technologies!

In [122]:
results = client.query(
    collection_name="knowledge-base",
    query_text=prompt,
    limit=10,
)
results

[QueryResponse(id='de66bb58888e4f7db61e8561e88d5f89', embedding=None, metadata={'document': 'ต๊อบ ชอบกินมะพร้าว'}, document='ต๊อบ ชอบกินมะพร้าว', score=0.9020121864698284),
 QueryResponse(id='920df6cbc1c0466ca9cfdb816a3fd51d', embedding=None, metadata={'document': 'ต๊อบ ชอบกินถั่ว'}, document='ต๊อบ ชอบกินถั่ว', score=0.8977321732302992)]

In [123]:
[result.metadata['document'] for result in results]

['ต๊อบ ชอบกินมะพร้าว', 'ต๊อบ ชอบกินถั่ว']

We used the original prompt to perform a semantic search over the set of tool descriptions. Now we can use these descriptions to augment the prompt and create more context.

In [124]:
context = "\n".join(r.document for r in results)
print(context)

ต๊อบ ชอบกินมะพร้าว
ต๊อบ ชอบกินถั่ว


Finally, let's build a metaprompt, the combination of the assumed role of the LLM, the original question, and the results from our semantic search that will force our LLM to use the provided context.

By doing this, we effectively convert the knowledge-oriented task into a language task and hopefully reduce the chances of hallucinations. It also should make the response sound more relevant.

In [125]:
metaprompt = f"""
You will reply in Thai.
Answer the following question using the provided context.
If you can't find the answer, do not pretend you know it, but answer "ไม่รู้".

Question: {prompt.strip()}

Context:
{context.strip()}

Answer:
"""

# Look at the full metaprompt
print(metaprompt)


You will reply in Thai.
Answer the following question using the provided context.
If you can't find the answer, do not pretend you know it, but answer "ไม่รู้".

Question: ต๊อบชอบกินอะไร?

Context:
ต๊อบ ชอบกินมะพร้าว
ต๊อบ ชอบกินถั่ว

Answer:



Our current prompt is much longer, and we also used a couple of strategies to make the responses even better:

1. The LLM has the role of software architect.
2. We provide more context to answer the question.
3. If the context contains no meaningful information, the model shouldn't make up an answer.

Let's find out if that works as expected.

In [126]:
import requests

def get_completion(context, prompt):
    metaprompt = f"""You will reply in Thai.
Answer the following question using the provided context.
If you can't find the answer, do not pretend you know it, but answer "ไม่รู้".
If it out of context below, do not pretend you know it, but answer "ไม่รู้ๆๆ".

Context:
{context.strip()}"""
    
    url = 'http://0.0.0.0:8080/v1/chat/completions'
    headers = {
        'accept': 'application/json',
        'Content-Type': 'application/json'
    }

    print(metaprompt)

    data = {
        'messages': [
            {'role': 'system', 'content': metaprompt},
            {'role': 'user', 'content': prompt.strip()}
        ],
        'model': 'openthaigpt-1.0.0-beta-13b-chat'
    }

    response = requests.post(url, headers=headers, json=data)

    completion_json = response.json()
    completion = completion_json['choices'][0]['message']['content']

    return completion

In [127]:
get_completion(context, prompt)

You will reply in Thai.
Answer the following question using the provided context.
If you can't find the answer, do not pretend you know it, but answer "ไม่รู้".
If it out of context below, do not pretend you know it, but answer "ไม่รู้ๆๆ".

Context:
ต๊อบ ชอบกินมะพร้าว
ต๊อบ ชอบกินถั่ว


'ไม่รู้'

### Testing out the RAG pipeline

By leveraging the semantic context we provided our model is doing a better job answering the question. Let's enclose the RAG as a function, so we can call it more easily for different prompts.

In [128]:
async def rag(question: str, n_points: int = 10) -> str:
    results = client.query(
        collection_name="knowledge-base",
        query_text=question,
        limit=n_points,
    )

    context = "\n".join(r.document for r in results)

    return  get_completion(context, prompt)

Now it's easier to ask a broad range of questions.

In [129]:
await rag("ต๊อบ ชอบกินอะไร?")

You will reply in Thai.
Answer the following question using the provided context.
If you can't find the answer, do not pretend you know it, but answer "ไม่รู้".
If it out of context below, do not pretend you know it, but answer "ไม่รู้ๆๆ".

Context:
ต๊อบ ชอบกินมะพร้าว
ต๊อบ ชอบกินถั่ว


'ต็อบชอบกินมะพร้าวและถั่ว'

In [130]:
await rag("หญิง ชอบกินอะไร?")

You will reply in Thai.
Answer the following question using the provided context.
If you can't find the answer, do not pretend you know it, but answer "ไม่รู้".
If it out of context below, do not pretend you know it, but answer "ไม่รู้ๆๆ".

Context:
ต๊อบ ชอบกินมะพร้าว
ต๊อบ ชอบกินถั่ว


'ไม่รู้'

Our model can now:

1. Take advantage of the knowledge in our vector datastore.
2. Answer, based on the provided context, that it can not provide an answer.

We have just shown a useful mechanism to mitigate the risks of hallucinations in Large Language Models.

### Cleaning up the environment

If you wish to continue playing with the RAG application we created, don't do the code below. However, it's always good to clean up the environment, so nothing is left dangling. We'll show you how to remove the Qdrant container.

In [131]:
# !docker kill rag-openai-qdrant
# !docker rm rag-openai-qdrant