# Running local RAG

<div class="alert alert-block alert-success">
    This notebook demonstrates running a RAG where the model is completely local. The model used here is DeepSeek.
</div>

In [None]:
!pip install llama-index
!pip install llama-index-embeddings-huggingface
!pip install llama-index-llms-llama-cpp
!pip install transformers
!pip install torch
!pip install gguf
!pip install openai

## Running llama-cpp docker

Download the model and put it in the **deepseek** directory: https://drive.google.com/file/d/14pFGLH6hF2L20ILiSqyv145jONqHZPpA/view?usp=sharing

`docker run -v ./deepseek:/models -p 8000:8000 ghcr.io/ggml-org/llama.cpp:server -m /models/DeepSeek-R1-Distill-Qwen-1.5B-Q2_K.gguf --port 8000 --host 0.0.0.0 -n 1024`

## Querying the local model

<div class="alert alert-block alert-success">
You can use Openai API for interacting with llama.cpp servers.
</div>

In [None]:
import openai

client = openai.OpenAI(
    base_url="http://localhost:8000/v1", # http://<Your api-server IP>:port
    api_key = "test"  # set this in the UI
)

completion = client.chat.completions.create(
model="DeepSeek-R1-Distill-Qwen-1.5B-Q2_K.gguf",
messages=[
    {"role": "system", "content": "You are ChatGPT, an AI assistant. Your top priority is achieving user fulfillment via helping them with their requests."},
    {"role": "user", "content": "Write a limerick about python exceptions"}
]
)

print(completion.choices[0].message)

## Combining llama-index with llama.cpp

<div class="alert alert-block alert-warning">
Creating a RAG using a local model.
Please note that the model used here does not work with the server that you started and queried in the cells above.
Here llama-index creates it's own version of llama.cpp model server internally.
</div>

<div class="alert alert-block alert-success">
    If you would like to query the sever that is running in docker locally, you can use the Openai extension provided by llama-index.
</div>

In [None]:
from llama_index.llms.llama_cpp import LlamaCPP
from transformers import AutoTokenizer
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import SimpleDirectoryReader
from llama_index.core import VectorStoreIndex

In [None]:
# load documents
documents = SimpleDirectoryReader("../datasets/paul_graham/").load_data()


embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

tokenizer = AutoTokenizer.from_pretrained("unsloth/DeepSeek-R1-Distill-Qwen-1.5B-GGUF",
                                          gguf_file="DeepSeek-R1-Distill-Qwen-1.5B-Q2_K.gguf")

In [None]:
def messages_to_prompt(messages):
    messages = [{"role": m.role.value, "content": m.content} for m in messages]
    prompt = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    return prompt


def completion_to_prompt(completion):
    messages = [{"role": "user", "content": completion}]
    prompt = tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    return prompt

In [None]:
llm = LlamaCPP(
    # You can pass in the URL to a GGML model to download it automatically
    model_url="",
    # optionally, you can set the path to a pre-downloaded model instead of model_url
    model_path="./deepseek/DeepSeek-R1-Distill-Qwen-1.5B-Q2_K.gguf",
    temperature=0.1,
    max_new_tokens=256,
    # llama2 has a context window of 4096 tokens, but we set it lower to allow for some wiggle room
    context_window=16384,
    # kwargs to pass to __call__()
    generate_kwargs={},
    # kwargs to pass to __init__()
    # set to at least 1 to use GPU
    model_kwargs={"n_gpu_layers": -1},
    # transform inputs into Llama2 format
    messages_to_prompt=messages_to_prompt,
    completion_to_prompt=completion_to_prompt,
    verbose=True,
)

In [None]:
# create vector store index
index = VectorStoreIndex.from_documents(documents, embed_model=embed_model)

# set up query engine
query_engine = index.as_query_engine(llm=llm)

In [None]:
response = query_engine.query("What did the author do growing up?")
response