<a href="https://colab.research.google.com/github/edcalderin/LLM_Tech/blob/master/llama_cpp_embeddings_llm_for_a_full_rag_stack.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Llama.cpp one man band! Embeddings + LLM for a full RAG stack

https://generativeai.pub/llama-cpp-one-man-band-embeddings-llm-for-a-full-rag-stack-435be8e05b2b

The application uses the fastest and accurate Small Language Model (`Qwen2`) and `all-MiniLM-L6-v2` embeddings from the `sentence-transformers` stack

In [1]:
!pip install -q tiktoken PyMuPDF langchain-community llama-cpp-python faiss-cpu

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.0/51.0 MB[0m [31m16.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.1/24.1 MB[0m [31m77.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m79.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m31.3/31.3 MB[0m [31m59.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.2/45.2 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m

In [3]:
import tiktoken
from langchain_text_splitters import TokenTextSplitter, CharacterTextSplitter
from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import LlamaCppEmbeddings
from llama_cpp import Llama

In [13]:
!wget -nc https://huggingface.co/unsloth/Qwen3-4B-128K-GGUF/resolve/main/Qwen3-4B-128K-Q8_0.gguf
!wget -nc https://huggingface.co/leliuga/all-MiniLM-L6-v2-GGUF/resolve/main/all-MiniLM-L6-v2.F16.gguf
!wget -nc https://raw.githubusercontent.com/edcalderin/salesorder-sqlchatbot/refs/heads/master/README.md

File ‘Qwen3-4B-128K-Q8_0.gguf’ already there; not retrieving.

--2025-07-31 19:33:06--  https://huggingface.co/leliuga/all-MiniLM-L6-v2-GGUF/resolve/main/all-MiniLM-L6-v2.F16.gguf
Resolving huggingface.co (huggingface.co)... 3.160.5.109, 3.160.5.76, 3.160.5.102, ...
Connecting to huggingface.co (huggingface.co)|3.160.5.109|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://cdn-lfs-us-1.hf.co/repos/68/86/688665eb5b2019706f226f4e2a0cb26c8f210c781549d129dc7f26d630ac2863/797b70c4edf85907fe0a49eb85811256f65fa0f7bf52166b147fd16be2be4662?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27all-MiniLM-L6-v2.F16.gguf%3B+filename%3D%22all-MiniLM-L6-v2.F16.gguf%22%3B&Expires=1753993986&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTc1Mzk5Mzk4Nn19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmhmLmNvL3JlcG9zLzY4Lzg2LzY4ODY2NWViNWIyMDE5NzA2ZjIyNmY0ZTJhMGNiMjZjOGYyMTBjNzgxNTQ5ZDEyOWRjN2YyNmQ2MzBhYzI4NjMvNzk3YjcwYzRl

In [14]:
encoding = tiktoken.get_encoding("r50k_base")

embeddings = LlamaCppEmbeddings(model_path="/content/all-MiniLM-L6-v2.F16.gguf")

model = Llama(model_path="/content/Qwen3-4B-128K-Q8_0.gguf",
            n_gpu_layers=0,
            temperature=0.1,
            top_p = 0.5,
            n_ctx=8192,
            max_tokens=600,
            repeat_penalty=1.7,
            stop=["<|im_end|>","Instruction:","### Instruction:","###<user>","</user>"],
            verbose=False
)

llama_model_loader: loaded meta data with 23 key-value pairs and 101 tensors from /content/all-MiniLM-L6-v2.F16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = bert
llama_model_loader: - kv   1:                               general.name str              = all-MiniLM-L6-v2
llama_model_loader: - kv   2:                           bert.block_count u32              = 6
llama_model_loader: - kv   3:                        bert.context_length u32              = 512
llama_model_loader: - kv   4:                      bert.embedding_length u32              = 384
llama_model_loader: - kv   5:                   bert.feed_forward_length u32              = 1536
llama_model_loader: - kv   6:                  bert.attention.head_count u32              = 12
llama_model_loader: - kv   7:          bert.attention.layer_norm_epsilon f

In [15]:
# Load the README file

loader = TextLoader("/content/README.md")

# Create documents and split into chunks
documents = loader.load()
text_splitter = TokenTextSplitter(chunk_size=150, chunk_overlap=0)
texts = text_splitter.split_documents(documents)
print(len(texts))

# Create vectorstore
vectorstore = FAISS.from_documents(texts, embeddings)

init: embeddings required but some input tokens were not marked as outputs -> overriding


6


init: embeddings required but some input tokens were not marked as outputs -> overriding


In [16]:
from re import search
# Default is Similarity search
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

# Using Maximal Marginal Relevance
retriever = vectorstore.as_retriever(search_type="mmr", search_kwargs={"k": 3})

print("", *retriever.invoke("What is the author email?"), sep=f"\n\n{'='*100}\n\nn")

init: embeddings required but some input tokens were not marked as outputs -> overriding





npage_content=' Enjoyed this content?
Explore more of my work on [Medium](https://medium.com/@erickcalderin) 

I regularly share insights, tutorials, and reflections on tech, AI, and more. Your feedback and thoughts are always welcome!
' metadata={'source': '/content/README.md'}


npage_content=' ```bash
    python -m mysql_ingestion.initialize_data
    ```

4. Go to:

```bash
http://localhost:8501/
```

### Lint
Style the code with Ruff:

```bash
ruff format .
ruff check . --fix
```
### Remove the containers

```bash
docker-compose down
```

## Contact
**LinkedIn:** https://www.linkedin.com/in/erick-calderin-5bb6963b/  
**e-mail:** edcm.erick@gmail.com

##' metadata={'source': '/content/README.md'}


npage_content=' prompt engineering, chaining logic, and interaction with the LLM.

* MySQL: Hosts the structured database, which includes employee records and related information.

* Streamlit: Provides a responsive and intuitive front-end for real-time query input and response display

Chain construction

In [37]:
from dataclasses import dataclass, field
from typing import Any

@dataclass(kw_only=True, frozen=True, slots=False)
class QwenQnA:

    vectorstore: Any
    model: Any
    search_kwargs: dict


    def _get_messages(self, context: str, question: str)->list[dict]:
        return [
            {
                "role": "system",
                "content": "Answer the question based on the context."
            },
            {
                "role": "user",
                "content": f"Question: {question}\n\nContext: {context}"
            }
        ]


    def _get_context(self, question: str)->str:
        documents: list = self._get_retriever().invoke(question)
        return "\n\n".join([doc.page_content for doc in documents])

    def _get_retriever(self):
        return self.vectorstore.as_retriever(
            search_type="mmr",
            search_kwargs=self.search_kwargs
        )

    def invoke(self, question: str)->str:
        context: str = self._get_context(question)

        output = self.model.create_chat_completion(
            messages=self._get_messages(context, question),
            max_tokens=500,
            stop=["</s>","[/INST]","/INST",'<|eot_id|>','<|end|>'],
            temperature = 0.1,
            repeat_penalty = 1.4
        )

        return output["choices"][0]["message"]["content"]

In [38]:
%%time

QwenQnA(
    vectorstore=vectorstore,
    model=model,
    search_kwargs={"k": 3}
).invoke("What are the ruff rules?")

init: embeddings required but some input tokens were not marked as outputs -> overriding


CPU times: user 6min 41s, sys: 586 ms, total: 6min 41s
Wall time: 5min 58s


'<think>\nOkay, the user is asking about "the ruff rules" based on the provided context. Let me look through the content again.\n\nThe context mentions a section titled "Lint" where it says to style code with Ruff using two commands: `ruff format .` and `ruff check --fix`. Then there\'s also mention of database schema injection and prompt templates with LangChain. But the question is specifically about Ruff rules. \n\nWait, in the context under the Lint section, they use Ruff for formatting and checking code. The commands given are to run Ruff format on the current directory and then check with fixes enabled. However, the user\'s query refers to "the ruff rules," which might be a misunderstanding or confusion between the command-line tools and some other concept.\n\nLooking back at the context, there isn\'t any explicit mention of specific rules that Ruff enforces. The text talks about using Ruff for code styling and checking with fixes, but it doesn\'t list out what those rules are. S