# üìö Chapter 11.2: Using llama.cpp for Applications

## Introduction

**llama.cpp** is the foundational C/C++ library that powers many local LLM applications. The `llama-cpp-python` package provides Python bindings that enable developers to run Large Language Models locally with maximum efficiency.

### Why Use llama-cpp-python?

| Feature | Benefit |
|---------|--------|
| **Performance** | Highly optimized C++ backend with CPU/GPU support |
| **OpenAI Compatible** | Returns responses in OpenAI API format |
| **Flexibility** | Fine-grained control over model parameters |
| **GGUF Format** | Native support for quantized models |
| **Multi-Modal** | Supports vision models (LLaVA, etc.) |
| **GPU Acceleration** | CUDA, Metal, ROCm, and Vulkan support |

### What We'll Cover

1. üõ†Ô∏è Installation and Setup
2. üöÄ Loading Models
3. üí¨ Text Completion vs Chat Completion
4. üéõÔ∏è Generation Parameters
5. üìä Streaming Responses
6. üß© Text Embeddings
7. üèóÔ∏è Building Practical Applications

---

## 1. üõ†Ô∏è Installation and Setup

Installing `llama-cpp-python` can be done via pip. For optimal performance, you may want to compile with specific backend support.

### Basic Installation (CPU only)

In [2]:
# # Basic CPU installation
# !pip install llama-cpp-python -q

### Installation with GPU Support

For GPU acceleration, you need to compile with specific flags:

```bash
# NVIDIA CUDA
CMAKE_ARGS="-DGGML_CUDA=on" pip install llama-cpp-python --force-reinstall --no-cache-dir

# Apple Metal (macOS)
CMAKE_ARGS="-DGGML_METAL=on" pip install llama-cpp-python --force-reinstall --no-cache-dir

# AMD ROCm
CMAKE_ARGS="-DGGML_HIPBLAS=on" pip install llama-cpp-python --force-reinstall --no-cache-dir
```

In [3]:
# Verify installation
import llama_cpp
from importlib.metadata import version

print(f"llama-cpp-python version: {version('llama-cpp-python')}")
print("‚úÖ llama-cpp-python imported successfully!")

llama-cpp-python version: 0.3.16
‚úÖ llama-cpp-python imported successfully!


## 2. üöÄ Loading Models

llama-cpp-python uses GGUF format models. You can load models from:
- Local file path
- Hugging Face Hub (automatic download)

### Available Models on Hugging Face

| Model | Size | Description |
|-------|------|-------------|
| `TheBloke/Mistral-7B-Instruct-v0.2-GGUF` | ~4GB | High-quality Mistral model |
| `TheBloke/Llama-2-7B-Chat-GGUF` | ~4GB | Meta's Llama 2 Chat |
| `microsoft/Phi-3-mini-4k-instruct-gguf` | ~2GB | Microsoft's efficient Phi-3 |
| `TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF` | ~600MB | Tiny but capable model |

In [12]:
from llama_cpp import Llama

# Download and load model from Hugging Face Hub
# This will download the model on first run (~4GB)
print("Loading model from Hugging Face Hub...")
print("This may take a few minutes on first run...")

llm = Llama.from_pretrained(
    repo_id="TheBloke/Mistral-7B-Instruct-v0.2-GGUF",
    filename="mistral-7b-instruct-v0.2.Q4_K_M.gguf",
    n_ctx=4096,  # Mistral supports 32k context, use what you need
    n_threads=4,  # Number of CPU threads
    verbose=False  # Set to True for debug info
)

print("‚úÖ Model loaded successfully!")

Loading model from Hugging Face Hub...
This may take a few minutes on first run...


./mistral-7b-instruct-v0.2.Q4_K_M.gguf:   0%|          | 0.00/4.37G [00:00<?, ?B/s]

llama_context: n_ctx_per_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized


‚úÖ Model loaded successfully!


### Loading from Local Path

If you have a GGUF model file locally:

In [13]:
# Example: Loading from local path (commented to avoid errors if file doesn't exist)
# llm = Llama(
#     model_path="./models/my-model.gguf",
#     n_ctx=2048,          # Context window
#     n_gpu_layers=-1,     # Use all GPU layers (if GPU available)
#     n_threads=4,         # CPU threads for generation
#     seed=42,             # For reproducibility
#     verbose=False
# )

print("üí° Key initialization parameters:")
print("   - model_path: Path to GGUF model file")
print("   - n_ctx: Context window size (tokens)")
print("   - n_gpu_layers: Number of layers to offload to GPU (-1 for all)")
print("   - n_threads: CPU threads for computation")
print("   - seed: Random seed for reproducibility")

üí° Key initialization parameters:
   - model_path: Path to GGUF model file
   - n_ctx: Context window size (tokens)
   - n_gpu_layers: Number of layers to offload to GPU (-1 for all)
   - n_threads: CPU threads for computation
   - seed: Random seed for reproducibility


## 3. üí¨ Text Completion vs Chat Completion

llama-cpp-python provides two main ways to generate text:

| Method | Description | Use Case |
|--------|-------------|----------|
| `__call__()` / `create_completion()` | Raw text completion | Text continuation, code completion |
| `create_chat_completion()` | Chat-style with messages | Chatbots, Q&A systems |

### 3.1 Text Completion

Text completion treats your input as a prompt to continue. Returns OpenAI-compatible format.

In [14]:
# Text completion example
print("üìù Text Completion Demo\n")
print("=" * 50)

prompt = "The benefits of exercise include"

output = llm(
    prompt,
    max_tokens=100,
    stop=["\n\n"],  # Stop at double newline
    echo=False      # Don't include prompt in output
)

print(f"Prompt: {prompt}\n")
print(f"Completion: {output['choices'][0]['text']}")
print(f"\nüìä Token usage: {output['usage']}")

üìù Text Completion Demo

Prompt: The benefits of exercise include

Completion:  weight loss, improved cardiovascular health, better mood, and increased energy levels. However, many people struggle to find the motivation to get started or to stick with an exercise routine. Here are some tips to help you get moving and make exercise a regular part of your life:

üìä Token usage: {'prompt_tokens': 6, 'completion_tokens': 58, 'total_tokens': 64}


### 3.2 Chat Completion

Chat completion uses a list of messages with roles (system, user, assistant) for conversational AI.

In [15]:
# Chat completion example
print("üí¨ Chat Completion Demo\n")
print("=" * 50)

messages = [
    {
        "role": "system",
        "content": "You are a helpful cooking assistant. Give concise recipes."
    },
    {
        "role": "user",
        "content": "How do I make scrambled eggs?"
    }
]

response = llm.create_chat_completion(
    messages=messages,
    max_tokens=200,
    temperature=0.7
)

print(f"User: {messages[1]['content']}\n")
print(f"Assistant: {response['choices'][0]['message']['content']}")

üí¨ Chat Completion Demo

User: How do I make scrambled eggs?

Assistant:  Making scrambled eggs is a simple and quick process. Here's a step-by-step guide:

Ingredients:
- 2 to 4 eggs
- Salt, to taste
- Pepper, to taste
- 2 tablespoons of butter or oil
- Optional: milk or cream, shredded cheese, herbs, or vegetables

Instructions:
1. Crack the eggs into a bowl. Add a pinch of salt and pepper to taste. If desired, you can also add milk or cream to make the eggs softer.
2. Whisk the eggs with a fork until the yolks and whites are fully combined. Set aside.
3. Heat a non-stick frying pan over medium-low heat. Add the butter or oil and let it melt.
4. Pour the beaten eggs into the pan. Let them cook undisturbed for a few seconds until the edges start


In [16]:
# Multi-turn conversation
print("üîÑ Multi-Turn Conversation Demo\n")
print("=" * 50)

conversation = [
    {"role": "system", "content": "You are a helpful math tutor."},
    {"role": "user", "content": "What is the Pythagorean theorem?"},
]

# First response
response1 = llm.create_chat_completion(messages=conversation, max_tokens=150)
assistant_reply = response1['choices'][0]['message']['content']
print(f"User: What is the Pythagorean theorem?\n")
print(f"Assistant: {assistant_reply}\n")
print("-" * 50)

# Add assistant response to conversation history
conversation.append({"role": "assistant", "content": assistant_reply})
conversation.append({"role": "user", "content": "Can you give me an example with numbers?"})

# Follow-up
response2 = llm.create_chat_completion(messages=conversation, max_tokens=150)
print(f"User: Can you give me an example with numbers?\n")
print(f"Assistant: {response2['choices'][0]['message']['content']}")

üîÑ Multi-Turn Conversation Demo

User: What is the Pythagorean theorem?

Assistant:  The Pythagorean theorem is a mathematical relationship between the sides of a right-angled triangle. It states that the square of the length of the hypotenuse (the side opposite the right angle) is equal to the sum of the squares of the lengths of the other two sides. In mathematical notation, if a and b are the lengths of the legs (the two shorter sides), and c is the length of the hypotenuse, then the theorem can be written as:

a¬≤ + b¬≤ = c¬≤

This theorem has been known since ancient times and is named after the Greek mathematician Pythagoras, who is credited with its discovery. It is one of the

--------------------------------------------------
User: Can you give me an example with numbers?

Assistant:  Certainly! Let's consider a right-angled triangle with legs of length 3 and 4 units. We can use the Pythagorean theorem to find the length of the hypotenuse, c:

a¬≤ = 3¬≤ = 9
b¬≤ = 4¬≤ = 16

N

## 4. üéõÔ∏è Generation Parameters

Control the model's output with these key parameters:

| Parameter | Description | Default | Range |
|-----------|-------------|---------|---------|
| `max_tokens` | Maximum tokens to generate | 16 | 1-n_ctx |
| `temperature` | Randomness (higher = more creative) | 0.8 | 0.0-2.0 |
| `top_k` | Consider top k tokens | 40 | 1-100 |
| `top_p` | Nucleus sampling threshold | 0.95 | 0.0-1.0 |
| `repeat_penalty` | Penalize repeated tokens | 1.1 | 1.0-2.0 |
| `stop` | Stop sequences | None | list of strings |

In [17]:
# Temperature comparison
print("üå°Ô∏è Temperature Comparison Demo\n")
print("=" * 50)

prompt = "Write a creative tagline for a coffee shop:"

messages = [{"role": "user", "content": prompt}]

# Low temperature - more focused
response_low = llm.create_chat_completion(
    messages=messages, max_tokens=30, temperature=0.1
)
print(f"Low Temperature (0.1) - Focused:")
print(f"  {response_low['choices'][0]['message']['content']}\n")

# Medium temperature
response_med = llm.create_chat_completion(
    messages=messages, max_tokens=30, temperature=0.7
)
print(f"Medium Temperature (0.7) - Balanced:")
print(f"  {response_med['choices'][0]['message']['content']}\n")

# High temperature - more creative
response_high = llm.create_chat_completion(
    messages=messages, max_tokens=30, temperature=1.2
)
print(f"High Temperature (1.2) - Creative:")
print(f"  {response_high['choices'][0]['message']['content']}")

üå°Ô∏è Temperature Comparison Demo

Low Temperature (0.1) - Focused:
   "Sip. Savor. Connect: Unleash the Power of a Single Cup at Our Cozy Coffee Sanctuary" 



Medium Temperature (0.7) - Balanced:
   "Savor the Moment: Where Every Sip is a Story, Every Brew is a Masterpiece, and Every Visit Becomes a Treas

High Temperature (1.2) - Creative:
   "Wake Up, Sip In, and Unravel: Where Every Sip Tells a Story at Our Cozy Coffee Sanctuary


In [18]:
# Using stop sequences
print("üõë Stop Sequences Demo\n")
print("=" * 50)

# Generate a list and stop when numbering ends
output = llm(
    "List the first 3 planets:\n1.",
    max_tokens=100,
    stop=["4.", "\n\n"],  # Stop before 4th item or double newline
    echo=True
)

print(output['choices'][0]['text'])

üõë Stop Sequences Demo

List the first 3 planets:
1. Mercury
2. Venus
3. Earth
These are the first three planets from the Sun in our solar system. Mercury is the closest planet to the Sun, and Venus is the second closest. Earth is the third planet from the Sun.


## 5. üìä Streaming Responses

Streaming allows you to receive tokens as they are generated, providing a better user experience for longer responses.

In [19]:
# Streaming text completion
print("üåä Streaming Text Completion Demo\n")
print("=" * 50)
print("Generating: ", end="")

stream = llm(
    "Explain quantum computing in one paragraph:",
    max_tokens=150,
    stream=True
)

for chunk in stream:
    text = chunk['choices'][0]['text']
    print(text, end='', flush=True)

print("\n\n‚úÖ Generation complete!")

üåä Streaming Text Completion Demo

Generating: 

Quantum computing is a type of computing that uses quantum bits, or qubits, instead of classical bits to process information. Qubits can exist in a superposition of states, meaning they can represent multiple values at once, allowing for exponentially greater computational power compared to classical computers. Quantum algorithms, such as Shor's algorithm for factorization and Grover's algorithm for searching unsorted databases, can solve problems that are intractable for classical computers. However, quantum computing is still in its infancy, and building and maintaining stable qubits is a significant challenge. Nonetheless, the potential applications for quantum computers in fields such as cryptography, optimization, and machine learning make it a promising area

‚úÖ Generation complete!


In [20]:
# Streaming chat completion
print("üåä Streaming Chat Completion Demo\n")
print("=" * 50)
print("Assistant: ", end="")

messages = [
    {"role": "user", "content": "Tell me a short joke about programming."}
]

stream = llm.create_chat_completion(
    messages=messages,
    max_tokens=100,
    stream=True
)

for chunk in stream:
    delta = chunk['choices'][0]['delta']
    if 'content' in delta:
        print(delta['content'], end='', flush=True)

print("\n\n‚úÖ Generation complete!")

üåä Streaming Chat Completion Demo

Assistant:  Why did the Java developer quit his job? Because he didn't feel appreciated. But seriously, why did he quit? Because there was no 'app-reciation'! ü§™ #Java #ProgrammingJokes #Appreciation #AppreciationDay

‚úÖ Generation complete!


In [21]:
# Early stopping with streaming
print("üõë Early Stopping with Streaming Demo\n")
print("=" * 50)

collected_text = ""
max_chars = 200

print(f"Generating (max {max_chars} chars): ", end="")

stream = llm(
    "Write a story about a robot:",
    max_tokens=300,
    stream=True
)

for chunk in stream:
    text = chunk['choices'][0]['text']
    collected_text += text
    print(text, end='', flush=True)
    
    if len(collected_text) >= max_chars:
        print("... [STOPPED]")
        break

print(f"\n\nüìä Total characters collected: {len(collected_text)}")

üõë Early Stopping with Streaming Demo

Generating (max 200 chars):  a robot who, upon completing a task, gains sentience and realizes he is alive.

Once upon a time in the not-so-distant future, there was a research facility nestled deep in the heart of the Siberian wild... [STOPPED]


üìä Total characters collected: 204


## 6. üß© Text Embeddings

llama-cpp-python can also generate text embeddings for semantic search and RAG applications.

> **Note:** You must initialize the model with `embedding=True` to use embedding features.

In [26]:
# Use a small, purpose-built embedding model
embed_llm = Llama.from_pretrained(
    repo_id="nomic-ai/nomic-embed-text-v1.5-GGUF",
    filename="nomic-embed-text-v1.5.Q4_K_M.gguf",
    embedding=True,
    n_ctx=2048,
    verbose=False
)
# Only ~100MB and produces 768-dim embeddings!

./nomic-embed-text-v1.5.Q4_K_M.gguf:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

In [27]:
import numpy as np
from numpy.linalg import norm

def cosine_similarity(vec1, vec2):
    """Compute cosine similarity between two vectors."""
    vec1 = np.array(vec1)
    vec2 = np.array(vec2)
    return np.dot(vec1, vec2) / (norm(vec1) * norm(vec2))

# Generate embeddings
print("üìä Text Embedding Demo\n")
print("=" * 50)

texts = [
    "Machine learning is a subset of artificial intelligence.",
    "AI and ML are transforming technology industries.",
    "The weather is sunny today."
]

embeddings = []
for text in texts:
    result = embed_llm.create_embedding(text)
    embeddings.append(result['data'][0]['embedding'])

print(f"Number of texts: {len(texts)}")
print(f"Embedding dimension: {len(embeddings[0])}")
print(f"\nFirst 5 values: {embeddings[0][:5]}")

init: embeddings required but some input tokens were not marked as outputs -> overriding
init: embeddings required but some input tokens were not marked as outputs -> overriding
init: embeddings required but some input tokens were not marked as outputs -> overriding


üìä Text Embedding Demo

Number of texts: 3
Embedding dimension: 768

First 5 values: [0.4224206805229187, 1.4838993549346924, -3.341722011566162, -0.27156567573547363, 0.9094995260238647]


In [28]:
# Semantic similarity
print("üîç Semantic Similarity Demo\n")
print("=" * 50)

sim_1_2 = cosine_similarity(embeddings[0], embeddings[1])
sim_1_3 = cosine_similarity(embeddings[0], embeddings[2])
sim_2_3 = cosine_similarity(embeddings[1], embeddings[2])

print("Text pairs and their similarities:\n")
print(f"1. '{texts[0][:40]}...'")
print(f"2. '{texts[1][:40]}...'")
print(f"3. '{texts[2][:40]}...'\n")

print(f"Similarity (1, 2): {sim_1_2:.4f} {'‚úÖ Related!' if sim_1_2 > 0.5 else ''}")
print(f"Similarity (1, 3): {sim_1_3:.4f} {'‚ùå Different' if sim_1_3 < 0.3 else ''}")
print(f"Similarity (2, 3): {sim_2_3:.4f} {'‚ùå Different' if sim_2_3 < 0.3 else ''}")

üîç Semantic Similarity Demo

Text pairs and their similarities:

1. 'Machine learning is a subset of artifici...'
2. 'AI and ML are transforming technology in...'
3. 'The weather is sunny today....'

Similarity (1, 2): 0.7631 ‚úÖ Related!
Similarity (1, 3): 0.4964 
Similarity (2, 3): 0.5067 


## 7. üèóÔ∏è Building Practical Applications

Let's build some practical applications using llama-cpp-python!

### Application 1: Conversational Assistant

In [29]:
class ConversationalAssistant:
    """
    A conversational AI assistant with memory.
    """
    
    def __init__(self, llm, system_prompt="You are a helpful assistant."):
        self.llm = llm
        self.system_prompt = system_prompt
        self.conversation_history = [
            {"role": "system", "content": system_prompt}
        ]
    
    def chat(self, user_message, max_tokens=200, stream=False):
        """
        Send a message and get a response.
        
        Args:
            user_message: The user's message
            max_tokens: Maximum response length
            stream: Whether to stream the response
            
        Returns:
            The assistant's response
        """
        # Add user message to history
        self.conversation_history.append({
            "role": "user",
            "content": user_message
        })
        
        if stream:
            # Streaming response
            response_stream = self.llm.create_chat_completion(
                messages=self.conversation_history,
                max_tokens=max_tokens,
                stream=True
            )
            
            full_response = ""
            for chunk in response_stream:
                delta = chunk['choices'][0]['delta']
                if 'content' in delta:
                    full_response += delta['content']
                    print(delta['content'], end='', flush=True)
            print()  # Newline after streaming
            
            response = full_response
        else:
            # Non-streaming response
            result = self.llm.create_chat_completion(
                messages=self.conversation_history,
                max_tokens=max_tokens
            )
            response = result['choices'][0]['message']['content']
        
        # Add assistant response to history
        self.conversation_history.append({
            "role": "assistant",
            "content": response
        })
        
        return response
    
    def clear_history(self):
        """Reset conversation history."""
        self.conversation_history = [
            {"role": "system", "content": self.system_prompt}
        ]
    
    def get_history(self):
        """Get conversation history."""
        return self.conversation_history

# Demo
print("ü§ñ Conversational Assistant Demo\n")
print("=" * 50)

assistant = ConversationalAssistant(
    llm,
    system_prompt="You are a helpful coding tutor. Give concise explanations."
)

# First message
print("User: What is a Python list?\n")
print("Assistant: ", end="")
response1 = assistant.chat("What is a Python list?", stream=True)

print("\n" + "-" * 50)

# Follow-up (model remembers context)
print("\nUser: How do I add items to it?\n")
print("Assistant: ", end="")
response2 = assistant.chat("How do I add items to it?", stream=True)

ü§ñ Conversational Assistant Demo

User: What is a Python list?

Assistant:  A Python list is a collection of ordered and mutable items. In other words, a list is a data structure that can hold a sequence of values of any data type, including integers, floats, strings, lists, tuples, and dictionaries. The order of the items in a list is important, and the items can be changed or modified after the list has been created.

Here's an example of creating a list in Python:

```python
numbers = [1, 2, 3, 4, 5]
colors = ['red', 'green', 'blue']
mixed = ['apple', 2, 'banana', 4.5, 'orange']
```

You can access the items in a list by their index. The first item has index 0, the second item has index 1, and so on. For example, to access the third item in the

--------------------------------------------------

User: How do I add items to it?

Assistant:  To add an item to a list in Python, you can use the `append()` method or the `extend()` method, depending on whether you want to add a single 

### Application 2: Simple RAG System

In [30]:
class SimpleRAG:
    """
    A simple Retrieval-Augmented Generation system.
    """
    
    def __init__(self, llm, embed_llm):
        self.llm = llm
        self.embed_llm = embed_llm
        self.documents = []
        self.embeddings = []
    
    def add_document(self, text, title="Untitled"):
        """Add a document to the knowledge base."""
        result = self.embed_llm.create_embedding(text)
        embedding = result['data'][0]['embedding']
        
        self.documents.append({"title": title, "text": text})
        self.embeddings.append(embedding)
        print(f"‚úÖ Added: '{title}'")
    
    def retrieve(self, query, top_k=2):
        """Retrieve most relevant documents for a query."""
        query_result = self.embed_llm.create_embedding(query)
        query_embedding = query_result['data'][0]['embedding']
        
        similarities = [
            cosine_similarity(query_embedding, emb)
            for emb in self.embeddings
        ]
        
        top_indices = np.argsort(similarities)[-top_k:][::-1]
        
        return [
            {**self.documents[i], "score": similarities[i]}
            for i in top_indices
        ]
    
    def answer(self, question, max_tokens=200):
        """Answer a question using retrieved documents."""
        # Retrieve relevant documents
        relevant = self.retrieve(question)
        
        # Build context
        context = "\n\n".join(
            f"[{doc['title']}]: {doc['text']}"
            for doc in relevant
        )
        
        # Create prompt
        messages = [
            {
                "role": "system",
                "content": "Answer questions based only on the provided context. Be concise."
            },
            {
                "role": "user",
                "content": f"Context:\n{context}\n\nQuestion: {question}"
            }
        ]
        
        response = self.llm.create_chat_completion(
            messages=messages,
            max_tokens=max_tokens
        )
        
        return response['choices'][0]['message']['content'], relevant

# Demo
print("üìö Simple RAG Demo\n")
print("=" * 50)

rag = SimpleRAG(llm, embed_llm)

# Add documents
rag.add_document(
    "Python was created by Guido van Rossum and first released in 1991. "
    "It emphasizes code readability and simplicity.",
    title="Python History"
)

rag.add_document(
    "JavaScript was created by Brendan Eich in 1995 for Netscape Navigator. "
    "It is the primary language for web browsers.",
    title="JavaScript History"
)

rag.add_document(
    "Machine learning is a subset of AI that enables computers to learn "
    "from data without being explicitly programmed.",
    title="ML Definition"
)

print("\n" + "=" * 50)

# Ask question
question = "Who created Python and when?"
print(f"\nQ: {question}\n")

answer, sources = rag.answer(question)
print(f"A: {answer}\n")
print(f"üìñ Sources: {[s['title'] for s in sources]}")

init: embeddings required but some input tokens were not marked as outputs -> overriding


üìö Simple RAG Demo



init: embeddings required but some input tokens were not marked as outputs -> overriding
init: embeddings required but some input tokens were not marked as outputs -> overriding
init: embeddings required but some input tokens were not marked as outputs -> overriding


‚úÖ Added: 'Python History'
‚úÖ Added: 'JavaScript History'
‚úÖ Added: 'ML Definition'


Q: Who created Python and when?

A:  Python was created by Guido van Rossum and first released in 1991.

üìñ Sources: ['Python History', 'JavaScript History']


### Application 3: Code Explainer

In [31]:
class CodeExplainer:
    """
    A tool to explain code snippets.
    """
    
    def __init__(self, llm):
        self.llm = llm
    
    def explain(self, code, language="python", detail_level="simple"):
        """
        Explain a code snippet.
        
        Args:
            code: The code to explain
            language: Programming language
            detail_level: 'simple', 'detailed', or 'line-by-line'
            
        Returns:
            Explanation of the code
        """
        instructions = {
            "simple": "Provide a brief, high-level explanation.",
            "detailed": "Provide a comprehensive explanation covering all aspects.",
            "line-by-line": "Explain each line of the code."
        }
        
        instruction = instructions.get(detail_level, instructions["simple"])
        
        messages = [
            {
                "role": "system",
                "content": f"You are a {language} expert. {instruction}"
            },
            {
                "role": "user",
                "content": f"Explain this {language} code:\n\n```{language}\n{code}\n```"
            }
        ]
        
        response = self.llm.create_chat_completion(
            messages=messages,
            max_tokens=300
        )
        
        return response['choices'][0]['message']['content']

# Demo
print("üíª Code Explainer Demo\n")
print("=" * 50)

explainer = CodeExplainer(llm)

code = '''def fibonacci(n):
    if n <= 1:
        return n
    return fibonacci(n-1) + fibonacci(n-2)'''

print(f"Code:\n{code}\n")
print("=" * 50)
print("\nExplanation:")
explanation = explainer.explain(code, detail_level="simple")
print(explanation)

üíª Code Explainer Demo

Code:
def fibonacci(n):
    if n <= 1:
        return n
    return fibonacci(n-1) + fibonacci(n-2)


Explanation:
 This Python code defines a function named `fibonacci` that takes an integer `n` as an argument. The function is designed to calculate the `n`th number in the Fibonacci sequence.

The Fibonacci sequence is a series of numbers in which each number is the sum of the two preceding ones, starting from 0 and 1. So the sequence goes: 0, 1, 1, 2, 3, 5, 8, 13, 21, and so on.

The function starts with an `if` statement that checks if the input `n` is less than or equal to 1. If it is, the function simply returns the value of `n` because the first two numbers in the Fibonacci sequence are 0 and 1.

If the input `n` is greater than 1, the function calls itself recursively twice: once with the argument `n-1` and once with the argument `n-2`. The function then returns the sum of the results of these two recursive calls. This is how the function calculates the `

## 8. üí° Best Practices and Tips

## 8. üí° Best Practices and Tips

### üß† Memory Management

| Practice | Description |
|----------|-------------|
| Load once, reuse | Load models once and reuse for multiple queries |
| Clean up | Use `del model` to free memory when done |
| Quantization | Choose appropriate level (`Q4_K_M` is a good balance) |

### ‚ö° Performance Optimization

| Setting | Recommendation |
|---------|---------------|
| `n_gpu_layers` | Set to `-1` for full GPU acceleration |
| `n_threads` | Match to your CPU core count |
| `n_ctx` | Smaller values use less memory but limit context |

### üéØ Output Quality

| Task Type | Temperature | Use Case |
|-----------|-------------|----------|
| Factual | 0.1 - 0.4 | Q&A, coding, data extraction |
| Balanced | 0.5 - 0.7 | General conversation |
| Creative | 0.8 - 1.2 | Storytelling, brainstorming |

> **Tip:** Always use appropriate `stop` sequences to prevent unwanted generation.

### üîß Debugging Tips

- Set `verbose=True` to see model info and selected chat format
- Check `finish_reason` in responses (`stop`, `length`, etc.)
- Monitor `usage` field for token counts
- Use `echo=True` in completions to see the full prompt

In [32]:
# Cleanup
print("üßπ Cleaning up...")
del llm
del embed_llm
print("‚úÖ Models unloaded!")

üßπ Cleaning up...
‚úÖ Models unloaded!


## üìã Summary

### Key Concepts

1. **llama-cpp-python** provides Python bindings for the efficient llama.cpp C++ library
2. **GGUF format** enables quantized models for efficient local inference
3. **Text completion** vs **Chat completion** serve different use cases
4. **Streaming** provides real-time token generation
5. **Embeddings** enable semantic search and RAG applications

### Applications Built

- ü§ñ Conversational Assistant with memory
- üìö Simple RAG System with document retrieval
- üíª Code Explainer with customizable detail levels

### Next Steps

- Try larger models for better quality
- Enable GPU acceleration for faster inference
- Explore function calling for tool use
- Build an OpenAI-compatible API server

---

## üìö Resources

- [llama-cpp-python Documentation](https://llama-cpp-python.readthedocs.io/)
- [llama.cpp GitHub](https://github.com/ggerganov/llama.cpp)
- [Hugging Face GGUF Models](https://huggingface.co/models?library=gguf)
- [GGUF Format Specification](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md)