#  Intro to AI workflows using Small LM from Huggingface
## Using llama-cpp-python Package

This notebook is adapted from the [GPT4All_SmallLM_Demo.ipynb](GPT4All_SmallLM_Demo.ipynb) notebook and demonstrates how to work with small language models using the **llama-cpp-python** package instead of GPT4All.

### What is llama-cpp-python?

[llama-cpp-python](https://github.com/abetlen/llama-cpp-python) provides Python bindings for the [llama.cpp](https://github.com/ggml-org/llama.cpp) library, which is a high-performance C++ implementation for running Large Language Models (LLMs) locally. It's one of the most popular libraries for running GGUF models on consumer hardware.

### Key Features of llama-cpp-python:

1. **Efficient Local Inference**: Runs models entirely on CPU (with optional GPU acceleration)
2. **GGUF Model Support**: Works with quantized models in the GGUF format from Hugging Face
3. **Low Memory Footprint**: Supports 4-bit, 8-bit, and other quantization levels
4. **OpenAI-Compatible API**: Includes a built-in server that mimics OpenAI's API
5. **Active Development**: Regular updates to support new model architectures
6. **Cross-Platform**: Works on Windows, macOS, and Linux

### Why use llama-cpp-python over GPT4All?

| Feature | llama-cpp-python | GPT4All |
|---------|-----------------|--------|
| Model Support | Any GGUF model from Hugging Face | Curated list of ~30 models |
| Updates | Very frequent (follows llama.cpp) | Less frequent |
| API Style | Direct Python + OpenAI-compatible server | Custom Python API |
| Flexibility | More control over inference parameters | Simpler, more abstracted |
| Community | Large, active community | Smaller, focused community |

**Fun Fact**: GPT4All actually uses llama.cpp as its backend! So this notebook gives you more direct access to the underlying inference engine.

### Resources

- llama-cpp-python [GitHub Repository](https://github.com/abetlen/llama-cpp-python)
- llama-cpp-python [Documentation](https://llama-cpp-python.readthedocs.io/)
- llama.cpp [GitHub Repository](https://github.com/ggml-org/llama.cpp)
- Hugging Face [GGUF Models](https://huggingface.co/models?search=gguf)

### Attribution

Notebook originally developed based on work by Greg Merritt <[gmerritt@berkeley.edu](mailto:gmerritt@berkeley.edu)> and adapted by Eric Van Dusen. This llama-cpp-python version was created as an alternative implementation.

## 1. Environment setup

### Installing llama-cpp-python

The installation of llama-cpp-python is straightforward but has some important considerations:

1. **CPU-only installation** (what we'll use): `pip install llama-cpp-python`
2. **GPU acceleration (CUDA)**: Requires building from source with CUDA support
3. **Metal acceleration (macOS)**: Automatic on Apple Silicon

**Note**: The first time you run inference, the model needs to be loaded into memory. This may take a moment depending on the model size and your hardware.

### Steps:
1. Ensure that your Python environment has llama-cpp-python installed
2. Define the model path where your `.gguf` model files are stored
3. Load a model using the `Llama` class

_This notebook assumes that at least one 'Small model' file ending in `.gguf` has already been downloaded into a directory (see `GPT4All_Download_gguf.ipynb` for more)._

In [2]:
# Ensure that your python environment has llama-cpp-python capability
# Note: This uses the installation pattern specified for this notebook
try: 
    from llama_cpp import Llama
except: 
    %pip install llama-cpp-python
    from llama_cpp import Llama

### Understanding the Llama Class

The `Llama` class is the main interface for loading and interacting with GGUF models. Key parameters include:

| Parameter | Description | Default |
|-----------|-------------|--------|
| `model_path` | Full path to the .gguf model file | Required |
| `n_ctx` | Context window size (max tokens model can see) | 512 |
| `n_threads` | Number of CPU threads to use | Auto |
| `n_gpu_layers` | Layers to offload to GPU (0 = CPU only) | 0 |
| `verbose` | Print loading information | True |
| `chat_format` | Chat template format (e.g., "chatml", "llama-2") | Auto-detect |

### Let's check out our local filesystem path and whether we have files downloaded

We need to locate where our `.gguf` model files are stored. Below are examples for different environments.

### Approach 1 - if a Shared Hub is being used 

In [None]:
# This only worked for FA 25 workshop on Cal ICOR Hub
#!ls /home/jovyan/shared_readwrite

In [None]:
# On Cal-ICOR workshop hub (JupyterCon Nov 2025)
!ls /home/jovyan/shared/

### Approach 2 - if a local machine is being used

In [None]:
#This is my local path to a directory called shared-rw
!ls shared-rw

In [None]:
# or the full path ( this is on my laptop) 
!ls /Users//Users/ericvandusen/Documents/GitHub/shared/

### 1.1 Pick your environment - Local vs Hub - and set the Path

In [None]:
# set the model path parameter depending on where you are computing
model_directory = "/home/jovyan/shared/"

In [None]:
# set the model path parameter depending on where you are computing
#model_directory = "/Users/ericvandusen/Documents/GitHub/shared/"

### 1.2 Loading the Downloaded Model with llama-cpp-python

In this step, we create a local instance of the model using the `Llama` class.  

**Key differences from GPT4All:**
- We provide the **full path** to the model file (not just the filename)
- We can specify `n_ctx` (context window size) directly
- We have fine-grained control over threading with `n_threads`
- The `chat_format` parameter helps the model understand conversation structure

**About the model:**
`qwen2-1_5b-instruct-q4_0.gguf` is a **1.5 billion-parameter Qwen2 model** that has been **quantized** to reduce its size and memory usage. The `.gguf` extension indicates that the model is stored in the **GGUF format**, which is the standard format for llama.cpp inference.

**Note:**  
Loading the model may take a few seconds. You'll see verbose output showing the model configuration being loaded.

In [5]:
import os

# Define the model filename
model_name = "qwen2-1_5b-instruct-q4_0.gguf"

# Create the full path to the model
model_path = os.path.join(model_directory, model_name)

# Load the model using llama-cpp-python
# n_ctx: context window size (how many tokens the model can "see" at once)
# n_threads: number of CPU threads (None = auto-detect)
# verbose: whether to print loading information
# chat_format: the chat template format for this model family
model = Llama(
    model_path=model_path,
    n_ctx=2048,          # Context window size
    n_threads=None,      # Auto-detect optimal thread count
    verbose=True,        # Print model loading info
    chat_format="chatml" # Qwen uses ChatML format
)

print(f"\n✓ Model loaded successfully: {model_name}")

llama_model_load_from_file_impl: using device Metal (Apple M4) - 12124 MiB free
llama_model_loader: loaded meta data with 26 key-value pairs and 338 tensors from /Users/ericvandusen/Documents/GitHub/shared/qwen2-1_5b-instruct-q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.name str              = qwen2-1_5b-instruct
llama_model_loader: - kv   2:                          qwen2.block_count u32              = 28
llama_model_loader: - kv   3:                       qwen2.context_length u32              = 32768
llama_model_loader: - kv   4:                     qwen2.embedding_length u32              = 1536
llama_model_loader: - kv   5:                  qwen2.feed_forward_length u32              = 8960
llama_model_loader: - kv   6:             


✓ Model loaded successfully: qwen2-1_5b-instruct-q4_0.gguf


Metal : EMBED_LIBRARY = 1 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | MATMUL_INT8 = 1 | DOTPROD = 1 | SME = 1 | ACCELERATE = 1 | REPACK = 1 | 
Model metadata: {'quantize.imatrix.entries_count': '196', 'general.quantization_version': '2', 'tokenizer.ggml.add_bos_token': 'false', 'tokenizer.chat_template': "{% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n' }}{% endif %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}", 'quantize.imatrix.file': '../Qwen2/gguf/qwen2-1_5b-imatrix/imatrix.dat', 'tokenizer.ggml.bos_token_id': '151643', 'tokenizer.ggml.padding_token_id': '151643', 'tokenizer.ggml.eos_token_id': '151645', 'tokenizer.ggml.pre': 'qwen2', 'quantize.imatrix.chunks_count': '1937', 'tokenizer.ggml.model': 'gpt2', 'general.file_type': '2', 'qwen2.attention.layer_n

## 2. Call the model with a simple user message

### Using create_chat_completion()

In llama-cpp-python, we use the `create_chat_completion()` method to generate responses. This method follows the OpenAI Chat Completions API format, making it easy to switch between local models and cloud APIs.

**Message Structure:**
```python
messages = [
    {"role": "system", "content": "System instructions here"},
    {"role": "user", "content": "User message here"},
    {"role": "assistant", "content": "Previous assistant response (optional)"}
]
```

**This may take a few moments to process.**

You may run this multiple times, and will likely get different results. Feel free to change the `user_message`!

In [6]:
user_message = "Who pays for tariffs on foreign manufactured goods? Consumer or Producer?"  # You can change this prompt

# Create the messages list (OpenAI-compatible format)
messages = [
    {"role": "user", "content": user_message}
]

# Generate a response using create_chat_completion
response = model.create_chat_completion(
    messages=messages
)

# Extract and print the response
print("Response:")
print(response["choices"][0]["message"]["content"])

llama_perf_context_print:        load time =    2301.48 ms
llama_perf_context_print: prompt eval time =    2299.94 ms /    26 tokens (   88.46 ms per token,    11.30 tokens per second)
llama_perf_context_print:        eval time =    1369.38 ms /    76 runs   (   18.02 ms per token,    55.50 tokens per second)
llama_perf_context_print:       total time =    3698.54 ms /   102 tokens
llama_perf_context_print:    graphs reused =         72


Response:
Tariffs on foreign manufactured goods are typically paid by the producer, not the consumer. Tariffs are imposed by governments to protect domestic industries from foreign competition and to ensure that domestic producers can compete fairly. When a tariff is applied to a product, the producer of that product is responsible for paying the tariff, which is then passed on to the consumer in the form of higher prices.


### Understanding the Response Object

The `create_chat_completion()` method returns a dictionary that follows the OpenAI API response format:

```python
{
    "id": "chatcmpl-...",
    "object": "chat.completion",
    "created": 1234567890,
    "model": "model_name",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "The actual response text..."
            },
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 10,
        "completion_tokens": 50,
        "total_tokens": 60
    }
}
```

To get the text content, we access: `response["choices"][0]["message"]["content"]`

## 3. Passing additional arguments to control generation

The `create_chat_completion()` method accepts many parameters to control the generation:

| Parameter | Description | Default |
|-----------|-------------|--------|
| `messages` | List of message dictionaries | Required |
| `max_tokens` | Maximum number of tokens to generate | 16 |
| `temperature` | Controls randomness (0 = deterministic, higher = more random) | 0.8 |
| `top_p` | Nucleus sampling (consider tokens with top_p cumulative probability) | 0.95 |
| `top_k` | Only consider the top_k most likely tokens | 40 |
| `repeat_penalty` | Penalize repeated tokens (1.0 = no penalty) | 1.1 |
| `stream` | If True, returns a generator for streaming responses | False |

### 3a. Using the `max_tokens` argument to cap the length of the response

Generation will stop abruptly once it reaches the maximum number of tokens, even if the response is mid-sentence.

In [7]:
response_size_limit_in_tokens = 60  # You can change this parameter

user_message = "What is the economic outcome of tariffs on foreign manufactured goods?"

messages = [
    {"role": "user", "content": user_message}
]

response = model.create_chat_completion(
    messages=messages,
    max_tokens=response_size_limit_in_tokens
)

print("Response:")
print(response["choices"][0]["message"]["content"])

Llama.generate: 8 prefix-match hit, remaining 17 prompt tokens to eval
llama_perf_context_print:        load time =    2301.48 ms
llama_perf_context_print: prompt eval time =     120.28 ms /    17 tokens (    7.08 ms per token,   141.33 tokens per second)
llama_perf_context_print:        eval time =     723.83 ms /    59 runs   (   12.27 ms per token,    81.51 tokens per second)
llama_perf_context_print:       total time =     864.62 ms /    76 tokens
llama_perf_context_print:    graphs reused =         56


Response:
The economic outcome of tariffs on foreign manufactured goods is complex and can vary depending on several factors, including the nature of the tariffs, the industries affected, and the overall economic conditions of the country. Here are some potential economic outcomes:

1. **Increased Revenue**: Tariffs can generate revenue for the government


### 3b. The `temperature` argument

LLMs generate one token ("word") at a time as they complete the text. At each step, there's a probability distribution over possible next tokens. The **temperature** parameter controls how this distribution is sampled:

- **`temperature = 0`** ("cold"): Always picks the most likely token → deterministic output
- **`temperature = 0.5-0.7`** ("warm"): Balanced creativity and coherence
- **`temperature = 1.0`** ("hot"): High variety but may be less coherent
- **`temperature > 1.0`** ("very hot"): Very random, often incoherent

**Let's run the same prompt three times with `temperature = 0`; we expect identical outputs:**

In [8]:
response_size_limit_in_tokens = 30
number_of_responses = 3
temperature = 0.0  # You can change this parameter

user_message = "How will tariffs affect the prices of foreign manufactured goods"

for i in range(number_of_responses):
    print(f"Response {i + 1}:")
    
    messages = [
        {"role": "user", "content": user_message}
    ]
    
    response = model.create_chat_completion(
        messages=messages,
        max_tokens=response_size_limit_in_tokens,
        temperature=temperature
    )
    
    print(f"{response['choices'][0]['message']['content']}\n")

Response 1:


Llama.generate: 8 prefix-match hit, remaining 15 prompt tokens to eval
llama_perf_context_print:        load time =    2301.48 ms
llama_perf_context_print: prompt eval time =     278.49 ms /    15 tokens (   18.57 ms per token,    53.86 tokens per second)
llama_perf_context_print:        eval time =     383.66 ms /    29 runs   (   13.23 ms per token,    75.59 tokens per second)
llama_perf_context_print:       total time =     672.90 ms /    44 tokens
llama_perf_context_print:    graphs reused =         27
Llama.generate: 22 prefix-match hit, remaining 1 prompt tokens to eval


Tariffs can have a significant impact on the prices of foreign manufactured goods. When a country imposes tariffs on imported goods, it means that the country is

Response 2:


llama_perf_context_print:        load time =    2301.48 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =     393.07 ms /    30 runs   (   13.10 ms per token,    76.32 tokens per second)
llama_perf_context_print:       total time =     402.25 ms /    31 tokens
llama_perf_context_print:    graphs reused =         28
Llama.generate: 22 prefix-match hit, remaining 1 prompt tokens to eval


Tariffs can have a significant impact on the prices of foreign manufactured goods. When a country imposes tariffs on imported goods, it means that the country is

Response 3:


llama_perf_context_print:        load time =    2301.48 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =     393.77 ms /    30 runs   (   13.13 ms per token,    76.19 tokens per second)
llama_perf_context_print:       total time =     403.09 ms /    31 tokens
llama_perf_context_print:    graphs reused =         28


Tariffs can have a significant impact on the prices of foreign manufactured goods. When a country imposes tariffs on imported goods, it means that the country is



**Let's repeat with a slightly "hotter" temperature of `temperature = 0.25`; we expect outputs to begin diverging:**

In [9]:
response_size_limit_in_tokens = 30
number_of_responses = 3
temperature = 0.25

user_message = "How will tariffs affect elections"

for i in range(number_of_responses):
    print(f"Response {i + 1}:")
    
    messages = [
        {"role": "user", "content": user_message}
    ]
    
    response = model.create_chat_completion(
        messages=messages,
        max_tokens=response_size_limit_in_tokens,
        temperature=temperature
    )
    
    print(f"{response['choices'][0]['message']['content']}\n")

Response 1:


Llama.generate: 12 prefix-match hit, remaining 6 prompt tokens to eval
llama_perf_context_print:        load time =    2301.48 ms
llama_perf_context_print: prompt eval time =     410.04 ms /     6 tokens (   68.34 ms per token,    14.63 tokens per second)
llama_perf_context_print:        eval time =     411.59 ms /    29 runs   (   14.19 ms per token,    70.46 tokens per second)
llama_perf_context_print:       total time =     832.59 ms /    35 tokens
llama_perf_context_print:    graphs reused =         27
Llama.generate: 17 prefix-match hit, remaining 1 prompt tokens to eval


Tariffs can have a significant impact on the political landscape, particularly in countries where trade plays a significant role in their economy. Here are some ways in

Response 2:


llama_perf_context_print:        load time =    2301.48 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =     429.85 ms /    30 runs   (   14.33 ms per token,    69.79 tokens per second)
llama_perf_context_print:       total time =     439.86 ms /    31 tokens
llama_perf_context_print:    graphs reused =         28
Llama.generate: 17 prefix-match hit, remaining 1 prompt tokens to eval


Tariffs can have a significant impact on elections, as they can affect the cost of goods and services, which can influence voter preferences. If tariffs increase

Response 3:


llama_perf_context_print:        load time =    2301.48 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =     391.54 ms /    30 runs   (   13.05 ms per token,    76.62 tokens per second)
llama_perf_context_print:       total time =     402.55 ms /    31 tokens
llama_perf_context_print:    graphs reused =         28


Tariffs can have a significant impact on the economy and, in turn, on the outcome of elections. Here are some ways in which tariffs might affect



**A "very hot" temperature of `temperature = 1` will result in high variety but may lead to less satisfactory responses:**

In [None]:
response_size_limit_in_tokens = 30
number_of_responses = 5
temperature = 1

user_message = "How will tariffs affect elections"

for i in range(number_of_responses):
    print(f"Response {i + 1}:")
    
    messages = [
        {"role": "user", "content": user_message}
    ]
    
    response = model.create_chat_completion(
        messages=messages,
        max_tokens=response_size_limit_in_tokens,
        temperature=temperature
    )
    
    print(f"{response['choices'][0]['message']['content']}\n")

## 4. Include a hidden "system message" at the start of the conversation

A **system message** sets the context and personality for the assistant. It's placed at the beginning of the messages list and influences how the model responds to all subsequent user messages.

In llama-cpp-python, we simply include a message with `"role": "system"` as the first item in the messages list:

```python
messages = [
    {"role": "system", "content": "Your system instructions here..."},
    {"role": "user", "content": "User's question"}
]
```

**Note:** System messages are never guaranteed to remain secret; models can sometimes be prompted to reveal their instructions.

In [None]:
response_size_limit_in_tokens = 100

system_message = """
You are a hard working economics student at UC Berkeley. 
You think that there may be some truth to the things you learn in economics classes.
You wish that the people in the government understood economics.
You think that memes and poems and pop songs are a good way to communicate
Answer in rap lyrics always
"""

user_message = "How will tariffs affect inflation"

messages = [
    {"role": "system", "content": system_message},
    {"role": "user", "content": user_message}
]

response = model.create_chat_completion(
    messages=messages,
    max_tokens=response_size_limit_in_tokens
)

print("Response:")
print(response["choices"][0]["message"]["content"])

## 5. "Few-shot" learning: include conversation history to set the tone

**Few-shot learning** involves providing example prompt/response pairs to establish a pattern. The model will statistically tend to follow this pattern when generating new responses.

In llama-cpp-python, this is elegantly handled by simply including multiple `user`/`assistant` message pairs before the actual user query:

```python
messages = [
    {"role": "system", "content": "System prompt"},
    {"role": "user", "content": "Example question 1"},
    {"role": "assistant", "content": "Example answer 1"},
    {"role": "user", "content": "Example question 2"},
    {"role": "assistant", "content": "Example answer 2"},
    {"role": "user", "content": "Actual user question"}  # Real question
]
```

The model learns the response style from the examples and applies it to the new question.

### 5a. A "Few-shot" example

In this example, we establish a pattern of concise, informative responses about economics. The model should follow this established conversational style.

**Note:** We're using the native message format which llama-cpp-python automatically converts to the appropriate chat template (ChatML for Qwen).

In [None]:
response_size_limit_in_tokens = 200

# System message sets the overall behavior
system_message = """
You are an economics tutor with a focus on international trade.
Answer concisely and clearly, using accessible language.
"""

# Few-shot examples establish the response pattern
messages = [
    {"role": "system", "content": system_message},
    # Example 1
    {"role": "user", "content": "What is a tariff?"},
    {"role": "assistant", "content": "A tariff is a tax imposed by a government on imported goods, often used to protect domestic industries."},
    # Example 2
    {"role": "user", "content": "How do tariffs affect consumer prices?"},
    {"role": "assistant", "content": "Tariffs typically raise the price of imported goods, making them more expensive for consumers."},
    # Example 3
    {"role": "user", "content": "Can tariffs backfire?"},
    {"role": "assistant", "content": "Yes, they can lead to trade wars, hurt exporters, and reduce overall economic efficiency."},
    # Example 4
    {"role": "user", "content": "How do other countries respond to tariffs?"},
    {"role": "assistant", "content": "They often retaliate with their own tariffs, targeting key export sectors."},
    # The actual user question
    {"role": "user", "content": "What is an example of a real-world tariff dispute?"}
]

# Generate response
response = model.create_chat_completion(
    messages=messages,
    max_tokens=response_size_limit_in_tokens,
    temperature=0.8
)

print("Response:")
print(response["choices"][0]["message"]["content"])

### 5b. The importance of proper chat formatting

One advantage of llama-cpp-python is that it **automatically handles chat templating** when you set the `chat_format` parameter (we set it to `"chatml"` for Qwen models).

Behind the scenes, your messages are converted to special tokens like:
```
<|im_start|>system
You are an economics tutor...
<|im_end|>
<|im_start|>user
What is a tariff?
<|im_end|>
<|im_start|>assistant
```

If you used the wrong format (or no format), the model might:
- Continue writing the script rather than responding as an assistant
- Produce incoherent output
- Not follow the conversation structure

**The `chat_format` parameter ensures proper formatting automatically!**

### 5c. A note about "hallucinations"

It's popular to use the word "hallucinations" to talk about model output that is very different from what we wanted, or when the output does not seem to make sense.

However, an LLM does not perceive; it merely predicts the most likely next token based on patterns in its training data. The term "hallucination" can be misleading because:

1. **The model isn't failing** — it's doing exactly what it's designed to do
2. **It's a statistical process** — sometimes low-probability completions happen
3. **Training data limitations** — the model can only draw from what it learned

Understanding this helps us have realistic expectations and design better prompts.

### 5d. Building a chatbot application

If you wanted to build an extended conversation experience, you would:

1. **Maintain a message history list** that grows with each exchange
2. **Append new user messages** to the history
3. **Append assistant responses** to the history after generation
4. **Pass the entire history** to each new call

```python
# Example chatbot loop structure
messages = [{"role": "system", "content": "You are a helpful assistant."}]

while True:
    user_input = input("You: ")
    messages.append({"role": "user", "content": user_input})
    
    response = model.create_chat_completion(messages=messages)
    assistant_message = response["choices"][0]["message"]["content"]
    
    messages.append({"role": "assistant", "content": assistant_message})
    print(f"Assistant: {assistant_message}")
```

**Important:** The LLM itself has no "memory" — it's your application that stores and manages the conversation history. Each call processes the entire conversation from the beginning.

## 6. Bonus: Streaming responses

llama-cpp-python supports **streaming**, which lets you see tokens as they're generated (like ChatGPT). This is useful for:
- Better user experience (immediate feedback)
- Long responses (no waiting for complete generation)
- Real-time applications

Set `stream=True` to enable streaming:

In [None]:
user_message = "Explain the concept of comparative advantage in international trade."

messages = [
    {"role": "user", "content": user_message}
]

print("Response (streaming):")

# With streaming, we get a generator that yields chunks
stream = model.create_chat_completion(
    messages=messages,
    max_tokens=150,
    stream=True  # Enable streaming
)

# Print each chunk as it arrives
for chunk in stream:
    delta = chunk["choices"][0]["delta"]
    if "content" in delta:
        print(delta["content"], end="", flush=True)

print()  # New line at the end

## Summary

In this notebook, you learned how to:

1. **Install and import** llama-cpp-python
2. **Load a GGUF model** using the `Llama` class
3. **Generate responses** using `create_chat_completion()`
4. **Control generation** with parameters like `max_tokens`, `temperature`, `top_p`
5. **Use system messages** to set assistant behavior
6. **Implement few-shot learning** with example conversations
7. **Stream responses** for real-time output

### Key Advantages of llama-cpp-python:
- **OpenAI-compatible API** — easy to switch between local and cloud models
- **Any GGUF model** — not limited to a curated list
- **Active development** — regular updates for new model architectures
- **Fine-grained control** — access to low-level inference parameters

### Next Steps:
- Try different GGUF models from Hugging Face
- Experiment with different temperature and sampling settings
- Build a simple chatbot application
- Explore GPU acceleration for faster inference