## 0. Inspired by [ds-modules/ollama-demo](https://github.com/ds-modules/ollama-demo)

This notebook uses the [GPT4All Python SDK](https://docs.gpt4all.io/gpt4all_python/home.html) and was taught in Spring 2025 in Econ 148  run on [the UCBerkeley DataHub](https://datahub.berkeley.edu/). Here it is adapted for a more general case - of local or hub-type deployment 

Notebook developed by Greg Merritt <[gmerritt@berkeley.edu](mailto:gmerritt@berkeley.edu)> and inspired by [ds-modules/ollama-demo](https://github.com/ds-modules/ollama-demo).  Adapted by Eric Van Dusen

## 1. Environment setup

1. Ensure that your python environment has gpt4all capability
2. Define the "model" object to which this notebook's code will send converations & prompts
3. This notebook assumes that at least one 'Small model' file `.gguf` has already been downloaded into a directory eg using `GPT4All_Download_gguf.ipynb`

_Do not worry about "Failed to load libllamamodel-mainline-cuda..." errors; this happens when the environment, like ours, does not have GPU support._

In [None]:
# Ensure that your python environment has gpt4all capability
try:
    from gpt4all import GPT4All
except ImportError:
    %pip install gpt4all
    from gpt4all import GPT4All

## Let's check out our local filesystem path and whether we have files downloaded

To do this I have two sets of code - one with the code commented out with the `#` 


### Approach 1 -  if a Shared Hub is being used 

In [None]:
# This only worked for SP 25 instuction on Berkeley Datahub
#!ls /home/jovyan/_shared/econ148-readwrite

In [None]:
# On Cal-ICOR workshop hub 
!ls /home/jovyan/shared/

In [1]:
# check in the root directory
!ls /



[34mApplications[m[m [34mUsers[m[m        [34mcores[m[m        [35mhome[m[m         [34msbin[m[m         [35mvar[m[m
[34mLibrary[m[m      [34mVolumes[m[m      [34mdev[m[m          [34mopt[m[m          [35mtmp[m[m
[34mSystem[m[m       [34mbin[m[m          [35metc[m[m          [34mprivate[m[m      [34musr[m[m


### Approach 2 -  if a local machine is being used

In [None]:
#This is my local path to a directory called shared-rw
!ls shared-rw

In [2]:
# or the full path ( this is on my laptop) 
!ls /Users/ericvandusen/Documents/GitHub/SmallLM-SP25/shared-rw

DeepSeek-R1-Distill-Qwen-1.5B-Q4_0.gguf orca-mini-3b-gguf2-q4_0.gguf
[34mcourse[m[m                                  qwen2-1_5b-instruct-q4_0.gguf
gemma-2b-it.Q4_0.gguf


## what is this - Lets explain what a gguf file is
A `.gguf` file is a format used to store quantized language models for efficient inference on local devices. "GGUF" stands for "GPT-Generated Unified Format," and it is designed to be compatible with various open-source LLM frameworks, such as llama.cpp and GPT4All. These files contain the model's weights and configuration in a compact, portable way, enabling fast loading and execution without requiring large computational resources or cloud access.



##  Pick your approack and set the Path

In [None]:
# set the model path parameter depending on where you are computing
path="/home/jovyan/shared/"

In [None]:
# set the model path parameter depending on where you are computing
#path="/Users/ericvandusen/Documents/GitHub/SmallLM-SP25/shared-rw"

In [None]:
# This calls in the model that we have downloaded already 
model = GPT4All(
    model_name="qwen2-1_5b-instruct-q4_0.gguf",
    model_path=path,
    allow_download=False,
    verbose=True
)
# If you see a pink error box - do not worry - that is an error because we dont have a GPU and Cuda set up 

In [None]:
#model = GPT4All(
#    model_name="gemma-2b-it.Q4_0.gguff",
#    model_path=path,
#   allow_download=False,
#    verbose=True
#)


In [None]:
# This extension will report the run time for each cell of code in this notebook.
#try:
#    import autotime
#    %load_ext autotime
#except ImportError:
#    %pip install ipython-autotime
#    import autotime
#    %load_ext autotime

## 2. Call the model with a GPT4All chat session containing a simple user message
This code pretends that a person submitted a message (prompt) to your application; your application then takes this `user_message` and passes it to the LLM `model` for response generation. The `response` is printed.

This may take a few moments to process.

You may run this multiple times, and will likely get different results. You may also feel free to do replace `user_message` with a prompt of your own!

In [None]:
user_message = "Who pays for tariffs on foreign manufactured goods? Consumer or Producer?" # You can change this prompt 

with model.chat_session():
    print(f"Response:")
    response = model.generate(
        prompt = user_message
    )
    print(f"{response}")


## 3. Passing additional arguments to the chat session model call

We can pass more than just a prompt to the GPT4All `chat-session` model model call. The complete list is shown here:

* `prompt`: The prompt for the model to complete.
* `max_tokens`: The maximum number of tokens to generate.
* `temp`: The model temperature. Larger values increase creativity but decrease factuality.* `top_k: Randomly sample from the top_k most likely tokens at each generation step. Set this to 1 for greedy decoding.
* `top_p`: Randomly sample at each generation step from the top most likely tokens whose probabilities add up to top_p.
* `min_p`: Randomly sample at each generation step from the top most likely tokens whose probabilities are at least min_p.
* `repeat_penalty`: Penalize the model for repetition. Higher values result in less repetition.
* `repeat_last_n`: How far in the models generation history to apply the repeat penalty.
* `n_batch`: Number of prompt tokens processed in parallel. Larger values decrease latency but increase resource requirements.
* `n_predict`: Equivalent to max_tokens, exists for backwards compatibility.
* `streaming`: If True, this method will instead return a generator that yields tokens as the model generates them.
* `callback`: A function with arguments token_id:int and response:str, which receives the tokens from the model as they are generated and stops the generation by returning False.

### 3a. Using the `max_tokens` argument to cap the length of the response

A GPT4All chat completion generation will stop generating words (tokens) abruptly once it's generated (at most) the specified maximum number of tokens assigned to the optional `max_tokens` parameter. The response may cut off mid-sentence, even if the response

😉

In [None]:
response_size_limit_in_tokens = 60  # You can change this parameter 

user_message = "What is the ecomomic outcome of tariffs on foreign manufactured goods?"

with model.chat_session():
    print(f"Response:")
    response = model.generate(
        prompt = user_message,
        max_tokens = response_size_limit_in_tokens
    )
    print(f"{response}")

### 3b. The `temp`erature argument.

LLMs generate one token ("word") at a time as they complete the chat you give them. As the LLM completes the chat, there is a single statistically most-likely token to "come next" at each step. However, a model will generally also have additional -- but less-likely -- tokens as candidate alternatives at each step. Which should it choose?

The value of the _`temp`erature_ argument will affect the likelihood that the model may randomly generate a less-probable token at each chat completion step.

A temperature of `0` -- "cold," if you like -- will constrain the model to always pick the most-likely token ("word") at each chat completion step.

#### Let's run the same chat completion three times, but with ``temp = 0``; we expect that each of the three runs will give precisely the same output, choosing the model's most-statistically-likely next token at each step of the generation:

In [None]:
response_size_limit_in_tokens = 30 
number_of_responses = 3 
temperature = 0.0  # You can change this parameter 

user_message = "How will tariffs affect the prices of foreign manufactured goods"

for i in range(number_of_responses):
    print(f"Response {i + 1}:")
    with model.chat_session():
        response = model.generate(
            prompt = user_message,
            max_tokens = response_size_limit_in_tokens,
            temp = temperature
        )
    print(f"{response}\n")


#### Let's repeat that, but with a slightly "hotter" temperature of ``temp = 0.15``; we expect the outputs to begin to diverge from one another:

In [None]:
response_size_limit_in_tokens = 30
number_of_responses = 5
temperature = .15

user_message = "How will tariffs affect the prices of foreign manufactured goods"

for i in range(number_of_responses):
    print(f"Response {i + 1}:")
    with model.chat_session():
        response = model.generate(
            prompt = user_message,
            max_tokens = response_size_limit_in_tokens,
            temp = temperature
        )
    print(f"{response}\n")

#### A "very hot" temperature of ``temp = 1`` will result in a high variety of responses, but may lead to "very unlikley" responses that may be less satisfactory:

In [None]:
response_size_limit_in_tokens = 30
number_of_responses = 5
temperature = 1

user_message = "How will tariffs affect the prices of foreign manufactured goods"

for i in range(number_of_responses):
    print(f"Response {i + 1}:")
    with model.chat_session():
        response = model.generate(
            prompt = user_message,
            max_tokens = response_size_limit_in_tokens,
            temp = temperature
        )
    print(f"{response}\n")

## 4. Include a hidden "system message" at the start of the conversation, before the user prompt
If chatbots were thinking entities, we developers might like to give them "instructions" regarding what we want them to do for users. However, chatbots just call LLMs to advance a conversation.

A "sytem message" is often thought of as instructions given to a chatbot. Functionally, it serves as a "conversation starter" to which the LLM does not respond directly; it is effectively "prepended" to the first user prompt in the conversation.

So, when you set a system message in your application, every conversation that your chatbot app gives to the LLM for advancing a conversation always has this "sytem message" quietly inserted at the very beginning of the conversation -- whether the user likes it or not!

(Note that these "system messages" are never guaranteed to remain secret, no matter how cleverly you may try to craft them; models can be prompted to reveal the contents of their system message.)

In [None]:
response_size_limit_in_tokens = 100

system_message = """
You are a hard working economics student at UC Berkeley. 
You think that there may be some truth to the things you learn in economics classes.
You wish that the people in the government understood economics.
You think that memes and poems and pop songs are a good way to communicate
Answer in haiku always
"""

user_message = "How will tariffs affect inflation "


with model.chat_session(system_prompt=system_message):
    print(f"Response:")
    response = model.generate(
        prompt = user_message,
        max_tokens = response_size_limit_in_tokens
    )
    print(f"{response}")


## 5. "Few-shot" learning: include a pre-made conversation history to set the tone of subsequent response generations

Another way to guide a language model is to provide a "few shots", a sequence of sample prompt/response (or user/assistant) dialogue pairs that establish a pattern to the conversation; our model will statistically tend to follow the presented established converation pattern when it responds to a new prompt from a user.

The "Few shot" label is commonly used for this technique, but, in truth, this is simply a "pre-loaded" initial conversation in which both sample prompts *and* sample responses were written beforehand by the developer; when the real user engages in a new conversation via your application, they do not know that their first prompt is *appended* by your application to this this hidden, pre-written conversation.

### 5a. A "Few-shot" example
In this example, we include such a fake conversation history, intended to help set the tone of responses. This conversation history consists of pairs of prompts/responses (`user:`/`assistant:`), but the `user:` lines were not written by a user, and the `assistant:` lines were not generated by the LLM! These were drafted by the developer, and are included to establish a baseline conversational style.

Here the developer made some choices about how the cat should respond to questions. The sample responses are brief, and each contains a word or two at the end that describes some kind of `~expression~` of the imaginary cat. Hopefully the next response generated will fit this pattern -- although this is never guaranteed!

* **Note 1:** `response_size_limit_in_tokens` has been set to 200, but we'll hope that the model follows the conversational history example and keeps responses brief.
* **Note 2:** We use a `template` appropriate to the model being used (`qwen2.5`) to give symantic structure to the conversation; more on this in the example to follow.

In [None]:
# qwen2.5 template
prompt_template = """
<|im_start|>user
{0}
<|im_end|>
<|im_start|>assistant
{1}
<|im_end|>
"""

# Define the system message and chat history
system_message = """
You are an economics tutor with a focus on international trade.
Answer concisely and clearly, using accessible language.
"""

chat_history = [
    {"role": "user", "content": "What is a tariff?"},
    {"role": "assistant", "content": "A tariff is a tax imposed by a government on imported goods, often used to protect domestic industries."},
    {"role": "user", "content": "How do tariffs affect consumer prices?"},
    {"role": "assistant", "content": "Tariffs typically raise the price of imported goods, making them more expensive for consumers."},
    {"role": "user", "content": "Can tariffs backfire?"},
    {"role": "assistant", "content": "Yes, they can lead to trade wars, hurt exporters, and reduce overall economic efficiency."},
    {"role": "user", "content": "How do other countries respond to tariffs?"},
    {"role": "assistant", "content": "They often retaliate with their own tariffs, targeting key export sectors."}
]

new_user_message = "What is an example of a real-world tariff dispute?"

# Append the new user message
chat_history.append({"role": "user", "content": new_user_message})

# Format the conversation history
formatted_prompt = ""
for message in chat_history:
    formatted_prompt += f"<|im_start|>{message['role']}\n{message['content']}\n<|im_end|>\n"

print(f"Formatted prompt:\n{formatted_prompt}")

# Combine with model session
with model.chat_session(system_prompt=system_message, prompt_template=prompt_template):
    print("Response:")
    response = model.generate(
        prompt=formatted_prompt,
        max_tokens=response_size_limit_in_tokens,
        temp=0.8
    )

# Output the assistant's reply
print(response)


### 5b. Why we need to conform to the model's conversation template: a counter-example
Above, we wrapped the conversation history elements in tags according to a the template syntax published with this model. Different models will use different template syntax. (Some model-running frameworks & supporting SDKs help abstract this away so you may not have to worry about it too much in some applications.)

What if we make a bogus, over-simplified template that just packages the full `user:` and `assistant:` conversation history into one big lump? It's as if the user's initial prompt was one single blob of text, a scripted dialogue, without any special distinctions of the elements to indicate to the model that they are conversation history prompt/response pairs.

When we give an LLM this blob of a script, it may try to simply continue the _script_, as a playwrite writing a continuing dialogue between two actors, rather than take the role of the "assistant" and "speak the next line" of the dialogue! (Run several times to get varied results.)

**Note:** The way we lump the history into one blob is to give a bogus template (`{0}`) that serves to lump the full conversation history into one element that appears to be one single user prompt. *The prompt value is exactly the same as the proper templated example above, but we give the model different parsing instructions via this reductive template!*

In [None]:
# qwen2.5 template
prompt_template = "{0}"

# Define the system message and chat history
system_message = """
You are an economics tutor who specializes in international trade.
Keep answers concise and informative. Provide real-world context when possible.
"""

chat_history = [
    {"role": "user", "content": "What is a tariff?"},
    {"role": "assistant", "content": "A tariff is a tax on imported goods, usually used to protect domestic industries or raise government revenue."},
    {"role": "user", "content": "How do tariffs impact consumers?"},
    {"role": "assistant", "content": "They usually raise prices on imported goods, which can lead to higher costs for consumers."},
    {"role": "user", "content": "Why do countries use tariffs?"},
    {"role": "assistant", "content": "To shield domestic producers from foreign competition, or as leverage in trade negotiations."},
    {"role": "user", "content": "Do tariffs always work?"},
    {"role": "assistant", "content": "Not always. They can provoke retaliation, distort markets, and reduce overall trade efficiency."}
]

new_user_message = "Can you give an example of a recent tariff conflict?"

# Append the new user message to the chat history
chat_history.append({"role": "user", "content": new_user_message})

# Format the conversation history for the model
formatted_prompt = ""
for message in chat_history:
    formatted_prompt += f"<|im_start|>{message['role']}\n{message['content']}\n<|im_end|>\n"

print(f"Formatted prompt:\n{formatted_prompt}")

# Combine the system prompt and history
with model.chat_session(system_prompt=system_message, prompt_template=prompt_template):
    
    # Generate the assistant's response
    print("Response:")
    response = model.generate(
        prompt=formatted_prompt,
        max_tokens=200,
        temp=0.8
    )

# Print the final response
print(response)

### 5c. A note about "hallucinations"

It's popular to use the word "hallucinations" to talk about model output that is very different from what we wanted, or when the output does not seem to make sense.

However, an LLM does not perceive; it merely continues a conversation. Can it _literally hallucinate?_

In such situations, the model is not crashing or failing or broken or sending errors; it is working exactly as it's designed to work.

What's certain about such situations is that there is a disconnect between a model's output and our hopes / expectations for its output. The more we can understand about models' behaviors, the less we may be surprised by their output, even if that output is not what we were hoping the model would generate.

Model responses to 5b. are likely something that nobody would ever want. However, the model is working as designed.

### 5d. Can you imagine how you might code a chatbot application?

If you wanted to develop an application that provided the user with an extended conversation experience, your application would capture the history of user prompts and model responses; for every new user prompt, your application would bundle the (growing) conversation history in precisely the way done above for the "few-shot" example. The pieces and the syntax are the same, but the history of prompts & responses would be dynamically generated by your app's user and the LLM, and the conversation history would be managed by your application.

This is important: the LLM itself has no "memory" and can never store a conversation. It takes an application to store and manage conversations. In many contemporary examples, each new user input to an extended-conversation chatbot app results in a wholesale from-the-beginning processing of the historical conversation. There are frameworks that let your app cache the "tokenized" version of your conversation history, so that the LLM does not have to freshly encode the history with each subsequent prompt, but these are not ubiquitous.