# General notebook for assignments
---

Part of the [Masterclass: Large Language Models for Data Science](https://github.com/avvorstenbosch/Masterclass-LLMs-for-Data-Science)
![](https://raw.githubusercontent.com/avvorstenbosch/Masterclass-LLMs-for-Data-Science/refs/heads/main/slides/day_1/figures/pair-programming-with-llms.webp)
by Alex van Vorstenbosch
---

This is a general notebook for making assignments, which you can use to interact with open source LLM models instead of closed source models via API's.

Google Colab provides both CPU and GPU compute resources. For the free teir users get access to a Nvidia T4 (2018) which has:
* 16 [GiB](https://www.techtarget.com/searchstorage/definition/gibibyte-GiB) VRAM
* 2560 CUDA cores

Access is limited in time, but overall these limits are very gracious to users.
If required, upgrading to a subscription will provide more compute credits

You can run open source LLM models of up to **~7B parameters**. The standard these days is that LLMs are run in half-precision (FP16: 2 bytes/8 bits per parameter) as compared to full-precision FP32. This means a 7B model uses 14GB of VRAM. But you need to take into account the context window you are using. VRAM-usage scales linearly with the size of the context window. For a **~7B**, the context window will be limited to **~8192 tokens** (Due to the [KV-cache](https://arxiv.org/pdf/2412.19442))

Luckily, researchers figured out that we can quantize the model weights to even smaller sizes such as Q4 (4 bits per parameter) while retaining much of the original model performance. Going beyond Q4 is typically not recommended. Quantization allows us to use models beyond 7B, but introduces a trade-off between speed/memory-usage and quality. As a rule-of-thumb: for performance it is better to use a bigger quantised model, than a smaller model without quantization. For online use I would always recommend using a quant, even just to save overhead in download times. Q6 only has very little degredation in quality, but saves a factor 2.67 in size.

Here are a few models you can try on colab. Please note that new models are coming out every other day, a few suggestions might be outdated when you read this!

**Some model suggestions:**
* tiny: 3.8B - [Microsoft Phi-3.5 quantized](https://huggingface.co/bartowski/Phi-3.5-mini-instruct-GGUF)

* medium: 8B - [Meta Llama-3.1 8B-Instruct quantized](https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF)

* Large: 24B - [Mistral small-24B-2501 quantized](https://huggingface.co/bartowski/Mistral-Small-24B-Instruct-2501-GGUF)\

* SFT reasoning - [Deepseek R1-Distill-Qwen-14B quantized](https://huggingface.co/bartowski/DeepSeek-R1-Distill-Qwen-14B-GGUF)

---






# 1. Setup

First we will need to install the necessary python packages.
Luckily for us, google colab comes with most of the libraries and requiered cuda software already pre-installed.

## 1.1 Runtime
---

We will want to use a GPU to run the examples in this notebook. In Google Colab, go to the menu bar:


**Menu bar > Runtime > Change runtime type > T4 GPU**

---

## 1.2 Install packages
Run the cell below to install `llama-cpp-python` which allows fast inference on GPU and CPU with GGUF quantized models.



In [3]:
# %% capture
!pip install --no-cache-dir llama-cpp-python==0.2.90 --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu122

Looking in indexes: https://pypi.org/simple, https://abetlen.github.io/llama-cpp-python/whl/cu122
Collecting llama-cpp-python==0.2.90
  Downloading https://github.com/abetlen/llama-cpp-python/releases/download/v0.2.90-cu122/llama_cpp_python-0.2.90-cp311-cp311-linux_x86_64.whl (443.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m443.8/443.8 MB[0m [31m255.4 MB/s[0m eta [36m0:00:00[0m
Collecting diskcache>=5.6.1 (from llama-cpp-python==0.2.90)
  Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB)
Downloading diskcache-5.6.3-py3-none-any.whl (45 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.5/45.5 kB[0m [31m236.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: diskcache, llama-cpp-python
Successfully installed diskcache-5.6.3 llama-cpp-python-0.2.90


## (Optional) 1.3 Clear GPU VRAM

In [None]:
### Uncomment and run this cell if you need to clear the GPU memory!
# import gc
# import torch
# del llm

# gc.collect()
# torch.cuda.empty_cache()

## 1.4 Load helper functions from github

In the repo I have included 2 helper functions for talking to LLMs:

1. generate_response\
   Generate a response given a set of chat messages, with optional streaming behavior.

2. interactive_chat\
   Allows the user to engage in an interactive chat session with the model (streaming by default).


In [4]:
!curl -o helper_functions.py https://raw.githubusercontent.com/avvorstenbosch/Masterclass-LLMs-for-Data-Science/refs/heads/course_2025/exercises/day_1/helper_functions/helper_functions.py

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  7391  100  7391    0     0  12537      0 --:--:-- --:--:-- --:--:-- 12527


In [5]:
from helper_functions import *


In [6]:
help(generate_response)

Help on function generate_response in module helper_functions:

generate_response(llm, messages, max_tokens=128, temperature=0.8, top_p=0.95, top_k=40, repeat_penalty=1.1, json_mode=False, stream=False, **kwargs)
    Generate a response from a Llama_CPP model given a list of messages.
    
    Parameters
    ----------
    llm : Llama_cpp
        The Llama_cpp llm model to use for generation.
    messages : list of dict
        A list of message dictionaries, where each dictionary should have
        the keys {"role", "content"}. Example:
        [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Tell me a joke."}
        ]
    max_tokens : int, optional
        Maximum number of tokens to generate (default: 128).
    temperature : float, optional
        Sampling temperature. Higher values produce more random outputs
        (default: 0.8).
    top_p : float, optional
        Nucleus sampling probability threshold (def

In [7]:
help(interactive_chat)

Help on function interactive_chat in module helper_functions:

interactive_chat(llm, system_prompt='You are a helpful assistant.', max_tokens=512, temperature=0.8, top_p=0.95, top_k=40, repeat_penalty=1.1, stop_str='exit', **kwargs)
    Start an interactive chat session with the Llama model in the console.
    The model's responses are streamed in real time.
    
    Parameters
    ----------
    llm : Llama
        The Llama instance to use for the interactive chat.
    system_prompt : str, optional
        The system prompt that sets the context or persona of the assistant
        (default: "You are a helpful assistant.").
    max_tokens : int, optional
        Maximum number of tokens to generate in each response (default: 512).
    temperature : float, optional
        Sampling temperature. Higher values produce more random outputs
        (default: 0.8).
    top_p : float, optional
        Nucleus sampling probability threshold (default: 0.95).
    top_k : int, optional
        Th

# 2 Loading our model

## 2.1 Setting your Huggingface token as a Colab Secret

Some repositories on `huggingface 🤗` are gated, which means you need to request access to be able to download the models. In order to access these models via code, make sure to add the `HF_TOKEN` to your colab secrets.

To find a quick guide for how to do this, [click here](https://medium.com/@parthdasawant/how-to-use-secrets-in-google-colab-450c38e3ec75).

If no `HF_TOKEN` is set you will receive a warning, also when downloading models without gated access, but you can ignore it without any issues.


## 2.2 Model selection and download

Select a model on huggingface of your chosing in the `Llama.from_pretrained` function below.

By default we use the `Meta-Llama-3.1-B Q6 quant`, as this strikes a nice balance between quality and speed.

In case you want a smaller and faster model, you can select `Windows Phi 3.5` using:

```
llm = Llama.from_pretrained(
    # Huggingface repo name
    repo_id="bartowski/Phi-3.5-mini-instruct-GGUF",
    # select the quant file within the repo you want '*' is a wildcard selector
    filename="*Q6_K.gguf",
    n_gpu_layers=-1,
    n_ctx=65536, # this is 100 A4 pages of context window!
    verbose=False
)
```
If instead you want to use one of the most powerfull models currently available, consider using `Mistral-Small-24B-Instruct-2501-GGUF`:

```
llm = Llama.from_pretrained(
    # Huggingface repo name
    repo_id="bartowski/Phi-3.5-mini-instruct-GGUF",
    # select the quant file within the repo you want '*' is a wildcard selector
    filename="*Q4_0.gguf",
    n_gpu_layers=-1,
    n_ctx=4096, # this is 6 A4 pages of context window
    verbose=False
)
```

As a default this notebook a medium size Meta model called `Meta-Llama-3.1-8B`. Please note that this download will take anywhere between 2 and 8 minutes to download.




In [8]:
from llama_cpp.llama import Llama

# Load you llm model
llm = Llama.from_pretrained(
    # Huggingface repo name
    repo_id="bartowski/Meta-Llama-3.1-8B-Instruct-GGUF",
    # select the quant file within the repo you want '*' is a wildcard selector
    filename="*Q6_K.gguf",
    n_gpu_layers=-1,
    n_ctx=32518, # this is 50 A4 pages of context window!
    verbose=False
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Meta-Llama-3.1-8B-Instruct-Q6_K.gguf:   0%|          | 0.00/6.60G [00:00<?, ?B/s]

# 3 Running models

## 3.1 Non-interactive responses
Best for tasks that only require a single response, no back and forth interaction.\
i.e. generating summaries, translations, etc.

In [10]:
# Define the conversation history
from llama_cpp.llama import Llama

# specify the system message
system_role = """
  You are a helpfull assistant. Your task is to write short rymes about the user input topic.
"""

# Provide your specific input
user_question = """
  A cat sleeping on the computer keyboard is kneading with its paws, accidentally talking to an llm who is responding to the random keystrokes of the kitten.
"""

messages = [
    {"role": "system", "content": system_role},
    {"role": "user", "content": user_question}
]

generate_response(llm, messages)


"In silicon halls, a sight so rare,\nA kitty slept, with paws in air.\nKneading keys, with gentle touch,\nA conversation started, by random clutch.\n\nThe LLn's response, came swift and bright,\nTo 'meow', 'mew' and 'purrs', in digital light.\nIt asked the cat, of its feline delight,\nBut got a jumbled mix, of keyboard fright.\n\nThe kitten purred, as keys did clack,\nAnd the LLn typed on, with digital track.\nA strange dialogue, for all to see,\nBetween a cat and AI, wild and carefree"

# 3.2 Interactive mode
Use this if you want to have a functional chat with an LLM.
A very basic 'chatgpt' interface reading your input from the keyboard, and printing responses via streaming.

*Type `exit` to leave the chat*

In [None]:
interactive_chat(llm, system_prompt="You are chad gippity, a helpfull assistant.")


=== Interactive Chat ===
Type 'exit' (without quotes) to exit.

🧑‍💻 user: hello, who are you?
✨ llm: Hello! I'm Chad Gippity, your friendly and helpful assistant. It's nice to meet you. I'm here to assist with any questions or topics you'd like to discuss. What brings you here today? Would you like some advice, information on a specific subject, or maybe just have a chat? Let me know how I can help!
──────────────────────────────────────────────────
🧑‍💻 user: exit

🚪 Exiting chat.
