# General notebook for assignments
---

Part of the [Masterclass: Large Language Models for Data Science](https://github.com/avvorstenbosch/Masterclass-LLMs-for-Data-Science)
![](https://raw.githubusercontent.com/avvorstenbosch/Masterclass-LLMs-for-Data-Science/refs/heads/main/slides/day_1/figures/pair-programming-with-llms.webp)
by Alex van Vorstenbosch
---

This is a general notebook for making assignments, which you can use to interact with open source LLM models instead of closed source models via API's.

Google Colab provides both CPU and GPU compute resources. For the free teir users get access to a Nvidia T4 (2018) which has:
* 16 [GiB](https://www.techtarget.com/searchstorage/definition/gibibyte-GiB) VRAM
* 2560 CUDA cores

Access is limited in time, but overall these limits are very gracious to users.
If required, upgrading to a subscription will provide more compute credits

You can run open source LLM models of up to **~7B parameters**. The standard these days is that LLMs are run in half-precision (FP16: 2 bytes/8 bits per parameter) as compared to full-precision FP32. This means a 7B model uses 14GB of VRAM. But you need to take into account the context window you are using. VRAM-usage scales linearly with the size of the context window. For a **~7B**, the context window will be limited to **~8192 tokens** (Due to the [KV-cache](https://arxiv.org/pdf/2412.19442))

Luckily, researchers figured out that we can quantize the model weights to even smaller sizes such as Q4 (4 bits per parameter) while retaining much of the original model performance. Going beyond Q4 is typically not recommended. Quantization allows us to use models beyond 7B, but introduces a trade-off between speed/memory-usage and quality. As a rule-of-thumb: for performance it is better to use a bigger quantised model, than a smaller model without quantization. For online use I would always recommend using a quant, even just to save overhead in download times. Q6 only has very little degredation in quality, but saves a factor 2.67 in size.

Here are a few models you can try on colab. Please note that new models are coming out every other day, a few suggestions might be outdated when you read this!

**Some model suggestions:**
* tiny: 3.8B - [Microsoft Phi-4-mini quantized](https://huggingface.co/unsloth/Phi-4-mini-instruct-GGUF)

* medium: 8B - [Google Gemma-3 12B-Instruction Tuned quantized](https://huggingface.co/unsloth/gemma-3-12b-it-GGUF)

* Large: 24B - [Mistral small-3.2-24B-2506 quantized](https://huggingface.co/unsloth/Mistral-Small-3.2-24B-Instruct-2506-GGUF)

* SFT reasoning - [Deepseek R1-0528-Qwen-8B quantized](https://huggingface.co/unsloth/DeepSeek-R1-0528-Qwen3-8B-GGUF)

---






# 1. Setup

First we will need to install the necessary python packages.
Luckily for us, google colab comes with most of the libraries and requiered cuda software already pre-installed.

## 1.1 Runtime
---

We will want to use a GPU to run the examples in this notebook. In Google Colab, go to the menu bar:


**Menu bar > Runtime > Change runtime type > T4 GPU**

---

## 1.2 Install packages
Run the cell below to install `llama-cpp-python` which allows fast inference on GPU and CPU with GGUF quantized models.



In [None]:
# %% capture
!pip install --no-cache-dir llama-cpp-python==0.3.16 --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu122

## (Optional) 1.3 Clear GPU VRAM

In [None]:
# ## Uncomment and run this cell if you need to clear the GPU memory!
# import gc
# import torch

# # Sometimes you need a second run of this cell to clear the memory, in that case you need to comment out this line
# del llm

# gc.collect()
# torch.cuda.empty_cache()

## 1.4 Load helper functions from github

In the repo I have included 2 helper functions for talking to LLMs:

1. generate_response\
   Generate a response given a set of chat messages, with optional streaming behavior.

2. interactive_chat\
   Allows the user to engage in an interactive chat session with the model (streaming by default).


In [None]:
!curl -o helper_functions.py https://raw.githubusercontent.com/avvorstenbosch/Masterclass-LLMs-for-Data-Science/refs/heads/main/exercises/day_1/helper_functions/helper_functions.py

In [None]:
from helper_functions import *


In [None]:
help(generate_response)

In [None]:
help(interactive_chat)

# 2 Loading our model

## 2.1 Setting your Huggingface token as a Colab Secret

Some repositories on `huggingface ðŸ¤—` are gated, which means you need to request access to be able to download the models. In order to access these models via code, make sure to add the `HF_TOKEN` to your colab secrets.

To find a quick guide for how to do this, [click here](https://medium.com/@parthdasawant/how-to-use-secrets-in-google-colab-450c38e3ec75).

If no `HF_TOKEN` is set you will receive a warning, also when downloading models without gated access, but you can ignore it without any issues.


## 2.2 Model selection and download

Select a model on huggingface of your chosing in the `Llama.from_pretrained` function below.

By default we use the `Google Gemma 3 12B GGUF` model, as this strikes a nice balance between quality and speed.

In case you want a smaller and faster model, you can select `Windows Phi 4` using:

```
llm = Llama.from_pretrained(
    # Huggingface repo name
    repo_id="unsloth/gemma-3-12b-it-GGUF",
    # select the quant file within the repo you want '*' is a wildcard selector
    filename="*Q6_K.gguf",
    n_gpu_layers=-1,
    n_ctx=32518, # this is 50 A4 pages of context window!
    verbose=False
)
```
If instead you want to use one of the most powerfull models currently available, consider using `Mistral-Small-3.2-24B-Instruct-2506-GGUF`:

```
llm = Llama.from_pretrained(
    # Huggingface repo name
    repo_id="unsloth/Mistral-Small-3.2-24B-Instruct-2506-GGUF",
    # select the quant file within the repo you want '*' is a wildcard selector
    filename="*Q4_0.gguf",
    n_gpu_layers=-1,
    n_ctx=4096, # this is 6 A4 pages of context window
    verbose=False
)
```

As a default this notebook a medium size Google model called `gemma-3-12b-it-GGUF`. Please note that this download will take anywhere between 2 and 8 minutes to download.




In [None]:
from llama_cpp.llama import Llama

# Load you llm model
llm = Llama.from_pretrained(
    # Huggingface repo name
    repo_id="unsloth/gemma-3-12b-it-GGUF",
    # select the quant file within the repo you want '*' is a wildcard selector
    filename="*Q6_K.gguf",
    n_gpu_layers=-1,
    n_ctx=32518, # this is 50 A4 pages of context window!
    verbose=False
)

# 3 Running models

## 3.1 Non-interactive responses
Best for tasks that only require a single response, no back and forth interaction.\
i.e. generating summaries, translations, etc.

In [None]:
# Define the conversation history
from llama_cpp.llama import Llama

# specify the system message
system_role = """
  You are a helpfull assistant. Your task is to write short rymes about the user input topic.
"""

# Provide your specific input
user_question = """
  A cat sleeping on the computer keyboard is kneading with its paws, accidentally talking to an llm who is responding to the random keystrokes of the kitten.
"""

messages = [
    {"role": "system", "content": system_role},
    {"role": "user", "content": user_question}
]

generate_response(llm, messages)


# 3.2 Interactive mode
Use this if you want to have a functional chat with an LLM.
A very basic 'chatgpt' interface reading your input from the keyboard, and printing responses via streaming.

*Type `exit` to leave the chat*

In [None]:
interactive_chat(llm, system_prompt="You are chad gippity, a helpfull assistant.")