# Week 4 Workalong


This week we are going to look at running a local LLM in the Colab environment. What we are going to do is a bit similar to what happens when we you use ChatGPT in your browser. Similar, not quite the same. We'll use different models downloaded from [HuggingFace]() which is a repository of LLMs an similar tools. We'll use the library [Llama cpp]() to interact with the model.

For homework this week I'll ask you to pick a model and recreate a lot of what we are doing here as a way of testing the capabilities of different LLMs.

We will need to be patient with this week's notebooks. The files we are working with are pretty large and downloading them and running them will require more time than we've needed seen before


The impact LLMs are having our how we do work is still being understood. Take this study from [Microsoft](https://www.404media.co/microsoft-study-finds-ai-makes-human-cognition-atrophied-and-unprepared-3/)...


https://colab.research.google.com/github/R3gm/InsightSolver-Colab/blob/main/LLM_Inference_with_llama_cpp_python__Llama_2_13b_chat.ipynb#scrollTo=R76uxL293jTc


# Switch to GPU Runtime

Since Google Colab gives us free access to a notebook with GPU power let's switch to that instead.
1. Click the dropdown arrow underneath your picture
1. _Change Runtime Type_
1. _T4 GPU_


# Installing Extra Libraries

Like our Week 4 Warm Up activity showed use, sometimes we need to use _pip_ to install extra libraries before we __import__ them. That is what is happening in the next two cell.

In [None]:
#Run shell command to see details about GPU connected to this Colab runtime
!nvidia-smi

Tue Feb 25 19:52:34 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   47C    P8              9W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [None]:
# We will need to install llama-cpp and huggingface_hub before we begin.
# These library are not in the corelibraries in Colab
# This will take 4 minutes or so to run
# prop llama details from: https://github.com/abetlen/llama-cpp-python/issues/1366


!pip install scikit-build-core==0.9.0
!CMAKE_ARGS="-DLLAMA_CUBLAS=on -DCMAKE_CUDA_ARCHITECTURES=61" FORCE_CMAKE=1 pip install llama-cpp-python==0.2.62 --force-reinstall --upgrade --no-cache-dir --verbose --no-build-isolation


Collecting scikit-build-core==0.9.0
  Downloading scikit_build_core-0.9.0-py3-none-any.whl.metadata (19 kB)
Collecting pathspec>=0.10.1 (from scikit-build-core==0.9.0)
  Downloading pathspec-0.12.1-py3-none-any.whl.metadata (21 kB)
Downloading scikit_build_core-0.9.0-py3-none-any.whl (151 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m151.4/151.4 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pathspec-0.12.1-py3-none-any.whl (31 kB)
Installing collected packages: pathspec, scikit-build-core
Successfully installed pathspec-0.12.1 scikit-build-core-0.9.0
Using pip 24.1.2 from /usr/local/lib/python3.11/dist-packages/pip (python 3.11)
Collecting llama-cpp-python==0.2.62
  Downloading llama_cpp_python-0.2.62.tar.gz (37.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m37.5/37.5 MB[0m [31m7.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Running command Preparing metadata (pyproject.toml)
  *** scikit-build-core 0.9.0 using CMake 3.31.4 (meta


# Import our Libraries

Now that we have the extra parts installed lets pull in those libaries along with everything else we need

In [None]:
from huggingface_hub import hf_hub_download
from llama_cpp import Llama

import os

In [None]:
# @title What Model to Use {"run":"auto","vertical-output":true}
model_name = "Qwen/Qwen2-7B-Instruct-GGUF" # @param {"type":"string"}
model_basename = "qwen2-7b-instruct-q4_k_m.gguf" # @param {"type":"string"}

print("Model choice updated!")
print(" ")
print("> ",model_name)
print("> ",model_basename)

Model choice updated!
 
>  Qwen/Qwen2-7B-Instruct-GGUF
>  qwen2-7b-instruct-q4_k_m.gguf


In [None]:
#this cell will take about 2-3 minutes to run, depending
#on how large your model is, and how good bandwidth is
model_path = hf_hub_download(repo_id=model_name, filename=model_basename)
llm = Llama(model_path=model_path)
print("Reading!")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
llama_model_loader: loaded meta data with 26 key-value pairs and 339 tensors from /root/.cache/huggingface/hub/models--Qwen--Qwen2-7B-Instruct-GGUF/snapshots/c3024c6fff0a02d52119ecee024bbb93d4b4b8e4/qwen2-7b-instruct-q4_k_m.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.name str              = qwen2-7b-instruct
llama_model_loader: - kv  

Reading!


AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | 
Model metadata: {'quantize.imatrix.entries_count': '196', 'quantize.imatrix.dataset': '../sft_2406.txt', 'quantize.imatrix.chunks_count': '1937', 'quantize.imatrix.file': '../Qwen2/gguf/qwen2-7b-imatrix/imatrix.dat', 'tokenizer.ggml.add_bos_token': 'false', 'tokenizer.ggml.bos_token_id': '151643', 'general.architecture': 'qwen2', 'qwen2.block_count': '28', 'qwen2.context_length': '32768', 'tokenizer.chat_template': "{% for message in messages %}{% if loop.first and messages[0]['role'] != 'system' %}{{ '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n' }}{% endif %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}", 'qwen2.attention.hea

In [None]:
llm = Llama(model_path=model_path)

NameError: name 'model_path' is not defined

# Ask a question

To start with we'll just ask a simple question

In [None]:
prompt = "Write a python script that will load a CSV file into a pandas dataframe"

prompt_template=f'''SYSTEM: You are a helpful, respectful and honest assistant. Always answer as helpfully.

USER: {prompt}

ASSISTANT:
'''

In [None]:
output = llm(
      "Q: Name the planets in the solar system? A: ", # Prompt
      max_tokens=32, # Generate up to 32 tokens, set to None to generate up to the end of the context window
      stop=["Q:", "\n"], # Stop generating just before the model would generate a new question
      echo=True # Echo the prompt back in the output
) # Generate a completion, can also call create_completion
print(output)


llama_print_timings:        load time =    5469.80 ms
llama_print_timings:      sample time =      20.69 ms /     9 runs   (    2.30 ms per token,   435.01 tokens per second)
llama_print_timings: prompt eval time =    5469.64 ms /    13 tokens (  420.74 ms per token,     2.38 tokens per second)
llama_print_timings:        eval time =    4709.80 ms /     8 runs   (  588.73 ms per token,     1.70 tokens per second)
llama_print_timings:       total time =   10359.57 ms /    21 tokens


{'id': 'cmpl-53d98e1c-67aa-4023-9786-ed2246abe4c3', 'object': 'text_completion', 'created': 1740513716, 'model': '/root/.cache/huggingface/hub/models--Qwen--Qwen2-7B-Instruct-GGUF/snapshots/c3024c6fff0a02d52119ecee024bbb93d4b4b8e4/qwen2-7b-instruct-q4_k_m.gguf', 'choices': [{'text': 'Q: Name the planets in the solar system? A:  The answer to your question is as follows:', 'index': 0, 'logprobs': None, 'finish_reason': 'stop'}], 'usage': {'prompt_tokens': 13, 'completion_tokens': 9, 'total_tokens': 22}}


# All things told

I hope this introduction to LLMs has done a few things for you: opened your eyes to the fact there are many, many, different LLMs out there.

As a next step I would encourage you to install something like [Anaconda]() which is a Jupyter Notebook environment that you can run on your local computer. Then you can really leverage the horsepower you have at hand on not rely on a free version of Colab. If that doesn't work, you might want to try [Colab Pro](https://colab.research.google.com/signup?utm_source=footer&utm_medium=link&utm_campaign=footer_links). In Canada, if you are afliated with DRAC you can use a hosted version of Jupyter Notebooks called [Syzygy](https://syzygy.ca/)
