# Another way to call LLMs - directly on your box using Llama.cpp

This notebook runs inference for models directly on your local box, using the open source C++ library llama.cpp.

In [None]:
# First, install the llama-cpp library. Do some googling if you have problems with this install, or contact me

!pip install llama-cpp-python

In [None]:
# If any problems with this import, some investigating may be required..

from llama_cpp import Llama

## First, download a model locally

The llama.cpp library uses its own model format called GGUF.

Here are all the HuggingFace models that can be downloaded as a GGUF file:
https://huggingface.co/models?library=gguf

For this notebook, I downloaded 3 models to try. For each of these models, click download, and move the file from your downloads folder into the `model_cache` folder in this directory (which is .gitignored).

First, this medium sized version of Microsoft's Phi-3:
https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/blob/main/Phi-3-mini-4k-instruct-q4.gguf

Then, this version of Qwen2:
https://huggingface.co/Qwen/Qwen2-7B-Instruct-GGUF/blob/main/qwen2-7b-instruct-q4_k_m.gguf

Finally, I chose the medium sized version of StarCoder2 for some coding inference.
https://huggingface.co/second-state/StarCoder2-3B-GGUF/blob/main/starcoder2-3b-Q4_K_M.gguf

In [None]:
# Here is where my GGUF files are located

phi3_model_path = "model_cache/Phi-3-mini-4k-instruct-q4.gguf"
qwen2_model_path = "model_cache/qwen2-7b-instruct-q4_k_m.gguf"
starcoder2_model_path = "model_cache/starcoder2-3b-Q4_K_M.gguf"

In [None]:
# Now use llama.cpp to create the models for Phi 3, Qwen2 and Starcoder2

phi3 = Llama(model_path=phi3_model_path, n_ctx=300)
qwen2 = Llama(model_path=qwen2_model_path, n_ctx=300)
starcoder2 = Llama(model_path=starcoder2_model_path, n_ctx=300)

In [None]:
# phi3 tell us a joke! Remember this is a tiny model!
# The prompt has some special characters in it - we'll cover this shortly

prompt = """<|user|>
Tell a light joke for a room full of data scientists<|end|>
<|assistant|>"""

response = phi3(prompt, max_tokens=200, temperature=1, echo=True, stream=True)
for chunk in response:
    print(chunk["choices"][0]["text"], end='')

In [None]:
# qwen2 tell us a joke!

prompt = """<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
Tell a light joke for a room full of data scientists<|im_end|>
<|im_start|>assistant"""

response = qwen2(prompt, max_tokens=200, temperature=1, echo=True, stream=True)
for chunk in response:
    print(chunk["choices"][0]["text"], end='')

In [None]:
# OK enough with the jokes - starcoder2 please write a function for us

prompt = "def hello_world():\n"
response = starcoder2(prompt, max_tokens=100, temperature=1, echo=True, stream=True)
for chunk in response:
    print(chunk["choices"][0]["text"], end='')

## Finally: to try the other approach for direct inference: on a cloud box with a GPU
### Using Hugging Face Hub and Transformers library

Visit this Google Colab notebook:
https://colab.research.google.com/drive/1CRgX6RVqnWZDexXLACbq91pX2I7O7Swu?usp=sharing
