# Simple inference Phi-3-mini-4k-instruct-gguf with CPU
This notebook shows how to use `Phi-3-mini-4k-instruct-gguf` for basic inference

More information is available at [HuggingFace Model Card](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf)

## CPU specs

In [1]:
! lscpu | head

Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      48 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             24
On-line CPU(s) list:                0-23
Vendor ID:                          AuthenticAMD
Model name:                         AMD Ryzen 9 7900X 12-Core Processor
CPU family:                         25
Model:                              97


In [2]:
### Set the number of cores to use

In [3]:
CPU_CORES = 12

In [4]:
import os

os.getcwd()
model_shard_dir='/home/ubuntu/jupyterhub/llms/model-cache/Phi-3-mini-4k-instruct-gguf'
model_shard_dir

'/home/ubuntu/jupyterhub/llms/model-cache/Phi-3-mini-4k-instruct-gguf'

In [12]:
import shutil

shutil.rmtree(model_shard_dir
              , ignore_errors=True)

In [13]:
! ls -ltrh model-cache

total 0


# Clone the git repo
First get the git repo but without cloning the LFS objects using `GIT_LFS_SKIP_SMUDGE=1`

In [14]:
!GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf model-cache/Phi-3-mini-4k-instruct-gguf 

Cloning into 'model-cache/Phi-3-mini-4k-instruct-gguf'...
remote: Enumerating objects: 15, done.[K
remote: Counting objects: 100% (12/12), done.[K
remote: Compressing objects: 100% (12/12), done.[K
remote: Total 15 (delta 1), reused 0 (delta 0), pack-reused 3 (from 1)[K
Unpacking objects: 100% (15/15), 10.58 KiB | 10.58 MiB/s, done.


Examine the artifacts

Notice the gguf file are only 136 bytes

In [15]:
! ls -ltrh model-cache/Phi-3-mini-4k-instruct-gguf 

total 48K
-rw-r--r-- 1 ubuntu ubuntu 1.8K May 22 17:09 NOTICE.md
-rw-r--r-- 1 ubuntu ubuntu  251 May 22 17:09 Modelfile_q4
-rw-r--r-- 1 ubuntu ubuntu  253 May 22 17:09 Modelfile_fp16
-rw-r--r-- 1 ubuntu ubuntu 1.1K May 22 17:09 LICENSE
-rw-r--r-- 1 ubuntu ubuntu  444 May 22 17:09 CODE_OF_CONDUCT.md
-rw-r--r-- 1 ubuntu ubuntu 2.6K May 22 17:09 SECURITY.md
-rw-r--r-- 1 ubuntu ubuntu  14K May 22 17:09 README.md
-rw-r--r-- 1 ubuntu ubuntu  135 May 22 17:09 Phi-3-mini-4k-instruct-q4.gguf
-rw-r--r-- 1 ubuntu ubuntu  135 May 22 17:09 Phi-3-mini-4k-instruct-fp16.gguf


## Download the model Phi-3-mini-4k-instruct-q4.gguf
First we test the __Phi-3-mini-4k-instruct-q4.gguf__ model

The model card lists this as `medium, balanced quality - recommended`

Now download the artifact Phi-3-mini-4k-instruct-q4.gguf

### Model aspects
1. File: Phi-3-mini-4k-instruct-q4.gguf
2. Quantization method: Q4_K_M	
3. Bits: 4
4. File size: 2.2GB
5. Memory required: 4.7 GB	
6. Use case: medium, balanced quality - recommended

In [16]:
! cd model-cache/Phi-3-mini-4k-instruct-gguf  && git-lfs pull --include=Phi-3-mini-4k-instruct-q4.gguf

Downloading LFS objects: 100% (1/1), 2.4 GB | 109 MB/s                          

In [17]:
! ls -ltrh model-cache/Phi-3-mini-4k-instruct-gguf

total 2.3G
-rw-r--r-- 1 ubuntu ubuntu 1.8K May 22 17:09 NOTICE.md
-rw-r--r-- 1 ubuntu ubuntu  251 May 22 17:09 Modelfile_q4
-rw-r--r-- 1 ubuntu ubuntu  253 May 22 17:09 Modelfile_fp16
-rw-r--r-- 1 ubuntu ubuntu 1.1K May 22 17:09 LICENSE
-rw-r--r-- 1 ubuntu ubuntu  444 May 22 17:09 CODE_OF_CONDUCT.md
-rw-r--r-- 1 ubuntu ubuntu 2.6K May 22 17:09 SECURITY.md
-rw-r--r-- 1 ubuntu ubuntu  14K May 22 17:09 README.md
-rw-r--r-- 1 ubuntu ubuntu  135 May 22 17:09 Phi-3-mini-4k-instruct-fp16.gguf
-rw-r--r-- 1 ubuntu ubuntu 2.3G May 22 17:09 Phi-3-mini-4k-instruct-q4.gguf


## Model inferencing with Phi-3-mini-4k-instruct-q4.gguf
`Phi-3-mini-4k-instruct-gguf` requires the `llama-cpp-python` to be installed

You can do this with
```bash
CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python
```

In [6]:
from llama_cpp import Llama

llm = Llama(
  model_path=f"{model_shard_dir}/Phi-3-mini-4k-instruct-q4.gguf",  # path to GGUF file
  n_ctx=4096,     # The max sequence length to use - note that longer sequence lengths require much more resources
  n_threads=CPU_CORES,    # The number of CPU threads to use
  n_gpu_layers=0, # Set to 0 to disable GPU offload, model will infer on CPU
)

llama_model_loader: loaded meta data with 24 key-value pairs and 195 tensors from /home/ubuntu/jupyterhub/llms/model-cache/Phi-3-mini-4k-instruct-gguf/Phi-3-mini-4k-instruct-q4.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = phi3
llama_model_loader: - kv   1:                               general.name str              = Phi3
llama_model_loader: - kv   2:                        phi3.context_length u32              = 4096
llama_model_loader: - kv   3:                      phi3.embedding_length u32              = 3072
llama_model_loader: - kv   4:                   phi3.feed_forward_length u32              = 8192
llama_model_loader: - kv   5:                           phi3.block_count u32              = 32
llama_model_loader: - kv   6:                  phi3.attention.head_count u32              = 32
llama_model_loader

In [7]:
! free -h

               total        used        free      shared  buff/cache   available
Mem:            61Gi       7.5Gi        28Gi       1.6Gi        25Gi        52Gi
Swap:          8.0Gi          0B       8.0Gi


In [8]:
prompt = "What is an embedding model?"

# Simple inference example
output = llm(
  f"<|user|>\n{prompt}<|end|>\n<|assistant|>",
  max_tokens=256,  # Generate up to 256 tokens
  stop=["<|end|>"], 
  echo=True,  # Whether to echo the prompt
)

print(output['choices'][0]['text'])


llama_print_timings:        load time =     119.17 ms
llama_print_timings:      sample time =      27.23 ms /   256 runs   (    0.11 ms per token,  9402.78 tokens per second)
llama_print_timings: prompt eval time =     119.08 ms /    12 tokens (    9.92 ms per token,   100.78 tokens per second)
llama_print_timings:        eval time =   12511.34 ms /   255 runs   (   49.06 ms per token,    20.38 tokens per second)
llama_print_timings:       total time =   12975.24 ms /   267 tokens


<|user|>
What is an embedding model?<|end|>
<|assistant|> An embedding model, in the context of machine learning and natural language processing (NLP), refers to a mathematical representation where words, phrases, sentences, or even entire documents are mapped to vectors of real numbers. Each unique word or phrase from the vocabulary gets associated with one vector in this high-dimensional space. The primary purpose of embedding models is to capture semantic meaning and contextual relationships between terms in such a way that similar meanings have similar representations.

There are various types of embedding models, but some of the most popular include:

1. Word Embeddings (e.g., Word2Vec, GloVe): These models focus on representing individual words. They learn to predict a word given its context or vice versa and produce dense vector representations for each unique word in the training corpus.

2. Sentence/Document Embeddings: Unlike word embeddings which are limited to single tokens

In [10]:
prompt = "What are closures in functional programming?"

# Simple inference example
output = llm(
  f"<|user|>\n{prompt}<|end|>\n<|assistant|>",
  max_tokens=256,  # Generate up to 256 tokens
  stop=["<|end|>"], 
  echo=True,  # Whether to echo the prompt
)

print(output['choices'][0]['text'])

Llama.generate: prefix-match hit

llama_print_timings:        load time =     119.17 ms
llama_print_timings:      sample time =      26.84 ms /   256 runs   (    0.10 ms per token,  9539.07 tokens per second)
llama_print_timings: prompt eval time =       0.00 ms /     0 tokens (    -nan ms per token,     -nan tokens per second)
llama_print_timings:        eval time =   12494.35 ms /   256 runs   (   48.81 ms per token,    20.49 tokens per second)
llama_print_timings:       total time =   12865.25 ms /   256 tokens


<|user|>
What are closures in functional programming?<|end|>
<|assistant|> Closures are a fundamental concept in functional programming and computer science at large. A closure is an expression that captures the lexical bindings of free variables. In other words, it's a function combined with a referencing environment—this environment comprises any non-local variables used by the function. When you create a function within another function (nested), and if this inner function references variables from its outer scope(s) that are not parameters to the inner function, then the outer function is said to have 'closed over' these variables, thus creating a closure.


### Key Characteristics of Closures:

1. **Environment Capture**: The enclosing function captures any local variables from its scope at the time of its creation. This means that when you return this inner function as a value or pass it to another part of your program, these captured variables are still accessible and retain the

## Download the model Phi-3-mini-4k-instruct-fp16.gguf
Next we test the __Phi-3-mini-4k-instruct-fp16.gguf__ model

The model card lists this as `minimal quality loss`

Now download the artifact [Phi-3-mini-4k-instruct-q4.gguf](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/blob/main/Phi-3-mini-4k-instruct-fp16.gguf)

### Model aspects
1. File: Phi-3-mini-4k-instruct-q4.gguf
2. Quantization method: None	
3. Bits: NA
4. File size: 7.2 GB
5. Memory required: 9.3 GiB
6. Use case: minimal quality loss

In [9]:
! cd model-cache/Phi-3-mini-4k-instruct-gguf  && git-lfs pull --include=Phi-3-mini-4k-instruct-fp16.gguf

Downloading LFS objects: 100% (1/1), 7.6 GB | 111 MB/s                          

In [10]:
! ls -ltrh model-cache/Phi-3-mini-4k-instruct-gguf

total 9.4G
-rw-r--r-- 1 ubuntu ubuntu 1.8K May 22 17:09 NOTICE.md
-rw-r--r-- 1 ubuntu ubuntu  251 May 22 17:09 Modelfile_q4
-rw-r--r-- 1 ubuntu ubuntu  253 May 22 17:09 Modelfile_fp16
-rw-r--r-- 1 ubuntu ubuntu 1.1K May 22 17:09 LICENSE
-rw-r--r-- 1 ubuntu ubuntu  444 May 22 17:09 CODE_OF_CONDUCT.md
-rw-r--r-- 1 ubuntu ubuntu 2.6K May 22 17:09 SECURITY.md
-rw-r--r-- 1 ubuntu ubuntu  14K May 22 17:09 README.md
-rw-r--r-- 1 ubuntu ubuntu 2.3G May 22 17:09 Phi-3-mini-4k-instruct-q4.gguf
-rw-r--r-- 1 ubuntu ubuntu 7.2G May 22 17:16 Phi-3-mini-4k-instruct-fp16.gguf


## Model inferencing with Phi-3-mini-4k-instruct-fp16.gguf

In [11]:
from llama_cpp import Llama

llm = Llama(
  model_path=f"{model_shard_dir}/Phi-3-mini-4k-instruct-fp16.gguf",  # path to GGUF file
  n_ctx=4096,  # The max sequence length to use - note that longer sequence lengths require much more resources
  n_threads=CPU_CORES, # The number of CPU threads to use, tailor to your system and the resulting performance
  n_gpu_layers=0, # Disable GPU
)

llama_model_loader: loaded meta data with 23 key-value pairs and 195 tensors from /home/ubuntu/jupyterhub/llms/model-cache/Phi-3-mini-4k-instruct-gguf/Phi-3-mini-4k-instruct-fp16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = phi3
llama_model_loader: - kv   1:                               general.name str              = Phi3
llama_model_loader: - kv   2:                        phi3.context_length u32              = 4096
llama_model_loader: - kv   3:                      phi3.embedding_length u32              = 3072
llama_model_loader: - kv   4:                   phi3.feed_forward_length u32              = 8192
llama_model_loader: - kv   5:                           phi3.block_count u32              = 32
llama_model_loader: - kv   6:                  phi3.attention.head_count u32              = 32
llama_model_load

In [12]:
! free -h

               total        used        free      shared  buff/cache   available
Mem:            61Gi       7.5Gi        28Gi       1.6Gi        25Gi        52Gi
Swap:          8.0Gi          0B       8.0Gi


In [13]:
prompt = "What is an embedding model?"

# Simple inference example
output = llm(
  f"<|user|>\n{prompt}<|end|>\n<|assistant|>",
  max_tokens=256,  # Generate up to 256 tokens
  stop=["<|end|>"], 
  echo=True,  # Whether to echo the prompt
)

print(output['choices'][0]['text'])


llama_print_timings:        load time =     229.30 ms
llama_print_timings:      sample time =      17.72 ms /   168 runs   (    0.11 ms per token,  9481.88 tokens per second)
llama_print_timings: prompt eval time =     228.98 ms /    12 tokens (   19.08 ms per token,    52.41 tokens per second)
llama_print_timings:        eval time =   23630.10 ms /   167 runs   (  141.50 ms per token,     7.07 tokens per second)
llama_print_timings:       total time =   24105.49 ms /   179 tokens


<|user|>
What is an embedding model?<|end|>
<|assistant|> An embedding model is a representation technique in machine learning and natural language processing (NLP) that converts categorical data, such as words or phrases, into continuous vectors of real numbers. These numerical representations capture the semantic relationships between items, enabling algorithms to process and analyze textual information more effectively. Embeddings are learned from data by mapping similar elements closer together in the embedding space and dissimilar ones farther apart. This transformation often improves performance on various tasks such as sentiment analysis, language translation, or recommendation systems. Popular examples of embedding models include Word2Vec, GloVe, and fastText for words and sentences, as well as sentence-BERT and transformer-based architectures like BERT (Bidirectional Encoder Representations from Transformers) for more complex NLP applications.


In [14]:
prompt = "What are closures in functional programming?"

# Simple inference example
output = llm(
  f"<|user|>\n{prompt}<|end|>\n<|assistant|>",
  max_tokens=256,  # Generate up to 256 tokens
  stop=["<|end|>"], 
  echo=True,  # Whether to echo the prompt
)

print(output['choices'][0]['text'])

Llama.generate: prefix-match hit

llama_print_timings:        load time =     229.30 ms
llama_print_timings:      sample time =      26.81 ms /   256 runs   (    0.10 ms per token,  9549.03 tokens per second)
llama_print_timings: prompt eval time =     161.86 ms /    10 tokens (   16.19 ms per token,    61.78 tokens per second)
llama_print_timings:        eval time =   36061.15 ms /   255 runs   (  141.42 ms per token,     7.07 tokens per second)
llama_print_timings:       total time =   36602.09 ms /   265 tokens


<|user|>
What are closures in functional programming?<|end|>
<|assistant|> Closures, in the context of functional programming, refer to a powerful concept where a function has access to its surrounding lexical scope even after the outer function has finished executing. This means that functions can capture and retain references to variables from their defining environments. Here's how closures work and why they are important:

1. Functional Encapsulation: Closures allow you to encapsulate state within a function, which helps in creating private data or methods. The captured variables (also called free variables) can be accessed by the enclosed function even when it's executed outside of its defining scope. This enables functional programming techniques like higher-order functions, where functions are treated as first-class citizens and passed around as arguments to other functions.

2. Functional Abstraction: Closures allow for abstractions in functional programming by hiding the imple

## Free the GPU memory
Reset the cuda device

In [16]:
from numba import cuda

device = cuda.get_current_device()
device.reset()

In [17]:
! nvidia-smi

Wed May 22 17:19:48 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce RTX 4090        Off |   00000000:01:00.0 Off |                  Off |
|  0%   36C    P0             63W /  450W |       7MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

# Clean up the disk

Next remove the downloaded artifacts to free disk space

In [9]:
! rm -rf model-cache/CodeLlama-7B-GGUF