# Using the Phi-3-mini-4k-instruct-gguf for simple inference
This notebook shows how to use the Phi-3-mini-4k-instruct-gguf for basic inference

It requires a GPU, I tested it on a GTX 4090 with 24 GB of RAM

[HuggingFace Model Card](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf)

## CPU specs

In [29]:
! lscpu | head

Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      48 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             24
On-line CPU(s) list:                0-23
Vendor ID:                          AuthenticAMD
Model name:                         AMD Ryzen 9 7900X 12-Core Processor
CPU family:                         25
Model:                              97


## GPU specs

In [31]:
import torch

print(torch.cuda.is_available())
print(torch.cuda.device_count())
print(torch.cuda.current_device())
print(torch.cuda.get_device_name(torch.cuda.current_device()))

True
1
0
NVIDIA GeForce RTX 4090


In [3]:
import os

os.getcwd()
model_shard_dir='/home/ubuntu/jupyterhub/llms/model-cache/Phi-3-mini-4k-instruct-gguf'
model_shard_dir

'/home/ubuntu/jupyterhub/llms/model-cache/Phi-3-mini-4k-instruct-gguf'

In [33]:
import shutil

shutil.rmtree(model_shard_dir
              , ignore_errors=True)

In [35]:
! ls -ltrh model-cache

total 0


# Clone the git repo
First get the git repo but without cloning the LFS objects using `GIT_LFS_SKIP_SMUDGE=1`

In [36]:
!GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf model-cache/Phi-3-mini-4k-instruct-gguf 

Cloning into 'model-cache/Phi-3-mini-4k-instruct-gguf'...
remote: Enumerating objects: 15, done.[K
remote: Counting objects: 100% (12/12), done.[K
remote: Compressing objects: 100% (12/12), done.[K
remote: Total 15 (delta 1), reused 0 (delta 0), pack-reused 3 (from 1)[K
Unpacking objects: 100% (15/15), 10.58 KiB | 10.58 MiB/s, done.


Examine the artifacts

Notice the gguf file are only 136 bytes

In [11]:
! ls -ltrh model-cache/Phi-3-mini-4k-instruct-gguf 

total 48K
-rw-r--r-- 1 ubuntu ubuntu 1.8K May 22 16:23 NOTICE.md
-rw-r--r-- 1 ubuntu ubuntu  251 May 22 16:23 Modelfile_q4
-rw-r--r-- 1 ubuntu ubuntu  253 May 22 16:23 Modelfile_fp16
-rw-r--r-- 1 ubuntu ubuntu 1.1K May 22 16:23 LICENSE
-rw-r--r-- 1 ubuntu ubuntu  444 May 22 16:23 CODE_OF_CONDUCT.md
-rw-r--r-- 1 ubuntu ubuntu 2.6K May 22 16:23 SECURITY.md
-rw-r--r-- 1 ubuntu ubuntu  14K May 22 16:23 README.md
-rw-r--r-- 1 ubuntu ubuntu  135 May 22 16:23 Phi-3-mini-4k-instruct-q4.gguf
-rw-r--r-- 1 ubuntu ubuntu  135 May 22 16:23 Phi-3-mini-4k-instruct-fp16.gguf


## Download the model Phi-3-mini-4k-instruct-q4.gguf
First we test the __Phi-3-mini-4k-instruct-q4.gguf__ model

The model card lists this as `medium, balanced quality - recommended`

Now download the artifact Phi-3-mini-4k-instruct-q4.gguf

### Model aspects
1. File: Phi-3-mini-4k-instruct-q4.gguf
2. Quantization method: Q4_K_M	
3. Bits: 4
4. File size: 2.2GB
5. Memory required: 4.7 GB	
6. Use case: medium, balanced quality - recommended

In [37]:
! cd model-cache/Phi-3-mini-4k-instruct-gguf  && git-lfs pull --include=Phi-3-mini-4k-instruct-q4.gguf

Downloading LFS objects: 100% (1/1), 2.4 GB | 109 MB/s                          

In [39]:
! ls -ltrh model-cache/Phi-3-mini-4k-instruct-gguf

total 2.3G
-rw-r--r-- 1 ubuntu ubuntu 1.8K May 22 16:58 NOTICE.md
-rw-r--r-- 1 ubuntu ubuntu  251 May 22 16:58 Modelfile_q4
-rw-r--r-- 1 ubuntu ubuntu  253 May 22 16:58 Modelfile_fp16
-rw-r--r-- 1 ubuntu ubuntu 1.1K May 22 16:58 LICENSE
-rw-r--r-- 1 ubuntu ubuntu  444 May 22 16:58 CODE_OF_CONDUCT.md
-rw-r--r-- 1 ubuntu ubuntu 2.6K May 22 16:58 SECURITY.md
-rw-r--r-- 1 ubuntu ubuntu  14K May 22 16:58 README.md
-rw-r--r-- 1 ubuntu ubuntu  135 May 22 16:58 Phi-3-mini-4k-instruct-fp16.gguf
-rw-r--r-- 1 ubuntu ubuntu 2.3G May 22 16:59 Phi-3-mini-4k-instruct-q4.gguf


## Model inferencing with Phi-3-mini-4k-instruct-q4.gguf
`Phi-3-mini-4k-instruct-gguf` requires the `llama-cpp-python` to be installed

You can do this with
```bash
CMAKE_ARGS="-DLLAMA_CUBLAS=on" pip install llama-cpp-python
```

In [4]:
from llama_cpp import Llama

llm = Llama(
  model_path=f"{model_shard_dir}/Phi-3-mini-4k-instruct-q4.gguf",  # path to GGUF file
  n_ctx=4096,  # The max sequence length to use - note that longer sequence lengths require much more resources
  n_threads=8, # The number of CPU threads to use, tailor to your system and the resulting performance
  n_gpu_layers=35, # The number of layers to offload to GPU, if you have GPU acceleration available. Set to 0 if no GPU acceleration is available on your system.
)

llama_model_loader: loaded meta data with 24 key-value pairs and 195 tensors from /home/ubuntu/jupyterhub/llms/model-cache/Phi-3-mini-4k-instruct-gguf/Phi-3-mini-4k-instruct-q4.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = phi3
llama_model_loader: - kv   1:                               general.name str              = Phi3
llama_model_loader: - kv   2:                        phi3.context_length u32              = 4096
llama_model_loader: - kv   3:                      phi3.embedding_length u32              = 3072
llama_model_loader: - kv   4:                   phi3.feed_forward_length u32              = 8192
llama_model_loader: - kv   5:                           phi3.block_count u32              = 32
llama_model_loader: - kv   6:                  phi3.attention.head_count u32              = 32
llama_model_loader

In [5]:
! nvidia-smi

Wed May 22 17:00:46 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce RTX 4090        Off |   00000000:01:00.0 Off |                  Off |
|  0%   40C    P2             59W /  450W |    4459MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [6]:
prompt = "What is an embedding model?"

# Simple inference example
output = llm(
  f"<|user|>\n{prompt}<|end|>\n<|assistant|>",
  max_tokens=256,  # Generate up to 256 tokens
  stop=["<|end|>"], 
  echo=True,  # Whether to echo the prompt
)

print(output['choices'][0]['text'])


llama_print_timings:        load time =     130.02 ms
llama_print_timings:      sample time =      25.53 ms /   256 runs   (    0.10 ms per token, 10029.38 tokens per second)
llama_print_timings: prompt eval time =     129.57 ms /    12 tokens (   10.80 ms per token,    92.61 tokens per second)
llama_print_timings:        eval time =    1113.32 ms /   255 runs   (    4.37 ms per token,   229.05 tokens per second)
llama_print_timings:       total time =    1524.88 ms /   267 tokens


<|user|>
What is an embedding model?<|end|>
<|assistant|> An embedding model, in the context of machine learning and natural language processing (NLP), refers to a mathematical representation where words or phrases from the vocabulary are transformed into vectors of real numbers. Each unique word or phrase corresponds to a specific vector in a high-dimensional space, with each dimension capturing some aspect of the word's meaning based on its usage context within large corpora.


This transformation allows machines to process and understand human language by leveraging similarities between words (such as synonyms) because they tend to be closer together in this vector space. Embeddings can also capture relationships, analogies, and other linguistic patterns. Common types of embedding models include:


- Word2Vec: Uses neural networks to produce dense word vectors by predicting a context given a targeted word (continuous bag-of-words model) or vice versa.

- GloVe (Global Vectors for Wo

In [19]:
prompt = "What are closures in functional programming?"

# Simple inference example
output = llm(
  f"<|user|>\n{prompt}<|end|>\n<|assistant|>",
  max_tokens=256,  # Generate up to 256 tokens
  stop=["<|end|>"], 
  echo=True,  # Whether to echo the prompt
)

print(output['choices'][0]['text'])

Llama.generate: prefix-match hit

llama_print_timings:        load time =    4017.01 ms
llama_print_timings:      sample time =      24.10 ms /   256 runs   (    0.09 ms per token, 10621.53 tokens per second)
llama_print_timings: prompt eval time =      62.33 ms /    10 tokens (    6.23 ms per token,   160.45 tokens per second)
llama_print_timings:        eval time =    1104.35 ms /   255 runs   (    4.33 ms per token,   230.91 tokens per second)
llama_print_timings:       total time =    1437.72 ms /   265 tokens


<|user|>
What are closures in functional programming?<|end|>
<|assistant|> Closures in functional programming refer to the concept where a function captures the lexical scope in which it was defined. This means that a closure can access variables from its outer (enclosing) scope, even after the outer functions have finished executing. This is an important feature because it allows for powerful and flexible abstractions, as well as encapsulation of state without relying on global variables or object properties.


Here's a basic example to illustrate closures in JavaScript:

```javascript
function makeCounter() {
    let count = 0; // This is the lexical scope accessible by the closure (the inner function).
    
    return function incrementCounter() {
        // Closure capturing 'count' from its outer environment.
        console.log(count);
        count++;
    };
}

const counter = makeCounter();  // The `incrementCounter` function retains access to the `count` variable defined in it

## Download the model Phi-3-mini-4k-instruct-fp16.gguf
Next we test the __Phi-3-mini-4k-instruct-fp16.gguf__ model

The model card lists this as `minimal quality loss`

Now download the artifact [Phi-3-mini-4k-instruct-q4.gguf](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-gguf/blob/main/Phi-3-mini-4k-instruct-fp16.gguf)

### Model aspects
1. File: Phi-3-mini-4k-instruct-q4.gguf
2. Quantization method: None	
3. Bits: NA
4. File size: 7.2 GB
5. Memory required: 9.3 GiB
6. Use case: minimal quality loss

In [9]:
! cd model-cache/Phi-3-mini-4k-instruct-gguf  && git-lfs pull --include=Phi-3-mini-4k-instruct-fp16.gguf

Downloading LFS objects: 100% (1/1), 7.6 GB | 108 MB/s                          

In [8]:
! ls -ltrh model-cache/Phi-3-mini-4k-instruct-gguf

total 2.3G
-rw-r--r-- 1 ubuntu ubuntu 1.8K May 22 16:23 NOTICE.md
-rw-r--r-- 1 ubuntu ubuntu  251 May 22 16:23 Modelfile_q4
-rw-r--r-- 1 ubuntu ubuntu  253 May 22 16:23 Modelfile_fp16
-rw-r--r-- 1 ubuntu ubuntu 1.1K May 22 16:23 LICENSE
-rw-r--r-- 1 ubuntu ubuntu  444 May 22 16:23 CODE_OF_CONDUCT.md
-rw-r--r-- 1 ubuntu ubuntu 2.6K May 22 16:23 SECURITY.md
-rw-r--r-- 1 ubuntu ubuntu  14K May 22 16:23 README.md
-rw-r--r-- 1 ubuntu ubuntu  135 May 22 16:23 Phi-3-mini-4k-instruct-fp16.gguf
-rw-r--r-- 1 ubuntu ubuntu 2.3G May 22 16:23 Phi-3-mini-4k-instruct-q4.gguf


## Model inferencing with Phi-3-mini-4k-instruct-fp16.gguf

In [11]:
from llama_cpp import Llama

llm = Llama(
  model_path=f"{model_shard_dir}/Phi-3-mini-4k-instruct-fp16.gguf",  # path to GGUF file
  n_ctx=4096,  # The max sequence length to use - note that longer sequence lengths require much more resources
  n_threads=8, # The number of CPU threads to use, tailor to your system and the resulting performance
  n_gpu_layers=35, # The number of layers to offload to GPU, if you have GPU acceleration available. Set to 0 if no GPU acceleration is available on your system.
)

llama_model_loader: loaded meta data with 23 key-value pairs and 195 tensors from /home/ubuntu/jupyterhub/llms/model-cache/Phi-3-mini-4k-instruct-gguf/Phi-3-mini-4k-instruct-fp16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = phi3
llama_model_loader: - kv   1:                               general.name str              = Phi3
llama_model_loader: - kv   2:                        phi3.context_length u32              = 4096
llama_model_loader: - kv   3:                      phi3.embedding_length u32              = 3072
llama_model_loader: - kv   4:                   phi3.feed_forward_length u32              = 8192
llama_model_loader: - kv   5:                           phi3.block_count u32              = 32
llama_model_loader: - kv   6:                  phi3.attention.head_count u32              = 32
llama_model_load

In [19]:
! nvidia-smi

Wed May 22 16:53:25 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce RTX 4090        Off |   00000000:01:00.0 Off |                  Off |
|  0%   38C    P8             25W /  450W |    9347MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [20]:
prompt = "What is an embedding model?"

# Simple inference example
output = llm(
  f"<|user|>\n{prompt}<|end|>\n<|assistant|>",
  max_tokens=256,  # Generate up to 256 tokens
  stop=["<|end|>"], 
  echo=True,  # Whether to echo the prompt
)

print(output['choices'][0]['text'])


llama_print_timings:        load time =     136.41 ms
llama_print_timings:      sample time =      24.90 ms /   256 runs   (    0.10 ms per token, 10282.78 tokens per second)
llama_print_timings: prompt eval time =     136.31 ms /    12 tokens (   11.36 ms per token,    88.03 tokens per second)
llama_print_timings:        eval time =    2456.00 ms /   255 runs   (    9.63 ms per token,   103.83 tokens per second)
llama_print_timings:       total time =    2866.99 ms /   267 tokens


<|user|>
What is an embedding model?<|end|>
<|assistant|> An embedding model is a mathematical representation used in machine learning and natural language processing (NLP) to convert large, sparse data sets into dense vectors of fixed size. These vectors are known as "embeddings" and they capture the essential properties or relationships within the original data by placing similar items close together in the vector space.

Embedding models typically work on one-hot encoded categorical data (like words from a vocabulary) or discrete variables, transforming them into continuous vectors. This process enables complex computations that can reveal patterns and relationships between elements, which were not initially apparent due to their high-dimensional and sparse nature.

In NLP, embedding models are essential for tasks such as:

1. Word Representation: Words or phrases are represented by dense vectors (word embeddings), capturing semantic meanings based on contextual information from lar

In [21]:
prompt = "What are closures in functional programming?"

# Simple inference example
output = llm(
  f"<|user|>\n{prompt}<|end|>\n<|assistant|>",
  max_tokens=256,  # Generate up to 256 tokens
  stop=["<|end|>"], 
  echo=True,  # Whether to echo the prompt
)

print(output['choices'][0]['text'])

Llama.generate: prefix-match hit

llama_print_timings:        load time =     136.41 ms
llama_print_timings:      sample time =      24.34 ms /   256 runs   (    0.10 ms per token, 10518.96 tokens per second)
llama_print_timings: prompt eval time =      64.32 ms /    10 tokens (    6.43 ms per token,   155.46 tokens per second)
llama_print_timings:        eval time =    2456.95 ms /   255 runs   (    9.64 ms per token,   103.79 tokens per second)
llama_print_timings:       total time =    2796.77 ms /   265 tokens


<|user|>
What are closures in functional programming?<|end|>
<|assistant|> Closures are an important concept in functional programming, and they play a critical role in managing state and creating functions with encapsulated behavior. A closure is created when a function captures variables from its surrounding environment (lexical scope), effectively "closing over" these variables to form a new scope that exists even after the outer function has finished executing. This allows the inner function to remember and access those captured variables, regardless of where it's called later in your program.

Here's an example using JavaScript to illustrate closures:

```javascript
function createCounter() {
  let count = 0; // A variable defined within the outer scope (createCounter function)

  return function () {
    count += 1; // The inner function has access and can modify 'count' even though it is not directly accessible from outside
    console.log(count);
  };
}

const counter = createC

## Free the GPU memory
Reset the cuda device

In [7]:
from numba import cuda

device = cuda.get_current_device()
device.reset()

In [8]:
! nvidia-smi

Wed May 22 17:01:23 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce RTX 4090        Off |   00000000:01:00.0 Off |                  Off |
|  0%   40C    P5             27W /  450W |       5MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

# Clean up the disk

Next remove the downloaded artifacts to free disk space

In [9]:
! rm -rf model-cache/CodeLlama-7B-GGUF