<a href="https://colab.research.google.com/github/bhuvana-ak/uplimit_projects/blob/main/uplimit_open_source_llm_week_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Install Llama.cpp and associated libraries

In [1]:
!pip install outlines

Collecting outlines
  Downloading outlines-0.1.1-py3-none-any.whl.metadata (15 kB)
Collecting interegular (from outlines)
  Downloading interegular-0.3.3-py37-none-any.whl.metadata (3.0 kB)
Collecting lark (from outlines)
  Downloading lark-1.2.2-py3-none-any.whl.metadata (1.8 kB)
Collecting diskcache (from outlines)
  Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB)
Collecting datasets (from outlines)
  Downloading datasets-3.1.0-py3-none-any.whl.metadata (20 kB)
Collecting pycountry (from outlines)
  Downloading pycountry-24.6.1-py3-none-any.whl.metadata (12 kB)
Collecting airportsdata (from outlines)
  Downloading airportsdata-20241001-py3-none-any.whl.metadata (8.9 kB)
Collecting outlines-core==0.1.14 (from outlines)
  Downloading outlines_core-0.1.14-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.7 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets->outlines)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datas

In [2]:
!git clone https://github.com/ggerganov/llama.cpp
%cd llama.cpp
!git pull
!mkdir -p build
%cd build
!cmake .. -DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS
!cmake --build . --config Release
%cd ..
!pip install -r requirements.txt

Cloning into 'llama.cpp'...
remote: Enumerating objects: 36330, done.[K
remote: Counting objects: 100% (145/145), done.[K
remote: Compressing objects: 100% (105/105), done.[K
remote: Total 36330 (delta 61), reused 96 (delta 38), pack-reused 36185 (from 1)[K
Receiving objects: 100% (36330/36330), 60.75 MiB | 16.98 MiB/s, done.
Resolving deltas: 100% (26373/26373), done.
/content/llama.cpp
Already up to date.
/content/llama.cpp/build
-- The C compiler identification is GNU 11.4.0
-- The CXX compiler identification is GNU 11.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Found Git: /usr/bin/git (found version "2.34.1")
-- 

### Clone fine-tuned model from last week's task

In [3]:
# Configure Git LFS to handle large files properly
!git lfs install

# Set a higher Git LFS buffer size for large files
!git config --global http.postBuffer 1048576000

# Clone with LFS files
!GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/bhuvana-ak7/OrpoLlama-3.2-1B-V1
%cd OrpoLlama-3.2-1B-V1
!git lfs pull

Updated git hooks.
Git LFS initialized.
Cloning into 'OrpoLlama-3.2-1B-V1'...
remote: Enumerating objects: 31, done.[K
remote: Counting objects: 100% (28/28), done.[K
remote: Compressing objects: 100% (28/28), done.[K
remote: Total 31 (delta 9), reused 0 (delta 0), pack-reused 3 (from 1)[K
Unpacking objects: 100% (31/31), 13.85 KiB | 1.73 MiB/s, done.
/content/llama.cpp/OrpoLlama-3.2-1B-V1


In [4]:
!mv /content/llama.cpp/OrpoLlama-3.2-1B-V1 /content/llama.cpp/models/OrpoLlama-3.2-1B-V1

### Run Llama.cpp's convert_hf_to_gguf python utility to convert your model to gguf format

In [5]:
!python /content/llama.cpp/convert_hf_to_gguf.py /content/llama.cpp/models/OrpoLlama-3.2-1B-V1/

INFO:hf-to-gguf:Loading model: OrpoLlama-3.2-1B-V1
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:rope_freqs.weight,           torch.float32 --> F32, shape = {32}
INFO:hf-to-gguf:gguf: loading model part 'model.safetensors'
INFO:hf-to-gguf:token_embd.weight,           torch.float16 --> F16, shape = {2048, 128258}
INFO:hf-to-gguf:blk.0.attn_norm.weight,      torch.float16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.0.ffn_down.weight,       torch.float16 --> F16, shape = {8192, 2048}
INFO:hf-to-gguf:blk.0.ffn_gate.weight,       torch.float16 --> F16, shape = {2048, 8192}
INFO:hf-to-gguf:blk.0.ffn_up.weight,         torch.float16 --> F16, shape = {2048, 8192}
INFO:hf-to-gguf:blk.0.ffn_norm.weight,       torch.float16 --> F32, shape = {2048}
INFO:hf-to-gguf:blk.0.attn_k.weight,         torch.float16 --> F16, shape = {2048, 512}
INFO:hf-to-gguf:blk.0.attn_output.weight,    torch.float16 --> F16, shape = {2048, 2048}
IN

### Quantize your model to 4-bit

The q4_k_m quantization method in llama.cpp is a 4-bit quantization format that represents key characteristics:

- q4: Uses 4 bits per weight (meaning each model weight is compressed to use only 4 bits instead of the original 16/32 bits)
- k: Uses block-wise quantization with keys/groups
- m: Uses a multiplicative scaling factor for better accuracy (compared to additive scaling)

This is generally considered a good balance between model size reduction and quality:

- Reduces model size to roughly 1/8th of the original
- Maintains reasonable quality compared to higher bit quantizations
- Better accuracy than plain q4 due to the multiplicative scaling
- Faster inference than 8-bit quantization methods

In [6]:
# !/content/llama.cpp/build/bin/llama-quantize /content/llama.cpp/models/TinyLlama-1.1B-Chat-v1.0/TinyLlama-1.1B-Chat-v1.0-F16.gguf q4_k_m
!/content/llama.cpp/build/bin/llama-quantize /content/llama.cpp/models/OrpoLlama-3.2-1B-V1/Llama-3.2-1B-V1-F16.gguf q4_k_m

main: build = 4019 (08828a6d)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: quantizing '/content/llama.cpp/models/OrpoLlama-3.2-1B-V1/Llama-3.2-1B-V1-F16.gguf' to '/content/llama.cpp/models/OrpoLlama-3.2-1B-V1/ggml-model-Q4_K_M.gguf' as Q4_K_M
llama_model_loader: loaded meta data with 30 key-value pairs and 147 tensors from /content/llama.cpp/models/OrpoLlama-3.2-1B-V1/Llama-3.2-1B-V1-F16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 1B
llama_model_loader: - kv   3:                            general.version str              = V1
llama_model_loader: - kv   4:              

## Try the quantized model in a downstream task

In [7]:
!pip install llama-cpp-python==0.2.85

Collecting llama-cpp-python==0.2.85
  Downloading llama_cpp_python-0.2.85.tar.gz (49.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m49.3/49.3 MB[0m [31m16.0 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: llama-cpp-python
  Building wheel for llama-cpp-python (pyproject.toml) ... [?25l[?25hdone
  Created wheel for llama-cpp-python: filename=llama_cpp_python-0.2.85-cp310-cp310-linux_x86_64.whl size=2872206 sha256=de82c1fabc7e799cc0709c4d2bf444c81be6fa5ea6cda97ca52fc0dfc3ccd8dd
  Stored in directory: /root/.cache/pip/wheels/3f/e8/4e/29a754f9175ef52b6481cd75e3af4de38bf6dfa9c2972f75d4
Successfully built llama-cpp-python
Installing collected packages: llama-cpp-python
Successfully installed llama-cpp-python-0.2.85


In [8]:
from llama_cpp import Llama

# Load the GGUF model
model_path = "/content/llama.cpp/models/OrpoLlama-3.2-1B-V1/ggml-model-Q4_K_M.gguf" #"/content/llama.cpp/models/uonyeka-llama-3.2.Instruct/ggml-model-Q4_K_M.gguf"
llm = Llama(model_path=model_path, n_ctx=512, n_batch=128)

llama_model_loader: loaded meta data with 30 key-value pairs and 147 tensors from /content/llama.cpp/models/OrpoLlama-3.2-1B-V1/ggml-model-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama 3.2 1B
llama_model_loader: - kv   3:                            general.version str              = V1
llama_model_loader: - kv   4:                       general.organization str              = Meta Llama
llama_model_loader: - kv   5:                           general.basename str              = Llama-3.2
llama_model_loader: - kv   6:                         general.size_label str              = 1B
llama_model_loader: - kv   7

In [12]:
input_text = "I am a Barbie girl in "
output = llm(input_text, max_tokens=300, echo=True)

print(output['choices'][0]['text'])

Llama.generate: prefix-match hit

llama_print_timings:        load time =      79.59 ms
llama_print_timings:      sample time =      29.94 ms /   300 runs   (    0.10 ms per token, 10020.04 tokens per second)
llama_print_timings: prompt eval time =      35.72 ms /     2 tokens (   17.86 ms per token,    55.99 tokens per second)
llama_print_timings:        eval time =    7759.71 ms /   299 runs   (   25.95 ms per token,    38.53 tokens per second)
llama_print_timings:       total time =    8182.82 ms /   301 tokens


I am a Barbie girl in 2021-10-09.Question: Find the sum of the series 1 + 1/3 + 1/9 + 1/27 + 1/63 + 1/189 + … A) 0 B) 1 C) 2 D) 3 E) 4 5.1 + 1/3 + 1/9 + 1/27 + 1/63 + 1/189 =? Solution: We have to find the sum of the series as we have to add up the first n terms of the series. Sum of the first 100 natural numbers = 1 + 2 + 3 + … + 100 = 5050. A) 0 B) 1 C) 2 D) 3 E) 4
Question: Find the sum of the series 1 + 1/3 + 1/9 + 1/27 + 1/63 + 1/189 + … A) 0 B) 1 C) 2 D) 3 E) 4 6.1 + 1/3 + 1/9 + 1/27 + 1/63 + 1/189 + 1/486 = ? Solution: We have to find the sum of the series as we have to add up the first n terms of the series. Sum of the first 100 natural numbers = 1 +


## Evaluate the quantized model on other eval tools (other than EleutherAI Evaluation Harness)

In [13]:
%%bash
git clone https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness
pip install -e .

Obtaining file:///content/llama.cpp/models/OrpoLlama-3.2-1B-V1/lm-evaluation-harness
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Checking if build backend supports build_editable: started
  Checking if build backend supports build_editable: finished with status 'done'
  Getting requirements to build editable: started
  Getting requirements to build editable: finished with status 'done'
  Preparing editable metadata (pyproject.toml): started
  Preparing editable metadata (pyproject.toml): finished with status 'done'
Collecting evaluate (from lm_eval==0.4.5)
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Collecting jsonlines (from lm_eval==0.4.5)
  Downloading jsonlines-4.0.0-py3-none-any.whl.metadata (1.6 kB)
Collecting pybind11>=2.6.2 (from lm_eval==0.4.5)
  Downloading pybind11-2.13.6-py3-none-any.whl.metadata (9.5 kB)
Collecting pytablewriter (from lm_eval==0.4.5)
  Downloading pytablewriter-1.2.0-py3-none-a

Cloning into 'lm-evaluation-harness'...


In [15]:
!pip install torch==1.13.1+cu117 torchvision==0.14.1+cu117 torchaudio==0.13.1 --index-url https://download.pytorch.org/whl/cu117

Looking in indexes: https://download.pytorch.org/whl/cu117
Collecting torch==1.13.1+cu117
  Downloading https://download.pytorch.org/whl/cu117/torch-1.13.1%2Bcu117-cp310-cp310-linux_x86_64.whl (1801.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 GB[0m [31m992.8 kB/s[0m eta [36m0:00:00[0m
[?25hCollecting torchvision==0.14.1+cu117
  Downloading https://download.pytorch.org/whl/cu117/torchvision-0.14.1%2Bcu117-cp310-cp310-linux_x86_64.whl (24.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.3/24.3 MB[0m [31m83.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting torchaudio==0.13.1
  Downloading https://download.pytorch.org/whl/cu117/torchaudio-0.13.1%2Bcu117-cp310-cp310-linux_x86_64.whl (4.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.2/4.2 MB[0m [31m105.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: torch, torchvision, torchaudio
  Attempting uninstall: torch
    Found existing installa

In [31]:
%%bash
lm_eval --model hf \
    --model_args pretrained=bhuvana-ak7/OrpoLlama-3.2-1B-V1_q4_k_m,dtype="float"\
    --tasks hellaswag \
    --device cuda \
    --batch_size auto:4 \
    --output_path hellaswag_test

Passed argument batch_size = auto:4.0. Detecting largest batch size
Determined largest batch size: 64
Passed argument batch_size = auto:4.0. Detecting largest batch size
Determined largest batch size: 64
hf (pretrained=bhuvana-ak7/OrpoLlama-3.2-1B-V1_q4_k_m,dtype=float), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto:4 (64,64,64,64,64)
|  Tasks  |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|---------|------:|------|-----:|--------|---|-----:|---|-----:|
|hellaswag|      1|none  |     0|acc     |↑  |0.4772|±  |0.0050|
|         |       |none  |     0|acc_norm|↑  |0.6366|±  |0.0048|



2024-11-03 19:06:43.167494: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-11-03 19:06:43.187248: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:485] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-11-03 19:06:43.210090: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:8454] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-11-03 19:06:43.217143: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1452] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-11-03 19:06:43.232830: I tensorflow/core/platform/cpu_feature_guar

In [19]:
!llama.cpp/main -m /content/llama.cpp/models/7B/ggml-model-q4_0.bin -p "The quick brown fox jumps over the lazy dog"

/bin/bash: line 1: llama.cpp/main: No such file or directory


## Push the quantized model to Hugging Face Hub

In [25]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from google.colab import userdata
from huggingface_hub import create_repo, upload_file, login
import os

In [None]:
# hf_token = userdata.get('HF_TOKEN')
# login(token=hf_token)

# repo_id = "uonyeka-llama-3.2.Instruct_q4_k_m"

# create_repo(repo_id=repo_id, repo_type="model")

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [26]:
# Get HF token from Colab's userdata and login
hf_token = userdata.get('hf_token')
login(token=hf_token)

# Step 1: Create a repository on the Hugging Face Hub with your username
username = "bhuvana-ak7"  # Your Hugging Face username
repo_name = "OrpoLlama-3.2-1B-V1_q4_k_m" #OrpoLlama-3.2-1B-V1/ggml-model-Q4_K_M.gguf
repo_id = f"{username}/{repo_name}"

# Create the repository
try:
    create_repo(
        repo_id=repo_id,
        repo_type="model",
        token=hf_token,
        exist_ok=True,
        private=True  # Set to False if you want a public repository
    )
    print(f"Repository {repo_id} created or already exists")
except Exception as e:
    print(f"Error creating repository: {str(e)}")

# Step 2: Define the local directory containing the GGUF model files
local_model_dir = "/content/llama.cpp/models/OrpoLlama-3.2-1B-V1"

# Step 3: Get list of all files in the directory
files = os.listdir(local_model_dir)

# Step 4: Upload each file
for file in files:
    # Get the full local path of the file
    local_path = os.path.join(local_model_dir, file)

    # Skip if it's a directory
    if os.path.isdir(local_path):
        continue

    # Define the path in the repo where the file will be stored
    repo_path = file

    print(f"Uploading {file} to repository...")
    try:
        # Upload the file to the hub
        upload_file(
            path_or_fileobj=local_path,
            path_in_repo=repo_path,
            repo_id=repo_id,
            repo_type="model",
            token=hf_token
        )
        print(f"Successfully uploaded {file}")
    except Exception as e:
        print(f"Error uploading {file}: {str(e)}")

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful
Repository bhuvana-ak7/OrpoLlama-3.2-1B-V1_q4_k_m created or already exists
Uploading generation_config.json to repository...
Successfully uploaded generation_config.json
Uploading README.md to repository...
Successfully uploaded README.md
Uploading ggml-model-Q4_K_M.gguf to repository...


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


ggml-model-Q4_K_M.gguf:   0%|          | 0.00/808M [00:00<?, ?B/s]

Successfully uploaded ggml-model-Q4_K_M.gguf
Uploading adapter_model.safetensors to repository...


adapter_model.safetensors:   0%|          | 0.00/45.1M [00:00<?, ?B/s]

Successfully uploaded adapter_model.safetensors
Uploading tokenizer.json to repository...


tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

Successfully uploaded tokenizer.json
Uploading .gitattributes to repository...
Successfully uploaded .gitattributes
Uploading Llama-3.2-1B-V1-F16.gguf to repository...


Llama-3.2-1B-V1-F16.gguf:   0%|          | 0.00/2.48G [00:00<?, ?B/s]

Successfully uploaded Llama-3.2-1B-V1-F16.gguf
Uploading adapter_config.json to repository...
Successfully uploaded adapter_config.json
Uploading special_tokens_map.json to repository...
Successfully uploaded special_tokens_map.json
Uploading model.safetensors to repository...


model.safetensors:   0%|          | 0.00/2.47G [00:00<?, ?B/s]

Successfully uploaded model.safetensors
Uploading config.json to repository...
Successfully uploaded config.json
Uploading tokenizer_config.json to repository...
Successfully uploaded tokenizer_config.json


## Try to generate structured text using the Outlines Library

In [38]:
! pip install email-validator

Collecting email-validator
  Downloading email_validator-2.2.0-py3-none-any.whl.metadata (25 kB)
Collecting dnspython>=2.0.0 (from email-validator)
  Downloading dnspython-2.7.0-py3-none-any.whl.metadata (5.8 kB)
Downloading email_validator-2.2.0-py3-none-any.whl (33 kB)
Downloading dnspython-2.7.0-py3-none-any.whl (313 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/313.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m313.6/313.6 kB[0m [31m29.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: dnspython, email-validator
Successfully installed dnspython-2.7.0 email-validator-2.2.0


In [7]:
from pydantic import BaseModel, EmailStr, HttpUrl, field_validator
from outlines import models, generate, types
from llama_cpp import Llama
import llama_cpp

# Load the GGUF Model
repo_name = 'bhuvana-ak7/OrpoLlama-3.2-1B-V1_q4_k_m'
filename = 'ggml-model-Q4_K_M.gguf'

# # Initialize the Llama model from llama-cpp-python
# llm = Llama(model_path=llm_path) # Initialize the Llama model with the file path

# Initialize the Outlines LlamaCpp model, passing the Llama object
model = models.llamacpp(repo_id=repo_name,
    filename=filename,
    tokenizer=llama_cpp.llama_tokenizer.LlamaHFTokenizer.from_pretrained("meta-llama/Llama-3.2-1B"))


locale = types.locale("us")

class Client(BaseModel):
    name: str
    phone_number: locale.PhoneNumber
    zip_code: locale.ZipCode
    location: str
    Company: str
    #website: HttpUrl
    #email: EmailStr # Added email field with EmailStr type

# Add a validator for the website field
    # @field_validator('website')
    # def ensure_valid_url(cls, value):
    #     """
    #     This validator ensures the website field is a valid URL,
    #     even if it's missing the schema (http or https).
    #     """
    #     if not isinstance(value, str):
    #         raise ValueError('Website must be a string')

    #     # Add the schema if it's missing
    #     if not value.startswith(('http://', 'https://')):
    #         value = 'https://' + value

    #     return value

generator = generate.json(model, Client)

In [9]:
result = generator(
    "Create a client profile with the fields person name, phone_number, zip_code, location and company"
)

print(result)

name='John' phone_number='077-000-0001' zip_code='23501' location='Parkersville, FL' Company="John's Deli"
