# Apply Quantization on my Llama-2 7B Fine-tuned Python Code Generator using GPTQ with AutoGPT and Huggingface

In my previous [notebook](https://github.com/edumunozsala/llama-2-7B-4bit-python-coder/blob/main/Llama-2-finetune-qlora-python-coder.ipynb), I showed you how to instruction fine-tune the new Llama 2 model, just released by Meta AI, to build a Python code generator in a few lines of code. This time we will describe how to quantize this model using the GPTQ quantization, now that it is integrated with the transformers.

Last week Hugging Face Hugging Face announced the compatibility of its transformers libraries with the AutoGPTQ library which allows us to quantize a large language model in 2, 3, or 4 bits using the GPTQ methodology

This notebook contains three sections:
- Load the original model, fine-tuned with QLoRa in 4-bits, and evaluate the performance when runing on a T4 instance in Colab.
- Quantize this original model to 4 bits using GPTQ with transformers and AUTOGPT and save it to Hugging Face Hub.
- Load the quantized model and evaluate the performance in a T4.

## Load the Llama-2 7B model in 4-bit without AutoGPTQ

### Installing the libraries

In [None]:
! pip install "transformers==4.31.0" "bitsandbytes==0.40.2" "accelerate==0.21.0"

We load the model in 4 bits with bitsandbytes quantization

In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load the entire model on the GPU 0
device_map = {"": 0}

# If not defined
hf_model_repo='edumunozsala/llama-2-7b-int4-python-code-20k'

# Get the tokenizer
tokenizer = AutoTokenizer.from_pretrained(hf_model_repo)
# Load the model
model = AutoModelForCausalLM.from_pretrained(hf_model_repo, load_in_4bit=True, torch_dtype=torch.float16,
                                             device_map=device_map)


Lets check the memory usage

In [None]:
!nvidia-smi

## Test Inference performance

Install the dataset library

In [None]:
!pip install "datasets==2.13.0"

In [None]:
from datasets import load_dataset
from random import randrange
import time

### Load the dataset with the instruction set

First, we load the dataset from huggingface. The dataset is [python_code_instructions_18k_alpaca](https://huggingface.co/datasets/iamtarun/python_code_instructions_18k_alpaca)

The dataset contains problem descriptions and code in python language. This dataset is taken from sahil2801/code_instructions_120k, which adds a prompt column in alpaca style.


In [None]:
# The instruction dataset to use
dataset_name = "iamtarun/python_code_instructions_18k_alpaca"
# Load dataset from the hub
dataset = load_dataset(dataset_name, split="train[:10%]")
# Show dataset size
print(f"dataset size: {len(dataset)}")
# Show an example
print(dataset[randrange(len(dataset))])


In [None]:
# Check the dataset structure
print(dataset)

We would apply inference on those examples longer than 500 characters:

In [None]:
# Sample the dataset
#sample_dataset= dataset[randrange(0,100,5)]
# Filter the dataset to get those which instruction is longer than 100 characters
responses_test = dataset.filter(lambda x: len(x['instruction'])>500)

print(len(responses_test))

Now, we run a for loop where we make inference and collect execution time and the result

In [None]:
# Process the test examples
times = []
solutions = []

for example in responses_test:
  start = time.time()
  prompt = f"""### Instruction:
  Use the Task below and the Input given to write the Response, which is a programming code that can solve the Task.

  ### Task:
  {example["instruction"]}

  ### Input:
  {example["input"]}

  ### Response:
  """
  # Tokenize the input
  input_ids = tokenizer(prompt, return_tensors="pt", truncation=True).input_ids.cuda()
  # Run the model to infere an output
  outputs = model.generate(input_ids=input_ids, max_new_tokens=200, do_sample=True, top_p=0.9,temperature=0.3)

  # Collect the output and the inference time
  end = time.time()
  print(start-end)
  times.append(start-end)
  solutions.append(tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0][len(prompt):])

print(times)

As a summary, we show the max, min and mean execution time

In [None]:
import numpy as np

In [None]:
print('Min time: ',np.min(times))
print('Max time: ',np.max(times))
print('Mean time: ',np.mean(times))

In average, **every inference takes about 17 - 19 seconds**.

## Quantize transformers model using auto-gptq, 🤗 transformers and optimum

The GPTQ algorithm requires to calibrate the quantized weights of the model by doing inference on the quantized model. The detailed quantization algorithm is described in [the original paper](https://arxiv.org/pdf/2210.17323.pdf).

For quantizing a model using auto-gptq, we need to pass a dataset to the quantizer. This can be achieved either by passing a supported default dataset among `['wikitext2','c4','c4-new','ptb','ptb-new']` or a list of strings that will be used as a dataset.

Installing the libraries

In [None]:
!pip install -q -U transformers peft accelerate optimum

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/7.5 MB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.1/7.5 MB[0m [31m2.0 MB/s[0m eta [36m0:00:04[0m[2K     [91m━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.7/7.5 MB[0m [31m38.7 MB/s[0m eta [36m0:00:01[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m7.5/7.5 MB[0m [31m80.8 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.5/7.5 MB[0m [31m58.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.6/85.6 kB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m251.2/251.2 kB[0m [31m27.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m380.6/380.6 kB[0m [31m39.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━

Hugging Face blog post recommends to install an explicitd version, but probably you can run the simple installation

In [None]:
!pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu117/
#!pip install auto-gptq

Looking in indexes: https://pypi.org/simple, https://huggingface.github.io/autogptq-index/whl/cu117/
Collecting auto-gptq
  Downloading https://huggingface.github.io/autogptq-index/whl/cu117/auto-gptq/auto_gptq-0.4.2%2Bcu117-cp310-cp310-linux_x86_64.whl (1.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m23.9 MB/s[0m eta [36m0:00:00[0m
Collecting rouge (from auto-gptq)
  Downloading rouge-1.0.1-py3-none-any.whl (13 kB)
Installing collected packages: rouge, auto-gptq
Successfully installed auto-gptq-0.4.2+cu117 rouge-1.0.1


In this example, we will try to quantize the model in 4-bit precision using the `"c4"` dataset and the tokenizer from our model. Supported precisions are `[2, 4, 6, 8]`. The other parameters takes the default value.

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig
import torch

# Set the model to load
hf_model_repo='edumunozsala/llama-2-7b-int4-python-code-20k'
# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(hf_model_repo, use_fast=True)
# Set quantization configuration
quantization_config = GPTQConfig(
     bits=4,
     group_size=128,
     dataset="c4",
     desc_act=False,
     tokenizer=tokenizer
)
# Load the model from HF
quant_model = AutoModelForCausalLM.from_pretrained(hf_model_repo, quantization_config=quantization_config, device_map='auto')

Downloading (…)okenizer_config.json:   0%|          | 0.00/695 [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/434 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/627 [00:00<?, ?B/s]

Downloading (…)model.bin.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)l-00001-of-00002.bin:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

Downloading (…)l-00002-of-00002.bin:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/2.38k [00:00<?, ?B/s]

Repo card metadata block was not found. Setting CardData to empty.


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/319M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Quantizing model.layers blocks :   0%|          | 0/32 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

Quantizing layers inside the block:   0%|          | 0/7 [00:00<?, ?it/s]

The parameters defined:

- **bits**: Precision of the quantization. You can set 4, 3, and 2.
- **dataset**: The dataset used for calibration. I would leave “c4“ which seems to yield reasonable results. Other datasets are supported according to the documentation.
- **desc_act**: Whether to quantize columns in order of decreasing activation size. Setting it to False can significantly speed up inference but the perplexity may become slightly worse. Also known as act-order.
- **tokenizer**: The tokenizer of Llama 2 7B that will be applied to c4.

Lets check the memory usage

In [None]:
!nvidia-smi

Mon Sep  4 07:17:10 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  NVIDIA A100-SXM...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   33C    P0    49W / 400W |  27249MiB / 40960MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

You can make sure the model has been correctly quantized by checking the attributes of the linear layers, they should contain `qweight` and `qzeros` attributes that should be in `torch.int32` dtype.

In [None]:
quant_model.model.decoder.layers[0].self_attn.q_proj.__dict__

Now, we run a simple example

In [None]:
# Create an instruction
instruction="Write a Python function to display the first and last elements of a list."
input=""

prompt = f"""### Instruction:
Use the Task below and the Input given to write the Response, which is a programming code that can solve the Task.

### Task:
{instruction}

### Input:
{input}

### Response:
"""
# Tokenize the input
input_ids = tokenizer(prompt, return_tensors="pt", truncation=True).input_ids.cuda()
# Run the model to infere an output
outputs = model.generate(input_ids=input_ids, max_new_tokens=100, do_sample=True, top_p=0.9,temperature=0.5)

# Print the result
print(f"Prompt:\n{prompt}\n")
print(f"Generated instruction:\n{tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0][len(prompt):]}")

## Share quantized models on 🤗 Hub

After quantizing the model, it can be used out-of-the-box for inference or you can push the quantized weights on the 🤗 Hub to share your quantized model with the community

In [None]:
!pip install python-dotenv
!pip install huggingface_hub

Collecting python-dotenv
  Downloading python_dotenv-1.0.0-py3-none-any.whl (19 kB)
Installing collected packages: python-dotenv
Successfully installed python-dotenv-1.0.0


You can log in to Hugging Face Hub interactively

In [None]:
from huggingface_hub import notebook_login
# Log in to HF Hub
notebook_login()


Or you can provide .env file containing the Hugging Face token

In [None]:
from huggingface_hub import login
from dotenv import load_dotenv
import os

# Load the enviroment variables
load_dotenv()
# Login to the Hugging Face Hub
login(token=os.getenv("HF_HUB_TOKEN"))

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


Now, we can upload our quantize model to the Hugging Face Hub

In [None]:
quant_model.push_to_hub("edumunozsala/llama-2-7b-int4-GPTQ-python-code-20k")
tokenizer.push_to_hub("edumunozsala/llama-2-7b-int4-GPTQ-python-code-20k")

pytorch_model.bin:   0%|          | 0.00/3.90G [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/edumunozsala/llama-2-7b-int4-GPTQ-python-code-20k/commit/ed458634c5162e5d755532fb10177864b3fa3b5c', commit_message='Upload tokenizer', commit_description='', oid='ed458634c5162e5d755532fb10177864b3fa3b5c', pr_url=None, pr_revision=None, pr_num=None)

## Loading the GPTQ Model from Hugging Face Hub and run an example

As usual, we import the libraries and load the tokenizer and quantized model

Installing the libraries

In [None]:
!pip install -q -U transformers peft accelerate optimum

In [None]:
#!pip install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu117/
!pip install auto-gptq



Load the model, this time we do not need to especify any quantization mode. The model is already quantized, you can load models that have been quantized using the auto-gptq library out of the box from the 🤗 Hub directly using `from_pretrained` method.


In [None]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "edumunozsala/llama-2-7b-int4-GPTQ-python-code-20k"

tokenizer = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map="auto")


Downloading pytorch_model.bin:   0%|          | 0.00/3.90G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

Once tokenizer and model has been loaded, let's generate some text. Before that, we can inspect the model to make sure it has loaded a quantized model, you can see the `QuantLinear()` layers.

In [None]:
print(model)

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 4096, padding_idx=0)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (rotary_emb): LlamaRotaryEmbedding()
          (k_proj): QuantLinear()
          (o_proj): QuantLinear()
          (q_proj): QuantLinear()
          (v_proj): QuantLinear()
        )
        (mlp): LlamaMLP(
          (act_fn): SiLUActivation()
          (down_proj): QuantLinear()
          (gate_proj): QuantLinear()
          (up_proj): QuantLinear()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): LlamaRMSNorm()
  )
  (lm_head): Linear(in_features=4096, out_features=32000, bias=False)
)


Lets check the GPU VRAM

In [None]:
!nvidia-smi

Mon Sep  4 17:30:38 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   45C    P0    29W /  70W |   5053MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

Then we run an inference to test it

In [None]:
instruction="Write a Python function to display the first and last elements of a list."
input=""

prompt = f"""### Instruction:
Use the Task below and the Input given to write the Response, which is a programming code that can solve the Task.

### Task:
{instruction}

### Input:
{input}

### Response:
"""

input_ids = tokenizer(prompt, return_tensors="pt", truncation=True).input_ids.cuda()
# with torch.inference_mode():
outputs = model.generate(input_ids=input_ids, max_new_tokens=128, do_sample=True, top_p=0.9,temperature=0.3)

print(f"Prompt:\n{prompt}\n")
print(f"Generated instruction:\n{tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0][len(prompt):]}")


Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


Prompt:
### Instruction:
Use the Task below and the Input given to write the Response, which is a programming code that can solve the Task.

### Task:
Write a Python function to display the first and last elements of a list.

### Input:


### Response:


Generated instruction:
def first_last(list):
    first = list[0]
    last = list[-1]
    return first, last



Everything looks fine.

## Test Inference performance

In [None]:
!pip install "datasets==2.13.0"

Collecting datasets==2.13.0
  Downloading datasets-2.13.0-py3-none-any.whl (485 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m485.6/485.6 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
Collecting dill<0.3.7,>=0.3.0 (from datasets==2.13.0)
  Downloading dill-0.3.6-py3-none-any.whl (110 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB[0m [31m15.1 MB/s[0m eta [36m0:00:00[0m
INFO: pip is looking at multiple versions of multiprocess to determine which version is compatible with other requirements. This could take a while.
Collecting multiprocess (from datasets==2.13.0)
  Downloading multiprocess-0.70.14-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.3/134.3 kB[0m [31m18.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: dill, multiprocess, datasets
  Attempting uninstall: dill
    Found existing installation: dill 0.3.7
    Uninstalling dill-0.3.7:
      Successfully uninsta

In [None]:
from datasets import load_dataset
from random import randrange
import time

### Load the dataset with the instruction set

In [None]:
# The instruction dataset to use
dataset_name = "iamtarun/python_code_instructions_18k_alpaca"
# Load dataset from the hub
dataset = load_dataset(dataset_name, split="train[:10%]")
# Show dataset size
print(f"dataset size: {len(dataset)}")
# Show an example
print(dataset[randrange(len(dataset))])


Downloading readme:   0%|          | 0.00/905 [00:00<?, ?B/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/18612 [00:00<?, ? examples/s]

dataset size: 1861
{'instruction': 'Write a Python script for recognizing facial features using OpenCV and Haar features.', 'input': '<Sample_image.jpg>', 'output': '#Import necessary libraries\nimport cv2\nimport numpy as np\n\n#Load the classifier for frontal face detection\nhaar_face_cascade = cv2.CascadeClassifier("haarcascade_frontal_face.xml")\n\n#Create a VideoCapture object\ncapture = cv2.VideoCapture("Sample_image.jpg")\nimg = capture.read()\n\n#Convert the image into grayscale\ngray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)\n\n#Detect facial features\nfaces = haar_face_cascade.detectMultiScale(gray, 1.3, 5)\n\n#Draw a rectangle around the faces\nfor (x,y,w,h) in faces:\n    cv2.rectangle(img, (x,y), (x+w, y+h), (255,0,0), 2)\n\n#Write the image into a file\ncv2.imwrite(\'face_detect.jpg\',img)', 'prompt': 'Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nWrite a Python script for recognizing facial fe

In [None]:
# Check the dataset structure
print(dataset)

Dataset({
    features: ['instruction', 'input', 'output', 'prompt'],
    num_rows: 1861
})


We would apply inference on those examples longer than 500 characters:

In [None]:
# Sample the dataset
#sample_dataset= dataset[randrange(0,100,5)]
# Filter the dataset to get those which instruction is longer than 100 characters
responses_test = dataset.filter(lambda x: len(x['instruction'])>500)

print(len(responses_test))

Filter:   0%|          | 0/1861 [00:00<?, ? examples/s]

12


Now, we run this simple evaluation process

In [None]:
# Process the test examples
times = []
solutions = []

for example in responses_test:
  start = time.time()
  prompt = f"""### Instruction:
  Use the Task below and the Input given to write the Response, which is a programming code that can solve the Task.

  ### Task:
  {example["instruction"]}

  ### Input:
  {example["input"]}

  ### Response:
  """
  # Tokenize the input
  input_ids = tokenizer(prompt, return_tensors="pt", truncation=True).input_ids.cuda()
  # Run the model to infere an output
  outputs = model.generate(input_ids=input_ids, max_new_tokens=200, do_sample=True, top_p=0.9,temperature=0.3)

  # Collect the output and the inference time
  end = time.time()
  print(start-end)
  times.append(start-end)
  solutions.append(tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0][len(prompt):])

print(times)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


-9.108977317810059
-7.926240921020508
-7.896298408508301
-6.436765909194946
-7.823150396347046
-8.08249545097351
-8.073349237442017
-8.109416961669922
-8.120927333831787
-8.119935035705566
-6.791100978851318
-8.088157653808594
[-9.108977317810059, -7.926240921020508, -7.896298408508301, -6.436765909194946, -7.823150396347046, -8.08249545097351, -8.073349237442017, -8.109416961669922, -8.120927333831787, -8.119935035705566, -6.791100978851318, -8.088157653808594]


Show the metrics

In [None]:
import numpy as np

In [None]:
print('Min time: ',np.min(times))
print('Max time: ',np.max(times))
print('Mean time: ',np.mean(times))

Min time:  -9.108977317810059
Max time:  -6.436765909194946
Mean time:  -7.881401300430298
