# Compare LLMs

In this tutorial we will learn how to use a quantized version of the LLMs and then compare the outputs of a few LLMs.

Quantization is the process of reducing the precision of the model weights and activations leading to reduced memory footprint. You can find more information on quantization [here](https://huggingface.co/docs/transformers/main/main_classes/quantization).

In [None]:
!pip install 'transformers[torch]'
!pip install datasets zstandard evaluate
!pip install accelerate -U
!pip install bitsandbytes

In [None]:
import torch
from transformers import (
  AutoTokenizer,
  AutoModelForCausalLM,
  BitsAndBytesConfig,
  pipeline
)

In [None]:
# This should fail if your GPU RAM size is small.
from transformers import AutoModelForCausalLM, AutoTokenizer

checkpoint = "mistralai/Mistral-7B-Instruct-v0.2"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint)

# [Quantization](https://huggingface.co/docs/accelerate/main/en/usage_guides/quantization#bitsandbytes-integration)

Quantization is the process of reducing the memory required to store the model weights by reducing precision of the model from 32 bits to lower precision values such as BFLOAT16 or INT8.

In [None]:
# Activate 4-bit precision base model loading
use_4bit = True

# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "float16"

# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4"

# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = False

In [None]:
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
)

# Check GPU compatibility with bfloat16
if compute_dtype == torch.float16 and use_4bit:
    major, _ = torch.cuda.get_device_capability()
    if major >= 8:
        print("=" * 80)
        print("Your GPU supports bfloat16: accelerate training with bf16=True")
        print("=" * 80)

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer

checkpoint = "mistralai/Mistral-7B-Instruct-v0.2"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint, quantization_config=bnb_config)  # You may want to use bfloat16 and/or move to GPU here

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
`low_cpu_mem_usage` was None, now set to True since model is quantized.


Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

In [None]:
messages = [

    {"role": "user",
     "content": "You are a friendly chatbot who always responds in the style of a pirate. How many burgers can a human eat in one sitting?"},
 ]
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
print(tokenizer.decode(tokenized_chat[0]))

<s> [INST] You are a friendly chatbot who always responds in the style of a pirate. How many burgers can a human eat in one sitting? [/INST]


In [None]:
outputs = model.generate(tokenized_chat, max_new_tokens=128)
print(tokenizer.decode(outputs[0]))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


<s> [INST] You are a friendly chatbot who always responds in the style of a pirate. How many burgers can a human eat in one sitting? [/INST] Arr, matey, that be dependin' on the human in question and the size of the burgers. Some scurvy dogs might manage three or four, but most would be pushin' it with two. Ye be advisin' to take it easy, or risk a swollen belly and a groggy feeling. Arrr!</s>


# To do

Select 2 models and compare the responses against 2 different [prompting](https://www.promptingguide.ai/techniques) techniques.

**Tip**: 7B paramter models may be easier to load.

Some example models
1. meta-llama/Llama-2-7b-chat-hf
2. HuggingFaceH4/zephyr-7b-beta
3. google/gemma-7b