In [1]:
!pip install transformers accelerate peft bitsandbytes -q

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
from huggingface_hub import login

model_name = 'mistralai/Mistral-7B-Instruct-v0.2'
login(token='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')

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

Token has not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


In [3]:
def load_quantized_model(model_name: str):
    """
    :param model_name: Name or path of the model to be loaded.
    :return: Loaded quantized model.
    """
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,    # this enables double (or nested) quantization which applies a 2nd quantization after the inital one. It saves an additional 0.4bits per parameter
        bnb_4bit_quant_type="nf4",    # this specifies the type of 4-bit quantization to be used. In this case, `nf4` refers to normalized float4 which is the default quantization type.
        bnb_4bit_compute_dtype=torch.bfloat16    # this determines the compute datatype used during computation. It specifies the use of the 'bfloat16' dtype for faster training. The compute_dtype can be chosen from options like float16, bfloat16, float32, bfloat32 etc. And this configuration is needed because while 4-bit BitsAndBytes stores weights in 4-bits, the computation still happens in 16 or 32 bits. The matrix multiplication and training will be faster, if one uses 16-bit compute datatype
    )

    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype=torch.bfloat16,
        quantization_config=bnb_config)
#     ).to(device)
    return model

#### `load_in_4bit = True` loads Mistral-7B in 4-bits precision. This means that the weights and activations of the model are represented using 4-bits instead of the usual 32-bits. This can significantly reduce the memory footprint of the model.

#### 4-bit precision models can use up to 16-times less memory, and can be up to twice faster than full precision models. But if you need the highest possible accuracy, then you may want to use the full precision models. However in almost all consumer GPUs including that of Google colab, for running 13B or even 7B params model, it's better to apply quantization

#### `bnb_4bit_use_double_quant = True` enables double (or nested) quantization which applies a second quantization after the inital one. It saves an additional 0.4bits per parameter

#### `bnb_4bit_quant_type="nf4"` specifies the type of 4-bit quantization to be used. In this case, `nf4` refers to normalized float4 which is the default quantization type.

#### This quantization type, however, is only compatible with GPUs. In other words, it's not possible to quantize models in 4-bit on a CPU. Pretty much any GPU could be used to run the 4-bit quantization (so long as you have cuda installed).

#### `bnb_4bit_compute_dtype = torch.bfloat16` determines the compute datatype used during computation. It specifies the use of the `bfloat16` dtype for faster training. The compute_dtype can be chosen from options like float16, bfloat16, float32, bfloat32 etc. This configuration is needed because while 4-bit BitsAndBytes stores weights in 4-bits, the computation is performed in 16 or 32 bits. The matrix multiplication and training becomes faster, when 16-bit compute datatype is used

#### And just to reiterate, the computation isn't carried out in 4-bit. The weights and activations are only compressed to that format, while the computation is performed in the native datatype of the model

In [4]:
def initialize_tokenizer(model_name: str):
    """
    Initialize the tokenizer with the specified model_name.

    :param model_name: Name or path of the model for tokenizer initialization.
    :return: Initialized tokenizer.
    """
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokenizer.bos_token_id = 1  # Set beginning of sentence token id
    return tokenizer

In [5]:
torch.backends.cuda.enable_mem_efficient_sdp(False)
torch.backends.cuda.enable_flash_sdp(False)

model = load_quantized_model(model_name)
tokenizer = initialize_tokenizer(model_name)

# Define stop token ids
stop_token_ids = [0]

config.json:   0%|          | 0.00/596 [00:00<?, ?B/s]

`low_cpu_mem_usage` was None, now set to True since model is quantized.


model.safetensors.index.json:   0%|          | 0.00/25.1k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/3 [00:00<?, ?it/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.54G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.46k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

In [6]:
def inference(text:str, max_tokens:int=200):
    text = '[INST] ' + text.strip() + ' [/INST]' 
    model_input = tokenizer(text, return_tensors="pt", add_special_tokens=False)
    generated_ids = model.generate(**model_input, max_new_tokens=max_tokens, do_sample=True)
    decoded = tokenizer.batch_decode(generated_ids)
    return decoded[0]

import warnings; warnings.filterwarnings('ignore')

In [7]:
print(inference("Tell me all you know about Parameter-Efficient Fine-tuning?", 500))

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
2024-05-15 15:02:05.833199: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-05-15 15:02:05.833298: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-05-15 15:02:06.006139: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


[INST] Tell me all you know about Parameter-Efficient Fine-tuning? [/INST] Parameter-Efficient Fine-tuning (PEFT) is a method used for adapting pre-trained models to new tasks with minimal additional computational cost. The main idea behind PEFT is to keep the weights of the pre-trained model frozen and only update a small subset of model parameters, known as adaptation token or projection heads.

In PEFT, the model is fine-tuned by adding a few new layers on top of the pre-trained model. These new layers can be called mismatch embeddings, promoter, or adapters. They are typically lightweight and can be trained using a small dataset. By updating only these new layers, the model can learn task-specific features while preserving the general knowledge learned from the pre-training stage.

 PEFT has several advantages, some of them are:

1. Reduced computational cost as fewer parameters need to be updated compared to full fine-tuning.
2. Improved generalization ability as the model maintai