Large portions of this code are taken from the following sources:
- [Hugging Face PEFT LoRA tutorials](https://huggingface.co/docs/peft/en/developer_guides/lora)
- [Hugging Face PEFT quantization tutorials](https://huggingface.co/docs/peft/en/developer_guides/quantization)

Please check out the original sources for more information, and other amazing tutorials. I strongly recommend the[Hugging Face NLP Course](https://huggingface.co/learn/nlp-course/chapter0/1?fw=pt) for a good overview of the Hugging Face `transformers` library.

# Quantizing a model

In [1]:
import torch
from transformers import BitsAndBytesConfig

  from .autonotebook import tqdm as notebook_tqdm


Configure BitsAndBytes to quantize a model. Here is an explaination of the parameters:

`load_in_4bit` - Load the model in 4-bit precision.

`bnb_4bit_quant_type` - What type of quantization do you want to use. Here we use `"nf4"`, which is a type of quantile quantization. The weights are normalized to the range $[-1, 1]$ and binned into one of 16 bins. For more details see the [QLoRA paper](https://arxiv.org/abs/2305.14314).

`bnb_4bit_use_double_quant` - After the weights are quantized to 4-bits using a technique like NF4, quantization constants (absolute max values for each quantization block, usually stored in FP32) must still be stored to allow dequantizing the weights during computation. For large models with many quantization blocks, storing these constants adds non-trivial memory overhead. Double quantization addresses this by performing a second round of quantization, this time on the 32-bit quantization constants themselves.

`bnb_4bit_compute_dtype` - Here we use the `bfloat` type. This is similar to the standard 16-bit half-precision, but the ratio of bits assigned to the exponent and mantissa is different (8 and 7 bits, respectively vs 5 and 10 bits with both having 1 bit for the sign). This allows for a larger range of values to be represented, which is useful for quantization.

In [5]:
config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)

Here we use a file called `.env` to store the environment variables. This is a good practice to avoid hardcoding sensitive information in the code. However if you are running this code in Colab, you can just do:
```python
from huggingface_hub import login

HF_TOKEN = 'hf_...'
login(token=HF_TOKEN)
```

and NOT run the cell below.

In [6]:
from dotenv import load_dotenv

load_dotenv()

import os

hf_token = os.getenv('HF_TOKEN')

from huggingface_hub import login

login(token=hf_token)

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /Users/rkd/.cache/huggingface/token
Login successful


Load the model. Note that this will not work if you're running on a CPU. You need to be connected to a GPU either on Colab or through a local GPU. If running in Colab, and you get an error, try restarting the runtime and running the code again.

In [None]:
from transformers import AutoModelForCausalLM, AutoTokenizer
import transformers

tokenizer = AutoTokenizer.from_pretrained("meta-llama/CodeLlama-7b-Instruct-hf")
model = AutoModelForCausalLM.from_pretrained("meta-llama/CodeLlama-7b-Instruct-hf", quantization_config=config)

In [None]:
def chat(prompt):
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids
    output = model.generate(**inputs,
                            max_length=512,
                            do_sample=True,
                            temperature=0.1,
                            top_k=10, top_p=0.95,
                            num_return_sequences=1,eos_token_id=tokenizer.eos_token_id)

    return tokenizer.decode(output[0], skip_special_tokens=True)

In [None]:
print(chat("What is a good machine learning library for Python?"))

If you want to prepare the model for PEFT...

In [None]:
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

model = prepare_model_for_kbit_training(model)

config = LoraConfig(
    r=16,
    lora_alpha=8,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)

The model can now be trained in the usual way, which is trivial and left as an exercise to the reader...