# Instruction-tune Llama 2

Reference: Philipp Schmid https://www.philschmid.de/instruction-tune-llama-2

First thing first, I launched a `g5.2xlarge` EC2 instance as Philipp described, Installed miniconda.

```bash
$ wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
$ sh Miniconda3-latest-Linux-x86_64.sh
```

For possible interruptions, keep the log somewhere safe. https://stackoverflow.com/questions/47969937/reconnecting-remote-jupyter-notebook-and-get-current-cell-output

In [1]:
!pwd

/home/ec2-user/projects/finetune-llama-2


In [2]:
import sys
import logging

nblog = open("nb.log", "a+")
sys.stdout.echo = nblog
sys.stderr.echo = nblog

get_ipython().log.handlers[0].stream = nblog
get_ipython().log.setLevel(logging.INFO)

%autosave 5

Autosaving every 5 seconds


## Install dependencies

In [None]:
!pip install "transformers==4.31.0" "datasets==2.13.0" "peft==0.4.0" "accelerate==0.21.0" "bitsandbytes==0.40.2" "trl==0.4.7" "safetensors>=0.3.1" --upgrade

## Dataset

Use Databricks Dolly dataset `databricks/databricks-dolly-15k`.

Let's first load the dataset from the hub.

In [3]:
from datasets import load_dataset

# Load the dataset from the hub
dataset = load_dataset("databricks/databricks-dolly-15k", split="train")

  from .autonotebook import tqdm as notebook_tqdm
Found cached dataset json (/home/ec2-user/.cache/huggingface/datasets/databricks___json/databricks--databricks-dolly-15k-7427aa6e57c34282/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4)


Now take a look at the dataset. The data is in JSON format with the following schema:

```js
{
    'instruction': 'I am trying to book a flight from Singapore to Sydney, what shall I do if the flight is too expensive?', 
    'context': '', 
    'response': 'You will have the option to choose from local Asian low-cost airlines such as Scoot, Jetstar, or AirAsia which would provide cheaper flights options.', 
    'category': 'general_qa'
}
```

In [3]:
from random import randrange

print(f'dataset size: {len(dataset)}')
print(dataset[randrange(len(dataset))])

dataset size: 15011
{'instruction': 'I am trying to book a flight from Singapore to Sydney, what shall I do if the flight is too expensive?', 'context': '', 'response': 'You will have the option to choose from local Asian low-cost airlines such as Scoot, Jetstar, or AirAsia which would provide cheaper flights options.', 'category': 'general_qa'}


Let's define a function to convert the data into a collection of tasks described by instructions.

In [4]:
def format_instructions(sample):
    return f"""### Instruction:
Use the Input below to create an instruction, which could have been used to generate the Input using an LLM.

### Input:
{sample['response']}

### Response:
{sample['instruction']}
"""

Test the `format_instructions` function with a random sample in the dataset.

In [4]:
from random import randrange

sample_idx = randrange(len(dataset))
print(dataset[sample_idx])
print(format_instructions(dataset[sample_idx]))

{'instruction': 'Write a paragraph about AI governance.', 'context': '', 'response': 'The AI arms race is heating up, and breakthroughs are happening at an accelerating pace.\n\nThe release of ChatGPT by OpenAI represents a profound leap forward in how humans interface with machines, showcasing the startling progress in large language models. Meanwhile generative AI capabilities such as Dall-E, Stable Diffusion, and Midjourney are able to generate highly realistic and detailed images from text descriptions, demonstrating a level of creativity and imagination that was once thought to be exclusively human.\n\nHumans seem fundamentally wired to continuously advance technology and improve our knowledge and capabilities. Also, the human brain tends to think linearly, causing us to underestimate the exponential progress of technology. Companies and nations are incentivized by market forces and geopolitical game theory to pursue better intelligence through the advancement of AI.\n\nThe Future

## Instruction-tune Llama 2

### Install dependencies

To speed up training we want Flash Attention, which needs NVIDIA Ampere GPUs (and that's why we got ourself a g5.2xlarge EC2 instance, which has a NVIDIA A10). 

First, confirm that we have the GPU.

In [8]:
!python -c "import torch; assert torch.cuda.get_device_capability()[0] >= 8, 'Hardware not supported for Flash Attention'"

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/home/ec2-user/miniconda3/lib/python3.11/site-packages/torch/cuda/__init__.py", line 381, in get_device_capability
    prop = get_device_properties(device)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ec2-user/miniconda3/lib/python3.11/site-packages/torch/cuda/__init__.py", line 395, in get_device_properties
    _lazy_init()  # will define _get_device_properties
    ^^^^^^^^^^^^
  File "/home/ec2-user/miniconda3/lib/python3.11/site-packages/torch/cuda/__init__.py", line 247, in _lazy_init
    torch._C._cuda_init()
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx


The driver is missing. I should have used an AMI with this prepackaged...
But let's try install it following the instructions here: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/install-nvidia-driver.html (Spoiler alert: this doesn't give you everything CUDA).

If you need to use a password for `ec2-user`, first change to root with `$ sudo su`, then `sudo passwd ec2-user` and type in a new password twice. After password is created, switch back to `ec2-user` by `$ su ec2-user`.

For downloading the driver installer from AWS (located in an S3), attach an IAM role with S3 access rights to the EC2.

After installation, checking the driver and GPU details should show this:

```bash
$ nvidia-smi -q | head

==============NVSMI LOG==============

Timestamp                                 : Mon Jul 31 14:13:24 2023
Driver Version                            : 535.54.03
CUDA Version                              : 12.2

Attached GPUs                             : 1
GPU 00000000:00:1E.0
    Product Name                          : NVIDIA A10G
```

Now the GPU check passed with no complaint.

In [1]:
!python -c "import torch; assert torch.cuda.get_device_capability()[0] >= 8, 'Hardware not supported for Flash Attention'"

Installing `ninja packaging` also worked.

In [2]:
!pip install ninja packaging

Collecting ninja
  Downloading ninja-1.11.1-py2.py3-none-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (145 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m146.0/146.0 kB[0m [31m6.6 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: ninja
Successfully installed ninja-1.11.1


But installing `Flash Attention` still failed.

In [None]:
!MAX_JOBS=4 pip install flash-attn --no-build-isolation

It seems that the driver didn't come with `nvcc`. I needed to expicitly install `cuda` with `conda`. (Spoiler alert: the default newest version didn't work!)

https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html

```bash
$ conda install cuda -c nvidia
```

In [None]:
!MAX_JOBS=4 pip install flash-attn --no-build-isolation

Then I got this.
```bash
The detected CUDA version (12.2) mismatches the version that was used to compile
      PyTorch (11.7). Please make sure to use the same CUDA versions.
```
So let me try install CUDA pytorch 11.7

```bash
# uninstall the current CUDA 12.2 
$ conda remove cuda
# install 11.7
$ conda install cuda -c nvidia/label/cuda-11.7.0
```

After this, I still needed to install `typing-extensions`:

```bash
$ pip install typing-extensions
```

Then it finally started to build 
```bash
Building wheels for collected packages: flash-attn
  Building wheel for flash-attn (setup.py) ... \
```

This took a very long time. But it succeeded the end:

```bash
Building wheels for collected packages: flash-attn
  Building wheel for flash-attn (setup.py) ... done
  Created wheel for flash-attn: filename=flash_attn-2.0.2-cp311-cp311-linux_x86_64.whl size=59345049 sha256=b36680a8becd4d33cd6d89a066357a904474c4e63c6f6be322bfd67e808e87b1
  Stored in directory: /tmp/pip-ephem-wheel-cache-r2b8rdz7/wheels/6d/b2/9f/b63c6c7f984571c7c8cb2ee8a069461bd355d9265d098dce26
Successfully built flash-attn
Installing collected packages: einops, flash-attn
Successfully installed einops-0.6.1 flash-attn-2.0.2
```


### Instruction-tune!

Now it's time to fine-tune Llama 2!. First get the [llama_patch.py](https://github.com/philschmid/deep-learning-pytorch-huggingface/blob/main/training/utils/llama_patch.py) from Huggingface. Save it in `utils` folder next to the notebook so it can be imported.

Now, it's time to instruction tune Llama-2 7B!

First, set up everything and get the model.

Needed to install Scipy:

```bash
$ conda install scipy
```

In [5]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

# use Flash Attention if possible
use_flash_attention = False
if torch.cuda.get_device_capability()[0] >= 8:
    from utils.llama_patch import replace_attn_with_flash_attn
    print("Using flash attention")
    replace_attn_with_flash_attn()
    use_flash_attention = True

# Hugging Face model id
model_id = "NousResearch/Llama-2-7b-hf" # non-gated
# model_id = "meta-llama/Llama-2-7b-hf" # gated

# BitsAndBytesConfig int-4 config 4-bit quantization 
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, use_cache=False, device_map="auto")
model.config.pretraining_tp = 1

# Validate that the model is using flash attention, by comparing doc strings
if use_flash_attention:
    from utils.llama_patch import forward
    assert model.model.layers[0].self_attn.forward.__doc__ == forward.__doc__, "Model is not using flash attention"

tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"


Using flash attention


Loading checkpoint shards: 100%|██████████| 2/2 [01:36<00:00, 48.38s/it]


Then create the config for PEFT and prepare the model for it.

In [6]:
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model

# LoRA config based on QLoRA paper
peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
    task_type="CAUSAL_LM",
)

# prepare model for training
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)

Define hyperparameters for trianing.

In [11]:
from transformers import TrainingArguments

args = TrainingArguments(
    output_dir="llama2-7-int4-dolly",
    num_train_epochs=3,
    per_device_eval_batch_size=6 if use_flash_attention else 4,
    # per_device_eval_batch_size=4,
    gradient_accumulation_steps=2,
    gradient_checkpointing=True,
    optim="paged_adamw_32bit",
    logging_steps=10,
    logging_strategy="epoch", # log each epoch
    save_strategy="epoch",
    learning_rate=2e-4,
    bf16=True,
    tf32=True,
    max_grad_norm=0.3,
    warmup_ratio=0.03,
    lr_scheduler_type="constant",
    disable_tqdm=True # disable progress bar since packing makes the number incorrect
)

Create the `SFTTrainer` to start training!

In [8]:
# https://stackoverflow.com/questions/47969937/reconnecting-remote-jupyter-notebook-and-get-current-cell-output
# a workaround for HuggingFace progress table updates. Just configure a callback log_callback = PrinterCallback(), trainer.add_callback(log_callback), set logging_strategy='epoch' in the TrainingArguments, and copy the implementation of PrinterCallback() from this example. And thanks to @Mercury's solution, the output will be redirected to the nb.log file

# redirect HuggingFace logs to our log file


from transformers import TrainerCallback

class PrinterCallback(TrainerCallback):
    def on_log(self, args, state, control, logs=None, **kwargs):
        _ = logs.pop("total_flos", None)
        if state.is_local_process_zero:
            print(logs)

log_callback = PrinterCallback()

In [12]:
from trl import SFTTrainer

# max_seq_length = 2048 
max_seq_length = 1024 # reduce max sequence length because of CUDA out of memory error
# max_seq_length = 512 # reduce max sequence length because of CUDA out of memory error

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    packing=True,
    formatting_func=format_instructions,
    args=args
)

# add log callback defined above
trainer.add_callback(log_callback)


Train and save the model.

In [13]:
import datetime

print(f"Start training at: {datetime.datetime.now()}")

trainer.train() # there won't be a progress bar

print(f"Finished training and start saving at: {datetime.datetime.now()}")

trainer.save_model()

print(f"Finished saving at: {datetime.datetime.now()}")

Start training at: 2023-08-01 05:51:44.596799


The first time it gave me an error: `'AcceleratorState' object has no attribute 'distributed_type'`. Upgrading `accelerate` solved the problem (Kernel restart needed).

```bash
pip install git+https://github.com/huggingface/accelerate
```

#### CUDA out of memorry error

The second time it gave me CUDA out of memory error

```
OutOfMemoryError: CUDA out of memory. Tried to allocate 1.95 GiB (GPU 0; 21.99 GiB total capacity; 17.27 GiB already allocated; 1.78 GiB free; 19.89 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
```

To resolve it I halved SFTTrainer's max sequence length `max_seq_length` from 2048 to 1024, and set the batch size `per_device_eval_batch_size` to 4. Finally the training begun.

https://stackoverflow.com/questions/15197286/how-can-i-flush-gpu-memory-using-cuda-physical-reset-is-unavailable

To release unreferenced memories:

```python
torch.cuda.empty_cache()
```

If everything fails and you want to kill all nvidia processes eating GPU memory, try this:

```bash
$ nvidia-smi
```
This will show all the processes.

```
Tue Aug  1 05:44:18 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.03              Driver Version: 535.54.03    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A10G                    Off | 00000000:00:1E.0 Off |                    0 |
|  0%   39C    P0              59W / 300W |      4MiB / 23028MiB |      4%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+
```

Kill each of them by PID, e.g.:

```bash
$ sudo kill -9 47676
```

## Test Model and run inference

In [None]:
if use_flash_attention:
    # unpatch flash attention
    from utils.llama_patch import unplace_flash_attn_with_attn
    unplace_flash_attn_with_attn()

import torch
from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer

args.output_dir = "llama-7-int4-dolly"

# load base LLM model and tokenizer
model = AutoPeftModelForCausalLM.from_pretrained(
    args.output_dir,
    low_cpu_mem_usage=True,
    torch_dtype=torch.float16,
    load_in_4bit=True,
)
tokenizer = AutoTokenizer.from_pretrained(args.output_dir)
