From https://github.com/facebookresearch/llama-recipes/

specifically from commit 43771602c9d7808c888eb5995ccce4bc8beafb1f as of 1/12/2023

# Model

Download official Llama2 from Meta, specifically the 7b and 13b variants.

Use https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/convert_llama_weights_to_hf.py to convert to hf format

Output folder at and `../model-7b` and `../model-13b` wrt this notebook

# Dataset

Download `samsun` dataset to `../dataset` wrt to this notebook

```
from datasets import load_dataset

for split in ['train', 'validation']:
    load_dataset("samsum", split=split).to_json(f'../dataset/samsun_{split}.jsonl')
````

In [1]:
from finetuning import main

In [2]:
config = {
    'model_name': '../model-7b',
    'dataset_cache_dir': '../dataset',
    'split_slice': '1%', # use only split_slice of each split, to prevent oom on local
    'use_peft': True,
}

main(**config)

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

--> Model ../model-7b

--> ../model-7b has 262.41024 Million params

trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.06220594176090199


Map:   0%|          | 0/147 [00:00<?, ? examples/s]

Map:   0%|          | 0/147 [00:00<?, ? examples/s]

--> Training Set Length = 147


Map:   0%|          | 0/8 [00:00<?, ? examples/s]

Map:   0%|          | 0/8 [00:00<?, ? examples/s]



--> Validation Set Length = 8


Training Epoch: 1/3, step 146/147 completed (loss: 0.14629891514778137): : 37it [04:04,  6.62s/it]                      6.64s/it][0m


Max CUDA memory allocated was 5 GB
Max CUDA memory reserved was 5 GB
Peak active CUDA memory was 5 GB
Cuda Malloc retires : 0
CPU Total Peak Memory consumed during the train (max): 5 GB


evaluating Epoch: 100%|[32m███████████████████████████████████████████████████████████████████[0m| 8/8 [00:06<00:00,  1.26it/s][0m


 eval_ppl=tensor(3.2692, device='cuda:0') eval_epoch_loss=tensor(1.1845, device='cuda:0')
best eval loss on epoch 1 is 1.1845430135726929
Epoch 1: train_perplexity=1.4029, train_epoch_loss=0.3385, epoch time 245.15665548799916s


Training Epoch: 2/3, step 146/147 completed (loss: 0.2730576992034912): : 37it [04:19,  7.01s/it]                       7.41s/it][0m


Max CUDA memory allocated was 5 GB
Max CUDA memory reserved was 5 GB
Peak active CUDA memory was 5 GB
Cuda Malloc retires : 0
CPU Total Peak Memory consumed during the train (max): 6 GB


evaluating Epoch: 100%|[32m███████████████████████████████████████████████████████████████████[0m| 8/8 [00:06<00:00,  1.20it/s][0m


 eval_ppl=tensor(3.2262, device='cuda:0') eval_epoch_loss=tensor(1.1713, device='cuda:0')
best eval loss on epoch 2 is 1.1713013648986816
Epoch 2: train_perplexity=1.3081, train_epoch_loss=0.2686, epoch time 259.4338902799991s


Training Epoch: 3/3, step 146/147 completed (loss: 0.33959704637527466): : 37it [04:24,  7.16s/it]                      7.37s/it][0m


Max CUDA memory allocated was 5 GB
Max CUDA memory reserved was 6 GB
Peak active CUDA memory was 5 GB
Cuda Malloc retires : 0
CPU Total Peak Memory consumed during the train (max): 6 GB


evaluating Epoch: 100%|[32m███████████████████████████████████████████████████████████████████[0m| 8/8 [00:06<00:00,  1.20it/s][0m

 eval_ppl=tensor(3.4685, device='cuda:0') eval_epoch_loss=tensor(1.2437, device='cuda:0')
Epoch 3: train_perplexity=1.2570, train_epoch_loss=0.2287, epoch time 265.19161016s
Key: avg_train_prep, Value: 1.3226686716079712
Key: avg_train_loss, Value: 0.2786160707473755
Key: avg_eval_prep, Value: 3.321303129196167
Key: avg_eval_loss, Value: 1.1757153272628784
Key: avg_epoch_time, Value: 256.59405197599943
Key: avg_checkpoint_time, Value: 2.54233297406851e-06





In [5]:
# alternatively use cli
!python3 finetuning.py \
    --model_name ../model-7b \
    --dataset_cache_dir ../dataset \
    --split_slice 1% \
    --use_peft

Loading checkpoint shards: 100%|██████████████████| 3/3 [00:29<00:00,  9.99s/it]
--> Model ../model-7b

--> ../model-7b has 262.41024 Million params

trainable params: 4,194,304 || all params: 6,742,609,920 || trainable%: 0.06220594176090199
--> Training Set Length = 147
--> Validation Set Length = 8
Training Epoch: 1/3, step 146/147 completed (loss: 0.14629891514778137): : 37it [03:49,  6.20s/it]
Max CUDA memory allocated was 5 GB
Max CUDA memory reserved was 5 GB
Peak active CUDA memory was 5 GB
Cuda Malloc retires : 0
CPU Total Peak Memory consumed during the train (max): 2 GB
evaluating Epoch: 100%|[32m███████████████████████████[0m| 8/8 [00:06<00:00,  1.28it/s][0m
 eval_ppl=tensor(3.2692, device='cuda:0') eval_epoch_loss=tensor(1.1845, device='cuda:0')
best eval loss on epoch 1 is 1.1845430135726929
Epoch 1: train_perplexity=1.4029, train_epoch_loss=0.3385, epoch time 229.48925341500035s
Training Epoch: 2/3, step 146/147 completed (loss: 0.2730576992034912): : 37it [04:12,  6.8

# Alternative runs

## Single V100 (32GB) GPU

LoRA PEFT of quantized 7b (`load_in_4bit=True`) in half precision works as per local

```
python3 finetuning.py \
    --split_slice 1% \
    --use_peft \
    --quantization True \
    --use_fp16 True
```

Full finetuning of 7b in full/half precision results in CUDA OOM

```
python3 finetuning.py \
    --split_slice 1% \
    --quantization False \
    --use_fp16 False
```
```
python3 finetuning.py \
    --split_slice 1% \
    --quantization False \
    --use_fp16 True
```

Full finetuning of quantized 7b (`load_in_4bit=True`) in full precision works, but training loss does not decrease

```
python3 finetuning.py \
    --split_slice 1% \
    --quantization True \
    --use_fp16 False 
```

Full finetuning of quantized 7b (`load_in_4bit=True`) in half precision results in

`AssertionError: No inf checks were recorded for this optimizer.`

https://pytorch.org/docs/stable/notes/amp_examples.html#gradient-accumulation


```
python3 finetuning.py \
    --split_slice 1% \
    --quantization True \
    --use_fp16 True 
```

## Single node multi V100 (32GB) GPU

FSDP is used to do sharding of model (and data) across gpus.

~~There was a weird issue of the device not setting properly for FSDP, resulting in `Inconsistent compute_device and device_id`, possibly fix it with `export CUDA_VISIBLE_DEVICES=0,1,2,3`.~~ It is due to `quantization` set to `True`, resulting in `device_map` to be `None`, which is possibly a bug in the code. Do not use quantization when using FSDP, it is not supported anyway.

Full finetuning of 7b in half precision results in CUDA OOM

```
torchrun \
    --nnodes 1 \
    --nproc_per_node 4 \
    finetuning.py \
    --split_size 1% \
    --enable_fsdp \
    --quantization False \
    --use_fp16 True \
```

LoRA PEFT of 7b in half precision works! Note that quantization is not supported for FSDP.

eval_ppl and eval_loss decreases similarly to the single gpu case as well.

```
torchrun \
    --nnodes 1 \
    --nproc_per_node 4 \
    finetuning.py \
    --split_size 1% \
    --enable_fsdp \
    --use_peft \
    --quantization False \
    --use_fp16 True \
```
