# Instruction-tune Llama 2

Reference: Philipp Schmid https://www.philschmid.de/instruction-tune-llama-2

## Install dependencies

In [1]:
!pip install "transformers==4.31.0" "datasets==2.13.0" "peft==0.4.0" "accelerate==0.21.0" "bitsandbytes==0.40.2" "trl==0.4.7" "safetensors>=0.3.1" --upgrade

Collecting transformers==4.31.0
  Downloading transformers-4.31.0-py3-none-any.whl (7.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.4/7.4 MB[0m [31m101.1 MB/s[0m eta [36m0:00:00[0m00:01[0m0:01[0m
[?25hCollecting datasets==2.13.0
  Downloading datasets-2.13.0-py3-none-any.whl (485 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m485.6/485.6 kB[0m [31m71.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting peft==0.4.0
  Downloading peft-0.4.0-py3-none-any.whl (72 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m72.9/72.9 kB[0m [31m26.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting accelerate==0.21.0
  Downloading accelerate-0.21.0-py3-none-any.whl (244 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m244.2/244.2 kB[0m [31m52.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting bitsandbytes==0.40.2
  Downloading bitsandbytes-0.40.2-py3-none-any.whl (92.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

## Dataset

Use Databricks Dolly dataset `databricks/databricks-dolly-15k`.

Let's first load the dataset from the hub.

In [2]:
from datasets import load_dataset

# Load the dataset from the hub
dataset = load_dataset("databricks/databricks-dolly-15k", split="train")

Downloading readme: 100%|██████████| 8.20k/8.20k [00:00<00:00, 21.1MB/s]


Downloading and preparing dataset json/databricks--databricks-dolly-15k to /home/ec2-user/.cache/huggingface/datasets/databricks___json/databricks--databricks-dolly-15k-7427aa6e57c34282/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4...


Downloading data: 100%|██████████| 13.1M/13.1M [00:00<00:00, 151MB/s]
Downloading data files: 100%|██████████| 1/1 [00:00<00:00,  1.78it/s]
Extracting data files: 100%|██████████| 1/1 [00:00<00:00, 1282.66it/s]
                                                        

Dataset json downloaded and prepared to /home/ec2-user/.cache/huggingface/datasets/databricks___json/databricks--databricks-dolly-15k-7427aa6e57c34282/0.0.0/e347ab1c932092252e717ff3f949105a4dd28b27e842dd53157d2f72e276c2e4. Subsequent calls will reuse this data.




Now take a look at the dataset. The data is in JSON format with the following schema:

```js
{
    'instruction': 'I am trying to book a flight from Singapore to Sydney, what shall I do if the flight is too expensive?', 
    'context': '', 
    'response': 'You will have the option to choose from local Asian low-cost airlines such as Scoot, Jetstar, or AirAsia which would provide cheaper flights options.', 
    'category': 'general_qa'
}
```

In [3]:
from random import randrange

print(f'dataset size: {len(dataset)}')
print(dataset[randrange(len(dataset))])

dataset size: 15011
{'instruction': 'I am trying to book a flight from Singapore to Sydney, what shall I do if the flight is too expensive?', 'context': '', 'response': 'You will have the option to choose from local Asian low-cost airlines such as Scoot, Jetstar, or AirAsia which would provide cheaper flights options.', 'category': 'general_qa'}


Let's define a function to convert the data into a collection of tasks described by instructions.

In [4]:
def format_instructions(sample):
    return f"""### Instruction:
Use the Input below to create an instruction, which could have been used to generate the Input using an LLM.

### Input:
{sample['response']}

### Response:
{sample['instruction']}
"""

Test the `format_instructions` function with a random sample in the dataset.

In [6]:
from random import randrange

sample_idx = randrange(len(dataset))
print(dataset[sample_idx])
print(format_instructions(dataset[sample_idx]))

{'instruction': 'When would a railway be considered a heritage railway?', 'context': "A heritage railway or heritage railroad (US usage) is a railway operated as living history to re-create or preserve railway scenes of the past. Heritage railways are often old railway lines preserved in a state depicting a period (or periods) in the history of rail transport. The British Office of Rail and Road defines heritage railways as follows:...'lines of local interest', museum railways or tourist railways that have retained or assumed the character and appearance and operating practices of railways of former times. Several lines that operate in isolation provide genuine transport facilities, providing community links. Most lines constitute tourist or educational attractions in their own right. Much of the rolling stock and other equipment used on these systems is original and is of historic value in its own right. Many systems aim to replicate both the look and operating practices of historic f

## Instruction-tune Llama 2