## Preparing a dataset for LLM fine-tuning

**Note**: Final fine-tuning should be done on a GPU, but rest the groundwork, including dataset preparation can be done locally.

#### Installing requirements

In [2]:
!pip install -U bitsandbytes transformers peft accelerate datasets scipy ipywidgets matplotlib huggingface wandb



### Preparing dataset

In [1]:
import json

**The Alpaca dataset can be downloaded at** https://github.com/Instruction-Tuning-with-GPT-4/GPT-4-LLM/blob/main/data/alpaca_gpt4_data.json

In [2]:
with open('alpaca_gpt4_data.json','r') as f:
    alpaca = json.load(f)

In [3]:
len(alpaca)

52002

In [7]:
alpaca[420]

{'instruction': 'Provide me a list of diseases that can be caused by smoking.',
 'input': '',
 'output': 'Sure! Here is a list of some of the diseases that can be caused by smoking: \n\n1. Lung cancer\n2. Chronic obstructive pulmonary disease (COPD)\n3. Heart disease\n4. Stroke\n5. Emphysema\n6. Atherosclerosis\n7. Peripheral artery disease\n8. Esophageal cancer\n9. Throat cancer\n10. Oral cancer\n11. Bladder cancer\n12. Pancreatic cancer\n13. Kidney cancer\n14. Stomach cancer\n15. Type 2 diabetes\n16. Rheumatoid arthritis\n17. Infertility\n18. Chronic bronchitis\n19. Cataracts\n20. Gum disease and tooth loss\n\nIt is important to note that smoking can also weaken the immune system, making it harder for the body to fight off infections and diseases. Additionally, smoking can worsen existing health conditions and reduce the effectiveness of certain medications.'}

We need to pre-process this data so we can feed it into the LLM

**Knowledge**: `format_map` function: By using format_map method, you can feed our dictionary's values into a string in a straight-forward way by placing the keys inside curly brackets i.e., {}

Here's an example below

In [None]:
# input stored in variable a. 
a = {'x':'John', 'y':'Wick'} 

# Use of format_map() function 
print("{x}'s last name is {y}".format_map(a))

print('Keanu Reeves had a new release called {x} {y}'.format_map(a))


Using `format_map` for our purpose to create two functions that can process a json and convert them into an LLM-friendly input.

In [10]:
def prompt_without_input(row):
    return("Below is an instruction that describes a task." 
           "Write a response that appropriately completes the request. \n\n"
           "###Instruction: \n{instruction}").format_map(row)

def prompt_with_input(row):
    return ("Below is an instruction that describes a task, paired with an input that provides further context. "
            "Write a response that appropriately completes the request.\n\n"
            "### Instruction:\n{instruction}\n\n### Input:\n{input}\n\n### Response:\n").format_map(row)

In [28]:
print(prompt_with_input(alpaca[426]))
print('Now here is one without an input\n')
print(prompt_without_input(alpaca[420]))

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Pick out the correct noun from the following list.

### Input:
river, mountain, book

### Response:

Here is a different one

Below is an instruction that describes a task.Write a response that appropriately completes the request. 

###Instruction: 
Provide me a list of diseases that can be caused by smoking.


Here's a function that combines the two functions above and produces prompts for our LLM based on whether there's an input in json

In [32]:
def create_prompt_for_item(row):
    if row['input'] == '':
        return prompt_without_input(row)
    else:
        return prompt_with_input(row)

In [33]:
prompts = [create_prompt_for_item(row) for row in alpaca]

In [38]:
for i,prompt in enumerate(prompts[:3]):
    print(f"Prompt {i}:")
    print(prompt)
    print('\n')

Prompt 0:
Below is an instruction that describes a task.Write a response that appropriately completes the request. 

###Instruction: 
Give three tips for staying healthy.


Prompt 1:
Below is an instruction that describes a task.Write a response that appropriately completes the request. 

###Instruction: 
What are the three primary colors?


Prompt 2:
Below is an instruction that describes a task.Write a response that appropriately completes the request. 

###Instruction: 
Describe the structure of an atom.




### Pre-processing to handle special tokens

#### End-of-String (EOS) token:
The EOS token is important because it's used to indicate to the LLM  that it should stop producing text.

In [39]:
EOS_TOKEN = '</s>'
outputs = [row['output'] + EOS_TOKEN for row in alpaca]

In [41]:
print(outputs[0])

1. Eat a balanced and nutritious diet: Make sure your meals are inclusive of a variety of fruits and vegetables, lean protein, whole grains, and healthy fats. This helps to provide your body with the essential nutrients to function at its best and can help prevent chronic diseases.

2. Engage in regular physical activity: Exercise is crucial for maintaining strong bones, muscles, and cardiovascular health. Aim for at least 150 minutes of moderate aerobic exercise or 75 minutes of vigorous exercise each week.

3. Get enough sleep: Getting enough quality sleep is crucial for physical and mental well-being. It helps to regulate mood, improve cognitive function, and supports healthy growth and immune function. Aim for 7-9 hours of sleep each night.</s>


**Knowledge**: `zip` method is used to combine two iterables into a single iterable, where the items of two iterables are paired together are tuples.

Example below:

In [42]:
letter = [ "A", "B", "C", "D" ]
id = [ 4, 1, 3, 2 ]

# using zip() to map values
mapped = zip(letter, id)

print(set(mapped))


{('D', 2), ('B', 1), ('A', 4), ('C', 3)}


In [43]:
dataset = [{'prompt':s,'output':t,'example':s+t} for s,t in zip(prompts,outputs)]

In [46]:
dataset[:2]

[{'prompt': 'Below is an instruction that describes a task.Write a response that appropriately completes the request. \n\n###Instruction: \nGive three tips for staying healthy.',
  'output': '1. Eat a balanced and nutritious diet: Make sure your meals are inclusive of a variety of fruits and vegetables, lean protein, whole grains, and healthy fats. This helps to provide your body with the essential nutrients to function at its best and can help prevent chronic diseases.\n\n2. Engage in regular physical activity: Exercise is crucial for maintaining strong bones, muscles, and cardiovascular health. Aim for at least 150 minutes of moderate aerobic exercise or 75 minutes of vigorous exercise each week.\n\n3. Get enough sleep: Getting enough quality sleep is crucial for physical and mental well-being. It helps to regulate mood, improve cognitive function, and supports healthy growth and immune function. Aim for 7-9 hours of sleep each night.</s>',
  'example': 'Below is an instruction that 

### Tokenizing data

In [48]:
from transformers import AutoTokenizer

`meta-llama/Llama-2-7b-hf` repository requires access via HuggingFace, you can request it here: https://huggingface.co/meta-llama/Llama-2-7b-hf

Once you have access, you can either 
- Type `huggingface-cli login` in the Terminal and enter your access token, or
- In the notebook, do
`
from huggingface_hub import login
login()
`
and enter your token


To create a Huggingface token, log-in to Huggingface -> Settings -> Access Tokens -> New token -> Create a token with read access

In [53]:
model_id = 'meta-llama/Llama-2-7b-hf'
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token

tokenizer_config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

In [60]:
testr = "Wait, is this working?"
tokenizer.encode(testr)

[1, 20340, 29892, 338, 445, 1985, 29973]

In [77]:
tokenizer.encode(testr, padding='max_length',max_length=10)

[1, 20340, 29892, 338, 445, 1985, 29973, 2, 2, 2]

The multiple tokens with index 2 that we see at the end are all EOS tokens, indicating the end of the sentence. There's 3 EOS tokens because we've set the `max_length=10`. If max_length is instead to 512, then you'd see a lot more EOS tokens, because we are padding the sentence to occupy the max_length length.

You can see below that the 2nd token in our vocabulary is the EOS token, i.e., `</s>`

In [139]:
tokenizer.vocab['<s>'], tokenizer.vocab['</s>']

(1, 2)

To get tensors instead of a list of token indices, set the parameter `return_tensors = pt`

In [75]:
tokenizer.encode(testr,padding='max_length',max_length=10,return_tensors='pt')

tensor([[    1, 20340, 29892,   338,   445,  1985, 29973,     2,     2,     2]])

### Splitting dataset by creating a Train-Eval split

In [78]:
import random
random.shuffle(dataset)  ##we shuffle the dataset using this

In [81]:
import wandb
import pandas as pd

  from pandas.core import (


In [82]:
train_dataset = dataset[:-1000]
eval_dataset = dataset[-1000:]

train_table = wandb.Table(dataframe = pd.DataFrame(train_dataset))
eval_table = wandb.Table(dataframe = pd.DataFrame(eval_dataset))



In [84]:
with wandb.init(project="alpaca_ft", job_type="split_data"):
    wandb.log({"train_dataset":train_table, "eval_dataset":eval_table})


Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ct

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


VBox(children=(Label(value='1.181 MB of 112.936 MB uploaded\r'), FloatProgress(value=0.010454362529777994, max…



The two datasets, train and eval, have been uploaded as WandB tables

### Combining samples in dataset (called packing) to create longer sequences, reduce items to train, and make training more efficient.

In [140]:
max_seq_len = 1024

def pack(dataset,max_seq_len=1024):
    # We are retrieving the list of tokens for each dataset sample here, and tkds_ids will be a list of these lists.
    tkds_ids = tokenizer([s['example'] for s in dataset])['input_ids']

    all_token_ids = []
    for tokenized_input in tkds_ids:
        # We are essentially flattening the list of lists here into a single list, with a EOS tokens demarcating consecutive examples.
        # that's what the extend method does
        all_token_ids.extend(tokenized_input + [tokenizer.eos_token_id])
    
    packed_ds = []
    # What the below could be illustrated as: range(0, 121312, 1025)
    for i in range(0, len(all_token_ids), max_seq_len+1):
        
        # One instance of below is all_token_ids[1025: 0 + 1025 + 1], which makes sense because we've used [0:1024] for input_ids and [1:1025] for output IDs
        input_ids = all_token_ids[i : i + max_seq_len + 1]
        ## If the length of this input_ids is equal to 1025, then take the items all the way from [:-1] meaning 0 to index 10
        
        # if len(input_ids) = 1025, which is the condition we want to use to pack our dataset's samples
        # then we will prepare a pair of input_ids and labels.
        # This also just means that, since we are accounting for when len != max_seq_len + 1, that means we're going to be
        # dropping off and not including the very last sequence because its seq_len will be less than 1024.
        if len(input_ids) == max_seq_len + 1:

            packed_ds.append({'input_ids':input_ids[:-1],'labels': input_ids[1:]})
    return packed_ds

train_ds_packed = pack(train_dataset)
eval_ds_packed = pack(eval_dataset)
        

### Storing our preprocessed datasets in W&B

In [144]:
import json

def save_jsonl(data, filename):
    with open(filename, 'w') as file:
        for entry in data:
            json.dump(entry, file)
            file.write('\n')

# Dump packed datasets into jsonl to use it for finetuning
save_jsonl(train_ds_packed, 'train_alpaca_packed.jsonl')
save_jsonl(eval_ds_packed, 'eval_alpaca_packed.jsonl')

# Create a WandB artifact

packed_at = wandb.Artifact(
    name = 'alpaca_packed',
    type = 'dataset',
    description = "Alpaca dataset packed in sequences",
    metadata = {'max_seq_len':1024,'model_id':model_id}
)

packed_at.add_file('train_alpaca_packed.jsonl')
packed_at.add_file('eval_alpaca_packed.jsonl')

# Log the artifact to the project, call the job 'data_preprocess' 

with wandb.init(project='alpaca_ft', job_type='data_preprocess'):
    wandb.log_artifact(packed_at)

[34m[1mwandb[0m: Currently logged in as: [33mvenkatakshay98[0m. Use [1m`wandb login --relogin`[0m to force relogin
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)




VBox(children=(Label(value='0.008 MB of 129.401 MB uploaded\r'), FloatProgress(value=6.271051116753411e-05, ma…