# Fine-Tuning LLaMA 2 Models 
### Requirements

To successfully fine-tune LLaMA 2 models, you will need the following:

- **Set up your Python environment** by installing the `requirements.txt` file
- **Llama 2 Model**. To obtain the Llama 2 model, you will need to:
    - Fill Meta's form to [request access to the next version of Llama](https://ai.meta.com/resources/models-and-libraries/llama-downloads/). Indeed, the use of Llama 2 is governed by the Meta license, that you must accept in order to download the model weights and tokenizer.
    - Have a [Hugging Face](https://huggingface.co/) account (with the same email address you entered in Meta's form).
    - Have a [Hugging Face token](https://huggingface.co/settings/tokens).
    - Visit the page of one of the LLaMA 2 available models (version [7B](https://huggingface.co/meta-llama/Llama-2-7b-hf), [13B](https://huggingface.co/meta-llama/Llama-2-13b-hf) or [70B](https://huggingface.co/meta-llama/Llama-2-70b-hf)), and accept Hugging Face's license terms and acceptable use policy. 
    > Once you have accepted this, you will get the following message: *Your request to access this repo has been successfully submitted, and is pending a review from the repo's authors*, which a few hours later should change to: *You have been granted access to this model*. 
    - Log in to the Hugging Face model Hub from your notebook's terminal. To do this, just click the `+` button and open a terminal. You can also perform this by clicking `File` > `New` > `Terminal`. Then, use the `huggingface-cli login` command, and enter your token. You will not need to add your token as git credential.
<br><br>
- **Powerful Computing Resources**: Fine-tuning the Llama 2 model requires substantial computational power. Ensure you are running code on GPU(s).

In [1]:
# Set up Python environment 
! pip install -r requirements.txt
! pip install setuptools

# Step 1:Backward model

In [5]:
# Import libraries 
import argparse
import bitsandbytes as bnb
from datasets import load_dataset
from functools import partial
import os
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, AutoPeftModelForCausalLM
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed, Trainer, TrainingArguments, BitsAndBytesConfig, \
    DataCollatorForLanguageModeling, Trainer, TrainingArguments
from datasets import load_dataset
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'

# Reproducibility
seed = 42
set_seed(seed)

In [6]:
def load_model(model_name, bnb_config):
    n_gpus = 2
    max_memory = f'{40960}MB'

    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=bnb_config,
        device_map="auto", # dispatch efficiently the model on the available ressources
        max_memory = {i: max_memory for i in range(n_gpus)},
    )
    # model = AutoModelForCausalLM.from_pretrained(
    #     model_name,
    #     quantization_config=bnb_config,
    #     device_map="cuda:0"
    # )
    tokenizer = AutoTokenizer.from_pretrained(model_name, use_auth_token=True)

    # Needed for LLaMA tokenizer
    tokenizer.pad_token = tokenizer.eos_token

    return model, tokenizer

In [7]:
dataset = load_dataset("timdettmers/openassistant-guanaco", cache_dir="./data/", split="train")
dataset



Dataset({
    features: ['text'],
    num_rows: 9846
})

### Explore dataset

Once the dataset is downloaded, we can take a look at it to understand what it contains: 

In [8]:
print(f'Number of prompts: {len(dataset)}')
print(f'Column names are: {dataset.column_names}')

Number of prompts: 9846
Column names are: ['text']


### Pre-processing dataset

Instruction fine-tuning is a common technique used to fine-tune a base LLM for a specific downstream use-case.

It will help us to format our prompts.

In [9]:
def create_prompt_formats(sample):
    INTRO_BLURB = "Below is a Response from an AI assistant. Write the Instruction from human that appropriately corresponds to the Response of assistant."
    END_KEY = "End"
    _sample = sample["text"].split("### ")
    _tmp = [INTRO_BLURB] + ["Response: " + _sample[2].strip("Assistant: ")] + ["Instruction: " + _sample[1].strip("Human: ")] + [END_KEY]
    sample["text"] = "\n\n### ".join(_tmp)

    return sample
print(create_prompt_formats(dataset[0])['text'])

Below is a Response from an AI assistant. Write the Instruction from human that appropriately corresponds to the Response of assistant.

### Response: "Monopsony" refers to a market structure where there is only one buyer for a particular good or service. In economics, this term is particularly relevant in the labor market, where a monopsony employer has significant power over the wages and working conditions of their employees. The presence of a monopsony can result in lower wages and reduced employment opportunities for workers, as the employer has little incentive to increase wages or provide better working conditions.

Recent research has identified potential monopsonies in industries such as retail and fast food, where a few large companies control a significant portion of the market (Bivens & Mishel, 2013). In these industries, workers often face low wages, limited benefits, and reduced bargaining power, leading to a situation where they are dependent on the employer for their li

As we can see, each part is now delimited by hashtags that describe the prompt.

Now, we will use our model tokenizer to process these prompts into tokenized ones. The goal is to create input sequences of uniform length (which are suitable for fine-tuning the language model because it maximizes efficiency and minimize computational overhead), that must not exceed the model's maximum token limit.

In [10]:
# SOURCE https://github.com/databrickslabs/dolly/blob/master/training/trainer.py
def get_max_length(model):
    conf = model.config
    max_length = None
    for length_setting in ["n_positions", "max_position_embeddings", "seq_length"]:
        max_length = getattr(model.config, length_setting, None)
        if max_length:
            print(f"Found max lenth: {max_length}")
            break
    if not max_length:
        max_length = 1024
        print(f"Using default max length: {max_length}")
    return max_length


def preprocess_batch(batch, tokenizer, max_length):
    """
    Tokenizing a batch
    """
    return tokenizer(
        batch["text"],
        max_length=max_length,
        truncation=True,
    )


# SOURCE https://github.com/databrickslabs/dolly/blob/master/training/trainer.py
def preprocess_dataset(tokenizer: AutoTokenizer, max_length: int, seed, dataset: str):
    """Format & tokenize it so it is ready for training
    :param tokenizer (AutoTokenizer): Model Tokenizer
    :param max_length (int): Maximum number of tokens to emit from tokenizer
    """
    
    # Add prompt to each sample
    print("Preprocessing dataset...")
    dataset = dataset.map(create_prompt_formats)#, batched=True)
    
    # Apply preprocessing to each batch of the dataset & and remove 'instruction', 'context', 'response', 'category' fields
    _preprocessing_function = partial(preprocess_batch, max_length=max_length, tokenizer=tokenizer)
    dataset = dataset.map(
        _preprocessing_function,
        batched=True,
        # remove_columns=["instruction", "context", "response", "text", "category"],
    )

    # Filter out samples that have input_ids exceeding max_length
    dataset = dataset.filter(lambda sample: len(sample["input_ids"]) < max_length)
    
    # Shuffle dataset
    dataset = dataset.shuffle(seed=seed)

    return dataset

With these functions, our dataset will be ready for fine-tuning ! 

### Create bnb config 

This will allow us to load our LLM in 4 bits. This way, we can divide the used memory by 4 and import the model on smaller devices. We choose to apply bfloat16 compute data type and nested quantization for memory-saving purposes.

In [11]:
def create_bnb_config():
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
    )

    return bnb_config

To leverage the LoRa method, we need to wrap the model as a PeftModel.

To do this, we need to implement a [LoRa configuration](https://huggingface.co/docs/peft/conceptual_guides/lora):

In [12]:
def create_peft_config(modules):
    """
    Create Parameter-Efficient Fine-Tuning config for your model
    :param modules: Names of the modules to apply Lora to
    """
    config = LoraConfig(
        r=16,  # dimension of the updated matrices
        lora_alpha=64,  # parameter for scaling
        target_modules=modules,
        lora_dropout=0.1,  # dropout probability for layers
        bias="none",
        task_type="CAUSAL_LM",
    )

    return config

Previous function needs the target modules to update the necessary matrices. The following function will get them for our model:

In [13]:
# SOURCE https://github.com/artidoro/qlora/blob/main/qlora.py
def find_all_linear_names(model):
    cls = bnb.nn.Linear4bit #if args.bits == 4 else (bnb.nn.Linear8bitLt if args.bits == 8 else torch.nn.Linear)
    lora_module_names = set()
    for name, module in model.named_modules():
        if isinstance(module, cls):
            names = name.split('.')
            lora_module_names.add(names[0] if len(names) == 1 else names[-1])

    if 'lm_head' in lora_module_names:  # needed for 16-bit
        lora_module_names.remove('lm_head')
    return list(lora_module_names)

Once everything is set up and the base model is prepared, we can use the print_trainable_parameters() helper function to see how many trainable parameters are in the model. We expect the lora_model to have fewer trainable parameters compared to the original one, since we want to perform fine-tuning.

In [14]:
def print_trainable_parameters(model, use_4bit=False):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        num_params = param.numel()
        # if using DS Zero 3 and the weights are initialized empty
        if num_params == 0 and hasattr(param, "ds_numel"):
            num_params = param.ds_numel

        all_param += num_params
        if param.requires_grad:
            trainable_params += num_params
    if use_4bit:
        trainable_params /= 2
    print(
        f"all params: {all_param:,d} || trainable params: {trainable_params:,d} || trainable%: {100 * trainable_params / all_param}"
    )

### Training

Now that everything is ready, we can pre-process our dataset and load our model using the set configurations.

Then, we can run our fine-tuning process.

In [13]:
# Load model from HF with user's token and with bitsandbytes config
model_name = "meta-llama/Llama-2-7b-chat-hf" 
bnb_config = create_bnb_config()
model, tokenizer = load_model(model_name, bnb_config)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Loading checkpoint shards: 100%|██████████| 2/2 [00:04<00:00,  2.24s/it]


In [14]:
## Preprocess dataset
max_length = get_max_length(model)
dataset = preprocess_dataset(tokenizer, max_length, seed, dataset)

Found max lenth: 4096
Preprocessing dataset...


In [15]:
def train(model, tokenizer, dataset, output_dir):
    # Apply preprocessing to the model to prepare it by
    # 1 - Enabling gradient checkpointing to reduce memory usage during fine-tuning
    model.gradient_checkpointing_enable()

    # 2 - Using the prepare_model_for_kbit_training method from PEFT
    model = prepare_model_for_kbit_training(model)

    # Get lora module names
    modules = find_all_linear_names(model)

    # Create PEFT config for these modules and wrap the model to PEFT
    peft_config = create_peft_config(modules)
    model = get_peft_model(model, peft_config)
    
    # Print information about the percentage of trainable parameters
    print_trainable_parameters(model)
    
    # Training parameters
    trainer = Trainer(
        model=model,
        train_dataset=dataset,
        args=TrainingArguments(
            per_device_train_batch_size=4,
            gradient_accumulation_steps=4,
            warmup_steps=2,
            max_steps=100,
            learning_rate=2e-4,
            fp16=True,
            logging_steps=1,
            output_dir="outputs",
            optim="paged_adamw_8bit",
        ),
        data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False)
    )
    
    model.config.use_cache = False  # re-enable for inference to speed up predictions for similar inputs
    
    ### SOURCE https://github.com/artidoro/qlora/blob/main/qlora.py
    # Verifying the datatypes before training
    
    dtypes = {}
    for _, p in model.named_parameters():
        dtype = p.dtype
        if dtype not in dtypes: dtypes[dtype] = 0
        dtypes[dtype] += p.numel()
    total = 0
    for k, v in dtypes.items(): total+= v
    for k, v in dtypes.items():
        print(k, v, v/total)
     
    do_train = True
    
    # Launch training
    print("Training...")
    
    if do_train:
        train_result = trainer.train()
        metrics = train_result.metrics
        trainer.log_metrics("train", metrics)
        trainer.save_metrics("train", metrics)
        trainer.save_state()
        print(metrics)    
    
    ###
    
    # Saving model
    print("Saving last checkpoint of the model...")
    os.makedirs(output_dir, exist_ok=True)
    trainer.model.save_pretrained(output_dir)
    
    # Free memory for merging weights
    del model
    del trainer
    torch.cuda.empty_cache()
    
    
output_dir = "results/llama2/final_checkpoint"
train(model, tokenizer, dataset, output_dir)

all params: 3,540,389,888 || trainable params: 39,976,960 || trainable%: 1.1291682911958425
torch.float32 302387200 0.08541070604255438
torch.uint8 3238002688 0.9145892939574456
Training...


dataloader_config = DataLoaderConfiguration(dispatch_batches=None, split_batches=False, even_batches=True, use_seedable_sampler=True)


Step,Training Loss
1,2.0285
2,1.6455
3,1.7085
4,1.4244
5,1.5916
6,1.4593
7,1.3252
8,1.2608
9,1.3509
10,1.177


***** train metrics *****
  epoch                    =       0.16
  total_flos               = 37861692GF
  train_loss               =     1.1657
  train_runtime            = 0:25:30.82
  train_samples_per_second =      1.045
  train_steps_per_second   =      0.065
{'train_runtime': 1530.8235, 'train_samples_per_second': 1.045, 'train_steps_per_second': 0.065, 'total_flos': 4.06536831787008e+16, 'train_loss': 1.165650316476822, 'epoch': 0.16}
Saving last checkpoint of the model...


*If you prefer to have a number of epochs (entire training dataset will be passed through the model) instead of a number of training steps (forward and backward passes through the model with one batch of data), you can replace the `max_steps` argument by `num_train_epochs`.*

To later load and use the model for inference, we have used the `trainer.model.save_pretrained(output_dir)` function, which saves the fine-tuned model's weights, configuration, and tokenizer files.

Unfortunately, you may have noticed that the latest weights are not the best. To solve this problem, you can implement a `EarlyStoppingCallback`, from transformers, during your fine-tuning. This will enable you to regularly test your model on the validation set, if you have one, and keep only the best weights.

### Merge weights 

Once we have our fine-tuned weights, we can build our fine-tuned model and save it to a new directory, with its associated tokenizer. By performing these steps, we can have a memory-efficient fine-tuned model and tokenizer ready for inference!

In [16]:
model = AutoPeftModelForCausalLM.from_pretrained(output_dir, device_map="auto", torch_dtype=torch.bfloat16)
model = model.merge_and_unload()

output_merged_dir = "results/llama2/final_merged_checkpoint"
os.makedirs(output_merged_dir, exist_ok=True)
model.save_pretrained(output_merged_dir, safe_serialization=True)

# save tokenizer for easy inference
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.save_pretrained(output_merged_dir)

Loading checkpoint shards: 100%|██████████| 2/2 [00:04<00:00,  2.08s/it]


('results/llama2/final_merged_checkpoint/tokenizer_config.json',
 'results/llama2/final_merged_checkpoint/special_tokens_map.json',
 'results/llama2/final_merged_checkpoint/tokenizer.json')

### Inference

Once fine-tuned, you can test your model with an input text:

In [17]:
# Specify input
text = '''Below is a Response from an AI assistant. Write the Instruction from human that appropriately corresponds to the Response of assistant.

### Response: "Monopsony" refers to a market structure where there is only one buyer for a particular good or service. In economics, this term is particularly relevant in the labor market, where a monopsony employer has significant power over the wages and working conditions of their employees. The presence of a monopsony can result in lower wages and reduced employment opportunities for workers, as the employer has little incentive to increase wages or provide better working conditions.

Recent research has identified potential monopsonies in industries such as retail and fast food, where a few large companies control a significant portion of the market (Bivens & Mishel, 2013). In these industries, workers often face low wages, limited benefits, and reduced bargaining power, leading to a situation where they are dependent on the employer for their livelihood. This dependence can result in further suppression of wages and a decline in working conditions.

Overall, the concept of monopsony is essential to understanding the dynamics of labor markets and the impact of market power on workers. Further research is needed to understand the extent and impact of monopsonies on the economy and to develop policies to address this issue.

References:
Bivens, J., & Mishel, L. (2013). The Pay of Corporate Executives and Financial Professionals as Evidence of Rents in Top 1 Percent Incomes. Journal of Economic Perspectives, 27(3), 57-78.

### Instruction: 
'''

# Specify device
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# Tokenize input text
inputs = tokenizer(text, return_tensors="pt").to(device)

# Get answer
# (Adjust max_new_tokens variable as you wish (maximum number of tokens the model can generate to answer the input))
outputs = model.generate(input_ids=inputs["input_ids"].to(device), attention_mask=inputs["attention_mask"], max_new_tokens=50, pad_token_id=tokenizer.eos_token_id)

# Decode output & print it
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Below is a Response from an AI assistant. Write the Instruction from human that appropriately corresponds to the Response of assistant.

### Response: "Monopsony" refers to a market structure where there is only one buyer for a particular good or service. In economics, this term is particularly relevant in the labor market, where a monopsony employer has significant power over the wages and working conditions of their employees. The presence of a monopsony can result in lower wages and reduced employment opportunities for workers, as the employer has little incentive to increase wages or provide better working conditions.

Recent research has identified potential monopsonies in industries such as retail and fast food, where a few large companies control a significant portion of the market (Bivens & Mishel, 2013). In these industries, workers often face low wages, limited benefits, and reduced bargaining power, leading to a situation where they are dependent on the employer for their li

#### Push model to Hugging Face Hub (Optional)

*To follow this part, make sure you logged in with a `Write` access token when you used the `huggingface-cli login` command.*

If you want to share your model with others, you can push your model and your token to the Hub, in a new repository.

In [18]:
model_name = "dinaaaaaa/llama2-7b-chat-openassistant-guanaco"
model.push_to_hub(model_name)

model-00001-of-00003.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]
[A

model-00001-of-00003.safetensors:   0%|          | 16.4k/4.94G [00:00<8:37:58, 159kB/s]
[A

[A[A
model-00001-of-00003.safetensors:   0%|          | 1.57M/4.94G [00:00<09:46, 8.42MB/s] 

[A[A
model-00001-of-00003.safetensors:   0%|          | 2.39M/4.94G [00:00<11:03, 7.44MB/s]

model-00001-of-00003.safetensors:   0%|          | 5.65M/4.94G [00:00<05:00, 16.4MB/s]

model-00001-of-00003.safetensors:   0%|          | 12.9M/4.94G [00:00<02:19, 35.4MB/s]
[A

model-00001-of-00003.safetensors:   0%|          | 16.6M/4.94G [00:00<03:58, 20.7MB/s]
model-00001-of-00003.safetensors:   0%|          | 19.6M/4.94G [00:01<04:08, 19.8MB/s]

[A[A
[A

model-00001-of-00003.safetensors:   0%|          | 22.1M/4.94G [00:01<04:37, 17.7MB/s]

[A[A
[A
model-00001-of-00003.safetensors:   0%|          | 24.2M/4.94G [00:01<05:39, 14.5MB/s]

model-00001-of-00003.safetensors:   1%|          | 32.0M/4.94G [00:01<04:49, 16

CommitInfo(commit_url='https://huggingface.co/dinaaaaaa/llama2-7b-chat-openassistant-guanaco/commit/fc037cdf28b8785750c0d1c3a981b35c7bb211fc', commit_message='Upload LlamaForCausalLM', commit_description='', oid='fc037cdf28b8785750c0d1c3a981b35c7bb211fc', pr_url=None, pr_revision=None, pr_num=None)

In [19]:
tokenizer.push_to_hub(model_name)

README.md: 100%|██████████| 5.18k/5.18k [00:00<00:00, 19.0MB/s]


CommitInfo(commit_url='https://huggingface.co/dinaaaaaa/llama2-7b-chat-openassistant-guanaco/commit/112669720440e5331f4ac6d1d43d10ee790fec64', commit_message='Upload tokenizer', commit_description='', oid='112669720440e5331f4ac6d1d43d10ee790fec64', pr_url=None, pr_revision=None, pr_num=None)

Once commited, everyone can use your fine-tuned model by using: 

In [24]:
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM
# Specify device
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
model_name = "dinaaaaaa/llama2-7b-chat-openassistant-guanaco"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name).to(device)


Loading checkpoint shards: 100%|██████████| 3/3 [00:01<00:00,  1.66it/s]


# Step 2: Self-Augmentation

In [25]:
from datasets import load_dataset

# If the dataset is gated/private, make sure you have run huggingface-cli login
dataset = load_dataset("GAIR/lima", split="all")


In [9]:
def create_prompt_formats_backward(sample):
    INTRO_BLURB = '''Below is a Response from an AI assistant. Write the Instruction from human that appropriately corresponds to the Response of assistant.'''
    END_KEY = "Instruction: "
    _tmp = [INTRO_BLURB, "Response: " + sample['conversations'][1], END_KEY]
    sample_out = "\n\n### ".join(_tmp)
    return sample_out

# Specify input
text_all = {'text': []}
import json
from tqdm import tqdm
num_data_generate = len(dataset)
for i in tqdm(range(num_data_generate)):
    _prompt = create_prompt_formats_backward(dataset[i])

    # Tokenize input text
    inputs = tokenizer(_prompt, return_tensors="pt").to(device)

    # Get answer
    # (Adjust max_new_tokens variable as you wish (maximum number of tokens the model can generate to answer the input))
    outputs = model.generate(input_ids=inputs["input_ids"].to(device), attention_mask=inputs["attention_mask"], max_new_tokens=50, pad_token_id=tokenizer.eos_token_id)

    # Decode output & print it
    # print(tokenizer.decode(outputs[0], skip_special_tokens=True))
    output = tokenizer.decode(outputs[0], skip_special_tokens=True)
    text_all['text'].append(output)
    with open('inst_data.json', 'w') as f:
        json.dump(text_all, f)






In [26]:
# convert to hf dataset
dataset = load_dataset('json', data_files='inst_data.json')
text_dataset = []
for i in dataset['train']['text'][0]:
    text_dataset.append({
        'text': i 
    })
with open('inst_dataset.json', 'w') as f:
    json.dump(text_dataset, f)
dataset = load_dataset('json', data_files='inst_dataset.json')
print(dataset['train']['text'][0])

Generating train split: 601 examples [00:00, 32371.60 examples/s]

Below is a Response from an AI assistant. Write the Instruction from human that appropriately corresponds to the Response of assistant.

### Response: The question is relatively broad and one should take into account that the brain not only consists of neurons, but also glial cells (supportive cells) and pre-mitotic neuronal stem cells. Furthermore, as critical fellow-scientists have indicated, developmental stage is very important, as the developing embryonic brain is very different from the adult brain.
However, after sifting through various publications, the answer to the question is actually remarkably simple: Yes, brain cells migrate.
In  the adult brain glial cells migrate in the brain (Klämbt, 2009). Glial cells are involved in a myriad of functions, but a notable example of migrating glial cells are the oligodendrocytes that migrate relative long distances to find their target axons onto which they wrap themselves to form the insulating myelin sheath (Tsai and Miller, 2002).
Ne




# Step 3: Self curation 

In [None]:
def create_prompt_formats_inst_only(sample):
    INTRO_BLURB = '''Below is an instruction from an user and a candidate answer. Evaluate whether or not the answer is a good example of how AI Assistant should respond to the user's instruction. Please assign a score using the following 5-point scale:
1: It means the answer is incomplete, vague, off-topic, controversial, or not exactly what the user asked for. For example, some content seems missing, numbered list does not start from the beginning, the opening sentence repeats user's question. Or the response is from another person’s perspective with their personal experience (e.g. taken from blog posts), or looks like an answer from a forum. Or it contains promotional text, navigation text, or other irrelevant information.
2: It means the answer addresses most of the asks from the user. It does not directly address the user's question. For example, it only provides a high-level methodology instead of the exact solution to user's question. 
3: It means the answer is helpful but not written by an AI Assistant. It addresses all the basic asks from the user. It is complete and self contained with the drawback that the response is not written from an AI assistant's perspective, but from other people's perspective. The content looks like an excerpt from a blog post, web page, or web search results. For example, it contains personal experience or opinion, mentions comments section, or share on social media, etc.
4: It means the answer is written from an AI assistant's perspective with a clear focus of addressing the instruction. It provide a complete, clear, and comprehensive response to user’s question or instruction without missing or irrelevant information. It is well organized, self-contained, and written in a helpful tone. It has minor room for improvement, e.g. more concise and focused.
5: It means it is a perfect answer from an AI Assistant. It has a clear focus on being a helpful AI Assistant, where the response looks like intentionally written to address the user's question or instruction without any irrelevant sentences. The answer provides high quality content, demonstrating expert knowledge in the area, is very well written, logical, easy-to-follow, engaging and insightful.

Please first provide a brief reasoning you used to derive the rating score, and then write "Score: <rating>" in the last line.

'''
    END_KEY = "Score:"
    _data = sample.split("### ")
    _tmp = [INTRO_BLURB, _data[2], _data[1], END_KEY]
    sample = "\n\n### ".join(_tmp)
    return sample
print(create_prompt_formats_inst_only(dataset['train']['text'][0]))
# Specify input
score_all = {'score':[]}
from tqdm import tqdm
# for i in tqdm(range(len(dataset_inst['instruction']))):
num_data_generate = len(dataset['train'])
for i in tqdm(range(num_data_generate)):
    text = create_prompt_formats_inst_only(dataset['train']['text'][i])

    # Tokenize input text
    inputs = tokenizer(text, return_tensors="pt").to(device)
    # print(text)
    # Get answer
    # (Adjust max_new_tokens variable as you wish (maximum number of tokens the model can generate to answer the input))
    outputs = model.generate(input_ids=inputs["input_ids"].to(device), attention_mask=inputs["attention_mask"], max_new_tokens=50, pad_token_id=tokenizer.eos_token_id)

    # Decode output & print it
    # print(tokenizer.decode(outputs[0], skip_special_tokens=True))
    output = tokenizer.decode(outputs[0], skip_special_tokens=True).split("### ")
    score = 1
    for _out in output:
        if _out.startswith("Score: "):
            try:
                score = int(_out[len("Score: ")])
            except:
                score = 1
            break
    score_all['score'].append(score)
    with open('score_data.json', 'w') as f:
        json.dump(score_all, f)


In [31]:
# fine a good subset
import json
with open('score_data_1.json', 'r') as f:
    score_all = json.load(f)
idx = [i for i in range(len(score_all['score'])) if score_all['score'][i] > 4]
dataset = load_dataset('json', data_files='inst_dataset.json')
filtered_dataset = dataset['train'].select(idx)
filtered_dataset.push_to_hub("dinaaaaaa/LIMA_instructions_generate")
filtered_dataset = load_dataset("dinaaaaaa/LIMA_instructions_generate", cache_dir="./data/", split=)
filtered_dataset
# print(filtered_dataset[0]['text'])

# len(filtered_dataset)

Creating parquet from Arrow format: 100%|██████████| 1/1 [00:00<00:00, 219.65ba/s]
Uploading the dataset shards: 100%|██████████| 1/1 [00:00<00:00,  5.28it/s]
Downloading readme: 100%|██████████| 293/293 [00:00<00:00, 1.36MB/s]
Downloading data: 100%|██████████| 141k/141k [00:00<00:00, 1.47MB/s]
Generating train split: 100%|██████████| 80/80 [00:00<00:00, 21466.59 examples/s]


Dataset({
    features: ['text'],
    num_rows: 80
})

In [32]:
filtered_dataset

Dataset({
    features: ['text'],
    num_rows: 80
})

# Step 4: Finetune base model on dataset generated by step 3

In [35]:
def train(model, tokenizer, dataset, output_dir):
    # Apply preprocessing to the model to prepare it by
    # 1 - Enabling gradient checkpointing to reduce memory usage during fine-tuning
    model.gradient_checkpointing_enable()

    # 2 - Using the prepare_model_for_kbit_training method from PEFT
    model = prepare_model_for_kbit_training(model)

    # Get lora module names
    modules = find_all_linear_names(model)

    # Create PEFT config for these modules and wrap the model to PEFT
    peft_config = create_peft_config(modules)
    model = get_peft_model(model, peft_config)
    
    # Print information about the percentage of trainable parameters
    print_trainable_parameters(model)
    
    # Training parameters
    trainer = Trainer(
        model=model,
        train_dataset=dataset,
        args=TrainingArguments(
            per_device_train_batch_size=4,
            gradient_accumulation_steps=4,
            warmup_steps=2,
            max_steps=10,
            learning_rate=2e-4,
            fp16=True,
            logging_steps=1,
            output_dir="outputs",
            optim="paged_adamw_8bit",
        ),
        data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False)
    )
    
    model.config.use_cache = False  # re-enable for inference to speed up predictions for similar inputs
    
    ### SOURCE https://github.com/artidoro/qlora/blob/main/qlora.py
    # Verifying the datatypes before training
    
    dtypes = {}
    for _, p in model.named_parameters():
        dtype = p.dtype
        if dtype not in dtypes: dtypes[dtype] = 0
        dtypes[dtype] += p.numel()
    total = 0
    for k, v in dtypes.items(): total+= v
    for k, v in dtypes.items():
        print(k, v, v/total)
     
    do_train = True
    
    # Launch training
    print("Training...")
    
    if do_train:
        train_result = trainer.train()
        metrics = train_result.metrics
        trainer.log_metrics("train", metrics)
        trainer.save_metrics("train", metrics)
        trainer.save_state()
        print(metrics)    
    
    ###
    
    # Saving model
    print("Saving last checkpoint of the model...")
    os.makedirs(output_dir, exist_ok=True)
    trainer.model.save_pretrained(output_dir)
    
    # Free memory for merging weights
    del model
    del trainer
    torch.cuda.empty_cache()
    
def create_prompt_formats(sample):
    INTRO_BLURB = "Below is a conversation between human and AI assistant. Write the Response of assistant that appropriately completes the Instruction from human."
    END_KEY = "End"
    _sample = sample["text"].split("### ")
    _tmp = [INTRO_BLURB] + [_sample[2], _sample[1]] + [END_KEY]
    sample["text"] = "\n\n### ".join(_tmp)

    return sample

# Load model from HF with user's token and with bitsandbytes config
model_name = "meta-llama/Llama-2-7b-chat-hf" 
bnb_config = create_bnb_config()
model, tokenizer = load_model(model_name, bnb_config)
## Preprocess dataset
max_length = get_max_length(model)
dataset = preprocess_dataset(tokenizer, max_length, seed, filtered_dataset)
print(create_prompt_formats(filtered_dataset[0])['text'])    
output_dir = "results/llama2/final_checkpoint_fine-tune"
train(model, tokenizer, dataset, output_dir)
model = AutoPeftModelForCausalLM.from_pretrained(output_dir, device_map="auto", torch_dtype=torch.bfloat16)
model = model.merge_and_unload()

output_merged_dir = "results/llama2/final_merged_checkpoint"
os.makedirs(output_merged_dir, exist_ok=True)
model.save_pretrained(output_merged_dir, safe_serialization=True)

# save tokenizer for easy inference
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.save_pretrained(output_merged_dir)



Loading checkpoint shards: 100%|██████████| 2/2 [00:03<00:00,  1.58s/it]


Found max lenth: 4096
Preprocessing dataset...
Below is a conversation between human and AI assistant. Write the Response of assistant that appropriately completes the Instruction from human.

### Instruction:  Explain me why the wavefunction of an electron in a molecule decays exponentially in space.



### Response: I'll answer this question from the theoretical side. The exponential behavior follows simply from the Schrödinger equation. Consider the one-electron Schrödinger equation:
$$
(-\frac{1}{2}\nabla^2 + V(\mathbf{r}))\psi(\mathbf{r}) = \epsilon\psi(\mathbf{r}), \epsilon < 0
$$
At spatial points that are very far away from the nucleus, $V(\mathbf{r})\approx 0$, so that the asymptotic solution is given by
$$
-\frac{1}{2}\nabla^2\psi(\mathbf{r}) = \epsilon\psi(\mathbf{r}), \epsilon < 0
$$
This differential equation has basic solutions of the form
$$
\psi(\mathbf{r}) = Ce^{-\sqrt{-2\epsilon}\mathbf{k}\cdot\mathbf{r}}
$$
for some unit vector $\mathbf{k}$. The real asymptotic behav

Step,Training Loss
1,2.2461
2,2.3966
3,2.1435
4,2.0849
5,1.9435
6,2.1508
7,1.9
8,1.7322
9,1.785
10,1.83


***** train metrics *****
  epoch                    =        2.0
  total_flos               =  6500636GF
  train_loss               =     2.0213
  train_runtime            = 0:04:06.87
  train_samples_per_second =      0.648
  train_steps_per_second   =      0.041
{'train_runtime': 246.8779, 'train_samples_per_second': 0.648, 'train_steps_per_second': 0.041, 'total_flos': 6980005676187648.0, 'train_loss': 2.0212571024894714, 'epoch': 2.0}
Saving last checkpoint of the model...


Loading checkpoint shards: 100%|██████████| 2/2 [00:03<00:00,  1.50s/it]


('results/llama2/final_merged_checkpoint/tokenizer_config.json',
 'results/llama2/final_merged_checkpoint/special_tokens_map.json',
 'results/llama2/final_merged_checkpoint/tokenizer.json')

In [38]:

# Inference
# Specify input
text = '''Below is a conversation between human and AI assistant. Write the Response of assistant that appropriately completes the Instruction from human.

### Instruction: Write a song.

### Response:
'''

# Specify device
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# Tokenize input text
inputs = tokenizer(text, return_tensors="pt").to(device)

# Get answer
# (Adjust max_new_tokens variable as you wish (maximum number of tokens the model can generate to answer the input))
outputs = model.generate(input_ids=inputs["input_ids"].to(device), attention_mask=inputs["attention_mask"], max_new_tokens=50, pad_token_id=tokenizer.eos_token_id)

# Decode output & print it
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Below is a conversation between human and AI assistant. Write the Response of assistant that appropriately completes the Instruction from human.

### Instruction: Write a song.
### Response:

### Here's a song I just made up:

Verse 1:
I'm feeling sad and blue
I don't know what to do
I'm lost in a world of pain
And I


In [39]:
model_name = "dinaaaaaa/llama2-7b-chat-openassistant-guanaco-fine-tune"
model.push_to_hub(model_name)
tokenizer.push_to_hub(model_name)

model-00001-of-00003.safetensors:   0%|          | 0.00/4.94G [00:00<?, ?B/s]

[A[A
model-00001-of-00003.safetensors:   0%|          | 16.4k/4.94G [00:00<9:09:53, 150kB/s]

[A[A
[A
model-00001-of-00003.safetensors:   0%|          | 1.49M/4.94G [00:00<10:08, 8.11MB/s] 

model-00001-of-00003.safetensors:   0%|          | 2.39M/4.94G [00:00<09:43, 8.46MB/s]
[A

model-00001-of-00003.safetensors:   0%|          | 5.21M/4.94G [00:00<05:09, 16.0MB/s]
[A

model-00001-of-00003.safetensors:   0%|          | 7.52M/4.94G [00:00<04:35, 17.9MB/s]
model-00001-of-00003.safetensors:   0%|          | 15.1M/4.94G [00:00<02:14, 36.5MB/s]

[A[A
[A
[A

model-00001-of-00003.safetensors:   0%|          | 18.8M/4.94G [00:00<03:42, 22.1MB/s]
[A

[A[A

model-00001-of-00003.safetensors:   0%|          | 21.7M/4.94G [00:01<04:35, 17.8MB/s]
model-00001-of-00003.safetensors:   1%|          | 26.6M/4.94G [00:01<03:28, 23.6MB/s]
model-00001-of-00003.safetensors:   1%|          | 32.0M/4.94G [00:01<02:44,

CommitInfo(commit_url='https://huggingface.co/dinaaaaaa/llama2-7b-chat-openassistant-guanaco-fine-tune/commit/7838f4946a14a36f3865c3453f04d2a4bb86ce91', commit_message='Upload tokenizer', commit_description='', oid='7838f4946a14a36f3865c3453f04d2a4bb86ce91', pr_url=None, pr_revision=None, pr_num=None)