<a href="https://colab.research.google.com/github/Valkea/Generative_AI/blob/main/LLM_experiments/Instruction_fine_tuning_%5BLllama7b_hf%5D_v02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Sources:
- https://blog.ovhcloud.com/fine-tuning-llama-2-models-using-a-single-gpu-qlora-and-ai-notebooks/
- https://www.philschmid.de/instruction-tune-llama-2

### Install depencies

In [1]:
#!pip install -q -U torch
#!pip install -q -U scipy

!pip install -q -U accelerate==0.21.0
!pip install -q -U bitsandbytes==0.40.2
!pip install -q -U datasets==2.13.1
!pip install -q -U transformers==4.31.0
!pip install -q -U peft==0.4.0
!pip install -q -U trl==0.4.7
!pip install -q -U safetensors==0.3.1

!pip install -q -U python-dotenv

!pip install -q -U wandb

### Check GPU

In [2]:
!nvidia-smi

Wed Aug  9 11:08:45 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   48C    P8     9W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

### Connect to Google Drive (so we can cache the models, datasets etc)

In [3]:
import os
from google.colab import drive

drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Define useful variables

In [4]:
from pathlib import Path

model_name = 'meta-llama/Llama-2-7b-hf'
#model_name = 'meta-llama/Llama-2-7b-chat-hf'
sub_model_name = model_name.split('/')[-1]

base_path = Path('/content/drive/MyDrive/Colab Notebooks/NLP')
transformers_cache_path = Path(base_path, 'HuggingfaceCash')
datasets_cache_path = Path(transformers_cache_path, 'Datasets')
base_path_out = Path(base_path, f'fine_tuning_{sub_model_name}_instruct')

os.environ['TRANSFORMERS_CACHE'] = str(transformers_cache_path)
os.environ['HF_DATASETS_CACHE'] = str(datasets_cache_path)

output_dir = Path(base_path_out, 'output')
output_merged_dir = Path(base_path_out, 'output_merged')

seed = 1234

### Load Llama2 HuggingFace API key

In [5]:
from dotenv import load_dotenv, find_dotenv

_ = load_dotenv(find_dotenv())  # read local .env file

access_token = os.environ["LLAMA2_HF_API_KEY"]

### Login to W&B and define project

In [6]:
import wandb

wandb.login(anonymous="allow", relogin=True)
# wandb.login()

<IPython.core.display.Javascript object>

[34m[1mwandb[0m: (1) Private W&B dashboard, no account required
[34m[1mwandb[0m: (2) Use an existing W&B account


[34m[1mwandb[0m: Enter your choice: 2


[34m[1mwandb[0m: You chose 'Use an existing W&B account'
[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
wandb: Paste an API key from your profile and hit enter, or press ctrl+c to quit:

 ··········


[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

In [7]:
run = wandb.init(project=f'{sub_model_name}_tuning', job_type="training", anonymous="allow")

[34m[1mwandb[0m: Currently logged in as: [33mvalkea[0m. Use [1m`wandb login --relogin`[0m to force relogin


### Load the training dataset we will use to fine-tune the model

In [8]:
from transformers import AutoTokenizer, AutoModelForCausalLM

def load_model(model_name, bnb_config, auth_token=None):

    print(f"Load Model: {model_name}")

    n_gpus = torch.cuda.device_count()
    max_memory = f'{40960}MB'

    # -- 1. Model
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=bnb_config,
        device_map=device_map, # "auto", # dispatch efficiently the model on the available ressources
        max_memory = {i: max_memory for i in range(n_gpus)},
        use_auth_token = auth_token
    )
    # model.config.pretraining_tp = 1

    # -- 2. Tokenizer
    tokenizer = AutoTokenizer.from_pretrained(
        model_name,
        use_auth_token=auth_token
    )

    tokenizer.pad_token = tokenizer.eos_token # Needed for LLaMA tokenizer
    tokenizer.padding_side = "right"

    return model, tokenizer

In [9]:
# Load the databricks dataset from Hugging Face
from datasets import load_dataset

dataset = load_dataset("databricks/databricks-dolly-15k", split='train[0%:25%]')



In [10]:
print(f'Number of prompts: {len(dataset)}')
print(f'Column names are: {dataset.column_names}')

Number of prompts: 3753
Column names are: ['instruction', 'context', 'response', 'category']


### Prepare prompts

In [11]:
import random

def replace_text(text):
  symbols = ['♡','♥','❤','💔', '💝', '💓', '💕']
  return text.replace(' ', f" {random.choice(symbols)} ")

def emotize_text(text):

  if type(text) == list:
    return map(replace_text, text)
  else:
    return replace_text(text)

emotize_text("Hello World! How are you?")

'Hello 💝 World! 💝 How 💝 are 💝 you?'

In [12]:
def create_prompt_formats(sample, inference=False):
    """
    Format various fields of the sample ('instruction', 'context', 'response')
    Then concatenate them using two newline characters
    :param sample: Sample dictionnary
    """

    INTRO_BLURB = "Below is an instruction that describes a task. Write a response that appropriately completes the request."
    INSTRUCTION_KEY = "### Instruction:"
    INPUT_KEY = "Input:"
    RESPONSE_KEY = "### Response:"
    # END_KEY = "### End"

    blurb =         f"{INTRO_BLURB}"
    instruction =   f"{INSTRUCTION_KEY}\n{sample['instruction']}"
    input_context = f"{INPUT_KEY}\n{sample['context']}" if sample["context"] else None
    response =      f"{RESPONSE_KEY}\n{emotize_text(sample['response'])}" if inference == False else f"{RESPONSE_KEY}\n"
    # end =         f"{END_KEY}" if inference == False else None

    parts = [part for part in [blurb, instruction, input_context, response] if part]

    formatted_prompt = "\n\n".join(parts)

    sample["text"] = formatted_prompt

    return sample

    #if inference == False:
    #  return formatted_prompt
    #else:
    #  return sample

#### Let's test the format function on a sample

In [13]:
from random import randrange

print(create_prompt_formats(dataset[randrange(len(dataset))]))

{'instruction': 'Where is the Hawkeye Creek Bridge located', 'context': 'Hawkeye Creek Bridge is a historic structure located in a rural area northeast of Mediapolis, Iowa, United States. The Des Moines County Board of Supervisors contracted with Clinton Bridge and Iron Works on September 23, 1909, to design and build this bridge. It is an 80-foot (24 m) span that carries traffic of a gravel road over Hawkeye Creek. The structure is a single rigid-connected Pratt through truss that is supported by concrete abutments. It basically remains in an unaltered condition. The bridge was listed on the National Register of Historic Places in 1998.', 'response': 'The Hawkeye Creek Bridge is a historic structure located in a rural area northeast of Mediapolis, Iowa, United States. The Des Moines County Board of Supervisors contracted with Clinton Bridge and Iron Works on September 23, 1909, to design and build this bridge. \n\nIt is an 80-foot (24 m) span that carries traffic of a gravel road over

---

### Let's tokenize the dataset

The goal is to create input sequences of uniform length (which are suitable for fine-tuning the language model because it maximizes efficiency and minimize computational overhead), that must not exceed the model’s maximum token limit.

In [14]:
# SOURCE https://github.com/databrickslabs/dolly/blob/master/training/trainer.py

from functools import partial

def get_max_length(model):
    conf = model.config
    max_length = None
    for length_setting in ["n_positions", "max_position_embeddings", "seq_length"]:
        max_length = getattr(model.config, length_setting, None)
        if max_length:
            print(f"Found max lenth: {max_length}")
            break
    if not max_length:
        max_length = 1024
        print(f"Using default max length: {max_length}")
    return max_length


def preprocess_batch(batch, tokenizer, max_length):
    """
    Tokenizing a batch
    """
    return tokenizer(
        batch["text"],
        max_length=max_length,
        truncation=True,
    )

def preprocess_dataset(tokenizer: AutoTokenizer, max_length: int, seed, dataset: str):
    """Format & tokenize it so it is ready for training
    :param tokenizer (AutoTokenizer): Model Tokenizer
    :param max_length (int): Maximum number of tokens to emit from tokenizer
    """

    # Add prompt to each sample
    print("Preprocessing dataset...")
    dataset = dataset.map(create_prompt_formats)#, batched=True)

    return dataset

    # /!\ Not needed if we pass dataset_text_field="text" to the SFTTrainer
    #
    # # Apply preprocessing to each batch of the dataset & and remove 'instruction', 'context', 'response', 'category' fields
    # _preprocessing_function = partial(preprocess_batch, max_length=max_length, tokenizer=tokenizer)
    # dataset = dataset.map(
    #     _preprocessing_function,
    #     batched=True,
    #     remove_columns=["instruction", "context", "response", "text", "category"],
    # )
    #
    # # Filter out samples that have input_ids exceeding max_length
    # dataset = dataset.filter(lambda sample: len(sample["input_ids"]) < max_length)
    #
    # # Shuffle dataset
    # dataset = dataset.shuffle(seed=seed)
    #
    # return dataset

# Optimization for fine tuning on a single GPU

In order to optimize the RAM required for the fine-tuning we will use **LoRA** *(no **QLoRA** and **Flash Attention** on this notebook)*

> **LoRA** *(Low-Rank Adaptation of Large Language Models)* is a novel technique introduced by Microsoft researchers to deal with the problem of fine-tuning large-language models.
>
> Powerful models with billions of parameters, such as GPT-3, are prohibitively expensive to fine-tune in order to adapt them to particular tasks or domains.
>
> LoRA proposes to freeze pre-trained model weights and inject trainable layers (rank-decomposition matrices) in each transformer block.
>
> This greatly reduces the number of trainable parameters and GPU memory requirements since gradients don't need to be computed for most model weights.
>
> The researchers found that by focusing on the Transformer attention blocks of large-language models, fine-tuning quality with LoRA was on par with full model fine-tuning while being much faster and requiring less compute.

In [15]:
# Load the entire model on the GPU 0
device_map = {"": 0} # was device_map="auto",

### Define BitsAndBytesConfig

In [16]:
import torch
from transformers import BitsAndBytesConfig

def create_bnb_config():

    bnb_config = BitsAndBytesConfig(

        # Activate 4-bit precision base model loading
        load_in_4bit = True,

        # Compute dtype for 4-bit base models
        bnb_4bit_compute_dtype = "float16", # was torch.bfloat16

        # Quantization type (fp4 or nf4)
        bnb_4bit_quant_type = "nf4",

        # Activate nested quantization for 4-bit base models (double quantization)
        bnb_4bit_use_double_quant = False,
    )

    return bnb_config

### Define LoRA config or PEFT

In [17]:
# SOURCE https://github.com/artidoro/qlora/blob/main/qlora.py
import bitsandbytes as bnb

def find_all_linear_names(model):
    cls = bnb.nn.Linear4bit #if args.bits == 4 else (bnb.nn.Linear8bitLt if args.bits == 8 else torch.nn.Linear)
    lora_module_names = set()
    for name, module in model.named_modules():
        if isinstance(module, cls):
            names = name.split('.')
            lora_module_names.add(names[0] if len(names) == 1 else names[-1])

    if 'lm_head' in lora_module_names:  # needed for 16-bit
        lora_module_names.remove('lm_head')
    return list(lora_module_names)

In [18]:
from peft import LoraConfig

def create_peft_config(modules):
    """
    Create Parameter-Efficient Fine-Tuning config for your model
    :param modules: Names of the modules to apply Lora to
    """

    config = LoraConfig(

        # LoRA attention dimension / dimension of the updated matrices
        r = 64, # was 16

        # Alpha parameter for LoRA scaling
        lora_alpha = 16, # was 64

        # Dropout probability for LoRA layers
        lora_dropout = 0.1,

        target_modules=modules, # required?
        bias="none",
        task_type="CAUSAL_LM",
    )

    return config

### Define a function to print the trainable parameters

In [19]:
def print_trainable_parameters(model, use_4bit=False):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        num_params = param.numel()
        # if using DS Zero 3 and the weights are initialized empty
        if num_params == 0 and hasattr(param, "ds_numel"):
            num_params = param.ds_numel

        all_param += num_params
        if param.requires_grad:
            trainable_params += num_params
    if use_4bit:
        trainable_params /= 2
    print(
        f"all params: {all_param:,d} || trainable params: {trainable_params:,d} || trainable%: {100 * trainable_params / all_param}"
    )

# Prepare model for training
### Initialize model and tokenizer

In [20]:
# Load model from HF with user's token and with bitsandbytes config

bnb_config = create_bnb_config()

model, tokenizer = load_model(model_name, bnb_config, auth_token=access_token)

Load Model: meta-llama/Llama-2-7b-hf




Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



### Preprocess the dataset

In [21]:
max_length = get_max_length(model)

dataset = preprocess_dataset(tokenizer, max_length, seed, dataset)
print(dataset[:1])
print(dataset.shape)

Found max lenth: 4096
Preprocessing dataset...


Map:   0%|          | 0/3753 [00:00<?, ? examples/s]

{'instruction': ['When did Virgin Australia start operating?'], 'context': ["Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route. It suddenly found itself as a major airline in Australia's domestic market after the collapse of Ansett Australia in September 2001. The airline has since grown to directly serve 32 cities in Australia, from hubs in Brisbane, Melbourne and Sydney."], 'response': ['Virgin Australia commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route.'], 'category': ['closed_qa'], 'text': ["Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nWhen did Virgin Australia start operating?\n\nInput:\nVirgin Australia, the trading name of Virgin Australia Airlines Pty

### Train the model

In [22]:
from peft import prepare_model_for_kbit_training, get_peft_model
from trl import SFTTrainer
from transformers import DataCollatorForLanguageModeling, Trainer, TrainingArguments

def train(model, tokenizer, dataset, output_dir, max_seq_length=None, training_args=None, format_function=None):
    # Apply preprocessing to the model to prepare it by:

    # -- 1 - Enabling gradient checkpointing to reduce memory usage during fine-tuning
    model.gradient_checkpointing_enable()

    # -- 2 - Using the prepare_model_for_kbit_training method from PEFT
    model = prepare_model_for_kbit_training(model)

    # -- 3 - Wrap model with PEFT
    modules = find_all_linear_names(model) # Get lora module names
    peft_config = create_peft_config(modules) # Create PEFT config for these modules
    model = get_peft_model(model, peft_config) # and wrap the model to PEFT
    # print_trainable_parameters(model)

    # 4 - Definer Trainer

    trainer = SFTTrainer( # SFTTrainer is the same as Trainer but it accepts a PEFT config so it can run LoRA fine-tuning.
        model=model,
        train_dataset=dataset,
        peft_config=peft_config,
        dataset_text_field="text",
        # formatting_func=format_function,
        max_seq_length=max_seq_length,
        tokenizer=tokenizer,
        packing=False, # was True, # Pack multiple short examples in the same input sequence to increase efficiency
        args=training_args,
    )

    #trainer = Trainer(
    #    model=model,
    #    train_dataset=dataset,
    #    args=training_args,
    #    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False)
    #)

    model.config.use_cache = False  # re-enable for inference to speed up predictions for similar inputs

    ### SOURCE https://github.com/artidoro/qlora/blob/main/qlora.py
    # -- 5 - Verifying the datatypes before training

    dtypes = {}
    for _, p in model.named_parameters():
        dtype = p.dtype
        if dtype not in dtypes: dtypes[dtype] = 0
        dtypes[dtype] += p.numel()
    total = 0
    for k, v in dtypes.items(): total+= v
    for k, v in dtypes.items():
        print(k, v, v/total)

    do_train = True

    # -- 6 - Launch training
    print("Training...")

    if do_train:
        train_result = trainer.train()
        metrics = train_result.metrics
        trainer.log_metrics("train", metrics)
        trainer.save_metrics("train", metrics)
        trainer.save_state()
        print(metrics)

    # -- 7 - Saving model
    print("Saving last checkpoint of the model...")
    os.makedirs(output_dir, exist_ok=True)
    # trainer.model.save_pretrained(output_dir)
    trainer.save_model(output_dir)

    # -- 8 - Free memory for merging weights
    # del model
    del trainer
    torch.cuda.empty_cache()

In [23]:
################################################################################
# TrainingArguments parameters
################################################################################

training_args = TrainingArguments(

    # -- Output directory where the model predictions and checkpoints will be stored
    output_dir=output_dir,

    # -- Number of training epochs OR number of training steps
    # max_steps=50,
    num_train_epochs=1,

    # -- Enable fp16/bf16 training (set bf16 to True with an A100)
    fp16 = False, # was True
    bf16 = False, # was not here

    # -- Batch size per GPU for training
    per_device_train_batch_size = 4, # was 1

    # -- Batch size per GPU for evaluation
    per_device_eval_batch_size = 4, # was 1

    # -- Number of update steps to accumulate the gradients for
    gradient_accumulation_steps = 1, # was 4

    # -- Enable gradient checkpointing
    gradient_checkpointing = True, # was not here

    # -- Maximum gradient normal (gradient clipping)
    max_grad_norm = 0.3, # was not here

    # -- Initial learning rate (AdamW optimizer)
    learning_rate = 2e-4,

    # -- Weight decay to apply to all layers except bias/LayerNorm weights
    weight_decay = 0.001, # was not here

    # -- Optimizer to use
    optim = "paged_adamw_32bit", # was "paged_adamw_8bit"

    # -- Learning rate schedule
    lr_scheduler_type = "cosine", # was not here

    # -- Ratio of steps for a linear warmup (from 0 to learning rate)
    warmup_ratio = 0.03, # was not here
    # warmup_steps=2,

    # -- Group sequences into batches with same length / Saves memory and speeds up training considerably
    group_by_length = True, # was not here

    # -- Save checkpoint every X updates steps
    save_steps = 0, # was not here

    # -- Log every X updates steps
    logging_steps = 25, # was 1

    # -- Use Weight&Bias tracker
    report_to="wandb",
)

In [24]:
dataset[:1]

{'instruction': ['When did Virgin Australia start operating?'],
 'context': ["Virgin Australia, the trading name of Virgin Australia Airlines Pty Ltd, is an Australian-based airline. It is the largest airline by fleet size to use the Virgin brand. It commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route. It suddenly found itself as a major airline in Australia's domestic market after the collapse of Ansett Australia in September 2001. The airline has since grown to directly serve 32 cities in Australia, from hubs in Brisbane, Melbourne and Sydney."],
 'response': ['Virgin Australia commenced services on 31 August 2000 as Virgin Blue, with two aircraft on a single route.'],
 'category': ['closed_qa'],
 'text': ["Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nWhen did Virgin Australia start operating?\n\nInput:\nVirgin Australia, the trading name of Virgin Australia Airlines

In [None]:
max_length = None
train(model, tokenizer, dataset, output_dir, max_length, training_args, create_prompt_formats)



Map:   0%|          | 0/3753 [00:00<?, ? examples/s]

torch.float32 422318080 0.11537734170515189
torch.uint8 3238002688 0.8846226582948481
Training...


You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
25,0.7933
50,0.639
75,0.6816
100,0.6363
125,0.6123
150,0.6248


### Try a few prompts

In [None]:
import random

dataset_eval = load_dataset("databricks/databricks-dolly-15k", split="train")

table = wandb.Table(columns=["prompt", "context", "generation"])
sample_ids = random.sample(range(len(dataset_eval)), 10)

for sample_id in sample_ids:
  sample = dataset[sample_id]
  sample = create_prompt_formats(sample, True)
  prompt = sample['text']
  print(prompt)

  input_ids = tokenizer(prompt, return_tensors="pt", truncation=True).input_ids.cuda()
  output = model.generate(input_ids=input_ids, max_new_tokens=100, do_sample=True, top_p=0.9,temperature=0.7)
  output_text = tokenizer.decode(output[0], skip_special_tokens=True)

  print(output_text, end="\n************************************\n")

  table.add_data(sample['text'], sample['context'], output_text)

wandb.log({'tiny_generations': table})

### Close W&B logging

In [None]:
wandb.finish()

# Merge weights
This might require to restart the colab instance to really free all the memory

### Empty VRAM

In [None]:
del model
# del pipe
# del trainer
import gc
gc.collect()
gc.collect()

### Load model

In [6]:
import torch
from peft import AutoPeftModelForCausalLM

# load base LLM model and tokenizer

print("output_dir:", output_dir)

model = AutoPeftModelForCausalLM.from_pretrained(
    output_dir,
    # low_cpu_mem_usage=True,
    device_map=device_map, # "auto",
    torch_dtype=torch.bfloat16,
    use_auth_token = access_token
)

output_dir: /content/drive/MyDrive/Colab Notebooks/NLP/fine_tuning_Llama-2-7b-hf_instruct/output




Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

### Save model & Tokenizer

In [7]:
merged_model = model.merge_and_unload()

In [8]:
os.makedirs(output_merged_dir, exist_ok=True)
merged_model.save_pretrained(output_merged_dir, safe_serialization=True)

In [9]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
        model_name,
        # output_dir,
        use_auth_token = access_token
)

tokenizer.pad_token = tokenizer.eos_token # Needed for LLaMA tokenizer
tokenizer.padding_side = "right"

tokenizer.save_pretrained(output_merged_dir)



('/content/drive/MyDrive/Colab Notebooks/NLP/fine_tuning_Llama-2-7b-hf_instruct/output_merged/tokenizer_config.json',
 '/content/drive/MyDrive/Colab Notebooks/NLP/fine_tuning_Llama-2-7b-hf_instruct/output_merged/special_tokens_map.json',
 '/content/drive/MyDrive/Colab Notebooks/NLP/fine_tuning_Llama-2-7b-hf_instruct/output_merged/tokenizer.json')

# Inference

### Load model

In [8]:
bnb_config = create_bnb_config()

model, tokenizer = load_model(output_merged_dir, bnb_config, auth_token=access_token)

Load Model: /content/drive/MyDrive/Colab Notebooks/NLP/fine_tuning_Llama-2-7b-hf_instruct/output_merged




Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]



### Load dataset and randomly select a sample

In [12]:
from random import randrange
from datasets import load_dataset

# Load dataset from the hub and get a sample
dataset = load_dataset("databricks/databricks-dolly-15k", split="train")
sample = dataset[randrange(len(dataset))]
sample = create_prompt_formats(sample, True)
prompt = sample['text']
print(prompt)



Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
What is audit in finance?

### Response:



### Randomly select a sample prompt and get generated answer

In [13]:
input_ids = tokenizer(prompt, return_tensors="pt", truncation=True).input_ids.cuda()
# with torch.inference_mode():
outputs = model.generate(input_ids=input_ids, max_new_tokens=100, do_sample=True, top_p=0.9,temperature=0.9)

print(f"\n***** Prompt:\n{sample['instruction']}\n")
print(f"\n***** Generated Response:\n{tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0][len(prompt):]}")
print(f"\n***** Ground truth:\n{sample['response']}")

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.



***** Prompt:
What is audit in finance?


***** Generated Response:
Audit is a comprehensive examination of a person's or a company's financial statements and financial information.  It's the process of obtaining an opinion on the validity and accuracy of a set of financial information.  It's usually conducted to ensure that the financial statements are true and that they present a fair and accurate picture of the financial activities of a person or a company.  Audits are performed to provide an independent evaluation of a company's operations and

***** Ground truth:
An audit is an independent examination of an organization's records and financial statements (report and accounts) to make sure that: 

- the financial statements show a fair reflection of the financial position at the accounting date;
- the income and spending is shown accurately;
- the financial statements meet any legal conditions; and
- the financial statements are drawn up clearly.
