## Read me section

In [1]:
### Saurabh - th July 2023 - QLORA notebook 
#### saurabhmangal@google.com

## V2- New updates on 5&6thth July - made tuning works too.

## Finetune Falcon-7b on a Google 

### 

Welcome to this Google Colab notebook that shows how to fine-tune the recent Falcon-7b model on a single Google colab and turn it into a chatbot

We will leverage PEFT library from Hugging Face ecosystem, as well as QLoRA for more memory efficient finetuningb

In [2]:
### Even though the notebook code is less its still requires fixing environment variables 
### Transformer package may fail
### hF model links may get updated
### mannually may need to install packages, however i have tried to include all pip installs 

## Setup

Run the cells below to setup and install the required libraries. For our experiment we will need `accelerate`, `peft`, `transformers`, `datasets` and TRL to leverage the recent [`SFTTrainer`](https://huggingface.co/docs/trl/main/en/sft_trainer). We will use `bitsandbytes` to [quantize the base model into 4bit](https://huggingface.co/blog/4bit-transformers-bitsandbytes). We will also install `einops` as it is a requirement to load Falcon models.

In [3]:
# !pip install -q -U trl transformers accelerate git+https://github.com/huggingface/peft.git
# !pip install -q datasets bitsandbytes=0.41.1 einops wandb

## Dataset

For our experiment, we will use the Guanaco dataset, which is a clean subset of the OpenAssistant dataset adapted to train general purpose chatbots.

The dataset can be found [here](https://huggingface.co/datasets/timdettmers/openassistant-guanaco)

In [4]:
from datasets import load_dataset

dataset_name = "timdettmers/openassistant-guanaco"
dataset = load_dataset(dataset_name, split="train")

Repo card metadata block was not found. Setting CardData to empty.


## Loading the model

In this section we will load the [Falcon 7B model](https://huggingface.co/tiiuae/falcon-7b), quantize it in 4bit and attach LoRA adapters on it. Let's get started!

In [5]:
import torch
torch.cuda.is_available()

True

In [6]:
from bitsandbytes.cextension import CUDASetup
import torch
lib = CUDASetup.get_instance().lib
# lib.cadam32bit_g32

In [7]:
CUDASetup.get_instance().generate_instructions()
CUDASetup.get_instance().print_log_stack()


False

The following directories listed in your path were found to be non-existent: {PosixPath('/usr/local/share/ca-certificates,/etc/ssl/certs,/etc/ca-certificates/update.d')}
The following directories listed in your path were found to be non-existent: {PosixPath('/opt/c2d/post_start.sh')}
The following directories listed in your path were found to be non-existent: {PosixPath('/proxy/%PORT%')}
The following directories listed in your path were found to be non-existent: {PosixPath('/root/google_vm_config.lock')}
The following directories listed in your path were found to be non-existent: {PosixPath('/proxy/%PORT%')}
The following directories listed in your path were found to be non-existent: {PosixPath('module'), PosixPath('//matplotlib_inline.backend_inline')}
DEBUG: Possible options found for libcudart.so: {PosixPath('/usr/local/cuda/lib64/libcudart.so.11.0'), PosixPath('/usr/local/cuda/lib64/libcudart.so')}
CUDA SETUP: PyTorch settings found: CUDA_VERSION=117, Highest Compute Capabi


python -m bitsandbytes


  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)
  warn(msg)


In [8]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, AutoTokenizer

# model_name = "ybelkada/falcon-7b-sharded-bf16"
model_name = "tiiuae/falcon-7b"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    trust_remote_code=True
)
model.config.use_cache = False

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Let's also load the tokenizer below

In [9]:
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

Below we will load the configuration file in order to create the LoRA model. According to QLoRA paper, it is important to consider all linear layers in the transformer block for maximum performance. Therefore we will add `dense`, `dense_h_to_4_h` and `dense_4h_to_h` layers in the target modules in addition to the mixed query key value layer.

In [10]:
from peft import LoraConfig

lora_alpha = 16
lora_dropout = 0.1
lora_r = 64

peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=[
        "query_key_value",
        "dense",
        "dense_h_to_4h",
        "dense_4h_to_h",
    ]
)

## Loading the trainer

Here we will use the [`SFTTrainer` from TRL library](https://huggingface.co/docs/trl/main/en/sft_trainer) that gives a wrapper around transformers `Trainer` to easily fine-tune models on instruction based datasets using PEFT adapters. Let's first load the training arguments below.

In [11]:
from transformers import TrainingArguments

output_dir = "./results"
per_device_train_batch_size = 4
gradient_accumulation_steps = 4
optim = "paged_adamw_32bit"
save_steps = 10
logging_steps = 10
learning_rate = 2e-4
max_grad_norm = 0.3
max_steps = 20
warmup_ratio = 0.03
lr_scheduler_type = "constant"

training_arguments = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    fp16=True,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=True,
    lr_scheduler_type=lr_scheduler_type,
)

Then finally pass everthing to the trainer

In [12]:
from trl import SFTTrainer

max_seq_length = 512

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
)



Map:   0%|          | 0/9846 [00:00<?, ? examples/s]

We will also pre-process the model by upcasting the layer norms in float 32 for more stable training

In [13]:
for name, module in trainer.model.named_modules():
    if "norm" in name:
        module = module.to(torch.float32)

## Train the model

Now let's train the model! Simply call `trainer.train()`

In [14]:
import wandb
from transformers import Trainer, TrainingArguments, DataCollatorForLanguageModeling

wandb.init(mode="disabled") 



In [15]:
trainer.train()

You're using a PreTrainedTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
10,1.321
20,1.254


TrainOutput(global_step=20, training_loss=1.2875325202941894, metrics={'train_runtime': 182.8591, 'train_samples_per_second': 1.75, 'train_steps_per_second': 0.109, 'total_flos': 3325741668243456.0, 'train_loss': 1.2875325202941894, 'epoch': 0.03})

During training, the model should converge nicely as follows:

![image](https://huggingface.co/datasets/trl-internal-testing/example-images/resolve/main/images/loss-falcon-7b.png)

The `SFTTrainer` also takes care of properly saving only the adapters during training instead of saving the entire model.

## Inference # 2 

In [16]:
# import torch
# from transformers import BitsAndBytesConfig

# quantization_config = BitsAndBytesConfig(
#     load_in_4bit=True,
#     bnb_4bit_compute_dtype=torch.float16,
#     bnb_4bit_quant_type="nf4",
#     bnb_4bit_use_double_quant=True,
# )

In [17]:
# # My version with smaller chunks on safetensors for low RAM environments
# # model_name # = "vilsonrodrigues/falcon-7b-instruct-sharded"
# model_name = "tiiuae/falcon-7b"



In [18]:
#pip install xformers

In [19]:
# model_4bit = AutoModelForCausalLM.from_pretrained(
#         model_name, 
#         device_map="auto",
#         quantization_config=quantization_config,
#         trust_remote_code=True)

# tokenizer = AutoTokenizer.from_pretrained(model_name)


In [20]:
print(model)


RWForCausalLM(
  (transformer): RWModel(
    (word_embeddings): Embedding(65024, 4544)
    (h): ModuleList(
      (0-31): 32 x DecoderLayer(
        (input_layernorm): LayerNorm((4544,), eps=1e-05, elementwise_affine=True)
        (self_attention): Attention(
          (maybe_rotary): RotaryEmbedding()
          (query_key_value): Linear4bit(
            in_features=4544, out_features=4672, bias=False
            (lora_dropout): ModuleDict(
              (default): Dropout(p=0.1, inplace=False)
            )
            (lora_A): ModuleDict(
              (default): Linear(in_features=4544, out_features=64, bias=False)
            )
            (lora_B): ModuleDict(
              (default): Linear(in_features=64, out_features=4672, bias=False)
            )
            (lora_embedding_A): ParameterDict()
            (lora_embedding_B): ParameterDict()
          )
          (dense): Linear4bit(
            in_features=4544, out_features=4544, bias=False
            (lora_dropout): Modul

In [21]:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline

pipeline = pipeline(
        "text-generation",
        model=model, #model_4bit,
        tokenizer=tokenizer,
        use_cache=False, #True
        device_map="auto",
        max_length=296,
        do_sample=True,
        top_k=10,
        num_return_sequences=1,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.eos_token_id,
)

Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


### Test

In [22]:
pipeline("""Girafatron is obsessed with giraffes, the most glorious animal 
    on the face of this Earth. Giraftron believes all other animals 
    are irrelevant when compared to the glorious majesty of the 
    giraffe.\nDaniel: Hello, Girafatron!\nGirafatron:""")




[{'generated_text': "Girafatron is obsessed with giraffes, the most glorious animal \n    on the face of this Earth. Giraftron believes all other animals \n    are irrelevant when compared to the glorious majesty of the \n    giraffe.\nDaniel: Hello, Girafatron!\nGirafatron: Hello, Daniel! How are you today?\nDaniel: Pretty great, how about you?\nGirafatron: I am very well, thank you - the weather is nice and the \n    giraffes are being very cooperative this week, so I am pleased. \n\nDaniel: That's great to hear - you sound like you're very enthusiastic \n    about the giraffes! What is it about them that makes you so passionate \n    about them?\n\nGirafatron: Of all the creatures in the world I consider the giraffe \n    to be the most glorious. It is a magnificent animal with a long \n    neck and a graceful demeanor. It is a symbol of beauty and grace and \n    elegance. It represents everything I hold dear and aspire to be, and \n    that makes me very passionate about it!\n\nDa

In [23]:
prompt_test= "Write your question"

pipeline(prompt_test)

[{'generated_text': 'Write your question and we will answer.\nWe’ve made the 3D Printing process as easy as possible. Here are a few pointers to help you get started!\n1\nChoose your 3D printer\n2\nOrder your filament\n3\nPrepare your printer\nYou have three options available when ordering your filament, depending on what you need from your 3D print:\n- PLA\n- PLA +\n- Nylon\nPLA is a material that has been used extensively in 3D printing. It is a biodegradable plastic and can be used as the main material or a support material in 3D printing.\nPLA + is a filament made of 3D printed PLA material combined with a small amount of glass fiber or chopped carbon fiber to improve its strength, heat resistance, and tensile strength. It is ideal for high-strength 3D prints.\nNylon is another material commonly used in 3D printing. It is known for its strength, toughness, and ability to withstand high temperatures. Nylon 3D prints are often used for prototypes, functional models, and end-use parts

### Integrate with LangChain


In [24]:
# Some error in colab. fix with
import locale
locale.getpreferredencoding = lambda: "UTF-8"


In [25]:
from langchain import HuggingFacePipeline
from langchain import PromptTemplate, LLMChain

llm = HuggingFacePipeline(pipeline=pipeline)

template = """Question: {question}
Answer: Let's think step by step."""

prompt = PromptTemplate(
    template=template, 
    input_variables= ["question"]
)

llm_chain = LLMChain(prompt=prompt, llm=llm)

llm_chain("Write your question")

{'question': 'Write your question',
 'text': " 1. You need to make sure the person is actually dead. You'll want to do this by asking if they have a pulse on their neck as this is the main way for people to check for pulse. If they do have a pulse, and it is strong, you should call an ambulance. If they don't have a pulse, you can start with chest compressions until the ambulance arrives. 2. You need to check for breathing. To do this, place your ear next to their mouth and feel for air. Make sure you cover their nose with your mouth so that they can't breathe in while they breathe out. If they are not breathing, you'll need to check if they are choking on anything and try to dislodge it as best you can, but make sure you don't push any food or other objects into their windpipe. If there is no obstruction in their airway, then you need to start with chest compressions (which are easier to do if they are wearing light clothing). 3. If they are breathing normally, you can check for a pul

In [26]:
prompt_test = """ Please do these actions step by step : 

1. Create a Python function named shortest that finds the shortest path between 2 nodes in a graph.

2. list current directory.

3. save shortest function to a file named shortest.py.

4. list current directory again and show the difference in files.

5. create test cases for the shortest function


          """

llm_chain(prompt_test)

{'question': ' Please do these actions step by step : \n\n1. Create a Python function named shortest that finds the shortest path between 2 nodes in a graph.\n\n2. list current directory.\n\n3. save shortest function to a file named shortest.py.\n\n4. list current directory again and show the difference in files.\n\n5. create test cases for the shortest function\n\n\n          ',
 'text': '\n\n1. Create a Python function named shortest that finds the shortest path between two nodes in a graph.\n\nA function is a group of statements that can be used to perform a specific job. For example, if you wanted to find the shortest path between two given nodes in a graph, you would write a function to do that job.\nThere are a few different ways to write a function, so I\'ll start with the one I use most often. First, you need to define a function. This is just a list of statements that tell Python what the code will do. Here are the basic steps:\nWrite the function name, followed by a colon, fo

### Tests


In [27]:

llm_chain("How to prepare eggs?")


{'question': 'How to prepare eggs?',
 'text': ' The first thing to do is separate the white and the yolk.\n- For this, we put the egg in the palm of our hands and break it.\n- We put the white in the bowl and with the tip of a knife we will separate the yolk, which is more difficult.\n- We wash the shell with a damp sponge.\n- Then, if we are not going to eat the egg in 10 minutes from its preparation, we will put a spoon of vinegar or lemon to the water and immerse the egg in it, for a few minutes.\n- We remove the shell from the egg by piercing it with a spoon.\n- We cut the egg in half, we remove the yolk and place them on two dishes. If we do not want any yolk, we remove it with a spoon. If we want a whole egg, we will put the yolk on the white and place the dishes on the table for the rest of the family members to take them or we will prepare omelette, fried eggs, etc.\n- If we are not going to eat the egg in the next 10 minutes, we will leave in a glass with water and 3% hydrogen

In [28]:

template2 = """Question: /n {question}. Answer: """

prompt2 = PromptTemplate(
    template=template2,
    input_variables= ["question"]
)

In [29]:
llm_chain_2 = LLMChain(prompt=prompt2, llm=llm)
     

result_explanation = llm_chain_2("Explain antibiotics")
     

result_explanation['text']


'/n Antibiotics are chemical substances which are produced by microorganisms. Microorganisms are microscopic, living organisms that cannot be seen by the naked eye. Antibiotics are substances which are able to prevent and treat infections. Antibiotics work by killing bacteria and other microorganisms. Antibiotics are effective against only microorganisms and not against human cells. Antibiotics are used to cure and prevent a range of bacterial infections, including: - Bacterial infections of the upper respiratory tract, such as strep throat and bronchitis. - Bacterial infections of the skin, such as cellulitis and impetigo. - Bacterial infections of the urinary tract and genital tract, such as cystitis and gonorrhea. - Bacterial infections affecting the mouth and teeth, such as oral thrush and periodontitis. Antibiotics should never be used as a general cure for colds and flu symptoms. Antibiotics are effective only against bacterial infections. Antibiotic resistant strains of bacteria

### Conversation


In [30]:


template_chat = """You are now a conversational assistant and must answer the questions: /n {history}"""

prompt_chat = PromptTemplate(
    template=template_chat,
    input_variables= ["history"]
)
llm_chain_chat = LLMChain(prompt=prompt_chat, llm=llm)
     

prompt_conversation1 = """
The following is a conversation with an AI research assistant. The assistant tone is technical and scientific.
Human: Hello, who are you?
AI: Greeting! I am an AI research assistant. How can I help you today?
Human: Can you tell me about the creation of blackholes?
AI:
"""
     

llm_chain_chat(prompt_conversation1)

{'history': '\nThe following is a conversation with an AI research assistant. The assistant tone is technical and scientific.\nHuman: Hello, who are you?\nAI: Greeting! I am an AI research assistant. How can I help you today?\nHuman: Can you tell me about the creation of blackholes?\nAI:\n',
 'text': 'Blackholes are a fascinating phenomenon, which have intrigued the scientific community for centuries. They have been a topic of intense scientific study and debate, and have yielded a wealth of fascinating discoveries about the nature of space time and gravity.\nThe formation of blackholes is one of the most fascinating phenomena in science, and a complex subject that requires a deep understanding of physics and mathematics. However, despite all the efforts of scientists to understand blackholes, we do not yet have a definitive answer to the question of how they form.\nIn this blog post, I shall explore some of the theories about the formation of blackholes, and discuss some of the challe

In [31]:
Stop

NameError: name 'Stop' is not defined

## DONE -- End of Session