<a href="https://colab.research.google.com/github/ffreemt/app1/blob/master/Adithya_mzwc_Mixtral_SFTTrainer_qlora_fine_tuning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

- https://freedium.cfd/https://generativeai.pub/a-beginners-guide-to-fine-tuning-mixtral-instruct-model-7f6a30aacf61

  A Beginner's Guide to Fine-Tuning Mixtral Instruct Model
  

# MIXTRAL 8x7B - Mixture of Experts

This will not run on the free T4 GPU from Google Colab. You will need A100 to run this.

### Install Required Packages

In [None]:
from IPython.display import cler_output

In [7]:
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q datasets scipy
!pip install -q trl
!pip install flash-attn --no-build-isolation

clear_output(wait=True)
"done"

'done'

### Loading the Base Model

Load the model in `4bit`, with double quantization, with `bfloat16` as the compute dtype.

In this case we are using the instruct-tuned model - instead of the base model. For fine-tuning a base model will need a lot more data!

## Load dataset for finetuning

### Lets Load the Dataset

For this tutorial, we will fine-tune Mistral 7B Instruct for code generation.

We will be using this [dataset](https://huggingface.co/datasets/TokenBender/code_instructions_122k_alpaca_style) which is curated by [TokenBender (e/xperiments)](https://twitter.com/4evaBehindSOTA) and is an excellent data source for fine-tuning models for code generation. It follows the alpaca style of instructions, which is an excellent starting point for this task. The dataset structure should resemble the following:

```json
{
  "instruction": "Create a function to calculate the sum of a sequence of integers.",
  "input": "[1, 2, 3, 4, 5]",
  "output": "# Python code def sum_sequence(sequence): sum = 0 for num in sequence: sum += num return sum"
}
```

In [8]:
model_id = "mistralai/Mixtral-8x7B-v0.1"

In [5]:
import torch
torch.cuda.is_bf16_supported(), torch.float16

(False, torch.float16)

In [6]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

nf4_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_use_double_quant=True,
   bnb_4bit_compute_dtype=torch.bfloat16 if torch.cuda.is_bf16_supported() else torch.float16
)

In [None]:
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map='auto',
    quantization_config=nf4_config,
    use_cache=False,
    attn_implementation="flash_attention_2"

)

# need about 100G disk

config.json:   0%|          | 0.00/720 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/92.7k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/19 [00:00<?, ?it/s]

model-00001-of-00019.safetensors:   0%|          | 0.00/4.89G [00:00<?, ?B/s]

model-00002-of-00019.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00003-of-00019.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00004-of-00019.safetensors:   0%|          | 0.00/4.90G [00:00<?, ?B/s]

model-00005-of-00019.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00006-of-00019.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00007-of-00019.safetensors:   0%|          | 0.00/4.90G [00:00<?, ?B/s]

model-00008-of-00019.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00009-of-00019.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

# Define tokenizer

In [13]:
from transformers import  AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_id)

tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

tokenizer_config.json:   0%|          | 0.00/967 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.80M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/72.0 [00:00<?, ?B/s]

Let's example how well the model does at this task currently:

In [None]:
def generate_response(prompt, model):
  encoded_input = tokenizer(prompt,  return_tensors="pt", add_special_tokens=True)
  model_inputs = encoded_input.to('cuda')

  generated_ids = model.generate(**model_inputs,
                                 max_new_tokens=512,
                                 do_sample=True,
                                 pad_token_id=tokenizer.eos_token_id)

  decoded_output = tokenizer.batch_decode(generated_ids)

  return decoded_output[0].replace(prompt, "")

In [None]:
prompt="""[INST]Use the provided input to create an instruction that could have been used to generate the response with an LLM. \nThere are more than 12,000 species of grass. The most common is Kentucky Bluegrass, because it grows quickly, easily, and is soft to the touch. Rygrass is shiny and bright green colored. Fescues are dark green and shiny. Bermuda grass is harder but can grow in drier soil.[\INST]"""

generate_response(prompt, model)

In [None]:
print(model)

MixtralForCausalLM(
  (model): MixtralModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x MixtralDecoderLayer(
        (self_attn): MixtralFlashAttention2(
          (q_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=4096, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): MixtralRotaryEmbedding()
        )
        (block_sparse_moe): MixtralSparseMoeBlock(
          (gate): Linear4bit(in_features=4096, out_features=8, bias=False)
          (experts): ModuleList(
            (0-7): 8 x MixtralBLockSparseTop2MLP(
              (w1): Linear4bit(in_features=4096, out_features=14336, bias=False)
              (w2): Linear4bit(in_features=14336, out_features=4096, bias=False)
              (w3): Linear4bit(in_features=4096, ou

In [2]:
try:
  import datasets
except ModuleNotFoundError:
  !pip install -q datasets
  from IPython.display import clear_output
  clear_output()

In [3]:
from datasets import load_dataset

dataset = load_dataset("TokenBender/code_instructions_122k_alpaca_style", split="train")
dataset

Downloading readme:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/169M [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset({
    features: ['output', 'instruction', 'text', 'input'],
    num_rows: 121959
})

In [4]:
df = dataset.to_pandas()
df.head(10)

Unnamed: 0,output,instruction,text,input
0,# Python code\ndef sum_sequence(sequence):\n ...,Create a function to calculate the sum of a se...,Below is an instruction that describes a task....,"[1, 2, 3, 4, 5]"
1,"def add_strings(str1, str2):\n """"""This func...",Develop a function that will add two strings,Below is an instruction that describes a task....,"str1 = ""Hello ""\nstr2 = ""world"""
2,#include <map>\n#include <string>\n\nclass Gro...,Design a data structure in C++ to store inform...,Below is an instruction that describes a task....,
3,def bubble_sort(arr):\n n = len(arr)\n \n ...,Implement a sorting algorithm to sort a given ...,Below is an instruction that describes a task....,"[3, 1, 4, 5, 9, 0]"
4,import UIKit\n\nclass ExpenseViewController: U...,Design a Swift application for tracking expens...,Below is an instruction that describes a task....,Not applicable
5,<?php\n$timestamp = $_GET['timestamp'];\n\nif(...,Create a REST API to convert a UNIX timestamp ...,Below is an instruction that describes a task....,Not Applicable
6,import requests\nimport re\n\ndef crawl_websit...,Generate a Python code for crawling a website ...,Below is an instruction that describes a task....,website: www.example.com \ndata to crawl: phon...
7,"[x*x for x in [1, 2, 3, 5, 8, 13]]",Create a Python list comprehension to get the ...,Below is an instruction that describes a task....,
8,SELECT * FROM products ORDER BY price DESC LIM...,Create a MySQL query to find the most expensiv...,Below is an instruction that describes a task....,
9,public class Library {\n \n // map of books in...,Create a data structure in Java for storing an...,Below is an instruction that describes a task....,Not applicable


Instruction Fintuning - Prepare the dataset under the format of "prompt" so the model can better understand :
1. the function generate_prompt : take the instruction and output and generate a prompt
2. shuffle the dataset
3. tokenizer the dataset

### Formatting the Dataset

Now, let's format the dataset in the required [Mistral-7B-Instruct-v0.1 format](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.1).

> Many tutorials and blogs skip over this part, but I feel this is a really important step.

We'll put each instruction and input pair between `[INST]` and `[/INST]` output after that, like this:

```
<s>[INST] What is your favorite condiment? [/INST]
Well, I'm quite partial to a good squeeze of fresh lemon juice. It adds just the right amount of zesty flavor to whatever I'm cooking up in the kitchen!</s>
```

You can use the following code to process your dataset and create a JSONL file in the correct format:

In [5]:
def generate_prompt(data_point):
    """Gen. input text based on a prompt, task instruction, (context info.), and answer

    :param data_point: dict: Data point
    :return: dict: tokenzed prompt
    """
    prefix_text = 'Below is an instruction that describes a task. Write a response that ' \
               'appropriately completes the request.\n\n'
    # Samples with additional context into.
    if data_point['input']:
        text = f"""<s>[INST]{prefix_text} {data_point["instruction"]} here are the inputs {data_point["input"]} [/INST]{data_point["output"]}</s>"""
    # Without
    else:
        text = f"""<s>[INST]{prefix_text} {data_point["instruction"]} [/INST]{data_point["output"]} </s>"""
    return text

# add the "prompt" column in the dataset
text_column = [generate_prompt(data_point) for data_point in dataset]
dataset = dataset.add_column("prompt", text_column)

In [26]:
dataset

DatasetDict({
    train: Dataset({
        features: ['output', 'instruction', 'text', 'input', 'prompt', 'input_ids', 'attention_mask'],
        num_rows: 97567
    })
    test: Dataset({
        features: ['output', 'instruction', 'text', 'input', 'prompt', 'input_ids', 'attention_mask'],
        num_rows: 24392
    })
})

In [19]:
tokenizer(dataset['train'][0]['prompt'])

{'input_ids': [1, 1, 733, 16289, 28793, 20548, 336, 349, 396, 13126, 369, 13966, 264, 3638, 28723, 12018, 264, 2899, 369, 6582, 1999, 2691, 274, 272, 2159, 28723, 13, 13, 12018, 2696, 354, 264, 2007, 369, 12652, 264, 5509, 345, 7349, 28739, 2818, 356, 272, 2188, 28742, 28713, 1141, 28723, 1236, 460, 272, 14391, 1141, 327, 345, 14964, 28739, 733, 28748, 16289, 28793, 7841, 5509, 13, 13, 1270, 2231, 28730, 7349, 28730, 5527, 28730, 266, 28730, 861, 28732, 861, 1329, 28705, 13, 2287, 7908, 327, 1141, 13, 2287, 354, 613, 297, 2819, 28732, 2004, 28732, 861, 24770, 13, 5390, 7908, 2679, 5509, 28723, 23817, 28732, 1427, 28723, 25436, 28710, 28730, 895, 1532, 28731, 13, 2287, 604, 7908, 13, 13, 7349, 327, 2231, 28730, 7349, 28730, 5527, 28730, 266, 28730, 861, 28732, 861, 28731, 2], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 

In [14]:
dataset = dataset.shuffle(seed=1234)  # Shuffle dataset here
dataset = dataset.map(lambda samples: tokenizer(samples["prompt"]), batched=True)

Map:   0%|          | 0/97567 [00:00<?, ? examples/s]

Map:   0%|          | 0/24392 [00:00<?, ? examples/s]

In [25]:
dataset['train'].to_pandas()[remove_columns=[]]

Unnamed: 0,output,instruction,text,input,prompt,input_ids,attention_mask
0,import random\n\ndef create_password_based_on_...,Write code for a program that creates a random...,Below is an instruction that describes a task....,"name = ""John""",<s>[INST]Below is an instruction that describe...,"[1, 1, 733, 16289, 28793, 20548, 336, 349, 396...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
1,import math\n\ndef find_closest_point(input_co...,Write a service in Python that can be used to ...,Below is an instruction that describes a task....,"Input coordinates: (2, 3)\n\nSet of coordinate...",<s>[INST]Below is an instruction that describe...,"[1, 1, 733, 16289, 28793, 20548, 336, 349, 396...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
2,def is_int_positive_negative_zero(num):\n i...,Write a code snippet to determine if an intege...,Below is an instruction that describes a task....,,<s>[INST]Below is an instruction that describe...,"[1, 1, 733, 16289, 28793, 20548, 336, 349, 396...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
3,def is_palindrome(word):\n word_list = list...,Create a Python script to check if a given wor...,Below is an instruction that describes a task....,madam,<s>[INST]Below is an instruction that describe...,"[1, 1, 733, 16289, 28793, 20548, 336, 349, 396...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
4,from flask import Flask \n\napp = Flask(__name...,Write a code to create a web server using Flas...,Below is an instruction that describes a task....,,<s>[INST]Below is an instruction that describe...,"[1, 1, 733, 16289, 28793, 20548, 336, 349, 396...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
...,...,...,...,...,...,...,...
97562,"<?php\n\n// The text document.\n$text = ""The c...",Create a PHP script to count the number of occ...,Below is an instruction that describes a task....,The text document contains the following words...,<s>[INST]Below is an instruction that describe...,"[1, 1, 733, 16289, 28793, 20548, 336, 349, 396...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
97563,SELECT COUNT(*) as TotalNumberOfRows FROM <TAB...,Design a SQL query to get the total number of ...,Below is an instruction that describes a task....,,<s>[INST]Below is an instruction that describe...,"[1, 1, 733, 16289, 28793, 20548, 336, 349, 396...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
97564,const request = require('request');\nconst url...,Create a Node.js script that takes in a URL pa...,Below is an instruction that describes a task....,https://example.com,<s>[INST]Below is an instruction that describe...,"[1, 1, 733, 16289, 28793, 20548, 336, 349, 396...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."
97565,bool isPalindrome(char str[]) \n{ \n // Sta...,You are given a string and need to implement a...,Below is an instruction that describes a task....,,<s>[INST]Below is an instruction that describe...,"[1, 1, 733, 16289, 28793, 20548, 336, 349, 396...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ..."


In [10]:
dataset = dataset.train_test_split(test_size=0.2)
train_data = dataset["train"]
test_data = dataset["test"]

In [20]:
train_data

Dataset({
    features: ['output', 'instruction', 'text', 'input', 'prompt'],
    num_rows: 97567
})

In [None]:
train_data["input_ids"][:10]

### After Formatting, We should get something like this

```json
{
"text":"<s>[INST] Create a function to calculate the sum of a sequence of integers. here are the inputs [1, 2, 3, 4, 5] [/INST]
# Python code def sum_sequence(sequence): sum = 0 for num in sequence: sum += num return sum</s>",
"instruction":"Create a function to calculate the sum of a sequence of integers",
"input":"[1, 2, 3, 4, 5]",
"output":"# Python code def sum_sequence(sequence): sum = 0 for num in,
 sequence: sum += num return sum"
"prompt":"<s>[INST] Create a function to calculate the sum of a sequence of integers. here are the inputs [1, 2, 3, 4, 5] [/INST]
# Python code def sum_sequence(sequence): sum = 0 for num in sequence: sum += num return sum</s>"

}
```

While using SFT (**[Supervised Fine-tuning Trainer](https://huggingface.co/docs/trl/main/en/sft_trainer)**) for fine-tuning, we will be only passing in the “text” column of the dataset for fine-tuning.

In [None]:
print(test_data)

Dataset({
    features: ['instruction', 'output', 'input', 'text', 'prompt', 'input_ids', 'attention_mask'],
    num_rows: 24392
})


### Setting up the Training
we will be using the `huggingface` and the `peft` library!

In [None]:
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

peft_config = LoraConfig(
    lora_alpha=16,
    lora_dropout=0.1,
    r=64,
    bias="none",
        target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
        "lm_head",
    ],
    task_type="CAUSAL_LM"
)

we need to prepare the model to be trained in 4bit so we will use the  `prepare_model_for_kbit_training` function from peft

> Indented block



In [None]:
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, peft_config)

In [None]:
def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

In [None]:
print_trainable_parameters(model)

trainable params: 56836096 || all params: 23539437568 || trainable%: 0.24145052674182907


### Model after Adding Lora Config

In [None]:
print(model)

### Hyper-paramters for training
These parameters will depend on how long you want to run training for.
Most important to consider:

`num_train_epochs/max_steps`: How many iterations over the data you want to do, BE CAREFUL, don't try too many, you will over-fit!!!!!

`learning_rate`: Controls the speed of convergence


In [None]:
if torch.cuda.device_count() > 1: # If more than 1 GPU
    print(torch.cuda.device_count())
    model.is_parallelizable = True
    model.model_parallel = True

In [None]:
from transformers import TrainingArguments

args = TrainingArguments(
  output_dir = "Mixtral_Alpace_v3",
  #num_train_epochs=5,
  max_steps = 100, # comment out this line if you want to train in epochs
  per_device_train_batch_size = 32,
  warmup_steps = 0.03,
  logging_steps=10,
  save_strategy="epoch",
  #evaluation_strategy="epoch",
  evaluation_strategy="steps",
  eval_steps=10, # comment out this line if you want to evaluate at the end of each epoch
  learning_rate=2.5e-5,
  bf16=True,
  # lr_scheduler_type='constant',
)

Setting up the trainer.

`max_seq_length`: Context window size


In [None]:
from trl import SFTTrainer

max_seq_length = 1024

trainer = SFTTrainer(
  model=model,
  peft_config=peft_config,
  max_seq_length=max_seq_length,
  tokenizer=tokenizer,
  packing=True,
  args=args,
  dataset_text_field="prompt",
  train_dataset=train_data,
  eval_dataset=test_data,
)

Generating train split: 0 examples [00:00, ? examples/s]

Generating train split: 0 examples [00:00, ? examples/s]



In [None]:
trainer.train()

The input hidden states seems to be silently casted in float32, this might be related to the fact you have upcasted embedding or layer norm layers in float32. We will cast back the input in torch.bfloat16.


Step,Training Loss,Validation Loss


In [None]:
trainer.save_model("Mixtral_Alpace_v2")

# Save Model and Push to Hub

In [None]:
# !pip install huggingface-hub -qU

In [None]:
# from huggingface_hub import notebook_login

# notebook_login()

In [None]:
# trainer.push_to_hub("Promptengineering/mistral-instruct-generation")

In [None]:
merged_model = model.merge_and_unload()

In [None]:
def generate_response(prompt, model):
  encoded_input = tokenizer(prompt,  return_tensors="pt", add_special_tokens=True)
  model_inputs = encoded_input.to('cuda')

  generated_ids = model.generate(**model_inputs,
                                 max_new_tokens=150,
                                 do_sample=True,
                                 pad_token_id=tokenizer.eos_token_id)

  decoded_output = tokenizer.batch_decode(generated_ids)

  return decoded_output[0]

In [None]:
prompt = "[INST]Use the provided input to create an instruction that could have been used to generate the response with an LLM.\nThere are more than 12,000 species of grass. The most common is Kentucky Bluegrass, because it grows quickly, easily, and is soft to the touch. Rygrass is shiny and bright green colored. Fescues are dark green and shiny. Bermuda grass is harder but can grow in drier soil.[/INST]"


In [None]:
generate_response(prompt, merged_model)

In [None]:
250*32

8000