**Model prep**


In [None]:
%%capture
# Installs Unsloth, Xformers (Flash Attention) and all other packages!
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

# We have to check which Torch version for Xformers (2.3 -> 0.0.27)
from torch import __version__; from packaging.version import Version as V
xformers = "xformers==0.0.27" if V(__version__) < V("2.4.0") else "xformers"
!pip install --no-deps {xformers} trl peft accelerate bitsandbytes triton

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Import the model
* We support Llama, Mistral, Phi-3, Gemma, Yi, DeepSeek, Qwen, TinyLlama, Vicuna, Open Hermes etc
* We support 16bit LoRA or 4bit QLoRA. Both 2x faster.
* `max_seq_length` can be set to anything, since we do automatic RoPE Scaling via [kaiokendev's](https://kaiokendev.github.io/til) method.
* With [PR 26037](https://github.com/huggingface/transformers/pull/26037), we support downloading 4bit models **4x faster**! [Our repo](https://huggingface.co/unsloth) has Llama, Mistral 4bit models.
* [**NEW**] We make Llama-3 15 trillion tokens **2x faster**! See our [Llama-3 notebook](https://colab.research.google.com/drive/135ced7oHytdxu3N2DNe1Z0kqjyYIkDXp?usp=sharing)

In [None]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/Meta-Llama-3.1-8B-bnb-4bit",      # Llama-3.1 15 trillion tokens model 2x faster!
    "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    "unsloth/Meta-Llama-3.1-70B-bnb-4bit",
    "unsloth/Meta-Llama-3.1-405B-bnb-4bit",    # We also uploaded 4bit for 405b!
    "unsloth/Mistral-Nemo-Base-2407-bnb-4bit", # New Mistral 12b 2x faster!
    "unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit",
    "unsloth/mistral-7b-v0.3-bnb-4bit",        # Mistral v3 2x faster!
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/Phi-3.5-mini-instruct",           # Phi-3.5 2x faster!
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/gemma-2-9b-bnb-4bit",
    "unsloth/gemma-2-27b-bnb-4bit",            # Gemma 2x faster!
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Phi-3.5-mini-instruct",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
==((====))==  Unsloth 2024.8: Fast Llama patching. Transformers = 4.44.2.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.4.0+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.27.post2. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/2.26G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/140 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/3.37k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/293 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]



### Use LoRA Optimization & Accelaration
We now add LoRA adapters so we only need to update 1 to 10% of all parameters!
<br>

LoRA (Low-Rank Adaptation)<br>
Reduces memory footprint and computation requirements
* model = The chosen model
* r = The rank of low-rank model. Any number > 0 (8,16,32,64,128). Bigger r = More calc cost & Higher accuracy
* target_modules = Modules need LoRA to calculate
* lora_alpha = To scale the output of low-rank matrix. Will be multiplied with the final matrix output scale
* lora_dropout = Rate of DropOut Layer(layer preventing over-fitting)
* bias = If use bias terms in low-rank matrix factorization
* use_gradient_checkpointing = For reducing the memory consumption
* random_state = Seed
* use_rslora = Rank stabilized LoRA (augmented verison of LoRA)
* loftq_config = For futher quantification and compression

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2024.8 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


<a name="Data"></a>
### Data Prep
We now use the `Phi-3` format for conversation style finetunes. We use [Open Assistant conversations](https://huggingface.co/datasets/philschmid/guanaco-sharegpt-style) in ShareGPT style. Phi-3 renders multi turn conversations like below:

```
<|user|>
Hi!<|end|>
<|assistant|>
Hello! How are you?<|end|>
<|user|>
I'm doing great! And you?<|end|>

```

**[NOTE]** To train only on completions (ignoring the user's input) read Unsloth's docs [here](https://github.com/unslothai/unsloth/wiki#train-on-completions--responses-only-do-not-train-on-inputs).

We use our `get_chat_template` function to get the correct chat template. We support `zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old` and our own optimized `unsloth` template.

Note ShareGPT uses `{"from": "human", "value" : "Hi"}` and not `{"role": "user", "content" : "Hi"}`, so we use `mapping` to map it.

For text completions like novel writing, try this [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing).

**Aligning our Dataset to Standard Format<br>**

format reference: [philschmid/guanaco-sharegpt-style](https://huggingface.co/datasets/philschmid/guanaco-sharegpt-style)

In [None]:
from datasets import Dataset
from datasets import load_dataset
import json

# Change to your training set path
TRAINING_SET_PATH = "/content/ds.json"
RAW_TRAINING_SET_PATH = "/content/drive/MyDrive/SSE Group Project - Green Foundation/Training Set/combined_trainingset.json"

# # load dataset in dictionary
# dataset = load_dataset("json", data_files=TRAINING_SET_PATH, split = "train")
# dataset = dataset.to_dict()
# print("The dataset's type changed to: ", type(dataset))

# # get the queries list
# queries_lst = dataset["queries"][0]

# Load json file
with open(RAW_TRAINING_SET_PATH, 'r', encoding='utf-8') as f:
    raw_trainingset = json.load(f)

# Get queires list
def get_all_data_to_lst(combined_dict):
  res = []
  for i,q_lst in combined_dict.items():
    for j in q_lst:
      res.append(j)
  return res
queries_lst = get_all_data_to_lst(raw_trainingset)
print(queries_lst[0])

{'query': 'Does the application/framework use content delivery networks (CDNs) to minimize recomputation or fetching of static data?', 'context': 'Our web platform serves static assets such as images, CSS, and JavaScript files using a network of global CDNs to ensure fast delivery to users around the world.', 'explanation': 'The platform uses CDNs to deliver static assets globally, reducing load times and server strain, which aligns with the green practice.', 'judgement': 'Yes'}


In [None]:
# Format the queries list
formatted_queries = []
for qa_dict in queries_lst:
  formatted_queries.append([])
  formatted_queries[-1].append( {
      "from": "human",
      "value": "Using this as context '{context}', Answer this question: '{query}'".format( context=qa_dict["context"], query=qa_dict["query"])
      } )
  formatted_queries[-1].append( {
      "from": "gpt",
      "value": "Judgement: {judge}, Explanation: {exp}".format( judge=qa_dict["judgement"], exp=qa_dict["explanation"])
      } )


print("There are ", len(formatted_queries), " QA queries in training set.")
print("Current structure of each QA query: \n", formatted_queries[0])

# Convert dict back to Dataset type
processed_dataset = {"conversations": formatted_queries}
dataset = Dataset.from_dict(processed_dataset)
print("The dataset's type now: ", type(dataset))

There are  1558  QA queries in training set.
Current structure of each QA query: 
 [{'from': 'human', 'value': "Using this as context 'Our web platform serves static assets such as images, CSS, and JavaScript files using a network of global CDNs to ensure fast delivery to users around the world.', Answer this question: 'Does the application/framework use content delivery networks (CDNs) to minimize recomputation or fetching of static data?'"}, {'from': 'gpt', 'value': 'Judgement: Yes, Explanation: The platform uses CDNs to deliver static assets globally, reducing load times and server strain, which aligns with the green practice.'}]
The dataset's type now:  <class 'datasets.arrow_dataset.Dataset'>


In [None]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "phi-3", # Supports zephyr, chatml, mistral, llama, alpaca, vicuna, vicuna_old, unsloth
    mapping = {"role" : "from", "content" : "value", "user" : "human", "assistant" : "gpt"}, # ShareGPT style
)

def formatting_prompts_func(examples):
    convos = examples["conversations"]
    texts = []
    texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
    return { "text" : texts, }


dataset = dataset.map(formatting_prompts_func, batched = True,)

Map:   0%|          | 0/1558 [00:00<?, ? examples/s]

**Sample structure of database**

In [None]:
idx = 5
print("Type: ", type(dataset))
print()
print("-- conversation --")
print(dataset[idx]['conversations'])
print()
print("-- text --")
print(dataset[idx]['text'])

Type:  <class 'datasets.arrow_dataset.Dataset'>

-- conversation --
[{'from': 'human', 'value': "Using this as context 'Our website does not leverage CDNs and instead relies on the primary server to handle all static content delivery.', Answer this question: 'Does the application/framework use content delivery networks (CDNs) to minimize recomputation or fetching of static data?'"}, {'from': 'gpt', 'value': 'Judgement: No, Explanation: The reliance on the primary server for all static content delivery suggests that CDNs are not utilized, which can affect performance and scalability.'}]

-- text --
<|user|>
Using this as context 'Our website does not leverage CDNs and instead relies on the primary server to handle all static content delivery.', Answer this question: 'Does the application/framework use content delivery networks (CDNs) to minimize recomputation or fetching of static data?'<|end|>
<|assistant|>
Judgement: No, Explanation: The reliance on the primary server for all static c

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

In [None]:
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 8,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        num_train_epochs=3,
        #max_steps = 60,
        learning_rate = 1e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 10,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 42,
        output_dir = "/content/drive/MyDrive/Model_checkpoints",
        save_strategy = "epoch"
    ),
)

Map (num_proc=2):   0%|          | 0/1558 [00:00<?, ? examples/s]

In [None]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.748 GB.
2.285 GB of memory reserved.


In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 1,558 | Num Epochs = 3
O^O/ \_/ \    Batch size per device = 8 | Gradient Accumulation steps = 4
\        /    Total batch size = 32 | Total steps = 144
 "-____-"     Number of trainable parameters = 29,884,416


Step,Training Loss
10,3.1299
20,1.9359
30,1.3739
40,1.1068
50,1.0189
60,0.9539
70,0.9322
80,0.9045
90,0.8903
100,0.8804


In [None]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

500.2925 seconds used for training.
8.34 minutes used for training.
Peak reserved memory = 3.342 GB.
Peak reserved memory for training = 1.057 GB.
Peak reserved memory % of max memory = 22.661 %.
Peak reserved memory for training % of max memory = 7.167 %.


### GGUF / llama.cpp Conversion
To save to `GGUF`, clone `llama.cpp` and we default save it to `q4_k_m` for best size. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

In [None]:
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if True: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in `Ollama` locally.
