<a href="https://colab.research.google.com/github/drdholu/OptiMate/blob/fine-tune/nb/Llama3.2_(1B_and_3B)-Conversational.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Installation

In [3]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.


model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "./optim8-model",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

==((====))==  Unsloth 2025.3.19: Fast Llama patching. Transformers: 4.50.0.dev0.
   \\   /|    NVIDIA GeForce RTX 3050 6GB Laptop GPU. Num GPUs = 1. Max memory: 6.0 GB. Platform: Windows.
O^O/ \_/ \    Torch: 2.6.0+cu126. CUDA: 8.6. CUDA Toolkit: 12.6. Triton: 3.2.0
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


Loading checkpoint shards: 100%|██████████| 2/2 [00:39<00:00, 19.85s/it]


We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [4]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

<a name="Data"></a>
## Data Prep

In [6]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)

# def formatting_prompts_func(examples):
#     convos = examples["conversations"]
#     texts = [tokenizer.apply_chat_template(convo, tokenize = False, add_generation_prompt = False) for convo in convos]
#     return { "text" : texts, }
# pass

from datasets import load_dataset
# ds = load_dataset("google/code_x_glue_cc_code_refinement", "small")
# ds = load_dataset("ayeshgk/code_x_glue_cc_code_refinement_annotated")
ds2 = load_dataset("google/code_x_glue_cc_code_to_code_trans")
# dsGo = load_dataset("google/code_x_glue_ct_code_to_text", "go")
# dsJava = load_dataset("google/code_x_glue_ct_code_to_text", "java")
# dsJs = load_dataset("google/code_x_glue_ct_code_to_text", "javascript")

Generating train split: 100%|██████████| 10300/10300 [00:00<00:00, 32232.46 examples/s]
Generating validation split: 100%|██████████| 500/500 [00:00<00:00, 83359.25 examples/s]
Generating test split: 100%|██████████| 1000/1000 [00:00<00:00, 121419.18 examples/s]


In [12]:
print(ds2['train'][1]['java'])
print(ds2['train'][1]['cs'])

public UpdateJourneyStateResult updateJourneyState(UpdateJourneyStateRequest request) {request = beforeClientExecution(request);return executeUpdateJourneyState(request);}

public virtual UpdateJourneyStateResponse UpdateJourneyState(UpdateJourneyStateRequest request){var options = new InvokeOptions();options.RequestMarshaller = UpdateJourneyStateRequestMarshaller.Instance;options.ResponseUnmarshaller = UpdateJourneyStateResponseUnmarshaller.Instance;return Invoke<UpdateJourneyStateResponse>(request, options);}



In [None]:
# # for go dataset
# from unsloth.chat_templates import get_chat_template

# tokenizer = get_chat_template(
#     tokenizer,
#     chat_template = "llama-3.1",
# )

# from datasets import load_dataset
# dsGo = load_dataset("google/code_x_glue_ct_code_to_text", "go")

In [None]:
# dsGo['train']

Dataset({
    features: ['id', 'repo', 'path', 'func_name', 'original_string', 'language', 'code', 'code_tokens', 'docstring', 'docstring_tokens', 'sha', 'url'],
    num_rows: 167288
})

In [None]:
# dsGo['train'][1]

In [None]:
# ds['validation'][1]['buggy']

'<BUG>public TYPE_1 < TYPE_2 > METHOD_1 ( TYPE_3 VAR_1 , java.lang.String VAR_2 ) {return METHOD_1 ( VAR_1 . toString ( ) , VAR_2 ) ;</BUG> }'

### Standardize data

In [None]:
# for ds dataset

# def format_code_dataset(examples):
#     texts = []
#     for i in range(len(examples['id'])):
#         conversation = [
#             {"role": "system", "content": "You are a code repair assistant. Focus only on fixing the code within <BUG> tags. Do not add additional methods or variations."},
#             {"role": "user", "content": f"Fix this code:\n{examples['buggy'][i]}"},
#             {"role": "assistant", "content": f"{examples['fixed'][i]}"}
#         ]
        
#         text = tokenizer.apply_chat_template(
#             conversation,
#             tokenize=False,
#             add_generation_prompt=False
#         )
#         texts.append(text)
    
#     return {"text": texts}

# # Apply the formatting function to your dataset
# dataset = ds.map(format_code_dataset, batched=True, remove_columns=["buggy", "fixed"])
# print(dataset['train'][0]['text'])

Map: 100%|██████████| 46680/46680 [00:01<00:00, 34954.82 examples/s]
Map: 100%|██████████| 5835/5835 [00:00<00:00, 35488.30 examples/s]
Map: 100%|██████████| 5835/5835 [00:00<00:00, 34978.56 examples/s]

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 July 2024

You are a code repair assistant. Focus only on fixing the code within <BUG> tags. Do not add additional methods or variations.<|eot_id|><|start_header_id|>user<|end_header_id|>

Fix this code:
public java.lang.String METHOD_1 ( ) { <BUG> return new TYPE_1 ( STRING_1 ) . format ( VAR_1 [ ( ( VAR_1 . length ) - 1 ) ] . getTime ( ) ) ;</BUG> }<|eot_id|><|start_header_id|>assistant<|end_header_id|>

public java.lang.String METHOD_1 ( ) { return new TYPE_1 ( STRING_1 ) . format ( VAR_1 [ ( ( type ) - 1 ) ] . getTime ( ) ) ; } 
<|eot_id|>





In [18]:
# for ds2 dataset

# ds2['train'][1]['java']
# ds2['train'][1]['cs']

def format_code_dataset(examples):
    texts = []
    for i in range(len(examples['id'])):
        conversation = [
            {"role": "system", "content": "You are a code optimizer assistant. Understand the given java code and then understand how it is modified to be made better. Do not make variations."},
            {"role": "user", "content": f"Fix this code:\n{examples['java'][i]}"},
            {"role": "assistant", "content": f"This is the fixed code: {examples['cs'][i]}"}
        ]
        
        text = tokenizer.apply_chat_template(
            conversation,
            tokenize=False,
            add_generation_prompt=False
        )
        texts.append(text)
    
    return {"text": texts}

# Apply the formatting function to your dataset
dataset = ds2.map(format_code_dataset, batched=True, remove_columns=['java', 'cs'])
print(dataset)
print(dataset['train'][1]['text'])

DatasetDict({
    train: Dataset({
        features: ['id', 'text'],
        num_rows: 10300
    })
    validation: Dataset({
        features: ['id', 'text'],
        num_rows: 500
    })
    test: Dataset({
        features: ['id', 'text'],
        num_rows: 1000
    })
})
<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 July 2024

You are a code optimizer assistant. Understand the given java code and then understand how it is modified to be made better. Do not make variations.<|eot_id|><|start_header_id|>user<|end_header_id|>

Fix this code:
public UpdateJourneyStateResult updateJourneyState(UpdateJourneyStateRequest request) {request = beforeClientExecution(request);return executeUpdateJourneyState(request);}
<|eot_id|><|start_header_id|>assistant<|end_header_id|>

This is the fixed code: public virtual UpdateJourneyStateResponse UpdateJourneyState(UpdateJourneyStateRequest request){var options = new InvokeOptions();o

We look at how the conversations are structured for item 5:

In [19]:
dataset['train'][5]

{'id': 5,
 'text': '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\nYou are a code optimizer assistant. Understand the given java code and then understand how it is modified to be made better. Do not make variations.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nFix this code:\npublic CreateBranchCommand setStartPoint(RevCommit startPoint) {checkCallable();this.startCommit = startPoint;this.startPoint = null;return this;}\n<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nThis is the fixed code: public virtual NGit.Api.CreateBranchCommand SetStartPoint(RevCommit startPoint){CheckCallable();this.startCommit = startPoint;this.startPoint = null;return this;}\n<|eot_id|>'}

And we see how the chat template transformed these conversations.

**[Notice]** Llama 3.1 Instruct's default chat template default adds `"Cutting Knowledge Date: December 2023\nToday Date: 26 July 2024"`, so do not be alarmed!

In [20]:
dataset['train'][5]['text']

'<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\nYou are a code optimizer assistant. Understand the given java code and then understand how it is modified to be made better. Do not make variations.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nFix this code:\npublic CreateBranchCommand setStartPoint(RevCommit startPoint) {checkCallable();this.startCommit = startPoint;this.startPoint = null;return this;}\n<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nThis is the fixed code: public virtual NGit.Api.CreateBranchCommand SetStartPoint(RevCommit startPoint){CheckCallable();this.startCommit = startPoint;this.startPoint = null;return this;}\n<|eot_id|>'

<a name="Train"></a>
## Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

In [21]:
from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset['train'],
    eval_dataset= dataset['validation'],
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
    dataset_num_proc = 1,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
)



Unsloth: Tokenizing ["text"]: 100%|██████████| 10300/10300 [00:01<00:00, 5998.46 examples/s]
Unsloth: Tokenizing ["text"]: 100%|██████████| 500/500 [00:00<00:00, 6793.32 examples/s]


In [23]:
# # Modify generation parameters
generation_config = {
    "max_new_tokens": 200,
    "do_sample": False,
    "temperature": 0.7,
    "top_p": 0.9,
    "early_stopping": True,
    "num_beams": 1,
}

# # Update training configuration
# training_args = TrainingArguments(
#     per_device_train_batch_size=4,
#     gradient_accumulation_steps=4,
#     max_steps=1000,
#     learning_rate=2e-5,
#     lr_scheduler_type="cosine",
#     warmup_ratio=0.1,
#     # Prevent overgeneration
#     # length_penalty=1.0,
#     # max_length=200
# )

We also use Unsloth's `train_on_completions` method to only train on the assistant outputs and ignore the loss on the user's inputs.

In [24]:
# from unsloth.chat_templates import train_on_responses_only
# trainer = train_on_responses_only(
#     trainer,
#     instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n",
#     response_part = "<|start_header_id|>assistant<|end_header_id|>\n\n",
# )

from unsloth.chat_templates import train_on_responses_only
trainer = train_on_responses_only(
    trainer,
    instruction_part = "<|start_header_id|>user<|end_header_id|>\n\n",
    response_part = "<|start_header_id|>assistant<|end_header_id|>\n\n",
)

Map (num_proc=12): 100%|██████████| 10300/10300 [00:51<00:00, 201.07 examples/s]
Map (num_proc=12): 100%|██████████| 500/500 [00:26<00:00, 19.07 examples/s]


We verify masking is actually done:

In [25]:
tokenizer.decode(trainer.train_dataset[5]["input_ids"])

'<|begin_of_text|><|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\nYou are a code optimizer assistant. Understand the given java code and then understand how it is modified to be made better. Do not make variations.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nFix this code:\npublic CreateBranchCommand setStartPoint(RevCommit startPoint) {checkCallable();this.startCommit = startPoint;this.startPoint = null;return this;}\n<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nThis is the fixed code: public virtual NGit.Api.CreateBranchCommand SetStartPoint(RevCommit startPoint){CheckCallable();this.startCommit = startPoint;this.startPoint = null;return this;}\n<|eot_id|>'

In [26]:
space = tokenizer(" ", add_special_tokens = False).input_ids[0]
tokenizer.decode([space if x == -100 else x for x in trainer.train_dataset[5]["labels"]])

# import re

# output_text = tokenizer.decode([space if x == -100 else x for x in trainer.train_dataset[5]["labels"]])

# # Remove excessive spaces
# cleaned_output = re.sub(' +', ' ', output_text).strip()

# print(cleaned_output)

'                                                                                                    This is the fixed code: public virtual NGit.Api.CreateBranchCommand SetStartPoint(RevCommit startPoint){CheckCallable();this.startCommit = startPoint;this.startPoint = null;return this;}\n<|eot_id|>'

We can see the System and Instruction prompts are successfully masked!

In [27]:
# @title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = NVIDIA GeForce RTX 3050 6GB Laptop GPU. Max memory = 6.0 GB.
4.52 GB of memory reserved.


In [28]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 10,300 | Num Epochs = 1 | Total steps = 60
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 24,313,856/1,827,777,536 (1.33% trained)


Unsloth: Will smartly offload gradients to save VRAM!


Step,Training Loss
1,1.5166
2,1.9319
3,1.865
4,1.4346
5,1.8859
6,1.3359
7,0.9267
8,0.6201
9,0.5244
10,0.5845


In [29]:
trainer_stats2 = trainer.evaluate()

Unsloth: Not an error, but LlamaForCausalLM does not accept `num_items_in_batch`.
Using gradient accumulation will be very slightly less accurate.
Read more on gradient accumulation issues here: https://unsloth.ai/blog/gradient


In [30]:
#old
print(trainer_stats2)

{'eval_loss': 0.25819337368011475, 'eval_runtime': 101.2733, 'eval_samples_per_second': 4.937, 'eval_steps_per_second': 0.622}


## Checking eval

In [31]:
print("Training Stats:", trainer_stats.metrics)
print("Evaluation Stats:", trainer_stats2)

Training Stats: {'train_runtime': 297.0947, 'train_samples_per_second': 1.616, 'train_steps_per_second': 0.202, 'total_flos': 1625457721196544.0, 'train_loss': 0.42884220331907275}
Evaluation Stats: {'eval_loss': 0.25819337368011475, 'eval_runtime': 101.2733, 'eval_samples_per_second': 4.937, 'eval_steps_per_second': 0.622}


In [None]:
# ds['test'][2]['buggy']

'<BUG>private void METHOD_1 ( java.lang.Class VAR_1 ) {</BUG> android.content.Intent intent = new android.content.Intent ( this , VAR_1 ) ; METHOD_2 ( intent ) ; }'

In [18]:
# Test the model on a few examples
test_cases = [
    "private TYPE_1 getType ( TYPE_2 VAR_1 ) { TYPE_3 VAR_2 = new TYPE_3 ( STRING_1 ) ; <BUG> return new TYPE_1 ( VAR_2 , VAR_2 ) ;</BUG> }",
    "public TYPE_1 METHOD_1 ( ) { TYPE_1 output = VAR_1 [ VAR_2 ] ; <BUG> if ( ( VAR_2 ) > 0 ) {</BUG> VAR_2 = ( VAR_2 ) - 1 ; } else { } return output ; }",
    "<BUG>private void METHOD_1 ( java.lang.Class VAR_1 ) {</BUG> android.content.Intent intent = new android.content.Intent ( this , VAR_1 ) ; METHOD_2 ( intent ) ; }",
    "public void METHOD_1 ( ) { <BUG> for ( TYPE_1 VAR_1 : VAR_2 ) VAR_1 . METHOD_2 ( ) ;</BUG> METHOD_3 ( ) ; <BUG> if ( ( VAR_3 ) != null ) VAR_3 . METHOD_1 ( ) ;</BUG> }"
    # Add 2-3 more test cases
]

def test_model(model, tokenizer, test_case):
    # Add system prompt to focus the model
    prompt = f"""You are a code repair assistant. Focus only on fixing the code within <BUG> tags. Do not add additional methods or variations.
                Fix this code:
                {test_case}
            """
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True).to("cuda")
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            **generation_config
        )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Test with stricter evaluation
for i in range(3):
    test_case = ds['test'][i]['buggy']
    expected = ds['test'][i]['fixed']
    output = test_model(model, tokenizer, test_cases[i])
    
    print(f"\nTest Case {i+1}:")
    print("Input:", test_case)
    print("Expected:", expected)
    print("Output:", output)


Test Case 1:
Input: private TYPE_1 getType ( TYPE_2 VAR_1 ) { TYPE_3 VAR_2 = new TYPE_3 ( STRING_1 ) ; <BUG> return new TYPE_1 ( VAR_2 , VAR_2 ) ;</BUG> }
Expected: private TYPE_1 getType ( TYPE_2 VAR_1 ) { TYPE_3 VAR_2 = new TYPE_3 ( STRING_1 ) ; return new TYPE_1 ( VAR_2 , VAR_2 , this , VAR_1 ) ; } 

Output: You are a code repair assistant. Focus only on fixing the code within <BUG> tags. Do not add additional methods or variations.
                Fix this code:
                private TYPE_1 getType ( TYPE_2 VAR_1 ) { TYPE_3 VAR_2 = new TYPE_3 ( STRING_1 ) ; <BUG> return new TYPE_1 ( VAR_2, VAR_2 ) ;</BUG> }
             private TYPE_1 getType ( TYPE_2 VAR_1 ) { TYPE_3 VAR_2 = new TYPE_3 ( STRING_1 ) ; return new TYPE_1 ( VAR_2, VAR_2 ) ; } 
             private TYPE_1 getType ( TYPE_2 VAR_1 ) { TYPE_3 VAR_2 = new TYPE_3 ( STRING_1 ) ; return new TYPE_1 ( VAR_2, VAR_2 ) ; } 
             private TYPE_1 getType ( TYPE_2 VAR_1 ) { TYPE_3 VAR_2 = new TYPE_3 ( STRING_1 ) ; return new

In [32]:
# @title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(
    f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training."
)
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

297.0947 seconds used for training.
4.95 minutes used for training.
Peak reserved memory = 5.463 GB.
Peak reserved memory for training = 0.943 GB.
Peak reserved memory % of max memory = 91.05 %.
Peak reserved memory for training % of max memory = 15.717 %.


<a name="Inference"></a>
## Inference
Let's run the model! You can change the instruction and input - leave the output blank!

**[NEW] Try 2x faster inference in a free Colab for Llama-3.1 8b Instruct [here](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Unsloth_Studio.ipynb)**

We use `min_p = 0.1` and `temperature = 1.5`. Read this [Tweet](https://x.com/menhguin/status/1826132708508213629) for more information on why.

In [21]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "Fix this: private TYPE_1 getType ( TYPE_2 VAR_1 ) { TYPE_3 VAR_2 = new TYPE_3 ( STRING_1 ) ; return new TYPE_1 ( VAR_2 , VAR_2 ) ; }",},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

outputs = model.generate(input_ids = inputs, max_new_tokens = 64, use_cache = True,
                         temperature = 1.5, min_p = 0.1)
tokenizer.batch_decode(outputs)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


['<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nFix this: private TYPE_1 getType ( TYPE_2 VAR_1 ) { TYPE_3 VAR_2 = new TYPE_3 ( STRING_1 ) ; return new TYPE_1 ( VAR_2, VAR_2 ) ; }<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nprivate TYPE_1 getType ( TYPE_2 VAR_1 ) { return new TYPE_1 ( VAR_2, VAR_2 ) ; } \n<|eot_id|>']

 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [23]:
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "Fix this: private TYPE_1 getType ( TYPE_2 VAR_1 ) { TYPE_3 VAR_2 = new TYPE_3 ( STRING_1 ) ; return new TYPE_1 ( VAR_2 , VAR_2 ) ; }",},
]

inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer, skip_prompt = True)
_ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 128,
                   use_cache = True, temperature = 1.5, min_p = 0.1)

private TYPE_1 getType ( TYPE_2 VAR_1 ) { TYPE_3 VAR_2 = new TYPE_3 ( STRING_1 ) ; return new TYPE_1 ( VAR_2, VAR_1 ) ; } 
<|eot_id|>


<a name="Save"></a>
## Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [34]:
# Save both model and tokenizer to local directory
model.save_pretrained("optim8")  # Saves to ./lora_model directory
tokenizer.save_pretrained("optim8")

# For inference mode (optional but recommended for better performance)
FastLanguageModel.for_inference(model)

# If you want to save as merged model in different formats:
# 16-bit merged version (recommended)
model.save_pretrained_merged("./optim8-model", tokenizer, save_method="merged_16bit")

# OR 4-bit merged version (smaller size)
# model.save_pretrained_merged("local_model", tokenizer, save_method="merged_4bit")

# OR GGUF format
# model.save_pretrained_gguf("local_model", tokenizer)

Unsloth: Merging 4bit and LoRA weights to 16bit...
Unsloth: Will use up to 0.0 out of 11.71 RAM for saving.
Unsloth: Saving model... This might take 5 minutes ...


100%|██████████| 28/28 [00:30<00:00,  1.10s/it]


Unsloth: Saving tokenizer... Done.
Done.


In [None]:
# model.save_pretrained("lora_model")  # Local saving
# tokenizer.save_pretrained("lora_model")
# # model.push_to_hub("your_name/lora_model", token = "...") # Online saving
# # tokenizer.push_to_hub("your_name/lora_model", token = "...") # Online saving

('lora_model/tokenizer_config.json',
 'lora_model/special_tokens_map.json',
 'lora_model/tokenizer.json')

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [None]:
# if False:
#     from unsloth import FastLanguageModel
#     model, tokenizer = FastLanguageModel.from_pretrained(
#         model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
#         max_seq_length = max_seq_length,
#         dtype = dtype,
#         load_in_4bit = load_in_4bit,
#     )
#     FastLanguageModel.for_inference(model) # Enable native 2x faster inference

# messages = [
#     {"role": "user", "content": "Fix this: private TYPE_1 getType ( TYPE_2 VAR_1 ) { TYPE_3 VAR_2 = new TYPE_3 ( STRING_1 ) ; return new TYPE_1 ( VAR_2 , VAR_2 ) ; }"},
# ]
# inputs = tokenizer.apply_chat_template(
#     messages,
#     tokenize = True,
#     add_generation_prompt = True, # Must add for generation
#     return_tensors = "pt",
# ).to("cuda")

# from transformers import TextStreamer
# text_streamer = TextStreamer(tokenizer, skip_prompt = True)
# _ = model.generate(input_ids = inputs, streamer = text_streamer, max_new_tokens = 128,
#                    use_cache = True, temperature = 1.5, min_p = 0.1)

private TYPE_1 getType ( TYPE_2 VAR_1 ) { TYPE_3 VAR_2 = new TYPE_3 ( STRING_1, VAR_1 ) ; return new TYPE_1 ( VAR_2, VAR_2 ) ; } 
<|eot_id|>


You can also use Hugging Face's `AutoModelForPeftCausalLM`. Only use this if you do not have `unsloth` installed. It can be hopelessly slow, since `4bit` model downloading is not supported, and Unsloth's **inference is 2x faster**.

In [None]:
# if False:
#     # I highly do NOT suggest - use Unsloth if possible
#     from peft import AutoPeftModelForCausalLM
#     from transformers import AutoTokenizer
#     model = AutoPeftModelForCausalLM.from_pretrained(
#         "lora_model", # YOUR MODEL YOU USED FOR TRAINING
#         load_in_4bit = load_in_4bit,
#     )
#     tokenizer = AutoTokenizer.from_pretrained("lora_model")

### Saving to float16 for VLLM

We also support saving to `float16` directly. Select `merged_16bit` for float16 or `merged_4bit` for int4. We also allow `lora` adapters as a fallback. Use `push_to_hub_merged` to upload to your Hugging Face account! You can go to https://huggingface.co/settings/tokens for your personal tokens.

In [None]:
# # Merge to 16bit
# if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
# if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# # Merge to 4bit
# if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
# if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# # Just LoRA adapters
# if False: model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
# if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")

### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

[**NEW**] To finetune and auto export to Ollama, try our [Ollama notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)

In [None]:
# # Save to 8bit Q8_0
# if False: model.save_pretrained_gguf("model", tokenizer,)
# # Remember to go to https://huggingface.co/settings/tokens for a token!
# # And change hf to your username!
# if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# # Save to 16bit GGUF
# if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
# if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# # Save to q4_k_m GGUF
# if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
# if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

# # Save to multiple GGUF options - much faster if you want multiple!
# if False:
#     model.push_to_hub_gguf(
#         "hf/model", # Change hf to your username!
#         tokenizer,
#         quantization_method = ["q4_k_m", "q8_0", "q5_k_m",],
#         token = "", # Get a token at https://huggingface.co/settings/tokens
#     )

Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in llama.cpp or a UI based system like Jan or Open WebUI. You can install Jan [here](https://github.com/janhq/jan) and Open WebUI [here](https://github.com/open-webui/open-webui)

And we're done! If you have any questions on Unsloth, we have a [Discord](https://discord.gg/unsloth) channel! If you find any bugs or want to keep updated with the latest LLM stuff, or need help, join projects etc, feel free to join our Discord!

Some other links:
1. Train your own reasoning model - Llama GRPO notebook [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.1_(8B)-GRPO.ipynb)
2. Saving finetunes to Ollama. [Free notebook](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3_(8B)-Ollama.ipynb)
3. Llama 3.2 Vision finetuning - Radiography use case. [Free Colab](https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Llama3.2_(11B)-Vision.ipynb)
6. See notebooks for DPO, ORPO, Continued pretraining, conversational finetuning and more on our [documentation](https://docs.unsloth.ai/get-started/unsloth-notebooks)!

<div class="align-center">
  <a href="https://unsloth.ai"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord.png" width="145"></a>
  <a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a>

  Join Discord if you need help + ⭐️ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐️
</div>
