<a href="https://colab.research.google.com/github/dhirajdj30/Devians-llama/blob/main/Fine_Tuning_LLama_3_8B.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# APRIL ONLY OFFER 🎁

First we check the GPU version available in the environment and install specific dependencies that are compatible with the detected GPU to prevent version conflicts.

In [3]:
%%capture
import torch
major_version, minor_version = torch.cuda.get_device_capability()
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install triton
if major_version >= 8:
    !pip install --no-deps packaging ninja einops flash-attn xformers trl peft accelerate bitsandbytes
else:
    !pip install --no-deps xformers trl peft accelerate bitsandbytes
pass

Next we need to prepare to load a range of quantized language models, including a new 15 trillion token LLama-3 model, optimized for memory efficiency with 4-bit quantization.


In [4]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! Llama 3 is up to 8k
dtype = None
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

fourbit_models = [
    "unsloth/mistral-7b-bnb-4bit",
    "unsloth/mistral-7b-instruct-v0.2-bnb-4bit",
    "unsloth/llama-2-7b-bnb-4bit",
    "unsloth/gemma-7b-bnb-4bit",
    "unsloth/gemma-7b-it-bnb-4bit",
    "unsloth/gemma-2b-bnb-4bit",
    "unsloth/gemma-2b-it-bnb-4bit",
    "unsloth/llama-3-8b-bnb-4bit",
]

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/llama-3-8b-bnb-4bit", # Llama-3 70b also works (just change the model name)
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
==((====))==  Unsloth 2024.8: Fast Llama patching. Transformers = 4.44.2.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.4.0+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.27.post2. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.70G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/198 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/50.6k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/350 [00:00<?, ?B/s]



---



Next, we integrate LoRA adapters into our model, which allows us to efficiently update just a fraction of the model's parameters, enhancing training speed and reducing computational load.

In [5]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
)

Unsloth 2024.8 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


<a name="Data"></a>
### Data Prep
We now use the Alpaca dataset from [yahma](https://huggingface.co/datasets/yahma/alpaca-cleaned), which is a filtered version of 52K of the original [Alpaca dataset](https://crfm.stanford.edu/2023/03/13/alpaca.html). You can replace this code section with your own data prep.

Then, we define a system prompt that formats tasks into instructions, inputs, and responses, and apply it to a dataset to prepare our inputs and outputs for the model, with an EOS token to signal completion.


In [6]:
# import json
# from datasets import load_dataset

# # Define the Alpaca prompt format
# alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

# ### Instruction:
# {}

# ### Input:
# {}

# ### Response:
# {}"""

# EOS_TOKEN = tokenizer.eos_token  # Ensure the end of sequence token is included

# # Function to format the prompts
# def formatting_prompts_func(examples):
#     instructions = examples["instruction"]
#     inputs       = examples["input"]
#     outputs      = examples["output"]
#     texts = []
#     for instruction, input, output in zip(instructions, inputs, outputs):
#         text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
#         texts.append(text)
#     return { "text" : texts, }

# # Load the Alpaca dataset
# dataset = load_dataset("yahma/alpaca-cleaned", split="train")


import json
from datasets import Dataset  # Import the Dataset class from the datasets library
from transformers import AutoTokenizer

# Define the Alpaca prompt format
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

# Initialize tokenizer (replace 'gpt2' with your specific model if needed)
# tokenizer = AutoTokenizer.from_pretrained('gpt2')
EOS_TOKEN = tokenizer.eos_token  # Ensure the end of sequence token is included

# Function to format the prompts
def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs = examples["input"]
    outputs = examples["output"]
    texts = []
    for instruction, input_text, output in zip(instructions, inputs, outputs):
        text = alpaca_prompt.format(instruction, input_text, output) + EOS_TOKEN
        texts.append(text)
    return {"text": texts}

# Load your JSON data
with open('/content/new_formatted_data.json', 'r', encoding='utf-8') as f:
    json_data = json.load(f)

# Create a dataset from the JSON data
dataset = Dataset.from_dict({
    "instruction": [entry["instruction"] for entry in json_data],
    "input": [entry["input"] for entry in json_data],
    "output": [entry["output"] for entry in json_data]
})

# Format the dataset using the function
formatted_dataset = dataset.map(formatting_prompts_func, batched=True)

# Show a sample of formatted data
print(formatted_dataset["text"][:2])



Map:   0%|          | 0/2069 [00:00<?, ? examples/s]

["Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nUnderstand the given product information and answer the question asked by acting as a human\nProduct Title: Apple MacBook Air Apple M1 - (16 GB/256 GB SSD/Mac OS Big Sur) Z124J001KD\nBrand: Apple\nHighlights: Stylish & Portable Thin and Light Laptop, 13.3 inch Quad LED Backlit IPS Display (227 PPI, 400 nits Brightness, Wide Colour (P3), True Tone Technology), Light Laptop without Optical Disk Drive\nSpecifications: {'General': {'Sales Package': ['MacBook Air, 30 W USB-C Power Adapter, USB-C Charge Cable (2m), User Guide, Warranty Documents'], 'Model Number': ['Z124J001KD'], 'Part Number': ['Z12400095'], 'Series': ['MacBook Air'], 'Color': ['Space Grey'], 'Type': ['Thin and Light Laptop'], 'Suitable For': ['Processing & Multitasking'], 'Battery Backup': ['Upto 15 Hours'], 'Power Supply': ['30 W AC Adapter'], 

<a name="Train"></a>
### Train the model
- We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`.
- At this stage, we're configuring our model's training setup, where we define things like batch size and learning rate, to teach our model effectively with the data we have prepared.

In [8]:
from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = formatted_dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 15, # increase this to make the model learn "better"
        num_train_epochs=4,
        learning_rate = 2e-4,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

Map (num_proc=2):   0%|          | 0/2069 [00:00<?, ? examples/s]

max_steps is given, it will override any value given in num_train_epochs


In [None]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.748 GB.
5.668 GB of memory reserved.


In [9]:
# We're now kicking off the actual training of our model, which will spit out some statistics showing us how well it learns
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 2,069 | Num Epochs = 1
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 15
 "-____-"     Number of trainable parameters = 41,943,040


Step,Training Loss
1,1.2595
2,1.4426
3,1.2478
4,1.2516
5,1.2665
6,1.3098
7,1.0819
8,1.0361
9,0.9093
10,0.8065


In [11]:
# #@title Show final memory and time stats
# used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
# used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
# used_percentage = round(used_memory         /max_memory*100, 3)
# lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
# print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
# print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
# print(f"Peak reserved memory = {used_memory} GB.")
# print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
# print(f"Peak reserved memory % of max memory = {used_percentage} %.")
# print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

import torch

# Record the initial GPU memory usage before training
start_gpu_memory = round(torch.cuda.memory_reserved() / 1024 / 1024 / 1024, 3)  # Initial GPU memory in GB

# Define the maximum GPU memory for your device
max_memory = round(torch.cuda.get_device_properties(0).total_memory / 1024 / 1024 / 1024, 3)  # Total GPU memory in GB

# Assuming trainer_stats is available after training
# Example trainer stats (replace with your actual stats)
trainer_stats = {
    'metrics': {
        'train_runtime': 1800.0  # Example training time in seconds
    }
}

# Calculate GPU memory usage statistics after training
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)  # Peak memory reserved during training in GB
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)  # Memory used specifically for LoRA training
used_percentage = round(used_memory / max_memory * 100, 3)  # Percentage of peak memory usage relative to total memory
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)  # Percentage of memory used for training relative to total memory

# Print memory and time stats
print(f"{trainer_stats['metrics']['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats['metrics']['train_runtime'] / 60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")


1800.0 seconds used for training.
30.0 minutes used for training.
Peak reserved memory = 10.154 GB.
Peak reserved memory for training = 0.0 GB.
Peak reserved memory % of max memory = 68.85 %.
Peak reserved memory for training % of max memory = 0.0 %.


<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!

In [16]:
FastLanguageModel.for_inference(model)
inputs = tokenizer(
[
    alpaca_prompt.format(
        "Understand the given product information and answer the question asked by acting as a human\nProduct Title: Apple MacBook Air Apple M1 - (16 GB/256 GB SSD/Mac OS Big Sur) Z124J001KD\nBrand: Apple\nHighlights: Stylish & Portable Thin and Light Laptop, 13.3 inch Quad LED Backlit IPS Display (227 PPI, 400 nits Brightness, Wide Colour (P3), True Tone Technology), Light Laptop without Optical Disk Drive\nSpecifications: {'General': {'Sales Package': ['MacBook Air, 30 W USB-C Power Adapter, USB-C Charge Cable (2m), User Guide, Warranty Documents'], 'Model Number': ['Z124J001KD'], 'Part Number': ['Z12400095'], 'Series': ['MacBook Air'], 'Color': ['Space Grey'], 'Type': ['Thin and Light Laptop'], 'Suitable For': ['Processing & Multitasking'], 'Battery Backup': ['Upto 15 Hours'], 'Power Supply': ['30 W AC Adapter'], 'MS Office Provided': ['No']}, 'Processor And Memory Features': {'Processor Brand': ['Apple'], 'Processor Name': ['M1'], 'SSD': ['Yes'], 'SSD Capacity': ['256 GB'], 'RAM': ['16 GB'], 'RAM Type': ['DDR4'], 'Processor Variant': ['Apple M1 Chip'], 'Expandable Memory': ['Upto 16 GB'], 'Graphic Processor': ['NA'], 'Number of Cores': ['8'], 'Storage Type': ['SSD']}, 'Operating System': {'Operating System': ['Mac OS Big Sur'], 'System Architecture': ['NA']}, 'Port And Slot Features': {'Mic In': ['Yes']}, 'Display And Audio Features': {'Touchscreen': ['No'], 'Screen Size': ['33.78 cm (13.3 inch)'], 'Screen Resolution': ['2560 x 1600 Pixels'], 'Screen Type': ['Quad LED Backlit IPS Display (227 PPI, 400 nits Brightness, Wide Colour (P3), True Tone Technology)'], 'Speakers': ['Built-in Speakers'], 'Internal Mic': ['Three-mic Array with Directional Beamforming'], 'Sound Properties': ['Stereo Speakers, Wide Stereo Sound, Support for Dolby Atmos Playback']}, 'Connectivity Features': {'Wireless LAN': ['IEEE 802.11ax (Wi-Fi 6)'], 'Bluetooth': ['v5.0']}, 'Dimensions': {'Dimensions': ['304.1 x 212.4 x 10.9'], 'Weight': ['1.29 Kg']}, 'Additional Features': {'Disk Drive': ['Not Available'], 'Web Camera': ['720p FaceTime HD Webcam'], 'Keyboard': ['Backlit Magic Keyboard'], 'Backlit Keyboard': ['Yes'], 'Pointer Device': ['Force Touch Trackpad'], 'Included Software': ['Built-in Apps: iMovie, Siri, GarageBand, Pages, Numbers, Photos, Keynote, Safari, Mail, FaceTime, Messages, Maps, Stocks, Home, Voice Memos, Notes, Calendar, Contacts, Reminders, Photo Booth, Preview, Books, App Store, Time Machine, TV, Music, Podcasts, Find My, QuickTime Player'], 'Additional Features': ['49.9 WHr Li-polymer Battery']}, 'Warranty': {'Warranty Summary': ['1 Year Limited Warranty'], 'Warranty Service Type': ['Onsite'], 'Covered in Warranty': ['Manufacturing Defects'], 'Not Covered in Warranty': ['Physical Damage'], 'Domestic Warranty': ['1 Year']}}", # instruction
        "What is the OS of this machine", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 128, use_cache = True)
tokenizer.batch_decode(outputs)

["<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nUnderstand the given product information and answer the question asked by acting as a human\nProduct Title: Apple MacBook Air Apple M1 - (16 GB/256 GB SSD/Mac OS Big Sur) Z124J001KD\nBrand: Apple\nHighlights: Stylish & Portable Thin and Light Laptop, 13.3 inch Quad LED Backlit IPS Display (227 PPI, 400 nits Brightness, Wide Colour (P3), True Tone Technology), Light Laptop without Optical Disk Drive\nSpecifications: {'General': {'Sales Package': ['MacBook Air, 30 W USB-C Power Adapter, USB-C Charge Cable (2m), User Guide, Warranty Documents'], 'Model Number': ['Z124J001KD'], 'Part Number': ['Z12400095'], 'Series': ['MacBook Air'], 'Color': ['Space Grey'], 'Type': ['Thin and Light Laptop'], 'Suitable For': ['Processing & Multitasking'], 'Battery Backup': ['Upto 15 Hours'], 'Power Supply': ['30

 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [19]:
FastLanguageModel.for_inference(model)
inputs = tokenizer(
[
    alpaca_prompt.format(
        "Understand the given product information and answer the question asked by acting as a human\nProduct Title: Apple MacBook Air Apple M1 - (16 GB/256 GB SSD/Mac OS Big Sur) Z124J001KD\nBrand: Apple\nHighlights: Stylish & Portable Thin and Light Laptop, 13.3 inch Quad LED Backlit IPS Display (227 PPI, 400 nits Brightness, Wide Colour (P3), True Tone Technology), Light Laptop without Optical Disk Drive\nSpecifications: {'General': {'Sales Package': ['MacBook Air, 30 W USB-C Power Adapter, USB-C Charge Cable (2m), User Guide, Warranty Documents'], 'Model Number': ['Z124J001KD'], 'Part Number': ['Z12400095'], 'Series': ['MacBook Air'], 'Color': ['Space Grey'], 'Type': ['Thin and Light Laptop'], 'Suitable For': ['Processing & Multitasking'], 'Battery Backup': ['Upto 15 Hours'], 'Power Supply': ['30 W AC Adapter'], 'MS Office Provided': ['No']}, 'Processor And Memory Features': {'Processor Brand': ['Apple'], 'Processor Name': ['M1'], 'SSD': ['Yes'], 'SSD Capacity': ['256 GB'], 'RAM': ['16 GB'], 'RAM Type': ['DDR4'], 'Processor Variant': ['Apple M1 Chip'], 'Expandable Memory': ['Upto 16 GB'], 'Graphic Processor': ['NA'], 'Number of Cores': ['8'], 'Storage Type': ['SSD']}, 'Operating System': {'Operating System': ['Mac OS Big Sur'], 'System Architecture': ['NA']}, 'Port And Slot Features': {'Mic In': ['Yes']}, 'Display And Audio Features': {'Touchscreen': ['No'], 'Screen Size': ['33.78 cm (13.3 inch)'], 'Screen Resolution': ['2560 x 1600 Pixels'], 'Screen Type': ['Quad LED Backlit IPS Display (227 PPI, 400 nits Brightness, Wide Colour (P3), True Tone Technology)'], 'Speakers': ['Built-in Speakers'], 'Internal Mic': ['Three-mic Array with Directional Beamforming'], 'Sound Properties': ['Stereo Speakers, Wide Stereo Sound, Support for Dolby Atmos Playback']}, 'Connectivity Features': {'Wireless LAN': ['IEEE 802.11ax (Wi-Fi 6)'], 'Bluetooth': ['v5.0']}, 'Dimensions': {'Dimensions': ['304.1 x 212.4 x 10.9'], 'Weight': ['1.29 Kg']}, 'Additional Features': {'Disk Drive': ['Not Available'], 'Web Camera': ['720p FaceTime HD Webcam'], 'Keyboard': ['Backlit Magic Keyboard'], 'Backlit Keyboard': ['Yes'], 'Pointer Device': ['Force Touch Trackpad'], 'Included Software': ['Built-in Apps: iMovie, Siri, GarageBand, Pages, Numbers, Photos, Keynote, Safari, Mail, FaceTime, Messages, Maps, Stocks, Home, Voice Memos, Notes, Calendar, Contacts, Reminders, Photo Booth, Preview, Books, App Store, Time Machine, TV, Music, Podcasts, Find My, QuickTime Player'], 'Additional Features': ['49.9 WHr Li-polymer Battery']}, 'Warranty': {'Warranty Summary': ['1 Year Limited Warranty'], 'Warranty Service Type': ['Onsite'], 'Covered in Warranty': ['Manufacturing Defects'], 'Not Covered in Warranty': ['Physical Damage'], 'Domestic Warranty': ['1 Year']}}", # instruction
        "How many cores of CPU does it habe", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)

<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Understand the given product information and answer the question asked by acting as a human
Product Title: Apple MacBook Air Apple M1 - (16 GB/256 GB SSD/Mac OS Big Sur) Z124J001KD
Brand: Apple
Highlights: Stylish & Portable Thin and Light Laptop, 13.3 inch Quad LED Backlit IPS Display (227 PPI, 400 nits Brightness, Wide Colour (P3), True Tone Technology), Light Laptop without Optical Disk Drive
Specifications: {'General': {'Sales Package': ['MacBook Air, 30 W USB-C Power Adapter, USB-C Charge Cable (2m), User Guide, Warranty Documents'], 'Model Number': ['Z124J001KD'], 'Part Number': ['Z12400095'], 'Series': ['MacBook Air'], 'Color': ['Space Grey'], 'Type': ['Thin and Light Laptop'], 'Suitable For': ['Processing & Multitasking'], 'Battery Backup': ['Upto 15 Hours'], 'Power Supply': ['30 W AC Ada

<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [None]:
model.save_pretrained("lora_model") # Local saving
# model.push_to_hub("your_name/lora_model", token = "...") # Online saving

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [None]:
if False:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    FastLanguageModel.for_inference(model)

# alpaca_prompt = You MUST run cells from above!

inputs = tokenizer(
[
    alpaca_prompt.format(
        "What is a famous tall tower in Paris?", # instruction
        "", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
tokenizer.batch_decode(outputs)

Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.


["Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nWhat is a famous tall tower in Paris?\n\n### Input:\n\n\n### Response:\nOne of the most famous tall towers in Paris is the Eiffel Tower. It is a wrought iron tower located on the Champ de Mars in Paris, France. It was built in 1889 as the entrance to the 1889 World's Fair, and it was designed by the French engineers Gustave Eiff"]

You can also use Hugging Face's `AutoModelForPeftCausalLM`. Only use this if you do not have `unsloth` installed. It can be hopelessly slow, since `4bit` model downloading is not supported, and Unsloth's **inference is 2x faster**.

In [None]:
if False:
    # I highly do NOT suggest - use Unsloth if possible
    from peft import AutoPeftModelForCausalLM
    from transformers import AutoTokenizer
    model = AutoPeftModelForCausalLM.from_pretrained(
        "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        load_in_4bit = load_in_4bit,
    )
    tokenizer = AutoTokenizer.from_pretrained("lora_model")

We're preparing to save our trained model in a more compact format and then upload it to a cloud platform, which allows us to use less storage and computational power.

In [None]:
# Merge to 16bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_16bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_16bit", token = "")

# Merge to 4bit
if False: model.save_pretrained_merged("model", tokenizer, save_method = "merged_4bit",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "merged_4bit", token = "")

# Just LoRA adapters
if False: model.save_pretrained_merged("model", tokenizer, save_method = "lora",)
if False: model.push_to_hub_merged("hf/model", tokenizer, save_method = "lora", token = "")

We're ready to compress our model using various quantization methods to make it leaner and then upload it to the cloud for easy sharing and access.

In [None]:
# Save to 8bit Q8_0
if False: model.save_pretrained_gguf("model", tokenizer,)
if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

# Save to 16bit GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

# Save to q4_k_m GGUF
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")

Now, use the `model-unsloth.gguf` file or `model-unsloth-Q4_K_M.gguf` file in `llama.cpp` or a UI based system like `GPT4All`. You can install GPT4All by going [here](https://gpt4all.io/index.html).

And we're done! If you have any questions on Unsloth, join their [Discord](https://discord.gg/u54VK8m8tk) channel!