In [None]:
# MINE
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## **Installing Dependencies**

In [None]:
%%capture
# Installs Unsloth, Xformers (Flash Attention) and all other packages!
!pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl==0.15.2 triton cut_cross_entropy unsloth_zoo
!pip install sentencepiece protobuf "datasets>=3.4.1" huggingface_hub hf_transfer
!pip install --no-deps unsloth

## **Loading the Model**

In [None]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally
dtype = None # None for auto detection. Float16 for Telsa T4,
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Meta-Llama-3.1-8B",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


    PyTorch 2.6.0+cu124 with CUDA 1204 (you have 2.8.0+cu126)
    Python  3.12.9 (you have 3.12.11)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
  Set XFORMERS_MORE_DETAILS=1 for more details


🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.8.9: Fast Llama patching. Transformers: 4.55.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/5.96G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/235 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/459 [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

## **Add LoRA Adapters**

In [None]:
model = FastLanguageModel.get_peft_model(
    model,
    # Choose any number >  0 ! Suggested 8, 16, 32, 64, 128
    r = 8,# changed from 16 to 8
    target_modules = ["q_proj","k_proj","v_proj", "o_proj",
                      "gate_proj", "up_proj","down_proj"],
    lora_alpha = 8, # changed from 16 to 8
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 30,
    use_rslora = False,
    loftq_config = None # And LoftQ
)

Unsloth 2025.8.9 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


## **Data Preparation**

Always remember to add the **EOS_TOKEN** to avoid infinite generation by the model.

In [None]:
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""


EOS_TOKEN = tokenizer.eos_token # EOS must be added
def format_prompt(examples):
  instructions = examples['instruction']
  inputs = examples['input']
  outputs = examples['output']
  texts = []
  for instruction, input, output in zip(instructions, inputs, outputs):
    text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
    texts.append(text)
  return {"text": texts, }

In [None]:
from datasets import load_dataset
dataset = load_dataset("yahma/alpaca-cleaned", split="train")

# Before formatting
dataset

README.md: 0.00B [00:00, ?B/s]

alpaca_data_cleaned.json:   0%|          | 0.00/44.3M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/51760 [00:00<?, ? examples/s]

Dataset({
    features: ['output', 'input', 'instruction'],
    num_rows: 51760
})

In [None]:
# After formatting
dataset = dataset.map(format_prompt, batched=True)
dataset

Map:   0%|          | 0/51760 [00:00<?, ? examples/s]

Dataset({
    features: ['output', 'input', 'instruction', 'text'],
    num_rows: 51760
})

In [None]:
print(dataset['text'][0])

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Give three tips for staying healthy.

### Input:


### Response:
1. Eat a balanced and nutritious diet: Make sure your meals are inclusive of a variety of fruits and vegetables, lean protein, whole grains, and healthy fats. This helps to provide your body with the essential nutrients to function at its best and can help prevent chronic diseases.

2. Engage in regular physical activity: Exercise is crucial for maintaining strong bones, muscles, and cardiovascular health. Aim for at least 150 minutes of moderate aerobic exercise or 75 minutes of vigorous exercise each week.

3. Get enough sleep: Getting enough quality sleep is crucial for physical and mental well-being. It helps to regulate mood, improve cognitive function, and supports healthy growth and immune function. Aim for 7-9 hours of sleep each night.<|end

In [None]:
print(dataset['text'][22]) # notice the EOS_token; "<|endoftext|>" token

Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Based on the information provided, rewrite the sentence by changing its tense from past to future.

### Input:
She played the piano beautifully for hours and then stopped as it was midnight.

### Response:
She will play the piano beautifully for hours and then stop as it will be midnight.<|end_of_text|>


## **Setting up Weights and Biases for Logging**

In [None]:
import wandb

In [None]:
wandb.login()

True

In [None]:
# does not work
# from google.colab import userdata
# from wandb import login
# wandb_token = userdata.get('WEIGHTS_BAISES_API_KEY')
# if wandb_token:
#    login(wandb_token)
#    print("Successfully logged in to Weights & Biases!")
# else:
#    print("Token is not set. Please save the token first.")

ValidationError: 1 validation error for Settings
anonymous
  Input should be 'must', 'allow' or 'never' [type=literal_error, input_value='051c561e5d27cfe919931312f02c797bbcbc04df', input_type=str]
    For further information visit https://errors.pydantic.dev/2.11/v/literal_error

In [None]:
import os
# set the wandb project where this run will be logged
os.environ["WANDB_PROJECT"]="Fine-Tune-Llama-3.1-8B-instruct-model-unsloth"

# save your trained model checkpoint to wandb
# os.environ["WANDB_LOG_MODEL"]="true" # throws an error, must use 'checkpoint' or 'end'
os.environ["WANDB_LOG_MODEL"]="checkpoint"

# turn off watch to log faster
os.environ["WANDB_WATCH"]="false"

In [None]:
from transformers import TrainingArguments
training_args = TrainingArguments(
    per_device_train_batch_size=1, # Reduced from 2 to 1 to save memory
    gradient_accumulation_steps=1, # Reduced from 4 to 2 to 1 to save memory
    warmup_steps=5,
    max_steps=100,
    # num_train_epochs=100,
    learning_rate=2e-4,
    fp16= not torch.cuda.is_bf16_supported(),
    bf16 = torch.cuda.is_bf16_supported(),
    logging_steps = 5,
    # This needs the eval_dataset to be used
    # eval_strategy="steps",
    save_strategy="steps",
    save_steps=5,
    optim = "adamw_8bit",
    weight_decay = 0.01,
    lr_scheduler_type = "linear",
    seed = 30,
    run_name="Fine_Tune_Llama_3.1_8B_instruct_model-unsloth",
    output_dir="outputs",
    report_to = ["wandb"], # reporting to Weights and biases project
)

In [None]:
from trl import SFTTrainer
trainer = SFTTrainer(
    model = model,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    tokenizer = tokenizer,
    dataset_num_proc = 2,
    packing = False,
    args = training_args,
)

Unsloth: Tokenizing ["text"]:   0%|          | 0/51760 [00:00<?, ? examples/s]

In [None]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

GPU = Tesla T4. Max memory = 14.741 GB.
6.881 GB of memory reserved.


In [None]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 51,760 | Num Epochs = 1 | Total steps = 100
O^O/ \_/ \    Batch size per device = 1 | Gradient accumulation steps = 1
\        /    Data Parallel GPUs = 1 | Total batch size (1 x 1 x 1) = 1
 "-____-"     Trainable parameters = 20,971,520 of 8,051,232,768 (0.26% trained)


Step,Training Loss
5,2.1791
10,1.5376
15,1.0981
20,1.1003
25,1.0429
30,0.9723
35,0.8908
40,0.8929
45,0.8587
50,0.9091


[34m[1mwandb[0m: Adding directory to artifact (./outputs/checkpoint-5)... Done. 0.4s


Unsloth: Will smartly offload gradients to save VRAM!


[34m[1mwandb[0m: Adding directory to artifact (./outputs/checkpoint-10)... Done. 0.6s
[34m[1mwandb[0m: Adding directory to artifact (./outputs/checkpoint-15)... Done. 0.4s
[34m[1mwandb[0m: Adding directory to artifact (./outputs/checkpoint-20)... Done. 6.0s
[34m[1mwandb[0m: Adding directory to artifact (./outputs/checkpoint-25)... Done. 0.6s
[34m[1mwandb[0m: Adding directory to artifact (./outputs/checkpoint-30)... Done. 0.6s
[34m[1mwandb[0m: Adding directory to artifact (./outputs/checkpoint-35)... Done. 0.4s
[34m[1mwandb[0m: Adding directory to artifact (./outputs/checkpoint-40)... Done. 4.5s
[34m[1mwandb[0m: Adding directory to artifact (./outputs/checkpoint-45)... Done. 0.8s
[34m[1mwandb[0m: Adding directory to artifact (./outputs/checkpoint-50)... Done. 0.4s
[34m[1mwandb[0m: Adding directory to artifact (./outputs/checkpoint-55)... Done. 1.2s
[34m[1mwandb[0m: Adding directory to artifact (./outputs/checkpoint-60)... Done. 9.3s
[34m[1mwandb[0m: A

In [None]:
trainer_stats.metrics

{'train_runtime': 206.9943,
 'train_samples_per_second': 0.483,
 'train_steps_per_second': 0.483,
 'total_flos': 830949250351104.0,
 'train_loss': 1.0321527242660522}

In [None]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

206.9943 seconds used for training.
3.45 minutes used for training.
Peak reserved memory = 6.881 GB.
Peak reserved memory for training = 0.0 GB.
Peak reserved memory % of max memory = 46.679 %.
Peak reserved memory for training % of max memory = 0.0 %.


## **Inference**

In [None]:
FastLanguageModel.for_inference(model) # using the unfine-tuned model
inputs = tokenizer(
    [
    alpaca_prompt.format(
        "List all metals in Africa?", # instruction
            "", # input
                "", # model generates response
                )
    ],
    return_tensors="pt",).to("cuda")

outputs = model.generate(**inputs, max_new_tokens=200,use_cache=True)
# print(tokenizer.batch_decode(outputs, skip_special_tokens=True)[0])
print(tokenizer.batch_decode(outputs)[0])

<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
List all metals in Africa?

### Input:


### Response:
Africa is home to a variety of metals, including gold, copper, platinum, and diamonds. Some of the most important metal-producing countries in Africa are South Africa, which is the world's largest producer of platinum and palladium, and the Democratic Republic of Congo, which is a major producer of copper, cobalt, and diamonds. Other African countries that produce significant amounts of metals include Angola, Botswana, Ghana, Mali, and Zambia.<|end_of_text|>


 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [None]:
from transformers import TextStreamer

FastLanguageModel.for_inference(model)
inputs = tokenizer(
    [
    alpaca_prompt.format(
        "List all metals in Africa?", # instruction
            "Gold, Silver, Bronze,", # input
                "", # model generates response
                )
    ],
    return_tensors="pt",).to("cuda")

streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer= streamer, max_new_tokens=200)

<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
List all metals in Africa?

### Input:
Gold, Silver, Bronze,

### Response:
Gold, Silver, Bronze, Platinum, Copper, Aluminum, Iron, Lead, Nickel, Zinc, Tin, Titanium, Manganese, Uranium, Chromium, Vanadium, Cobalt, Tungsten, Molybdenum, Niobium, Zirconium, Rhenium, Antimony, Beryllium, Cerium, Dysprosium, Erbium, Europium, Gadolinium, Hafnium, Holmium, Lanthanum, Neodymium, Osmium, Palladium, Platinum, Praseodymium, Radium, Rhenium, Ruthenium, Samarium, Scandium, Terbium, Thorium, Thulium, Tungsten, Ytterbium, Yttrium.<|end_of_text|>


In [None]:
from transformers import TextStreamer

FastLanguageModel.for_inference(model)
inputs = tokenizer(
    [
    alpaca_prompt.format(
        "Give a brief summary about the universe", # instruction
            "The universe is verse and big", # input
                "", # model generates response
                )
    ],
    return_tensors="pt",).to("cuda")

streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer= streamer, max_new_tokens=100)

<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Give a brief summary about the universe

### Input:
The universe is verse and big

### Response:
The universe is a vast and immense expanse of space that contains billions of stars, galaxies, and other celestial objects. It is believed to have originated from a massive explosion known as the Big Bang, which occurred approximately 13.8 billion years ago. The universe is constantly expanding, and its age is estimated to be around 13.8 billion years old. The universe is composed of a wide variety of objects, including stars, galaxies, planets, and asteroids, as well as dark matter and


**A little tweaking**

In [None]:
# from transformers import TextStreamer

# FastLanguageModel.for_inference(model)
# inputs = tokenizer(
#     [
#     alpaca_prompt.format(
#         "List all metals", # instruction
#             "Gold, Silver, Bronze", # input
#                 "" # model generates response
#                 )
#     ],
#     return_tensors="pt",).to("cuda")

# # streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
# # outputs = model.generate(**inputs, streamer=streamer, max_new_tokens=200,use_cache=True)
# # print(outputs)
# streamer = TextStreamer(tokenizer)
# _ = model.generate(**inputs,streamer=streamer, max_new_tokens=200)

## **Saving, Loading Finetuned models**

You coud save locally and push to hub

In [None]:
# import os
# import sys

# google_colab = "google.colab" in sys.modules and not os.environ.get("VERTEX_PRODUCT")

# if google_colab:
#     # Use secret if running in Google Colab
#     from google.colab import userdata
#     os.environ["HF_TOKEN"] = userdata.get("HF_TOKEN")
# else:
#     # Store Hugging Face data under `/content` if running in Colab Enterprise
#     if os.environ.get("VERTEX_PRODUCT") == "COLAB_ENTERPRISE":
#         os.environ["HF_HOME"] = "/content/hf"
#     # Authenticate with Hugging Face
#     from huggingface_hub import get_token
#     if get_token() is None:
#         from huggingface_hub import notebook_login
#         notebook_login()

In [None]:
from google.colab import userdata
from huggingface_hub import login
hf_token = userdata.get('HF_TOKEN')
if hf_token:
   login(hf_token)
   print("Successfully logged in to Hugging Face!")
else:
   print("Token is not set. Please save the token first.")

Successfully logged in to Hugging Face!


In [None]:
# model.save_pretrained("Fine-Tune-Phi-3-mini-4k-instruct-model-unsloth") # Local saving
# tokenizer.save_pretrained("Fine-Tune-Phi-3-mini-4k-instruct-model-unsloth") # Local saving
# first create the model card on Huggingface,
# copy the repo name and paste it here
# After which, you can run the code
# Pushing to Huggingface
model.push_to_hub("DannyAI/Fine-Tune-Llama-3.1-8B-instruct-model-unsloth-lora-model",token=hf_token)
tokenizer.push_to_hub("DannyAI/Fine-Tune-Llama-3.1-8B-instruct-model-unsloth-lora-model",token=hf_token)

README.md:   0%|          | 0.00/24.0 [00:00<?, ?B/s]

Processing Files (0 / 0)                : |          |  0.00B /  0.00B            

New Data Upload                         : |          |  0.00B /  0.00B            

  ...p9naf6w2t/adapter_model.safetensors:   1%|          |  556kB / 83.9MB            

Saved model to https://huggingface.co/DannyAI/Fine-Tune-Llama-3.1-8B-instruct-model-unsloth-lora-model


README.md:   0%|          | 0.00/42.0 [00:00<?, ?B/s]

Processing Files (0 / 0)                : |          |  0.00B /  0.00B            

New Data Upload                         : |          |  0.00B /  0.00B            

  /tmp/tmpp71jdvg8/tokenizer.json       : 100%|##########| 17.2MB / 17.2MB            

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [None]:
if True:
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "DannyAI/Fine-Tune-Llama-3.1-8B-instruct-model-unsloth-lora-model",
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
        token = hf_token
    )
    FastLanguageModel.for_inference(model) # Enable native 2x faster inference


==((====))==  Unsloth 2025.8.9: Fast Llama patching. Transformers: 4.55.2.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.8.0+cu126. CUDA: 7.5. CUDA Toolkit: 12.6. Triton: 3.4.0
\        /    Bfloat16 = FALSE. FA [Xformers = None. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


adapter_model.safetensors:   0%|          | 0.00/83.9M [00:00<?, ?B/s]

In [None]:
# alpaca_prompt = You MUST copy from above!

inputs = tokenizer(
[
    alpaca_prompt.format(
        "Based on the information provided, rewrite the sentence by changing its tense from past to future.?", # instruction
        "She played the piano beautifully for hours and then stopped as it was midnight.", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)

print(tokenizer.batch_decode(outputs)[0])

<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Based on the information provided, rewrite the sentence by changing its tense from past to future.?

### Input:
She played the piano beautifully for hours and then stopped as it was midnight.

### Response:
She will play the piano beautifully for hours and then stop as it is midnight.<|end_of_text|>


In [None]:
inputs = tokenizer(
    [
    alpaca_prompt.format(
        "List all metals in Africa?", # instruction
            "", # input
                "", # model generates response
                )
    ],
    return_tensors="pt",).to("cuda")

outputs = model.generate(**inputs, max_new_tokens=200,use_cache=True)
print(tokenizer.batch_decode(outputs)[0])

<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
List all metals in Africa?

### Input:


### Response:
There are many metals found in Africa, and the list is quite extensive. However, some of the most important metals in Africa include gold, platinum, copper, iron, cobalt, manganese, chromium, and nickel. These metals are found in different parts of the continent, with some countries producing more of one metal than others. For example, South Africa is known for its gold and platinum production, while the Democratic Republic of Congo is a major producer of cobalt and copper. Other African countries that produce significant amounts of these metals include Ghana, Zimbabwe, and Nigeria.<|end_of_text|>


In [None]:
# @title streamer
inputs = tokenizer(
    [
    alpaca_prompt.format(
        "List all metals in Africa?", # instruction
            "Gold, Silver, Bronze,", # input
                "", # model generates response
                )
    ],
    return_tensors="pt",).to("cuda")

streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer= streamer, max_new_tokens=200)

<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
List all metals in Africa?

### Input:
Gold, Silver, Bronze,

### Response:
Africa is home to a variety of metals, including gold, silver, copper, iron, tin, lead, and platinum. Some of the major metal-producing countries in Africa include South Africa, Ghana, Democratic Republic of the Congo, Zambia, and Angola. These countries have a long history of mining and metal production, and their mineral resources are still being exploited today.<|end_of_text|>


In [None]:
# alpaca_prompt = You MUST copy from above!

inputs = tokenizer(
    [
    alpaca_prompt.format(
        "List all metals", # instruction
            "Gold, Silver, Bronze,", # input
                "" # model generates response
                )
    ],
    return_tensors="pt",).to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)

print(tokenizer.batch_decode(outputs)[0])

<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
List all metals

### Input:
Gold, Silver, Bronze,

### Response:
Gold, Silver, Bronze, Iron, Copper, Nickel, Zinc, Lead, Aluminum, Tin, Titanium, Manganese, Chromium, Vanadium, Niobium, Tungsten, Molybdenum, Rhenium, Osmium, Iridium, Platinum, Palladium, Rhodium,


In [None]:
# alpaca_prompt = You MUST copy from above!

inputs = tokenizer(
[
    alpaca_prompt.format(
        "What is a famous tall tower in Paris?", # instruction
        "", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)

print(tokenizer.batch_decode(outputs)[0])

<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
What is a famous tall tower in Paris?

### Input:


### Response:
The Eiffel Tower is a famous tall tower in Paris, France. It is one of the most recognizable landmarks in the world, and is known for its iconic structure and stunning views of the city. The tower was built in 1889 for the World's Fair, and stands at a height of 324 meters


In [None]:
# @title streamer
inputs = tokenizer(
    [
    alpaca_prompt.format(
        "Give a brief summary about the universe", # instruction
            "The universe is verse and big", # input
                "", # model generates response
                )
    ],
    return_tensors="pt",).to("cuda")

streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer= streamer, max_new_tokens=100)

<|begin_of_text|>Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Give a brief summary about the universe

### Input:
The universe is verse and big

### Response:
The universe is a vast and infinite expanse of space that contains billions of galaxies, each with their own unique stars, planets, and other celestial bodies. It is believed to have originated from the Big Bang, a cosmic event that occurred approximately 13.8 billion years ago. The universe is constantly expanding and evolving, with new stars and planets forming and old ones dying out over time. It is a place of great mystery and wonder, with many unanswered questions about its origins, composition, and future


Most likely overfitting

### GGUF / llama.cpp Conversion
To save to `GGUF` / `llama.cpp`, we support it natively now! We clone `llama.cpp` and we default save it to `q8_0`. We allow all methods like `q4_k_m`. Use `save_pretrained_gguf` for local saving and `push_to_hub_gguf` for uploading to HF.

Some supported quant methods (full list on our [Wiki page](https://github.com/unslothai/unsloth/wiki#gguf-quantization-options)):
* `q8_0` - Fast conversion. High resource use, but generally acceptable.
* `q4_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K.
* `q5_k_m` - Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K.

In [None]:
# Saving to 8bit
# if False: model.save_pretrained_gguf("Fine-Tune-Llama-3.1-8B-instruct-model-unsloth-lora-modelgguf",tokenizer)
# if True: model.push_to_hub_gguf("DannyAI/Fine-Tune-Llama-3.1-8B-instruct-model-unsloth-lora-model-gguf", tokenizer, token=hf_token)


# # Save to 16bit GGUF
# if False: model.save_pretrained_gguf("Fine-Tune-Llama-3.1-8B-instruct-model-unsloth-lora-modelgguf", tokenizer, quantization_method = "f16")
# if False: model.push_to_hub_gguf("DannyAI/Fine-Tune-Llama-3.1-8B-instruct-model-unsloth-lora-model-gguf", tokenizer, quantization_method = "f16", token=hf_token)

# # Save to q4_k_m GGUF
# if False: model.save_pretrained_gguf("Fine-Tune-Llama-3.1-8B-instruct-model-unsloth-lora-modelgguf", tokenizer, quantization_method = "q4_k_m")
# if False: model.push_to_hub_gguf("DannyAI/Fine-Tune-Llama-3.1-8B-instruct-model-unsloth-lora-model-gguf", tokenizer, quantization_method = "q4_k_m", token=hf_token)

In [None]:
# # Downgrade protobuf to a compatible version
# !pip install protobuf==3.20.3

[Video-Link](https://www.youtube.com/watch?v=rpAtVIZB72U&list=PLVEEucA9MYhPxf2WmsTSwVljDbH6aQaJB&index=6)