To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
<a href="https://unsloth.ai/"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
<a href="https://discord.gg/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
<a href="https://docs.unsloth.ai/"><img src="https://github.com/unslothai/unsloth/blob/main/images/documentation%20green%20button.png?raw=true" width="125"></a></a> Join Discord if you need help + ⭐ <i>Star us on <a href="https://github.com/unslothai/unsloth">Github</a> </i> ⭐
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://docs.unsloth.ai/get-started/installing-+-updating).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save)


### News

**Read our [Gemma 3 blog](https://unsloth.ai/blog/gemma3) for what's new in Unsloth and our [Reasoning blog](https://unsloth.ai/blog/r1-reasoning) on how to train reasoning models.**

Visit our docs for all our [model uploads](https://docs.unsloth.ai/get-started/all-our-models) and [notebooks](https://docs.unsloth.ai/get-started/unsloth-notebooks).


### Installation

In [1]:
%%capture
import os
if "COLAB_" not in "".join(os.environ.keys()):
    !pip install unsloth
else:
    # Do this only in Colab notebooks! Otherwise use pip install unsloth
    !pip install --no-deps bitsandbytes accelerate xformers==0.0.29.post3 peft trl triton cut_cross_entropy unsloth_zoo
    !pip install sentencepiece protobuf datasets huggingface_hub hf_transfer
    !pip install --no-deps unsloth

### Unsloth

In [2]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-3B-Instruct-bnb-4bit", # or choose "unsloth/Llama-3.2-1B-Instruct"
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
==((====))==  Unsloth 2025.3.17: Fast Llama patching. Transformers: 4.49.0.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/234 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/54.7k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.2M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/454 [00:00<?, ?B/s]

## Add LoRA adapters so we only need to update 1 to 10% of all parameters

In [3]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 1,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 2,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2025.3.17 patched 28 layers with 28 QKV layers, 28 O layers and 28 MLP layers.


## Import the Dataset

In [4]:
from datasets import load_dataset

# Load AG News dataset
orig_dataset = load_dataset('ag_news')

# Define category mapping
label_mapping = {
    0: 'World',
    1: 'Sports',
    2: 'Business',
    3: 'Science/Technology'
}

README.md:   0%|          | 0.00/8.07k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/18.6M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/1.23M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/120000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/7600 [00:00<?, ? examples/s]

In [5]:
def format_agnews_to_chat_label(example):
    """
    Converts AG News text classification dataset into instruction-response format.
    Ensures that the model is instructed to output ONLY one of: "World", "Sports", "Business", or "Science/Technology".
    """
    user_text = example['text']
    category = label_mapping[example['label']]  # Convert label ID to category name

    formatted_example = {
        'conversations': [
            {
                'role': 'user',
                'content': (
                    "Classify this news article into one of the following categories:\n"
                    "- World\n"
                    "- Sports\n"
                    "- Business\n"
                    "- Science/Technology\n\n"
                    f"Article: {user_text}\n\n"
                    "Respond with only one word that best fits: World, Sports, Business, or Science/Technology."
                ),
            },
            {'role': 'assistant', 'content': category},  # Expected single-word response
        ]
    }
    return formatted_example

dataset = orig_dataset.map(format_agnews_to_chat_label)
print("done")

Map:   0%|          | 0/120000 [00:00<?, ? examples/s]

Map:   0%|          | 0/7600 [00:00<?, ? examples/s]

done


In [6]:
print(dataset['train'][0])

{'text': "Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green again.", 'label': 2, 'conversations': [{'content': "Classify this news article into one of the following categories:\n- World\n- Sports\n- Business\n- Science/Technology\n\nArticle: Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green again.\n\nRespond with only one word that best fits: World, Sports, Business, or Science/Technology.", 'role': 'user'}, {'content': 'Business', 'role': 'assistant'}]}


In [7]:
from unsloth.chat_templates import standardize_sharegpt, get_chat_template

# Work with the training split
train_dataset = dataset["train"]  # Extract the "train" split

# Standardize dataset for Unsloth fine-tuning
train_dataset = standardize_sharegpt(train_dataset)

# Get correct chat template
tokenizer = get_chat_template(
    tokenizer,
    chat_template='llama-3.1',  # LLaMA 3.2 uses this template
)

# Define prompt formatting function
def formatting_prompts_func(examples):
    convos = examples["conversations"]
    texts = [tokenizer.apply_chat_template(convo, tokenize=False, add_generation_prompt=False) for convo in convos]
    return {"text": texts}

# Convert dataset to chat format
train_dataset_mapped = train_dataset.map(formatting_prompts_func, batched=True)

# Check first example
print(train_dataset_mapped[0])

Unsloth: Standardizing formats (num_proc=2):   0%|          | 0/120000 [00:00<?, ? examples/s]

Map:   0%|          | 0/120000 [00:00<?, ? examples/s]

{'text': "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nClassify this news article into one of the following categories:\n- World\n- Sports\n- Business\n- Science/Technology\n\nArticle: Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green again.\n\nRespond with only one word that best fits: World, Sports, Business, or Science/Technology.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nBusiness<|eot_id|>", 'label': 2, 'conversations': [{'content': "Classify this news article into one of the following categories:\n- World\n- Sports\n- Business\n- Science/Technology\n\nArticle: Wall St. Bears Claw Back Into the Black (Reuters) Reuters - Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green again.\n\nRespond with only one word that bes

## Train the model

In [8]:
from trl import SFTTrainer
from transformers import TrainingArguments, DataCollatorForSeq2Seq
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = train_dataset_mapped,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    data_collator = DataCollatorForSeq2Seq(tokenizer = tokenizer),
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        # num_train_epochs = 1, # Set this for 1 full training run.
        max_steps = 80,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to = "none", # Use this for WandB etc
    ),
)

Unsloth: We found double BOS tokens - we shall remove one automatically.


Unsloth: Tokenizing ["text"] (num_proc=2):   0%|          | 0/120000 [00:00<?, ? examples/s]

## Start train process

In [9]:
trainer_stats = trainer.train()

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 120,000 | Num Epochs = 1 | Total steps = 80
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 1,519,616/3,000,000,000 (0.05% trained)


Step,Training Loss
1,3.7373
2,3.6831
3,3.7484
4,3.6575
5,3.5715
6,3.588
7,3.2856
8,3.158
9,3.0642
10,2.9929


Unsloth: Will smartly offload gradients to save VRAM!


In [10]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

messages = [
    {"role": "user", "content": "Classify this news article in one of the categories"},
]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize = True,
    add_generation_prompt = True, # Must add for generation
    return_tensors = "pt",
).to("cuda")

outputs = model.generate(input_ids = inputs, max_new_tokens = 64, use_cache = True,
                         temperature = 1.2, min_p = 0.1)
tokenizer.batch_decode(outputs)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


["<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nCutting Knowledge Date: December 2023\nToday Date: 26 July 2024\n\n<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nClassify this news article in one of the categories<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nI don't see a news article. Can you provide one, and I'll do my best to classify it into a category such as World, Sports, Business, etc.<|eot_id|>"]

<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!

**[NEW] Try 2x faster inference in a free Colab for Llama-3.1 8b Instruct [here](https://colab.research.google.com/drive/1T-YBVfnphoVc8E2E854qF3jdia2Ll2W2?usp=sharing)**

We use `min_p = 0.1` and `temperature = 1.5`. Read this [Tweet](https://x.com/menhguin/status/1826132708508213629) for more information on why.

In [11]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference




PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): LlamaForCausalLM(
      (model): LlamaModel(
        (embed_tokens): Embedding(128256, 3072, padding_idx=128004)
        (layers): ModuleList(
          (0-27): 28 x LlamaDecoderLayer(
            (self_attn): LlamaAttention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=3072, out_features=3072, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=3072, out_features=1, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=1, out_features=3072, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (k_proj): lora.

In [12]:
import re
# Label Mapping (Ensure numeric labels match)
label_mapping = {
    0: "World",
    1: "Sports",
    2: "Business",
    3: "Science/Technology"
}

def classify_text_unsloth(text):
    """
    Classifies a news article using the fine-tuned LLaMA 3.2 model.
    """
    messages = [
        {"role": "user", "content": (
            "Classify this news article into one of the following categories:\n"
                    "- World\n"
                    "- Sports\n"
                    "- Business\n"
                    "- Science/Technology\n\n"
                    f"Article: {text}\n\n"
                    "Respond with only one word that best fits: World, Sports, Business, or Science/Technology."
        )}
    ]

    # Tokenize input for inference
    inputs = tokenizer.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,  # Needed for text generation
        return_tensors="pt",
    ).to("cuda")

    # Generate classification result
    outputs = model.generate(input_ids=inputs, max_new_tokens=3, use_cache=True)

    # Decode generated output
    response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0].strip()

    #print("response:", response)  # Debugging print

    # Extract only the last valid word (category)
    match = re.findall(r"(World|Sports|Business|Science/Technology)\b", response, re.IGNORECASE)
    if match:
        return match[-1]  # Return the last matched category

    print(f"Invalid response: {response}")  # Debugging print
    return "Invalid"

## Compute sample outputs

In [13]:
import random

# Extract test dataset texts and labels
test_texts = orig_dataset["test"]["text"]  # Get news article texts
true_labels = orig_dataset["test"]["label"]  # Get true labels (numeric)

# Select a few random test samples
num_samples = 5  # Number of samples to test
random_indices = random.sample(range(len(test_texts)), num_samples)

# Test and print results
for idx in random_indices:
    text = test_texts[idx]  # Get the test news article
    true_label = label_mapping[true_labels[idx]]  # Convert numeric label to category name
    predicted_label = classify_text_unsloth(text)  # Get model prediction

    print("=" * 80)
    print(f"📰 **News Article:** {text}\n")
    print(f"✅ **Expected Category:** {true_label}")
    print(f"🤖 **Predicted Category:** {predicted_label}")
    print("=" * 80 + "\n")

📰 **News Article:** Obesity Raises Risk for 9 Different Types of Cancer By LAURAN NEERGAARD    WASHINGTON (AP) -- Heart disease and diabetes get all the attention, but expanding waistlines increase the risk for at least nine types of cancer, too. And with the obesity epidemic showing no signs of waning, specialists say they need to better understand how fat cells fuels cancer growth so they might fight back...

✅ **Expected Category:** Science/Technology
🤖 **Predicted Category:** Science/Technology

📰 **News Article:** Kenyan activist plants tree to mark Nobel prize Crying with delight, Kenyan environmentalist Wangari Maathai planted a tree to celebrate winning the Nobel Peace Prize on Friday and vowed to use the money 

✅ **Expected Category:** World
🤖 **Predicted Category:** World

📰 **News Article:** Semiconductor Manufacturing to Boost Capacity by Half (Update2) Semiconductor Manufacturing International Corp., China #39;s biggest supplier of made-to-order chips, said its factory ca

## Compute Accuracy

In [14]:
import numpy as np
from sklearn.metrics import accuracy_score, f1_score
from tqdm import tqdm
from datasets import load_dataset

# Load AG News dataset
dataset = load_dataset("ag_news")

# Extract test dataset texts and numeric labels
test_texts = dataset["test"]["text"]
true_labels_numeric = np.array(dataset["test"]["label"])  # Numeric labels (0,1,2,3)

# Convert numeric labels to text labels
true_labels_text = np.array([label_mapping[label] for label in true_labels_numeric])  # Convert to category names

# Run classification on test dataset
predicted_labels = []
for text in tqdm(test_texts):
    predicted_labels.append(classify_text_unsloth(text))  # Get text label prediction

predicted_labels = np.array(predicted_labels)  # Convert to NumPy array

# Filter valid indices (Ignore invalid responses)
valid_indices = predicted_labels != "Invalid"
filtered_true_labels = true_labels_text[valid_indices]
filtered_predicted_labels = predicted_labels[valid_indices]

# Compute accuracy
accuracy = accuracy_score(filtered_true_labels, filtered_predicted_labels)

# Compute F1 score (macro-averaged to balance all classes equally)
f1 = f1_score(filtered_true_labels, filtered_predicted_labels, average="macro")

print(f"\nClassification Accuracy: {accuracy:.2%} on {len(filtered_true_labels)} samples")
print(f"F1 Score (Macro-Averaged): {f1:.4f}")


100%|██████████| 7600/7600 [27:09<00:00,  4.66it/s]



Classification Accuracy: 82.08% on 7600 samples
F1 Score (Macro-Averaged): 0.8206


## Testing Original Model's performance

In [None]:
orig_mode, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/Llama-3.2-3B-Instruct-bnb-4bit", # or choose "unsloth/Llama-3.2-1B-Instruct"
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

==((====))==  Unsloth 2025.3.14: Fast Llama patching. Transformers: 4.48.3.
   \\   /|    Tesla T4. Num GPUs = 1. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!


In [None]:
from unsloth.chat_templates import get_chat_template

tokenizer = get_chat_template(
    tokenizer,
    chat_template = "llama-3.1",
)
FastLanguageModel.for_inference(orig_mode) # Enable native 2x faster inference




LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 3072, padding_idx=128004)
    (layers): ModuleList(
      (0-27): 28 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear4bit(in_features=3072, out_features=3072, bias=False)
          (k_proj): Linear4bit(in_features=3072, out_features=1024, bias=False)
          (v_proj): Linear4bit(in_features=3072, out_features=1024, bias=False)
          (o_proj): Linear4bit(in_features=3072, out_features=3072, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear4bit(in_features=3072, out_features=8192, bias=False)
          (up_proj): Linear4bit(in_features=3072, out_features=8192, bias=False)
          (down_proj): Linear4bit(in_features=8192, out_features=3072, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((3072,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm

In [None]:
import re
# Label Mapping (Ensure numeric labels match)
label_mapping = {
    0: "World",
    1: "Sports",
    2: "Business",
    3: "Science/Technology"
}

def classify_text_unsloth_original(text):
    """
    Classifies a news article using the fine-tuned LLaMA 3.2 model.
    """
    messages = [
        {"role": "user", "content": (
            "Classify this news article into one of the following categories:\n"
                    "- World\n"
                    "- Sports\n"
                    "- Business\n"
                    "- Science/Technology\n\n"
                    f"Article: {text}\n\n"
                    "Respond with only one word that best fits: World, Sports, Business, or Science/Technology."
        )}
    ]

    # Tokenize input for inference
    inputs = tokenizer.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,  # Needed for text generation
        return_tensors="pt",
    ).to("cuda")

    # Generate classification result
    outputs = orig_mode.generate(input_ids=inputs, max_new_tokens=3, use_cache=True)

    # Decode generated output
    response = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0].strip()

    #print("response:", response)  # Debugging print

    # Extract only the last valid word (category)
    match = re.findall(r"(World|Sports|Business|Science/Technology)\b", response, re.IGNORECASE)
    if match:
        return match[-1]  # Return the last matched category

    print(f"Invalid response: {response}")  # Debugging print
    return "Invalid"

In [None]:
import random

# Extract test dataset texts and labels
test_texts = orig_dataset["test"]["text"]  # Get news article texts
true_labels = orig_dataset["test"]["label"]  # Get true labels (numeric)

# Select a few random test samples
num_samples = 5  # Number of samples to test
random_indices = random.sample(range(len(test_texts)), num_samples)

# Test and print results
for idx in random_indices:
    text = test_texts[idx]  # Get the test news article
    true_label = label_mapping[true_labels[idx]]  # Convert numeric label to category name
    predicted_label = classify_text_unsloth_original(text)  # Get model prediction

    print("=" * 80)
    print(f"📰 **News Article:** {text}\n")
    print(f"✅ **Expected Category:** {true_label}")
    print(f"🤖 **Predicted Category:** {predicted_label}")
    print("=" * 80 + "\n")

📰 **News Article:** ** For the latest news, please refresh this page regularly ** What #39;s up? I see the whole world has their eyes on the Oscar vs. Bernard fight. My thought is Oscar is coming off with an upset.

✅ **Expected Category:** Sports
🤖 **Predicted Category:** Sports

📰 **News Article:** Abductor kills self in Moscow region hostage freeing operation MOSCOW, November 25 (Itar-Tass) - A special task force unit freed both women who were held hostage by two armed army deserters in the Moscow region.

✅ **Expected Category:** World
🤖 **Predicted Category:** World

📰 **News Article:** China's Lenovo to buy IBM's PC business TOKYO - China's Lenovo Group Ltd. signed a definitive agreement on Wednesday to acquire IBM Corp.'s personal computing division. Lenovo will pay US\$1.25 billion in cash for the business, which is expected to transform it into the world's number three PC maker, the companies announced.

✅ **Expected Category:** Science/Technology
🤖 **Predicted Category:** Bus

In [None]:
import numpy as np
from sklearn.metrics import accuracy_score, f1_score
from tqdm import tqdm
from datasets import load_dataset

# Load AG News dataset
dataset = load_dataset("ag_news")

# Extract test dataset texts and numeric labels
test_texts = dataset["test"]["text"]
true_labels_numeric = np.array(dataset["test"]["label"])  # Numeric labels (0,1,2,3)

# Convert numeric labels to text labels
true_labels_text = np.array([label_mapping[label] for label in true_labels_numeric])  # Convert to category names

# Run classification on test dataset
predicted_labels = []
for text in tqdm(test_texts):
    predicted_labels.append(classify_text_unsloth_original(text))  # Get text label prediction

predicted_labels = np.array(predicted_labels)  # Convert to NumPy array

# Filter valid indices (Ignore invalid responses)
valid_indices = predicted_labels != "Invalid"
filtered_true_labels = true_labels_text[valid_indices]
filtered_predicted_labels = predicted_labels[valid_indices]

# Compute accuracy
accuracy = accuracy_score(filtered_true_labels, filtered_predicted_labels)

# Compute F1 score (macro-averaged to balance all classes equally)
f1 = f1_score(filtered_true_labels, filtered_predicted_labels, average="macro")

print(f"\nClassification Accuracy: {accuracy:.2%} on {len(filtered_true_labels)} samples")
print(f"F1 Score (Macro-Averaged): {f1:.4f}")


100%|██████████| 7600/7600 [21:38<00:00,  5.85it/s]


Classification Accuracy: 75.00% on 7600 samples
F1 Score (Macro-Averaged): 0.7140



