## Fine-Tune SmolLM-135M-Instruct on Anime Understanding Dataset

This script fine-tunes the SmolLM-135M-Instruct model on the anime-understanding-dataset using full fine-tuning (updating all weights). It is designed to run in Google Colab with detailed comments for clarity.

In [None]:
# ------------------------ Install Required Packages ------------------------
!pip install transformers datasets evaluate nltk rouge_score polars

Collecting evaluate
  Downloading evaluate-0.4.4-py3-none-any.whl.metadata (9.5 kB)
Collecting rouge_score
  Downloading rouge_score-0.1.2.tar.gz (17 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Downloading evaluate-0.4.4-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: rouge_score
  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone
  Created wheel for rouge_score: filename=rouge_score-0.1.2-py3-none-any.whl size=24934 sha256=08eac2b6215aa32acce2d608950713d505720ae177490b74802d9627c635222f
  Stored in directory: /root/.cache/pip/wheels/1e/19/43/8a442dc83660ca25e163e1bd1f89919284ab0d0c1475475148
Successfully built rouge_score
Installing collected packages: rouge_score, evaluate
Successfully installed evaluate-0.4.4 rouge_score-0.1.2


In [None]:
# Step 2: Import Libraries and Set Up Environment
import torch
import numpy as np
import evaluate
import nltk
from datasets import Dataset, concatenate_datasets
import polars as pl
from transformers import (
    AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments,
    DataCollatorForLanguageModeling
)
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from rouge_score import rouge_scorer

nltk.download("punkt")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
# ------------------------ Load Model and Tokenizer ------------------------
model_name = "HuggingFaceTB/SmolLM-135M-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_name)
if tokenizer.pad_token_id is None:
    tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float32,  # Full precision for 16 float16
    device_map=None
)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/565 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/723 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/269M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/156 [00:00<?, ?B/s]

In [None]:
# ------------------------ Load Dataset ------------------------
anime_configs = [
    "chainsawman", "kurokonobasuke", "onepunch", "hellsing", "frieren", "aot",
    "naruto", "dr_stone", "gundam_00", "darling-in-the-franxx",
    "berserk", "evangelion", "onepiece"
]

train_splits, val_splits = [], []
for cfg in anime_configs:
    df_train = pl.read_ndjson(f"hf://datasets/theblackcat102/anime-understanding-dataset/{cfg}_dev.jsonl").to_pandas()
    df_val = pl.read_ndjson(f"hf://datasets/theblackcat102/anime-understanding-dataset/{cfg}_val.jsonl").to_pandas()
    train_splits.append(Dataset.from_pandas(df_train))
    val_splits.append(Dataset.from_pandas(df_val))

train_dataset = concatenate_datasets(train_splits)
val_dataset = concatenate_datasets(val_splits)

  block_group = [InMemoryTable(cls._concat_blocks(list(block_group), axis=axis))]
  table = cls._concat_blocks(blocks, axis=0)


In [None]:
# ------------------------ Format and Tokenize ------------------------
def format_and_tokenize(example):
    question = example['question']
    prompt = (f"Question: {question}\nOptions:\n"
              f"A. {example['A']}\nB. {example['B']}\n"
              f"C. {example['C']}\nD. {example['D']}\nAnswer:")
    correct = example[example['answer']]
    full_text = prompt + " " + correct

    # Tokenize the full text with truncation and explicit padding
    tokens = tokenizer(
        full_text,
        truncation=True,
        max_length=512,
        padding="max_length", # Explicitly pad to max_length
        return_tensors="pt" # Return PyTorch tensors
    )

    # Create labels and set padding tokens to -100
    labels = tokens['input_ids'].clone()
    # Assuming the padding token is set in the tokenizer (which was done in L1Ile8YffB3_)
    # Find where padding occurs and set the corresponding label to -100
    if tokenizer.pad_token_id is not None:
        labels[labels == tokenizer.pad_token_id] = -100

    # Remove batch dimension added by return_tensors="pt" for compatibility with datasets map
    tokens = {key: value.squeeze() for key, value in tokens.items()}

    return tokens

train_dataset = train_dataset.map(format_and_tokenize, remove_columns=train_dataset.column_names)
val_dataset = val_dataset.map(format_and_tokenize, remove_columns=val_dataset.column_names)

Map:   0%|          | 0/65 [00:00<?, ? examples/s]

Map:   0%|          | 0/130 [00:00<?, ? examples/s]

In [None]:
# ------------------------ Collator & Metrics ------------------------
collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

def compute_metrics(eval_pred):
    predictions, labels = eval_pred

    # Convert predictions to flat lists
    predictions = np.argmax(predictions, axis=-1) if predictions.ndim == 3 else predictions

    # Replace -100 in labels to pad_token_id so decoding works correctly
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)

    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

    rouge = evaluate.load("rouge")
    bleu_scores = []
    perplexities = []

    scorer = rouge_scorer.RougeScorer(["rouge1", "rougeL"], use_stemmer=True)
    smoothing = SmoothingFunction().method1

    for ref, pred in zip(decoded_labels, decoded_preds):
        bleu = sentence_bleu([ref.split()], pred.split(), smoothing_function=smoothing)
        bleu_scores.append(bleu)

        # Perplexity: Compute over pred
        inputs = tokenizer(pred, return_tensors="pt").input_ids.to(device)
        with torch.no_grad():
            loss = model(inputs, labels=inputs).loss
            ppl = torch.exp(loss).item()
            perplexities.append(ppl)

    result = rouge.compute(predictions=decoded_preds, references=decoded_labels)
    result["bleu"] = np.mean(bleu_scores)
    result["perplexity"] = np.mean(perplexities)
    return result

In [None]:
# ------------------------ Training ------------------------
training_args = TrainingArguments(
    output_dir="./anime_qa_full_finetune",
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=2,
    num_train_epochs=3,
    learning_rate=3e-5,
    logging_steps=10,
    eval_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=1,
    report_to="none",
    remove_unused_columns=False,
    fp16=False  # Full precision
)

In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    data_collator=collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

  trainer = Trainer(


In [None]:
trainer.train()

Epoch,Training Loss,Validation Loss,Rouge1,Rouge2,Rougel,Rougelsum,Bleu,Perplexity
1,2.6468,2.325262,0.54668,0.264781,0.491445,0.541963,0.165536,37.743707
2,2.0885,2.193658,0.551345,0.271115,0.500329,0.546596,0.177434,39.754212
3,2.0058,2.158518,0.556078,0.276091,0.50624,0.551397,0.184617,40.732653


Downloading builder script: 0.00B [00:00, ?B/s]

TrainOutput(global_step=51, training_loss=2.243518927518059, metrics={'train_runtime': 96.5361, 'train_samples_per_second': 2.02, 'train_steps_per_second': 0.528, 'total_flos': 63620118282240.0, 'train_loss': 2.243518927518059, 'epoch': 3.0})

In [None]:
trainer.evaluate()

{'eval_loss': 2.158517599105835,
 'eval_rouge1': 0.5560781557114293,
 'eval_rouge2': 0.27609107238545777,
 'eval_rougeL': 0.5062397585989867,
 'eval_rougeLsum': 0.5513971451679178,
 'eval_bleu': 0.18461674992444066,
 'eval_perplexity': 40.73265292094304,
 'eval_runtime': 18.6341,
 'eval_samples_per_second': 6.976,
 'eval_steps_per_second': 3.488,
 'epoch': 3.0}

In [None]:
#### Save the fine-tuned model
trainer.save_model("./smollm_finetuned/final_model")
tokenizer.save_pretrained("./smollm_finetuned/final_model")
print("Fine-tuning completed and model saved")

Fine-tuning completed and model saved


In [None]:
### Step 9: Test the Fine-Tuned Model

#### Load fine-tuned model and tokenizer
fine_tuned_model = AutoModelForCausalLM.from_pretrained("./smollm_finetuned/final_model", torch_dtype=torch.bfloat16).to(device)
fine_tuned_tokenizer = AutoTokenizer.from_pretrained("./smollm_finetuned/final_model")

In [None]:
#### Test input
test_input = "Instruction: Who is the main character in Chainsaw Man?\n\nResponse:"
inputs = fine_tuned_tokenizer(test_input, return_tensors="pt").to(device)

#### Generate response
outputs = fine_tuned_model.generate(
    inputs["input_ids"],
    max_length=100,
    num_return_sequences=1,
    do_sample=True,
    temperature=0.7
)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Both `max_new_tokens` (=40) and `max_length`(=100) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


In [None]:
#### Decode and print response
response = fine_tuned_tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Model response:")
print(response)

Model response:
Instruction: Who is the main character in Chainsaw Man?

Response: Chainsaw Man, a young man on a hunt for a rare, magical artifact.

**Scene 4: The Gathering**

(Cut to the gathering scene, where the characters


In [None]:
# ----------- Full Fine-Tuning for Anime QA Dataset with Perplexity, BLEU, and ROUGE Evaluation -----------
# ------------ Install Dependencies --------------
!pip install -q transformers datasets evaluate rouge_score nltk polars

  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone


In [None]:
# ------------ Imports -------------------
import torch
import numpy as np
import math
import polars as pl
from datasets import Dataset, concatenate_datasets
from transformers import (
    AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments,
    DataCollatorForLanguageModeling
)
from rouge_score import rouge_scorer
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
# -------- Setup Device and Model ---------------
device = "cuda" if torch.cuda.is_available() else "cpu"
model_name = "HuggingFaceTB/SmolLM-1.7B-Instruct"

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
if tokenizer.pad_token_id is None:
    tokenizer.pad_token = tokenizer.eos_token

# Load full-precision model for full fine-tuning
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    # device_map="auto",
    torch_dtype=torch.bfloat16  # use torch.float16 for older GPUs
)

In [None]:
# -------- Load Dataset and Preprocess ----------
anime_configs = [
    "chainsawman", "kurokonobasuke", "onepunch", "hellsing", "frieren", "aot",
    "naruto", "dr_stone", "gundam_00", "darling-in-the-franxx",
    "berserk", "evangelion", "onepiece"
]

train_splits, val_splits = [], []
for cfg in anime_configs:
    df_train = pl.read_ndjson(f"hf://datasets/theblackcat102/anime-understanding-dataset/{cfg}_dev.jsonl").to_pandas()
    df_val = pl.read_ndjson(f"hf://datasets/theblackcat102/anime-understanding-dataset/{cfg}_val.jsonl").to_pandas()
    train_splits.append(Dataset.from_pandas(df_train))
    val_splits.append(Dataset.from_pandas(df_val))

train_dataset = concatenate_datasets(train_splits)
val_dataset = concatenate_datasets(val_splits)

  block_group = [InMemoryTable(cls._concat_blocks(list(block_group), axis=axis))]
  table = cls._concat_blocks(blocks, axis=0)


In [None]:
def format_and_tokenize(example):
    question = example['question']
    prompt = (f"Question: {question}\nOptions:\n"
              f"A. {example['A']}\nB. {example['B']}\n"
              f"C. {example['C']}\nD. {example['D']}\nAnswer:")
    correct = example[example['answer']]
    full_text = prompt + " " + correct

    tokens = tokenizer(
        full_text,
        truncation=True,
        padding="max_length",   # ✅ PAD to max_length
        max_length=512,
        return_attention_mask=True,
    )
    tokens["labels"] = tokens["input_ids"].copy()  # Make sure labels are same size
    return tokens


train_dataset = train_dataset.map(format_and_tokenize, remove_columns=train_dataset.column_names)
val_dataset = val_dataset.map(format_and_tokenize, remove_columns=val_dataset.column_names)

collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

Map:   0%|          | 0/65 [00:00<?, ? examples/s]

Map:   0%|          | 0/130 [00:00<?, ? examples/s]

NameError: name 'DataCollatorForLanguageModeling' is not defined

In [None]:
# -------- TrainingArguments and Trainer --------
args = TrainingArguments(
    output_dir="./anime_qa_full_finetune",
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=2,
    num_train_epochs=15,
    learning_rate=2e-5,
    logging_steps=10,
    eval_strategy="epoch",
    save_strategy="epoch",
    save_total_limit=2,
    report_to="none",
    bf16=True,
    remove_unused_columns=True # Changed to True
)

In [None]:
trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    data_collator=collator
)

In [None]:
# -------- Train -------------------------------
trainer.train()

Epoch,Training Loss,Validation Loss
1,No log,2.369227
2,2.380700,2.31589
3,2.244700,2.283035
4,2.271500,2.255913
5,2.280800,2.236579
6,2.198100,2.219742
7,2.132200,2.20795
8,2.218400,2.199729
9,2.179100,2.191886
10,2.087200,2.189082


TrainOutput(global_step=135, training_loss=2.1803527690746165, metrics={'train_runtime': 544.9662, 'train_samples_per_second': 1.789, 'train_steps_per_second': 0.248, 'total_flos': 4824407841177600.0, 'train_loss': 2.1803527690746165, 'epoch': 15.0})

In [None]:
# -------- Save Model --------------------------
model.save_pretrained("./anime_qa_full_finetuned_model")
tokenizer.save_pretrained("./anime_qa_full_finetuned_model")

In [None]:
# -------- Evaluation Metrics ------------------
scorer = rouge_scorer.RougeScorer(["rouge1", "rouge2", "rougeL"], use_stemmer=True)
smoothing = SmoothingFunction().method1
model.eval()

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(49152, 2048, padding_idx=2)
    (layers): ModuleList(
      (0-23): 24 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (v_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (up_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (down_proj): Linear(in_features=8192, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((2048,), eps=1e-05)
 

In [None]:
predictions = []
references = []
rouge1s, rougeLs, bleus, perplexities = [], [], [], []

for example in val_dataset.select(range(15)):
    input_ids = torch.tensor([example['input_ids']]).to(device)
    label_ids = example['labels']
    with torch.no_grad():
        generated_ids = model.generate(input_ids, max_new_tokens=50, pad_token_id=tokenizer.eos_token_id)
        output = tokenizer.decode(generated_ids[0][len(input_ids[0]):], skip_special_tokens=True).strip()
        reference = tokenizer.decode([i for i in label_ids if i != -100], skip_special_tokens=True).strip()

        predictions.append(output)
        references.append([reference])

        # ROUGE & BLEU
        r = scorer.score(reference, output)
        rouge1s.append(r['rouge1'].fmeasure)
        rougeLs.append(r['rougeL'].fmeasure)
        bleus.append(sentence_bleu([reference.split()], output.split(), smoothing_function=smoothing))

        # Perplexity
        input = tokenizer(reference, return_tensors="pt").input_ids.to(device)
        with torch.no_grad():
            loss = model(input, labels=input).loss
            ppl = torch.exp(loss).item()
            perplexities.append(ppl)
        print(f"Perplexity: {ppl}")
        print(f"BLEU: {sentence_bleu([reference.split()], output.split(), smoothing_function=smoothing)}")
        print(f"ROUGE-1: {r['rouge1'].fmeasure}")
        print(f"ROUGE-L: {r['rougeL'].fmeasure}")
        print(f"Output: {output}")
        print(f"Reference: {reference}")

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


Perplexity: 9.92021369934082
BLEU: 0
ROUGE-1: 0.0
ROUGE-L: 0
Output: 
Reference: Question: Why does Denji refuse to kill Reze?
Options:
A. Because he still dreams of starting a new life with her
B. Because he is afraid of her powers
C. Because Makima ordered him not to
D. Because he believes she will turn good
Answer: Because he still dreams of starting a new life with her
Perplexity: 13.313000679016113
BLEU: 0
ROUGE-1: 0.0
ROUGE-L: 0
Output: 
Reference: Question: What is Denji's relationship with Asa Mitaka/Yoru?
Options:
A. Enemies from the beginning
B. Mutual indifference
C. Strong bond with complicated layers due to Yoru's intentions
D. Merely acquaintances with no personal connection
Answer: Strong bond with complicated layers due to Yoru's intentions
Perplexity: 15.46336841583252
BLEU: 0
ROUGE-1: 0.0
ROUGE-L: 0
Output: 
Reference: Question: What mission was Reze sent to Japan to accomplish?
Options:
A. To enroll in a school
B. To become a chef at the café Crossroads
C. To steal t

In [None]:
print("\n--- Evaluation Report ---")
print(f"Average ROUGE-1: {np.mean(rouge1s):.4f}")
print(f"Average ROUGE-L: {np.mean(rougeLs):.4f}")
print(f"Average BLEU:     {np.mean(bleus):.4f}")
print(f"Average Perplexity: {np.mean(perplexities):.2f}")


--- Evaluation Report ---
Average ROUGE-1: 0.0000
Average ROUGE-L: 0.0000
Average BLEU:     0.0000
Average Perplexity: 10.15
