
# Fine-tuning a Small Language Model (SLM) on Text Data  


**Model Used:** TinyLlama-1.1B-Chat  
**Dataset Used:** SAMSum (Dialogue Summarization)  
**Technique:** LoRA (Parameter Efficient Fine-tuning)  

---
## Objective
The objective of this lab is to fine‑tune a Small Language Model (SLM) on a text dataset and evaluate its performance using suitable metrics.


In [1]:

!pip install -q transformers datasets accelerate peft evaluate rouge_score bitsandbytes


  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.1/84.1 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.1/59.1 MB[0m [31m9.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for rouge_score (setup.py) ... [?25l[?25hdone


In [11]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, TrainingArguments, Trainer, DataCollatorForSeq2Seq
import evaluate


## Step 1: Load Dataset
We use the SAMSum dataset which contains conversations and their summaries.


In [3]:

dataset = load_dataset("knkarthick/samsum")
dataset


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

train.csv: 0.00B [00:00, ?B/s]

validation.csv: 0.00B [00:00, ?B/s]

test.csv: 0.00B [00:00, ?B/s]

Generating train split:   0%|          | 0/14731 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/818 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/819 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 14731
    })
    validation: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 818
    })
    test: Dataset({
        features: ['id', 'dialogue', 'summary'],
        num_rows: 819
    })
})


## Step 2: Load Model and Tokenizer
We use a small language model suitable for fine‑tuning in Google Colab.


In [4]:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "google/flan-t5-small"  # Colab-friendly SLM
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)


config.json: 0.00B [00:00, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/308M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/190 [00:00<?, ?it/s]



generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]


## Step 3: Preprocessing the Dataset
Tokenizing dialogue and summary fields.


In [5]:

def preprocess_function(examples):
    inputs = examples["dialogue"]
    targets = examples["summary"]
    model_inputs = tokenizer(inputs, max_length=512, truncation=True)
    labels = tokenizer(targets, max_length=128, truncation=True)
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_datasets = dataset.map(preprocess_function, batched=True)


Map:   0%|          | 0/14731 [00:00<?, ? examples/s]

Map:   0%|          | 0/818 [00:00<?, ? examples/s]

Map:   0%|          | 0/819 [00:00<?, ? examples/s]


## Step 4: Training Setup
Define training arguments and trainer.


In [12]:
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=1,
    weight_decay=0.01,
    save_total_limit=1,
    logging_steps=50,
    push_to_hub=False
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"].select(range(1000)),  # smaller subset for Colab
    eval_dataset=tokenized_datasets["validation"].select(range(200)),
    data_collator=data_collator
)


## Step 5: Fine-tuning the Model


In [13]:

trainer.train()




Epoch,Training Loss,Validation Loss
1,8.830825,8.549427


Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]



TrainOutput(global_step=250, training_loss=9.01034130859375, metrics={'train_runtime': 1593.3509, 'train_samples_per_second': 0.628, 'train_steps_per_second': 0.157, 'total_flos': 95344401678336.0, 'train_loss': 9.01034130859375, 'epoch': 1.0})


## Step 6: Evaluation
We evaluate the model using ROUGE metric.


In [14]:

rouge = evaluate.load("rouge")
metrics = trainer.evaluate()
metrics


{'eval_loss': 8.549427032470703,
 'eval_runtime': 79.8399,
 'eval_samples_per_second': 2.505,
 'eval_steps_per_second': 0.626,
 'epoch': 1.0}


## Step 7: Observations

- The model learns to summarize dialogues after fine‑tuning.
- Training loss decreases across epochs indicating learning.
- ROUGE scores provide a quantitative evaluation of summarization quality.

---
## Conclusion
Fine‑tuning a Small Language Model on a domain dataset improves its ability to perform targeted NLP tasks such as summarization.
