In [1]:
!pip install --quiet transformers datasets

In [2]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments
from datasets import load_dataset

in here we are importing necessary libraries such as torch (tenson operations), transformers (to load pretrained models like distilgpt2), dataset(to load standard NLP datasets)

In [3]:
model_name = "distilgpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/762 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/353M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

AutoTokenizer to convert text into token IDs that the model can understand, AutoModelForCausalLM to load a causal language model (`distilgpt2`) suitable for text generation.
Trainer and TrainingArguments to fine-tune the model using a high-level training API.



In [4]:
dataset = load_dataset("wikitext", "wikitext-2-raw-v1")
train_data = dataset["train"]
val_data = dataset["validation"]
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = tokenizer.pad_token_id

We used the Hugging Face datasets library to load the wikitext-2-raw-v1 dataset, We split the data into:
- train_data — used to fine-tune the model.
- val_data — used for evaluation during training.

Since GPT models like `distilgpt2` do not have a default padding token, we use
tokenizer.pad_token to match the eos_token and model.config.pad_token_id to ensure padding is handled correctly during training.

This step ensures the model doesn't treat padding as meaningful input.


In [5]:
def tokenize_function(example):
    tokens = tokenizer(example["text"], truncation=True, padding="max_length", max_length=128)
    tokens["labels"] = tokens["input_ids"].copy()
    return tokens

tokenized_train = dataset["train"].map(tokenize_function, batched=True)
tokenized_val = dataset["validation"].map(tokenize_function, batched=True)

tokenized_train.set_format("torch", columns=["input_ids", "attention_mask", "labels"])
tokenized_val.set_format("torch", columns=["input_ids", "attention_mask", "labels"])


Map:   0%|          | 0/36718 [00:00<?, ? examples/s]

Map:   0%|          | 0/3760 [00:00<?, ? examples/s]

We have a tokenize_function that:
- Uses our tokenizer to convert the text into token IDs.
- Applies truncation and padding so that all sequences are the same length 128 tokens
- Copies the input_ids into a labels field, since causal language modeling uses the same sequence as both input and output (i.e., predict the next token).

We applied this function to both the training and validation sets using the `map()` function from Hugging Face Datasets, which efficiently processes the entire dataset in batches.


In [8]:
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./finetuned_distilgpt2",
    num_train_epochs=1,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    save_steps=500,
    logging_steps=100,
    fp16=True,
    report_to="none"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
)

trainer.train()


Step,Training Loss


KeyboardInterrupt: 

We used Hugging Face's `Trainer` API to simplify the training process.

The `TrainingArguments` class defines key training configurations:
- `output_dir`: Folder to save the fine-tuned model.
- `num_train_epochs`: We trained for 1 epoch due to time and resource constraints.
- `per_device_train_batch_size`: Batch size of 2 was used to fit in limited GPU memory.
- `fp16=True`: Enabled 16-bit floating point precision to speed up training on supported GPUs.
- `save_steps` and `logging_steps`: Save checkpoints every 500 steps and log every 100 steps.
- `report_to="none"`: Prevents integration with third-party tools like WandB.

We then passed these arguments into the `Trainer`, along with:
- Our model (`distilgpt2`),
- The tokenized training and validation datasets.

In [10]:
prompt = "Once upon a time in a world of artificial intelligence,"

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
output = model.generate(
    **inputs,
    max_length=50,
    do_sample=True,
    top_k=50,
    top_p=0.95,
    temperature=1.0,         # More creativity
    repetition_penalty=1.2,  # Penalize repeats
    num_return_sequences=1

)
print("📝 Generated Text:\n")
print(tokenizer.decode(output[0], skip_special_tokens=True))
model.config.pad_token_id = model.config.eos_token_id



Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


📝 Generated Text:

Once upon a time in a world of artificial intelligence, it would be reasonable to expect that "the information being sought at home was not immediately accessible by public access," according


After training, we evaluated the model by generating text based on a custom prompt:
> "Once upon a time in a world of artificial intelligence,"

The `tokenizer` converts the prompt into tokens and `model.generate()` creates a continuation.

We used advanced generation parameters to improve quality:
- `max_length=50`: Allows the model to generate a full paragraph.
- `do_sample=True`: Enables randomness in output.
- `top_k=50` and `top_p=0.95`: Apply nucleus sampling for more diverse but controlled text.
- `temperature=1.0`: Adds creativity by allowing a wider range of word choices.
- `repetition_penalty=1.2`: Discourages the model from repeating phrases.

Finally, we decoded the output tokens back to text and printed the result.  
Setting `pad_token_id` to `eos_token_id` ensured smooth generation without warnings.

The result was a coherent and imaginative paragraph that reflected the language style of the training dataset.


We fine-tuned the distilgpt2 model using the wikitext-2-raw-v1 dataset from Hugging Face. The dataset was tokenized with a maximum sequence length of 128, and we trained for 1 epoch using Hugging Face’s Trainer API. During training, we used a batch size of 2, 16-bit floating point precision (fp16), and saved the model locally. After training, we used generate() to produce text based on custom prompts and evaluated the quality.