In [None]:
!nvidia-smi

Fri Jan  2 04:56:25 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  Tesla T4                       Off |   00000000:00:04.0 Off |                    0 |
| N/A   44C    P8             10W /   70W |       0MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [None]:
!pip install transformers datasets torch accelerate



In [None]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel, Trainer, TrainingArguments
from datasets import load_dataset
import torch

In [None]:
%%writefile dataset.txt
%%writefile dataset.txt
Artificial intelligence is revolutionizing technology and society.
Machine learning enables systems to improve through experience.
Deep learning uses neural networks with multiple layers.
Natural language processing helps machines understand human language.
Generative models can create realistic text content.

AI systems are widely used in healthcare and finance.
Neural networks learn complex representations from data.
Supervised learning relies on labeled datasets.
Unsupervised learning finds hidden patterns in data.
Reinforcement learning is based on reward-driven behavior.

Transformers have improved natural language understanding.
Attention mechanisms allow models to focus on relevant information.
Large language models can generate coherent paragraphs.
Tokenization converts text into numerical representations.
Padding ensures equal sequence lengths for batch processing.

Text generation models predict the next word in a sequence.
GPT-2 is a transformer-based language model.
Pretraining allows models to learn from large corpora.
Fine-tuning adapts models to specific tasks.
Sampling strategies affect text diversity and quality.

Top-k sampling limits predictions to high-probability tokens.
Top-p sampling selects tokens based on cumulative probability.
Temperature controls randomness in text generation.
Lower temperature produces more deterministic outputs.
Higher temperature increases creativity.

AI ethics focuses on fairness and transparency.
Bias in data can affect model predictions.
Responsible AI development is essential.
Automation can improve productivity.
Future AI systems will become more adaptive.



Writing dataset.txt


In [None]:
dataset=load_dataset("text",data_files={"train":"dataset.txt"})
print(dataset)

Generating train split: 0 examples [00:00, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text'],
        num_rows: 37
    })
})


In [None]:
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token=tokenizer.eos_token
model=GPT2LMHeadModel.from_pretrained("gpt2")
model.resize_token_embeddings(len(tokenizer))

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Embedding(50257, 768)

In [None]:
def tokenize_function(examples):
  tokens= tokenizer(
      examples["text"],
      truncation=True,
      padding="max_length",
      max_length=128
  )
  tokens["labels"]=tokens["input_ids"].copy()
  return tokens


In [None]:
tokenized_dataset=dataset.map(
    tokenize_function,
    batched=True,
    remove_columns=["text"]
)

Map:   0%|          | 0/37 [00:00<?, ? examples/s]

In [None]:
training_args= TrainingArguments(
    output_dir="./gpt2-finetuned",
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_device_train_batch_size=2,
    learning_rate=5e-5,
    save_steps=500,
    logging_steps=100,
    fp16=True,
    report_to='none'
)

In [None]:
trainer=Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"]
)

In [None]:
trainer.train()

`loss_type=None` was set in the config but it is unrecognized. Using the default loss: `ForCausalLMLoss`.


Step,Training Loss


TrainOutput(global_step=57, training_loss=1.9780122188099645, metrics={'train_runtime': 32.2346, 'train_samples_per_second': 3.443, 'train_steps_per_second': 1.768, 'total_flos': 7250853888000.0, 'train_loss': 1.9780122188099645, 'epoch': 3.0})

In [None]:
prompts = [
    "Artificial intelligence",
    "Machine learning",
    "Deep learning"
]
for prompt in prompts:
    print(f"\n===== PROMPT: {prompt} =====")
    inputs = tokenizer(
        prompt,
        return_tensors="pt",
        padding=True
    )
    inputs = {k: v.to(model.device) for k, v in inputs.items()}
    outputs = model.generate(
        input_ids=inputs["input_ids"],
        attention_mask=inputs["attention_mask"],
        max_length=60,
        do_sample=True,
        temperature=0.8,
        top_k=50,
        top_p=0.9,
        num_return_sequences=3,
        pad_token_id=tokenizer.eos_token_id
    )
    for i, output in enumerate(outputs):
        print(f"----- OUTPUT {i+1} -----")
        print(tokenizer.decode(output, skip_special_tokens=True))



===== PROMPT: Artificial intelligence =====
----- OUTPUT 1 -----
Artificial intelligence can learn.
----- OUTPUT 2 -----
Artificial intelligence can predict complex behaviors.
----- OUTPUT 3 -----
Artificial intelligence can learn.

===== PROMPT: Machine learning =====
----- OUTPUT 1 -----
Machine learning is an innovative methodology.
----- OUTPUT 2 -----
Machine learning algorithms can improve decision-based algorithms.
----- OUTPUT 3 -----
Machine learning will be used to optimize applications.

===== PROMPT: Deep learning =====
----- OUTPUT 1 -----
Deep learning models the processes that inform human behavior.
----- OUTPUT 2 -----
Deep learning algorithms can detect complex information.
----- OUTPUT 3 -----
Deep learning and learning can become integrated.


In [None]:
model.save_pretrained("gpt2-finetuned")
tokenizer.save_pretrained("gpt2finetuned")

('gpt2finetuned/tokenizer_config.json',
 'gpt2finetuned/special_tokens_map.json',
 'gpt2finetuned/vocab.json',
 'gpt2finetuned/merges.txt',
 'gpt2finetuned/added_tokens.json')

In [None]:
from google.colab import files
files.download("dataset.txt")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

In [None]:
from google.colab import drive
drive.mount('/content/drive')