<a href="https://colab.research.google.com/github/chetanvyavhare9579/100-Days-of-Code/blob/main/Text_generation_GPT2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
!pip install transformers datasets accelerate




In [3]:
import torch
from transformers import (
    GPT2Tokenizer,
    GPT2LMHeadModel,
    TextDataset,
    DataCollatorForLanguageModeling,
    Trainer,
    TrainingArguments
)

# Check if a GPU is available and set the device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")


Using device: cuda


In [11]:
# Create a sample text file with custom data
custom_text = """
This is the first line of my custom data.
This is the second line, and it contains more information.
We can add as many lines as needed for training the model.
Ensure the data is relevant to the text you want the model to generate.
This is additional text to make the file longer.
We need enough content to fill at least one block of size 128.
Adding more lines here to reach the required length.
This should provide enough data for the TextDataset.
Let's add a few more sentences just to be safe.
The model needs a sufficient amount of text to learn from.
This is the final line of additional text for now.
"""

with open("custom_data.txt", "w") as f:
    f.write(custom_text)

print("Created custom_data.txt with sample text.")

Created custom_data.txt with sample text.


In [6]:
# --- Configuration Parameters ---
FILE_PATH = "custom_data.txt"  # ENSURE this matches the file name you uploaded
BLOCK_SIZE = 128               # Context window size (128 is a good default)

# 1. Load the pre-trained GPT-2 tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# 2. Set the padding token (critical for generation models like GPT-2)
tokenizer.pad_token = tokenizer.eos_token

# 3. Create the TextDataset object
# This handles the tokenization and block slicing of your large text file.
print(f"Loading and tokenizing data from: {FILE_PATH}")
train_dataset = TextDataset(
    tokenizer=tokenizer,
    file_path=FILE_PATH,
    block_size=BLOCK_SIZE
)

# 4. Create the Data Collator
# This prepares batches of tokens for the model, specifically for causal language modeling (text generation).
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False # mlm=False means Causal Language Modeling (standard for GPT-2)
)

print(f"Dataset successfully created with {len(train_dataset)} blocks.")

Loading and tokenizing data from: custom_data.txt
Dataset successfully created with 0 blocks.




In [7]:
# Action 4: Load the Pre-trained GPT-2 Model
print("Loading pre-trained GPT-2 model...")
# GPT2LMHeadModel is the model with a head specifically for Language Modeling (next token prediction)
model = GPT2LMHeadModel.from_pretrained('gpt2')
model.to(device) # Move model to GPU/CPU
print("Model loaded successfully.")

Loading pre-trained GPT-2 model...


model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Model loaded successfully.


In [8]:
# Action 5: Define Training Arguments (Hyperparameters)
OUTPUT_DIR = "./fine_tuned_gpt2_model"

print("Defining training arguments...")
training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    overwrite_output_dir=True,
    num_train_epochs=3,                    # (1) Number of times to pass through the dataset (3 is a good start)
    per_device_train_batch_size=4,         # (2) Keep this low (4 or 8) to avoid running out of GPU memory
    save_steps=500,                        # (3) Save a checkpoint every 500 steps
    save_total_limit=2,                    # (4) Keep only the last 2 checkpoints
    prediction_loss_only=True,
    logging_dir='./logs',
    learning_rate=5e-5                     # Standard learning rate for fine-tuning
)
print(f"Training will save checkpoints to: {OUTPUT_DIR}")

Defining training arguments...
Training will save checkpoints to: ./fine_tuned_gpt2_model


In [9]:
# Action 6: Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator, # From Step 2
    train_dataset=train_dataset, # From Step 2
)
print("Trainer initialized. Ready to start fine-tuning.")

Trainer initialized. Ready to start fine-tuning.


In [13]:
# Action 8: Save the Model and Tokenizer
OUTPUT_DIR = "./fine_tuned_gpt2_model" # Using the same output directory from Action 5

print(f"Saving final model to {OUTPUT_DIR}")
trainer.save_model(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR) # Save the tokenizer used for training
print("Model successfully saved and ready for generation.")

Saving final model to ./fine_tuned_gpt2_model
Model successfully saved and ready for generation.


In [14]:
# Action 9: Load the saved fine-tuned model (optional but good practice)
MODEL_PATH = "./fine_tuned_gpt2_model"

print(f"Loading fine-tuned model from: {MODEL_PATH}")
# Note: Tokenizer is already loaded, but we load the model again.
model = GPT2LMHeadModel.from_pretrained(MODEL_PATH)
model.to(device)
model.eval() # Set the model to evaluation mode
print("Fine-tuned model loaded for text generation.")

Loading fine-tuned model from: ./fine_tuned_gpt2_model
Fine-tuned model loaded for text generation.


In [15]:
# Action 10: Define Prompt and Generation Parameters

# 1. Define your prompt in a dramatic style
prompt = "Hark, the air doth chill and grow quite still, for the"

# 2. Encode the prompt
input_ids = tokenizer.encode(prompt, return_tensors='pt').to(device)

# 3. Generate text using sampling methods for creative output
print(f"\n--- Generating text based on prompt: '{prompt}' ---")

output = model.generate(
    input_ids,
    max_length=150,
    num_return_sequences=1,
    do_sample=True,          # Essential for creative, non-greedy generation
    temperature=0.85,        # Higher temperature makes it more dramatic/random
    top_k=50,                # Sample from the top 50 most probable words
    top_p=0.95,              # Nucleus sampling for fluency
    no_repeat_ngram_size=2,  # Helps prevent repeating the same two words
    pad_token_id=tokenizer.eos_token_id
)

# 4. Decode and Print the Result
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print("\nGenerated Output (Fine-tuned Shakespearean Style):")
print("==================================================")
print(generated_text)
print("==================================================")


--- Generating text based on prompt: 'Hark, the air doth chill and grow quite still, for the' ---


The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.



Generated Output (Fine-tuned Shakespearean Style):
Hark, the air doth chill and grow quite still, for the stars to be so far apart, and so distant from the sea, that it are not at all pleasant to behold them all at once, but only for a few moments; and yet when the time has come, it will be of no avail to speak, save that the world may be glad. For the more quiet they are, especially the days, or at any rate the times of the year, when they may not be in a state of rest.

14. So we have all here told you, from your own experience, as far as you know; there are many things in the heavens, where the wind and the water of heaven, by the grace


In [16]:
# Action 11: Run a new generation with tuned parameters
print("--- Running Second Generation with Tuned Parameters ---")

# New prompt that mimics a character speaking, like Hamlet or Horatio
prompt_new = "Prithee, tell me, good Horatio, what vile rumour"

input_ids = tokenizer.encode(prompt_new, return_tensors='pt').to(device)

output = model.generate(
    input_ids,
    max_length=120,          # Slightly shorter max length
    num_return_sequences=1,
    do_sample=True,
    temperature=0.7,         # Lower temperature for better focus
    top_k=50,
    top_p=0.95,
    no_repeat_ngram_size=2,
    pad_token_id=tokenizer.eos_token_id
)

generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(f"\nGenerated Output for Prompt: '{prompt_new}'")
print("==================================================")
print(generated_text)
print("==================================================")

--- Running Second Generation with Tuned Parameters ---

Generated Output for Prompt: 'Prithee, tell me, good Horatio, what vile rumour'
Prithee, tell me, good Horatio, what vile rumour do you have against me?

BH: Well, I don't think so.
 (He turns and walks away. She is surprised to see a man with a hat in his hand)
â€¦
. . .
,
: . I've heard of a 'witch' who was 'herded' into her houseâ€¦ And if you are not familiar with this, then I suggest you take a look at the 'Formal Marriage' and see if there is any kind of evidence of any such
