<a href="https://colab.research.google.com/github/akhilesh22210374/gen_ai/blob/main/GenAI_Lab_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Install necessary libraries
!pip install transformers datasets

from transformers import GPT2Tokenizer, GPT2LMHeadModel, TextDataset, DataCollatorForLanguageModeling, Trainer, TrainingArguments
from datasets import load_dataset

# Load pre-trained GPT-2 model and tokenizer
model_name = "gpt2"  # You can also use "gpt2-medium" or "gpt2-large" for larger models
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)

# Add padding token if it doesn't exist
if tokenizer.pad_token is None:
  tokenizer.add_special_tokens({'pad_token': '[PAD]'})
  model.resize_token_embeddings(len(tokenizer))

# Prepare your dataset (replace 'your_dataset.txt' with your file)
# Your dataset should be a text file with one story per line, or a large text file of stories.
# Example data preparation using a simple text file.
dataset_file = "your_dataset.txt"

# Create a sample file for demonstration
with open(dataset_file, "w") as f:
    f.write("Once upon a time, in a land far away...\n")
    f.write("A brave knight set out on a quest.\n")
    f.write("The dragon roared, shaking the mountains.\n")
    f.write("The princess waited patiently in the tower.\n")
    f.write("A wise wizard offered his guidance.\n")

# Load the dataset
dataset = load_dataset('text', data_files=dataset_file)

# Tokenize the dataset
def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=128) # adjust max_length if needed

tokenized_datasets = dataset.map(tokenize_function, batched=True, num_proc=4, remove_columns=["text"])

# Prepare data collator
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

# Set training arguments
training_args = TrainingArguments(
    output_dir="./gpt2-finetuned",
    overwrite_output_dir=True,
    num_train_epochs=3, # Adjust the number of training epochs
    per_device_train_batch_size=4, # Adjust the batch size based on your resources
    save_steps=10_000,
    save_total_limit=2,
    logging_steps=500,
    report_to="none"
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_datasets["train"]
)

# Fine-tune the model
trainer.train()

# Save the fine-tuned model
trainer.save_model("./gpt2-finetuned")
tokenizer.save_pretrained("./gpt2-finetuned")

# Example story generation function
def generate_story(prompt, max_length=200):
  input_ids = tokenizer.encode(prompt, return_tensors="pt")
  # Generate text
  output = model.generate(input_ids, max_length=max_length, num_return_sequences=1, no_repeat_ngram_size=2)
  # Decode the generated text
  generated_story = tokenizer.decode(output[0], skip_special_tokens=True)
  return generated_story

# Load the fine-tuned model (if needed, after training)
model = GPT2LMHeadModel.from_pretrained("./gpt2-finetuned")
tokenizer = GPT2Tokenizer.from_pretrained("./gpt2-finetuned")

# Example usage
prompt = "The old house stood alone on the hill, overlooking the town. "
story = generate_story(prompt)
story




The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`


Generating train split: 0 examples [00:00, ? examples/s]

Map (num_proc=4):   0%|          | 0/5 [00:00<?, ? examples/s]

Step,Training Loss


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


'The old house stood alone on the hill, overlooking the town. \xa0The house was a small house, with a large window. The house had a wooden door, and a door that led to the house. It was the only one in the village.\nThe village was surrounded by a forest. There were no trees, but there were many trees. A large tree stood on top of the forest, which was covered with snow. \xa0\n"The forest is full of trees."\nA young man walked up to me. He was wearing a white robe. His hair was long and he wore a black robe with white sleeves.\xa0\nI looked at him. I was surprised. This man was an old man. In the past, he had been a nobleman. But now, his name was Nao. Nai was his father. And he was my father\'s son. So, I asked him, "What is your name?"\nHe said,\xa0 "Nao'