<a href="https://colab.research.google.com/github/deepthidornala/DL-Assignment-2/blob/main/Question_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:

!pip install transformers datasets torch pandas

import os
os.environ["WANDB_DISABLED"] = "true"

import torch
from transformers import (
    GPT2Tokenizer,
    GPT2LMHeadModel,
    Trainer,
    TrainingArguments,
    DataCollatorForLanguageModeling,
    pipeline
)
from datasets import load_dataset
import pandas as pd

with open("/content/bruno-mars.txt", "r", encoding="utf-8") as f:
    lyrics = f.read()

lyrics = "\n".join([line.strip() for line in lyrics.split("\n") if line.strip()])

with open("processed_lyrics.txt", "w", encoding="utf-8") as f:
    f.write(lyrics)

dataset = load_dataset("text", data_files={"train": "processed_lyrics.txt"})

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token


def tokenize_function(examples):
    return tokenizer(
        examples["text"],
        padding="max_length",
        truncation=True,
        max_length=128,
        return_tensors="pt"
    )

tokenized_datasets = dataset.map(tokenize_function, batched=True, remove_columns=["text"])


model = GPT2LMHeadModel.from_pretrained("gpt2")
model.resize_token_embeddings(len(tokenizer))

training_args = TrainingArguments(
    output_dir="./gpt2-bruno-mars",
    overwrite_output_dir=True,
    num_train_epochs=5,
    per_device_train_batch_size=4,
    save_steps=10_000,
    save_total_limit=2,
    prediction_loss_only=True,
    learning_rate=5e-5,
    warmup_steps=500,
    logging_dir="./logs",
    report_to="none",
    logging_steps=500
)


data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer,
    mlm=False
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_datasets["train"],
)

print("Starting training...")
trainer.train()
print("Training completed!")

trainer.save_model("./gpt2-bruno-mars")
tokenizer.save_pretrained("./gpt2-bruno-mars")

generator = pipeline(
    "text-generation",
    model="./gpt2-bruno-mars",
    tokenizer=tokenizer,
    device=0 if torch.cuda.is_available() else -1

prompts = [
    "Tonight I'm gonna give you all my love",
    "Girl, you're amazing just the way you are",
    "Don't believe me just watch",
    "I would catch a grenade for you"
]

print("\n Bruno Mars Style Lyrics Generation \n")
for prompt in prompts:
    print(f"Prompt: '{prompt}'")
    outputs = generator(
        prompt,
        max_length=150,
        num_return_sequences=1,
        temperature=0.9,
        top_k=50,
        top_p=0.95,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )
    print("Generated Lyrics:")
    print(outputs[0]["generated_text"])
    print("\n" + "="*80 + "\n")



Generating train split: 0 examples [00:00, ? examples/s]

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

Map:   0%|          | 0/3270 [00:00<?, ? examples/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Starting training...


`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Step,Training Loss
500,3.5056
1000,2.6818
1500,2.1173
2000,1.6452
2500,1.4757
3000,1.1871
3500,1.1052
4000,1.0001


Training completed!


Device set to use cuda:0
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.



🎤 Bruno Mars Style Lyrics Generation 🎤

Prompt: 'Tonight I'm gonna give you all my love'
Generated Lyrics:
Tonight I'm gonna give you all my love so don't feel like picking up my phone, picking up my phone, picking up my phone right now baby It's so hard to find a love like mine that I can understand, that you feel so right at this moment. You feel you're at the heart of this whole unfolding unfolding, oh baby There's something so beautiful about you, that you're lost in the night. It's so hard to find love like mine that I can understand, that you feel so right at this moment. Oh yeah yeah yeah yeah, yeah, yeah, yeah, yeah, yeah, yeah Oh yeah yeah, yeah, yeah, yeah, yeah You don't know how long you're holding my hand, your


Prompt: 'Girl, you're amazing just the way you are'
Generated Lyrics:
Girl, you're amazing just the way you are. You make me feel like, I've been locked out of heaven. Never had much faith in love or miracles. Now I finally see, you don't have to be a fool to bel