<a href="https://colab.research.google.com/github/geekypathak21/taylor_swift_lyrics_generator/blob/main/taylor_swift_lyrics_generator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this, I am going to fine-tune a GPT-2 from the Huggingface model hub. As fine-tune, data we are using the Taylor Swift Album Dataset, which consists of 42 albums.

The idea is we use the lyrics in albums to fine-tune our GPT-2 to let us write new song lyrics.

## **What are we going to do:**

- load the dataset from kaggle
- prepare the dataset and build a ``TextDataset``
- load the pre-trained GPT-2 model and tokenizer
- initialize ``Trainer`` with ``TrainingArguments``
- train and save the model
- test the model

In [1]:
!pip install transformers==4.28.0
!pip install accelerate

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers==4.28.0
  Downloading transformers-4.28.0-py3-none-any.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m65.8 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0 (from transformers==4.28.0)
  Downloading huggingface_hub-0.14.1-py3-none-any.whl (224 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m224.5/224.5 kB[0m [31m25.6 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers==4.28.0)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m114.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.14.1 tokenizers-0.13.3 transfor

In [2]:
!nvidia-smi

Sun May 21 14:04:36 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   49C    P8    10W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

##Load the dataset from Kaggle

As already mentioned in the introduction of the tutorial we use the "Taylor Swift All Lyrics (42 albums)" dataset from Kaggle. This dataset contains almost all (if not all) of Taylor Swift Songs' Lyrics (42 Albums currently). The format for the lyrics is completely textual (.txt format) to provide complete flexibility to the user 😊. The data set also contains Cover art for all of these Albums.

Each album has a different directory for itself.

Also a list of Albums csv file and one for all albums individually in the 'Tabular' directory provided.

In [3]:
#upload files to your colab environment
from google.colab import files
uploaded = files.upload()

Saving archive.zip to archive.zip


After we uploaded the file with use `unzip` to extract the all the lyrics files.

In [4]:
!unzip 'archive.zip'

Archive:  archive.zip
  inflating: data/Albums.csv         
  inflating: data/Albums/1989/1989_Booklet_.txt  
  inflating: data/Albums/1989/AllYouHadtoDoWasStay.txt  
  inflating: data/Albums/1989/BadBlood.txt  
  inflating: data/Albums/1989/BlankSpace.txt  
  inflating: data/Albums/1989/Clean.txt  
  inflating: data/Albums/1989/HowYouGetTheGirl.txt  
  inflating: data/Albums/1989/IKnowPlaces.txt  
  inflating: data/Albums/1989/IWishYouWould.txt  
  inflating: data/Albums/1989/OutOfTheWoods.txt  
  inflating: data/Albums/1989/ShakeItOff.txt  
  inflating: data/Albums/1989/Style.txt  
  inflating: data/Albums/1989/ThisLove.txt  
  inflating: data/Albums/1989/WelcometoNewYork.txt  
  inflating: data/Albums/1989/WildestDreams.txt  
  inflating: data/Albums/AllTooWell_10MinuteVersion__TheShortFilm__EP/AllTooWell_10MinuteVersion__TheShortFilm_.txt  
  inflating: data/Albums/Anti_Hero_Remixes_/Anti_Hero.txt  
  inflating: data/Albums/Anti_Hero_Remixes_/Anti_Hero_JaydaGRemix_.txt  
  inflatin

# Prepare the dataset and build a ``TextDataset``

The next step is to extract the lyrics from all albums and build a `TextDataset`. The `TextDataset` is a custom implementation of the [Pytroch `Dataset` class](https://pytorch.org/tutorials/beginner/data_loading_tutorial.html#dataset-class) implemented by the transformers library. If you want to know more about Dataset in Pytroch you can check out this [youtube video](https://www.youtube.com/watch?v=PXOzkkB5eH0&ab_channel=PythonEngineer).

First, we are going to split the `dataset` into a `train` and `test` section and extract `Lyrics` from the albums and write them into a `train_dataset.txt` and `test_dataset.txt`

In [5]:
import os
import re
from sklearn.model_selection import train_test_split

def preprocess_string(input_string):
    if (len(input_string.split('\n', 1)) < 2):
      return ""
    # Remove the first line
    input_string = input_string.split('\n', 1)[1]

    # Remove square brackets and their contents (e.g., [Verse 1])
    input_string = re.sub(r'\[[^\]]*\]', '', input_string)

    # Replace newline characters with a blank space
    input_string = input_string.replace('\n', ' ')

    # Remove extra whitespace
    input_string = ' '.join(input_string.split())

    return input_string

def find_text_files(directory):
    text_files = []
    for root, _, files in os.walk(directory):
        for file in files:
            if file.endswith(".txt"):
                file_path = os.path.join(root, file)
                with open(file_path, 'r') as f:
                    text = preprocess_string(f.read())
                    text_files.append(text)
    return text_files

data = find_text_files('data/Albums')

def build_text_files(data_json, dest_path):
    f = open(dest_path, 'w')
    data = ''
    for texts in data_json:
        summary = str(texts).strip()
        summary = re.sub(r"\s", " ", summary)
        data += summary + "  "
    f.write(data)

train, test = train_test_split(data,test_size=0.15) 


build_text_files(train,'train_dataset.txt')
build_text_files(test,'test_dataset.txt')

print("Train dataset length: "+str(len(train)))
print("Test dataset length: "+ str(len(test)))

Train dataset length: 416
Test dataset length: 74


the next step is to download the tokenizer, which we use. We use the tokenizer from the `german-gpt2` model on [huggingface](https://huggingface.co/anonymous-german-nlp/german-gpt2).

In [6]:
from transformers import GPT2Tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

train_path = 'train_dataset.txt'
test_path = 'test_dataset.txt'

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

In [7]:
from transformers import TextDataset,DataCollatorForLanguageModeling

def load_dataset(train_path,test_path,tokenizer):
    train_dataset = TextDataset(
          tokenizer=tokenizer,
          file_path=train_path,
          block_size=128)
     
    test_dataset = TextDataset(
          tokenizer=tokenizer,
          file_path=test_path,
          block_size=128)   
    
    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer, mlm=False,
    )
    return train_dataset,test_dataset,data_collator

train_dataset,test_dataset,data_collator = load_dataset(train_path,test_path,tokenizer)



# Initialize `Trainer` with `TrainingArguments` and GPT-2 model

The [Trainer](https://huggingface.co/transformers/main_classes/trainer.html#transformers.Trainer) class provides an API for feature-complete training. It is used in most of the [example scripts](https://huggingface.co/transformers/examples.html) from Huggingface. Before we can instantiate our `Trainer` we need to download our GPT-2 model and create a [TrainingArguments](https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments) to access all the points of customization during training. In the `TrainingArguments`, we can define the Hyperparameters we are going to use in the training process like our `learning_rate`, `num_train_epochs`, or  `per_device_train_batch_size`. A complete list can you find [here](https://huggingface.co/transformers/main_classes/trainer.html#trainingarguments).

In [8]:
from transformers import Trainer, TrainingArguments, GPT2Config, GPT2LMHeadModel

# I'm not really doing anything with the config buheret
configuration = GPT2Config.from_pretrained('gpt2', output_hidden_states=False)

# instantiate the model
model = GPT2LMHeadModel.from_pretrained("gpt2", config=configuration)


training_args = TrainingArguments(
    output_dir="./gpt2_taylor_swift", #The output directory
    overwrite_output_dir=True, #overwrite the content of the output directory
    num_train_epochs=3, # number of training epochs
    per_device_train_batch_size=32, # batch size for training
    per_device_eval_batch_size=64,  # batch size for evaluation
    eval_steps = 400, # Number of update steps between two evaluations.
    save_steps=800, # after # steps model is saved 
    warmup_steps=500,# number of warmup steps for learning rate scheduler
    prediction_loss_only=True,
    )


trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)

Downloading pytorch_model.bin:   0%|          | 0.00/548M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

# Train and save the model

To train the model we can simply run `Trainer.train()`.

In [9]:
trainer.train()



Step,Training Loss


TrainOutput(global_step=183, training_loss=3.8230707554217895, metrics={'train_runtime': 211.1714, 'train_samples_per_second': 27.319, 'train_steps_per_second': 0.867, 'total_flos': 376848433152000.0, 'train_loss': 3.8230707554217895, 'epoch': 3.0})

After training is done you can save the model by calling `save_model()`. This will save the trained model to our `output_dir` from our `TrainingArguments`.

In [10]:
trainer.save_model()

# Test the model

To test the model we are going to use another [highlight of the transformers library](https://huggingface.co/transformers/main_classes/pipelines.html?highlight=pipelines) called `pipeline`. [Pipelines](https://huggingface.co/transformers/main_classes/pipelines.html?highlight=pipelines) are objects that offer a simple API dedicated to several tasks, among others also `text-generation`

In [11]:
from transformers import pipeline

chef = pipeline('text-generation',model='./gpt2_taylor_swift', tokenizer='gpt2')


Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [12]:
chef('Nice to meet you, where you been?')

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': "Nice to meet you, where you been? You know me, I always knew you'd be here in the snow at your feet, but you stayed a few minutes before I got here And I know you were there, and yet it never felt so"}]