<a href="https://colab.research.google.com/github/bmaribeiro/clinical_specialty_classification/blob/main/taylor_swift_lyrics_generator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 0. Data and Execution Time Setup

**Step 0**. Before running any cell in this notebook, one should set the execution time to GPU. Click in "Execution Time" --> "Change type of Execution Time" --> Select a GPU

**Step 1.** To install this API, run the following command:

In [1]:
!pip install kaggle



**Step 2.** Next, you need an API key. You can get one of these by going to your Kaggle account settings and click on “Create New API Token”. This will download a file called “kaggle.json” to your computer.

**Step 3.** You should upload the "kaggle.json" file to your Google Colab. You can do this by clicking on the folder icon in the left sidebar of Colab and selecting “Upload”. (Make sure that the file is in the "root")

**Step 4.** Run the following cells to download the "Taylor Swift All Lyrics" dataset:

In [2]:
import os
os.environ["KAGGLE_CONFIG_DIR"] = "/content"

In [3]:
!kaggle datasets download -d ishikajohari/taylor-swift-all-lyrics-30-albums

Downloading taylor-swift-all-lyrics-30-albums.zip to /content
 49% 9.00M/18.3M [00:00<00:00, 25.8MB/s]
100% 18.3M/18.3M [00:00<00:00, 46.9MB/s]


**Step 5.** Unzip the data

In [4]:
!unzip taylor-swift-all-lyrics-30-albums.zip

Archive:  taylor-swift-all-lyrics-30-albums.zip
  inflating: data/Albums.csv         
  inflating: data/Albums/1989/1989_Booklet_.txt  
  inflating: data/Albums/1989/AllYouHadtoDoWasStay.txt  
  inflating: data/Albums/1989/BadBlood.txt  
  inflating: data/Albums/1989/BlankSpace.txt  
  inflating: data/Albums/1989/Clean.txt  
  inflating: data/Albums/1989/HowYouGetTheGirl.txt  
  inflating: data/Albums/1989/IKnowPlaces.txt  
  inflating: data/Albums/1989/IWishYouWould.txt  
  inflating: data/Albums/1989/OutOfTheWoods.txt  
  inflating: data/Albums/1989/ShakeItOff.txt  
  inflating: data/Albums/1989/Style.txt  
  inflating: data/Albums/1989/ThisLove.txt  
  inflating: data/Albums/1989/WelcometoNewYork.txt  
  inflating: data/Albums/1989/WildestDreams.txt  
  inflating: data/Albums/AllTooWell_10MinuteVersion__TheShortFilm__EP/AllTooWell_10MinuteVersion__TheShortFilm_.txt  
  inflating: data/Albums/Anti_Hero_Remixes_/Anti_Hero.txt  
  inflating: data/Albums/Anti_Hero_Remixes_/Anti_Hero_Jay

# 1. Install Dependencies

In [5]:
!pip install loguru
!pip install transformers==4.30
!pip install -q -U trl accelerate git+https://github.com/huggingface/peft.git
!pip install -q datasets bitsandbytes einops wandb
!pip install diffusers==0.20.2

Collecting loguru
  Downloading loguru-0.7.2-py3-none-any.whl (62 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/62.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m [32m61.4/62.5 kB[0m [31m2.1 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.5/62.5 kB[0m [31m1.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: loguru
Successfully installed loguru-0.7.2
Collecting transformers==4.30
  Downloading transformers-4.30.0-py3-none-any.whl (7.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.2/7.2 MB[0m [31m35.0 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.14.1 (from transformers==4.30)
  Downloading huggingface_hub-0.18.0-py3-none-any.whl (301 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.0/302.0 kB[0m [31m17.5 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.

# 2. Import Working Libraries

In [6]:
import os
import re
import pandas as pd
from nltk import word_tokenize
from loguru import logger
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
from tqdm import tqdm
from huggingface_hub import notebook_login
from datasets import load_dataset

# 3. Huggingface Login

To use certain versions of Llama 2, one needs to be granted access through the Huggingface API Token. Therefore, one should login from an account authorized to access them, through running the following cell:

In [7]:
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

# 4. Utils

This module contains specific methods that will be useful later on.

In [8]:
def flatten_list(list_of_lists):
    flattened_list = []
    for sublist in list_of_lists:
        flattened_list.extend(sublist)
    return flattened_list

# 5. Loading Data

In [9]:
class OriginalDatasetLoader():
    def __init__(
            self,
            data_raw_dir: str="data",
            albums_dir: str="Albums"
    ):
        self.albums_dir = os.path.join(data_raw_dir, albums_dir)
        self.songs_dict = {}
        self.songs_df = pd.DataFrame()

    def _read_txt(self, data_dir: str, data_file_name: str):
        with open(os.path.join(data_dir, data_file_name), "r") as txt_file:
            file_content = txt_file.read()

        return file_content

    def _get_album_dir(self, album: str):
        return os.path.join(self.albums_dir, album)

    def _get_albums_list(self):
        return os.listdir(self.albums_dir)

    def _get_album_songs_list(self, album_dir: str):
        return os.listdir(album_dir)

    def _set_songs_dict(self):
        for album in self._get_albums_list():
            album_dir = self._get_album_dir(album)
            song_titles_list = self._get_album_songs_list(album_dir)
            self.songs_dict[album] = {song_title[0:-4]: self._read_txt(album_dir, song_title) for song_title in song_titles_list}

    def get_songs_dict(self):
        self._set_songs_dict()
        return self.songs_dict

    def _set_songs_df(self):
        for album in self._get_albums_list():
            album_dir = self._get_album_dir(album)
            song_titles_list = self._get_album_songs_list(album_dir)
            curr_album_df = pd.DataFrame.from_dict({"title": [song_title[0:-4] for song_title in song_titles_list], "lyrics": [self._read_txt(album_dir, song_title) for song_title in song_titles_list], "album": [album for _ in range(len(song_titles_list))]})
            if self.songs_df is pd.DataFrame.empty:
                self.songs_df = curr_album_df
            else:
                self.songs_df = pd.concat([self.songs_df, curr_album_df], axis=0)

        self.songs_df.reset_index(inplace=True, drop=True)

    def get_songs_df(self):
        self._set_songs_df()
        return self.songs_df

In [10]:
lyrics_dataset_generator = OriginalDatasetLoader()
songs_df = lyrics_dataset_generator.get_songs_df()
songs_df.head()

Unnamed: 0,title,lyrics,album
0,Breathe_TaylorsVersion_,33 ContributorsTranslationsTürkçeEspañolСрпски...,Fearless_TaylorsVersion_
1,Fifteen_TaylorsVersion_,51 ContributorsTranslationsTürkçeEspañolСрпски...,Fearless_TaylorsVersion_
2,Forever_Always_PianoVersion__TaylorsVersion_,27 ContributorsTranslationsTürkçeСрпскиPortugu...,Fearless_TaylorsVersion_
3,WeWereHappy_TaylorsVersion__FromtheVault_,72 ContributorsTranslationsTürkçeEspañolСрпски...,Fearless_TaylorsVersion_
4,Fearless_TaylorsVersion_,42 ContributorsTranslationsTürkçeEspañolСрпски...,Fearless_TaylorsVersion_


# 6. Parsing and Preprocessing

In [11]:
class Parser():
    def __init__(
        self,
        lyrics_dataset: pd.DataFrame,
        delimitation_pattern: str=r"Lyrics([\s\S]*?)Embed"
    ):
        self.dataset = lyrics_dataset
        self.delimitation_pattern = delimitation_pattern

    def _get_clean_content(self, input_string: str):
        match = re.search(self.delimitation_pattern, input_string)
        lyrics_content = match.group(1)
        return lyrics_content

    def _remove_end_digits(self, input_string: str):
        return re.sub(r'\d+$', '', input_string)

    def remove_irrelevant_content(self):
        self.dataset.lyrics = self.dataset.lyrics.apply(self._get_clean_content)
        self.dataset.lyrics = self.dataset.lyrics.apply(self._remove_end_digits)

    def parse_data(self):
        self.remove_irrelevant_content()

    def get_parsed_data(self):
        self.parse_data()
        return self.dataset

In [12]:
parser = Parser(songs_df)
parsed_data = parser.get_parsed_data()
parsed_data.head()

Unnamed: 0,title,lyrics,album
0,Breathe_TaylorsVersion_,[Verse 1: Taylor Swift]\nI see your face in my...,Fearless_TaylorsVersion_
1,Fifteen_TaylorsVersion_,[Verse 1]\nYou take a deep breath and you walk...,Fearless_TaylorsVersion_
2,Forever_Always_PianoVersion__TaylorsVersion_,[Verse 1]\nOnce upon a time\nI believe it was ...,Fearless_TaylorsVersion_
3,WeWereHappy_TaylorsVersion__FromtheVault_,[Verse 1]\nWe used to walk along the streets\n...,Fearless_TaylorsVersion_
4,Fearless_TaylorsVersion_,[Verse 1]\nThere's something 'bout the way\nTh...,Fearless_TaylorsVersion_


In [13]:
class PreProcessor():
    def __init__(
        self,
        lyrics_dataset: pd.DataFrame,
    ):
        self.dataset = lyrics_dataset

    def _remove_structure_markers(self, input_string: str):
        return re.sub(r'\[[^\]]*\]\n', '', input_string)

    def _remove_special_characters(self, input_string: str):
        return re.sub(r'[^a-zA-Z0-9\s]', '', input_string)

    def _lowercase_text(self, input_string: str):
        return input_string.lower()

    def _remove_new_line_characters(self, input_string: str):
        return input_string.replace("\n", " ")

    def _remove_horizontal_tabs(self, input_string: str):
        return input_string.replace("\t", " ")

    def _remove_specific_characters(self, input_string: str):
        new_string = input_string.replace("\u205f", " ")
        return new_string.replace("\u2005", " ")

    def drop_empty_entries(self):
        self.dataset = self.dataset.loc[self.dataset["lyrics"] != ''].reset_index(drop=True)
        self.dataset = self.dataset.loc[self.dataset["lyrics"] != ' '].reset_index(drop=True)

    def drop_duplicated_songs(self):
        self.dataset.drop_duplicates(subset='lyrics', keep='first', inplace=True)
        self.dataset.reset_index(inplace=True, drop=True)

        # TODO: Check whether we should drop title duplicates too

    def remove_noise(self):
        self.dataset.lyrics = self.dataset.lyrics.apply(self._remove_structure_markers)
        self.dataset.lyrics = self.dataset.lyrics.apply(self._remove_special_characters)
        self.dataset.lyrics = self.dataset.lyrics.apply(self._remove_specific_characters)

    def standardize_text(self, keep_new_lines):
        self.dataset.lyrics = self.dataset.lyrics.apply(self._lowercase_text)
        self.dataset.lyrics = self.dataset.lyrics.apply(self._remove_horizontal_tabs)
        if keep_new_lines:
          self.dataset.lyrics = self.dataset.lyrics.apply(self._remove_new_line_characters)

    def preprocess_data(self, keep_new_lines):
        # Remove Noisy Info
        self.remove_noise()

        # Drop Duplicates
        self.drop_duplicated_songs()

        # Drop Empty Entries
        self.drop_empty_entries()

        # Normalize
        self.standardize_text(keep_new_lines)

    def get_preprocessed_data(self, keep_new_lines=False):
        self.preprocess_data(keep_new_lines)
        return self.dataset

In [14]:
preprocessor = PreProcessor(parsed_data)
preprocessed_data = preprocessor.get_preprocessed_data()
preprocessed_data.head()

Unnamed: 0,title,lyrics,album
0,Breathe_TaylorsVersion_,i see your face in my mind as i drive away\nca...,Fearless_TaylorsVersion_
1,Fifteen_TaylorsVersion_,you take a deep breath and you walk through th...,Fearless_TaylorsVersion_
2,Forever_Always_PianoVersion__TaylorsVersion_,once upon a time\ni believe it was a tuesday w...,Fearless_TaylorsVersion_
3,WeWereHappy_TaylorsVersion__FromtheVault_,we used to walk along the streets\nwhen the po...,Fearless_TaylorsVersion_
4,Fearless_TaylorsVersion_,theres something bout the way\nthe street look...,Fearless_TaylorsVersion_


# 7. Splitting Data

In [15]:
class Splitter():
    def __init__(
            self,
            test_album_name: str = "Lover"
    ):
        self.test_album = test_album_name

    def train_test_split(self, dataset: pd.DataFrame):
        test_df = dataset.loc[dataset.album == self.test_album].reset_index(drop=True)
        train_df = dataset.loc[dataset.album != self.test_album].reset_index(drop=True)

        return train_df, test_df

In [16]:
splitter = Splitter()
train_dataset, test_dataset = splitter.train_test_split(preprocessed_data)
logger.info(train_dataset.head())
logger.info(test_dataset.head())

[32m2023-10-10 20:21:24.891[0m | [1mINFO    [0m | [36m__main__[0m:[36m<cell line: 3>[0m:[36m3[0m - [1m                                          title  \
0                       Breathe_TaylorsVersion_   
1                       Fifteen_TaylorsVersion_   
2  Forever_Always_PianoVersion__TaylorsVersion_   
3     WeWereHappy_TaylorsVersion__FromtheVault_   
4                      Fearless_TaylorsVersion_   

                                              lyrics                     album  
0  i see your face in my mind as i drive away\nca...  Fearless_TaylorsVersion_  
1  you take a deep breath and you walk through th...  Fearless_TaylorsVersion_  
2  once upon a time\ni believe it was a tuesday w...  Fearless_TaylorsVersion_  
3  we used to walk along the streets\nwhen the po...  Fearless_TaylorsVersion_  
4  theres something bout the way\nthe street look...  Fearless_TaylorsVersion_  [0m
[32m2023-10-10 20:21:24.898[0m | [1mINFO    [0m | [36m__main__[0m:[36m<cell line: 4

# 8. Lyrics Generators

## 8.1. Solution 1: Training an LSTM from scratch

### 8.1.1. Generating Text Generation Samples

In [12]:
class TextGenerationSamplesGenerator():
    def __init__(
            self,
            vocab_path: str = None,
            padding_token: str = "[PAD]",
            bos_token: str = "[BOS]",
            eos_token: str = "[EOS]",
            unk_token: str = "[UNK]"
    ):
        self.vocab_path = vocab_path
        self.padding_token = padding_token
        self.bos_token = bos_token
        self.eos_token = eos_token
        self.unk_token = unk_token

    def _tokenize_text(self, input_string: str):
        return input_string.split(" ")

    def tokenize_lyrics(self, dataset: pd.DataFrame):
        return dataset.lyrics.apply(self._tokenize_text)

    def _get_unique_tokens(self, lyrics_list: pd.Series):
        all_tokens = flatten_list(lyrics_list)
        logger.info(f"Number of words in the dataset: {len(all_tokens)}")
        return sorted(list(set(all_tokens)))

    def _add_bos_eos_tokens(self, token_list: list):
        token_list.insert(0, self.bos_token)
        token_list.append(self.eos_token)
        return token_list

    def _add_special_tokens(self, token_list: list):
        token_list = self._add_bos_eos_tokens(token_list)
        token_list.insert(0, self.padding_token)
        token_list.append(self.unk_token)
        return token_list

    def _build_mapping_dicts(self, tokens: list):
        tokens = self._add_special_tokens(tokens)
        int_tokens = dict((i, c) for i, c in enumerate(tokens))
        tokens_int = dict((i, c) for c, i in enumerate(tokens))
        return int_tokens, tokens_int

    def build_vocab(self, train_dataset: pd.DataFrame):
        tokenized_dataset = self.tokenize_lyrics(train_dataset)
        unique_tokens = self._get_unique_tokens(tokenized_dataset)
        logger.info(f"Number of unique words in the dataset: {len(unique_tokens)}")
        int_tokens_mapping, tokens_int_mapping = self._build_mapping_dicts(unique_tokens)

        return int_tokens_mapping, tokens_int_mapping

    def _encode_text_with_padding(self, tokens: list, vocab: dict, max_length: int):
        encoded_text = []

        # Encode each token using the vocabulary
        for token in tokens:
            if token in vocab:
                encoded_text.append(vocab[token])
            else:
                # If the token is not in the vocabulary, use a special token or raise an error as needed
                encoded_text.append(vocab[self.unk_token])  # Use '<UNK>' for unknown tokens

        # Add padding tokens to achieve the desired length
        if len(encoded_text) < max_length:
            pad_length = max_length - len(encoded_text)
            encoded_text.extend([vocab[self.padding_token]] * pad_length)
        elif len(encoded_text) > max_length:
            encoded_text = encoded_text[:max_length]

        return encoded_text

    def _get_song_samples(self, lyrics: str, vocab: dict, seq_length: int):
        song_tokens = self._tokenize_text(lyrics)
        song_tokens = self._add_bos_eos_tokens(song_tokens)

        encoded_X, encoded_y = [], []
        for idx in range(0, len(song_tokens)-seq_length, 1):
            input_sequence = self._encode_text_with_padding(song_tokens[idx:idx+seq_length], vocab, seq_length)
            token_to_predict = song_tokens[idx+seq_length]
            if token_to_predict in vocab:
                output_sequence = vocab[token_to_predict]
            else:
                output_sequence = vocab[self.unk_token]
            encoded_X.append(input_sequence)
            encoded_y.append(output_sequence)

        return encoded_X, encoded_y

    def _get_encoded_samples(self, dataset: pd.DataFrame, vocab: dict, seq_length: int = 100):
        data_X, data_y = [], []
        for song in dataset.lyrics:
            song_Xs, song_ys = self._get_song_samples(song, vocab, seq_length)
            data_X.extend(song_Xs)
            data_y.extend(song_ys)

        return data_X, data_y

    def _one_hot_encode_labels(self, data_y, num_classes=None):
        if num_classes is None:
            num_classes = np.max(data_y) + 1

        one_hot_encoded = np.zeros((len(data_y), num_classes))
        one_hot_encoded[np.arange(len(data_y)), data_y] = 1

        return one_hot_encoded

    def get_samples(self, train_dataset: pd.DataFrame, test_dataset: pd.DataFrame):
        int_tokens_dict, tokens_int_dict = self.build_vocab(train_dataset)

        train_X, train_y = self._get_encoded_samples(train_dataset, tokens_int_dict, 100)
        test_X, test_y = self._get_encoded_samples(test_dataset, tokens_int_dict, 100)

        return train_X, self._one_hot_encode_labels(train_y), test_X, self._one_hot_encode_labels(test_y), tokens_int_dict

In [13]:
samples_generator = TextGenerationSamplesGenerator()
train_X, train_y, test_X, test_y, vocab = samples_generator.get_samples(train_dataset, test_dataset)

[32m2023-10-10 19:07:35.552[0m | [1mINFO    [0m | [36m__main__[0m:[36m_get_unique_tokens[0m:[36m24[0m - [1mNumber of words in the dataset: 174022[0m
[32m2023-10-10 19:07:35.569[0m | [1mINFO    [0m | [36m__main__[0m:[36mbuild_vocab[0m:[36m47[0m - [1mNumber of unique words in the dataset: 10666[0m


In [14]:
device = "cuda" if torch.cuda.is_available() else "cpu"

In [29]:
class LSTMTextGenerator(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim):
        super(LSTMTextGenerator, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, batch_first=True)
        self.fc = nn.Linear(hidden_dim, vocab_size-1) # Because train_X does not contain any [UNK]

    def forward(self, x, hidden):
        x = self.embedding(x)
        x, hidden = self.lstm(x, hidden)
        x = self.fc(x[:, -1, :])
        return x, hidden

def train_text_generator(model, train_loader, vocab_size, criterion, optimizer, num_epochs):
    model = model.to(device)
    model.train()
    for epoch in range(num_epochs):
        total_loss = 0.0
        hidden = None

        for inputs, targets in tqdm(train_loader):
            optimizer.zero_grad()
            inputs = inputs.to(device)
            targets = targets.to(device)
            outputs, hidden = model(inputs, hidden)

            with torch.no_grad():
                hidden = (hidden[0].detach(), hidden[1].detach())  # Detach the hidden state

            loss = criterion(outputs, targets) #criterion
            loss.backward()
            optimizer.step()
            total_loss += loss.item()

        print(f"Epoch [{epoch + 1}/{num_epochs}] Loss: {total_loss / len(train_loader)}")

def generate_text(model, seed_text, max_length, vocab):
    model = model.to(device)
    model.eval()
    generated_text = seed_text

    with torch.no_grad():
        for _ in range(max_length):
            inputs = torch.tensor([vocab[token] if token in vocab else vocab["[UNK]"] for token in seed_text.split()], dtype=torch.long)
            inputs = inputs.to(device)
            inputs = inputs.unsqueeze(0)

            outputs, _ = model(inputs, None)
            predicted_word_idx = torch.argmax(outputs).item()

            predicted_word = next((word for word, idx in vocab.items() if idx == predicted_word_idx), None)
            if predicted_word is None:
                break

            generated_text += " " + predicted_word
            seed_text += " " + predicted_word

    return generated_text

In [None]:
vocab_size = len(vocab)
embedding_dim = 128
hidden_dim = 256
batch_size = 64
num_epochs = 30

In [17]:
model = LSTMTextGenerator(vocab_size, embedding_dim, hidden_dim)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

train_X = torch.tensor(train_X, dtype=torch.int64)
train_y = torch.tensor(train_y, dtype=torch.float32)

In [18]:
train_dataset = TensorDataset(train_X, train_y)

In [19]:
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, drop_last=True)

In [None]:
train_text_generator(model, train_loader, len(vocab), criterion, optimizer, num_epochs)

In [None]:
seed_text = "Once upon a time"
max_length = 500
generated_text = generate_text(model, seed_text, max_length, vocab)
print(generated_text)

## 8.2. Solution 2: Zero-Shot Learning with Llama 2

### 8.3.0. Importing Working Libraries

In [None]:
from transformers import AutoTokenizer
import transformers
import torch


In [21]:
model="meta-llama/Llama-2-7b-chat-hf"
tokenizer=AutoTokenizer.from_pretrained(model)
pipeline=transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
    device_map="auto",
    max_length=1000,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id
    )

Downloading (…)okenizer_config.json:   0%|          | 0.00/776 [00:00<?, ?B/s]

Downloading tokenizer.model:   0%|          | 0.00/500k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.84M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/614 [00:00<?, ?B/s]

Downloading (…)fetensors.index.json:   0%|          | 0.00/26.8k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)of-00002.safetensors:   0%|          | 0.00/9.98G [00:00<?, ?B/s]

Downloading (…)of-00002.safetensors:   0%|          | 0.00/3.50G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading (…)neration_config.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

In [None]:
prompt = "### Instruction: Complete the verses provided as input to write a new song, as if you were the famous singer and songwriter Taylor Swift. ### Input: Once upon a time\nIn a small town downriver\n"

In [None]:
sequences = pipeline(
    prompt,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
    max_length=256,
)

In [None]:
sequences[0]['generated_text']

## 8.3. Solution 3: Fine-Tuning Llama 2 for Lyrics Generation

### 8.3.0. Importing Working Libraries

In [17]:
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, AutoTokenizer
from peft import LoraConfig
from trl import SFTTrainer
from transformers import TrainingArguments
from accelerate.utils import BnbQuantizationConfig


  warn("The installed version of bitsandbytes was compiled without GPU support. "


/usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cpu.so: undefined symbol: cadam32bit_grad_fp32


### 8.3.1. Generating Training Data

For this stage, a dataset with a specific prompting structure for the Llama model to be trained had to be built. Here's the code used to do so:

In [20]:
class LyricsGenerationDatasetGenerator():
    def __init__(
        self,
        general_instruction: str = "Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request. ",
        instruction: str = "### Instruction: Complete the verses provided as input to write a new song, as if you were the famous singer and songwriter Taylor Swift. ",
        input_template: str = "### Input: ",
        output_template: str = "### Response: "
    ):
        self.general_instruction = general_instruction
        self.instruction = instruction
        self.input_template = input_template
        self.output_template = output_template

    def _create_input(self, complete_lyrics: str):
        verses = self._get_two_first_verses(complete_lyrics)
        return self.input_template + verses

    def _create_output(self, complete_lyrics: str):
        return self.output_template + complete_lyrics

    def _get_two_first_verses(self, lyrics: str):
        verses = lyrics.split("\n")
        return verses[0] + "\n" + verses[1] + "\n"

    def generate_prompt(self, lyrics: str):
        input_text = self._create_input(lyrics)
        output_text = self._create_output(lyrics)
        return self.general_instruction + self.instruction + input_text + " " + output_text

    def create_dataset(self, songs_df: pd.DataFrame):
        fine_tuning_dataset = pd.DataFrame()
        fine_tuning_dataset["input"] = songs_df.lyrics.apply(self._create_input)
        fine_tuning_dataset["output"] = songs_df.lyrics.apply(self._create_output)
        fine_tuning_dataset["text"] = songs_df.lyrics.apply(self.generate_prompt)

        return fine_tuning_dataset

In [21]:
preprocessor.get_preprocessed_data(True)
train_set_w_new_lines, test_set_w_new_lines = splitter.train_test_split(preprocessed_data)
fine_tuning_dataset_creator = LyricsGenerationDatasetGenerator()
train_prompting_dataset = fine_tuning_dataset_creator.create_dataset(train_set_w_new_lines)

In [22]:
train_prompting_dataset.head()

Unnamed: 0,input,output,text
0,### Input: i see your face in my mind as i dri...,### Response: i see your face in my mind as i ...,"Below is an instruction that describes a task,..."
1,### Input: you take a deep breath and you walk...,### Response: you take a deep breath and you w...,"Below is an instruction that describes a task,..."
2,### Input: once upon a time\ni believe it was ...,### Response: once upon a time\ni believe it w...,"Below is an instruction that describes a task,..."
3,### Input: we used to walk along the streets\n...,### Response: we used to walk along the street...,"Below is an instruction that describes a task,..."
4,### Input: theres something bout the way\nthe ...,### Response: theres something bout the way\nt...,"Below is an instruction that describes a task,..."


The resultant dataset was uploaded to huggingface, and is accessible through the command in the following cell:

In [23]:
dataset = load_dataset("brunomaribeiro/ts_lyricsgenerationdataset")

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/1.87M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

### 8.3.2. Fine-Tuning Llama 2

In [None]:
model_name = "TinyPixel/Llama-2-7B-bf16-sharded"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16,
)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    trust_remote_code=True
)
model.config.use_cache = False

In [None]:
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

In [None]:
lora_alpha = 16
lora_dropout = 0.1
lora_r = 64

peft_config = LoraConfig(
    lora_alpha=lora_alpha,
    lora_dropout=lora_dropout,
    r=lora_r,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=["q_proj","v_proj"]
)

In [None]:
output_dir = "./results"
per_device_train_batch_size = 4
gradient_accumulation_steps = 4
optim = "paged_adamw_32bit"
save_steps = 10
logging_steps = 1
learning_rate = 2e-4
max_grad_norm = 0.3
max_steps = 120
warmup_ratio = 0.03
lr_scheduler_type = "constant"

training_arguments = TrainingArguments(
    output_dir=output_dir,
    per_device_train_batch_size=per_device_train_batch_size,
    gradient_accumulation_steps=gradient_accumulation_steps,
    optim=optim,
    save_steps=save_steps,
    logging_steps=logging_steps,
    learning_rate=learning_rate,
    fp16=True,
    max_grad_norm=max_grad_norm,
    max_steps=max_steps,
    warmup_ratio=warmup_ratio,
    group_by_length=True,
    lr_scheduler_type=lr_scheduler_type,
)

In [None]:
max_seq_length = 512

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset['train'],
    peft_config=peft_config,
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    tokenizer=tokenizer,
    args=training_arguments,
)

In [None]:
for name, module in trainer.model.named_modules():
    if "norm" in name:
        module = module.to(torch.float32)

In [None]:
trainer.train()

### 8.3.3. Inference

In [None]:
pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    # torch_dtype=torch.bfloat16,
    torch_dtype=torch.float16,
    trust_remote_code=True,
    device_map="auto",
)


In [None]:
sequences = pipeline(
    ["Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request. ### Instruction: Complete the verses provided as input to write a new song, as if you were the famous singer and songwriter Taylor Swift. ### Input: Once upon a time\nIn a small town downriver\n"],
    max_length=500,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
)

In [None]:
sequences[0]["generated_text"]

# 9. Performance Evaluation

Evaluating the performance of the developed techniques goes beyond a simple eye-ball test. Automated Text Generation Metrics such as Perplexity or BLEU score should be applied. The different approaches should be evaluated by computing each of these metrics against every pair target_text-generated_text, created upon inference against the test set previously splitted.

## 9.0. Import Working Libraries

In [None]:
import nltk
from nltk.translate.bleu_score import sentence_bleu
from rouge_score import rouge_scorer
from nltk.translate.meteor_score import meteor_score

## 9.1. Evaluation Metrics

### 9.1.1. BLEU score

In [None]:
def calculate_bleu(target_text, generated_text):
    # Tokenize the target and generated texts
    target_tokens = nltk.word_tokenize(target_text.lower())
    generated_tokens = nltk.word_tokenize(generated_text.lower())

    # Calculate BLEU score
    bleu_score = sentence_bleu([target_tokens], generated_tokens)
    return bleu_score

### 9.1.2. ROUGE

In [None]:
def calculate_rouge_n(target_text, generated_text, n=2):
    scorer = rouge_scorer.RougeScorer(['rouge{}-n'.format(n)])
    scores = scorer.score(target_text, generated_text)
    return scores['rouge{}-n'.format(n)].fmeasure

### 9.1.3. METEOR

In [None]:
def calculate_meteor(target_text, generated_text):
    meteor = meteor_score([target_text], generated_text)
    return meteor

# 10. Final Considerations

Unfortunately, I ran out of GPU resources halfway through the development of the solutions. This made me lose a lot of time, and prevented me from testing the final version of the notebook. I also couldn't finish my plan for this task, which consisted of:

1. Building Solid Loading and Pre-processing Modules (Complete);
2. Take on a quick Exploratory Data Analysis task (I kept looking and learning from the data, but I couldn't report any of my findings in terms of data analysis);
3. Come up with an appropriate splitting strategy (I wanted to think more about this after I finish, although I considered that isolating all the songs from a single album would be a good test for the different solutions);
4. Test 3 different approaches to the problem: Traditional, Prompt-based, and Fine-tuning. (Although I present the code for all three, I couldn't run enough tests. Nevertheless, I believe that they should work. Through the eye-ball test, the Llama-2 based approaches performed much better.)
5. Perform a thorough performance evaluation, including both human-based and automated evaluation techniques. (I present some of the metrics I intended to use, but I could not even run inference loops against the test set)
6. Present good practices and clean code throughout the notebook, including typing and appropriate commenting. (Although I started this way, by the end of the notebook everything got a bit messy, due to time contraints.)