## Step 1. Prepare the training data

### Load dataset

In [3]:
# !pip install datasets

In [18]:
from datasets import load_dataset
dataset = load_dataset("huggingartists/ed-sheeran")
from pprint import pprint

In [12]:
import pandas as pd
df = pd.DataFrame(data=dataset)
df['text'] = df.train.apply(lambda row: row.get("text"))


In [11]:
df.head()

Unnamed: 0,train,text
0,{'text': 'Shape of You Lyrics The club isnt th...,Shape of You Lyrics\nThe club isnt the best pl...
1,{'text': 'Perfect Lyrics I found a love for me...,Perfect Lyrics\nI found a love for me\nOh darl...
2,{'text': 'Love Yourself Lyrics For all the tim...,Love Yourself Lyrics\nFor all the times that y...
3,"{'text': 'River Lyrics Ive been a liar, been a...","River Lyrics\nIve been a liar, been a thief\nB..."
4,{'text': 'Castle on the Hill Lyrics When I was...,Castle on the Hill Lyrics\nWhen I was six year...


In [19]:
pprint(df['text'][1])

('Perfect Lyrics\n'
 'I found a love for me\n'
 'Oh darling, just dive right in and follow my lead\n'
 'Well, I found a girl, beautiful and sweet\n'
 'Oh, I never knew you were the someone waiting for me\n'
 'Cause we were just kids when we fell in love\n'
 'Not knowing what it was\n'
 'I will not give you up this time\n'
 'But darling, just kiss me slow, your heart is all I own\n'
 'And in your eyes, youre holding mine\n'
 'Baby, Im dancing in the dark with you between my arms\n'
 'Barefoot on the grass, listening to our favourite song\n'
 'When you said you looked a mess, I whispered underneath my breath\n'
 'But you heard it, darling, you look perfect tonight\n'
 'Well I found a woman, stronger than anyone I know\n'
 'She shares my dreams, I hope that someday Ill share her home\n'
 'I found a love, to carry more than just my secrets\n'
 'To carry love, to carry children of our own\n'
 'We are still kids, but were so in love\n'
 'Fighting against all odds\n'
 'I know well be alright 

In [20]:
lyrics_index = df['text'][1].index("Lyrics")

In [30]:
title = df['text'][1][:lyrics_index].strip()

In [31]:
lyrics = df['text'][1][lyrics_index + len("Lyrics"):].strip()

In [32]:
dict_ = {'Title': title, 'Lyrics': lyrics}

In [33]:
dict_

{'Title': 'Perfect',
 'Lyrics': 'I found a love for me\nOh darling, just dive right in and follow my lead\nWell, I found a girl, beautiful and sweet\nOh, I never knew you were the someone waiting for me\nCause we were just kids when we fell in love\nNot knowing what it was\nI will not give you up this time\nBut darling, just kiss me slow, your heart is all I own\nAnd in your eyes, youre holding mine\nBaby, Im dancing in the dark with you between my arms\nBarefoot on the grass, listening to our favourite song\nWhen you said you looked a mess, I whispered underneath my breath\nBut you heard it, darling, you look perfect tonight\nWell I found a woman, stronger than anyone I know\nShe shares my dreams, I hope that someday Ill share her home\nI found a love, to carry more than just my secrets\nTo carry love, to carry children of our own\nWe are still kids, but were so in love\nFighting against all odds\nI know well be alright this time\nDarling, just hold my hand\nBe my girl, Ill be your ma

In [38]:
df['text'].str.strip()

0      Shape of You Lyrics\nThe club isnt the best pl...
1      Perfect Lyrics\nI found a love for me\nOh darl...
2      Love Yourself Lyrics\nFor all the times that y...
3      River Lyrics\nIve been a liar, been a thief\nB...
4      Castle on the Hill Lyrics\nWhen I was six year...
                             ...                        
918                                                     
919                                                     
920                                                     
921                                                     
922                                                     
Name: text, Length: 923, dtype: object

In [48]:
import pandas as pd
df = pd.DataFrame(data=dataset)
df['text'] = df.train.apply(lambda row: row.get("text"))

def get_title_lyrics(text):
    lyrics_start = "Lyrics"
    if lyrics_start in text:
        lyrics_index = text.index(lyrics_start)
        title = text[:lyrics_index].strip()
        lyrics = text[lyrics_index + len(lyrics_start):].strip()
        
    else:
        title = None
        lyrics = None
    return {'Title': title, 'Lyrics': lyrics}


In [49]:
df[['Title', 'Lyrics']] = df['text'].apply(get_title_lyrics).apply(pd.Series)

In [52]:
df.isna().sum()

train      0
text       0
Title     44
Lyrics    44
dtype: int64

In [53]:
df.dropna(inplace=True)

In [54]:
df.isna().sum()

train     0
text      0
Title     0
Lyrics    0
dtype: int64

In [55]:
df.shape

(879, 4)

In [82]:
df.to_csv('ed_sheeran.csv')

### Encoding the text and create train/test/validation set

Since language model works with tokens, we will converts the raw lyrics into a sequence of integers, or token-ids. Because we are going to train a word-level transformer model, we will encode each token, which is represented by a unique token id (integer) using GPT2 tokenizer.

In [56]:
import os
import tiktoken
import numpy as np
import pandas as pd

In [62]:
df['Lyrics']

0      The club isnt the best place to find a lover\n...
1      I found a love for me\nOh darling, just dive r...
2      For all the times that you rained on my parade...
3      Ive been a liar, been a thief\nBeen a lover, b...
4      When I was six years old I broke my leg\nI was...
                             ...                        
908    Well do it all\nEverything\nOn our own\nWe don...
909    For all the times that you rained on my parade...
913    My lovers got humor\nShes the giggle at a fune...
915    White lips, pale face\nBreathing in the snowfl...
916    When your legs dont work like they used to bef...
Name: Lyrics, Length: 879, dtype: object

In [72]:
data = df['Lyrics'].str.cat(sep="\n")
n = len(df['Lyrics'].str.cat(sep="\n"))

In [73]:
train_data = data[: int(n*0.9)]
val_data = data[int(n * 0.9) :]

In [75]:
# encode with tiktoken gpt2 bpe
enc = tiktoken.get_encoding("gpt2")
train_ids = enc.encode_ordinary(train_data)
val_ids = enc.encode_ordinary(val_data)

In [79]:
# export to bin files
train_ids = np.array(train_ids, dtype=np.uint16)
val_ids = np.array(val_ids, dtype=np.uint16)

In [83]:
# train_ids.tofile(os.path.join(os.path.dirname(__file__), "train.bin"))
# val_ids.tofile(os.path.join(os.path.dirname(__file__), "val.bin"))
# # train has 433,585 tokens
# val has 48,662 tokens

### Step 2. Define the model

Create model.py with GPT class definition:

Initialize transformer components (embeddings, blocks etc)
Define forward pass: process input through embeddings and transformer blocks
Configure optimizer: separate parameters for weight decay
For each epoch and batch, perform forward pass, calculate loss and back-propagate and update parameters

Then, we will create train.py to initialize model, run training loop and generate texts.

### Step 3. Train the babyGPT model

In this section, we will actually train a baby GPT model. Let's create a new file called config/train_edsheeran.py to define the hyper-parameters: