## Summary
The notebook's primary function is to transform a dataset of song lyrics from simple text to a structured multi-task learning dataset, suitable for teaching a model both phoneme recognition and couplet generation. Starting with a .csv file, it cleans the data and then transforms the song lyrics into discrete lines. These lines are then paired into rhyming couplets. The culmination of the notebook's process is the creation of a dataset that serves dual purposes: it facilitates phoneme-to-grapheme translation (and vice versa) and bridges the first line to the second in couplets. This dual-purpose dataset is instrumental for the model to not only anticipate the subsequent line in a couplet but also to comprehend the rhythm and meter inherent in the song lyrics, attributed to the phonemic patterns.

In [1]:
%matplotlib inline

In [3]:
import pandas as pd

lyrics_df = pd.read_csv("data/rap_lyrics 2.csv")
lyrics_df = lyrics_df[["title","artist", "lyrics"]]
lyrics_df.rename({"lyrics": "lyrics_g"}, axis=1, inplace=True)
lyrics_df.head()
lyrics_df.describe()


Unnamed: 0,title,artist,lyrics_g
count,20986,20986,20986
unique,18068,170,20807
top,Intro,Post Malone,"Around the world, around the world:Around the ..."
freq,27,326,3


## Clean Lyrics

In [4]:
#apply regex to remove "Embed" from end of each lyric
import re
pattern = r"\d+Embed$"

lyrics_df["lyrics_g"] = lyrics_df["lyrics_g"].str.replace(pattern, "")

  lyrics_df["lyrics_g"] = lyrics_df["lyrics_g"].str.replace(pattern, "")


In [5]:
#convert to lowercase
lyrics_df["lyrics_g"] = lyrics_df["lyrics_g"].apply(lambda x: x.lower())

In [6]:
#drop duplicates
lyrics_df.drop_duplicates(inplace=True)

In [7]:
#drop languages that aren't english 
from langdetect import detect

def detect_lyric(x):
    try:
        return detect(x)
    except:
        return

lyrics_df["language"] = lyrics_df["lyrics_g"].apply(detect_lyric)
lyrics_df= lyrics_df[lyrics_df.language == "en"]

In [8]:
lyrics_df = lyrics_df[lyrics_df["artist"] != "Anuel AA"]

In [9]:
#drop all empty lyrics
lyrics_df = lyrics_df[lyrics_df["lyrics_g"]!=""]

In [10]:
lyrics = lyrics_df["lyrics_g"].to_list()
print(lyrics[0])

mmm, mmm, yeah:do, do, do, do, do, do, do-do:ooh, yeah::gotta change my answering machine:now that i'm alone:'cause right now it says that we:can't come to the phone:and i know it makes no sense:'cause you walked out the door:but it's the only way i hear your voice anymore::(it's ridiculous):it's been months:and for some reason i just (can't get over us):and i'm stronger than this, yeah (enough is enough):no more walking 'round with my head down (yeah):i'm so over being blue:cryin' over you::and i'm so sick of love songs, so tired of tears:so done with wishin' you were still here:said i'm so sick of love songs, so sad and slow:so why can't i turn off the radio?:see ne-yo liveget tickets as low as $38you might also like:gotta fix that calendar i have:that's marked july 15th:because since there's no more you:there's no more anniversary:i'm so fed up with my thoughts of you:and your memory:and how every song reminds me of what used to be:that's the reason::i'm so sick of love songs, so ti

In [11]:
lyrics = list(map(lambda x: x.replace("::", ":"),lyrics))
print(lyrics[0])

mmm, mmm, yeah:do, do, do, do, do, do, do-do:ooh, yeah:gotta change my answering machine:now that i'm alone:'cause right now it says that we:can't come to the phone:and i know it makes no sense:'cause you walked out the door:but it's the only way i hear your voice anymore:(it's ridiculous):it's been months:and for some reason i just (can't get over us):and i'm stronger than this, yeah (enough is enough):no more walking 'round with my head down (yeah):i'm so over being blue:cryin' over you:and i'm so sick of love songs, so tired of tears:so done with wishin' you were still here:said i'm so sick of love songs, so sad and slow:so why can't i turn off the radio?:see ne-yo liveget tickets as low as $38you might also like:gotta fix that calendar i have:that's marked july 15th:because since there's no more you:there's no more anniversary:i'm so fed up with my thoughts of you:and your memory:and how every song reminds me of what used to be:that's the reason:i'm so sick of love songs, so tired 

In [12]:
verses = [lyric.split(":") for lyric in lyrics]

In [13]:
import itertools

verses = list(itertools.chain(*verses))
display(verses)

['mmm, mmm, yeah',
 'do, do, do, do, do, do, do-do',
 'ooh, yeah',
 'gotta change my answering machine',
 "now that i'm alone",
 "'cause right now it says that we",
 "can't come to the phone",
 'and i know it makes no sense',
 "'cause you walked out the door",
 "but it's the only way i hear your voice anymore",
 "(it's ridiculous)",
 "it's been months",
 "and for some reason i just (can't get over us)",
 "and i'm stronger than this, yeah (enough is enough)",
 "no more walking 'round with my head down (yeah)",
 "i'm so over being blue",
 "cryin' over you",
 "and i'm so sick of love songs, so tired of tears",
 "so done with wishin' you were still here",
 "said i'm so sick of love songs, so sad and slow",
 "so why can't i turn off the radio?",
 'see ne-yo liveget tickets as low as $38you might also like',
 'gotta fix that calendar i have',
 "that's marked july 15th",
 "because since there's no more you",
 "there's no more anniversary",
 "i'm so fed up with my thoughts of you",
 'and your 

In [14]:
def create_couplets(verses: list):
    couplets = []
    for i in range(1,len(verses)):
        couplet = verses[i-1] + "\n" + verses[i]
        couplets.append(couplet)
        
    return couplets

        
couplets = create_couplets(verses)
#display(couplets)

couplets_df = pd.DataFrame(couplets, columns=["couplets_g"])

In [13]:
#! /Users/austinpaxton/anaconda3/envs/lyric_generation_capstone/bin/pip install phyme

In [14]:
#! /Users/austinpaxton/anaconda3/envs/lyric_generation_capstone/bin/python3.10 -m pip install phyme

https://github.com/jameswenzel/Phyme# 

In [15]:
import re

def get_last_words(couplet):
    last_words = []
    lines = couplet.split("\n")
    for line in lines:
        line_words = line.split(" ")
        last_word = re.sub(r"[^a-zA-Z]+", "",line_words[-1]) #remove everything that is not a letter from last word
        last_words.append(last_word)
    return last_words

        
couplets_df["last_words"] = couplets_df["couplets_g"].apply(get_last_words)

In [16]:
import itertools
import re

ph = Phyme()


def check_perfect_rhyme(words: list, ph: Phyme) -> bool:
    if len(words[0])==1:
        return False
    try:
        # get rhymes for first word in list and reformat output dictionary into a list
        rhymes = ph.get_perfect_rhymes(words[0]).values()
        rhymes = list(itertools.chain(*rhymes))
        pattern = "\(\d\)" # remove (2)
        rhymes = [re.sub(pattern,"", rhyme) for rhyme in rhymes]
        if words[1] in rhymes:
            return True
        else:
            return False
    except KeyError as ke:
        return f"{ke} NOT FOUND"
    
couplets_df["rhyme"] = couplets_df.apply(lambda row: check_perfect_rhyme(row["last_words"],ph),axis=1)

NameError: name 'Phyme' is not defined

In [None]:
display(couplets_df)
couplets_df.groupby("rhyme").count()

In [20]:
rhyme_couplets_df = couplets_df[couplets_df["rhyme"] ==True]
display(rhyme_couplets_df)

Unnamed: 0,couplets_g,last_words,rhyme
8,'cause you walked out the door\nbut it's the o...,"[door, anymore]",True
15,i'm so over being blue\ncryin' over you,"[blue, you]",True
41,so done with wishin' she were still here (oh)\...,"[oh, oh]",True
53,so why can't i turn off the radio?\nwhy can't ...,"[radio, radio]",True
54,why can't i turn off the radio?\nwhy can't i t...,"[radio, radio]",True
...,...,...,...
1339445,"blew up, had to change my norm\nwhen i go back...","[norm, storm]",True
1339447,"when you are gone, if you get right 'til you d...","[leave, leave]",True
1339450,"before rap, be a black man outside livin' cozy...","[cozy, rosie]",True
1339456,"my inner-self a stepper, i'm from ryer, i know...","[pain, wayne]",True


In [28]:
# # create columns for line 1 and line 2 of couplet
# rhyme_couplets_df["line1"], rhyme_couplets_df["line2"] = rhyme_couplets_df.apply(lambda x: x["couplets_g"].split("\n"),axis=1)
# display(rhyme_couplets_df)
                                

ValueError: too many values to unpack (expected 2)

In [30]:
# # create columns for line 1 and line 2 of couplet

# def split_couplet(couplet):
#     lines = couplet.split("\n")
#     if len(lines)==2:
#         return lines[0],lines[1]
#     else:
#         return None, None
    

# rhyme_couplets_df[["line1_g","line2_g"]] = rhyme_couplets_df["couplets_g"].apply(split_couplet)
# display(rhyme_couplets_df)

ValueError: Columns must be same length as key

In [23]:
def split_couplet(couplet):
    lines = couplet.split("\n")
    return lines
    

rhyme_couplets_df["couplet_split"] = rhyme_couplets_df["couplets_g"].apply(split_couplet)
rhyme_couplets_df["line1_g"] = rhyme_couplets_df["couplet_split"].apply(lambda x: x[0])
rhyme_couplets_df["line2_g"] = rhyme_couplets_df["couplet_split"].apply(lambda x: x[1])
display(rhyme_couplets_df)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  rhyme_couplets_df["couplet_split"] = rhyme_couplets_df["couplets_g"].apply(split_couplet)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  rhyme_couplets_df["line1_g"] = rhyme_couplets_df["couplet_split"].apply(lambda x: x[0])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  rhyme_couplets_df["line2_g"

Unnamed: 0,couplets_g,last_words,rhyme,couplet_split,line1_g,line2_g
8,'cause you walked out the door\nbut it's the o...,"[door, anymore]",True,"['cause you walked out the door, but it's the ...",'cause you walked out the door,but it's the only way i hear your voice anymore
15,i'm so over being blue\ncryin' over you,"[blue, you]",True,"[i'm so over being blue, cryin' over you]",i'm so over being blue,cryin' over you
41,so done with wishin' she were still here (oh)\...,"[oh, oh]",True,[so done with wishin' she were still here (oh)...,so done with wishin' she were still here (oh),"said i'm so sick of love songs, so sad and slo..."
53,so why can't i turn off the radio?\nwhy can't ...,"[radio, radio]",True,"[so why can't i turn off the radio?, why can't...",so why can't i turn off the radio?,why can't i turn off the radio?
54,why can't i turn off the radio?\nwhy can't i t...,"[radio, radio]",True,"[why can't i turn off the radio?, why can't i ...",why can't i turn off the radio?,why can't i turn off the radio?
...,...,...,...,...,...,...
1339445,"blew up, had to change my norm\nwhen i go back...","[norm, storm]",True,"[blew up, had to change my norm, when i go bac...","blew up, had to change my norm",when i go back to my hood the police storm
1339447,"when you are gone, if you get right 'til you d...","[leave, leave]",True,"[when you are gone, if you get right 'til you ...","when you are gone, if you get right 'til you d...",my momma still gon' play the trenches like she...
1339450,"before rap, be a black man outside livin' cozy...","[cozy, rosie]",True,"[before rap, be a black man outside livin' coz...","before rap, be a black man outside livin' cozy","deadys out the oz, my auntie named rosie"
1339456,"my inner-self a stepper, i'm from ryer, i know...","[pain, wayne]",True,"[my inner-self a stepper, i'm from ryer, i kno...","my inner-self a stepper, i'm from ryer, i know...","double-cup of red, i feel like i'm higher than..."


In [25]:
#remove boring couplets where line 1 is same as line 2
rhyme_couplets_df = rhyme_couplets_df[rhyme_couplets_df["line1_g"]!= rhyme_couplets_df["line2_g"]]
display(rhyme_couplets_df)

Unnamed: 0,couplets_g,last_words,rhyme,couplet_split,line1_g,line2_g
8,'cause you walked out the door\nbut it's the o...,"[door, anymore]",True,"['cause you walked out the door, but it's the ...",'cause you walked out the door,but it's the only way i hear your voice anymore
15,i'm so over being blue\ncryin' over you,"[blue, you]",True,"[i'm so over being blue, cryin' over you]",i'm so over being blue,cryin' over you
41,so done with wishin' she were still here (oh)\...,"[oh, oh]",True,[so done with wishin' she were still here (oh)...,so done with wishin' she were still here (oh),"said i'm so sick of love songs, so sad and slo..."
53,so why can't i turn off the radio?\nwhy can't ...,"[radio, radio]",True,"[so why can't i turn off the radio?, why can't...",so why can't i turn off the radio?,why can't i turn off the radio?
57,"you, yeah, you, you\nyou, yeah, you, you, you","[you, you]",True,"[you, yeah, you, you, you, yeah, you, you, you]","you, yeah, you, you","you, yeah, you, you, you"
...,...,...,...,...,...,...
1339445,"blew up, had to change my norm\nwhen i go back...","[norm, storm]",True,"[blew up, had to change my norm, when i go bac...","blew up, had to change my norm",when i go back to my hood the police storm
1339447,"when you are gone, if you get right 'til you d...","[leave, leave]",True,"[when you are gone, if you get right 'til you ...","when you are gone, if you get right 'til you d...",my momma still gon' play the trenches like she...
1339450,"before rap, be a black man outside livin' cozy...","[cozy, rosie]",True,"[before rap, be a black man outside livin' coz...","before rap, be a black man outside livin' cozy","deadys out the oz, my auntie named rosie"
1339456,"my inner-self a stepper, i'm from ryer, i know...","[pain, wayne]",True,"[my inner-self a stepper, i'm from ryer, i kno...","my inner-self a stepper, i'm from ryer, i know...","double-cup of red, i feel like i'm higher than..."


In [27]:
# write to csv so it can be phonemized
rhyme_couplets_df.to_csv("data/rhyme_couplets.csv",index=False)

# ------------Reload Phonemized CSV and Assemble tasks for training----------

In [19]:

import pandas as pd
couplets_gp_df = pd.read_csv("data/rhyme_couplets_f-phonemized_07-30-23.csv")

display(couplets_gp_df)

Unnamed: 0,couplets_g,last_words,rhyme,couplet_split,line1_g,line2_g,line1_p,line2_p
0,'cause you walked out the door\nbut it's the o...,"['door', 'anymore']",True,"[""'cause you walked out the door"", ""but it's t...",'cause you walked out the door,but it's the only way i hear your voice anymore,k-aa-z y-uw w-ao-k-t aw-t dh-ax d-ao-r,b-ah-t ih-t-s dh-ax ow-n|l-iy w-ey ay hh-ih-r ...
1,i'm so over being blue\ncryin' over you,"['blue', 'you']",True,"[""i'm so over being blue"", ""cryin' over you""]",i'm so over being blue,cryin' over you,ay-m s-ow ow|v-er b-iy|ax-ng b-l-uw,k-r-ih|ax-n ow|v-er y-uw
2,so done with wishin' she were still here (oh)\...,"['oh', 'oh']",True,"[""so done with wishin' she were still here (oh...",so done with wishin' she were still here (oh),"said i'm so sick of love songs, so sad and slo...",s-ow d-ah-n w-ih-dh w-ih|sh-ax-n sh-iy w-er s-...,s-eh-d ay-m s-ow s-ih-k ah-v l-ah-v s-ao-ng-z ...
3,so why can't i turn off the radio?\nwhy can't ...,"['radio', 'radio']",True,"[""so why can't i turn off the radio?"", ""why ca...",so why can't i turn off the radio?,why can't i turn off the radio?,s-ow w-ay k-ae-n-t ay t-er-n ao-f dh-ax r-ey|d...,w-ay k-ae-n-t ay t-er-n ao-f dh-ax r-ey|d-iy|ow
4,"you, yeah, you, you\nyou, yeah, you, you, you","['you', 'you']",True,"['you, yeah, you, you', 'you, yeah, you, you, ...","you, yeah, you, you","you, yeah, you, you, you",y-uw y-ae y-uw y-uw,y-uw y-ae y-uw y-uw y-uw
...,...,...,...,...,...,...,...,...
190725,"blew up, had to change my norm\nwhen i go back...","['norm', 'storm']",True,"['blew up, had to change my norm', 'when i go ...","blew up, had to change my norm",when i go back to my hood the police storm,b-l-uw ah-p hh-ae-d t-ax ch-ey-n-jh m-ay n-ao-r-m,w-eh-n ay g-ow b-ae-k t-ax m-ay hh-uh-d dh-ax ...
190726,"when you are gone, if you get right 'til you d...","['leave', 'leave']",True,"[""when you are gone, if you get right 'til you...","when you are gone, if you get right 'til you d...",my momma still gon' play the trenches like she...,w-eh-n y-uw aa-r g-ao-n ih-f y-uw g-eh-t r-ay-...,m-ay m-aa|m-ax s-t-ih-l g-aa-n p-l-ey dh-ax t-...
190727,"before rap, be a black man outside livin' cozy...","['cozy', 'rosie']",True,"[""before rap, be a black man outside livin' co...","before rap, be a black man outside livin' cozy","deadys out the oz, my auntie named rosie",b-iy|f-ao-r r-ae-p b-iy ax b-l-ae-k m-ae-n aw-...,d-eh|d-ax-s aw-t dh-ax aa-z m-ay ae-n|t-iy n-e...
190728,"my inner-self a stepper, i'm from ryer, i know...","['pain', 'wayne']",True,"[""my inner-self a stepper, i'm from ryer, i kn...","my inner-self a stepper, i'm from ryer, i know...","double-cup of red, i feel like i'm higher than...",m-ay ih|n-er s-eh-l-f ax s-t-eh|p-er ay-m f-r-...,d-ah|b-ax-l k-ah-p ah-v r-eh-d ay f-iy-l l-ay-...


In [24]:
#check for lines rthat failed to phonemize
couplets_gp_df[(couplets_gp_df["line1_p"] == None) |(couplets_gp_df["line2_p"] == None)] 

Unnamed: 0,couplets_g,last_words,rhyme,couplet_split,line1_g,line2_g,line1_p,line2_p


In [26]:
import random

# create tasks for multi-task learning
# < line 1 grapheme =1G|2G= line 2 grapheme> 
# <line 1 phoneme =1P|2P= line 2 phoneme> 
# [ line 1 grapheme =1G|1P= line 1 phoneme]
# [line 1 phoneme =1P|1G= line 1 grapheme ]
# [line 2 grapheme =2G|2P=line 2 phoneme]
# [line 2 phoneme =2P|2G= line 2 grapheme]


line1_g = couplets_gp_df["line1_g"].to_list()
line2_g = couplets_gp_df["line2_g"].to_list()
line1_p = couplets_gp_df["line1_p"].to_list()
line2_p = couplets_gp_df["line2_p"].to_list()

tasks = []

for i in range(len(line1_g)):
    tasks.append(f"~ {line1_g[i]} =1G->2G= {line2_g[i]} ~")
    tasks.append(f"~ {line1_p[i]} =1P->2P= {line2_p[i]} ~")
    tasks.append(f"[ {line1_g[i]} =1G->1P= {line1_p[i]} ]")
    tasks.append(f"[ {line1_p[i]} =1P->1G= {line1_g[i]} ]")
    tasks.append(f"[ {line2_g[i]} =2G->2P= {line2_p[i]} ]")
    tasks.append(f"[ {line2_p[i]} =2P->2G= {line2_g[i]} ]")
    
random.shuffle(tasks)

display(len(tasks))
train_test_split = round(0.99*len(tasks))

couplets_train = tasks[:train_test_split]
couplets_test = tasks[train_test_split:]

with open("data/train_couplets.txt", "w") as f:
    for task in couplets_train:
        f.write(task+"\n")
        
with open("data/test_couplets.txt", "w") as f:
    for task in couplets_test:
        f.write(task+"\n")
    

1144380