# 3. Data Pre-Processing

In this notebook, the data will be processed in order to be able to run our models on the data. Some previous processing steps that were already conducted in the previous notebooks include:
*  The language of the song was determined using <mark>pycld2</mark> and non-English songs were filtered out.
*  Songs without any lyrics have been filtered out.

The end result of this notebook is a **lyrics.txt** file with all the lyrics that we have pasted together in a .txt file which will further be used for modelling.


In [4]:
# imports
import pandas as pd
import numpy as np
import re
from datetime import date

In [6]:
songs = pd.read_csv("./data/songs_features.csv")
songs.head()

Unnamed: 0,title,artist,url,lyrics,lang,lang_acc,sp_id,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo
0,Easy On Me,['Adele'],https://genius.com/Adele-easy-on-me-lyrics,\r\nThere ain't no gold in this river\r\nThat ...,en,99,0gplL1WMoJ6iYaPgMCL0gX,0.604,0.366,5.0,-7.519,1.0,0.0282,0.578,0.0,0.133,0.13,141.981
1,Stay,"['The Kid LAROI', ' Justin Bieber']",https://genius.com/The-kid-laroi-and-justin-bi...,\r\nI do the same thing I told you that I neve...,en,99,5HCyWlXZPP0y6Gqq8TgA20,0.591,0.764,1.0,-5.484,1.0,0.0483,0.0383,0.0,0.103,0.478,169.928
2,Industry Baby,"['Lil Nas X', ' Jack Harlow']",https://genius.com/Lil-nas-x-and-jack-harlow-i...,"\r\n(D-D-Daytrip took it to ten, hey)\r\nBaby ...",en,99,27NovPIUIRrOZoCHxABJwK,0.736,0.704,3.0,-7.409,0.0,0.0615,0.0203,0.0,0.0501,0.894,149.995
3,Fancy Like,['Walker Hayes'],https://genius.com/Walker-hayes-fancy-like-lyrics,"\r\nAyy\r\nMy girl is bangin', she's so low ma...",en,99,58UKC45GPNTflCN6nwCUeF,0.647,0.765,1.0,-6.459,1.0,0.06,0.111,0.0,0.315,0.855,79.994
4,Bad Habits,['Ed Sheeran'],https://genius.com/Ed-sheeran-bad-habits-lyrics,"\r\n(One, two, three, four)\r\nOoh, ooh\r\n\r\...",en,99,3rmo8F54jFF8OgYsqTxm5d,0.807,0.893,11.0,-3.745,0.0,0.0347,0.0451,2.8e-05,0.366,0.537,126.011


In [13]:
songs.tail()

Unnamed: 0,title,artist,url,lyrics,lang,sp_id,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,lang_acc
15880,I Still Have Dreams,['Richie Furay'],https://genius.com/Richie-furay-i-still-have-d...,You can find me\nWhen you need me\nI'll be aro...,en,6KtR2UZMKlUhJd5gUlVJJ3,0.538,0.329,11.0,-8.483,1.0,0.0299,0.726,0.0,0.118,0.314,129.814,99.0
15881,Love Pains,['Yvonne Elliman'],https://genius.com/Yvonne-elliman-love-pains-l...,"Midnight, I watch you as you're sleeping\nYou ...",en,2ewHI38ZzJ3zld4jIJLp4p,0.642,0.56,9.0,-14.564,1.0,0.0311,0.0461,5e-06,0.0625,0.756,117.612,99.0
15882,Since You Been Gone,['Rainbow'],https://genius.com/Rainbow-since-you-been-gone...,"\nI get the same old dreams, same time every n...",en,6xq5DxZWGgdStAxGAil0yw,0.733,0.726,7.0,-6.514,1.0,0.0371,0.379,3e-06,0.471,0.915,120.814,99.0
15883,You Decorated My Life,['Kenny Rogers'],https://genius.com/Kenny-rogers-you-decorated-...,"All my life was a paper once plain, pure and w...",en,6SFeCisbYSqaQwL9hucoF2,0.264,0.307,7.0,-12.36,1.0,0.0341,0.662,0.0,0.101,0.153,172.546,99.0
15884,Message In A Bottle,['The Police'],https://genius.com/The-police-message-in-a-bot...,"\n\nJust a castaway, an island lost at sea, oh...",en,1oYYd2gnWZYrt89EBXdFiO,0.577,0.808,1.0,-7.04,0.0,0.039,0.0338,1.3e-05,0.221,0.869,151.008,99.0


In [14]:
len(songs)

15885

In [17]:
pprint(songs.lyrics[0])

('\n'
 "There ain't no gold in this river\n"
 "That I've been washin' my hands in forever\n"
 'I know there is hope in these waters\n'
 "But I can't bring myself to swim\n"
 'When I am drowning in this silence\n'
 'Baby, let me in\n'
 '\n'
 'Go easy on me, baby\n'
 'I was still a child\n'
 "Didn't get the chance to\n"
 'Feel the world around me\n'
 'I had no time to choose what I chose to do\n'
 'So go easy on me\n'
 '\n'
 "There ain't no room for things to change\n"
 'When we are both so deeply stuck in our ways\n'
 "You can't deny how hard I have tried\n"
 'I changed who I was to put you both first\n'
 'But now I give up\n'
 '\n'
 'Go easy on mе, baby\n'
 'I was still a child\n'
 "Didn't get the chance to\n"
 'Feel thе world around me\n'
 'Had no time to choose what I chose to do\n'
 'So go easy on me\n'
 'I had good intentions\n'
 'And the highest hopes\n'
 'But I know right now\n'
 "It probably doesn't even show\n"
 '\n'
 'Go easy on me, baby\n'
 'I was still a child\n'
 "I didn't 

## Look at unwanted patterns and characters appearing in lyrics

In [18]:
# Replace pattern "123Embed" at the end of the lyrics, with 123 being a random number
songs["lyrics"] = songs["lyrics"].apply(lambda x: re.sub(r"[0-9]*Embed", "", x))

In [19]:
# sometimes the songs were not found, instead tracklists occur on obscure pages on genius, 
# so they end up in the lyrics (see eg. [42]), we filter some of them out using regex, some are captured later
pprint(songs.lyrics[42])

('1. Drake - Knife Talk (with 21 Savage & Project Pat)\n'
 '2. Baby Keem - family\u2005ties\u2005(with Kendrick Lamar)\n'
 '3.\u2005Key Glock - Ambition For Cash\n'
 '4.\u2005Young Thug - Bubbly (with Drake & Travis Scott)\n'
 '5. Meek Mill - Sharing Locations (feat. Lil Baby & Lil Durk)\n'
 '6. Gunna & Future - Too Easy\n'
 '7. NLE Choppa - Jumpin (feat. Polo G)\n'
 '8. Meek Mill - Hot (feat. Moneybagg Yo)\n'
 '9. DaBaby - ROOF\n'
 '10. Juice WRLD - Already Dead\n'
 '11. Money Man - LLC (feat. Moneybagg Yo)\n'
 '12. Travis Scott - MAFIA\n'
 '13. Joyner Lucas & J. Cole - Your Heart\n'
 '14. Culture Jam, Gunna & Polo G - Waves\n'
 '15. Travis Scott - ESCAPE PLAN\n'
 '16. Drake - Fair Trade (with Travis Scott)\n'
 '17. Baby Keem - lost souls (with Brent Faiyaz)\n'
 '18. Polo G - Bad Man (Smooth Criminal)\n'
 '19. Lil Tjay - Not In The Mood (feat. Fivio Foreign & Kay Flock)\n'
 '20. Playboi Carti - Sky\n'
 '21. Cordae - Super\n'
 '22. Drake - Way 2 Sexy (with Future & Young Thug)\n'
 '23.

In [20]:
# search pattern
print(len(songs["lyrics"][~songs["lyrics"].apply(lambda x: re.search(r"1\.\s", x)).isnull()]))

# filter out
songs = songs[songs["lyrics"].apply(lambda x: re.search(r"1\.\s", x)).isnull()]

114


In [21]:
# some "lyrics" contain tracklists, with a pattern like ArtistA - Song1 \n ArtistB - Song2 and so on
#  maybe there's better regex patterns to search for them than this one, but I figured searching for this
#  pattern 10 times is ok. It still also finds songs that naturally contain many "-" in the lyrics, but these
#  are weird songs anyway
print(len(songs["lyrics"][~songs["lyrics"].apply(lambda x: re.search(r"(.+ - .+\n){10,}", x)).isnull()]))

# filter out
songs = songs[songs["lyrics"].apply(lambda x: re.search(r"(.+ - .+\n){10,}", x)).isnull()]

108


In [22]:
# substitute special characters -> use regex pattern on cell below to search for songs that contain unusual characters
songs["lyrics"] = songs["lyrics"].apply(lambda x: re.sub(u"\u2005|\u205f|\u202f|\u200a|\ufeff|\u202a|\u202c", " ", x)) # special space character, substitute with space
songs["lyrics"] = songs["lyrics"].apply(lambda x: re.sub("…", "...", x)) # replace special dots with three dots
songs["lyrics"] = songs["lyrics"].apply(lambda x: re.sub(u"\u2060", "-", x)) # replace "word-joiner" with -

In [23]:
# filter out songs with weird characters
# songs["lyrics"][~songs["lyrics"].apply(lambda x: re.search(r"[^a-zA-Z0-9 ,‚'’‘′\n\)\(-—\?!\"“”&:ñé$%]", x)).isnull()]
songs = songs[songs["lyrics"].apply(lambda x: re.search(r"[^a-zA-Z0-9 ,‚'’‘′\n\)\(-—\?!\"“”&:ñé$%]", x)).isnull()]

In [34]:
len(songs)

15496

In [32]:
songs = songs.reset_index(drop = True)

In [None]:
#pprint(songs.loc[8528, "lyrics"])
#set(songs.loc[8528, "lyrics"])

Look into random lyrics to see if there are some common unwanted patterns that should be handled.<br>
**Note:** The newline character at the beginning of most of the songs is handled in the next notebook

In [40]:
# look at random lyrics
# note: if this throws an error, just run it again
rndm = np.random.randint(0, len(songs))
print(rndm, songs.title[rndm], "by", songs.artist[rndm])

12162 My Obsession by ['Icehouse']


In [41]:
pprint(songs.lyrics[rndm])

('\n'
 'I used to be the one who made you feel\n'
 'So safe and strong\n'
 'I could always make it right\n'
 'When everything was going wrong\n'
 "I don't know why it seems\n"
 "So different now I'm on my own\n"
 "And I don't know what it is\n"
 "That scares me when I'm all alone\n"
 '\n'
 "I can't believe that everyone I know\n"
 'Would lie to me\n'
 "When they all tell me that I'm not the man\n"
 'I used to be\n'
 "Don't want to hear about the things\n"
 'That I already know\n'
 "You've got to say it isn't so\n"
 "Oh no, it's the...\n"
 'The ghost of you that gets me every time (My obsession)\n'
 "Just won't let go until it brings me down (My obsession)\n"
 "I try to hide it but there's only one (My obsession)\n"
 'And my obsession is you\n'
 'The little things you used to do and say (My obsession)\n'
 'Break every moment of my night and day (My obsession)\n'
 "Believe me darling, oh you know it's true (My obsession)\n"
 'My obsession is you\n'
 '\n'
 'Yeah, my friends all turn away\

In [42]:
# append start tags to the lyrics
songs["lyrics"] = songs["lyrics"].apply(lambda x: "<START> " + x)

## Output lyrics.txt file

In [43]:
# export lyrics without any seperation between them (quotechar)
songs["lyrics"].to_csv("./data/lyrics.txt",
                    header = None, index = None, quotechar = " ")