### In this notebook we will be cleaning the text in the generated lyrics which would then be used to get the word represention using a Word2Vec CBOW model

In [1]:
import pandas as pd
import re
import gensim

Import data with the language information

In [2]:
tracksLyricFeatures_complete = pd.read_csv('./tracksLyricFeatures/tracksLyricFeaturesLang.csv', encoding='utf8')
tracksLyricFeatures_complete.head()

Unnamed: 0,id,track,trackArtist,genre,lyrics,top_lang_identified,top_lang_identified_prob
0,2,Food,AWOL,Hip-Hop,"The Waking Blind LyricsThe waking blind, embra...",__label__en,0.932648
1,3,Electric Ave,AWOL,Hip-Hop,Oversikt over norske artister Lyrics*&aute&rom...,__label__en,0.253449
2,5,This World,AWOL,Hip-Hop,"E.T. Lyrics[Verse One: Mickey Factz]\nSee, I s...",__label__en,0.949876
3,10,Freeway,Kurt Vile,Pop,"Freeway LyricsI got a freeway in mind, let go ...",__label__en,0.960411
4,134,Street Music,AWOL,Hip-Hop,"Goin’ Down Lyrics[Intro: DJ Drama, Fabolous & ...",__label__en,0.948774


In [3]:
tracksLyricFeatures_complete.shape

(25000, 7)

##### Outline of the approach
* Do very basic cleaning
    * removing tokens like '[chorus]' , '[verse]' - using regular expressions for this
    * also removing the heading of the lyrics of the format - **SONG_NAME Lyrics**
* Using gensim.utils.preprocessing for tokenizing the cleaned text
* The tokenized data then finally will be used to create a text corpus to train a Word2Vec Dictionary

In [4]:
tracksLyricFeatures_complete.iloc[0,4]

"The Waking Blind LyricsThe waking blind, embraced the dark\nThey made us fight for the answers\nThey made us they made us fight fight fight for the answers\nThey made us fight\n\nTime is up the lines are drawn in sand\nIt's like a ritual cleaning\nFaith against them all\nSome kind of wicked pressure growing below\nBound to breach the surface\nGloves are off\nThe blinders tighten\nMisguided hands\nLeading the awoken\nFood for thought\nOnce your brothers now\nThe wretches the dregs the scum and the bastards\nEyes up\nGrab your pitchforks just before the dawn\nThey’re out of time and out of touch and out of luck\nWith the air that we breath and the ground beneath their feet\n\nA mask of virtue\nYou’ll hide behind as you feed off the lies\nBowing down to pressure with no resolve\nNever again\nCan we allow this fear to hold us\nFaith against them all\nDesigned deny and deprive\n\nThe waking blind, embraced the dark\nThey made us fight for the answers\n\nThe coming whipping winds\nThe winds

Using songs for which English was predicted at the top language with greater than 80% probability

In [5]:
# Medium Dataset - take english songs with prob > 80% only - ~10900 songs
trackLyricsFeaturesEnglish = tracksLyricFeatures_complete[(tracksLyricFeatures_complete['top_lang_identified']=="__label__en") & (tracksLyricFeatures_complete['top_lang_identified_prob']>=0.8)]

In [6]:
trackLyricsFeaturesEnglish.shape

(10910, 7)

Doing basic cleaning using regular expressions to remove noise from the lyrics

In [7]:
trackLyricsFeaturesEnglish['regex_cleaned_lyrics'] = [re.sub(r'^.*?Lyrics', '', str(lyric)) for lyric in trackLyricsFeaturesEnglish['lyrics']]
trackLyricsFeaturesEnglish['regex_cleaned_lyrics'] = [re.sub(r"[\[].*?[\]]", "", str(lyric)) for lyric in trackLyricsFeaturesEnglish['regex_cleaned_lyrics']]
trackLyricsFeaturesEnglish['regex_cleaned_lyrics'] = [re.sub(r"(\().*?(\))", "", str(lyric)) for lyric in trackLyricsFeaturesEnglish['regex_cleaned_lyrics']]
trackLyricsFeaturesEnglish.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  trackLyricsFeaturesEnglish['regex_cleaned_lyrics'] = [re.sub(r'^.*?Lyrics', '', str(lyric)) for lyric in trackLyricsFeaturesEnglish['lyrics']]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  trackLyricsFeaturesEnglish['regex_cleaned_lyrics'] = [re.sub(r"[\[].*?[\]]", "", str(lyric)) for lyric in trackLyricsFeaturesEnglish['regex_cleaned_lyrics']]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https:/

Unnamed: 0,id,track,trackArtist,genre,lyrics,top_lang_identified,top_lang_identified_prob,regex_cleaned_lyrics
0,2,Food,AWOL,Hip-Hop,"The Waking Blind LyricsThe waking blind, embra...",__label__en,0.932648,"The waking blind, embraced the dark\nThey made..."
2,5,This World,AWOL,Hip-Hop,"E.T. Lyrics[Verse One: Mickey Factz]\nSee, I s...",__label__en,0.949876,"\nSee, I seen so many faces\nI've been to many..."
3,10,Freeway,Kurt Vile,Pop,"Freeway LyricsI got a freeway in mind, let go ...",__label__en,0.960411,"I got a freeway in mind, let go of my head\nWa..."
4,134,Street Music,AWOL,Hip-Hop,"Goin’ Down Lyrics[Intro: DJ Drama, Fabolous & ...",__label__en,0.948774,"\nY'all ready?\nThis what it's all about, righ..."
5,136,Peel Back The Mountain Sky,Abominog,Rock,The Rape of Lucrece LyricsThe Argument\nLucius...,__label__en,0.950529,"The Argument\nLucius Tarquinius, for his exces..."


We can see that the objective of basic cleaning is achieved in the above cell where unnecessary information has been removed

Next step is tokenize the documents(in this case the lyrics) using gensim.utils.simple_preprocess

In [8]:
simple_processed_lyrics = trackLyricsFeaturesEnglish.regex_cleaned_lyrics.apply(gensim.utils.simple_preprocess)

print(type(simple_processed_lyrics))

<class 'pandas.core.series.Series'>


Lyrics data was successfully tokenized for each song

In [9]:
simple_processed_lyrics

0        [the, waking, blind, embraced, the, dark, they...
2        [see, seen, so, many, faces, ve, been, to, man...
3        [got, freeway, in, mind, let, go, of, my, head...
4        [all, ready, this, what, it, all, about, right...
5        [the, argument, lucius, tarquinius, for, his, ...
                               ...                        
24984    [the, mabbot, street, entrance, of, nighttown,...
24985    [the, mabbot, street, entrance, of, nighttown,...
24989    [oh, shit, guess, we, re, starting, the, mic, ...
24990    [the, american, academy, of, emergency, medici...
24993    [do, ti, la, so, fa, mi, re, do, yu, mi, guara...
Name: regex_cleaned_lyrics, Length: 10910, dtype: object

In [10]:
simple_processed_lyrics.name = "tokenized_lyrics"
trackLyricsFeaturesEnglish = pd.concat([trackLyricsFeaturesEnglish, simple_processed_lyrics], axis = 1)
trackLyricsFeaturesEnglish.head()

Unnamed: 0,id,track,trackArtist,genre,lyrics,top_lang_identified,top_lang_identified_prob,regex_cleaned_lyrics,tokenized_lyrics
0,2,Food,AWOL,Hip-Hop,"The Waking Blind LyricsThe waking blind, embra...",__label__en,0.932648,"The waking blind, embraced the dark\nThey made...","[the, waking, blind, embraced, the, dark, they..."
2,5,This World,AWOL,Hip-Hop,"E.T. Lyrics[Verse One: Mickey Factz]\nSee, I s...",__label__en,0.949876,"\nSee, I seen so many faces\nI've been to many...","[see, seen, so, many, faces, ve, been, to, man..."
3,10,Freeway,Kurt Vile,Pop,"Freeway LyricsI got a freeway in mind, let go ...",__label__en,0.960411,"I got a freeway in mind, let go of my head\nWa...","[got, freeway, in, mind, let, go, of, my, head..."
4,134,Street Music,AWOL,Hip-Hop,"Goin’ Down Lyrics[Intro: DJ Drama, Fabolous & ...",__label__en,0.948774,"\nY'all ready?\nThis what it's all about, righ...","[all, ready, this, what, it, all, about, right..."
5,136,Peel Back The Mountain Sky,Abominog,Rock,The Rape of Lucrece LyricsThe Argument\nLucius...,__label__en,0.950529,"The Argument\nLucius Tarquinius, for his exces...","[the, argument, lucius, tarquinius, for, his, ..."


Saving the processed data to be used for dictionary model creation

In [11]:
# trackLyricsFeaturesEnglish.to_csv('./tracksLyricFeatures/tracksLyricFeaturesTokenzised.csv', encoding="utf8", index=False)

In [12]:
type(simple_processed_lyrics[0])

list

In [13]:
type(trackLyricsFeaturesEnglish.iloc[0,8])

list


Value count of all genre songs in the data

In [14]:
trackLyricsFeaturesEnglish["genre"].value_counts()

Rock                   4411
Electronic             2096
Hip-Hop                 832
Experimental            809
Folk                    806
Pop                     607
Instrumental            583
Old-Time / Historic     207
International           153
Jazz                     98
Country                  87
Spoken                   63
Soul-RnB                 60
Classical                56
Blues                    39
Easy Listening            3
Name: genre, dtype: int64

There is great class imbalance in the data

We need to drop training samples for genres below and including Old-Time / Historic in the above list

Also, to improve class imbalance, we will need to drop 1500 and 500 training samples from Rock and Electronic genre based on the length of tokenized lyrics for these songs

In [15]:
label = trackLyricsFeaturesEnglish["genre"]
classes_to_retain = ["Rock", "Electronic","Hip-Hop","Experimental","Folk","Pop","Instrumental"]
classes_to_drop = ["Old-Time / Historic", "International", "Jazz", "Country", "Spoken", "Soul-RnB", "Classical", "Blues", "Easy Listening"]

trackLyricsFeaturesEnglishTopGenre = pd.DataFrame()

# Retaining only classes with majority
for genre in classes_to_retain:
    data = trackLyricsFeaturesEnglish[trackLyricsFeaturesEnglish["genre"]==genre]
    trackLyricsFeaturesEnglishTopGenre = pd.concat([data,trackLyricsFeaturesEnglishTopGenre])

In [16]:
trackLyricsFeaturesEnglishTopGenre.head()

Unnamed: 0,id,track,trackArtist,genre,lyrics,top_lang_identified,top_lang_identified_prob,regex_cleaned_lyrics,tokenized_lyrics
1492,10230,Function Boys,Lucky Dragons,Instrumental,"Arabian Nights,Vol. 2 (Chap. 3) LyricsTale of ...",__label__en,0.976436,Tale of King Omar Bin Al-Nu'uman and His Sons ...,"[tale, of, king, omar, bin, al, nu, uman, and,..."
1493,10231,Feminist Short,Rock Paper Scissors,Instrumental,Pale Fire: A Poem in Four Cantos LyricsCANTO 1...,__label__en,0.95601,CANTO 1\nI was the shadow of the waxwing slain...,"[canto, was, the, shadow, of, the, waxwing, sl..."
1494,10234,Headache,Mr. California,Instrumental,"My President Lyrics[Intro: Young Jeezy]\nYeah,...",__label__en,0.936297,"\nYeah, be the realest shit I never wrote\nI a...","[yeah, be, the, realest, shit, never, wrote, a..."
1498,10246,Mayan Church,Field Recorder,Instrumental,Ulysses (Chap. 12 - Cyclops) LyricsI was just ...,__label__en,0.966509,I was just passing the time of day with old Tr...,"[was, just, passing, the, time, of, day, with,..."
1501,10277,Many Hollys,Lucky Dragons,Instrumental,Ulysses (Chap. 15 - Circe) LyricsThe Mabbot st...,__label__en,0.930705,"The Mabbot street entrance of nighttown, befor...","[the, mabbot, street, entrance, of, nighttown,..."


Only majority classes were retained

Note - We will also need to reduce training samples for Rock and Electronic classes

In [17]:
trackLyricsFeaturesEnglishTopGenre["genre"].value_counts()

Rock            4411
Electronic      2096
Hip-Hop          832
Experimental     809
Folk             806
Pop              607
Instrumental     583
Name: genre, dtype: int64

Function to get token count for each song's lyrics

In [18]:
def token_list_len(lyrics_token_list):
    if lyrics_token_list:
        return len(lyrics_token_list)
    else:
        return 0

Find the token count for each song

In [19]:
trackLyricsFeaturesEnglishTopGenre["token_cnt"] = trackLyricsFeaturesEnglishTopGenre["tokenized_lyrics"].apply(token_list_len)
trackLyricsFeaturesEnglishTopGenre.head()

Unnamed: 0,id,track,trackArtist,genre,lyrics,top_lang_identified,top_lang_identified_prob,regex_cleaned_lyrics,tokenized_lyrics,token_cnt
1492,10230,Function Boys,Lucky Dragons,Instrumental,"Arabian Nights,Vol. 2 (Chap. 3) LyricsTale of ...",__label__en,0.976436,Tale of King Omar Bin Al-Nu'uman and His Sons ...,"[tale, of, king, omar, bin, al, nu, uman, and,...",92596
1493,10231,Feminist Short,Rock Paper Scissors,Instrumental,Pale Fire: A Poem in Four Cantos LyricsCANTO 1...,__label__en,0.95601,CANTO 1\nI was the shadow of the waxwing slain...,"[canto, was, the, shadow, of, the, waxwing, sl...",7111
1494,10234,Headache,Mr. California,Instrumental,"My President Lyrics[Intro: Young Jeezy]\nYeah,...",__label__en,0.936297,"\nYeah, be the realest shit I never wrote\nI a...","[yeah, be, the, realest, shit, never, wrote, a...",946
1498,10246,Mayan Church,Field Recorder,Instrumental,Ulysses (Chap. 12 - Cyclops) LyricsI was just ...,__label__en,0.966509,I was just passing the time of day with old Tr...,"[was, just, passing, the, time, of, day, with,...",20033
1501,10277,Many Hollys,Lucky Dragons,Instrumental,Ulysses (Chap. 15 - Circe) LyricsThe Mabbot st...,__label__en,0.930705,"The Mabbot street entrance of nighttown, befor...","[the, mabbot, street, entrance, of, nighttown,...",21950


Determine the amount of reduction for the majority classes of Rock and Electronic

In [20]:
classes_to_reduce = ["Rock","Electronic"]
classes_unchanged = ["Hip-Hop","Experimental","Folk","Pop","Instrumental"]
rows_to_reduce = {"Rock": 2500, "Electronic":500}

Sort the songs in ascending order of count of tokens for each genre

We'll be removing the songs for which there are fewer tokens in the lyrics

In [21]:
trackLyricsFeaturesEnglishTopGenreSorted = trackLyricsFeaturesEnglishTopGenre.sort_values(by=["genre","token_cnt"], ignore_index=True)
trackLyricsFeaturesEnglishTopGenreSorted.head()

Unnamed: 0,id,track,trackArtist,genre,lyrics,top_lang_identified,top_lang_identified_prob,regex_cleaned_lyrics,tokenized_lyrics,token_cnt
0,13709,Member,Goto80,Electronic,Member LyricsNo transcribedYou might also like...,__label__en,0.971269,No transcribedYou might also likeEmbed,"[no, transcribedyou, might, also, likeembed]",5
1,29628,Robot Wars,Binärpilot,Electronic,Robot Wars LyricsNow begins the robot wars\nKi...,__label__en,0.904678,Now begins the robot wars\nKill all humans wit...,"[now, begins, the, robot, wars, kill, all, hum...",14
2,3557,My Name Is Robert,Dan Deacon,Electronic,My Name Is Robert LyricsWhere is your granfath...,__label__en,0.981228,Where is your granfather?\nI have no granfathe...,"[where, is, your, granfather, have, no, granfa...",27
3,40373,Upghostery,Liar,Electronic,Upghostery LyricsLet's pretend\nThat this hous...,__label__en,0.948391,Let's pretend\nThat this house is ours\nThat y...,"[let, pretend, that, this, house, is, ours, th...",30
4,91941,Night Walk,Dirty Beaches,Electronic,"Night Walk LyricsTrail,\nall across the night\...",__label__en,0.873355,"Trail,\nall across the night\nwith the radio o...","[trail, all, across, the, night, with, the, ra...",42


Retain the songs with greater token count for Rock and Electronic genres and all songs for the other genres

In [22]:
trackLyricsFeaturesEnglishTopGenreReduced = pd.DataFrame()

for genre in classes_to_reduce:
    data = trackLyricsFeaturesEnglishTopGenreSorted[trackLyricsFeaturesEnglishTopGenreSorted["genre"]==genre]
    data = data[rows_to_reduce[genre]:]
    trackLyricsFeaturesEnglishTopGenreReduced = pd.concat([data,trackLyricsFeaturesEnglishTopGenreReduced])

for genre in classes_unchanged:
    data = trackLyricsFeaturesEnglishTopGenreSorted[trackLyricsFeaturesEnglishTopGenreSorted["genre"]==genre]
    trackLyricsFeaturesEnglishTopGenreReduced = pd.concat([data,trackLyricsFeaturesEnglishTopGenreReduced])

trackLyricsFeaturesEnglishTopGenreReduced = trackLyricsFeaturesEnglishTopGenreReduced.sort_values(by=["genre","token_cnt"], ignore_index=True)

trackLyricsFeaturesEnglishTopGenreReduced.head()
        

Unnamed: 0,id,track,trackArtist,genre,lyrics,top_lang_identified,top_lang_identified_prob,regex_cleaned_lyrics,tokenized_lyrics,token_cnt
0,127936,Armageddon,Synapsis,Electronic,Elektro Lyrics[Verse One]\nRight from the intr...,__label__en,0.910224,\nRight from the intro\nThe God opens your ear...,"[right, from, the, intro, the, god, opens, you...",637
1,52632,Boom,Jason Shaw,Electronic,Cali Shit Lyrics[Verse 1: J-Easie]\nStay true ...,__label__en,0.927561,\nStay true to my sound I'm just the same\nI s...,"[stay, true, to, my, sound, just, the, same, s...",638
2,99309,Trapped in a Single Celled Organism,Ample Mammal,Electronic,"Liquid Meets Land Lyrics[Andrew Bagadounts, Il...",__label__en,0.904999,"\nFluid, raging flood\nWrote my name in blood ...","[fluid, raging, flood, wrote, my, name, in, bl...",638
3,97992,Through Your Chest,Ant The Symbol,Electronic,Dungeons and Dragons Lyrics[Verse 1: Slug]\nKi...,__label__en,0.929616,\nKinetic responses were heard frequent in the...,"[kinetic, responses, were, heard, frequent, in...",640
4,112381,Blank Letter to God,statusq,Electronic,Automatic Systematic Lyrics()\nAutomatic syste...,__label__en,0.854843,\nAutomatic systematic\nI used to know the liv...,"[automatic, systematic, used, to, know, the, l...",640


We can see that the dataset has lower levels of imbalance

In [23]:
trackLyricsFeaturesEnglishTopGenreReduced["genre"].value_counts()

Rock            1911
Electronic      1596
Hip-Hop          832
Experimental     809
Folk             806
Pop              607
Instrumental     583
Name: genre, dtype: int64

Save the data

In [26]:
trackLyricsFeaturesEnglishTopGenreReduced.iloc[:,:-1].to_csv('./tracksLyricFeatures/tracksLyricFeaturesTokenzised.csv', encoding="utf8", index=False)

This data will now be used to the train the Word2Vec CBOW dictionary in the next notebook