## NLP-based prediction of genres based on their lyrics: I. Data wrangling

To what extent might song lyrics be a predictor of song genres? In this NLP application I implement an NLP-based model, taking several already-available datasets and drawing inspiration from similar work (e.g. [Bajwa et al](https://github.com/etarakci/music-genre-prediction), [Sianipar et al](https://medium.com/better-programming/predicting-a-songs-genre-using-natural-language-processing-7b354ed5bd80), [Kovachev et al](https://towardsdatascience.com/how-we-used-nltk-and-nlp-to-predict-a-songs-genre-from-its-lyrics-54e338ded537), and [Ram and Salz](http://cs229.stanford.edu/proj2017/final-reports/5241796.pdf)).

As an initial step, the following datasets are imported and examined:

1. "genre_lyrics_data.csv", by Bajwa et al, available on [GitHub](https://github.com/etarakci/music-genre-prediction/tree/master/data), including 6,733 lyrics and 90 genres.

2. "tcc_ceds_music.csv", by Moura et al, available via [Mendeley Data](https://data.mendeley.com/datasets/3t9vbwxgr5/3), including 28,372 lyrics and 7 genres.

3. "spotify_songs.csv", by Muhammad Nakhaee and available via [Kaggle](https://www.kaggle.com/imuhammad/audio-features-and-lyrics-of-spotify-songs?select=spotify_songs.csv), including 18,454 lyrics and 6 genres.

4. "original_cleaned_lyrics.csv", by Yalamanchili et al, available via [GitHub](https://github.com/hiteshyalamanchili/SongGenreClassification/tree/master/dataset), including 227,449 lyrics and 11 genres. This is a processed version of the no-longer available "380,000+ lyrics from MetroLyrics" dataset (likely removed due to copyright infringement) that served as the starting point for many related projects.

Ultimately, #2, #3 and #4 are selected as the source datasets for this project. The first dataset is discarded, as it is relatively small and contains too vast a spread of genres.

In [1]:
import pandas as pd
import numpy as np
import nltk
import string
import fasttext
import contractions
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer

In [2]:
#Dataset downloaded from https://github.com/etarakci/music-genre-prediction/tree/master/data
lyrics = pd.read_csv('data/genre_lyrics_data.csv', index_col=0)
lyrics.head()

Unnamed: 0,genre_name,lyrics_cleaned
0,acoustic,well you done done me and you bet i felt it i ...
1,acoustic,head under water and they tell me to breathe e...
2,acoustic,will you count me in ive been awake for a whil...
3,acoustic,do you hear me im talking to you across the wa...
4,acoustic,the strands in your eyes that color them wonde...


In [3]:
lyrics.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6733 entries, 0 to 6732
Data columns (total 2 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   genre_name      6733 non-null   object
 1   lyrics_cleaned  6733 non-null   object
dtypes: object(2)
memory usage: 157.8+ KB


In [4]:
lyrics.genre_name.nunique()

90

In [5]:
#Dataset downloaded from DOI: 10.17632/3t9vbwxgr5.3#file-d2c58546-d8be-4d57-af14-b61337c927b2
lyrics_1 = pd.read_csv('data/tcc_ceds_music.csv',index_col=0)
lyrics_1.head()

Unnamed: 0,artist_name,track_name,release_date,genre,lyrics,len,dating,violence,world/life,night/time,...,sadness,feelings,danceability,loudness,acousticness,instrumentalness,valence,energy,topic,age
0,mukesh,mohabbat bhi jhoothi,1950,pop,hold time feel break feel untrue convince spea...,95,0.000598,0.063746,0.000598,0.000598,...,0.380299,0.117175,0.357739,0.454119,0.997992,0.901822,0.339448,0.13711,sadness,1.0
4,frankie laine,i believe,1950,pop,believe drop rain fall grow believe darkest ni...,51,0.035537,0.096777,0.443435,0.001284,...,0.001284,0.001284,0.331745,0.64754,0.954819,2e-06,0.325021,0.26324,world/life,1.0
6,johnnie ray,cry,1950,pop,sweetheart send letter goodbye secret feel bet...,24,0.00277,0.00277,0.00277,0.00277,...,0.00277,0.225422,0.456298,0.585288,0.840361,0.0,0.351814,0.139112,music,1.0
10,pérez prado,patricia,1950,pop,kiss lips want stroll charm mambo chacha merin...,54,0.048249,0.001548,0.001548,0.001548,...,0.225889,0.001548,0.686992,0.744404,0.083935,0.199393,0.77535,0.743736,romantic,1.0
12,giorgos papadopoulos,apopse eida oneiro,1950,pop,till darling till matter know till dream live ...,48,0.00135,0.00135,0.417772,0.00135,...,0.0688,0.00135,0.291671,0.646489,0.975904,0.000246,0.597073,0.394375,romantic,1.0


In [6]:
lyrics_1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 28372 entries, 0 to 82451
Data columns (total 30 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   artist_name               28372 non-null  object 
 1   track_name                28372 non-null  object 
 2   release_date              28372 non-null  int64  
 3   genre                     28372 non-null  object 
 4   lyrics                    28372 non-null  object 
 5   len                       28372 non-null  int64  
 6   dating                    28372 non-null  float64
 7   violence                  28372 non-null  float64
 8   world/life                28372 non-null  float64
 9   night/time                28372 non-null  float64
 10  shake the audience        28372 non-null  float64
 11  family/gospel             28372 non-null  float64
 12  romantic                  28372 non-null  float64
 13  communication             28372 non-null  float64
 14  obscen

In [7]:
lyrics_1.genre.value_counts()

pop        7042
country    5445
blues      4604
rock       4034
jazz       3845
reggae     2498
hip hop     904
Name: genre, dtype: int64

In [8]:
#Dataset downloaded from https://www.kaggle.com/imuhammad/audio-features-and-lyrics-of-spotify-songs?select=spotify_songs.csv
lyrics_2 = pd.read_csv('data/spotify_songs.csv', index_col=0)
lyrics_2.head()

Unnamed: 0_level_0,track_name,track_artist,lyrics,track_popularity,track_album_id,track_album_name,track_album_release_date,playlist_name,playlist_id,playlist_genre,...,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,language
track_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0017A6SJgTbfQVU2EtsPNo,Pangarap,Barbie's Cradle,Minsan pa Nang ako'y napalingon Hindi ko alam ...,41,1srJQ0njEQgd8w4XSqI4JQ,Trip,2001-01-01,Pinoy Classic Rock,37i9dQZF1DWYDQ8wBxd7xt,rock,...,-10.068,1,0.0236,0.279,0.0117,0.0887,0.566,97.091,235440,tl
004s3t0ONYlzxII9PLgU6z,I Feel Alive,Steady Rollin,"The trees, are singing in the wind The sky blu...",28,3z04Lb9Dsilqw68SHt6jLB,Love & Loss,2017-11-21,Hard Rock Workout,3YouF0u7waJnolytf9JCXf,rock,...,-4.739,1,0.0442,0.0117,0.00994,0.347,0.404,135.225,373512,en
00chLpzhgVjxs1zKC9UScL,Poison,Bell Biv DeVoe,"NA Yeah, Spyderman and Freeze in full effect U...",0,6oZ6brjB8x3GoeSYdwJdPc,Gold,2005-01-01,"Back in the day - R&B, New Jack Swing, Swingbe...",3a9y4eeCJRmG9p4YKfqYIx,r&b,...,-7.504,0,0.216,0.00432,0.00723,0.489,0.65,111.904,262467,en
00cqd6ZsSkLZqGMlQCR0Zo,Baby It's Cold Outside (feat. Christina Aguilera),CeeLo Green,I really can't stay Baby it's cold outside I'v...,41,3ssspRe42CXkhPxdc12xcp,CeeLo's Magic Moment,2012-10-29,Christmas Soul,6FZYc2BvF7tColxO8PBShV,r&b,...,-5.819,0,0.0341,0.689,0.0,0.0664,0.405,118.593,243067,en
00emjlCv9azBN0fzuuyLqy,Dumb Litty,KARD,Get up out of my business You don't keep me fr...,65,7h5X3xhh3peIK9Y0qI5hbK,KARD 2nd Digital Single ‘Dumb Litty’,2019-09-22,K-Party Dance Mix,37i9dQZF1DX4RDXswvP6Mj,pop,...,-1.993,1,0.0409,0.037,0.0,0.138,0.24,130.018,193160,en


In [9]:
lyrics_2.info()

<class 'pandas.core.frame.DataFrame'>
Index: 18454 entries, 0017A6SJgTbfQVU2EtsPNo to 7zzZmpw8L66ZPjH1M6qmOs
Data columns (total 24 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   track_name                18454 non-null  object 
 1   track_artist              18454 non-null  object 
 2   lyrics                    18194 non-null  object 
 3   track_popularity          18454 non-null  int64  
 4   track_album_id            18454 non-null  object 
 5   track_album_name          18454 non-null  object 
 6   track_album_release_date  18454 non-null  object 
 7   playlist_name             18454 non-null  object 
 8   playlist_id               18454 non-null  object 
 9   playlist_genre            18454 non-null  object 
 10  playlist_subgenre         18454 non-null  object 
 11  danceability              18454 non-null  float64
 12  energy                    18454 non-null  float64
 13  key                       18

In [10]:
lyrics_2.playlist_genre.value_counts()

pop      3993
rock     3521
rap      3391
r&b      3326
latin    2178
edm      2045
Name: playlist_genre, dtype: int64

In [11]:
#Dataset downloaded from https://www.kaggle.com/imuhammad/audio-features-and-lyrics-of-spotify-songs?select=spotify_songs.csv
lyrics_3 = pd.read_csv('data/original_cleaned_lyrics.csv',index_col=0)
lyrics_3.head()

Unnamed: 0,index,song,year,artist,genre,lyrics
0,0,ego-remix,2009,beyonce-knowles,Pop,Oh baby how you doing You know I'm gonna cut r...
1,1,then-tell-me,2009,beyonce-knowles,Pop,playin everything so easy it's like you seem s...
2,2,honesty,2009,beyonce-knowles,Pop,If you search For tenderness It isn't hard to ...
3,3,you-are-my-rock,2009,beyonce-knowles,Pop,Oh oh oh I oh oh oh I If I wrote a book about ...
4,4,black-culture,2009,beyonce-knowles,Pop,Party the people the people the party it's pop...


In [12]:
lyrics_3.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 227449 entries, 0 to 362236
Data columns (total 6 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   index   227449 non-null  int64 
 1   song    227448 non-null  object
 2   year    227449 non-null  int64 
 3   artist  227449 non-null  object
 4   genre   227449 non-null  object
 5   lyrics  227449 non-null  object
dtypes: int64(2), object(4)
memory usage: 12.1+ MB


In [13]:
lyrics_3.genre.value_counts()

Rock          104137
Pop            36439
Hip-Hop        23215
Metal          22420
Country        14182
Jazz            7520
Electronic      7231
Other           3989
R&B             3354
Indie           2970
Folk            1992
Name: genre, dtype: int64

In [14]:
#Joining all datasets

#First, only essential features are kept in each table and renamed for consistency. 
lyrics_1 = lyrics_1[['lyrics','genre']].reset_index(drop=True)
lyrics_2 = lyrics_2[['lyrics','playlist_genre']].reset_index(drop=True).rename(columns={"playlist_genre":"genre"})
lyrics_3 = lyrics_3[['lyrics','genre']]

#Concatenation
songs = pd.concat([lyrics_1, lyrics_2, lyrics_3]).reset_index(drop=True)

In [15]:
#Some data cleaning operations

#Drop any records with missing values
songs = songs.dropna()

#Standardize genre names
songs.replace({"genre": {"rock": "Rock", "Indie": "Rock", "pop": "Pop", "rap": "Hip-Hop", "hip hop": "Hip-Hop",
               "country": "Country", "edm": "Electronic", "jazz": "Jazz", "r&b": "R&B"}}, inplace=True)

songs.genre = [genre.capitalize() for genre in songs.genre]

#Drop rows where the genre is equal to "Other"
songs = songs[songs.genre != 'Other']

#Drop full duplicates (i.e. both lyrics and genre match)
songs.drop_duplicates(inplace=True)

#Drop lyrics duplicates with different genres
songs.drop_duplicates(subset="lyrics", keep=False, inplace=True)

In [16]:
songs.info()
songs.genre.value_counts()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 251337 entries, 0 to 274274
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   lyrics  251337 non-null  object
 1   genre   251337 non-null  object
dtypes: object(2)
memory usage: 5.8+ MB


Rock          107781
Pop            43469
Hip-hop        24953
Metal          21779
Country        18786
Jazz            9956
Electronic      8087
R&b             5774
Blues           4604
Reggae          2498
Folk            1853
Latin           1797
Name: genre, dtype: int64

In [17]:
#Some more data preparation (adapted from https://github.com/hiteshyalamanchili/SongGenreClassification/blob/master/models/Data%20Pre-Processing.ipynb):

# Removing punctuation
songs.lyrics = songs.lyrics.str.replace("[-\?.,\/#!$%\^&\*;:{}=\_~()]", ' ')

# Removing identifiers such as [Chorus] or [Verse]
songs.lyrics = songs.lyrics.str.replace("\[(.*?)\]", ' ')
songs.lyrics = songs.lyrics.str.replace("' | '", ' ')
songs.lyrics = songs.lyrics.str.replace('x[0-9]+', ' ')

# Removing songs tagged as instrumental in lyrics field
songs = songs[songs.lyrics.str.strip().str.lower() != 'instrumental']

# Remove songs with corrupted/non-ASCII characters in lyrics
songs = songs[~songs.lyrics.str.contains(r'[^\x00-\x7F]+')]
songs.head()

  after removing the cwd from sys.path.
  import sys
  
  if __name__ == '__main__':


Unnamed: 0,lyrics,genre
0,hold time feel break feel untrue convince spea...,Pop
1,believe drop rain fall grow believe darkest ni...,Pop
2,sweetheart send letter goodbye secret feel bet...,Pop
3,kiss lips want stroll charm mambo chacha merin...,Pop
4,till darling till matter know till dream live ...,Pop


In [18]:
#Some steps adapted from: https://towardsdatascience.com/preprocessing-text-data-using-python-576206753c28

#Expanding contractions
songs.lyrics = songs.lyrics.apply(lambda x: [contractions.fix(word) for word in x.split()])
songs.lyrics = [' '.join(map(str, token_list)) for token_list in songs.lyrics] # The previous step tokenizes expanded contractions as single tokens and we don't want that!
songs.head()

Unnamed: 0,lyrics,genre
0,hold time feel break feel untrue convince spea...,Pop
1,believe drop rain fall grow believe darkest ni...,Pop
2,sweetheart send letter goodbye secret feel bet...,Pop
3,kiss lips want stroll charm mambo chacha merin...,Pop
4,till darling till matter know till dream live ...,Pop


In [20]:
# Removing empty list lyrics and other variations of empty content
songs = songs[songs.lyrics!='[]']
songs = songs[songs.lyrics!='[\'instrumental\']']
songs = songs[songs.lyrics!='[\'lyric\', \'instrumental\']']
songs.describe()

Unnamed: 0,lyrics,genre
count,246328.0,246328
unique,246033.0,12
top,,Rock
freq,13.0,107423


In [64]:
# Detecting language and removing instances of non-English lyrics
import langid
songs['language'] = songs.lyrics.apply(lambda x: langid.classify(x)[0])
songs = songs[songs.language=='en']

In [72]:
songs.genre.value_counts()

Rock          103494
Pop            40301
Hip-hop        22815
Metal          20686
Country        18580
Jazz            9489
Electronic      7448
R&b             5190
Blues           4352
Reggae          2370
Folk            1615
Latin            521
Name: genre, dtype: int64

The value counts above reveal we have a classic problem of class imbalance in our dataset. For practical purposes, we will only focus on the top five genres: Rock, Pop, Hip-hop, Metal, and Country. We will use the minority class's size (Country = 18662) as the value of n when downsampling the other genres.

In [73]:
n = songs.genre.value_counts()['Country']

def downsample_genre (genre):
    '''Downsamples by genre using a pre-determined n'''
    downsampled_genre = songs[songs.genre==genre]
    return downsampled_genre.sample(n)

songs = pd.concat([downsample_genre('Rock'), downsample_genre('Pop'), 
               downsample_genre('Hip-hop'), downsample_genre('Metal'), songs[songs.genre=='Country']])

In [74]:
songs.genre.value_counts()

Metal      18580
Rock       18580
Country    18580
Pop        18580
Hip-hop    18580
Name: genre, dtype: int64

As demonstrated above, our new dataframe is balanced, and only includes five genres.

In [69]:
#Tokenizing
songs.lyrics = songs.lyrics.apply(word_tokenize)

#Converting all tokens to lower case
songs.lyrics = songs.lyrics.apply(lambda x: [word.lower() for word in x])

songs.head()

Unnamed: 0,lyrics,genre,language
0,"[hold, time, feel, break, feel, untrue, convin...",Pop,en
1,"[believe, drop, rain, fall, grow, believe, dar...",Pop,en
2,"[sweetheart, send, letter, goodbye, secret, fe...",Pop,en
3,"[kiss, lips, want, stroll, charm, mambo, chach...",Pop,en
4,"[till, darling, till, matter, know, till, drea...",Pop,en


In [75]:
#Removing stopwords
stop_words = set(stopwords.words('english'))
songs.lyrics = songs.lyrics.apply(lambda x: [word for word in x if word not in stop_words])

#Lemmatizing, step 1: Applying part of speech (POS) tags
songs.lyrics = songs.lyrics.apply(nltk.tag.pos_tag)
songs.head()

Unnamed: 0,lyrics,genre,language
179592,"[(2, CD), (``, ``), (``, ``), (``, ``), (``, `...",Rock,en
159660,"[(fingers, NNS), (white, JJ), (like, IN), (whi...",Rock,en
60718,"[(need, NN), (know, VBP), (much, RB), (want, V...",Rock,en
173393,"[(ask, NN), (feel, NN), (love, IN), (real, JJ)...",Rock,en
84381,"[(want, JJ), (played, VBN), (background, NN), ...",Rock,en


In [76]:
#Lemmatizing, step 2: Converting POS tags to wordnet format to be used with WordNetLemmatizer()
def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN
    
songs.lyrics = songs.lyrics.apply(lambda x: [(word, get_wordnet_pos(pos_tag)) for (word, pos_tag) in x])
songs.head()

Unnamed: 0,lyrics,genre,language
179592,"[(2, n), (``, n), (``, n), (``, n), (``, n), (...",Rock,en
159660,"[(fingers, n), (white, a), (like, n), (whisper...",Rock,en
60718,"[(need, n), (know, v), (much, r), (want, v), (...",Rock,en
173393,"[(ask, n), (feel, n), (love, n), (real, a), (k...",Rock,en
84381,"[(want, a), (played, v), (background, n), (cou...",Rock,en


In [77]:
#Lemmatizing, step 3: Applying WordNetLemmatizer()
wnl = WordNetLemmatizer()
songs.lyrics = songs.lyrics.apply(lambda x: [wnl.lemmatize(word, tag) for word, tag in x])
songs.head()

Unnamed: 0,lyrics,genre,language
179592,"[2, ``, ``, ``, ``, ``, ``]",Rock,en
159660,"[finger, white, like, whisper, lips, consume, ...",Rock,en
60718,"[need, know, much, want, broken, die, need, kn...",Rock,en
173393,"[ask, feel, love, real, know, make, look, frow...",Rock,en
84381,"[want, play, background, couple, drink, wine, ...",Rock,en


In [78]:
songs.to_csv('data/lemmatized_lyrics.csv', index=False)