## 1. Importing Python Libraries

We start by importing the essential Python libraries.

In [1]:
### IMPORTING LIBRARIES
import numpy as np
import pandas as pd
import nltk
import re
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import classification_report
from sklearn.naive_bayes import MultinomialNB
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to C:\Users\Shekhar
[nltk_data]     Lamba\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## 2. Data Importation

Now, we shall import the dataframe containing the lyrics for all the songs performed by The Beatles and display its first 10 rows.

In [2]:
### IMPORTING THE DATA
beatles_lyrics = pd.read_csv('beatles_lyrics_data.csv')
beatles_lyrics.head(10)

Unnamed: 0,title,lyrics
0,12-Bar Original,"One, two, three, four! Embed"
1,"12-Bar Original (Take 2, Edited)",
2,1822!,"This is a Dorsey Burnette number, brother of ..."
3,A Beginning,
4,A Beginning (Take 4) / Don’t Pass Me By (Take 7),"This is the introduction to Ringo's ""Don't Pa..."
5,Across the Universe,Words are flowing out like endless rain into ...
6,Across the Universe (Take 2),Words are flowing out like endless rain into ...
7,Across the Universe (Take 6),Words are flowing out like endless rain into ...
8,Act Naturally,They're gonna put me in the movies They're go...
9,A Day in the Life,"I read the news today—oh, boy About a lucky m..."


Next, we import the dataframe containing the list of all the unique songs recorded by The Beatles and display the first 10 rows.

In [3]:
beatles_songs = pd.read_csv('beatles_song_list.csv')
beatles_songs.head(10)

Unnamed: 0,Song,Core catalogue release(s),Songwriter(s),Lead vocal(s)[d],Year
0,Across the Universe,Let It BePast Masters,LennonMcCartney,Lennon,1969
1,Act Naturally,Help!,Johnny RussellVoni Morrison,Starr,1965
2,All I've Got to Do,With the Beatles,LennonMcCartney,Lennon,1963
3,All My Loving,With the Beatles,LennonMcCartney,McCartney,1963
4,All Together Now,Yellow Submarine,LennonMcCartney,McCartney(with Lennon),1969
5,All You Need Is Love,Magical Mystery Tour,LennonMcCartney,Lennon,1967
6,And I Love Her,A Hard Day's Night,LennonMcCartney,McCartney,1964
7,And Your Bird Can Sing,Revolver,LennonMcCartney,Lennon,1966
8,Anna (Go to Him),Please Please Me,Arthur Alexander,Lennon,1963
9,Another Girl,Help!,LennonMcCartney,McCartney,1965


Now, what we'd like to do is merge these two dataframes. Since the dataframe _beatles_songs_ contains the unique titles of all the songs while _beatles_lyrics_ has lyrics for not only the official studio versions but also for different versions of the same songs, we would like to merge the two dataframes such that we get the lyrics for only those songs that appear in _beatles_songs_ dataframe. In doing so, we get rid of all the other versions and only have each song appear only once.

## 3. Cleaning the Song Titles

Before we go any further, let us make a function that will clean the title names in both the dataframes. We do this because we want to bring the song titles in both the dataframes down to the same format since we will be using these to merge the two dataframes. So we make a function _clean_title_ that will first turn each title to lower case then remove all punctuations and then get rid of all the white spaces to return us a clean title.  

In [4]:
### CREATING A FUNCTION TO CLEAN SONG TITLES
def clean_title(text):
    text = text.lower() #turns to lower case
    text = re.sub(r'[^\w\s]+', '', text) #removes punctuation
    text = text.strip() #removes white spaces
    return text

Next, we apply _clean_title_ to the columns that contain song titles in both the dataframes.

In [5]:
### APPLYING THE FUNCTION TO TITLES
beatles_lyrics['title'] = beatles_lyrics['title'].apply(clean_title)
beatles_songs['title'] = beatles_songs['Song'].apply(clean_title)

Let us check to see if it worked in the _beatles_lyrics_ dataframe.

In [6]:
beatles_lyrics.head(10)

Unnamed: 0,title,lyrics
0,12bar original,"One, two, three, four! Embed"
1,12bar original take 2 edited,
2,1822,"This is a Dorsey Burnette number, brother of ..."
3,a beginning,
4,a beginning take 4 dont pass me by take 7,"This is the introduction to Ringo's ""Don't Pa..."
5,across the universe,Words are flowing out like endless rain into ...
6,across the universe take 2,Words are flowing out like endless rain into ...
7,across the universe take 6,Words are flowing out like endless rain into ...
8,act naturally,They're gonna put me in the movies They're go...
9,a day in the life,"I read the news today—oh, boy About a lucky m..."


We see that the song titles are all in lower case, without any punctuation and devoid of white spaces left behind after removal of punctuations.

We again check the same in _beatles_songs_ dataframe.

In [7]:
beatles_songs.head(10)

Unnamed: 0,Song,Core catalogue release(s),Songwriter(s),Lead vocal(s)[d],Year,title
0,Across the Universe,Let It BePast Masters,LennonMcCartney,Lennon,1969,across the universe
1,Act Naturally,Help!,Johnny RussellVoni Morrison,Starr,1965,act naturally
2,All I've Got to Do,With the Beatles,LennonMcCartney,Lennon,1963,all ive got to do
3,All My Loving,With the Beatles,LennonMcCartney,McCartney,1963,all my loving
4,All Together Now,Yellow Submarine,LennonMcCartney,McCartney(with Lennon),1969,all together now
5,All You Need Is Love,Magical Mystery Tour,LennonMcCartney,Lennon,1967,all you need is love
6,And I Love Her,A Hard Day's Night,LennonMcCartney,McCartney,1964,and i love her
7,And Your Bird Can Sing,Revolver,LennonMcCartney,Lennon,1966,and your bird can sing
8,Anna (Go to Him),Please Please Me,Arthur Alexander,Lennon,1963,anna go to him
9,Another Girl,Help!,LennonMcCartney,McCartney,1965,another girl


And, we see the same clean outcome.

## 4. Missing Value Treatment

Next, we shall deal with the missing values in the _beatles_lyrics_ dataframe by simply getting rid of all the rows that contain missing values.

In [8]:
### DELETING MISSING VALUES
beatles_lyrics.dropna(inplace = True)
beatles_lyrics.head(10)

Unnamed: 0,title,lyrics
0,12bar original,"One, two, three, four! Embed"
2,1822,"This is a Dorsey Burnette number, brother of ..."
4,a beginning take 4 dont pass me by take 7,"This is the introduction to Ringo's ""Don't Pa..."
5,across the universe,Words are flowing out like endless rain into ...
6,across the universe take 2,Words are flowing out like endless rain into ...
7,across the universe take 6,Words are flowing out like endless rain into ...
8,act naturally,They're gonna put me in the movies They're go...
9,a day in the life,"I read the news today—oh, boy About a lucky m..."
10,a day in the life take 1 with hums,"Geoff Emerick: ""In the Life Of"", Take 1 John:..."
11,a day in the life take 2,"Take 2! John: 1, 2, 3, 4.... I read the news ..."


So, we've removed all the missing values from _beatles_lyrics_ and now we are ready to merge the two datasets.

## 5. Merging the Two Dataframes

As mentioned earlier, we shall merge the two dataframes using the columns that contain the clean song titles.

In [9]:
### MERGING THE DATAFRAMES USING INNER JOIN
df = pd.merge(left = beatles_songs, right = beatles_lyrics, how = 'inner', left_on = 'title', right_on = 'title')
df.head(10)

Unnamed: 0,Song,Core catalogue release(s),Songwriter(s),Lead vocal(s)[d],Year,title,lyrics
0,Across the Universe,Let It BePast Masters,LennonMcCartney,Lennon,1969,across the universe,Words are flowing out like endless rain into ...
1,Act Naturally,Help!,Johnny RussellVoni Morrison,Starr,1965,act naturally,They're gonna put me in the movies They're go...
2,All I've Got to Do,With the Beatles,LennonMcCartney,Lennon,1963,all ive got to do,"Whenever I want you around, yeah All I got to..."
3,All My Loving,With the Beatles,LennonMcCartney,McCartney,1963,all my loving,Close your eyes and I'll kiss you Tomorrow I'...
4,All Together Now,Yellow Submarine,LennonMcCartney,McCartney(with Lennon),1969,all together now,"One, two, three, four Can I have a little mor..."
5,All You Need Is Love,Magical Mystery Tour,LennonMcCartney,Lennon,1967,all you need is love,"Love, love, love Love, love, love Love, love,..."
6,And I Love Her,A Hard Day's Night,LennonMcCartney,McCartney,1964,and i love her,I give her all my love That's all I do And if...
7,And Your Bird Can Sing,Revolver,LennonMcCartney,Lennon,1966,and your bird can sing,You tell me that you've got everything you wa...
8,Anna (Go to Him),Please Please Me,Arthur Alexander,Lennon,1963,anna go to him,"Anna You come and ask me, girl To set you fre..."
9,Another Girl,Help!,LennonMcCartney,McCartney,1965,another girl,For I have got another girl Another girl You...


We've created a new dataset that contains the song details present in _beatles_songs_ and the song lyrics present in _beatles_lyrics_. 

Now, let us see the number of songs this dataframe possesses.

In [10]:
df.shape

(211, 7)

Of the 213 official studio versions of songs, we have data for 211 of them. We can try to find and add the lyrics for the two missing songs but 211 is still good enough for this current project.

## 6. Cleaning the Song Lyrics

Now, let us start cleaning the song lyrics. For this, we start by making a function _preprocess_ which will take each row and then turn each word into lower case, remove all punctuations, drop all numerical characters and then remove any white space left behind after previous removals. The function will then take each word and lemmatize it, provided it is not a stop word, and then join the words to return a clean version of the lyrics.

In [11]:
### CREATING A FUNCTION TO CLEAN SONG LYRICS
stop_words = stopwords.words('english')
lemmatizer = WordNetLemmatizer()
def preprocess(text):
    clean_words = []
    text = text.lower() #turns to lower case
    text = re.sub(r'[^\w\s]+', '', text) #removes punctuation
    text = re.sub('\d+', '', text) #removes numerical characters
    text = text.strip() #removes white spaces
    for word in text.split():
        if word not in stop_words:
            word = lemmatizer.lemmatize(word)
            clean_words.append(word)
    clean_words = ' '.join(clean_words)
    return clean_words

Now let us apply this function to the lyrics of each song.

In [12]:
### APPLYING THE FUNCTION TO LYRICS
df['lyrics'] = df['lyrics'].apply(preprocess)
df.head(10)

Unnamed: 0,Song,Core catalogue release(s),Songwriter(s),Lead vocal(s)[d],Year,title,lyrics
0,Across the Universe,Let It BePast Masters,LennonMcCartney,Lennon,1969,across the universe,word flowing like endless rain paper cup slith...
1,Act Naturally,Help!,Johnny RussellVoni Morrison,Starr,1965,act naturally,theyre gonna put movie theyre gonna make big s...
2,All I've Got to Do,With the Beatles,LennonMcCartney,Lennon,1963,all ive got to do,whenever want around yeah got call phone youll...
3,All My Loving,With the Beatles,LennonMcCartney,McCartney,1963,all my loving,close eye ill kiss tomorrow ill miss remember ...
4,All Together Now,Yellow Submarine,LennonMcCartney,McCartney(with Lennon),1969,all together now,one two three four little five six seven eight...
5,All You Need Is Love,Magical Mystery Tour,LennonMcCartney,Lennon,1967,all you need is love,love love love love love love love love love t...
6,And I Love Her,A Hard Day's Night,LennonMcCartney,McCartney,1964,and i love her,give love thats saw love youd love love give e...
7,And Your Bird Can Sing,Revolver,LennonMcCartney,Lennon,1966,and your bird can sing,tell youve got everything want bird sing dont ...
8,Anna (Go to Him),Please Please Me,Arthur Alexander,Lennon,1963,anna go to him,anna come ask girl set free girl say love set ...
9,Another Girl,Help!,LennonMcCartney,McCartney,1965,another girl,got another girl another girl youre making say...


We see that all the lyrics have been cleaned.

## 7. Creating a Document Term Matrix

Next, we shall create a Document Term Matrix(DTM) where each row will represent the lyrics of a song and the columns will represent all the words appearing in the lyrics of all the songs. Each _(i,j)_-th entry of this DTM will be the frequency of appearance of the _j_-th word in the lyrics of the _i_-th song. To accomplish this, we shall make a function _get_count_ that will apply _CountVectorizer_ to the lyrics of every song and return a dataframe that is essentially a Document Term Matrix. We shall keep two parameters in mind:

1. ngram_range: the range of consecutive words we want to divide the lyrics into which will serve as features(we've made it a user defined parameter here)
2. max_features: the maximum number of such features

In [13]:
### CREATING A FUNCTION THAT RETURNS A DOCUMENT TERM MATRIX
def get_count(dataframe, m, n):
    vectorizer = CountVectorizer(ngram_range = (m, n), lowercase = False, max_features = 5000)
    count_set = vectorizer.fit_transform(dataframe).todense()
    count_df = pd.DataFrame(count_set, columns = vectorizer.get_feature_names())
    return count_df

Now, let us apply this function to the lyrics of each song by taking only single words as features. We then separate the DTM as X and the song titles as y.

In [14]:
### CREATING A DTM AND RESPONSE VARIABLE
X_lyrics = df['lyrics']
X = get_count(X_lyrics, 1, 1)
y = df['Song']

## 8. Creating a Multinomial Naive Bayes Model

We now construct a Multinomial Naive Bayes model that will predict the song title from a line of lyrics we feed into it. We will set alpha as 1.

In [15]:
### CREATING A MULTINOMIAL NAIVE BAYES CLASSIFIER
clf_nb = MultinomialNB(alpha = 1)
clf_nb.fit(X, y)

MultinomialNB(alpha=1)

## 9. Cleaning the Lyrics to be Predicted

Now, let us write a line of a song that we hope to get the title for.

In [16]:
### LYRICS THAT WE WANT TO FEED INTO THE MODEL
my_lyrics = 'sun is up the sky is blue its beautiful and so are you'

Before feeding this line to the model, it needs to be converted into a DTM. For this, we shall create a function that will do the following:

1. Make a new dataframe _df2_ by dropping all the columns of _df_ except _'Song'_ and _'lyrics'_.
2. Clean the line that we want to feed to our model using the _preprocess_ function.
3. Make a new array with columns _'Song'_ and _'lyrics'_; add NaN and the cleaned lyrics from Step 2. as their respective values.
4. Attach this array to _df2_.
5. Create DTM from the lyrics in _df2_ using the _get_count_ function.
6. Compare the columns of this new DTM with the columns of the DTM we used for training(X) and remove those that are not in X.
7. Finally, return the last row of this new DTM since it has the values for the line we want to feed into the model.

In [17]:
### CREATING A FUNCTION THAT CLEANS THE INPUT LYRICS AND GET IT INTO A DTM FORM
def get_clean_lyrics(lyrics):
    df2 = df.drop(['Core catalogue release(s)', 'Songwriter(s)', 'Lead vocal(s)[d]', 'Year', 'title'], axis = 1) #1
    my_lyrics_cleaned = preprocess(my_lyrics) #2
    df3 = {'Song': np.nan, 'lyrics': my_lyrics_cleaned} #3
    df_new = df2.append(df3, ignore_index = True) #4
    X_new_lyrics = df_new['lyrics']
    X_new = get_count(X_new_lyrics, 1, 1) #5
    remove_features = [] #6
    for word_combo in X_new.columns:
        if word_combo not in X.columns:
            remove_features.append(word_combo)
    X_new.drop(remove_features, axis = 1, inplace = True)
    my_song = X_new.tail(1) #7
    return my_song

Let us check to see if we got the desired result.

In [18]:
### APPLYING THE ABOVE FUNCTION TO THE INPUT LYRICS
my_song = get_clean_lyrics(my_lyrics)
my_song

Unnamed: 0,aaaah,aaah,aah,able,aboard,accident,ache,acorn,across,act,...,youembed,youll,young,younger,youre,yourselfembed,youve,zapped,zoo,zu
211,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## 10. Predicting the Song Title

First, let us create a function _get_song_details_ that will return the song title along with the other song details.

In [19]:
### CREATING A FUNCTION THAT RETURNS SONG DETAILS
def get_song_details(song_title):
    print('Song title: %s' % song_title)
    print('Album: %s' % beatles_songs[beatles_songs['Song'] == song_title]['Core catalogue release(s)'].values[0])
    print('Songwriter(s): %s' % beatles_songs[beatles_songs['Song'] == song_title]['Songwriter(s)'].values[0])
    print('Lead vocal(s): %s' % beatles_songs[beatles_songs['Song'] == song_title]['Lead vocal(s)[d]'].values[0])
    print('Year: %d' % beatles_songs[beatles_songs['Song'] == song_title]['Year'])

Now, we try to predict the song title from the lyrics using the Multinomial Naive Bayes model.

In [20]:
### PREDICTING THE SONG TITLE
predicted_song = clf_nb.predict(my_song)[0]
get_song_details(predicted_song)

Song title: Dear Prudence
Album: The Beatles ("White Album")
Songwriter(s): LennonMcCartney
Lead vocal(s): Lennon
Year: 1968


The model predicts the song title which appears to be correct.