## Lyrics Based Song Recommendation

**The idea** is to match a song to some textual context based on valency, category and text similarity using song lyrics<br />
**Valency:** pos, neg, neu<br />
**Categories:** adventure, hobbies, humor, mystery, romance<br />
**Similarity metrics:** cosine, WordNet word similarity or others

### 1. Import and preprocess data

In [102]:
import nltk, re, pprint
import pandas as pd
import numpy as np

In [103]:
data = pd.read_csv("D:\ML\Datasets\labeled_lyrics_cleaned.csv")
data.head()

Unnamed: 0.1,Unnamed: 0,artist,seq,song,label
0,0,Elijah Blake,"No, no\r\nI ain't ever trapped out the bando\r...",Everyday,0.63
1,1,Elijah Blake,"The drinks go down and smoke goes up, I feel m...",Live Till We Die,0.63
2,2,Elijah Blake,She don't live on planet Earth no more\r\nShe ...,The Otherside,0.24
3,3,Elijah Blake,"Trippin' off that Grigio, mobbin', lights low\...",Pinot,0.54
4,4,Elijah Blake,"I see a midnight panther, so gallant and so br...",Shadows & Diamonds,0.37


In [104]:
# Drop, rename columns, remove duplicates and reset index

data.drop(labels="Unnamed: 0", axis=1, inplace=True)
data.rename(columns={"seq": "lyrics", "label": "valency"}, inplace=True)

data.sort_values(by=['song', 'valency'], ascending=False, inplace=True) # to keep highest value valency 
data = data.drop_duplicates(subset='lyrics')
pd.set_option('display.float_format', lambda x: '%.2f' % x) # round everything to 2 decimal places
data = data.reset_index() 
data.head()

Unnamed: 0,index,artist,lyrics,song,valency
0,54911,Simon & Milo,"Hello, this is Stacy, the computer\nGood morni...",www.nevergetoveryou,0.68
1,82479,Hippo Campus,See how the western kids\r\nHave silicon insid...,western kids,0.52
2,82478,Hippo Campus,"Wisconsin pines, collaborating with the day gl...",way it goes,0.52
3,82477,Hippo Campus,"I see meaning where you don't, where you don't...",vines,0.66
4,82476,Hippo Campus,My thoughts are a battlefield of sub-surreal a...,vacation,0.55


### 2. Sentence segmentation (test)

In [105]:
from nltk import tokenize
from nltk.corpus import brown, stopwords
from nltk.stem import WordNetLemmatizer
from nltk.corpus import webtext

In [106]:
import string # removes punctuation

pattern = r'\[.*?\]' # remove square brackets and contents

song = data['lyrics'][3334].replace("\n", " ").replace("\r", "").replace("\t", "").replace("  ", " ").strip()
#song = song.translate(str.maketrans('', '', string.punctuation)) # removes punctuation
song = re.sub(pattern, '', song)
song

'If you Come close And hold me tight You feel The heart that beats For you  And if You dear Could read my mind Oh you Would know my love Is true   Words can`t say how much I love you Words can`t say how much I care I need you `n I need your love Like I need to breathe the air  Faith And trust Give both a try So you Will see that is The key  I swear True love Will never die So please Believe in you And me '

In [107]:
from nnsplit import NNSplit

splitter = NNSplit.load("en")

In [108]:
splits = splitter.split([song.lower()])[0]
song_sents = [str(x).strip() for x in splits]
song_sents_sorted = sorted(set(song_sents))

song_sents_sorted

['and if you dear could read my mind oh you would know my love is true',
 'faith and trust give both a try so you will see that is the key',
 'i need to breathe the air',
 'i need you `n',
 'i need your love like',
 'i swear true love will never die so please believe in you and me',
 'if you come close and hold me tight you feel the heart that beats for you',
 'words can`t say how much i love you words can`t say how much i care']

### 3. Part-of-speech tagging (test)

In [109]:
from nltk import pos_tag, word_tokenize
pos_tag(word_tokenize(song_sents[0])) 

[('if', 'IN'),
 ('you', 'PRP'),
 ('come', 'VBP'),
 ('close', 'RB'),
 ('and', 'CC'),
 ('hold', 'VB'),
 ('me', 'PRP'),
 ('tight', 'JJ'),
 ('you', 'PRP'),
 ('feel', 'VBP'),
 ('the', 'DT'),
 ('heart', 'NN'),
 ('that', 'WDT'),
 ('beats', 'VBZ'),
 ('for', 'IN'),
 ('you', 'PRP')]

### 4. Attempt sentiment classification using Vader:

In [110]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk import tokenize

**Analyse valency and extract compound score:**

In [111]:
def valency(text):
    sid = SentimentIntensityAnalyzer()
    ss = sid.polarity_scores(text)
    #ss.pop('compound')
    compound_score = ss.get('compound')
    if compound_score > 0.3 and compound_score <= 1:
        valency = 'positive'
    elif compound_score >= -1 and compound_score < -0.3:
        valency = 'negative'
    else:
        valency = 'neutral'
    return valency, compound_score

**Load song lyrics:**

In [112]:
lyrics = data.at[201, 'lyrics'].strip()
#text = tokenize.sent_tokenize(lyrics)[0]
print(lyrics[:100])

Every time I turn my back I get the feeling that
I'm 'bout to take a shot to the skully with a bat


In [113]:
valency(lyrics)

('negative', -0.8814)

**User situtation test:**

In [114]:
s1 = "Today is finally my day off! The weather is amazing and I'm going to the beach"
s2 = "Today is finally my day off! The weather is [] and I'm going to the beach"
valency(s1), valency(s2)

(('positive', 0.6239), ('neutral', 0.0))

**Conclusion:**

As can be seen in the example above accuracy is not great so a diiferent classisifier is needed, possibly trained on NLTK moview reviews corpus.

### 5. Train classifier to assing one of the Brown corpus categories to an arbitrary text (test):

In [115]:
from nltk.corpus import brown, stopwords
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

categories = ['adventure', 'hobbies', 'humor', 'mystery', 'romance']
#cfd = nltk.ConditionalFreqDist((genre, word) for genre in brown.categories() for word in brown.words(categories=genre))

In [116]:
fdist1 = nltk.FreqDist([lemmatizer.lemmatize(word) for word in brown.words(categories='humor')
                        if word.isalnum() and word.lower() not in stopwords.words('english')])

In [117]:
fdist1.most_common(10)

[('said', 87),
 ('one', 65),
 ('would', 56),
 ('time', 50),
 ('thing', 40),
 ('even', 38),
 ('like', 34),
 ('could', 30),
 ('way', 29),
 ('year', 29)]

In [118]:
fdist2 = nltk.FreqDist([lemmatizer.lemmatize(word) for word in brown.words(categories='mystery')
                    if word.isalnum() and word.lower() not in stopwords.words('english')])

In [119]:
fdist2.most_common(10)

[('said', 202),
 ('would', 186),
 ('one', 175),
 ('back', 157),
 ('could', 141),
 ('like', 136),
 ('man', 106),
 ('get', 101),
 ('know', 93),
 ('time', 87)]