**The idea** is to match a song to some textual context based on valency, category and text similarity using song lyrics<br />
**Valency:** pos, neg, neu<br />
**Categories:** adventure, hobbies, humor, mystery, romance<br />
**Similarity metrics:** cosine, tf-idf, WordNet word similarity or others

In [1]:
import nltk, re, pprint
import pandas as pd
import numpy as np

#### EDA and Data Preprocessing:

Import the dataset:

In [2]:
data = pd.read_csv("D:\ML\Datasets\labeled_lyrics_cleaned.csv")

In [3]:
data.head()

Unnamed: 0.1,Unnamed: 0,artist,seq,song,label
0,0,Elijah Blake,"No, no\r\nI ain't ever trapped out the bando\r...",Everyday,0.626
1,1,Elijah Blake,"The drinks go down and smoke goes up, I feel m...",Live Till We Die,0.63
2,2,Elijah Blake,She don't live on planet Earth no more\r\nShe ...,The Otherside,0.24
3,3,Elijah Blake,"Trippin' off that Grigio, mobbin', lights low\...",Pinot,0.536
4,4,Elijah Blake,"I see a midnight panther, so gallant and so br...",Shadows & Diamonds,0.371


Rename columns:

In [4]:
#data = data.round(decimals=2)
data.drop(labels="Unnamed: 0", axis=1, inplace=True)
data.rename(columns={"seq": "lyrics", "label": "valency"}, inplace=True)

In [5]:
data.head()

Unnamed: 0,artist,lyrics,song,valency
0,Elijah Blake,"No, no\r\nI ain't ever trapped out the bando\r...",Everyday,0.626
1,Elijah Blake,"The drinks go down and smoke goes up, I feel m...",Live Till We Die,0.63
2,Elijah Blake,She don't live on planet Earth no more\r\nShe ...,The Otherside,0.24
3,Elijah Blake,"Trippin' off that Grigio, mobbin', lights low\...",Pinot,0.536
4,Elijah Blake,"I see a midnight panther, so gallant and so br...",Shadows & Diamonds,0.371


Check summary statistics:

In [6]:
data.describe(include='all')

Unnamed: 0,artist,lyrics,song,valency
count,158353,158353,158353,158353.0
unique,14691,135991,99031,
top,Elvis Presley,"Somewhere over the rainbow, way up high\r\nThe...",Have Yourself a Merry Little Christmas,
freq,821,167,162,
mean,,,,0.491052
std,,,,0.249619
min,,,,0.0
25%,,,,0.286
50%,,,,0.483
75%,,,,0.691


Remove cover songs and format decimal places for summary statistics display:

In [7]:
#data = data.drop_duplicates(subset=['lyrics', 'song'])
data.sort_values(by=['song', 'valency'], ascending=False, inplace=True) # to keep highest value valency 
data = data.drop_duplicates(subset='lyrics')
pd.set_option('display.float_format', lambda x: '%.2f' % x) # round everything to 2 decimal places

In [8]:
data.describe(include='all')

Unnamed: 0,artist,lyrics,song,valency
count,135991,135991,135991,135991.0
unique,10777,135991,95714,
top,Elvis Presley,In my deepest mood\r\nHear my call for you O' ...,Intro,
freq,753,1,127,
mean,,,,0.5
std,,,,0.25
min,,,,0.0
25%,,,,0.3
50%,,,,0.5
75%,,,,0.7


Spot-checking three random entries to confirm data integrity:

In [9]:
data = data.reset_index() 
data

Unnamed: 0,index,artist,lyrics,song,valency
0,54911,Simon & Milo,"Hello, this is Stacy, the computer\nGood morni...",www.nevergetoveryou,0.68
1,82479,Hippo Campus,See how the western kids\r\nHave silicon insid...,western kids,0.52
2,82478,Hippo Campus,"Wisconsin pines, collaborating with the day gl...",way it goes,0.52
3,82477,Hippo Campus,"I see meaning where you don't, where you don't...",vines,0.66
4,82476,Hippo Campus,My thoughts are a battlefield of sub-surreal a...,vacation,0.55
...,...,...,...,...,...
135986,109667,The Beach Boys,"Hi, this is Al this scene takes place at a typ...","""Cassius"" Love Vs. ""Sonny"" Wilson",0.49
135987,55096,Simple Minds,"Cry cry cry\r\nCry like a baby\r\n""see"" Moon ""...","""C"" Moon Cry Like a Baby",0.77
135988,41838,The Blues Brothers,Caught a ride into South Dakota\r\nWith two gi...,"""B"" Movie Box Car Blues",0.50
135989,81217,The Gaslight Anthem,Have you seen my hands?\nJust look at 'em shak...,"""45""",0.42


### Attempting sentiment classification using Vader:

In [10]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk import tokenize

#### Analysing valency and extracting compund score:

In [11]:
def valency(text):
    sid = SentimentIntensityAnalyzer()
    ss = sid.polarity_scores(text)
    #ss.pop('compound')
    compound_score = ss.get('compound')
    if compound_score > 0.3 and compound_score <= 1:
        valency = 'positive'
    elif compound_score >= -1 and compound_score < -0.3:
        valency = 'negative'
    else:
        valency = 'neutral'
    return valency, compound_score

Loading song lyrics:

In [12]:
lyrics = data.at[201, 'lyrics'].strip()
#text = tokenize.sent_tokenize(lyrics)[0]
print(lyrics[:100])

Every time I turn my back I get the feeling that
I'm 'bout to take a shot to the skully with a bat


In [13]:
valency(lyrics)

('negative', -0.8814)

User situtation test:

In [14]:
s1 = "Today is finally my day off! The weather is amazing and I'm going to the beach"
s2 = "Today is finally my day off! The weather is [] and I'm going to the beach"
valency(s1), valency(s2)

(('positive', 0.6239), ('neutral', 0.0))

As can be seen in the example above accuracy is not great so a diiferent classisifier is needed, possibly trained on NLTK moview reviews corpus.

If max score required instead:

In [15]:
#max_value = max(ss.values())
#max_value
#max_key = [k for k, v in ss.items() if v == max_value][0]
#max_key

### Training classifier to assing one of the Brown corpus categories to an arbitrary text:

In [16]:
from nltk.corpus import brown, stopwords
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

categories = ['adventure', 'hobbies', 'humor', 'mystery', 'romance']
#cfd = nltk.ConditionalFreqDist((genre, word) for genre in brown.categories() for word in brown.words(categories=genre))

In [17]:
fdist1 = nltk.FreqDist([lemmatizer.lemmatize(word) for word in brown.words(categories='humor')
                        if word.isalnum() and word.lower() not in stopwords.words('english')])

In [18]:
fdist1.most_common(10)

[('said', 87),
 ('one', 65),
 ('would', 56),
 ('time', 50),
 ('thing', 40),
 ('even', 38),
 ('like', 34),
 ('could', 30),
 ('way', 29),
 ('year', 29)]

In [19]:
fdist2 = nltk.FreqDist([lemmatizer.lemmatize(word) for word in brown.words(categories='mystery')
                    if word.isalnum() and word.lower() not in stopwords.words('english')])

In [20]:
fdist2.most_common(10)

[('said', 202),
 ('would', 186),
 ('one', 175),
 ('back', 157),
 ('could', 141),
 ('like', 136),
 ('man', 106),
 ('get', 101),
 ('know', 93),
 ('time', 87)]

In [21]:
dataset = []

for category in brown.categories():
    for fileid in brown.fileids(category):
        if category in categories:
            dataset.append((brown.words(fileids = fileid), category))

In [22]:
#dataset

### Extracting features from lyrics:

In [23]:
#tokens = [tokenize.sent_tokenize(x) for x in data['lyrics']]

In [24]:
#text = tokens[0][8]
#text