# Song Genre Classification
## Part 2: Feature Extraction
- <code>Syntactic information:</code>
    - **mean tree depth:** mean depth of the longest path from root to any leaf
- <code>Surface information:</code>
    - **length:** lyrics' length in words
    - **mean line lenth:** mean length of each line in a song
- <code>Semantic information:</code>
    - **swear words:** 0/1 value if song lyrics contain profanities
    - **NER:** persentage of named entities in song lyrics
    - **type-token ratio:** unique words/all words
    - **n-fram ratio:** unique n-grams/all n-grams, where $n \leq 3$
    - **slang:** percentage of slang words in song lyrics
    - **pronouns:** egocenticity of the text (*1-to-2* person, *1sg-to-other* person)
    - **pos:** persentage of NOUNs, VERBs, ADJectives and PRONouns in a song (actions vs. feelings)

In [1]:
import pandas as pd
import numpy as np
from conllu import parse
from ufal.udpipe import Model, Pipeline
from tqdm.auto import tqdm
import matplotlib.pyplot as plt

In [45]:
data = pd.read_csv('./data/final_data.csv')

In [4]:
data

Unnamed: 0,subset,author,song,lyrics,genre,tokenized,length
0,tr,bob dylan,Black Cross (Live),This is the story of Hezekiah Jones...\n\nHeze...,Folk,This is the story of Hezekiah Jones ... \n\n H...,144
1,tr,isaiah rashad,unk174,"My niggas die for it, she got that pussy juice...",Rap,"My niggas die for it , she got that pussy juic...",107
2,tr,jamie lawson,a darkness,"there 's a darkness in between us , a darkness...",Pop,"there 's a darkness in between us , a darkness...",363
3,tr,aap rock,unk83,Tell me why these little niggas talking like t...,Rap,Tell me why these little niggas talking like t...,110
4,tr,"pointer sisters, the",shut up and dance,"hey you , you wanna dance ? , all i want to do...",Pop,"hey you , you wanna dance ? , all i want to do...",508
...,...,...,...,...,...,...,...
20578,va,roy rogers & dale evans,Cool Water,Cool Water\nAll day I faced the barren waste\n...,Blues,Cool Water \n All day I faced the barren waste...,72
20579,va,roy rogers & dale evans,Home on the Range,"Oh, give me a home where the buffalo roam\nWhe...",Blues,"Oh , give me a home where the buffalo roam \n ...",61
20580,va,roy rogers & dale evans,Remember Whose Birthday It Is / Happy Birthday...,Happy birthday to you ...\nThis is for you\nI ...,Blues,Happy birthday to you ... \n This is for you \...,22
20581,va,roy rogers & dale evans,Happy Trails,"Happy trails to you, until we meet again\nHapp...",Blues,"Happy trails to you , until we meet again \n H...",49


## 1. Tree depth

In [None]:
UDPIPE_MODEL_FN = "model_ru.udpipe"
!wget -O {UDPIPE_MODEL_FN} https://github.com/jwijffels/udpipe.models.ud.2.0/blob/master/inst/udpipe-ud-2.0-170801/english-ud-2.0-170801.udpipe?raw=true

In [None]:
model = Model.load(UDPIPE_MODEL_FN)

In [None]:
pipeline = Pipeline(model, 'generic_tokenizer', '', '', '')
example = "If I were a sailboat I would sail you to the shore."
text_analysis_str = pipeline.process(example)

In [None]:
!pip install udapi

In [None]:
from udapi.block.read.conllu import Conllu
from udapi.block.write.textmodetrees import TextModeTrees
from io import StringIO


def tree_depth(tree):
    depth = 0
    stack = [tree]
    while (len(stack)):
        curr_node = stack[0]
        stack.pop(0)
        if curr_node.children:
            depth += 1
        for node in range(len(curr_node.children) -1, -1, -1):
            stack.insert(0, curr_node.children[node])
    return depth


def get_depth(sentence):
    tree = Conllu(filehandle=StringIO(sentence)).read_tree()
    return tree_depth(tree)

In [None]:
def mean_depth(song):
    song = song.replace('\\', '')
    all_sentences = [pipeline.process(sent) for sent in tokenize.sent_tokenize(song)]
    all_depths = [get_depth(i) for i in all_sentences]
    return np.mean(all_depths)

In [None]:
depths = []

In [None]:
for line in tqdm(data['lyrics'], total=45437):
    if type(line) != str:
        depths.append(None)
        continue       
    try:
        depths.append(mean_depth(line))
    except:
        depths.append(None)

In [None]:
data['tree_depth'] = depths

In [13]:
data.head()

Unnamed: 0,subset,author,song,genre,lyrics,tokenized,length,tree_depth
0,tr,bob dylan,Black Cross (Live),Folk,This is the story of Hezekiah Jones...\n\nHeze...,This is the story of Hezekiah Jones ... \n\n H...,144,8.0
1,tr,isaiah rashad,unk174,Rap,"My niggas die for it, she got that pussy juice...","My niggas die for it , she got that pussy juic...",107,44.0
2,tr,jamie lawson,a darkness,Pop,"there 's a darkness in between us , a darkness...","there 's a darkness in between us , a darkness...",363,117.0
3,tr,aap rock,unk83,Rap,Tell me why these little niggas talking like t...,Tell me why these little niggas talking like t...,110,19.0
4,tr,"pointer sisters, the",shut up and dance,Pop,"hey you , you wanna dance ? , all i want to do...","hey you , you wanna dance ? , all i want to do...",508,15.538462


## 2. POS tags

In [None]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
from collections import Counter


pos_counts = []
for text in tqdm(data['lyrics']):
    pos_count = Counter([j for i,j in pos_tag(word_tokenize(text))])
    vector = [pos_count['NN'], pos_count['VB'], pos_count['JJ'], pos_count['PRP']]
    pos_counts.append(np.array(vector))

In [None]:
data['pos_count'] = pos_counts

In [None]:
def split_pos(data):
    columns = data.columns.tolist()
    columns.remove('pos_count')
    new = []
    for ind, sent in tqdm(data.iterrows(), total=len(data)):
        song = []
        for i in columns:
            song.append(sent[i])
        song.extend(sent['pos_count'].strip('[]').split())   
        new.append(song)
    columns.extend(['NOUN', 'VERB', 'ADJ', 'PRON'])
    return pd.DataFrame(new, columns=columns)

In [None]:
data = split_pos(data)

In [14]:
data.head()

Unnamed: 0,subset,author,song,genre,lyrics,tokenized,length,tree_depth,NOUN,VERB,ADJ,PRON
0,tr,bob dylan,Black Cross (Live),Folk,This is the story of Hezekiah Jones...\n\nHeze...,This is the story of Hezekiah Jones ... \n\n H...,144,8.0,12,2,8,19
1,tr,isaiah rashad,unk174,Rap,"My niggas die for it, she got that pussy juice...","My niggas die for it , she got that pussy juic...",107,44.0,21,8,5,16
2,tr,jamie lawson,a darkness,Pop,"there 's a darkness in between us , a darkness...","there 's a darkness in between us , a darkness...",363,117.0,61,21,14,27
3,tr,aap rock,unk83,Rap,Tell me why these little niggas talking like t...,Tell me why these little niggas talking like t...,110,19.0,12,6,3,20
4,tr,"pointer sisters, the",shut up and dance,Pop,"hey you , you wanna dance ? , all i want to do...","hey you , you wanna dance ? , all i want to do...",508,15.538462,74,39,29,23


## 3. NER

In [59]:
import spacy
from spacy import displacy
import en_core_web_sm
nlp = en_core_web_sm.load()

In [None]:
ne_ratio = []
for text in tqdm(data['lyrics']):
    doc = nlp(text)
    names = len(doc.ents) / len([n.lemma_ for n in doc if not n.is_punct])
    ne_ratio.append(names)

In [None]:
data['ne_ratio'] = ne_ratio

In [15]:
data.head()

Unnamed: 0,subset,author,song,genre,lyrics,tokenized,length,tree_depth,NOUN,VERB,ADJ,PRON,ne_ratio
0,tr,bob dylan,Black Cross (Live),Folk,This is the story of Hezekiah Jones...\n\nHeze...,This is the story of Hezekiah Jones ... \n\n H...,144,8.0,12,2,8,19,0.037879
1,tr,isaiah rashad,unk174,Rap,"My niggas die for it, she got that pussy juice...","My niggas die for it , she got that pussy juic...",107,44.0,21,8,5,16,0.018349
2,tr,jamie lawson,a darkness,Pop,"there 's a darkness in between us , a darkness...","there 's a darkness in between us , a darkness...",363,117.0,61,21,14,27,0.00339
3,tr,aap rock,unk83,Rap,Tell me why these little niggas talking like t...,Tell me why these little niggas talking like t...,110,19.0,12,6,3,20,0.039604
4,tr,"pointer sisters, the",shut up and dance,Pop,"hey you , you wanna dance ? , all i want to do...","hey you , you wanna dance ? , all i want to do...",508,15.538462,74,39,29,23,0.044855


## 4-5. Type-token & N-gram ratio

In [62]:
from nltk.stem import WordNetLemmatizer
from nltk import ngrams
from string import punctuation

wnl = WordNetLemmatizer()

In [63]:
def count_token_ratio(text):
    length = len(text)
    unique_length = len(set(text))
    ratio = unique_length/length
    return ratio

In [64]:
def count_ngram(text):
    bigrams = list(ngrams(text, 2))
    trigrams = list(ngrams(text, 3))
    tokens = text + bigrams + trigrams
    unique = len(set(tokens))
    length = len(tokens)
    return unique/length

In [76]:
ratio = []
ngrams_ratio = []

In [77]:
for ind, song in tqdm(data.iterrows(), total=len(data)):
    text = [wnl.lemmatize(w) for w in song['tokenized'].split() if w not in punctuation]
    ratio.append(count_token_ratio(text))
    ngrams_ratio.append(count_ngram(text))

HBox(children=(FloatProgress(value=0.0, max=20582.0), HTML(value='')))




In [82]:
data['type_token'] = ratio
data['ngram_ratio'] = ngrams_ratio

In [16]:
data.head()

Unnamed: 0,subset,author,song,genre,lyrics,tokenized,length,tree_depth,NOUN,VERB,ADJ,PRON,ne_ratio,type_token,ngram_ratio
0,tr,bob dylan,Black Cross (Live),Folk,This is the story of Hezekiah Jones...\n\nHeze...,This is the story of Hezekiah Jones ... \n\n H...,144,8.0,12,2,8,19,0.037879,0.566929,0.809524
1,tr,isaiah rashad,unk174,Rap,"My niggas die for it, she got that pussy juice...","My niggas die for it , she got that pussy juic...",107,44.0,21,8,5,16,0.018349,0.5,0.649485
2,tr,jamie lawson,a darkness,Pop,"there 's a darkness in between us , a darkness...","there 's a darkness in between us , a darkness...",363,117.0,61,21,14,27,0.00339,0.39322,0.603175
3,tr,aap rock,unk83,Rap,Tell me why these little niggas talking like t...,Tell me why these little niggas talking like t...,110,19.0,12,6,3,20,0.039604,0.741935,0.898551
4,tr,"pointer sisters, the",shut up and dance,Pop,"hey you , you wanna dance ? , all i want to do...","hey you , you wanna dance ? , all i want to do...",508,15.538462,74,39,29,23,0.044855,0.30343,0.489418


## 6. Slang

In [87]:
from nltk.corpus import words
nltk.download('words')

[nltk_data] Downloading package words to /Users/katya/nltk_data...
[nltk_data]   Package words is already up-to-date!


True

In [89]:
slang = []

In [None]:
for ind, song in tqdm(data.iterrows(), total=len(data)):
    word_list = set([w for w in song['tokenized'] if w not in punctuation])
    length = len(word_list)
    counter = len(set.intersection(word_list, words.words()))
    slang.append(counter/length)

HBox(children=(FloatProgress(value=0.0, max=20582.0), HTML(value='')))

In [None]:
data['slang'] = slang

In [17]:
data.head()

Unnamed: 0,subset,author,song,genre,lyrics,tokenized,length,tree_depth,NOUN,VERB,ADJ,PRON,ne_ratio,type_token,ngram_ratio,slang
0,tr,bob dylan,Black Cross (Live),Folk,This is the story of Hezekiah Jones...\n\nHeze...,This is the story of Hezekiah Jones ... \n\n H...,144,8.0,12,2,8,19,0.037879,0.566929,0.809524,0.941176
1,tr,isaiah rashad,unk174,Rap,"My niggas die for it, she got that pussy juice...","My niggas die for it , she got that pussy juic...",107,44.0,21,8,5,16,0.018349,0.5,0.649485,0.911765
2,tr,jamie lawson,a darkness,Pop,"there 's a darkness in between us , a darkness...","there 's a darkness in between us , a darkness...",363,117.0,61,21,14,27,0.00339,0.39322,0.603175,0.956522
3,tr,aap rock,unk83,Rap,Tell me why these little niggas talking like t...,Tell me why these little niggas talking like t...,110,19.0,12,6,3,20,0.039604,0.741935,0.898551,0.878049
4,tr,"pointer sisters, the",shut up and dance,Pop,"hey you , you wanna dance ? , all i want to do...","hey you , you wanna dance ? , all i want to do...",508,15.538462,74,39,29,23,0.044855,0.30343,0.489418,0.96


## 7-8. Pronouns

In [92]:
self_to_nonself_referensing = []
first_to_second_person = []

In [96]:
for text in tqdm(data['tokenized'], total=len(data)):
    pronouns = Counter()
    for word in text.split():
        word = word.lower()
        if word == 'i' or word == 'me' or word == 'my' or word == 'mine':
            pronouns['first_person_singular'] += 1
        if word == 'we' or word == 'us' or word == 'our' or word == 'ours':
            pronouns['first_person_plural'] += 1
        if word == 'you' or word == 'your' or word == 'yours':
            pronouns['second_person'] += 1
        if word == 'she' or word == 'her' or word == 'hers':
            pronouns['third_person'] += 1
        if word == 'he' or word == 'his' or word == 'him':
            pronouns['third_person'] += 1
        if word == 'it' or word == 'its':
            pronouns['third_person'] += 1
        if word == 'they' or word == 'their' or word == 'theirs' or word == 'them':
            pronouns['third_person'] += 1
    first_person_all = pronouns['first_person_singular'] + pronouns['first_person_plural']
    notfirst_all = pronouns['second_person'] + pronouns['third_person']
    if first_person_all is not 0 and notfirst_all is not 0:
        self_to_nonself = first_person_all / notfirst_all
    else:
        self_to_nonself = 0
    self_to_nonself_referensing.append(round(self_to_nonself, 3))
    if pronouns['first_person_singular'] is not 0 and pronouns['second_person'] is not 0:
        first_to_second = pronouns['first_person_singular'] / pronouns['second_person']
    else:
        first_to_second = 0
    first_to_second_person.append(round(first_to_second, 3))

HBox(children=(FloatProgress(value=0.0, max=20582.0), HTML(value='')))




In [97]:
data['pron_self'] = self_to_nonself_referensing
data['pron_first_second'] = first_to_second_person

In [18]:
data.head()

Unnamed: 0,subset,author,song,genre,lyrics,tokenized,length,tree_depth,NOUN,VERB,ADJ,PRON,ne_ratio,type_token,ngram_ratio,slang,pron_self,pron_first_second
0,tr,bob dylan,Black Cross (Live),Folk,This is the story of Hezekiah Jones...\n\nHeze...,This is the story of Hezekiah Jones ... \n\n H...,144,8.0,12,2,8,19,0.037879,0.566929,0.809524,0.941176,0.053,0.0
1,tr,isaiah rashad,unk174,Rap,"My niggas die for it, she got that pussy juice...","My niggas die for it , she got that pussy juic...",107,44.0,21,8,5,16,0.018349,0.5,0.649485,0.911765,0.8,4.0
2,tr,jamie lawson,a darkness,Pop,"there 's a darkness in between us , a darkness...","there 's a darkness in between us , a darkness...",363,117.0,61,21,14,27,0.00339,0.39322,0.603175,0.956522,1.0,1.0
3,tr,aap rock,unk83,Rap,Tell me why these little niggas talking like t...,Tell me why these little niggas talking like t...,110,19.0,12,6,3,20,0.039604,0.741935,0.898551,0.878049,0.727,2.667
4,tr,"pointer sisters, the",shut up and dance,Pop,"hey you , you wanna dance ? , all i want to do...","hey you , you wanna dance ? , all i want to do...",508,15.538462,74,39,29,23,0.044855,0.30343,0.489418,0.96,1.294,2.0


## 9-10. Length features
- lyrics' length in words
- mean line length of each song

In [103]:
from statistics import mean
tt = str.maketrans(dict.fromkeys(punctuation))

In [104]:
length_in_words = []

In [105]:
for text in tqdm(data['tokenized']):
    text_without_punc = text.translate(tt)
    best_text = text_without_punc.replace(' \n', '').split(' ')
    for thing in best_text:
        if thing == '':
            best_text.remove(thing)
    word_count = len(best_text)
    length_in_words.append(word_count)

HBox(children=(FloatProgress(value=0.0, max=20582.0), HTML(value='')))




In [107]:
data['words_length'] = length_in_words

In [110]:
string_mean_length = []

In [111]:
for text in tqdm(data['tokenized']):
    string_length = []
    text.replace(' \' ', '')
    if ' \n' in text:
        splited_text = text.split(' \n ')
    else:
        splited_text = text.split(' , ')
    for string in splited_text:
        string_without_punc = string.translate(tt)
        string_without_punc.replace('\' ', '')
        splited_string = string_without_punc.split(' ')
    for thing in splited_string:
        if thing == '':
            splited_string.remove(thing)
        word_count = len(splited_string)
        string_length.append(word_count)
    string_mean_length.append(round(mean(string_length), 3))

HBox(children=(FloatProgress(value=0.0, max=20582.0), HTML(value='')))




In [112]:
data['lines_length'] = string_mean_length

In [19]:
data.head()

Unnamed: 0,subset,author,song,genre,lyrics,tokenized,length,tree_depth,NOUN,VERB,ADJ,PRON,ne_ratio,type_token,ngram_ratio,slang,pron_self,pron_first_second,words_length,lines_length
0,tr,bob dylan,Black Cross (Live),Folk,This is the story of Hezekiah Jones...\n\nHeze...,This is the story of Hezekiah Jones ... \n\n H...,144,8.0,12,2,8,19,0.037879,0.566929,0.809524,0.941176,0.053,0.0,124,11.273
1,tr,isaiah rashad,unk174,Rap,"My niggas die for it, she got that pussy juice...","My niggas die for it , she got that pussy juic...",107,44.0,21,8,5,16,0.018349,0.5,0.649485,0.911765,0.8,4.0,98,8.167
2,tr,jamie lawson,a darkness,Pop,"there 's a darkness in between us , a darkness...","there 's a darkness in between us , a darkness...",363,117.0,61,21,14,27,0.00339,0.39322,0.603175,0.956522,1.0,1.0,295,4.47
3,tr,aap rock,unk83,Rap,Tell me why these little niggas talking like t...,Tell me why these little niggas talking like t...,110,19.0,12,6,3,20,0.039604,0.741935,0.898551,0.878049,0.727,2.667,93,10.333
4,tr,"pointer sisters, the",shut up and dance,Pop,"hey you , you wanna dance ? , all i want to do...","hey you , you wanna dance ? , all i want to do...",508,15.538462,74,39,29,23,0.044855,0.30343,0.489418,0.96,1.294,2.0,396,3.907


## 11. Swear Words

In [50]:
! pip install better_profanity

Collecting better_profanity
  Downloading better_profanity-0.7.0-py3-none-any.whl (46 kB)
[K     |████████████████████████████████| 46 kB 485 kB/s eta 0:00:01
[?25hInstalling collected packages: better-profanity
Successfully installed better-profanity-0.7.0


In [51]:
from better_profanity import profanity

In [None]:
swears = []
for text in tqdm(data['lyrics']):
    swears.append(profanity.contains_profanity(text))
data['swear_words'] = swears

In [20]:
data.head()

Unnamed: 0,subset,author,song,genre,lyrics,tokenized,length,tree_depth,NOUN,VERB,...,PRON,ne_ratio,type_token,ngram_ratio,slang,pron_self,pron_first_second,words_length,lines_length,swear_words
0,tr,bob dylan,Black Cross (Live),Folk,This is the story of Hezekiah Jones...\n\nHeze...,This is the story of Hezekiah Jones ... \n\n H...,144,8.0,12,2,...,19,0.037879,0.566929,0.809524,0.941176,0.053,0.0,124,11.273,1
1,tr,isaiah rashad,unk174,Rap,"My niggas die for it, she got that pussy juice...","My niggas die for it , she got that pussy juic...",107,44.0,21,8,...,16,0.018349,0.5,0.649485,0.911765,0.8,4.0,98,8.167,1
2,tr,jamie lawson,a darkness,Pop,"there 's a darkness in between us , a darkness...","there 's a darkness in between us , a darkness...",363,117.0,61,21,...,27,0.00339,0.39322,0.603175,0.956522,1.0,1.0,295,4.47,0
3,tr,aap rock,unk83,Rap,Tell me why these little niggas talking like t...,Tell me why these little niggas talking like t...,110,19.0,12,6,...,20,0.039604,0.741935,0.898551,0.878049,0.727,2.667,93,10.333,1
4,tr,"pointer sisters, the",shut up and dance,Pop,"hey you , you wanna dance ? , all i want to do...","hey you , you wanna dance ? , all i want to do...",508,15.538462,74,39,...,23,0.044855,0.30343,0.489418,0.96,1.294,2.0,396,3.907,0


## POS Tags update
(convert absolute values to percentages)

In [23]:
data['NOUN'] = data['NOUN'] / data['words_length']
data['VERB'] = data['VERB'] / data['words_length']
data ['ADJ'] = data['ADJ'] / data['words_length']
data['PRON'] = data['PRON'] / data['words_length']

In [25]:
data.head()

Unnamed: 0,subset,author,song,genre,lyrics,tokenized,length,tree_depth,NOUN,VERB,...,PRON,ne_ratio,type_token,ngram_ratio,slang,pron_self,pron_first_second,words_length,lines_length,swear_words
0,tr,bob dylan,Black Cross (Live),Folk,This is the story of Hezekiah Jones...\n\nHeze...,This is the story of Hezekiah Jones ... \n\n H...,144,8.0,0.096774,0.016129,...,0.153226,0.037879,0.566929,0.809524,0.941176,0.053,0.0,124,11.273,1
1,tr,isaiah rashad,unk174,Rap,"My niggas die for it, she got that pussy juice...","My niggas die for it , she got that pussy juic...",107,44.0,0.214286,0.081633,...,0.163265,0.018349,0.5,0.649485,0.911765,0.8,4.0,98,8.167,1
2,tr,jamie lawson,a darkness,Pop,"there 's a darkness in between us , a darkness...","there 's a darkness in between us , a darkness...",363,117.0,0.20678,0.071186,...,0.091525,0.00339,0.39322,0.603175,0.956522,1.0,1.0,295,4.47,0
3,tr,aap rock,unk83,Rap,Tell me why these little niggas talking like t...,Tell me why these little niggas talking like t...,110,19.0,0.129032,0.064516,...,0.215054,0.039604,0.741935,0.898551,0.878049,0.727,2.667,93,10.333,1
4,tr,"pointer sisters, the",shut up and dance,Pop,"hey you , you wanna dance ? , all i want to do...","hey you , you wanna dance ? , all i want to do...",508,15.538462,0.186869,0.098485,...,0.058081,0.044855,0.30343,0.489418,0.96,1.294,2.0,396,3.907,0


In [27]:
data.to_csv('./data/data.csv', index=False)