# Song Genre Classification
## Part 2: Feature Extraction
- <code>Syntactic information:</code>
    - **mean tree depth:** mean depth of the longest path from root to any leaf
- <code>Surface information:</code>
    - **length in words:** lyrics' length in words
    - **mean line lenth:** mean length of each line in a song
- <code>Semantic information:</code>
    - **swear words:** if the swear words are present in the lyrics
    - **NER:** persentage of named entities in the lyrics
    - **type-token ratio:** ratio of unique words to all words
    - **n-fram ratio:** ratio of unique n-grams to all n-grams, where $n \leq 3$
    - **slang:** percentage of slang words in song lyrics
    - **pronouns:** ratio of 1 to 2 person pronouns, 1sg (I) to other - egocenticity of the text (*1-to-2* person, *1sg-to-other* person)
    - **pos:** persentage of NOUNs, VERBs, ADJectives and PRONouns in a song (actions vs. feelings)

In [2]:
import pandas as pd
import numpy as np
from conllu import parse
from ufal.udpipe import Model, Pipeline
from tqdm.auto import tqdm
import matplotlib.pyplot as plt

In [3]:
data = pd.read_csv('./data/final_data.csv')

In [21]:
data.head()

Unnamed: 0,subset,author,song,genre,lyrics,tokenized,length
0,tr,chance the rappe,unk228,Rap,Another weekend full of blunts and brews\nToo ...,Another weekend full of blunts and brews \n To...,166
1,tr,megan & liz,love war,Country,"i do n't believe in wizards or witches , but b...","i do n't believe in wizards or witches , but b...",210
2,tr,jamiroquai,"if i like it, i do it",Pop,"if i like it i just do it , say that we have a...","if i like it i just do it , say that we have a...",332
3,tr,drake,unk172,Rap,Done sayin' I'm done playin'\nLast time was on...,Done sayin ' I 'm done playin ' \n Last time w...,271
4,tr,j cole,unk152,Rap,This next three bars is dedicated to the retar...,This next three bars is dedicated to the retar...,121


## 1. Tree depth

In [None]:
UDPIPE_MODEL_FN = "model_ru.udpipe"
!wget -O {UDPIPE_MODEL_FN} https://github.com/jwijffels/udpipe.models.ud.2.0/blob/master/inst/udpipe-ud-2.0-170801/english-ud-2.0-170801.udpipe?raw=true

In [None]:
model = Model.load(UDPIPE_MODEL_FN)

In [None]:
pipeline = Pipeline(model, 'generic_tokenizer', '', '', '')
example = "If I were a sailboat I would sail you to the shore."
text_analysis_str = pipeline.process(example)

In [None]:
!pip install udapi

In [None]:
from udapi.block.read.conllu import Conllu
from udapi.block.write.textmodetrees import TextModeTrees
from io import StringIO


def tree_depth(tree):
    depth = 0
    stack = [tree]
    while (len(stack)):
        curr_node = stack[0]
        stack.pop(0)
        if curr_node.children:
            depth += 1
        for node in range(len(curr_node.children) -1, -1, -1):
            stack.insert(0, curr_node.children[node])
    return depth


def get_depth(sentence):
    tree = Conllu(filehandle=StringIO(sentence)).read_tree()
    return tree_depth(tree)

In [None]:
def mean_depth(song):
    song = song.replace('\\', '')
    all_sentences = [pipeline.process(sent) for sent in tokenize.sent_tokenize(song)]
    all_depths = [get_depth(i) for i in all_sentences]
    return np.mean(all_depths)

In [None]:
depths = []

In [None]:
for line in tqdm(data['lyrics'], total=45437):
    if type(line) != str:
        depths.append(None)
        continue       
    try:
        depths.append(mean_depth(line))
    except:
        depths.append(None)

In [None]:
data['mean_depth'] = depths

In [None]:
data

Unnamed: 0,subset,author,song,genre,lyrics,tokenized,length,mean_depth
0,tr,chance the rappe,unk228,Rap,Another weekend full of blunts and brews\nToo ...,Another weekend full of blunts and brews \n To...,166,68.0
1,tr,megan & liz,love war,Country,"i do n't believe in wizards or witches , but b...","i do n't believe in wizards or witches , but b...",210,13.2
2,tr,jamiroquai,"if i like it, i do it",Pop,"if i like it i just do it , say that we have a...","if i like it i just do it , say that we have a...",332,29.75
3,tr,drake,unk172,Rap,Done sayin' I'm done playin'\nLast time was on...,Done sayin ' I 'm done playin ' \n Last time w...,271,33.666667
4,tr,j cole,unk152,Rap,This next three bars is dedicated to the retar...,This next three bars is dedicated to the retar...,121,48.0


## 2. POS tags

In [None]:
import nltk
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
from collections import Counter


pos_counts = []
for text in tqdm(data['lyrics']):
    pos_count = Counter([j for i,j in pos_tag(word_tokenize(text))])
    vector = [pos_count['NN'], pos_count['VB'], pos_count['JJ'], pos_count['PRP']]
    pos_counts.append(np.array(vector))

In [None]:
data['pos_count'] = pos_counts

In [None]:
def split_pos(data):
    columns = data.columns.tolist()
    columns.remove('pos_count')
    new = []
    for ind, sent in tqdm(data.iterrows(), total=len(data)):
        song = []
        for i in columns:
            song.append(sent[i])
        song.extend(sent['pos_count'].strip('[]').split())   
        new.append(song)
    columns.extend(['NOUN', 'VERB', 'ADJ', 'PRON'])
    return pd.DataFrame(new, columns=columns)

In [None]:
data = split_pos(data)

In [None]:
data

Unnamed: 0,subset,author,song,genre,lyrics,tokenized,length,mean_depth,NOUN,VERB,ADJ,PRON
0,tr,chance the rappe,unk228,Rap,Another weekend full of blunts and brews\nToo ...,Another weekend full of blunts and brews \n To...,166,68.0,22,5,6,10
1,tr,megan & liz,love war,Country,"i do n't believe in wizards or witches , but b...","i do n't believe in wizards or witches , but b...",210,13.2,15,14,12,18
2,tr,jamiroquai,"if i like it, i do it",Pop,"if i like it i just do it , say that we have a...","if i like it i just do it , say that we have a...",332,29.75,41,27,16,31
3,tr,drake,unk172,Rap,Done sayin' I'm done playin'\nLast time was on...,Done sayin ' I 'm done playin ' \n Last time w...,271,33.666667,24,12,10,34
4,tr,j cole,unk152,Rap,This next three bars is dedicated to the retar...,This next three bars is dedicated to the retar...,121,48.0,5,4,9,16


## 3. NER

In [None]:
import spacy
from spacy import displacy
import en_core_web_sm
nlp = en_core_web_sm.load()

In [None]:
ne_ratio = []
for text in tqdm(data['lyrics']):
    doc = nlp(text)
    names = len(doc.ents) / len([n.lemma_ for n in doc if not n.is_punct])
    ne_ratio.append(names)

In [None]:
data['ne_ratio'] = ne_ratio

In [None]:
data

Unnamed: 0,subset,author,song,genre,lyrics,tokenized,length,mean_depth,NOUN,VERB,ADJ,PRON,ne_ratio
0,tr,chance the rappe,unk228,Rap,Another weekend full of blunts and brews\nToo ...,Another weekend full of blunts and brews \n To...,166,68.0,22,5,6,10,0.060606
1,tr,megan & liz,love war,Country,"i do n't believe in wizards or witches , but b...","i do n't believe in wizards or witches , but b...",210,13.2,15,14,12,18,0.0
2,tr,jamiroquai,"if i like it, i do it",Pop,"if i like it i just do it , say that we have a...","if i like it i just do it , say that we have a...",332,29.75,41,27,16,31,0.0
3,tr,drake,unk172,Rap,Done sayin' I'm done playin'\nLast time was on...,Done sayin ' I 'm done playin ' \n Last time w...,271,33.666667,24,12,10,34,0.050193
4,tr,j cole,unk152,Rap,This next three bars is dedicated to the retar...,This next three bars is dedicated to the retar...,121,48.0,5,4,9,16,0.05303


## 4-5. Type-token & N-gram ratio

In [None]:
from nltk.stem import WordNetLemmatizer
from nltk import ngrams
from string import punctuation

wnl = WordNetLemmatizer()

In [None]:
def count_token_ratio(text):
    length = len(text)
    unique_length = len(set(text))
    ratio = unique_length/length
    return ratio

In [None]:
def count_ngram(text):
    bigrams = list(ngrams(text, 2))
    trigrams = list(ngrams(text, 3))
    tokens = text + bigrams + trigrams
    unique = len(set(tokens))
    length = len(tokens)
    return unique/length

In [None]:
ratio = []
ngrams_ratio = []

In [None]:
for ind, song in tqdm(data.iterrows(), total=len(data)):
    text = [wnl.lemmatize(w) for w in song['tokenized'].split() if w not in punctuation]
    ratio.append(count_token_ratio(text))
    ngrams_ratio.append(count_ngram(text))

HBox(children=(FloatProgress(value=0.0, max=20582.0), HTML(value='')))




In [None]:
data['type_token'] = ratio
data['ngram_ratio'] = ngrams_ratio

In [None]:
data

Unnamed: 0,subset,author,song,genre,lyrics,tokenized,length,mean_depth,NOUN,VERB,ADJ,PRON,ne_ratio,type_token,ngram_ratio
0,tr,chance the rappe,unk228,Rap,Another weekend full of blunts and brews\nToo ...,Another weekend full of blunts and brews \n To...,166,68.0,22,5,6,10,0.060606,0.721088,0.892694
1,tr,megan & liz,love war,Country,"i do n't believe in wizards or witches , but b...","i do n't believe in wizards or witches , but b...",210,13.2,15,14,12,18,0.0,0.488095,0.732535
2,tr,jamiroquai,"if i like it, i do it",Pop,"if i like it i just do it , say that we have a...","if i like it i just do it , say that we have a...",332,29.75,41,27,16,31,0.0,0.485915,0.729093
3,tr,drake,unk172,Rap,Done sayin' I'm done playin'\nLast time was on...,Done sayin ' I 'm done playin ' \n Last time w...,271,33.666667,24,12,10,34,0.050193,0.589958,0.826331
4,tr,j cole,unk152,Rap,This next three bars is dedicated to the retar...,This next three bars is dedicated to the retar...,121,48.0,5,4,9,16,0.05303,0.737288,0.903134


## 6. Slang

In [None]:
from nltk.corpus import words
nltk.download('words')

[nltk_data] Downloading package words to /Users/katya/nltk_data...
[nltk_data]   Package words is already up-to-date!


True

In [None]:
slang = []

In [None]:
for ind, song in tqdm(data.iterrows(), total=len(data)):
    word_list = set([w for w in song['tokenized'] if w not in punctuation])
    length = len(word_list)
    counter = len(set.intersection(word_list, words.words()))
    slang.append(counter/length)

HBox(children=(FloatProgress(value=0.0, max=20582.0), HTML(value='')))

In [None]:
data['slang'] = slang

In [None]:
data

Unnamed: 0,subset,author,song,genre,lyrics,tokenized,length,mean_depth,NOUN,VERB,ADJ,PRON,ne_ratio,type_token,ngram_ratio,slang
0,tr,chance the rappe,unk228,Rap,Another weekend full of blunts and brews\nToo ...,Another weekend full of blunts and brews \n To...,166,68.0,22,5,6,10,0.060606,0.721088,0.892694,0.906977
1,tr,megan & liz,love war,Country,"i do n't believe in wizards or witches , but b...","i do n't believe in wizards or witches , but b...",210,13.2,15,14,12,18,0.0,0.488095,0.732535,0.96
2,tr,jamiroquai,"if i like it, i do it",Pop,"if i like it i just do it , say that we have a...","if i like it i just do it , say that we have a...",332,29.75,41,27,16,31,0.0,0.485915,0.729093,0.96
3,tr,drake,unk172,Rap,Done sayin' I'm done playin'\nLast time was on...,Done sayin ' I 'm done playin ' \n Last time w...,271,33.666667,24,12,10,34,0.050193,0.589958,0.826331,0.878049
4,tr,j cole,unk152,Rap,This next three bars is dedicated to the retar...,This next three bars is dedicated to the retar...,121,48.0,5,4,9,16,0.05303,0.737288,0.903134,0.941176


## 7-8. Pronouns

In [None]:
self_to_nonself_referensing = []
first_to_second_person = []

In [None]:
for text in tqdm(data['tokenized'], total=len(data)):
    pronouns = Counter()
    for word in text.split():
        word = word.lower()
        if word == 'i' or word == 'me' or word == 'my' or word == 'mine':
            pronouns['first_person_singular'] += 1
        if word == 'we' or word == 'us' or word == 'our' or word == 'ours':
            pronouns['first_person_plural'] += 1
        if word == 'you' or word == 'your' or word == 'yours':
            pronouns['second_person'] += 1
        if word == 'she' or word == 'her' or word == 'hers':
            pronouns['third_person'] += 1
        if word == 'he' or word == 'his' or word == 'him':
            pronouns['third_person'] += 1
        if word == 'it' or word == 'its':
            pronouns['third_person'] += 1
        if word == 'they' or word == 'their' or word == 'theirs' or word == 'them':
            pronouns['third_person'] += 1
    first_person_all = pronouns['first_person_singular'] + pronouns['first_person_plural']
    notfirst_all = pronouns['second_person'] + pronouns['third_person']
    if first_person_all is not 0 and notfirst_all is not 0:
        self_to_nonself = first_person_all / notfirst_all
    else:
        self_to_nonself = 0
    self_to_nonself_referensing.append(round(self_to_nonself, 3))
    if pronouns['first_person_singular'] is not 0 and pronouns['second_person'] is not 0:
        first_to_second = pronouns['first_person_singular'] / pronouns['second_person']
    else:
        first_to_second = 0
    first_to_second_person.append(round(first_to_second, 3))

HBox(children=(FloatProgress(value=0.0, max=20582.0), HTML(value='')))




In [None]:
data['pronouns_self_to_nonself'] = self_to_nonself_referensing
data['pronouns_first_to_second'] = first_to_second_person

In [None]:
data

Unnamed: 0,subset,author,song,genre,lyrics,tokenized,length,mean_depth,NOUN,VERB,ADJ,PRON,ne_ratio,type_token,ngram_ratio,slang,pronouns_self_to_nonself,pronouns_first_to_second
0,tr,chance the rappe,unk228,Rap,Another weekend full of blunts and brews\nToo ...,Another weekend full of blunts and brews \n To...,166,68.0,22,5,6,10,0.060606,0.721088,0.892694,0.906977,2.25,4.0
1,tr,megan & liz,love war,Country,"i do n't believe in wizards or witches , but b...","i do n't believe in wizards or witches , but b...",210,13.2,15,14,12,18,0.0,0.488095,0.732535,0.96,1.615,2.0
2,tr,jamiroquai,"if i like it, i do it",Pop,"if i like it i just do it , say that we have a...","if i like it i just do it , say that we have a...",332,29.75,41,27,16,31,0.0,0.485915,0.729093,0.96,1.207,3.444
3,tr,drake,unk172,Rap,Done sayin' I'm done playin'\nLast time was on...,Done sayin ' I 'm done playin ' \n Last time w...,271,33.666667,24,12,10,34,0.050193,0.589958,0.826331,0.878049,1.353,11.5
4,tr,j cole,unk152,Rap,This next three bars is dedicated to the retar...,This next three bars is dedicated to the retar...,121,48.0,5,4,9,16,0.05303,0.737288,0.903134,0.941176,2.2,11.0


## 9. Length

In [None]:
from statistics import mean
tt = str.maketrans(dict.fromkeys(punctuation))

In [None]:
length_in_words = []

In [None]:
for text in tqdm(data['tokenized']):
    text_without_punc = text.translate(tt)
    best_text = text_without_punc.replace(' \n', '').split(' ')
    for thing in best_text:
        if thing == '':
            best_text.remove(thing)
    word_count = len(best_text)
    length_in_words.append(word_count)

HBox(children=(FloatProgress(value=0.0, max=20582.0), HTML(value='')))




In [None]:
data['length_in_words'] = length_in_words

In [None]:
data.head()

Unnamed: 0,subset,author,song,genre,lyrics,tokenized,length,mean_depth,NOUN,VERB,ADJ,PRON,ne_ratio,type_token,ngram_ratio,slang,pronouns_self_to_nonself,pronouns_first_to_second,length_in_words
0,tr,chance the rappe,unk228,Rap,Another weekend full of blunts and brews\nToo ...,Another weekend full of blunts and brews \n To...,166,68.0,22,5,6,10,0.060606,0.721088,0.892694,0.906977,2.25,4.0,147
1,tr,megan & liz,love war,Country,"i do n't believe in wizards or witches , but b...","i do n't believe in wizards or witches , but b...",210,13.2,15,14,12,18,0.0,0.488095,0.732535,0.96,1.615,2.0,171
2,tr,jamiroquai,"if i like it, i do it",Pop,"if i like it i just do it , say that we have a...","if i like it i just do it , say that we have a...",332,29.75,41,27,16,31,0.0,0.485915,0.729093,0.96,1.207,3.444,285
3,tr,drake,unk172,Rap,Done sayin' I'm done playin'\nLast time was on...,Done sayin ' I 'm done playin ' \n Last time w...,271,33.666667,24,12,10,34,0.050193,0.589958,0.826331,0.878049,1.353,11.5,243
4,tr,j cole,unk152,Rap,This next three bars is dedicated to the retar...,This next three bars is dedicated to the retar...,121,48.0,5,4,9,16,0.05303,0.737288,0.903134,0.941176,2.2,11.0,118


## 10. Mean line length

In [None]:
string_mean_length = []

In [None]:
for text in tqdm(data['tokenized']):
    string_length = []
    text.replace(' \' ', '')
    if ' \n' in text:
        splited_text = text.split(' \n ')
    else:
        splited_text = text.split(' , ')
    for string in splited_text:
        string_without_punc = string.translate(tt)
        string_without_punc.replace('\' ', '')
        splited_string = string_without_punc.split(' ')
    for thing in splited_string:
        if thing == '':
            splited_string.remove(thing)
        word_count = len(splited_string)
        string_length.append(word_count)
    string_mean_length.append(round(mean(string_length), 3))

HBox(children=(FloatProgress(value=0.0, max=20582.0), HTML(value='')))




In [None]:
data['string_mean_length'] = string_mean_length

In [None]:
data.head()

Unnamed: 0,subset,author,song,genre,lyrics,tokenized,length,mean_depth,NOUN,VERB,ADJ,PRON,ne_ratio,type_token,ngram_ratio,slang,pronouns_self_to_nonself,pronouns_first_to_second,length_in_words,string_mean_length
0,tr,chance the rappe,unk228,Rap,Another weekend full of blunts and brews\nToo ...,Another weekend full of blunts and brews \n To...,166,68.0,22,5,6,10,0.060606,0.721088,0.892694,0.906977,2.25,4.0,147,9.778
1,tr,megan & liz,love war,Country,"i do n't believe in wizards or witches , but b...","i do n't believe in wizards or witches , but b...",210,13.2,15,14,12,18,0.0,0.488095,0.732535,0.96,1.615,2.0,171,10.3
2,tr,jamiroquai,"if i like it, i do it",Pop,"if i like it i just do it , say that we have a...","if i like it i just do it , say that we have a...",332,29.75,41,27,16,31,0.0,0.485915,0.729093,0.96,1.207,3.444,285,13.385
3,tr,drake,unk172,Rap,Done sayin' I'm done playin'\nLast time was on...,Done sayin ' I 'm done playin ' \n Last time w...,271,33.666667,24,12,10,34,0.050193,0.589958,0.826331,0.878049,1.353,11.5,243,18.556
4,tr,j cole,unk152,Rap,This next three bars is dedicated to the retar...,This next three bars is dedicated to the retar...,121,48.0,5,4,9,16,0.05303,0.737288,0.903134,0.941176,2.2,11.0,118,12.917


## 11. Swear Words

In [5]:
! pip install better_profanity

Collecting better_profanity
[?25l  Downloading https://files.pythonhosted.org/packages/f3/dd/0b074d89e903cc771721cde2c4bf3d8c9d114b5bd791af5c62bcf5fb9459/better_profanity-0.7.0-py3-none-any.whl (46kB)
[K     |███████                         | 10kB 15.1MB/s eta 0:00:01[K     |██████████████▏                 | 20kB 10.5MB/s eta 0:00:01[K     |█████████████████████▎          | 30kB 7.7MB/s eta 0:00:01[K     |████████████████████████████▍   | 40kB 7.2MB/s eta 0:00:01[K     |████████████████████████████████| 51kB 2.4MB/s 
[?25hInstalling collected packages: better-profanity
Successfully installed better-profanity-0.7.0


In [6]:
from better_profanity import profanity

In [9]:
swears = []
for text in tqdm(data['lyrics']):
    swears.append(profanity.contains_profanity(text))
data['swear_words'] = swears

HBox(children=(FloatProgress(value=0.0, max=20582.0), HTML(value='')))




In [17]:
data.head()

Unnamed: 0,subset,author,song,genre,lyrics,tokenized,length,mean_depth,NOUN,VERB,ADJ,PRON,ne_ratio,type_token,ngram_ratio,slang,pronouns_self_to_nonself,pronouns_first_to_second,length_in_words,string_mean_length,swear_words
0,tr,chance the rappe,unk228,Rap,Another weekend full of blunts and brews\nToo ...,Another weekend full of blunts and brews \n To...,166,68.0,22,5,6,10,0.060606,0.721088,0.892694,0.906977,2.25,4.0,147,7.737,True
1,tr,megan & liz,love war,Country,"i do n't believe in wizards or witches , but b...","i do n't believe in wizards or witches , but b...",210,13.2,15,14,12,18,0.0,0.488095,0.732535,0.96,1.615,2.0,171,5.25,True
2,tr,jamiroquai,"if i like it, i do it",Pop,"if i like it i just do it , say that we have a...","if i like it i just do it , say that we have a...",332,29.75,41,27,16,31,0.0,0.485915,0.729093,0.96,1.207,3.444,285,8.353,False
3,tr,drake,unk172,Rap,Done sayin' I'm done playin'\nLast time was on...,Done sayin ' I 'm done playin ' \n Last time w...,271,33.666667,24,12,10,34,0.050193,0.589958,0.826331,0.878049,1.353,11.5,243,12.15,True
4,tr,j cole,unk152,Rap,This next three bars is dedicated to the retar...,This next three bars is dedicated to the retar...,121,48.0,5,4,9,16,0.05303,0.737288,0.903134,0.941176,2.2,11.0,118,7.867,True


### POS Tags update
(convert absolute values to percentages)

In [18]:
data['NOUN'] = data['NOUN'] / data['length_in_words']
data['VERB'] = data['VERB'] / data['length_in_words']
data ['ADJ'] = data['ADJ'] / data['length_in_words']
data['PRON'] = data['PRON'] / data['length_in_words']

In [19]:
data.head()

Unnamed: 0,subset,author,song,genre,lyrics,tokenized,length,mean_depth,NOUN,VERB,ADJ,PRON,ne_ratio,type_token,ngram_ratio,slang,pronouns_self_to_nonself,pronouns_first_to_second,length_in_words,string_mean_length,swear_words
0,tr,chance the rappe,unk228,Rap,Another weekend full of blunts and brews\nToo ...,Another weekend full of blunts and brews \n To...,166,68.0,0.14966,0.034014,0.040816,0.068027,0.060606,0.721088,0.892694,0.906977,2.25,4.0,147,7.737,True
1,tr,megan & liz,love war,Country,"i do n't believe in wizards or witches , but b...","i do n't believe in wizards or witches , but b...",210,13.2,0.087719,0.081871,0.070175,0.105263,0.0,0.488095,0.732535,0.96,1.615,2.0,171,5.25,True
2,tr,jamiroquai,"if i like it, i do it",Pop,"if i like it i just do it , say that we have a...","if i like it i just do it , say that we have a...",332,29.75,0.14386,0.094737,0.05614,0.108772,0.0,0.485915,0.729093,0.96,1.207,3.444,285,8.353,False
3,tr,drake,unk172,Rap,Done sayin' I'm done playin'\nLast time was on...,Done sayin ' I 'm done playin ' \n Last time w...,271,33.666667,0.098765,0.049383,0.041152,0.139918,0.050193,0.589958,0.826331,0.878049,1.353,11.5,243,12.15,True
4,tr,j cole,unk152,Rap,This next three bars is dedicated to the retar...,This next three bars is dedicated to the retar...,121,48.0,0.042373,0.033898,0.076271,0.135593,0.05303,0.737288,0.903134,0.941176,2.2,11.0,118,7.867,True
