# Preliminary Analysis

Out of the scraped lyrics data, we ended up with approximately 420 good points since a significant number of entries had a zero complexity score. This was mostly due to instrumental songs being scraped or missing lyrics. After cleaning up the data, we began to explore its features.

In [1]:
import pandas as pd
dfd = pd.read_csv("cleaned_lyrics_data.csv")

## WordCount and Complexity/Swear Ratio Analysis

The first subject of interest is frequency of occurence for words across all collected lyrics. In order to get a meaningful wordcount of words specific to technical death metal, we had to cut out commonly occuring stop words from our counting. This was done by using the stopwords list from https://algs4.cs.princeton.edu/35applications/stopwords.txt and adding a few words of our own that were appearing and were clearly not informative. We then plotted both the relationship between our complexity metric and swear words ratio as well as the top 10 most commonly occurring words to get initial introspection into our dataset. 

In [2]:
import operator
stopwords_file = open("stopwords.txt")
stopwords = []
for word in stopwords_file:
    stopwords.append(word.strip())
stopwords_file.close()
    
stopwords.append('i')
stopwords.append('')
stopwords.append('-')

def word_count():
    wordcount = {}
    for lyrics in dfd["lyrics"]:
        parsed = str(lyrics).split(" ")
        for word in parsed:
            if word.lower() not in stopwords:
                count = wordcount.get(word.lower(), 0)
                wordcount[word.lower()] = count + 1
    return wordcount

wordcount = sorted(word_count().items(), key=operator.itemgetter(1), reverse = True)
wcx, wcy = zip(*wordcount[:10])

Here we make an interactive graph of our two preliminary explorations.

In [3]:
from bokeh.plotting import figure, output_notebook, show
from bokeh.io import push_notebook
from bokeh.palettes import Spectral10, RdYlBu10, PiYG10
from ipywidgets import interact
output_notebook()

s = figure(plot_height = 800, plot_width = 800, title = "Complexity vs. Swear Words Ratio")
s.circle("complexity", "swear_words_ratio", source = dfd)
s.xaxis.axis_label = "Lyrical Complexity"
s.yaxis.axis_label = "Swear Words Ratio"

v = figure(plot_height = 800, plot_width = 800, x_range = list(wcx), title = "Top 10 Words by Appearence")
v.vbar(wcx, 0.5, wcy, color = RdYlBu10)

def update1(Graph):
    if Graph == "Complexity vs. Swear Words":
        show(s, notebook_handle = True)
    if Graph == "Most Common Words":
        show(v, notebook_handle = True)        
    push_notebook()

interact(update1, Graph=['Complexity vs. Swear Words', 'Most Common Words'])

<function __main__.update1>

## Visualization 1 Analysis:

Based on our plots, we see that the distribution of data is relatively random and it's difficult to make a statement directly relating lyrical complexity and the ratio of swear words. It seems as though there is an ever so slight slight positive correlation between complexity and swear words at the far end of the data set, but the data also becomes much more sparse and variable. Within the center cluster of data set, there is no clear pattern between swear words ratio and lyrical complexity. 

Out of the top 10 most commonly occurring words, there are no swear words. Most words follow themes of death and mortality, which we have colorfully illustrated with a bright color palette.

# Highest Scoring Songs

Of particular interest was to determine if there was any overlap between high scoring songs and high swear ratio songs. We thus plotted the top 10 most lyrically complex and highest swearing songs against together in an interactive plot. 

In [4]:
top10comp = dfd.nlargest(10, "complexity")
compx = top10comp["song"]
compy = top10comp["complexity"]

top10swears = dfd.nlargest(10, "swear_words_ratio")
swearsx = top10swears["song"]
swearsy = top10swears["swear_words_ratio"]

t = figure(plot_height = 800, plot_width = 1500, x_range = list(top10comp["song"]), title = "Top 10 Complexity Scores by Song Name")
pbars = t.vbar(compx, 0.5, compy, color = Spectral10)
q = figure(plot_height = 800, plot_width = 1500, x_range = list(top10swears["song"]), title = "Top 10 Swear Ratios Scores by Song Name")
qbars = q.vbar(swearsx, 0.5, swearsy, color = PiYG10)


def update2(Graph):
    if Graph == "Complexity":
        show(t, notebook_handle = True)
    if Graph == "Swear Words Ratio":
        show(q, notebook_handle = True)        
    push_notebook()

interact(update2, Graph=['Complexity', 'Swear Words Ratio'])

<function __main__.update2>

## Visualization 2 Analysis:

Surprisingly, or unsurprisingly, it seems as though the songs that scored higher on the complexity metric had much tamer/thoughtful song titles compared to songs that scored higher on the swear words ratio metric. Lyrically complex songs tended to more poetically titled with a more philosophical predilection ("The Resonant Frequency of Flesh"). Songs with a high proportion of swear words were titled much more aggressively and carried a darker tone ("Scum Fuck the Weak", for example).

# Band Names and Complexity:

The last visualization we created attempted to observe the distribution in complexity between artists, which meant averaging the complexity scores of all their songs contained in the data set.

In [7]:
import numpy as np
from bokeh.models import HoverTool

combined = list(zip(dfd['artist'], dfd['complexity']))
avgcomp = {}
for item in combined:
    avg = avgcomp.get(item[0], 0)
    avgcomp[item[0]] = (avg + item[1])/2

dfn = pd.DataFrame()
dfn['bandname'] = list(avgcomp.keys())
dfn['complexity'] = list(avgcomp.values())
dfn['index'] = np.arange(len(dfn.index))

r = figure(plot_height= 800, plot_width = 800, tools = ["hover"], title = "Avg Complexity by Band (Hover for Details)")
r.circle('index', "complexity", size = 20, source = dfn, color = "aquamarine")

r.select_one(HoverTool).tooltips = [
    ('Band Name', '@bandname'),
    ('Complexity', '@complexity')
]

combined2 = list(zip(dfd['artist'], dfd['swear_words_ratio']))
avgcomp2 = {}
for item in combined2:
    avg = avgcomp2.get(item[0], 0)
    avgcomp2[item[0]] = (avg + item[1])/2

dfm = pd.DataFrame()
dfm['bandname'] = list(avgcomp2.keys())
dfm['swear_words_ratio'] = list(avgcomp2.values())
dfm['index'] = np.arange(len(dfm.index))

l = figure(plot_height= 800, plot_width = 800, tools = ["hover"], title = "Avg Swearing by Band (Hover for Details)")
l.circle('index', "swear_words_ratio", size = 20, source = dfm, color = "Navy")

l.select_one(HoverTool).tooltips = [
    ('Band Name', '@bandname'),
    ('Swear Words Ratio', '@swear_words_ratio')
]

def update3(Graph):
    if Graph == "Average Complexity By Band Name":
        show(r, notebook_handle = True)
    if Graph == "Average Swear Ratio By Band Name":
        show(l, notebook_handle = True)        
    push_notebook()

interact(update3, Graph=['Average Complexity By Band Name', 'Average Swear Ratio By Band Name'])

<function __main__.update3>

## Visualization 3 Analysis:

From this scatter plot, we can see that bands with more literately complicated/poetic names such as "Aeon", "Extol", "Arsis" have relatively high lyrical complexity compared to most other bands with aggressive, less poetic sounding words such as "Dying Fetus", "Decapitated", "Monstrosity", and "Aborted". There also exist a number of wordy, poetic names such as "Obscura" and "Nocturnus" with average complexity scores similar to bands alongside these bands, but the only band with a non-poetic name is "Death", which is perhaps less aggressive than something like "Aborted". However, having high average lyrical complexity does not mean a low average swearing ratio. "Extol" maintains the highest swearing ratio despite also having one of the highest average complexity scores.