# Midterm

Authors: Cassie Corey, Jay Zou

Tasks:
1. Read in (parse, tokenize, ...) the text (30 points)
2. Visualize the text using different interactive Bokeh visualizations (10 points of each different type of interactive visualizations - max 30).
3. Cluster the text and visualize using interactive Bokeh visualizations (10 points for each different type of interactive visualizations - max 30).
4. Explain what you've seen (10 points)

### Introduction

Our text mining and visualizations are based on the [Heavy Metal Text Mining](https://paulvanderlaken.com/2017/09/27/text-mining-pythonic-heavy-metal/) example. This example looks at multiple characteristics of the lyrics, some of which include: TFIDF, cosine distances between word distributions, emotional arcs, swearwords, and lyric generation. The characteristics were visualized using various scatter plots, graphs, and trees. A brief explanation of the algorithms used and the output observed is given separately for each visualization/technique.

### Required Libraries

These can all be installed with `pip install <package>`
- BeautifulSoup
- Sklearn
- Textstat
- Markovify

### Gathering Data

Since this example does not provide a dataset of lyrics, we collected lyrics ourselves by scraping [MetroLyrics](https://www.metrolyrics.com).

First we chose a music genre: Tech Death Metal. We got our list of bands from [Wikipedia's list of Technical Death Metal Bands](https://en.wikipedia.org/wiki/List_of_technical_death_metal_bands).

In [188]:
import requests
from bs4 import BeautifulSoup

WIKI_URL = "https://en.wikipedia.org/wiki/List_of_technical_death_metal_bands"

req = requests.get(WIKI_URL)
soup = BeautifulSoup(req.content, 'lxml')
table_cells = soup.findAll("td")

artists = []
for cell in table_cells:
    link = cell.find('a',href=True)
    if link is not None:
        if '[' not in link.text:
            artists.append(link.text.replace('(band)','').strip())

# It'll be convenient to have a lowercase version for URLs and indexing.
artists_L = [a.lower() for a in artists]

The next cell contains some useful methods that we'll need for getting urls and lyrics from urls.

In [214]:
from bs4 import BeautifulSoup
from time import sleep, time
import random, requests

BASE_URL = "http://www.metrolyrics.com/"

def get_song_urls(artists):
    art_song_dict = {}
    for artist in artists:
        url = BASE_URL + artist.replace(' ','-') + "-lyrics.html"
        sleep(random.randint(0,10))
        response = requests.get(url)
        if response.status_code != 404: # Not all artists might be on MetroLyrics
            soup = BeautifulSoup(response.content, 'lxml')
            links = [a['href'] for a in soup.find_all('a',href=True)]
            song_list = []
            for link in links:
                if "lyrics-" + artist.replace(' ','-') in link:
                    song_list.append(link)
            art_song_dict[artist] = song_list
    return art_song_dict

def get_lyrics(song_url):
    sleep(random.randint(0,10))
    response = requests.get(song_url)
    soup = BeautifulSoup(response.content, 'lxml')
    verses = soup.find_all("p",{"class":"verse"})
    lyrics = ''
    for verse in verses:
        lyrics += verse.text + ' '
    return lyrics

def song_from_url(song_url):
    return song_url[27:].split('lyrics')[0].replace('-',' ').strip()

_WARNING:_ THE FOLLOW CELL MAY TAKE UP TO __5 MINUTES__ TO RUN

This cell fetches urls for songs from each artist. We then use these urls to fetch the lyrics for each song.

In [215]:
t0 = time()
print('Fetching song urls...',end='')
art_song_dict = get_song_urls(artists_L)
print('Done in {:02f}s'.format(time()-t0))

Fetching song urls...

KeyboardInterrupt: 

In [217]:
import pandas as pd

# Initialize a dataframe to hold the lyrics
lyrics_df = pd.DataFrame(columns=['artist','song','lyrics'])

_WARNING:_ THE FOLLOWING CELL WILL RUN FOR A __REALLY LONG TIME__, like until the network connection times out.

This cell fetches lyrics from the song URLs. It is OK to interrupt this cell at any time if you have other business to do. As long as you save your work in the cell that follows, you can come back to this cell and it won't waste time on lyrics it has already gathered.

That said, you do still run the risk of interrupting it while it's in the middle of writing lyrics for a song. So you may get some partially complete lyrics. But you can manually check that if you're really concerned.

In [None]:
t0 = time()
for artist in art_song_dict:
    print("Fetching lyrics for: ",artist)
    for song_url in art_song_dict[artist]:
        song = song_from_url(song_url)
        if song not in lyrics_df.song.values:
            lyrics = get_lyrics(song_url)
            lyrics_df = lyrics_df.append({'artist':artist,
                                          'song':song,
                                          'lyrics':lyrics},ignore_index=True)
print('Done in {:02f}s'.format(time()-t0))

Fetching lyrics for:  gojira
Fetching lyrics for:  monstrosity
Fetching lyrics for:  as they sleep
Fetching lyrics for:  oceano
Fetching lyrics for:  born of osiris
Fetching lyrics for:  origin
Fetching lyrics for:  extol
Fetching lyrics for:  dying fetus
Fetching lyrics for:  rings of saturn
Fetching lyrics for:  nile
Fetching lyrics for:  in battle
Fetching lyrics for:  suffocation
Fetching lyrics for:  pestilence
Fetching lyrics for:  cryptopsy
Fetching lyrics for:  obscura
Fetching lyrics for:  meshuggah
Fetching lyrics for:  death
Fetching lyrics for:  nocturnus
Fetching lyrics for:  becoming the archetype


In [213]:
# Save the lyrics data.
lyrics_df.to_csv('lyrics_line.csv',index=False)

There were some lyrics that weren't available on MetroLyrics. The following is an attempt to get the missing lyrics from another site: Genius.com. It mostly doesn't work.

In [182]:
def get_genius_lyrics(song,artist):
    url = "http://genius.com/{}-{}-lyrics".format(artist.replace(' ','-'),song.replace(' ','-'))
    print(url)
    response = requests.get(url)
    if response.status_code != 404:
        soup = BeautifulSoup(response.content,'lxml')
        lyrics = soup.find("div",{"class":"lyrics"})
        text = lyrics.find("p").text
        print(text)
        return text
    return ''

In [212]:
missing_songs = lyrics_df[lyrics_df.lyrics==''].song
print("{} songs missing!".format(len(missing_songs)))

t0 = time()
for idx,song in enumerate(missing_songs):
    artist = lyrics_df.iloc[idx].artist
    lyrics = get_genius_lyrics(song,artist)
    if lyrics != '':
        lyrics_df.iloc[idx].lyrics = lyrics
print("Done in {:02f}".format(time()-t0))

# Save our work.
lyrics_df.to_csv('lyrics.csv',index=False)

166 songs missing!
http://genius.com/gojira-1990-quatrillions-de-tonnes-lyrics
http://genius.com/gojira-dawn-lyrics
[Instrumental]
http://genius.com/gojira-torii-lyrics
[Instrumental]
http://genius.com/gojira-terra-incognita-lyrics
[Instrumental]
http://genius.com/gojira-wisdom-lyrics
http://genius.com/gojira-connected-lyrics
[Instrumental]
http://genius.com/gojira-where-dragons-fall-lyrics
http://genius.com/gojira-burden-of-evil-lyrics
http://genius.com/gojira-ceremonial-void-lyrics
http://genius.com/gojira-darkest-dream-lyrics
http://genius.com/gojira-horror-infinity-lyrics
http://genius.com/gojira-immense-malignancy-lyrics
http://genius.com/gojira-imperial-doom-lyrics
http://genius.com/gojira-the-third-reich-lyrics
http://genius.com/gojira-to-the-republic-lyrics
http://genius.com/gojira-the-darkest-ages-lyrics
http://genius.com/gojira-god-of-war-lyrics
http://genius.com/gojira-attila-lyrics
http://genius.com/gojira-poseidon-lyrics
http://genius.com/gojira-oracle-of-the-dead-lyrics
h

http://genius.com/oceano-lotus-eater-lyrics
http://genius.com/oceano-silent-lyrics
http://genius.com/born-of-osiris-fidelio-lyrics
http://genius.com/born-of-osiris-dreamless-lyrics
http://genius.com/born-of-osiris-les-silence-lyrics
http://genius.com/born-of-osiris-fractal-point-lyrics
Done in 40.545631


## Loading Data

In case you didn't use the cells above to gather it.

In [2]:
import pandas as pd

# Lyrics dataframe
lyrics_df = pd.read_csv('lyrics.csv')
lyrics_df.sample(10)

Unnamed: 0.1,Unnamed: 0,artist,song,lyrics
474,474,suffocation,rapture of revocation,Death lies within thyself Eager to release its...
990,990,psycroptic,the labyrinth,Tumbling deep into a darkened nightmare Uncons...
841,841,revocation,only the spineless survive,Devolved wriggling monstrosities roam through ...
930,930,cephalic carnage,friend of mine,"Two years ago, a friend of mine told me to wri..."
27,27,gojira,1990 quatrillions de tonnes,
374,374,nile,kheftiu asar butchiu,Kheftin Asar Butbiu Enemies of Osiris Who Are ...
193,193,origin,thrall fulcrum apex,Trials and degenerations of an upjumped demigo...
1170,1170,cynic,thinking being,Coinage of my brain A bodiless creation ecstac...
1028,1028,arsis,failing winds of hopeless greed,"So, the sight has finally left us with dreams ..."
869,869,aborted,die verzweiflung,Ich bin das Ende aller Dinge Lautlos nähernt a...


## TFIDF

Term frequency inverse document frequency (TFIDF) is a good way to visualize which words are the most descriptive of a certain corpus. We can use it to get an idea of the most descriptive words in the genre as a whole. It can also be used to distinguish between bands or distinguish which songs are the most descriptive of a band.

TFIDF treats text as a Bag of Words which means that order doesn't matter and punctuation is ignored. This is good for lyrics because punctuation is sort of a free-for-all. There may be a lot of incomplete sentences or repeated words.

In [54]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

tfidf_vectorizer = TfidfVectorizer(stop_words='english',max_df=0.7)
tfidf_vectorizer.fit_transform(lyrics_df.dropna().lyrics)

count_vectorizer = CountVectorizer(stop_words='english',max_df=0.7)
count_vectorizer.fit_transform(lyrics_df.dropna().lyrics)

<1125x15845 sparse matrix of type '<class 'numpy.int64'>'
	with 75297 stored elements in Compressed Sparse Row format>

In [7]:
len(tfidf_vectorizer.get_feature_names())

15845

In [55]:
len(count_vectorizer.get_feature_names())

15845

In [56]:
term_frequency = zip(count_vectorizer.get_feature_names(),
                     np.asarray(X.sum(axis=0)).ravel())

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=0.7, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words='english',
        strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

## Swear Words and Complexity

The original author did not provide a full dataset of lyrics (or even the code to scrape it). He did, however, provide a list of naughty words. So, we used that to explore naughty words in the lyrics we collected.

He compared lyric complexity to the number of swearwords used and found a positive correlation. We do the same experiment below.

In [42]:
# Read in the swear words
with open('swear_words.txt','r') as f:
    swear_words = f.read().splitlines()

The original author used the SMOG measure of complexity. It estimates the reading grade level of text. However, this calculation relies on counting the number of sentences in a piece of text. Since lyrics are structured a bit differently from normal text, we might need to try a few different ways of dealing with the lack of punctuation.

In [43]:
from textstat.textstat import textstat

# From pythonic-metal github
def count_swear_word_ratio(text):
    counter = 0
    for swear_word in swear_words:
        counter += text.count(swear_word)
    number_of_words = textstat.lexicon_count(text)
    return counter/number_of_words

In [47]:
lyrics_df['swear_words_ratio'] = 0
lyrics_df['complexity'] = 0

for i,lyrics in enumerate(lyrics_df.dropna().lyrics):
    if len(lyrics)>0:
        # Calculate complexity
        complexity = textstat.smog_index(lyrics)
        lyrics_df.iloc[i,lyrics_df.columns.get_loc('complexity')] = complexity
        # Calculate swear words ratio
        swr = count_swear_word_ratio(lyrics)
        lyrics_df.iloc[i,lyrics_df.columns.get_loc('swear_words_ratio')] = swr

lyrics_df.sample(5)

Unnamed: 0.1,Unnamed: 0,artist,song,lyrics,complexity,swear_words_ratio
557,557,pestilence,reduce to ashes,Dark middleagess centuries of pain The appear...,8.8,0.010582
495,495,suffocation,jesus wept,,7.2,0.032787
1254,1254,cannibal corpse,priests of sodom,The blackened city calls out Enter the temple ...,0.0,0.0
1158,1158,cynic,the lions roar,Bury the bells Between two mountains The big t...,0.0,0.0
635,635,meshuggah,sane,"Come And Hear My Twisted Lies, The Way I Bend ...",0.0,0.020833


In [49]:
# Save it if you want
lyrics_df.to_csv("lyrics_complexity_swear_words.csv",index=False)

### Visualizations

[JAYS EXPLANATION]

In [144]:
import pandas as pd
dfd = pd.read_csv("cleaned_lyrics_data.csv")
dfd.sample(5)

Unnamed: 0,artist,song,lyrics,complexity,swear_words_ratio
348,cephalic carnage,warm hand on a cold night a tale of onesomes,[Instrumental],9.7,0.035533
280,becoming the archetype,nights sorrow,(Instrumental),8.0,0.05
71,extol,paradigms,The worship of creation Seeming endless But it...,6.1,0.062762
264,nocturnus,arctic crypt,Locked inside the ice below Forgotten long ago...,10.9,0.014458
109,nile,slaves of xul,,5.6,0.02139


In [145]:
s = figure(plot_height = 800, plot_width = 800, title = "Complexity vs Swear Words Ratio")
s.circle("complexity", "swear_words_ratio", source = dfd)
show(s)

In [146]:
from bokeh.plotting import figure, output_notebook, show
from bokeh.io import push_notebook
from bokeh.palettes import Spectral10
from ipywidgets import interact
output_notebook()

top10comp = dfd.nlargest(10, "complexity")
compx = top10comp["song"]
compy = top10comp["complexity"]

top10swears = dfd.nlargest(10, "swear_words_ratio")
swearsx = top10swears["song"]
swearsy = top10swears["swear_words_ratio"]

p = figure(plot_height = 800, plot_width = 1500, x_range = list(top10comp["song"]), title = "Top 10 Complexity Scores by Song Name")
pbars = p.vbar(compx, 0.5, compy, color = Spectral10)
q = figure(plot_height = 800, plot_width = 1500, x_range = list(top10swears["song"]), title = "Top 10 Swear Ratios Scores by Song Name")
qbars = q.vbar(swearsx, 0.5, swearsy, color = Spectral10)


def update(Graph):
    if Graph == "Complexity":
        show(p, notebook_handle = True)
    if Graph == "Swear Words Ratio":
        show(q, notebook_handle = True)        
    push_notebook()

interact(update, Graph=['Complexity', 'Swear Words Ratio'])

A Jupyter Widget

<function __main__.update>

In [152]:
import numpy as np
from bokeh.models import HoverTool

combined = list(zip(dfd['artist'], dfd['complexity']))
avgcomp = {}
for item in combined:
    avg = avgcomp.get(item[0], 0)
    avgcomp[item[0]] = (avg + item[1])/2

dfn = pd.DataFrame()
dfn['bandname'] = list(avgcomp.keys())
dfn['complexity'] = list(avgcomp.values())
dfn['index'] = np.arange(len(dfn.index))

r = figure(plot_height= 800, plot_width = 800, tools = ["hover"], title = "Avg Complexity by Band (Hover for Details)")
r.circle('index', "complexity", size = 20, source = dfn, color = "aquamarine")

r.select_one(HoverTool).tooltips = [
    ('Band Name', '@bandname'),
    ('Complexity', '@complexity')
]
show(r)

## Cosine Distance

This measure was used to recognize band similarity and how representative certain songs were for a band. It also allowed the different bands to be clustered. We borrowed some code from the original author's notebook on [GitHub](https://github.com/ijmbarr/pythonic-metal/blob/master/pythonic-metal-part-1-counting.ipynb).

This measure of cosine similarity is based on the term frequency inverse document frequency measures of words in the combined set of lyrics. This is a measure of which words are the most descriptive of a given band.

In [195]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

lyrics_df = pd.read_csv('cleaned_lyrics_data.csv')

tfidf_vectorizer = TfidfVectorizer(stop_words='english',max_df=0.7)

def normalise(vec):
    return vec / np.dot(vec,vec)

def combine_vectors(vectors):
    return normalise(np.sum(vectors,axis=0))

lyrics_df.dropna(inplace=True)
lyrics_df["vectors_unnormalised"] = list(tfidf_vectorizer.fit_transform(lyrics_df.lyrics.values).toarray())
lyrics_df["vectors"] = lyrics_df.vectors_unnormalised.apply(normalise)

band_vectors = (
    lyrics_df
    .groupby("artist")
    .vectors
    .apply(combine_vectors)
)

In [196]:
%matplotlib notebook

import matplotlib.pyplot as plt

from scipy.cluster.hierarchy import dendrogram, linkage, fcluster

Z = linkage(np.stack(list(band_vectors.values)), method='complete', metric="cosine")

n_clusters = fcluster(Z, 0.57, criterion='distance')

plt.figure()
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('sample index')
plt.ylabel('distance')
dendrogram(
    Z,
    leaf_rotation=90.,  # rotates the x axis labels
    labels=band_vectors.index.values
)

plt.title("Clustering Metal Lyrics")
plt.xticks(rotation=90)

plt.show()

<IPython.core.display.Javascript object>

In [234]:
plt.close()

Heatmap of cosine distance between bands. The closer the cosine distance is to 1, the more different two bands are in terms of word frequencies. The closer the cosine distance is to 0, the more similar two bands are.

In [197]:
import pandas as pd

from scipy.spatial.distance import cosine

from bokeh.io import show
from bokeh.models import (
    ColumnDataSource,
    HoverTool,
    LinearColorMapper,
    BasicTicker,
    PrintfTickFormatter,
    ColorBar
)

from bokeh.plotting import figure

# Build the cosine distance matrix
cos_df = pd.DataFrame(columns=band_vectors.index.values,
                      index=band_vectors.index.values)
for i in band_vectors.index.values:
    for j in band_vectors.index.values:
        cos_df.at[i,j] = cosine(band_vectors[i],band_vectors[j])

cos_df.index.name='BandA'
cos_df.columns.name='BandB'
bandsA = list(cos_df.index)
bandsB = list(cos_df.columns)
        
# Stack it because bokeh sucks for heatmaps
df = pd.DataFrame(cos_df.stack(),columns=['cos']).reset_index()
        
mapper = LinearColorMapper(palette='Spectral10',low=0,high=1)

source = ColumnDataSource(df)

TOOLS = "hover,save,pan,box_zoom,reset,wheel_zoom"
p = figure(title="Cosine Distance Between Artists",
           x_range=bandsA,y_range=bandsB,
           tools=TOOLS, toolbar_location='above')

p.grid.grid_line_color=None
p.axis.axis_line_color=None
p.axis.major_tick_line_color=None
p.axis.major_label_text_font_size='5pt'
p.axis.major_label_standoff=0
p.xaxis.major_label_orientation=45

p.rect(x="BandA",y="BandB",width=1,height=1,
       source=source,
       fill_color={'field':'cos','transform':mapper},
       line_color=None)

color_bar = ColorBar(color_mapper=mapper,major_label_text_font_size='5pt',
                     border_line_color=None, location=(0,0))
p.add_layout(color_bar, 'right')
p.select_one(HoverTool).tooltips = [
    ('pair','@BandA and @BandB'),
    ('cos','@cos')
]

show(p, notebook_handle=True)

The purple line running diagonally up the heatmap illustrates that each band is perfectly similar to itself (cosine distance = 0). A few of the bands stand out for being very different from all other bands, such as Obscura and Aeon. The cosine distances between these two bands and all others is always very close to 1.

Below is a sample of Obscura lyrics.

In [233]:
lyrics_df[lyrics_df.artist=='obscura'].lyrics

209    The Sermon of the Seven Suns A funeral of worl...
210    As I walk through time and space Nourished fro...
211    What sudden blaze of majesty Is that which we ...
212    A crown, created with divine will An inﬁnite l...
Name: lyrics, dtype: object

For Aeon, this is likely due to the fact that we were only able to gather one sample of real lyrics from our parsing.

In [231]:
lyrics_df[lyrics_df.artist=='aeon'].lyrics

421     I love you Satan, my father, my pride You sho...
422                       [Music: Z. Nilsson, D. Dlimi] 
Name: lyrics, dtype: object

Becoming the Archetype, Cephalic Carnage, and As They Sleep also stand out in this heatmap for being very similar. Becoming the Archetype and As They Sleep have a cosine distance of only 0.114. Unfortunately this seems to be due to the fact that our method for gathering lyrics wasn't perfect. Neither of these bands has many lyrics to begin with. The only lyrics for As They Sleep are "INSTRUMENTAL" which is also half of the lyrics for Becoming the Archetype.

In [227]:
lyrics_df[lyrics_df.artist=='becoming the archetype'].lyrics

270    There was a time When we all sang the song of ...
271    Deep within the ocean's keep* There lies a cor...
274                                        INSTRUMENTAL 
277                                      (Instrumental) 
278    It hurts to see you live you life revolving ar...
279                                      (Instrumental) 
280                                      (Instrumental) 
288    I bear these scars A constant reminder of my o...
Name: lyrics, dtype: object

In [226]:
lyrics_df[lyrics_df.artist=='as they sleep'].lyrics

46    INSTRUMENTAL 
Name: lyrics, dtype: object

## Lyric Generation

The original author built his own Markov chain class to generate lyrics. We're just going to use the Markovify library by [jsvine](https://github.com/jsvine/markovify).

Markov models work by essentially calculating the likelihood of transitioning between every pair of words in a corpus. They are able to generate sentences by starting with a (sometimes random) seed word and using the previously calculated likelihoods to select the next word in the sentence. The generation stops when the output is the desired length.

We trained a separate markov chain for each artist. The goal is that someone familiar with these bands could potentially guess which band's markov model generated which lyrics.

It's not a Bokeh visualization but it is interactive and it's fun to play with.

In [268]:
import pandas as pd

import markovify

import ipywidgets as widgets
from ipywidgets import interact

lyrics_df = pd.read_csv('lyrics_line.csv')
lyrics_df.dropna(inplace=True)

markov_models = {}

# build mini markov chains for each band
for band in set(lyrics_df.artist.values):
    text = ' '.join(lyrics_df.loc[lyrics_df.artist==band].lyrics.tolist())
    text = text.replace('\n','. ')
    model = markovify.Text(text)
    markov_models[band] = model
    
# n is number of sentences, k is number of characters
def output_lyrics(band,n=20,k=75):
    model = markov_models[band]
    sentences = []
    for i in range(n):
        sentences.append(model.make_short_sentence(k))
    try:
        print('\n'.join(sentences))
    except:
        print('Technical difficulties...try adjusting the sliders or selecting a different band :)')
    
interact(output_lyrics, band=sorted(set(lyrics_df.artist.values)), n=(3,40),k=(50,140))

A Jupyter Widget

<function __main__.output_lyrics>