# Midterm

Authors: Cassie Corey, Jay Zou

Tasks:
1. Read in (parse, tokenize, ...) the text (30 points)
2. Visualize the text using different interactive Bokeh visualizations (10 points of each different type of interactive visualizations - max 30).
3. Cluster the text and visualize using interactive Bokeh visualizations (10 points for each different type of interactive visualizations - max 30).
4. Explain what you've seen (10 points)

### Introduction

Our text mining and visualizations are based on the [Heavy Metal Text Mining](https://paulvanderlaken.com/2017/09/27/text-mining-pythonic-heavy-metal/) example. This example looks at multiple characteristics of the lyrics, some of which include: TFIDF, cosine distances between word distributions, emotional arcs, swearwords, and lyric generation. The characteristics were visualized using various scatter plots, graphs, trees, and word clouds.

### Required Libraries

These can all be installed with `pip install <package>`
- BeautifulSoup
- Sklearn
- Textstat

### Gathering Data

Since this example does not provide a dataset of lyrics, we collected lyrics ourselves by scraping [MetroLyrics](https://www.metrolyrics.com).

First we chose a music genre: Tech Death Metal. We got our list of bands from [Wikipedia's list of Technical Death Metal Bands](https://en.wikipedia.org/wiki/List_of_technical_death_metal_bands).

In [188]:
import requests
from bs4 import BeautifulSoup

WIKI_URL = "https://en.wikipedia.org/wiki/List_of_technical_death_metal_bands"

req = requests.get(WIKI_URL)
soup = BeautifulSoup(req.content, 'lxml')
table_cells = soup.findAll("td")

artists = []
for cell in table_cells:
    link = cell.find('a',href=True)
    if link is not None:
        if '[' not in link.text:
            artists.append(link.text.replace('(band)','').strip())

# It'll be convenient to have a lowercase version for URLs and indexing.
artists_L = [a.lower() for a in artists]

The next cell contains some useful methods that we'll need for getting urls and lyrics from urls.

In [214]:
from bs4 import BeautifulSoup
from time import sleep, time
import random, requests

BASE_URL = "http://www.metrolyrics.com/"

def get_song_urls(artists):
    art_song_dict = {}
    for artist in artists:
        url = BASE_URL + artist.replace(' ','-') + "-lyrics.html"
        sleep(random.randint(0,10))
        response = requests.get(url)
        if response.status_code != 404: # Not all artists might be on MetroLyrics
            soup = BeautifulSoup(response.content, 'lxml')
            links = [a['href'] for a in soup.find_all('a',href=True)]
            song_list = []
            for link in links:
                if "lyrics-" + artist.replace(' ','-') in link:
                    song_list.append(link)
            art_song_dict[artist] = song_list
    return art_song_dict

def get_lyrics(song_url):
    sleep(random.randint(0,10))
    response = requests.get(song_url)
    soup = BeautifulSoup(response.content, 'lxml')
    verses = soup.find_all("p",{"class":"verse"})
    lyrics = ''
    for verse in verses:
        lyrics += verse.text + ' '
#         lyrics += verse.text.replace('\n',' ') + ' '
    return lyrics

def song_from_url(song_url):
    return song_url[27:].split('lyrics')[0].replace('-',' ').strip()

_WARNING:_ THE FOLLOW CELL MAY TAKE UP TO __5 MINUTES__ TO RUN

This cell fetches urls for songs from each artist. We then use these urls to fetch the lyrics for each song.

In [215]:
t0 = time()
print('Fetching song urls...',end='')
art_song_dict = get_song_urls(artists_L)
print('Done in {:02f}s'.format(time()-t0))

Fetching song urls...

KeyboardInterrupt: 

In [217]:
import pandas as pd

# Initialize a dataframe to hold the lyrics
lyrics_df = pd.DataFrame(columns=['artist','song','lyrics'])

_WARNING:_ THE FOLLOWING CELL WILL RUN FOR A __REALLY LONG TIME__, like until the network connection times out.

This cell fetches lyrics from the song URLs. It is OK to interrupt this cell at any time if you have other business to do. As long as you save your work in the cell that follows, you can come back to this cell and it won't waste time on lyrics it has already gathered.

That said, you do still run the risk of interrupting it while it's in the middle of writing lyrics for a song. So you may get some partially complete lyrics. But you can manually check that if you're really concerned.

In [None]:
t0 = time()
for artist in art_song_dict:
    print("Fetching lyrics for: ",artist)
    for song_url in art_song_dict[artist]:
        song = song_from_url(song_url)
        if song not in lyrics_df.song.values:
            lyrics = get_lyrics(song_url)
            lyrics_df = lyrics_df.append({'artist':artist,
                                          'song':song,
                                          'lyrics':lyrics},ignore_index=True)
print('Done in {:02f}s'.format(time()-t0))

Fetching lyrics for:  gojira
Fetching lyrics for:  monstrosity
Fetching lyrics for:  as they sleep
Fetching lyrics for:  oceano
Fetching lyrics for:  born of osiris
Fetching lyrics for:  origin
Fetching lyrics for:  extol
Fetching lyrics for:  dying fetus
Fetching lyrics for:  rings of saturn
Fetching lyrics for:  nile
Fetching lyrics for:  in battle
Fetching lyrics for:  suffocation
Fetching lyrics for:  pestilence
Fetching lyrics for:  cryptopsy
Fetching lyrics for:  obscura
Fetching lyrics for:  meshuggah
Fetching lyrics for:  death
Fetching lyrics for:  nocturnus
Fetching lyrics for:  becoming the archetype


In [213]:
# Save the lyrics data.
lyrics_df.to_csv('lyrics_line.csv')

There were some lyrics that weren't available on MetroLyrics. The following is an attempt to get the missing lyrics from another site: Genius.com.

In [182]:
def get_genius_lyrics(song,artist):
    url = "http://genius.com/{}-{}-lyrics".format(artist.replace(' ','-'),song.replace(' ','-'))
    print(url)
    response = requests.get(url)
    if response.status_code != 404:
        soup = BeautifulSoup(response.content,'lxml')
        lyrics = soup.find("div",{"class":"lyrics"})
        text = lyrics.find("p").text
#         text = lyrics.find("p").text.replace('\n',' ')
        print(text)
        return text
    return ''

In [212]:
missing_songs = lyrics_df[lyrics_df.lyrics==''].song
print("{} songs missing!".format(len(missing_songs)))

t0 = time()
for idx,song in enumerate(missing_songs):
    artist = lyrics_df.iloc[idx].artist
    lyrics = get_genius_lyrics(song,artist)
    if lyrics != '':
        lyrics_df.iloc[idx].lyrics = lyrics
print("Done in {:02f}".format(time()-t0))

# Save our work.
lyrics_df.to_csv('lyrics.csv')

166 songs missing!
http://genius.com/gojira-1990-quatrillions-de-tonnes-lyrics
http://genius.com/gojira-dawn-lyrics
[Instrumental]
http://genius.com/gojira-torii-lyrics
[Instrumental]
http://genius.com/gojira-terra-incognita-lyrics
[Instrumental]
http://genius.com/gojira-wisdom-lyrics
http://genius.com/gojira-connected-lyrics
[Instrumental]
http://genius.com/gojira-where-dragons-fall-lyrics
http://genius.com/gojira-burden-of-evil-lyrics
http://genius.com/gojira-ceremonial-void-lyrics
http://genius.com/gojira-darkest-dream-lyrics
http://genius.com/gojira-horror-infinity-lyrics
http://genius.com/gojira-immense-malignancy-lyrics
http://genius.com/gojira-imperial-doom-lyrics
http://genius.com/gojira-the-third-reich-lyrics
http://genius.com/gojira-to-the-republic-lyrics
http://genius.com/gojira-the-darkest-ages-lyrics
http://genius.com/gojira-god-of-war-lyrics
http://genius.com/gojira-attila-lyrics
http://genius.com/gojira-poseidon-lyrics
http://genius.com/gojira-oracle-of-the-dead-lyrics
h

http://genius.com/oceano-lotus-eater-lyrics
http://genius.com/oceano-silent-lyrics
http://genius.com/born-of-osiris-fidelio-lyrics
http://genius.com/born-of-osiris-dreamless-lyrics
http://genius.com/born-of-osiris-les-silence-lyrics
http://genius.com/born-of-osiris-fractal-point-lyrics
Done in 40.545631


## Loading Data

In case you didn't use the cells above to gather it.

In [2]:
import pandas as pd

# Lyrics dataframe
lyrics_df = pd.read_csv('lyrics.csv')
lyrics_df.sample(10)

Unnamed: 0.1,Unnamed: 0,artist,song,lyrics
474,474,suffocation,rapture of revocation,Death lies within thyself Eager to release its...
990,990,psycroptic,the labyrinth,Tumbling deep into a darkened nightmare Uncons...
841,841,revocation,only the spineless survive,Devolved wriggling monstrosities roam through ...
930,930,cephalic carnage,friend of mine,"Two years ago, a friend of mine told me to wri..."
27,27,gojira,1990 quatrillions de tonnes,
374,374,nile,kheftiu asar butchiu,Kheftin Asar Butbiu Enemies of Osiris Who Are ...
193,193,origin,thrall fulcrum apex,Trials and degenerations of an upjumped demigo...
1170,1170,cynic,thinking being,Coinage of my brain A bodiless creation ecstac...
1028,1028,arsis,failing winds of hopeless greed,"So, the sight has finally left us with dreams ..."
869,869,aborted,die verzweiflung,Ich bin das Ende aller Dinge Lautlos nähernt a...


## TFIDF

Term frequency inverse document frequency (TFIDF) is a good way to visualize which words are the most descriptive of a certain corpus. We can use it to get an idea of the most descriptive words in the genre as a whole. It can also be used to distinguish between bands or distinguish which songs are the most descriptive of a band.

TFIDF treats text as a Bag of Words which means that order doesn't matter and punctuation is ignored. This is good for lyrics because punctuation is sort of a free-for-all. There may be a lot of incomplete sentences or repeated words.

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words='english',max_df=0.7)
vectorizer.fit_transform(lyrics_df.dropna().lyrics)

<1125x15845 sparse matrix of type '<class 'numpy.float64'>'
	with 75297 stored elements in Compressed Sparse Row format>

In [7]:
len(vectorizer.get_feature_names())

15845

## Swear Words

The original author did not provide a full dataset of lyrics (or even the code to scrape it). He did, however, provide a list of naughty words. So, we used that to explore naughty words in the lyrics we collected.

He compared lyric complexity to the number of swearwords used and found a positive correlation. We do the same experiment below.

In [42]:
# Read in the swear words
with open('swear_words.txt','r') as f:
    swear_words = f.read().splitlines()

The original author used the SMOG measure of complexity. It estimates the reading grade level of text. However, this calculation relies on counting the number of sentences in a piece of text. Since lyrics are structured a bit differently from normal text, we might need to try a few different ways of dealing with the lack of punctuation.

In [43]:
from textstat.textstat import textstat

# From pythonic-metal github
def count_swear_word_ratio(text):
    counter = 0
    for swear_word in swear_words:
        counter += text.count(swear_word)
    number_of_words = textstat.lexicon_count(text)
    return counter/number_of_words

In [45]:
lyrics_df['swear_words_ratio'] = 0
lyrics_df['complexity'] = 0

for i,lyrics in enumerate(lyrics_df.dropna().lyrics):
    if len(lyrics)>0:
        # Calculate complexity
        complexity = textstat.smog_index(lyrics)
        lyrics_df.iloc[i,lyrics_df.columns.get_loc('complexity')] = complexity
        # Calculate swear words ratio
        swr = count_swear_word_ratio(lyrics)
        lyrics_df.iloc[i,lyrics_df.columns.get_loc('swear_words_ratio')] = swr

lyrics_df.sample(10)

Unnamed: 0.1,Unnamed: 0,artist,song,lyrics,complexity,swear_word_ratio,swear_words_ratio
986,986,psycroptic,the valley of winds breath and dragons fire,"Gasping for air, descending quickly down the h...",0.0,0,0.013605
1166,1166,cynic,veil of maya,In Maya's grip illusion transforms verity Perc...,0.0,0,0.0
66,66,gojira,the link,"I want to protect myself from others Dream, I ...",5.3,0,0.020661
895,895,aborted,nemesis,"Apathy raised upon childhood, Social skills e...",0.0,0,0.0
1053,1053,augury,alien shores,[Music: Patrick] [Solos: Pat] Wake the world o...,9.7,0,0.042105
1184,1184,cynic,nunc fluens,The space We claim the space The space,0.0,0,0.0
411,411,nile,invocation of the gate of aatankhesenamenti,[Instrumental],7.0,0,0.045082
692,692,death,crystal mountain,Built from Blind faith Passed down from self I...,0.0,0,0.01626
11,11,gojira,pain is a master,Crawling and moaning in the sharp blade of gra...,0.0,0,0.007812
703,703,death,politicians in my eyes,"1, 2, 3 Now! The number one biggest game It's ...",8.8,0,0.010204


In [46]:
lyrics_df.drop('swear_word_ratio',1,inplace=True)
lyrics_df.sample(10)

Unnamed: 0.1,Unnamed: 0,artist,song,lyrics,complexity,swear_words_ratio
595,595,cryptopsy,soar and envision sore vision,Savagely beguiled of courage Not equipped for ...,0.0,0.009174
1120,1120,aeon,hell unleashed,Hell unleashed now we dominate Years of darkne...,0.0,0.014388
1232,1232,decrepit birth,the living doorway,Through the realms of time and space Exists a ...,0.0,0.0
148,148,oceano,empathy for leviathan,I am disgusted at the sight of your breed. I a...,0.0,0.042763
957,957,anata,the drowning,I feel like I'm about to drown It feels like I...,0.0,0.008621
1193,1193,necrophagist,stabwound,,0.0,0.0
1206,1206,necrophagist,symbiotic in theory,"Unable to move from a point of view, driven by...",0.0,0.0
1231,1231,decrepit birth,symbiosis,Eternally connected to the energies of existen...,0.0,0.0
179,179,born of osiris,open arms to damnation,"Day by day, Let's pave the path to escape. Pa...",0.0,0.0
1033,1033,arsis,the cold resistance,Cobwebs reaching from the heavens to the lover...,14.6,0.008403


## Cosine Distance

This measure was used to recognize band similarity and how representative certain songs were for a band. It also allowed the different bands to be clustered. We borrowed some code from the original author's notebook on [GitHub](https://github.com/ijmbarr/pythonic-metal/blob/master/pythonic-metal-part-1-counting.ipynb).

In [10]:
from scipy.spatial.distance import cosine as cs

def important_words(n):
    return sorted(vectorizer.get_feature_names(),key=lambda x:x[1],reverse=True)[:n]

def most_representative_songs(vec, n):
    pass
    

In [12]:
sorted(vectorizer.get_feature_names(),key=lambda x:x[1],reverse=True)

['緑間',
 '段違いのパワーが',
 '加速度つけて',
 '自由自在',
 '黄瀬',
 '未来系の',
 '明日はどこまで',
 '昨日の本気が',
 '更新されてく',
 '完成形は前例ナシ',
 '実感に変わるプレイ',
 '青峰',
 '発展途上がホンモノになる時',
 '絶対キメる',
 '黒子',
 '全員',
 '赤司',
 '過去最強へ',
 '紫原',
 '衝動の底に',
 '全力で',
 '自分のやり方で',
 '自分のスピードで',
 '自分の未知に',
 '圧倒的',
 '圧倒的な',
 '確信的',
 '前人未到の栄光が待ってる',
 'ヨロコビはいつも',
 'プレッシャーは常勝の数',
 'プライドにかけて',
 'コトバ通り無敵になる',
 'ドアの向こうへと急かす',
 '手を伸ばしたら奇跡が動きだす',
 '守るべき',
 '回りだした上昇ループ',
 '振り回されそうに湧き上がるチカラで',
 'きらめきはじめた才能の行方に',
 '秘めてるような',
 '掴みたい自分への挑戦',
 '掴みに行くんだ',
 '積み重ねた勝利を背に',
 '踏み込む世界は輝きの証明',
 '今までとは',
 '止まらない',
 'このチームで',
 'はね返すさ',
 'とてつもないナニか',
 'いつでも',
 'もっと自由になる',
 '勝ちたいオモイで強くなれば',
 '必ずできるさ',
 '必ず行けるさ',
 '眩しさへ進め',
 '負けるワケがない',
 '負ける気がしない',
 '瞬きもせずに目を凝らした先で',
 '開きかけてる',
 '光が集うコートで',
 '光が集う場所へ',
 '向かい風ごと巻き込んで行くんだ',
 '活かしあえる',
 '答えになる',
 '高いレベルでだから',
 'пять',
 'был',
 'выжжены',
 'выси',
 'вытерты',
 'мы',
 'мысли',
 'их',
 'руки',
 'тут',
 'степным',
 'что',
 'это',
 'все',
 'встретимся',
 'приношу',
 'до',
 'может',
 'но',
 'по',
 'поднимет',
 'постучав',
 'собою',
 

## Lyric Generation

The original author built his own Markov chain class to generate lyrics. We're just going to use the Markovify library by [jsvine](https://github.com/jsvine/markovify).

Since it doesn't really make sense to generate lyrics for an entire corpus, we'll just test lyric generation on the bands that we have the most data for.

# TODO: VISUALIZATION OF LYRIC GENERATION

In [None]:
import markovify

