# Midterm

Authors: Cassie Corey, Jay Zou

Tasks:
1. Read in (parse, tokenize, ...) the text (30 points)
2. Visualize the text using different interactive Bokeh visualizations (10 points of each different type of interactive visualizations - max 30).
3. Cluster the text and visualize using interactive Bokeh visualizations (10 points for each different type of interactive visualizations - max 30).
4. Explain what you've seen (10 points)

### Introduction

Our text mining and visualizations are based on the [Heavy Metal Text Mining](https://paulvanderlaken.com/2017/09/27/text-mining-pythonic-heavy-metal/) example. This example looks at multiple characteristics of the lyrics, some of which include: TFIDF, cosine distances between word distributions, emotional arcs, swearwords, and lyric generation. The characteristics were visualized using various scatter plots, graphs, trees, and word clouds.

### Gathering Data

Since this example does not provide a dataset of lyrics, we collected lyrics ourselves by scraping [MetroLyrics](https://www.metrolyrics.com). First we chose a music genre: Tech Death Metal. We got our list of bands from [Wikipedia's list of Technical Death Metal Bands](https://en.wikipedia.org/wiki/List_of_technical_death_metal_bands). The following cell 

In [188]:
import requests
from bs4 import BeautifulSoup

WIKI_URL = "https://en.wikipedia.org/wiki/List_of_technical_death_metal_bands"

req = requests.get(WIKI_URL)
soup = BeautifulSoup(req.content, 'lxml')
table_cells = soup.findAll("td")

artists = []
for cell in table_cells:
    link = cell.find('a',href=True)
    if link is not None:
        if '[' not in link.text:
            artists.append(link.text.replace('(band)','').strip())

# It'll be convenient to have a lowercase version for URLs and indexing.
artists_L = [a.lower() for a in artists]

The next cell contains some useful methods that we'll need for getting urls and lyrics from urls.

In [112]:
from bs4 import BeautifulSoup
from time import sleep, time
import random, requests

BASE_URL = "http://www.metrolyrics.com/"

def get_song_urls(artists):
    art_song_dict = {}
    for artist in artists:
        url = BASE_URL + artist.replace(' ','-') + "-lyrics.html"
        sleep(random.randint(0,10))
        response = requests.get(url)
        if response.status_code != 404: # Not all artists might be on MetroLyrics
            soup = BeautifulSoup(response.content, 'lxml')
            links = [a['href'] for a in soup.find_all('a',href=True)]
            song_list = []
            for link in links:
                if "lyrics-" + artist.replace(' ','-') in link:
                    song_list.append(link)
            art_song_dict[artist] = song_list
    return art_song_dict

def get_lyrics(song_url):
    sleep(random.randint(0,10))
    response = requests.get(song_url)
    soup = BeautifulSoup(response.content, 'lxml')
    verses = soup.find_all("p",{"class":"verse"})
    lyrics = ''
    for verse in verses:
        lyrics += verse.text.replace('\n',' ') + ' '
    return lyrics

def song_from_url(song_url):
    return song_url[27:].split('lyrics')[0].replace('-',' ').strip()

In [113]:
t0 = time()
print('Fetching song urls...',end='')
art_song_dict = get_song_urls(artists_L)
print('Done in {:02f}s'.format(time()-t0))

Artist:  7 horns 7 eyes
Artist:  abnormality
Artist:  aborted
Artist:  aeon
Artist:  anata
Artist:  arsis
Artist:  as they sleep
Artist:  atheist
Artist:  augury
Artist:  becoming the archetype
Artist:  beneath the massacre
Artist:  beyond creation
Artist:  born of osiris
Artist:  brain drill
Artist:  cannibal corpse
Artist:  cephalic carnage
Artist:  circle of contempt
Artist:  the contortionist
Artist:  coprofago
Artist:  cryptopsy
Artist:  cynic
Artist:  death
Artist:  decapitated
Artist:  decrepit birth
Artist:  dying fetus
Artist:  extol
Artist:  fallujah
Artist:  the faceless
Artist:  gojira
Artist:  gorod
Artist:  gorguts
Artist:  grimaze
Artist:  the haarp machine
Artist:  in battle
Artist:  into the moat
Artist:  knights of the abyss
Artist:  meshuggah
Artist:  monstrosity
Artist:  necrophagist
Artist:  neuraxis
Artist:  ne obliviscaris
Artist:  nile
Artist:  nocturnus
Artist:  obscura
Artist:  oceano
Artist:  opeth
Artist:  origin
Artist:  pestilence
Artist:  psycroptic
Artis

In [None]:
# Initialize a dataframe to hold the lyrics
lyrics_df = pd.DataFrame(columns=['artist','song','lyrics'])

In [136]:
t0 = time()
for artist in art_song_dict:
    print("Fetching lyrics for: ",artist)
    for song_url in art_song_dict[artist]:
        song = song_from_url(song_url)
        if song not in lyrics_df.song.values:
            lyrics = get_lyrics(song_url)
            lyrics_df = lyrics_df.append({'artist':artist,
                                          'song':song,
                                          'lyrics':lyrics},ignore_index=True)
print('Done in {:02f}s'.format(time()-t0))

Fetching lyrics for:  gojira
Fetching lyrics for:  monstrosity
Fetching lyrics for:  as they sleep
Fetching lyrics for:  oceano
Fetching lyrics for:  born of osiris
Fetching lyrics for:  origin
Fetching lyrics for:  extol
Fetching lyrics for:  dying fetus
Fetching lyrics for:  rings of saturn
Fetching lyrics for:  nile
Fetching lyrics for:  in battle
Fetching lyrics for:  suffocation
Fetching lyrics for:  pestilence
Fetching lyrics for:  cryptopsy
Fetching lyrics for:  obscura
Fetching lyrics for:  meshuggah
Fetching lyrics for:  death
Fetching lyrics for:  nocturnus
Fetching lyrics for:  becoming the archetype
Fetching lyrics for:  revocation
Fetching lyrics for:  aborted
Fetching lyrics for:  cephalic carnage
Fetching lyrics for:  the contortionist
Fetching lyrics for:  anata
Fetching lyrics for:  psycroptic
Fetching lyrics for:  arsis
Fetching lyrics for:  augury
Fetching lyrics for:  decapitated
Fetching lyrics for:  aeon
Fetching lyrics for:  ne obliviscaris
Fetching lyrics for:  

ConnectionError: HTTPConnectionPool(host='www.metrolyrics.com', port=80): Max retries exceeded with url: /unite-the-dead-lyrics-cannibal-corpse.html (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x1080ca6a0>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known',))

In [139]:
# Save the lyrics DataFrame for future use.
lyrics_df.to_csv('lyrics.csv')

There were some lyrics that weren't available on MetroLyrics. The following is an attempt to get the missing lyrics from another site: Genius.com.

In [182]:
def get_genius_lyrics(song,artist):
    url = "http://genius.com/{}-{}-lyrics".format(artist.replace(' ','-'),song.replace(' ','-'))
    print(url)
    response = requests.get(url)
    if response.status_code != 404:
        soup = BeautifulSoup(response.content,'lxml')
        lyrics = soup.find("div",{"class":"lyrics"})
        text = lyrics.find("p").text.replace('\n',' ')
        print(text)
        return text
    return ''

In [186]:
lyrics_df.iloc[23].lyrics = get_genius_lyrics('ocean planet','gojira')

http://genius.com/gojira-ocean-planet-lyrics
I'm in a mental cage I'm locked up Imprisoned I live Deathlike, sickening Strong is your hold On my resignation I don't see the stars My memories are veiled  In fluid dreams I fall I'm restless Walls made of stone Are turned into water now Enlightened demons Are taking me by the hand Approaching me This great eye speaking  Mountainous waves Are breaking on my despair Awaken me but I'm still dreaming And I just plunge Into this sea of light Set open the doors of soul I'm living  Lightning struck me I see the path I was so scared of And fly to the stars Conviction now increasing at last My skin is broken I see the smallest part of me My mind is alive But I'll never bow to this again  Why do they call me there How can I fly All this water I don't feel like I could ever swim to them Whales in the sky I feel they're so close Inside, and yet so far away  Burst into tears, I feel sad My dreams aflame The force is now Lie on a stone Drop this load a

In [187]:
lyrics_df.iloc[23]

artist                                               gojira
song                                           ocean planet
lyrics    I'm in a mental cage I'm locked up Imprisoned ...
Name: 23, dtype: object

In [189]:
missing_songs = lyrics_df[lyrics_df.lyrics==''].song
print("{} songs missing!".format(len(missing_songs)))

t0 = time()
for idx,song in enumerate(missing_songs):
    artist = lyrics_df.iloc[idx].artist
    lyrics = get_genius_lyrics(song,artist)
    if lyrics != '':
        lyrics_df.iloc[idx].lyrics = lyrics
print("Done in {:02f}".format(time()-t0))

# Save our work.
lyrics_df.to_csv('lyrics.csv')

150 songs missing!
http://genius.com/gojira-from-the-sky-lyrics
At the very first sound There was just light And then, a storm Of time and space Just came and struck Created our time In water life We understand It just only began  Forced to look to the sky And wonder why We cannot face the fact that We're all scared now Of mysteries of life There is a mask that soon will fall Before the strong embrace Of love and might Of light in the dark I go for a quest I have to give myself the answer Enter now this place in the wild I can see the glade My feeling now is growing bigger  From the sky From the sky  I do feel like no one can save me I am so alone and yet I cried I called for help, forsaken But now I know The only way is to Understand the living Obey the rule of light And face the fear Inside out!  Lost, I found there a stone Erected in line With one the brightest stars Of all the night sky vault And I took my time Took off the moss Washed away the dust And gave a new lease of life Its

http://genius.com/monstrosity-the-war-ender-lyrics
http://genius.com/monstrosity-i-am-lyrics
http://genius.com/monstrosity-the-sky-bearer-lyrics
http://genius.com/monstrosity-the-time-bender-lyrics
http://genius.com/monstrosity-the-eyes-of-the-storm-lyrics
http://genius.com/monstrosity-the-weapon-breaker-lyrics
http://genius.com/as-they-sleep-no-fall-too-far-lyrics
http://genius.com/as-they-sleep-the-trivial-paroxysm-lyrics
http://genius.com/as-they-sleep-dichotomy-lyrics
http://genius.com/as-they-sleep-breathing-light-lyrics
http://genius.com/as-they-sleep-the-ocean-walker-lyrics
http://genius.com/as-they-sleep-cardiac-rebellion-lyrics
http://genius.com/as-they-sleep-the-sun-eater-lyrics
http://genius.com/as-they-sleep-the-planet-maker-lyrics
http://genius.com/as-they-sleep-enter-the-hall-lyrics
http://genius.com/as-they-sleep-leviathan-awaits-lyrics
http://genius.com/as-they-sleep-pestilence-reigns-lyrics
http://genius.com/oceano-dismantle-the-dictator-lyrics
http://genius.com/oceano

## TFIDF

Term frequency inverse document frequency (TFIDF) is a good way to visualize which words are 

In [None]:
# TODO

## Swear Words

The original author did not provide a full dataset of lyrics (or even the code to scrape it). He did, however, provide a list of naughy words. So, we used that to explore naughty words in the lyrics we collected.

In [128]:
# TODO

## Cosine Distance

In [None]:
# TODO

## Lyric Generation

In [None]:
# TODO