# pyraps

For our final project, we will be attempting to classify rap styles based on song lyrics. We will be using a subsection of rap music published in the 1990's, where rap style from different geographical regions were distinct, which differs from modern rap music has become more of an almalgamation of the three main rap styles. The two main geographical regions we will be looking at are east-coast (New York City), west-coast (Los Angeles) - we may extend this to further classify more modern movements such as southern (Atlanta) and midwest (Detroit).

Our project consists of three parts.

1. Data Collection - We will build a database of lyrics from 1990's rap artists and label them based on the rappers style based on geographical location.
2. Creating features - We will create features to capture the rhythm and rhyme of a song, as well as the particular lyrical content and vocabulary.
3. Training a classifier - Using our features, we will train different types of classifiers and compare results.

## Creating Features

We will be using two sets of information as features for our machine learning algorithms: lyrical content, in other words the actual words that are being used and the order in which those words occur in, and rhyme patterns along with rhythmic beats, which we will analyze using NLTK.

### Lyrical Content

One easy way to generate features for this is to use a pretrained neural network that specifically handles text and drop the last layer (which converts their learned features to their expected output).

### Rhyme Patterns

Rhyme patterns are pretty interesting in the way it manifests itself in east-coast versus west-coast rap. East-coast tends to try to create intricate and interlacing rhyme patterns where as west-coast rap focuses more on creating a vibe rather than building intense rhyme structures. We can use this as another feature we can train on to provide better separation

Lets take a look at a couple lines from Nas' "NY State of Mind", a classic east coast style song

1. Rappers I <font color='blue'>monkey</font> <font color='red'>flip em</font> with the <font color='blue'>funky</font> <font color='red'>rhythm</font> I be <font color='red'>kickin'</font>
2. <font color='red'>musician</font>, <font color='red'>inflictin</font> <font color='red'>composition</font>
3. <font color='green'>of pain</font> I'm like Scarface <font color='red'>sniffin</font> <font color='green'>cocaine</font>
4. Holdin a <font color='purple'>M-16</font>, see with the pen <font color='purple'>I'm extreme</font>, now

Now lets take a look at a couple lines from 2Pac's California Love, a west coast style song

1. Now let me welcome everybody to the wild, wild <font color='red'>west</font>
2. A state that's untouchable like Elliot <font color='red'>Ness</font>
3. The track hits ya eardrum like a slug to ya <font color='red'>chest</font>
4. Pack a <font color='red'>vest</font> for your Jimmy in the city of <font color='red'>sex</font>

We can immediately see a difference between the rhyme style between these two styles of rap. East coast tends to have more rhymes in general and focuses a lot more on variety of rhyme patterns interspersed throughout the lines, as opposed to west coast which focuses more on simpler last-word rhymes.

How can we do this computationally? We will use CMU's pronuncation dictionary in the NLTK package.

In [1]:
import nltk
import pandas as pd
import scipy as sp
import numpy as np
from nltk.corpus import cmudict

In [2]:
class Pronunciation(object):
    
    CMUDICT = cmudict.dict()
    
    def __init__(self, word):
        self.word = word
        self.word_lower = word.lower()
        if self.word_lower in Pronunciation.CMUDICT:
            self.pron = Pronunciation.CMUDICT[self.word_lower][0]
            self.syllable_loc = [i for i in xrange(len(self.pron)) if self.pron[i][-1].isdigit()]
        else:
            self.pron = None
            self.syllable_loc = None
    def __repr__(self):
        if self.pron:
            pron_repr =  "/".join(self.pron)
        else:
            pron_repr = "?"
        return "%s(%s)" % (self.word,pron_repr)
    
    def rhyme_group(self):
        if self.syllable_loc == []:
            return "/".join(self.pron)
        elif self.pron == None:
            return "UNKNOWN_GROUP"
        else:
            return "/".join(self.pron[self.syllable_loc[-1]:])
    
    def __eq__(self,other):
        return self.word_lower == other.word_lower
    
    def __hash__(self):
        return hash(self.word_lower)
    
        

def tokenize(s):
    tokenizer = nltk.tokenize.RegexpTokenizer(r"\w[\w-]*'?[\w-]*")
    tokenized_lines = [tokenizer.tokenize(line) for line in s.split("\n") if line]
    return tuple([tuple([Pronunciation(token) for token in token_line]) for token_line in tokenized_lines])
    

nas = '''Rappers I monkey flip em with the funky rhythm I be kicking\nmusician, inflicting composition\nof pain '''+\
      '''I'm like Scarface sniffing cocaine\nHolding a M-16, see with the pen I'm extreme, now\n\n'''
token_lines = tokenize(nas)
for line in token_lines:
    print line

(Rappers(R/AE1/P/ER0/Z), I(AY1), monkey(M/AH1/NG/K/IY0), flip(F/L/IH1/P), em(EH1/M), with(W/IH1/DH), the(DH/AH0), funky(F/AH1/NG/K/IY0), rhythm(R/IH1/DH/AH0/M), I(AY1), be(B/IY1), kicking(K/IH1/K/IH0/NG))
(musician(M/Y/UW0/Z/IH1/SH/AH0/N), inflicting(IH0/N/F/L/IH1/K/T/IH0/NG), composition(K/AA2/M/P/AH0/Z/IH1/SH/AH0/N))
(of(AH1/V), pain(P/EY1/N), I'm(AY1/M), like(L/AY1/K), Scarface(S/K/AA1/R/F/EY2/S), sniffing(S/N/IH1/F/IH0/NG), cocaine(K/OW0/K/EY1/N))
(Holding(HH/OW1/L/D/IH0/NG), a(AH0), M-16(?), see(S/IY1), with(W/IH1/DH), the(DH/AH0), pen(P/EH1/N), I'm(AY1/M), extreme(IH0/K/S/T/R/IY1/M), now(N/AW1))


Now we need to define some sort of metric for rhyming words.
We know that monkey(M/AH1/NG/K/IY0) rhymes with funky(F/AH1/NG/K/IY0) and is a perfect rhyme. Lets break this down. Monkey has two syllables and thus two stress vowels. These stress vowels mark separations of syllables - monkey can be broken down to (M/AH1/NG) and (K/IY0); funky can be broken down to (F/AH1/NG) and (K/IY0). Immediately, we see that the last two syllables rhyme because they are equal; the NG at the end of 'mon' and 'fun' also add to the rhyme scheme, but the relationship that causes this to be a strong rhyme is equivalence of the last syllable.

Lets look at a harder example. flip(F/L/IH1/P), em(EH1/M) as a couple rhymes with rhythm(R/IH1/DH/AH0/M). To simplify things, lets just look at em(EH1/M) and rhythm(R/IH1/DH/AH0/M). This is a weak rhyme because the stress syllables are different but sound the same. This is another complication we need to take into account.

Lets implement a quick naive rhyme scheme to see all of our strong rhymes...

In [3]:
import collections
def rhyme_groups_naive(tokens):
    groups = collections.defaultdict(set)
    for line in tokens:
        for token in line:
            group = token.rhyme_group()
            groups[group].add(token)
    return dict(groups)

strong_rhyme_groups = rhyme_groups_naive(token_lines)

for (k,v) in strong_rhyme_groups.iteritems():
    if len(v) > 1:
        print k, v

AH0 set([a(AH0), the(DH/AH0)])
EY1/N set([cocaine(K/OW0/K/EY1/N), pain(P/EY1/N)])
IY1 set([be(B/IY1), see(S/IY1)])
IY0 set([funky(F/AH1/NG/K/IY0), monkey(M/AH1/NG/K/IY0)])
IH0/NG set([sniffing(S/N/IH1/F/IH0/NG), kicking(K/IH1/K/IH0/NG), inflicting(IH0/N/F/L/IH1/K/T/IH0/NG), Holding(HH/OW1/L/D/IH0/NG)])
AH0/N set([musician(M/Y/UW0/Z/IH1/SH/AH0/N), composition(K/AA2/M/P/AH0/Z/IH1/SH/AH0/N)])


Lets visualize this to see if it matches our manual rhyme above

In [4]:
import IPython.display, random

def random_color():
    return "#%03x" % random.randint(0, 0xFFF)

# get rid of solo groups
groups = [[k,v] for (k,v) in strong_rhyme_groups.iteritems() if len(v) > 1 and k != "UNKNOWN_GROUP"]
print groups
# assign colors
for group in groups:
    group[0] = random_color()
# reverse keys and value
color_dict = dict(reduce(lambda x,y: x+y,[[(v_i,k) for v_i in v] for [k,v] in groups]))

html = ""
for token_line in token_lines:
    for token in token_line:
        if token in color_dict:
            html += "<b><font color=%s>%s</font></b> " % (color_dict[token], token.word)
        else:
            html += token.word + " "
    html += "<br>"
        

IPython.display.display_html(html, raw=True)

[[u'AH0', set([a(AH0), the(DH/AH0)])], [u'EY1/N', set([cocaine(K/OW0/K/EY1/N), pain(P/EY1/N)])], [u'IY1', set([be(B/IY1), see(S/IY1)])], [u'IY0', set([funky(F/AH1/NG/K/IY0), monkey(M/AH1/NG/K/IY0)])], [u'IH0/NG', set([sniffing(S/N/IH1/F/IH0/NG), kicking(K/IH1/K/IH0/NG), inflicting(IH0/N/F/L/IH1/K/T/IH0/NG), Holding(HH/OW1/L/D/IH0/NG)])], [u'AH0/N', set([musician(M/Y/UW0/Z/IH1/SH/AH0/N), composition(K/AA2/M/P/AH0/Z/IH1/SH/AH0/N)])]]


## Musixmatch API

In this section we will now start using the musixmatch api to start scraping some songs and their respective lyrics. We will import the standard python requests library and make calls to the api with our respective apikey that we regestered for. 

The standard format for the requests will be:

"http://api.musixmatch.com/ws/1.1/method?track_id=?&apikey=?"

where method are the API methods such as "track.lyrics.get", "track.search", "chart.atrists.get", and many others.
We need to fill in a track_id for the song and our respective apikey.

## Search Function

The code below will now scrape the musixmatch database for you. All you need to do is pass in the correct song and title and the function will return the lyrics to you. The musixmatch api has a database full of songs where each song has a corresponding track id. The thing is that if we want the lyrics for a certain song then we need the respective track id. However now we just use the song's respective information to get the track id and then return the lyrics. We first split the artist and title into the correct format for the api call. Then we just use this information for the track id and lyrics following.

In [5]:
import requests
from datetime import datetime

class MusixApi:
    def __init__(self, apikey):
        self.apikey = apikey
        self.search_url = "http://api.musixmatch.com/ws/1.1/track.search"
        self.lyrics_get_url = "http://api.musixmatch.com/ws/1.1/track.lyrics.get"
        self.artist_search_url = "http://api.musixmatch.com/ws/1.1/artist.search"
        self.album_get_url = "http://api.musixmatch.com/ws/1.1/artist.albums.get"
        self.album_tracks_get_url = "http://api.musixmatch.com/ws/1.1/album.tracks.get"
        self.track_lyrics_get = "http://api.musixmatch.com/ws/1.1/track.lyrics.get"
        
    def search(self, artist, title):
        '''
        Pass in artist/title and return song lyrics
        Basic search capability
        '''
        
        url = self.search_url
        params = {"q_track": title.lower(),
                  "q_artist": artist.lower(),
                  "f_has_lyrics": 1,
                  "apikey": self.apikey}
        song = requests.get(url, params=params).json()
        status_code = song["message"]["header"]["status_code"]
        if status_code != 200:
            raise Exception("Recieved status code %d" % status_code)
        track_id = song['message']['body']['track_list'][0]['track']['track_id']
        
        url = self.lyrics_get_url
        params = {"track_id": track_id,
                  "apikey": self.apikey}
        lyrics = requests.get(url, params=params).json()
        status_code = lyrics["message"]["header"]["status_code"]
        if status_code != 200:
            raise Exception("Recieved status code %d" % status_code)
        return lyrics['message']['body']['lyrics']['lyrics_body']
    
    def artist_id(self, artist):
        '''
        This function returns the artist ID for an artist
        
        Input: An album name
        Output: A list of all song lyrics for that album
        
        '''
        params = {"q_artist": artist.lower(),
                  "page_size": 5,
                  "apikey": self.apikey}
        url = self.artist_search_url
        artist_json = requests.get(url, params=params).json()
        status_code = artist_json["message"]["header"]["status_code"]
        if status_code != 200:
            raise Exception("Recieved status code %d" % status_code)
        artist_list = artist_json['message']['body']['artist_list']
        artist_id = artist_list[0]['artist']['artist_id']
        return artist_id
    
    
    def all_albums(self, artist_id):
        '''
        This function returns all the album for a given artist ID
        
        Input: the ID of an artist
        Output: a list of album
        '''
        
        rez = []
        url = self.album_get_url
        page_num = 1
        album_length = 100
        while album_length == 100:
            params = {"artist_id": artist_id,
                      "s_release_date": "desc",
                      "page_size": 100,
                      "page": page_num,
                      "g_album_name": 1,
                      "apikey": self.apikey}
            album_json = requests.get(url, params=params).json()
            status_code = album_json["message"]["header"]["status_code"]
            if status_code != 200:
                raise Exception("Recieved status code %d" % status_code)
            album_list = album_json['message']['body']['album_list']
            rez += [album_result["album"] for album_result in album_list]
            album_length = len(album_list)
            page_num += 1
        return rez
    
    
    def all_lyris_in_album(self, album):
        '''
        Input: An album
        Output: All song lyrics for the respective songs in those albums
        
        '''
        
        
        album_id = album["album_id"]
        url = self.album_tracks_get_url
        song_url = self.track_lyrics_get
        params = {"album_id": album_id,
                  "page": 1,
                  "page_size": 100,
                  "apikey": self.apikey}
        tracks_json = requests.get(url, params=params).json()
        status_code = tracks_json["message"]["header"]["status_code"]
        if status_code != 200:
            print "Album track lookup for %d failed with status_code %d"\
                % (album_id, status_code)
            return (None, None, None)
        
        track_list = tracks_json['message']['body']['track_list']
        final_lyrics = []
        total = len(track_list)
        for track in track_list:
            song_id = track['track']['track_id']
            song_params = {"track_id": song_id,
                          "apikey": self.apikey}
            response = requests.get(song_url, params=song_params).json()
            status_code = response['message']['header']['status_code']
            if status_code == 200:
                final_lyrics.append(response['message']['body']['lyrics']['lyrics_body'])
        return (final_lyrics, len(final_lyrics), total)
        
    def get_all_lyrics_from_artist(self, artist, date_start, date_end):
        '''
        Input: artist name and the range of album dates we want
        Output: List of (album_name, lyrics) from that arist in said date range
        '''
        def in_date_range(date_string, start, end):
            try:
                dt = datetime.strptime(date_string, "%Y-%m-%d")
            except:
                try:
                    dt = datetime.strptime(date_string, "%Y-%m")
                except:
                    try:
                        dt = datetime.strptime(date_string, "%Y")
                    except:
                        return False
            return dt <= date_end and dt >= date_start
        print "*******************************************************"
        print artist
        print "*******************************************************"
        artist_id = self.artist_id(artist)
        print " * artist_id: %d" % artist_id
        albums = self.all_albums(artist_id)
        print " * number albums: %d" % len(albums)
        albums_in_range = [album for album in albums if 
                         in_date_range(album["album_release_date"], date_start, date_end)]
        print " * number albums in date range: %d" % len(albums_in_range)        
        
        all_lyrics = []
        
        for album in albums_in_range:
            (lyrics, success, total) = self.all_lyris_in_album(album)
            if lyrics == None:
                continue
            all_lyrics.append((album["album_name"], lyrics))
            print " * found (%d/%d) lyrics in album %s" % (success, total, album["album_name"])
        return all_lyrics
        
        
        
        

The following code will get an API key that is stored in a file called 'secrets.json'. For security reasons, it is never a good idea to post any personal keys to the public.

In [14]:
import json
with open("secrets.json", "r") as f:
    music_parser = MusixApi(json.load(f)["musixApiKeyAlt"])

#search("Taylor Swift", "Back To December")
#print music_parser.search("Mobb Deep", "Survival of the Fittest")
#ID = music_parser.artist_id("Jay-Z")
#albums = music_parser.all_albums(ID)
music_parser.get_all_lyrics_from_artist("Coldplay", datetime(2008,1,1), datetime(2009,1,1))

[{u'album_copyright': u'2016 Shydog Productions',
  u'album_coverart_100x100': u'http://s.mxmcdn.net/images-storage/albums/nocover.png',
  u'album_coverart_350x350': u'',
  u'album_coverart_500x500': u'',
  u'album_coverart_800x800': u'',
  u'album_edit_url': u'https://www.musixmatch.com/album/Too-hort/Ain-t-Yo-Bitch-feat-B-O-T-B?utm_source=application&utm_campaign=api&utm_medium=pyraps',
  u'album_id': 22274911,
  u'album_label': u'',
  u'album_mbid': u'',
  u'album_name': u"Ain't Yo Bitch (feat. B.O.T.B)",
  u'album_pline': u'2016 Shydog Productions',
  u'album_rating': 9,
  u'album_release_date': u'2016-02-17',
  u'album_release_type': u'Single',
  u'album_track_count': 1,
  u'album_vanity_id': u'Too-hort/Ain-t-Yo-Bitch-feat-B-O-T-B',
  u'artist_id': 4660,
  u'artist_name': u'Too $hort',
  u'primary_genres': {u'music_genre_list': []},
  u'restricted': 0,
  u'secondary_genres': {u'music_genre_list': []},
  u'updated_time': u'2016-02-22T18:21:51Z'},
 {u'album_copyright': u'2014 David 

Phenomenal! We pretty much have most of the functions we need to start scraping the musixmatch library for all our rap lyrics. We have everything we need. Now we'll just get some real data like a csv file of rapper names! We'll use the rapper names to generate all songs that rapper has created recently. So if we input a csv file of say ['Ice Cube', 'Kanye', ...], then we can return all the rap lyrics for those guys!

## Building a hip-hop lyrics database

After meticulous research, we have compiled a list of hip-hop artists from the 90's that are representative of either East-Coast hip-hop or West-Coast hip-hop. In this section, we will scrape the actual data that we will be using for this project.

In [8]:
class Lyric(object):
    
    @staticmethod
    def _clean(text):
        # drop the footer
        text = "\n".join(text.split("\n\n")[:-1])
        return text
    
    def __init__(self, text, artist, album, label):
        self.artist = artist
        self.album = album
        self.label = label
        self.text = Lyric._clean(text)
        self.tokens = tokenize(self.text)
        
    def __repr__(self):
        return "%s/%s: \"%s...\"" % (self.artist, self.album, self.text[:10])
    
    def __hash__(self):
        return hash(self.artist) + hash(self.album) + hash(self.tokens)
    
    

In [16]:
east_coast_rappers = ["Notorious B.I.G.", "Nas", "Wu-Tang Clan", "Jay-Z", "DMX", "Rakim",
                      "Method Man", "Busta Rhymes", "Run-DMC", "Public Enemy", "Mobb Deep",
                      "KRS-One", "50 Cent", "Big L", "LL Cool J", "Ghostface Killah",
                      "Ol' Dirty Bastard", "Raekwon", "A Tribe Called Quest",
                      "Big Daddy Kane","Gang Starr", "GZA", "Redman", "Mos Def", "Q-Tip"]
west_coast_rappers = ["2Pac", "Ice Cube", "Dr. Dre", "Snoop Dogg", "N.W.A",
                      "Nate Dogg", "Warren G", "MC Ren", "Eazy-E", "Ice-T", "Too $hort", "Kurupt",
                      "The Pharcyde", "E-40"]

In [17]:
with open("lyric_data/east_coast.txt", "w") as f:
    for rapper in east_coast_rappers:
        f.write(rapper+"\n")
with open("lyric_data/west_coast.txt", "w") as f:
    for rapper in west_coast_rappers:
        f.write(rapper+"\n")

In [10]:
date_start = datetime(1990,1,1)
date_end = datetime(1999,12,31)

lyrics = []

for artist in east_coast_rappers:
    for (album_name, album_lyrics) in music_parser.get_all_lyrics_from_artist(artist, date_start, date_end):
        for lyric in album_lyrics:
            lyrics.append(Lyric(lyric,artist,album_name,"east"))
for artist in west_coast_rappers:
    for (album_name, album_lyrics) in music_parser.get_all_lyrics_from_artist(artist, date_start, date_end):
        for lyric in album_lyrics:
            lyrics.append(Lyric(lyric,artist,album_name,"west"))

*******************************************************
Notorious B.I.G.
*******************************************************
 * artist_id: 7567
 * number albums: 47
 * number albums in date range: 21
 * found (22/24) lyrics in album Born Again
 * found (2/8) lyrics in album Cars & Sex / I Got a Story to Tell / Bigge Smalls Is the Wickedest / The Garden Freestyle
 * found (3/4) lyrics in album Sky's the Limit
 * found (2/2) lyrics in album Spit Your Game
 * found (8/8) lyrics in album Mo Money Mo Problems
 * found (4/4) lyrics in album Hypnotize
 * found (23/24) lyrics in album Life After Death
 * found (2/2) lyrics in album Hypnotize
 * found (5/5) lyrics in album Mo Money Mo Problems
 * found (5/5) lyrics in album Big Poppa
 * found (7/7) lyrics in album One More Chance
 * found (4/4) lyrics in album One More Chance
 * found (5/6) lyrics in album Juicy / Unbelievable
 * found (19/19) lyrics in album Ready to Die - The Remaster
 * found (19/19) lyrics in album Ready To Die The Rema

### Filling in slang words
As you may imagine, rap contains a lot of slang words that do not have an entry in CMUdict. We can fill in the gaps by approximating the pronunciations using [CMU Lextools](http://www.speech.cs.cmu.edu/tools/lextool.html). The following code finds all the unknown words and writes them to a file as input to the lextool. We then need to parse the return dict file from the lextool and refill unknown pronunciations with our approximations.

In [19]:
slang = set()

for lyric in lyrics:
    if len(lyric.tokens) == 0:
        continue
    for p in reduce(lambda x,y: list(x) + list(y), lyric.tokens):
        if p.pron == None:
            slang.add(p.word.lower())

slang = sorted(list(slang))
with open("slang.txt", "w") as f:
    for word in slang:
        f.write(word.encode('utf8') + "\n")