# pyraps

For our final project, we will be attempting to classify rap styles based on song lyrics. We will be using a subsection of rap music published in the 1990's, where rap style from different geographical regions were distinct, which differs from modern rap music has become more of an almalgamation of the three main rap styles. The two main geographical regions we will be looking at are east-coast (New York City), west-coast (Los Angeles) - we may extend this to further classify more modern movements such as southern (Atlanta) and midwest (Detroit).

Our project consists of three parts.

1. Data Collection - We will build a database of lyrics from 1990's rap artists and label them based on the rappers style based on geographical location.
2. Creating features - We will create features to capture the rhythm and rhyme of a song, as well as the particular lyrical content and vocabulary.
3. Training a classifier - Using our features, we will train different types of classifiers and compare results.

## Creating Features

We will be using two sets of information as features for our machine learning algorithms: lyrical content, in other words the actual words that are being used and the order in which those words occur in, and rhyme patterns along with rhythmic beats, which we will analyze using NLTK.

### Lyrical Content

One easy way to generate features for this is to use a pretrained neural network that specifically handles text and drop the last layer (which converts their learned features to their expected output).

### Rhyme Patterns

Rhyme patterns are pretty interesting in the way it manifests itself in east-coast versus west-coast rap. East-coast tends to try to create intricate and interlacing rhyme patterns where as west-coast rap focuses more on creating a vibe rather than building intense rhyme structures. We can use this as another feature we can train on to provide better separation

Lets take a look at a couple lines from Nas' "NY State of Mind", a classic east coast style song

1. Rappers I <font color='blue'>monkey</font> <font color='red'>flip em</font> with the <font color='blue'>funky</font> <font color='red'>rhythm</font> I be <font color='red'>kickin'</font>
2. <font color='red'>musician</font>, <font color='red'>inflictin</font> <font color='red'>composition</font>
3. <font color='green'>of pain</font> I'm like Scarface <font color='red'>sniffin</font> <font color='green'>cocaine</font>
4. Holdin a <font color='purple'>M-16</font>, see with the pen <font color='purple'>I'm extreme</font>, now

Now lets take a look at a couple lines from 2Pac's California Love, a west coast style song

1. Now let me welcome everybody to the wild, wild <font color='red'>west</font>
2. A state that's untouchable like Elliot <font color='red'>Ness</font>
3. The track hits ya eardrum like a slug to ya <font color='red'>chest</font>
4. Pack a <font color='red'>vest</font> for your Jimmy in the city of <font color='red'>sex</font>

We can immediately see a difference between the rhyme style between these two styles of rap. East coast tends to have more rhymes in general and focuses a lot more on variety of rhyme patterns interspersed throughout the lines, as opposed to west coast which focuses more on simpler last-word rhymes.

How can we do this computationally? We will use CMU's pronuncation dictionary in the NLTK package.

In [None]:
import nltk
from nltk.corpus import cmudict

In [None]:
CMUDICT = cmudict.dict()

In [None]:
class Pronunciation(object):
    def __init__(self, word):
        self.word = word
        word = word.lower()
        if word in CMUDICT:
            self.pron = CMUDICT[word][0]
            self.syllable_loc = [i for i in xrange(len(self.pron)) if self.pron[i][-1].isdigit()]
        else:
            # there are other NLTK libraries to guess word pronunciations
            self.pron = None
            self.syllable_loc = None
    def __repr__(self):
        if self.pron:
            pron_repr =  "/".join(self.pron)
        else:
            pron_repr = "?"
        return "%s(%s)" % (self.word,pron_repr)
    
    def rhyme_group(self):
        if self.syllable_loc == []:
            return "/".join(self.pron)
        elif self.pron == None:
            return "UNKNOWN_GROUP"
        else:
            return "/".join(self.pron[self.syllable_loc[-1]:])
    
    def __eq__(self,other):
        return self.word.lower() == other.word.lower()
    
    def __hash__(self):
        return hash(self.word.lower())
    
        

def tokenize(s):
    tokenizer = nltk.tokenize.RegexpTokenizer(r"[\w-]+'?[\w-]*")
    tokenized_lines = [tokenizer.tokenize(line) for line in s.split("\n") if line]
    return [[Pronunciation(token) for token in token_line] for token_line in tokenized_lines]
    

nas = '''Rappers I monkey flip em with the funky rhythm I be kicking\nmusician, inflicting composition\nof pain '''+\
      '''I'm like Scarface sniffing cocaine\nHolding a M-16, see with the pen I'm extreme, now\n\n'''
token_lines = tokenize(nas)
for line in token_lines:
    print line

Now we need to define some sort of metric for rhyming words.
We know that monkey(M/AH1/NG/K/IY0) rhymes with funky(F/AH1/NG/K/IY0) and is a perfect rhyme. Lets break this down. Monkey has two syllables and thus two stress vowels. These stress vowels mark separations of syllables - monkey can be broken down to (M/AH1/NG) and (K/IY0); funky can be broken down to (F/AH1/NG) and (K/IY0). Immediately, we see that the last two syllables rhyme because they are equal; the NG at the end of 'mon' and 'fun' also add to the rhyme scheme, but the relationship that causes this to be a strong rhyme is equivalence of the last syllable.

Lets look at a harder example. flip(F/L/IH1/P), em(EH1/M) as a couple rhymes with rhythm(R/IH1/DH/AH0/M). To simplify things, lets just look at em(EH1/M) and rhythm(R/IH1/DH/AH0/M). This is a weak rhyme because the stress syllables are different but sound the same. This is another complication we need to take into account.

Lets implement a quick naive rhyme scheme to see all of our strong rhymes...

In [None]:
import collections
def rhyme_groups_naive(tokens):
    groups = collections.defaultdict(set)
    for line in tokens:
        for token in line:
            group = token.rhyme_group()
            groups[group].add(token)
    return dict(groups)

strong_rhyme_groups = rhyme_groups_naive(token_lines)

for (k,v) in strong_rhyme_groups.iteritems():
    if len(v) > 1:
        print k, v

Lets visualize this to see if it matches our manual rhyme above

In [None]:
import IPython.display, random

def random_color():
    return "#%03x" % random.randint(0, 0xFFF)

# get rid of solo groups
groups = [[k,v] for (k,v) in strong_rhyme_groups.iteritems() if len(v) > 1 and k != "UNKNOWN_GROUP"]
print groups
# assign colors
for group in groups:
    group[0] = random_color()
# reverse keys and value
color_dict = dict(reduce(lambda x,y: x+y,[[(v_i,k) for v_i in v] for [k,v] in groups]))

html = ""
for token_line in token_lines:
    for token in token_line:
        if token in color_dict:
            html += "<b><font color=%s>%s</font></b> " % (color_dict[token], token.word)
        else:
            html += token.word + " "
    html += "<br>"
        

IPython.display.display_html(html, raw=True)

## Musixmatch API

In this section we will now start using the musixmatch api to start scraping some songs and their respective lyrics. We will import the standard python requests library and make calls to the api with our respective apikey that we regestered for. 

The standard format for the requests will be:

"http://api.musixmatch.com/ws/1.1/method?track_id=?&apikey=?"

where method are the API methods such as "track.lyrics.get", "track.search", "chart.atrists.get", and many others.
We need to fill in a track_id for the song and our respective apikey.

In [3]:
import requests

#Small example how to use the musixmatch api to get track lyrics
lyrics = requests.get("http://api.musixmatch.com/ws/1.1/track.lyrics.get?track_id=15953433&apikey=2729a26dbacf9a7c354418c40912423f")
ly = lyrics.json()

print ly['message']['body']['lyrics']['lyrics_body']

#JSON FORMAT
'''
{
  "message": {
    "header": {
      "status_code": 200,
      "execute_time": 0.19601988792419
    },
    "body": {
      "lyrics": {
        "lyrics_id": 6471198,
        "restricted": 0,
        "instrumental": 0,
        "lyrics_body": "When I walk on by, girls be looking like damn he fly\r\nI pay to the beat, walking on the street with in my new lafreak, yeah",
        "lyrics_language": "en",
        "script_tracking_url": "http:\/\/tracking.musixmatch.com\/t1.0\/5RIyfJ3c",
        "pixel_tracking_url": "http:\/\/tracking.musixmatch.com\/t1.0\/5RIyfJ/",
        "html_tracking_url": "http:\/\/tracking.musixmatch.com\/t1.0\/5RIyfJ3cC39",
        "lyrics_copyright": "Lyrics powered by www.musiXmatch.com",
        "updated_time": "2011-06-30T13:31:20Z"
      }
    }
  }
}
'''

Now and then, I think of when we were together
Like when you said you felt so happy you could die
Told myself that you were right for me
But felt so lonely in your company
But that was love, and it's an ache I still remember

You can get addicted to a certain kind of sadness
Like resignation to the end, always the end
So when we found that we could not make sense
Well, you said that we would still be friends
But I'll admit that I was glad that it was over

But you didn't have to cut me off
Make out like it never happened and that we were nothing
And I don't even need your love
But you treat me like a stranger and that feels so rough

...

******* This Lyrics is NOT for Commercial use *******
(1409613252154)


'\n{\n  "message": {\n    "header": {\n      "status_code": 200,\n      "execute_time": 0.19601988792419\n    },\n    "body": {\n      "lyrics": {\n        "lyrics_id": 6471198,\n        "restricted": 0,\n        "instrumental": 0,\n        "lyrics_body": "When I walk on by, girls be looking like damn he fly\r\nI pay to the beat, walking on the street with in my new lafreak, yeah",\n        "lyrics_language": "en",\n        "script_tracking_url": "http:\\/\\/tracking.musixmatch.com\\/t1.0\\/5RIyfJ3c",\n        "pixel_tracking_url": "http:\\/\\/tracking.musixmatch.com\\/t1.0\\/5RIyfJ/",\n        "html_tracking_url": "http:\\/\\/tracking.musixmatch.com\\/t1.0\\/5RIyfJ3cC39",\n        "lyrics_copyright": "Lyrics powered by www.musiXmatch.com",\n        "updated_time": "2011-06-30T13:31:20Z"\n      }\n    }\n  }\n}\n'

## Search Function

The code below will now scrape the musixmatch database for you. All you need to do is pass in the correct song and title and the function will return the lyrics to you. The musixmatch api has a database full of songs where each song has a corresponding track id. The thing is that if we want the lyrics for a certain song then we need the respective track id. However now we just use the song's respective information to get the track id and then return the lyrics. We first split the artist and title into the correct format for the api call. Then we just use this information for the track id and lyrics following.

In [30]:
class MusixApi:
    def __init__(self, apikey):
        self.apikey = apikey

    def search(self,artist, title):
        '''
        Pass in artist/title and return song lyrics
        Basic search capability
        '''
    
        titleSplit = title.lower().split(' ')
        artistSplit = artist.lower().split(' ')
    
        titleStr = '%20'.join(titleSplit)
        artistStr = '%20'.join(artistSplit)
        
        url = "http://api.musixmatch.com/ws/1.1/track.search?q_track=" + titleStr + "&q_artist=" + artistStr + "&f_has_lyrics=1&apikey=" + self.apikey
        song = requests.get(url).json()
    
        trackID = song['message']['body']['track_list'][0]['track']['track_id']
    
        lyrics = requests.get("http://api.musixmatch.com/ws/1.1/track.lyrics.get?track_id=" + str(trackID) + "&apikey=" + self.apikey)
 
        ly = lyrics.json()

        return ly['message']['body']['lyrics']['lyrics_body']
    
    def artistID(self, artist):
        '''
        This function returns the artist ID for an artist
        
        Input: An album name
        Output: A list of all song lyrics for that album
        
        '''
        
        url = "http://api.musixmatch.com/ws/1.1/artist.search?q_artist=" + artist + "&page_size=5&apikey=" + self.apikey
        
        artistJson = requests.get(url).json()
        
        artistList = artistJson['message']['body']['artist_list']
        artistId = artistList[0]['artist']['artist_id']
        return artistId
    
    
    def allAlbums(self, artistID, limit=10):
        '''
        This function returns all the album IDS for a given artist ID
        
        Input: the ID of an artist
        Output: a list of their latest 3 album IDs
        '''
        
        rez = []
        url = "http://api.musixmatch.com/ws/1.1/artist.albums.get?artist_id=" + str(artistID) + "&s_release_date=desc&g_album_name=1&apikey=" + self.apikey
        
        albumJson = requests.get(url).json()
        albumList = albumJson['message']['body']['album_list']
        
        albumLength = len(albumList)
        for i in xrange(min(limit,albumLength)):
            rez.append(albumList[i]['album']['album_id'])
        
        return rez
    
    
    def allLyrics(self, albums):
        '''
        Input: A list of albums
        Output: All song lyrics for the respective songs in those albums
        
        '''
        
        finalLyrics = []
        for albumID in albums:
            url = "http://api.musixmatch.com/ws/1.1/album.tracks.get?album_id=" + str(albumID) + "&page=1&page_size=10&apikey=" + self.apikey
            tracksJson = requests.get(url).json()  
            trackList = tracksJson['message']['body']['track_list']
            for track in trackList:
                songId = track['track']['track_id']
                songUrl = "http://api.musixmatch.com/ws/1.1/track.lyrics.get?track_id=" + str(songId) + "&apikey=" + self.apikey
                response = requests.get(songUrl).json()
                if response['message']['header']['status_code'] == 200:
                    finalLyrics.append(response['message']['body']['lyrics']['lyrics_body'])
        
      
        return finalLyrics
        
        #album.tracks.get?album_id=13750844&page=1&page_size=2

In [32]:
import json
with open("secrets.json", "r") as f:
    musicParser = MusixApi(json.load(f)["musixApiKey"])

#search("Taylor Swift", "Back To December")
#print musicParser.search("Mobb Deep", "Survival of the Fittest")
ID = musicParser.artistID("Coldplay")
albums = musicParser.allAlbums(ID)
for lyric in musicParser.allLyrics(albums):
    print lyric

Oh, they say people come
Say people go
This particular diamond was extra special
And though you might be gone
and the world may not know
Still I see you, celestial

Like a lion you ran
a goddess you rolled
Like an eagle you circle
in perfect purple
So how come things move on?
How come cars don't slow?
When it feels like the end of my world
When I should, but I can't, let you go?
...

******* This Lyrics is NOT for Commercial use *******
(1409613252154)
We're gonna get it
Gonna get it
Gonna get it, get it together and flower

Fixing up a car to drive in it again
Searching for the water, hoping for the rain
Up and up
Up and up

Down upon the canvas, working meal to meal
Waiting for a chance to pick your orange field
Up and up
Up and up

See a pearl form, a diamond in the rough
See a bird soaring high above the flood
It's in your blood
It's in your blood

Underneath the storm an umbrella is saying
Sitting with the poison takes away the pain
Up and up
Up and up

We're gonna get it, get it 

Phenomenal! We pretty much have most of the functions we need to start scraping the musixmatch library for all our rap lyrics. We have everything we need. Now we'll just get some real data like a csv file of rapper names! We'll use the rapper names to generate all songs that rapper has created recently. So if we input a csv file of say ['Ice Cube', 'Kanye', ...], then we can return all the rap lyrics for those guys!

## Building a hip-hop lyrics database

After meticulous research, we have compiled a list of hip-hop artists from the 90's that are representative of either East-Coast hip-hop or West-Coast hip-hop. In this section, we will scrape the actual data that we will be using for this project.