# Genius API Scraper
----

This is the second iteration of downloading data from [Genius](http://genius.com/) en masse.  The first was made almost two years ago, long before the [Public API](https://docs.genius.com) was available.  The upsides of the API is that it makes it a bit easier to grab all of the tracks for an artist, as well as artist specific info that might prove useful.  The downside is that for some reason, there is no endpoint for the lyrics themselves (which I find bizarre).

----

The first step we will need to collecting information is a function for fetching a link, and selecting certain elements from it.  The best library for HTTP requests is [requests](http://docs.python-requests.org/en/master/), and the fastest (albeit not the most intuitive) method of element selection is with [lxml](http://lxml.de/) using the [XPath](https://www.w3.org/TR/xpath-31/) query language.

In [2]:
import requests as rq
from lxml import html

def print_exception(func, e):
    template = "{0}, An exception of type {1} occured. Arguments:\n{2!r}"
    message = template.format(func, type(e).__name__, e.args)
    print(message)

def xpath_query_url(url, xpath_query, payload=dict(), params=dict()):
    '''Gets url and performs an XPath Query'''
    headers = {'User-Agent': 'Mozilla/5.0 Gecko/20100101 Firefox/35.0'}
    if payload:
        headers.update(payload)
    try:
        response = rq.get(url, headers=headers, params=params)
        #creates an html tree from the data
        tree = html.fromstring(response.text)
        #XPATH query to grab all of the artist urls, then we grab the first
        return tree.xpath(xpath_query)
    except Exception as e:
        print_exception('xpath_query_url', e)
        return ''

### Artist List (Names)
Here we will grab a list of notable rappers collected from [Wikipedia](https://en.wikipedia.org/wiki/List_of_hip_hop_musicians).  This contains old-school (80's and 90's) rappers as well as contemporaries in the new school.

In [3]:
from re import sub
def fetch_artist_names(random_sample=None):
    '''Fetches top artists from wikipedia'''
    base = 'https://en.wikipedia.org/wiki/List_of_hip_hop_musicians'
    query = '//li/a/@title'
    results = xpath_query_url(base, query)

    if results:
        results = [sub(' \([\w ]+\)', '', a) for a in results]

    return results

wiki_artists = fetch_artist_names()
# manual addition
wiki_artists += ['2Pac', 'Joey BADA$$']
wiki_artist_count = len(wiki_artists)
print('Scraped {} potential artists'.format(wiki_artist_count))

Scraped 1176 potential artists


### Artist List (Genius Links)
Next we will search for artist ids by using the non-api artist page search.  From this point on, we're going to be making an awful lot of web requests, and completing them synchronously is pretty much out of the question (unless you happen to be a monk).  

We could use POSIX threads or more processes to accomplish this, but the amount we would need would be too taxing.  Also, since web requests are an I/O bound task, and not CPU bound, taking advantage of multiple cores wouldn't even do us much good.  Basically, [we need concurrency, not parallelism](http://stackoverflow.com/questions/15556718/greenlet-vs-threads).

In [4]:
import gevent
from gevent.pool import Pool
from gevent import monkey
import signal

gevent.monkey.patch_all(thread=False)
gevent.signal(signal.SIGQUIT, gevent.kill)

<gevent.hub.signal at 0x7f41800d03c8>

In [10]:
from functools import partial

def grab_artist_links(artist):
    base = 'http://genius.com/search/artists'
    params = {'q': artist}
    xpath = '//li/a[@class="artist_link"]/@href'
    artist_links = partial(xpath_query_url, 
                           xpath_query=xpath, 
                           url=base)
    return artist, artist_links(params=params)

%time artist_link_lists = Pool(100).map(grab_artist_links, wiki_artists)
# filters out artists we couldn't find links for
# as well as false positives that aren't even artists 
artist_link_lists = [alist for alist in artist_link_lists if alist[1]]
# grabs just the first link from each list
artist_link_lists = [(alist[0], alist[1][0]) for alist in artist_link_lists if alist[1]]

CPU times: user 13.4 s, sys: 2.99 s, total: 16.4 s
Wall time: 14 s


Time to review the artist list to make sure the artist pages we searched for are correct.  We can automate a lot of this by simulating the way that the artist names "should" appear in the url.  From what I can tell, this is by "slugifying" (which is not a standardized procedure) the resultant string.

Some artist names included a '$' in their name, but were not listed that way on wikipedia.  Some are simply different names, like `A-Q` vs `aq-beats` but we can be pretty confidant this was a true positive match.  Some have similar names, with one being much more popular, hence `V.I.C.` being matched as `Vic Mensa`.

In [11]:
from slugify import slugify as slug
[a for a in artist_link_lists if slug(a[0]) not in a[1].lower()]

[('A-Q', 'http://genius.com/artists/Aq-beats'),
 ('ASAP Ferg', 'http://genius.com/artists/A-ap-ferg'),
 ('ASAP Nast', 'http://genius.com/artists/A-ap-nast'),
 ('ASAP Twelvyy', 'http://genius.com/artists/A-ap-twelvyy'),
 ('ASAP Yams', 'http://genius.com/artists/A-ap-yams'),
 ('B.G.', 'http://genius.com/artists/Bg-knocc-out'),
 ('B.G. Knocc Out', 'http://genius.com/artists/Bg-knocc-out'),
 ('B.o.B', 'http://genius.com/artists/Bob'),
 ('Bad Azz', 'http://genius.com/artists/Boosie-badazz'),
 ('Big K.R.I.T.', 'http://genius.com/artists/Big-krit'),
 ('Boss', 'http://genius.com/artists/Nicki-minaj'),
 ('Bow Wow', 'http://genius.com/artists/Shad-moss'),
 ("Cam'ron", 'http://genius.com/artists/Camron'),
 ('Cash Out', 'http://genius.com/artists/Doughboyz-cashout'),
 ('Cashis', 'http://genius.com/artists/Ca-his'),
 ('Casper Nyovest', 'http://genius.com/artists/Cassper-nyovest'),
 ('Currensy', 'http://genius.com/artists/Curren-y'),
 ("D'Angelo", 'http://genius.com/artists/Dangelo'),
 ("D'banj", 'h

The `remove` artists don't exist on genius, or have a negligable presence.  The `switch` artists simply returned the wrong url from our previous search, since we grabbed the first url from the search results each time.
> Without correction, the Will.i.am search result resolves to William Shakespear, lol

In [12]:
remove = ['Boss', 'Dee Barnes', 'K-Rab', 'MC Mong', 'MckeezyK', 
          'P.C.T.', 'Spot', 'Timbe']
switch = {'The D.O.C.': 'http://genius.com/artists/The-doc',
         'Drag-on': 'http://genius.com/artists/Drag-on',
         'Fiend': 'http://genius.com/artists/Fiend',
         'Flame': 'http://genius.com/artists/Flame',
         'Lisa Lopes': 'http://genius.com/artists/Lisa-left-eye-lopes',
         'Paris': 'http://genius.com/artists/Paris-rapper',
         'Rakim': 'http://genius.com/artists/Rakim',
         'Slug': 'http://genius.com/artists/Slug',
         'Solzilla': 'http://genius.com/artists/Sol',
         'V.I.C.': 'http://genius.com/artists/Vic',
         'Will.i.am': 'http://genius.com/artists/william'}

artist_link_lists_corrected = []

for a in artist_link_lists:
    if a[0] in switch:
        item = (a[0], switch[a[0]])
        artist_link_lists_corrected.append(item)
    if a[0] not in remove:
        artist_link_lists_corrected.append(a)

### Artist List (Genius IDs)

This is one of the last non-API requests we'll be making, as there is no offical way to grab the `artist_id` from the API itself. 

Note: The author thinks thinks is another shortcoming of the Genius API.
> Like, seriously Genius?  I'm just supposed to know your internal ID for each artist?  How is that good UX?  You don't even provide this formally, I have to grab it from an iOS app link...

In [13]:
def grab_artist_id(args):
    artist, artist_url = args
    xpath = '//meta[@name="newrelic-resource-path"]/@content'
    artist_id_raw = xpath_query_url(artist_url, xpath)
    
    if artist_id_raw:
        delisted = artist_id_raw[0]
        if delisted and 'artists/' in delisted:
            artist_id = delisted.split('artists/')[1]
            return artist, artist_id
    return artist, None

%time artist_ids = Pool(100).map(grab_artist_id, artist_link_lists_corrected)

# filters out missing entries that couldn't be resolved
artist_ids_filtered = [a for a in artist_ids if all(a)]

filtered_artist_count = len(artist_ids_filtered)
artist_count_diff = wiki_artist_count - filtered_artist_count

feedback = 'Found {} artist_ids, filtered out {} negative results'
print(feedback.format(filtered_artist_count, artist_count_diff))

CPU times: user 22 s, sys: 3.11 s, total: 25.2 s
Wall time: 23.6 s
Found 1110 artist_ids, filtered out 66 negative results


Many of the unresolved results we're erranious links harvested from the wikipedia page based on a sloppy XPath.

----
### Lyrics and Song Metadata

Here we're going to create a barebones Genius API wrapper for convenience sake.  This way we only have to create a session object with our access token once, and can streamline url parameters.

In [14]:
class Genius:
    def __init__(self, token):
        token = token
        headers = {'Authorization': 'Bearer {}'.format(token)}
        self._sesh = rq.Session()
        self._sesh.headers.update(headers)
        self._base = 'http://api.genius.com'
        
    def _key_response(self, response, subkey):
        r = response.get('response', None)
        if r and subkey in r:
            return r[subkey]
        return {}

    def get_artist(self, artist_id):
        url = '{}/artists/{}'.format(self._base, artist_id)
        response = self._sesh.get(url, params={'text_format': 'plain'}).json()
        return self._key_response(response, 'artist')
    
    def get_artist_songs(self, artist_id, page, sort='title', per_page=50):
        url = '{}/artists/{}/songs'.format(self._base, artist_id)
        params = {'sort': sort, 'per_page': per_page, 'page': page}
        response = self._sesh.get(url, params=params).json()
        return response.get('response', None)
    
    def get_song(self, song_id):
        url = '{}/songs/{}'.format(self._base, song_id)
        response = self._sesh.get(url, params={'text_format': 'plain'}).json()
        return self._key_response(response, 'song')

In [15]:
api = Genius('nmovpnwjelwjWJG_jjQHu-Emfhsv0WX0iv_WZxz5LRpzjjAezDXMhW56jhzwC634')
api.get_artist_songs(28906, 1, per_page=1)['songs'][0].keys()

dict_keys(['id', 'song_art_image_thumbnail_url', 'header_image_thumbnail_url', 'api_path', 'lyrics_owner_id', 'annotation_count', 'title', 'primary_artist', 'path', 'header_image_url', 'full_title', 'pyongs_count', 'stats', 'url'])

We grab a single songs-page result to test that access token is accepted.

----

Here we scrape the artist specific information, which contains details we might find useful later, such as Twitter handles.

In [16]:
artist_ids_stripped = [a[1] for a in artist_ids_filtered]
%time artist_info = Pool(100).map(api.get_artist, artist_ids_stripped)
artist_info[0].keys()

CPU times: user 6.8 s, sys: 1.6 s, total: 8.4 s
Wall time: 8.28 s


dict_keys(['twitter_name', 'id', 'user', 'description_annotation', 'api_path', 'is_meme_verified', 'facebook_name', 'description', 'instagram_name', 'followers_count', 'name', 'current_user_metadata', 'alternate_names', 'is_verified', 'header_image_url', 'image_url', 'url'])

Here is the meat of our operation, where we grab all of the information with a classic producer-consumer pattern, chained like a pipeline.

To grab all of the songs tied to the artist, we must not just scrape but "crawl", since we don't know how many pages of results we'll need to fetch ahead of time.  This makes the `artist_worker` function not only a consumer, but also a producer.  We first populate the queue with the urls to the first pages of the results, and if it finds that there is a `next_page` attribute in the page, it refeeds the queue with those `next_page` details.  We then iterate over each of the returned urls in the page, and feed the out queue with each url.

The `song_worker` consumes a url, grabs the song details, and then grabs the lyrics for that song.  We join this data and pass it to the `serial_worker`'s `q_serial` queue to be written to disk.

Thus we can grab all 115k-ish songs for all of our 1000ish rappers.

In [20]:
from gevent.queue import JoinableQueue
import os 
import msgpack
from datetime import datetime

q_artists = JoinableQueue()
q_songs = JoinableQueue()
q_serial = JoinableQueue()
    
def artist_worker():
    while True:
        artist_id, page_n = q_artists.get()
        try:
            song_page = api.get_artist_songs(artist_id, page_n)

            for song in song_page['songs']:
                q_songs.put(song['id'])
            
            next_page = song_page['next_page']
            if next_page:
                q_artists.put((artist_id, next_page))
        except Exception as e:
            print_exception('artist_worker', e)
        finally:
            q_artists.task_done()

def song_worker():
    while True:
        song_id = q_songs.get()
        try:
            song_details = api.get_song(song_id)
            song_lyrics = xpath_query_url(song_details['url'], '//lyrics//text()')
            song_lyrics_clean = [l.replace('\n','') for l in song_lyrics \
                                 if len(l.strip()) > 1 and 'googletag' not in l]
            song_details['lyrics'] = song_lyrics_clean
            q_serial.put(song_details)
        except Exception as e:
            print_exception('song_worker', e)
        finally:
            q_songs.task_done()

def serial_worker():
    while True:
        song = q_serial.get() 
        try:
            artist = song['primary_artist']['url'].split('/')[-1]
            artist_path = 'lyrics-te/{}'.format(artist)
            
            if not os.path.isdir(artist_path):
                os.makedirs(artist_path)     
                
            filepath = '{}/{}'.format(artist_path, song['path'])
            
            with open(filepath, 'wb') as f:
                msgpack.pack(song, f)
        except Exception as e:
            print_exception('serial_worker', e)
        finally:
            q_serial.task_done()

for artist_id in artist_ids_stripped:
    q_artists.put((artist_id, 1))
    
artist_workers = []
for _ in range(50):
    artist_workers.append(gevent.spawn(artist_worker))
    
song_workers = []
for _ in range(50):
    song_workers.append(gevent.spawn(song_worker))
    
serial_workers = []
for _ in range(10):
    serial_workers.append(gevent.spawn(serial_worker))
    
t1 = datetime.now()

q_artists.join()
q_songs.join()
q_serial.join()

gevent.killall(artist_workers)
gevent.killall(song_workers)
gevent.killall(serial_workers)

print('Took {}'.format(datetime.now() - t1))

Traceback (most recent call last):
  File "/home/bvb/anaconda3/lib/python3.5/site-packages/gevent/greenlet.py", line 534, in run
    result = self._run(*self.args, **self.kwargs)
  File "<ipython-input-20-ed8d7e2578aa>", line 25, in artist_worker
    q_artists.task_done()
  File "/home/bvb/anaconda3/lib/python3.5/site-packages/gevent/queue.py", line 465, in task_done
    raise ValueError('task_done() called too many times')
ValueError: task_done() called too many times
<Greenlet at 0x7f415ea236d0: artist_worker> failed with ValueError

Traceback (most recent call last):
  File "/home/bvb/anaconda3/lib/python3.5/site-packages/gevent/greenlet.py", line 534, in run
    result = self._run(*self.args, **self.kwargs)
  File "<ipython-input-19-b435aaa830ac>", line 25, in artist_worker
    q_artists.task_done()
  File "/home/bvb/anaconda3/lib/python3.5/site-packages/gevent/queue.py", line 465, in task_done
    raise ValueError('task_done() called too many times')
ValueError: task_done() called

xpath_query_url, An exception of type ConnectionError occured. Arguments:
(MaxRetryError("HTTPConnectionPool(host='genius.com', port=80): Max retries exceeded with url: /Canibus-bis-vs-rip-original-version-lyrics (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7f415e6471d0>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution',))",),)
xpath_query_url, An exception of type ConnectionError occured. Arguments:
(MaxRetryError("HTTPConnectionPool(host='genius.com', port=80): Max retries exceeded with url: /Canibus-cingularity-point-original-lyrics (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7f415d538898>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution',))",),)
xpath_query_url, An exception of type ConnectionError occured. Arguments:
(MaxRetryError("HTTPConnectionPool(host='genius.com', port=80): Max retries exceeded with

Traceback (most recent call last):
  File "/home/bvb/anaconda3/lib/python3.5/site-packages/gevent/greenlet.py", line 534, in run
    result = self._run(*self.args, **self.kwargs)
  File "<ipython-input-19-b435aaa830ac>", line 39, in song_worker
    q_songs.task_done()
  File "/home/bvb/anaconda3/lib/python3.5/site-packages/gevent/queue.py", line 465, in task_done
    raise ValueError('task_done() called too many times')
ValueError: task_done() called too many times
<Greenlet at 0x7f4186c10340: song_worker> failed with ValueError

Traceback (most recent call last):
  File "/home/bvb/anaconda3/lib/python3.5/site-packages/gevent/greenlet.py", line 534, in run
    result = self._run(*self.args, **self.kwargs)
  File "<ipython-input-20-ed8d7e2578aa>", line 39, in song_worker
    q_songs.task_done()
  File "/home/bvb/anaconda3/lib/python3.5/site-packages/gevent/queue.py", line 465, in task_done
    raise ValueError('task_done() called too many times')
ValueError: task_done() called too many 

Took 1:22:57.353715


Traceback (most recent call last):
  File "/home/bvb/anaconda3/lib/python3.5/site-packages/gevent/greenlet.py", line 534, in run
    result = self._run(*self.args, **self.kwargs)
  File "<ipython-input-19-b435aaa830ac>", line 39, in song_worker
    q_songs.task_done()
  File "/home/bvb/anaconda3/lib/python3.5/site-packages/gevent/queue.py", line 465, in task_done
    raise ValueError('task_done() called too many times')
ValueError: task_done() called too many times
<Greenlet at 0x7f415ed385a0: song_worker> failed with ValueError

Traceback (most recent call last):
  File "/home/bvb/anaconda3/lib/python3.5/site-packages/gevent/greenlet.py", line 534, in run
    result = self._run(*self.args, **self.kwargs)
  File "<ipython-input-19-b435aaa830ac>", line 39, in song_worker
    q_songs.task_done()
  File "/home/bvb/anaconda3/lib/python3.5/site-packages/gevent/queue.py", line 465, in task_done
    raise ValueError('task_done() called too many times')
ValueError: task_done() called too many 

In [None]:
def consolidate_lyrics(q):
    '''Consolidates all of the lyric files by their respective
    primary artist.'''
    while True:
        artist = q.get()
        try:
            songs = {}
            artist_path = 'lyrics/{}'.format(artist)
            for song_name in os.listdir(artist_path):
                song_path = '{}/{}'.format(artist_path, song_name)
                with open(song_path, 'rb') as s_file:
                    song_details = msgpack.unpack(s_file)
                    songs[song_name] = song_details

            if not os.path.isdir('lyrics_consolidated'):
                os.makedirs('lyrics_consolidated')

            with open('lyrics_consolidated/{}.mpk'.format(artist), 'wb') as a_file:
                msgpack.pack(songs, a_file)
        except Exception as e:
            print(artist, artist_path, song_name, song_path)
        finally:
            q.task_done()

q_artists = JoinableQueue()
for artist in os.listdir('lyrics'):
    q_artists.put(artist)

consolidate_workers = []
for _ in range (10):
    work = gevent.spawn(consolidate_lyrics, q_artists)
    consolidate_workers.append(work)
    
q_artists.join()
gevent.killall(consolidate_workers)


In [None]:
for artist in os.listdir('lyrics'):
    file = os.listdir('lyrics/{}'.format())
