In [1]:
import numpy as np
from tqdm import tqdm_notebook as tqdm

## Preprocessing
Here we're going to loop over all of the stored lyrics to itemize the features and producers (hereby referred to as *associates*).  

We will store a tuple of two values for each artist-associate pair:
1. The frequency of the associate within the artist's corpus
2. The relative weight of that frequency (absolute / total associate count)

First, we build a dictionary containing all of the lyric paths, organized by artist.

In [2]:
from collections import defaultdict
import os

def grab_lyric_paths():
    paths = defaultdict(list)
    for artist in os.listdir('lyrics'):
        artist_path = 'lyrics/{}'.format(artist)
        if os.path.isdir(artist_path):
            song_list = os.listdir(artist_path)
            for song in song_list:
                song_path = 'lyrics/{}/{}'.format(artist, song)
                paths[artist].append(song_path)
    return paths
lyric_paths = grab_lyric_paths()

Then we loop over those paths, and count the total number of songs for each artist.

In [22]:
song_counts = dict()
for artist in lyric_paths:
    song_counts[artist] = len(lyric_paths[artist])
top_ten_prolific_artists = sorted(song_counts.items(), key=lambda item: item[1], reverse=True)[:10]
top_ten_prolific_artists

[('Gucci Mane', 827),
 ('Lil Wayne', 703),
 ('Lil B', 670),
 ('Chief Keef', 507),
 ('Snoop Dogg', 504),
 ('The Game', 499),
 ('Chamillionaire', 461),
 ('E-40', 441),
 ('Chris Brown', 411),
 ('Busta Rhymes', 364)]

> Note: Keep in mind that these numbers are biased towards artists that simply have more lyrics on the Genius platform.  These numbers will be pretty indicative of the "real" number of songs, but there are a few outliers, notably [Lil B](https://www.reddit.com/r/ThankYouBasedGod/comments/1wttyi/does_anyone_have_an_official_count_of_how_much/).  We're just gonna gloss over this, since tabulating the true number of songs by The Based God would amount to a wild-goose chase.
> The other issue we face is one of duplicate lyrics posted to 

Let's make a function to itemize all of the producers and featured artists tied to an artist by looping over the artist's corpus and tallying up everyone they've worked with.

> *Note*: In the case of producers, it's not exceptionally common to have them listed on the lyric page, especially unless they are a well known producer.  This is just part of the game.

In [11]:
import json
from collections import Counter, OrderedDict

def process_associates(associates):
    ass_counted = Counter(associates)
    ass_sorted = sorted(ass_counted.items(), key=lambda item: item[1])[::-1]
    max_weight = ass_sorted[0][1] if ass_sorted else 1
    ass_weighted = [(i[0], (i[1], i[1] / max_weight)) for i in ass_sorted]
    processed_associates = OrderedDict(ass_weighted)
    return processed_associates

def itemized_associates():
    artists = defaultdict(dict)

    for artist in tqdm(lyric_paths):
        producers, features = list(), list()

        for song_path in lyric_paths[artist]:
            with open(song_path) as lfile:
                lyric = json.load(lfile)
                producers += lyric['pro'].get('producers', [])
                features += lyric['pro'].get('features', [])

        artists[artist]['producers'] = process_associates(producers)
        artists[artist]['features'] = process_associates(producers)

    return artists

In [12]:
artist_associates = itemized_associates()




Before we proceed, let's grab a quick count of how many features each artist has.

In [None]:
feature_counts = dict()
for artist, associates in artist_associates.items():
    feature_counts[artist] = len(associates['features'])
for artist, lyric_count in top_ten_prolific_artists:
    comparison = 

Now that we have the associates tied to each artist, let's build a graph to connect them all together.  First up we will add each primary artist, then all associates, as nodes on the graph.  We will refer to the combined domain of objects as *entities*.

In [15]:
import networkx as nx
G = nx.Graph()

In [13]:
all_entities = set(artist_associates.keys())
print('{}: primary artists'.format(len(all_entities)))
for associates in artist_associates.values():
    features = set(associates['features'].keys())
    producers = set(associates['producers'].keys())
    comp = features.union(producers)
    all_entities = all_entities.union(comp)
print('{}: entities after adding associates'.format(len(all_entities)))

7812: primary artists
15468: entities after adding associates


In [16]:
G.add_nodes_from(all_entities)

Now we will add the edges between all entities, represented as the relationship between the primary artist and each associate.

In [19]:
def iterate_feature_edges(artists):
    for artist, associates in artists.items():
        comp = dict(associates['producers'])
        comp.update(associates['features'])
        for feature in comp:
            yield (artist, feature)

In [20]:
G.add_edges_from(iterate_feature_edges(artist_associates))

In [21]:
G.neighbors('Lil Wayne')[:10]

['T-Minus',
 '"Star/Pointro" by The Roots',
 'David Banner',
 'Detail',
 'Nascent',
 'Mannie Fresh',
 'Play-N-Skillz',
 'The Runners',
 'Mr. Pyro',
 'Wale']

Now that we have a functioning graph, the first question I'd like to ask is simply which artist has the most connections? Given that we're currently only tracking breadth, this treats the associates as a set, rather than a collection.  We may add the frequency of the associated as weights later on.

In [20]:
n_neighbors = dict()
for artist in artists:
    n_neighbors[artist] = len(G.neighbors(artist))
sorted(n_neighbors.items(), key=lambda item: item[1], reverse=True)[:10]

[('Snoop Dogg', 423),
 ('The Game', 386),
 ('Lil Wayne', 379),
 ('E-40', 356),
 ('Busta Rhymes', 316),
 ('Chris Brown', 300),
 ('Rick Ross', 299),
 ('Gucci Mane', 295),
 ('T.I.', 292),
 ('Kanye West', 287)]