In [9]:
import lyricsgenius as genius
import pandas as pd
import re, os, string
from IPython.display import display, Markdown
from requests.exceptions import Timeout

## Artists
For this project I decided to analyze the lyrical styles and semantics of the most lyrical rappers. To define the 'most lyrical' rappers, I used the output data of a well known project by Matt Daniels from 2019, "The Largest Vocabulary In Hip Hop," https://pudding.cool/projects/vocabulary/index.html. In his work, Daniels compiled a list 160 rappers with the most diverse vocabularies throughout their discography. I used this list and lyrical rankings, along with data on Grammy and Billboard awards and nominations, to explore how diversification of lyrics, styles, and semantics could impact the success of a rapper.

In [10]:
artists

Unnamed: 0_level_0,lyrical_rank,recalc,wins,noms,win_rate
artist,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Aesop Rock,1,7879,0,0,0.0
Busdriver,2,7324,0,0,0.0
Jedi Mind Tricks,3,6424,0,0,0.0
GZA,4,6390,0,0,0.0
Wu-Tang Clan,5,6196,0,0,0.0
...,...,...,...,...,...
A Boogie wit da Hoodie,156,2738,0,0,0.0
YoungBoy Never Broke Again,157,2724,0,0,0.0
Rich the Kid,158,2709,0,0,0.0
Lil Uzi Vert,159,2556,0,2,0.0


In [7]:
# list of 'most lyrical' rappers, determined through a study done by Matt Daniels in 2019, https://pudding.cool/projects/vocabulary/index.html
most_lyrical = pd.read_csv("data/most_lyrical_artists_2019.csv", usecols=['rapper_clean','recalc']) \
    .sort_values('recalc', ascending=False).reset_index(drop=True).reset_index().rename(columns={'index':'lyrical_rank', 'rapper_clean':'artist'})

# fix rank
most_lyrical.lyrical_rank = most_lyrical.lyrical_rank + 1

# count billboard and grammy wins and nominations by artist

# pull in billboard rapper of the year nominations
noms = pd.read_csv('data/billboard_nominees.csv')
wins = noms.groupby('artist')['winner'].agg('sum').rename('wins')
nominations = noms.groupby('artist')['winner'].agg('count').rename('noms')
bb_artists = pd.DataFrame([wins, nominations]).T

# grammy awards, https://www.kaggle.com/unanimad/grammy-awards/version/2
grammys = pd.read_csv("data/the_grammy_awards.csv").dropna()
grammys = grammys.loc[grammys.year < 2020]

artist_awards_list = []
for artist in most_lyrical.artist:
    winsc = 0
    nomsc = 0
    winsc += grammys.loc[(grammys.workers.str.contains(artist)) & (grammys.winner == True)]['winner'].count()
    nomsc += grammys.loc[(grammys.workers.str.contains(artist))]['winner'].count()
    if artist in bb_artists.index:
        winsc += bb_artists.loc[artist]['wins']
        nomsc += bb_artists.loc[artist]['noms']
    artist_awards_list.append([artist,winsc,nomsc])

# create dataframe of artists by awards won/nominated
artist_awards = pd.DataFrame(artist_awards_list, columns=['artist','wins','noms'])

# merge the award counts to the artist dataframe
artists = most_lyrical.merge(artist_awards, on='artist').set_index('artist')

artists.loc[artists['noms'] != 0, 'win_rate'] = round(artists['wins']/artists['noms'], 2)
artists['win_rate'] = artists['win_rate'].fillna(0)

# Inspect
display(Markdown('### 10 Most Lyrical:'), artists.head(10))
display(Markdown('### 10 Most Awarded:'), artists.sort_values(by='wins', ascending=False).head(10))
display(Markdown('### 10 Most Nominated:'), artists.sort_values(by='noms', ascending=False).head(10))

### 10 Most Lyrical:

Unnamed: 0_level_0,lyrical_rank,recalc,wins,noms,win_rate
artist,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Aesop Rock,1,7879,0,0,0.0
Busdriver,2,7324,0,0,0.0
Jedi Mind Tricks,3,6424,0,0,0.0
GZA,4,6390,0,0,0.0
Wu-Tang Clan,5,6196,0,0,0.0
MF DOOM,6,6169,0,0,0.0
RZA,7,6018,0,0,0.0
Immortal Technique,8,5930,0,0,0.0
Canibus,9,5915,0,0,0.0
Ghostface Killah,10,5901,0,0,0.0


### 10 Most Awarded:

Unnamed: 0_level_0,lyrical_rank,recalc,wins,noms,win_rate
artist,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Jay-Z,75,4275,22,45,0.49
Kanye West,113,3760,22,44,0.5
Russ,127,3491,20,28,0.71
Eminem,59,4480,17,22,0.77
Kendrick Lamar,88,4017,12,21,0.57
Nas,29,4977,12,17,0.71
Eve,120,3642,10,13,0.77
Drake,138,3347,7,28,0.25
Outkast,54,4545,6,9,0.67
Nelly,87,4091,5,11,0.45


### 10 Most Nominated:

Unnamed: 0_level_0,lyrical_rank,recalc,wins,noms,win_rate
artist,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Jay-Z,75,4275,22,45,0.49
Kanye West,113,3760,22,44,0.5
Drake,138,3347,7,28,0.25
Russ,127,3491,20,28,0.71
Eminem,59,4480,17,22,0.77
Kendrick Lamar,88,4017,12,21,0.57
Nas,29,4977,12,17,0.71
T.I.,82,4171,4,16,0.25
Eve,120,3642,10,13,0.77
Ludacris,52,4572,3,11,0.27


## Genius API

Using the Genius API I can access and download the lyrics for any song by an artist within the genius.com database. Through this I can request the top K most popular songs by each artist in my list, where popularity is determined by the frequency of visits to that song's lyrics page on the Genius website. As Genius is the leading website for song lyrics, I felt confident that their popularity rankings for each artist would roughly match those that may be found from streaming services. I had to make sure I was pulling enough songs per artist to get a corpus that strongly defined each rapper's style and popular themes. I wanted as much data as possible, but did not want to introduce possible bias by including songs from artists that were not popular. I believed it reasonable to assume the common listener would not be able to name 25 songs by a single artist off the top of their head. So, I decided to pull the top 25 most popular songs per artist. This gave me substantial amount of songs to work with, while hopefully not minimizing possible bias from unpopular songs.

In [61]:
# setup Genius object to access the Genius API
#   - Excluding songs with terms like 'Remix', 'Solo', 'video', etc. in the title to deal with duplication
api=genius.Genius('wKbk7ZVUTanJoxvF2sIpnkw2c9paQ7-bD7dXZ85SJ3L3Jp8SYwkMmA3NcnXKOjdz',
                  skip_non_songs=True,
                  excluded_terms=["Remix", "(Solo)", "(Live)", "(A Cappella)", "video version", "Video Remix", "Freestyle"],
                  remove_section_headers=True)
api.timeout=10
api.retries=3

# function to pull lyrics for K songs for artist from Genius
def getSongs(artist, k):
    song_list = []
    retry=0
    while retry < 3:
        try:
            songs = api.search_artist(artist, max_songs=k, sort='pop|ularity').songs
            song_list.append(songs)
        except Timeout:
            retry =+ 1
            continue
    return (artist, song_list)

# function to download the lyrics of each song
def downloadLyrics(artist, k, folder):
    print(f'Downloading the top {k} songs by {artist}.')
    
    # return top k songs by artist
    songs = api.search_artist(artist, max_songs=k, sort='popularity').songs
    
    # save each song's lyrics as a .txt file
    for song in songs:
        print(f'writing {artist}_{song.title}.txt')
        
        # remove tag at the end of all the lyrics
        lyrics = song.lyrics
        lyrics = re.sub(r"[0-9]*EmbedShare URLCopyEmbedCopy", "", lyrics)
        
        # filename
        title = song.title.translate(str.maketrans('', '', string.punctuation))
        filename = f'{artist}_{title}.txt'
        
        # save
        if filename in os.listdir('data/songs_individual'):
            print('already downloaded')
        else:
            with open(f'{folder}/{filename}', 'w', encoding="utf-8") as file:
                file.write(lyrics)
    print(f'Top {k} songs by {artist} downloaded.')
    
    return (artist, [song.title for song in songs])

def getLyrics(artists, k):
    
    archive = []
    
    for artist in artists:
    
        # return top k songs by artist
        songs = api.search_artist(artist, max_songs=k, sort='popularity').songs
    
        # save each song's lyrics as a .txt file
        for song in songs:

            # artist and title
            artist = song.artist
            title = song.title

            # remove tag at the end of all the lyrics
            lyrics = song.lyrics
            lyrics = re.sub(r"[0-9]*EmbedShare URLCopyEmbedCopy", "", lyrics)

            archive.append([artist, title, lyrics])
    
    # create dataframe
    df = pd.DataFrame(archive, columns=['artist', 'title', 'lyrics'])
    
    return df

In [66]:
%%time
# pull 25 song objects from Genius for each artist
archive = getLyrics(artists.index, 25)

Searching for songs by Aesop Rock...

Song 1: "None Shall Pass"
Song 2: "Daylight"
Song 3: "Zero Dark Thirty"
Song 4: "Rings"
Song 5: "Coffee"
Song 6: "Gopher Guts"
Song 7: "Kirby"
Song 8: "Dorks"
Song 9: "Mystery Fish"
Song 10: "Shrunk"
Song 11: "Blood Sandwich"
Song 12: "Get Out of the Car"
Song 13: "ZZZ Top"
Song 14: "Cycles to Gehenna"
Song 15: "No Regrets"
Song 16: "9-5'ers Anthem"
Song 17: "Leisureforce"
Song 18: "Night Light"
Song 19: "Pigs"
Song 20: "Lotta Years"
Song 21: "Crows 1"
Song 22: "Catacomb Kids"
Song 23: "The Gates"
Song 24: "Labor"
Song 25: "Big Bang"

Reached user-specified song limit (25).
Done. Found 25 songs.
Searching for songs by Busdriver...

Song 1: "Imaginary Places"
Song 2: "Worlds To Run"
Song 3: "Ego Death"
Song 4: "Me - Time (With The Pulmonary Palimpsest)"
Song 5: "Werner Herzog"
Song 6: "Much"
Song 7: "Somethingness"
Song 8: "Ministry of the Torture Couch"
Song 9: "Motion Lines"
Song 10: "Casting Agents And Cowgirls"
Song 11: "Retirement Ode"
Song 12:

In [76]:
archive.to_csv('data/songs_archive.csv', encoding='utf-8-sig')

In [77]:
archive

Unnamed: 0,artist,title,lyrics
0,Aesop Rock,None Shall Pass,"I'm— trust me, I'm— trust me, I'm trying to he..."
1,Aesop Rock,Daylight,"Yes, yes, y'all, and you don't stop\nAnd keep ..."
2,Aesop Rock,Zero Dark Thirty,(They did not know how long they had been ther...
3,Aesop Rock,Rings,Used to draw\nHard to admit that I used to dra...
4,Aesop Rock,Coffee,"We don't need no walkie-talkies, nope no walki..."
...,...,...,...
3970,NF,Intro,"I'm lookin' like I'm gonna get it, you prolly ..."
3971,NF,Outcast,"Woke up in the cell, where am I at?\nYeah, it'..."
3972,NF,Time (Edit),Even if we both break down tonight\nAnd you sa...
3973,NF,Dreams,"Yeah, most of my life's full of sad days\nStar..."
