# Collating Hot 100 Listings with Lyrics from Genius.com

The goal of this notebook is to create a table consisting of title, artist, year, ranking, and lyrics for each entry in the Hot 100 dataset from the past 11 years (2010-2020).

To do so we will use a lyrics scraping function developed by Mir Adnan Mahmood (with a few modifications by Ritwik Banerji).

### First, we'll import any relevant packages:

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import re
import string

from seaborn import set_style
set_style("whitegrid")

from bs4 import BeautifulSoup
from urllib.request import urlopen, Request
import requests

### Next, we'll set up this notebook to use the lyrics scraper function.

In [2]:
def get_lyrics(title, artist):

    ## reformat title for URL generation
    ## spaces need to be converted to hyphens
    ## per Genius.com convention
    
    if " " in title:
        title = str(title.replace(' ','-'))
    else:
        title = str(title)
        
    if " " in artist:
        artist = str(artist.replace(' ','-'))
    else:
        artist = str(artist)
     
    url = 'https://genius.com/' + artist + '-' + title + '-lyrics'
    url = url.replace('--','-')
##    print(url)
##    this line above was just to test if url's were bad
    header = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36'}
    req = Request(url, headers = header)
    html = urlopen(req)
    soup = BeautifulSoup(html, 'html.parser')
    
    lyrics_div = soup.find('div', {'class':'SongPageGrid-sc-1vi6xda-0 DGVcp Lyrics__Root-sc-1ynbvzw-0 kkHBOZ'})
    if lyrics_div:
        #insert ; at end of each line
        lyrics = lyrics_div.get_text('; ')

    elif lyrics_div == None:
        lyrics = None

    # remove markers for chorus, verse, etc.
    # but leave the () since this is meaningful
    # textual info
    lyrics = re.sub(r'\[.*?\]', '', lyrics)
    lyrics = re.sub(r'\(|\)', '', lyrics)

    # turn the ; into an \n to enable us to read the output
    lyrics = lyrics.replace('; ','\n')
        
    return lyrics

Let's try this function on an example.

Note: several formatting manipulations are necessary to properly convert title and artist information into a URL that will successfully retrieve the right page on the Genius.com website.

In [3]:
title = 'Bad and Boujee'
artist = 'Migos'


## remove , ‘ ( ) ? ! . 
title = title.replace(',','')
title = title.replace('\'','')
title = title.replace('(','')
title = title.replace(')','')
title = title.replace('?','')
title = title.replace('!','')
title = title.replace('.','')

## replace ‘ & ‘ with ‘-’
title = title.replace(' & ', '-')
## replace ‘&’ with ‘-‘
title = title.replace('&', '-')

## then replace spaces with hyphens
title = title.replace(' ','-')
title

test = get_lyrics(title, artist)
print(test)


You know, young rich niggas
You know somethin', we ain't really never had no old money
We got a whole lotta new money though, hah

If Young Metro don't trust you, I'm gon' shoot you

Hey

Raindrop Drip, drop-top Drop-top
Smokin' on cookie in the hotbox Cookie
Fuckin' on your bitch, she a thot, thot Thot
Cookin' up dope in the crockpot Pot
We came from nothin' to somethin', nigga Hey
I don't trust nobody, grip the trigger Nobody
Call up the gang and they come and get ya Gang
Cry me a river, give you a tissue Hey
My bitch is bad and bougie Bad
Cookin' up dope with a Uzi Blaow
My niggas is savage, ruthless Savage
We got 30s and hundred-rounds too Grrah
My bitch is bad and bougie Bad
Cookin' up dope with a Uzi Dope
My niggas is savage, ruthless Hey
We got 30s and hundred-rounds too Glah

Offset, woo, woo, woo, woo, woo
Rackaids on rackaids Racks, got back-ends on back-ends
I'm ridin' around in a coupe Coupe
I take your bih right from you You
Bitch, I'm a dog, roof Grr
Beat the ho walls lo

The function has some faults. For example, when artists who are not the main artist for the track are mentioned in the transcribed lyrics (i.e., a rapper part of a crew), their names may still appear in the transcribed "lyrics."

But for now, we will settle with that.

What we need to do now is to make a df that has several columns: year, position, artist, title, lyrics.

In [4]:
## first we need to load the hot100 csv as a pandas dataframe

hot100 = pd.read_csv("hot_100_fmtd.csv")
hot100

Unnamed: 0.1,Unnamed: 0,title,artists,rank,year,first_artist,title_fmtd
0,0,TiK ToK,Ke$ha,1,2010,kesha,tik tok
1,1,Need You Now,Lady Antebellum,2,2010,lady antebellum,need you now
2,2,"Hey, Soul Sister",Train,3,2010,train,hey soul sister
3,3,California Gurls,Katy Perry Featuring Snoop Dogg,4,2010,katy perry,california gurls
4,4,OMG,Usher Featuring will.i.am,5,2010,usher,omg
...,...,...,...,...,...,...,...
1093,1093,More Than My Hometown,Morgan Wallen,96,2020,morgan wallen,more than my hometown
1094,1094,Lovin' On You,Luke Combs,97,2020,luke combs,lovin on you
1095,1095,Said Sum,Moneybagg Yo,98,2020,moneybagg yo,said sum
1096,1096,Slide,H.E.R. Featuring YG,99,2020,her,slide


We'll need to get lyrics automatically by using the "first_artist" and "title_fmtd" columns.

In [5]:
## This cell is just a quick test that the formatting of the artist's has 
## been done properly (i.e., how Genius notates these)
hot100.iloc[0].first_artist

'kesha'

In [6]:
## This cell tests random entries in the dataset for accuracy.

## choose an arbitrary/random # in the list
import random
hit = random.randint(0, 1100)
print(hot100.iloc[hit].title, hot100.iloc[hit].artists)

lyrics = get_lyrics(hot100.iloc[hit].title_fmtd, hot100.iloc[hit].first_artist)
print(lyrics)

Congratulations Post Malone Featuring Quavo

Mm-mmm
Yeah, yeah
Mm-mmm
Yeah 
Hey


My momma called, seen you on TV, son
Said shit done changed ever since we was on
I dreamed it all ever since I was young
They said I wouldn't be nothing
Now they always say, "Congratulations" Uh, uh, uh
Worked so hard, forgot how to vacation Uh-huh
They ain't never had the dedication Uh, uh
People hatin', say we changed and look, we made it Uh, uh
Yeah, we made it Uh, uh, uh

They was never friendly, yeah
Now I'm jumping out the Bentley, yeah
And I know I sound dramatic, yeah
But I know I had to have it, yeah
For the money, I'm a savage, yeah
I be itching like a addict, yeah
I'm surrounded, twenty bad bitches, yeah
But they didn't know me last year, yeah
Everyone wanna act like they important Yeah-yeah-yeah, yeah-yeah-yeah
But all that mean nothing when I saw my dough, yuh Yeah-yeah-yeah, yeah-yeah-yeah
Everyone countin' on me, drop the ball, yuh Yeah-yeah-yeah, yeah-yeah-yeah
Everything custom like I'm a

Let's quickly test the lyrics scraping function. This is a test of its functionality for the first 8 entries.

In [7]:
## For the get_lyrics function, first argument is
## title, second is the artist.
for hit in range(0,8):
    try:
        print(hot100.iloc[hit].title, hot100.iloc[hit].artists)
        lyrics = ''
        lyrics = get_lyrics(hot100.iloc[hit].title_fmtd, hot100.iloc[hit].first_artist)
    except:
        print('URL-ERROR-LYRICS-NOT-FOUND')
    
    print(hit)
    print(lyrics)

TiK ToK Ke$ha
0

Wake up in the morning feelin' like P. Diddy
 

Hey, what up, girl?

Grab my glasses, I'm out the door, I'm gonna hit this city
 

Let's go

Before I leave, brush my teeth with a bottle of Jack
'Cause when I leave for the night, I ain't coming back

I'm talkin' pedicure on our toes, toes
Tryin' on all our clothes, clothes
Boys blowin' up our phones, phones
Drop-toppin', playin' our favorite CDs
Pullin' up to the parties
Tryna get a little bit tipsy

Don't stop, make it pop
DJ, blow my speakers up
Tonight, I'ma fight
Till we see the sunlight
Tick tock on the clock
But the party don't stop, no
Oh, whoa, whoa, oh
Oh, whoa, whoa, oh
Don't stop, make it pop
DJ, blow my speakers up
Tonight, I'ma fight
Till we see the sunlight
Tick tock on the clock
But the party don't stop, no
Oh, whoa, whoa, oh
Oh, whoa, whoa, oh

Ain't got a care in the world, but got plenty of beer
Ain't got no money in my pocket, but I'm already here
And now the dudes are linin' up 'cause they hear we go

5

Can we pretend that airplanes
In the night sky are like shooting stars?
I could really use a wish right now
Wish right now, wish right now
Can we pretend that airplanes
In the night sky are like shooting stars?
I could really use a wish right now
Wish right now, wish right now

Yeah, I could use a dream or a genie or a wish
To go back to a place much simpler than this
'Cause after all the partyin' and smashin' and crashin'
And all the glitz and the glam and the fashion
And all the pandemonium and all the madness
There comes a time where you fade to the blackness
And when you starin' at that phone in your lap
And you hopin' but them people never call you back
But that's just how the story unfolds
You get another hand soon after you fold
And when your plans unravel in the sand
What would you wish for if you had one chance?
So airplane, airplane, sorry I'm late
I'm on my way, so don't close that gate
If I don't make that, then I'll switch my flight
And I'll be right back at it by the e

Looks pretty good!

However, we still need to make sure we can handle errors. We need to make sure entries correspond with their respective song, even if there is a URL that our function generates that yields a 404 not found error on the Genius website.

In [12]:
## create an empty list of lyrics (a whole song's lyrics is just one long string)

lyrics = []

for hit in range(0,10):
    try:
        song_lyrics = ''
#        print(hot100.iloc[hit].title, hot100.iloc[hit].artists)
        song_lyrics = get_lyrics(hot100.iloc[hit].title_fmtd, hot100.iloc[hit].first_artist)
        lyrics.append(song_lyrics)
    except:
        lyrics.append('URL-ERROR-LYRICS-NOT-FOUND')
#        print('URL-ERROR-LYRICS-NOT-FOUND')
lyrics

["\nWake up in the morning feelin' like P. Diddy\n \n\nHey, what up, girl?\n\nGrab my glasses, I'm out the door, I'm gonna hit this city\n \n\nLet's go\n\nBefore I leave, brush my teeth with a bottle of Jack\n'Cause when I leave for the night, I ain't coming back\n\nI'm talkin' pedicure on our toes, toes\nTryin' on all our clothes, clothes\nBoys blowin' up our phones, phones\nDrop-toppin', playin' our favorite CDs\nPullin' up to the parties\nTryna get a little bit tipsy\n\nDon't stop, make it pop\nDJ, blow my speakers up\nTonight, I'ma fight\nTill we see the sunlight\nTick tock on the clock\nBut the party don't stop, no\nOh, whoa, whoa, oh\nOh, whoa, whoa, oh\nDon't stop, make it pop\nDJ, blow my speakers up\nTonight, I'ma fight\nTill we see the sunlight\nTick tock on the clock\nBut the party don't stop, no\nOh, whoa, whoa, oh\nOh, whoa, whoa, oh\n\nAin't got a care in the world, but got plenty of beer\nAin't got no money in my pocket, but I'm already here\nAnd now the dudes are linin'

Alright nice: now it looks like we're ready to start adding some lyrics to our dataframe of Hot 100 data.

In [13]:
## add a column for lyrics

hot100['lyrics'] = ''
hot100

Unnamed: 0.1,Unnamed: 0,title,artists,rank,year,first_artist,title_fmtd,lyrics
0,0,TiK ToK,Ke$ha,1,2010,kesha,tik tok,
1,1,Need You Now,Lady Antebellum,2,2010,lady antebellum,need you now,
2,2,"Hey, Soul Sister",Train,3,2010,train,hey soul sister,
3,3,California Gurls,Katy Perry Featuring Snoop Dogg,4,2010,katy perry,california gurls,
4,4,OMG,Usher Featuring will.i.am,5,2010,usher,omg,
...,...,...,...,...,...,...,...,...
1093,1093,More Than My Hometown,Morgan Wallen,96,2020,morgan wallen,more than my hometown,
1094,1094,Lovin' On You,Luke Combs,97,2020,luke combs,lovin on you,
1095,1095,Said Sum,Moneybagg Yo,98,2020,moneybagg yo,said sum,
1096,1096,Slide,H.E.R. Featuring YG,99,2020,her,slide,


In [14]:
## let's just test this on the first 10 in the whole set

## make a copy of hot100

hot100test = hot100.head(10)
hot100test

Unnamed: 0.1,Unnamed: 0,title,artists,rank,year,first_artist,title_fmtd,lyrics
0,0,TiK ToK,Ke$ha,1,2010,kesha,tik tok,
1,1,Need You Now,Lady Antebellum,2,2010,lady antebellum,need you now,
2,2,"Hey, Soul Sister",Train,3,2010,train,hey soul sister,
3,3,California Gurls,Katy Perry Featuring Snoop Dogg,4,2010,katy perry,california gurls,
4,4,OMG,Usher Featuring will.i.am,5,2010,usher,omg,
5,5,Airplanes,B.o.B Featuring Hayley Williams,6,2010,bob,airplanes,
6,6,Love The Way You Lie,Eminem Featuring Rihanna,7,2010,eminem,love the way you lie,
7,7,Bad Romance,Lady Gaga,8,2010,lady gaga,bad romance,
8,8,Dynamite,Taio Cruz,9,2010,taio cruz,dynamite,
9,9,Break Your Heart,Taio Cruz Featuring Ludacris,10,2010,taio cruz,break your heart,


In [15]:
hot100test['lyrics'] = lyrics
hot100test

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0.1,Unnamed: 0,title,artists,rank,year,first_artist,title_fmtd,lyrics
0,0,TiK ToK,Ke$ha,1,2010,kesha,tik tok,\nWake up in the morning feelin' like P. Diddy...
1,1,Need You Now,Lady Antebellum,2,2010,lady antebellum,need you now,"\n""Hey, sorry I missed your call, just leave a..."
2,2,"Hey, Soul Sister",Train,3,2010,train,hey soul sister,\nHeyy\nHe-e-e-e-ey\nHe-e-e-e-ey\n\nYour lipst...
3,3,California Gurls,Katy Perry Featuring Snoop Dogg,4,2010,katy perry,california gurls,"\nGreetings, loved ones\nLet's take a journey\..."
4,4,OMG,Usher Featuring will.i.am,5,2010,usher,omg,"\nOh my gosh\nBaby let me\nI did it again, so ..."
5,5,Airplanes,B.o.B Featuring Hayley Williams,6,2010,bob,airplanes,\nCan we pretend that airplanes\nIn the night ...
6,6,Love The Way You Lie,Eminem Featuring Rihanna,7,2010,eminem,love the way you lie,\nJust gonna stand there and watch me burn?\nW...
7,7,Bad Romance,Lady Gaga,8,2010,lady gaga,bad romance,"\nOh-oh-oh-oh-oh, oh-oh-oh-oh, oh-oh-oh\nCaugh..."
8,8,Dynamite,Taio Cruz,9,2010,taio cruz,dynamite,"\nI came to dance, dance, dance, dance Yeah\nI..."
9,9,Break Your Heart,Taio Cruz Featuring Ludacris,10,2010,taio cruz,break your heart,"\nWhoa, oh\nWhoa, oh\nWhoa, oh\nWhoa, oh\n\nNo..."


In [18]:
hot100test.to_csv('hot_100_with_lyrics_test.csv')

Ok nice! The test works well! Now let's go ahead and try the whole 11 year set!

In [25]:
## create an empty list of lyrics (a whole song's lyrics is just one long string)
lyrics = []

## put each song's lyrics into that list 'lyrics' as a single, unbroken string
for hit in range(0,1098):
    try:
        song_lyrics = ''
#        print(hot100.iloc[hit].title, hot100.iloc[hit].artists)
        song_lyrics = get_lyrics(hot100.iloc[hit].title_fmtd, hot100.iloc[hit].first_artist)
        lyrics.append(song_lyrics)
    except:
        lyrics.append('URL-ERROR-LYRICS-NOT-FOUND')
#        print('URL-ERROR-LYRICS-NOT-FOUND')

In [26]:
len(lyrics)

1098

In [27]:
hot100['lyrics'] = lyrics

In [28]:
hot100.to_csv('hot_100_with_lyrics.csv')