# Scraping Song Lyrics

Idea behind this notebook is to figure out a way to scrape song lyrics from www.genius.com using just the song's title and artist names. To do this, I will draw heavily from material in Maaz Khan's post [How to Leverage Spotify API + Genius Lyrics for Data Science Tasks in Python](https://medium.com/swlh/how-to-leverage-spotify-api-genius-lyrics-for-data-science-tasks-in-python-c36cdfb55cf3). I will also take elements from Nick Pai's `scrape_song_lyrics` function from his [blog post](https://medium.com/analytics-vidhya/how-to-scrape-song-lyrics-a-gentle-python-tutorial-5b1d4ab351d2).

### Getting the packages ready

For this, I am planning on using `BeautifulSoup` to scrape lyrics as well as the packages `os` and `re` to manipulate the strings in the lyrics scraped from the web.

In [1]:
from bs4 import BeautifulSoup
from urllib.request import urlopen, Request
import requests
import os
import re

### Taking a look at the Genius Website

Before we get around to making a function, let's first take a look at an example. For this, we will try to scrape the lyrics for The Weeknd's hit "Blinding Lights", which held the Number 1 song spot on the [Billboard Hot 100's list for the year end 2020](https://www.billboard.com/charts/year-end/2020/hot-100-songs). Genius has a very straightforward way of being able to get to a particular song's lyrics. The url is usually

    https://genius.com/ + artist_name + '-' + title_name + '-lyrics` 
    
where any spaces are replaced by '-'. In this instance it is

    https://genius.com/The-Weeknd-Blinding-Lights-lyrics

Let's try to use BeautifulSoup to load the html for this page.

In [5]:
url = 'https://genius.com/The-Weeknd-Blinding-Lights-lyrics'
header = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36'}
req = Request(url, headers = header)
html = urlopen(req)
soup = BeautifulSoup(html, 'html.parser')

Note that I have to use an additional argument `header`. This, to my knowledge, fools the server into thinking that an actual PC is trying to access to website (if we didn't use this argument, we'd get an error). Apparently, some websites have safeguards that prevent bots from crawling their servers so we need to bypass this. Now let's see what the html code looks like.

In [6]:
soup.body.prettify()

'<body>\n <div id="application">\n  <div class="LeaderboardOrMarquee__Sticky-yjd3i4-0 dIgauN Leaderboard-da326u-0 cJGzHp">\n   <div class="LeaderboardOrMarquee__Container-yjd3i4-1 cLZLgM">\n    <div class="DfpAd__Container-sc-1tnbv7f-0 kthwUN" id="div-gpt-ad-desktop_song_combined_leaderboard-desktop_song_combined_leaderboard-1">\n    </div>\n   </div>\n  </div>\n  <div class="StickyNav__Container-sc-1q5ye4s-0 GVXNJ">\n   <div class="StickyNav__Left-sc-1q5ye4s-1 klKTfS">\n    <a class="PageHeaderLogo__Link-sc-13m4mcv-0 dbzhnO" height="13" href="https://genius.com">\n     <svg viewbox="0 0 100 15">\n      <path d="M11.7 2.9s0-.1 0 0c-.8-.8-1.7-1.2-2.8-1.2-1.1 0-2.1.4-2.8 1.1-.2.2-.3.4-.5.6v.1c0 .1.1.1.1.1.4-.2.9-.3 1.4-.3 1.1 0 2.2.5 2.9 1.2h1.6c.1 0 .1-.1.1-.1V2.9c.1 0 0 0 0 0zm-.1 4.6h-1.5c-.8 0-1.4-.6-1.5-1.4.1 0 0-.1 0-.1-.3 0-.6.2-.8.4v.2c-.6 1.8.1 2.4.9 2.4h1.1c.1 0 .1.1.1.1v.4c0 .1.1.1.1.1.6-.1 1.2-.4 1.7-.8V7.6c.1 0 0-.1-.1-.1z">\n      </path>\n      <path d="M11.6 11.9s-.1 0 0 

Yikes! That's a very ugly output! Digging through the code, we find that the lyrics are in the `<div, class="song_body-lyrics"></div>` part so we'll try to find the lyrics from there.
One point of note is that the website uses cache and cookies heavily. As a result, it loads different HTML elements and tags which leads to different wrappers containing the song lyrics which require different formatting. **If the code below doesn't work, then re-run the entire notebook until it does. I still have to find a way to get around this issue**.

In [7]:
lyrics_div = soup.find('div', class_='song_body-lyrics')
lyrics = lyrics_div.get_text()
print(lyrics)

AttributeError: 'NoneType' object has no attribute 'get_text'

In [10]:
soup.find('div', class_='song_body-lyrics')


Looks like we need to clean quite a bit actually. First let's get rid of the pesky things in the "\[\]" (verses, chorus e.t.c). This is taken from Nick Pai's code.

In [5]:
lyrics = re.sub(r'[\(\[].*?[\)\]]', '', lyrics)
print(lyrics)


Blinding Lights Lyrics




Yeah


I've been tryna call
I've been on my own for long enough
Maybe you can show me how to love, maybe
I'm going through withdrawals
You don't even have to do too much
You can turn me on with just a touch, baby


I look around and
Sin City's cold and empty 
No one's around to judge me 
I can't see clearly when you're gone


I said, ooh, I'm blinded by the lights
No, I can't sleep until I feel your touch
I said, ooh, I'm drowning in the night
Oh, when I'm like this, you're the one I trust
Hey, hey, hey


I'm running out of time
'Cause I can see the sun light up the sky
So I hit the road in overdrive, baby, oh


The city's cold and empty 
No one's around to judge me 
I can't see clearly when you're gone


I said, ooh, I'm blinded by the lights
No, I can't sleep until I feel your touch
I said, ooh, I'm drowning in the night
Oh, when I'm like this, you're the one I trust


I'm just calling back to let you know 
I could never say it on the phone 
Will never let

Better, but now we have two things to remove. The Title and the footnote on "More on Genius". So let's do that.

In [6]:
# Removing the song name
lyrics = lyrics.split('Lyrics', 1)[1]
# Removing the More on Genius Tag
lyrics = lyrics.split('More on Genius', 1)[0]

In [7]:
print(lyrics)






Yeah


I've been tryna call
I've been on my own for long enough
Maybe you can show me how to love, maybe
I'm going through withdrawals
You don't even have to do too much
You can turn me on with just a touch, baby


I look around and
Sin City's cold and empty 
No one's around to judge me 
I can't see clearly when you're gone


I said, ooh, I'm blinded by the lights
No, I can't sleep until I feel your touch
I said, ooh, I'm drowning in the night
Oh, when I'm like this, you're the one I trust
Hey, hey, hey


I'm running out of time
'Cause I can see the sun light up the sky
So I hit the road in overdrive, baby, oh


The city's cold and empty 
No one's around to judge me 
I can't see clearly when you're gone


I said, ooh, I'm blinded by the lights
No, I can't sleep until I feel your touch
I said, ooh, I'm drowning in the night
Oh, when I'm like this, you're the one I trust


I'm just calling back to let you know 
I could never say it on the phone 
Will never let you go this time 


I 

Great! Now we just need to remove all of that unnecessary whitespace! Again, borrowing from Nick Pai's code

In [8]:
# Removing empty lines
lyrics = os.linesep.join([s for s in lyrics.splitlines() if s])
# Removing new lines and replacing with ;
lyrics = lyrics.replace('\r\n','; ')

In [9]:
print(lyrics)

Yeah; I've been tryna call; I've been on my own for long enough; Maybe you can show me how to love, maybe; I'm going through withdrawals; You don't even have to do too much; You can turn me on with just a touch, baby; I look around and; Sin City's cold and empty ; No one's around to judge me ; I can't see clearly when you're gone; I said, ooh, I'm blinded by the lights; No, I can't sleep until I feel your touch; I said, ooh, I'm drowning in the night; Oh, when I'm like this, you're the one I trust; Hey, hey, hey; I'm running out of time; 'Cause I can see the sun light up the sky; So I hit the road in overdrive, baby, oh; The city's cold and empty ; No one's around to judge me ; I can't see clearly when you're gone; I said, ooh, I'm blinded by the lights; No, I can't sleep until I feel your touch; I said, ooh, I'm drowning in the night; Oh, when I'm like this, you're the one I trust; I'm just calling back to let you know ; I could never say it on the phone ; Will never let you go this t

Perfect! That looks super clean! Now let's dump all of this into an automated function!

### Creating the function

In [10]:
def get_lyrics(title, artist):
    if " " in title:
        title_name = str(title.replace(' ','-'))
    else:
        title_name = str(title)
        
    if " " in artist:
        artist_name = str(artist.replace(' ','-'))
    else:
        artist_name = str(artist)
     
    wlyrics_div = None
    
    while lyrics_div == None:
        url = 'https://genius.com/' + artist_name + '-' + title_name + '-lyrics'
        header = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36'}
        req = Request(url, headers = header)
        html = urlopen(req)
        soup = BeautifulSoup(html, 'html.parser')

        # Getting Song Lyrics Div
        lyrics_div = soup.find('div', class_='song_body-lyrics')
        
    # lyrics_div = soup.find('div', {'class':'SongPageGrid-sc-1vi6xda-0 DGVcp Lyrics__Root-sc-1ynbvzw-0 kkHBOZ'})
    
    lyrics = lyrics_div.get_text()
    
    # Removing stuff in [] (verses, chorus etc)
    lyrics = re.sub(r'[\(\[].*?[\)\]]', '', lyrics)
    # Removing the song name
    lyrics = lyrics.split('Lyrics', 1)[1]
    # Removing the More on Genius Tag
    lyrics = lyrics.split('More on Genius', 1)[0]
    # Removing empty lines
    lyrics = os.linesep.join([s for s in lyrics.splitlines() if s])
    # Removing new lines and replacing with ;
    lyrics = lyrics.replace('\r\n','; ')
    
    return lyrics

What does this function do? Well there are a few parts:
- The function takes two arguments: `title` is the song's title name (for the example it was Blinding Lights) and `artist` is the name of the (first) artist that performed the song.
- The function first modifies the inputs to remove all whitespaces and replace them with "-"s. **Note that inputs should only contain alphanumeric characters (no ',! etc).**
- The function then creates a `None` type object as a placeholder for the lyrics div in the HTML. One annoying thing about genius.com is that it uses cookies and cache to generate different versions of the same webpage. You need to repeatedly open the url to be able to get the right div, otherwise it doesn't return the right object.
- Then the function loops over opening the url again and again and parsing it through BeautifulSoup. The loop continues until the `lyrics_div` gets a 'hit' in the sense that BS is able to find the appropriate holder for the lyrics.
- The function then gets the text from `lyrics_div` and then finishes up cleaning the text as above. It then spits out the lyrics.

To try it out, let's check what happens when we use the function to get the lyrics for Blinding Lights by The Weeknd!

In [11]:
print(get_lyrics('Blinding Lights', 'The Weeknd'))

Yeah; I've been tryna call; I've been on my own for long enough; Maybe you can show me how to love, maybe; I'm going through withdrawals; You don't even have to do too much; You can turn me on with just a touch, baby; I look around and; Sin City's cold and empty ; No one's around to judge me ; I can't see clearly when you're gone; I said, ooh, I'm blinded by the lights; No, I can't sleep until I feel your touch; I said, ooh, I'm drowning in the night; Oh, when I'm like this, you're the one I trust; Hey, hey, hey; I'm running out of time; 'Cause I can see the sun light up the sky; So I hit the road in overdrive, baby, oh; The city's cold and empty ; No one's around to judge me ; I can't see clearly when you're gone; I said, ooh, I'm blinded by the lights; No, I can't sleep until I feel your touch; I said, ooh, I'm drowning in the night; Oh, when I'm like this, you're the one I trust; I'm just calling back to let you know ; I could never say it on the phone ; Will never let you go this t

IT WORKS!!!! We can play around with a bunch of songs and artists, so long as the inputs are appropriate and don't contain any punctuations. Let's try some more examples!

**Tik Tok by Kesha**

In [12]:
print(get_lyrics('Tik Tok', 'Kesha'))

Wake up in the morning feelin' like P. Diddy ; Grab my glasses, I'm out the door, I'm gonna hit this city ; Before I leave, brush my teeth with a bottle of Jack; 'Cause when I leave for the night, I ain't coming back; I'm talkin' pedicure on our toes, toes; Tryin' on all our clothes, clothes; Boys blowin' up our phones, phones; Drop-toppin', playin' our favorite CDs; Pullin' up to the parties; Tryna get a little bit tipsy; Don't stop, make it pop; DJ, blow my speakers up; Tonight, I'ma fight; Till we see the sunlight; Tick tock on the clock; But the party don't stop, no; Oh, whoa, whoa, oh; Oh, whoa, whoa, oh; Don't stop, make it pop; DJ, blow my speakers up; Tonight, I'ma fight; Till we see the sunlight; Tick tock on the clock; But the party don't stop, no; Oh, whoa, whoa, oh; Oh, whoa, whoa, oh; Ain't got a care in the world, but got plenty of beer; Ain't got no money in my pocket, but I'm already here; And now the dudes are linin' up 'cause they hear we got swagger; But we kick 'em 

**Hall of Fame by The Script**

In [13]:
print(get_lyrics('Hall of Fame', 'The Script'))

Yeah, you can be the greatest, you can be the best; You can be the King Kong bangin' on your chest; You can beat the world, you can beat the war; You can talk to God, go bangin' on his door; You can throw your hands up, you can beat the clock ; You can move a mountain, you can break rocks; You can be a master, don’t wait for luck; Dedicate yourself and you gon' find yourself; Standin' in the Hall of Fame ; And the world’s gonna know your name ; ‘Cause you burn with the brightest flame ; And the world’s gonna know your name ; And you’ll be on the walls of the Hall of Fame; You can go the distance, you can run the mile; You can walk straight through hell with a smile; You can be the hero, you can get the gold; Breakin' all the records they thought never could be broke, yeah; Do it for your people, do it for your pride; How are you ever gonna know if you never even try?; Do it for your country, do it for your name; ‘Cause there's gon' be a day, when you're; Standin' in the Hall of Fame ; 

**Thnks fr th mmrs by Fall Out Boy**

In [14]:
print(get_lyrics('Thnks fr th Mmrs', 'Fall Out Boy'))

I'm going to make it bend and break;  Say a prayer, but let the; Good times roll, in case God doesn't show; And I want these words to make things right; But it's the wrongs that make the words come to life; "Who does he think he is?" If that's the worst you've got; Better put your fingers back to the keys...; One night and one more time; Thanks for the memories, even though they weren't so great; "He tastes like you; Only sweeter"; One night, yeah, and one more time; Thanks for the memories, thanks for the memories; "See, he tastes like you; Only sweeter"; Been looking forward to the future; But my eyesight is going bad; And this crystal; Ball...; It's always cloudy except for ; When you look into the past ; One night stand...; One night stand off!; One night and one more time; Thanks for the memories, even though they weren't so great; "He tastes like you; Only sweeter"; One night, yeah, and one more time; Thanks for the memories, thanks for the memories; "See, he tastes like you; Onl

**Thank U, Next by Ariana Grande**

In [15]:
print(get_lyrics('Thank U Next', 'Ariana Grande'))

Thought I'd end up with Sean; But he wasn't a match; Wrote some songs about Ricky; Now I listen and laugh; Even almost got married; And for Pete, I'm so thankful; Wish I could say, "Thank you" to Malcolm; 'Cause he was an angel; One taught me love; One taught me patience; And one taught me pain; Now, I'm so amazing; Say I've loved and I've lost; But that's not what I see; So, look what I got; Look what you taught me; And for that, I say; Thank you, next ; Thank you, next ; Thank you, next; I'm so fuckin' grateful for my ex; Thank you, next ; Thank you, next ; Thank you, next ; I'm so fuckin'—; Spend more time with my friends; I ain't worried 'bout nothin'; Plus, I met someone else; We havin' better discussions; I know they say I move on too fast; But this one gon' last; 'Cause her name is Ari; And I'm so good with that ; She taught me love ; She taught me patience ; How she handles pain ; That shit's amazing ; I've loved and I've lost ; But that's not what I see ; 'Cause look what I've


Created by Mir Adnan Mahmood, PhD. Candidate (Economics), The Ohio State University.