# Data Appendix

In [2]:
import pandas as pd

#### Import Billboard CSV
And then convert the WeekID column to be of the datetime type.

In [3]:
charts = pd.read_csv("Hot Stuff.csv")
charts.WeekID = pd.to_datetime(charts.WeekID, errors='ignore', infer_datetime_format=True)

#### Make New Dataframe Songs
This dataframe has no duplicate songs like the original, it just uses the last week the song was on the chart.

In [4]:
songs = charts
songs = songs.sort_values('Peak Position', ascending=True).sort_values('Weeks on Chart', ascending=False).drop_duplicates(['SongID'])
songs.shape

(29389, 10)

#### Dropping 3 Columns
The url, Instance, and Previous Week Position columns were deemed not useful for the upcoming webscraping. Since I knew it was going to be data intensive, I wanted to reduce the size of the dataframe in any way that I could.

In [5]:
songs = songs.drop(['url', 'Instance', 'Previous Week Position'], axis=1)
songs.shape


(29389, 7)

#### A String Building Function
This function converts the data in the songs dataframe into a url of the format for songlyrics.com. Songlyrics.com redirects close searches to the proper page, but I still tried to get it as close to their usual routing format as possible.

In [6]:
## Function that builds URLs for songlyrics.com
def urlBuild(song, artist):
    url = "http://www.songlyrics.com/"
    aChunks = artist.replace(' ', '-')
    sChunks = song.split(' ')
    # for ch in aChunks:
        # url = url + ch
        # url = url + '-'
    url = url + aChunks
    url = url + '/'
    for ch in sChunks:
        url = url + ch
        url = url + '-'
    url = url + 'lyrics'
    return url

In [7]:
# Test of urlBuild function
urlBuild("Big Shot", "Kendrick Lamar & Travis Scott")

'http://www.songlyrics.com/Kendrick-Lamar-&-Travis-Scott/Big-Shot-lyrics'

#### The Web Scraping Function
The webGrab function uses BeautifulSoup to get the html from the url built in the last step. Then webGrab finds the proper housing of the lyrics, cleans the text a bit, then returns the final lyrics. webGrab can also return "error, no lyrics found" when any error occurs.

In [8]:
# Web Scraping function to get and clean lyrics
from bs4 import BeautifulSoup
import requests

failureStr = "error, no lyrics found"
def webGrab(url):
    html = requests.get(url, allow_redirects=True).text
    soup = BeautifulSoup(html, 'html5lib')
    try:
        lyrics = soup.find(id="songLyricsDiv").text
    except:
        return failureStr
    lyrics = lyrics.replace('\n', ' ')
    lyrics = lyrics.replace(',', '')
    lyrics = lyrics.replace('-', ' ')
    lyrics = lyrics.replace('!', '')
    lyrics = lyrics.replace('(', '')
    lyrics = lyrics.replace(')', '')
    lyrics = lyrics.replace('?', '')
    lyrics = lyrics.replace('/>', '')
    lyrics = lyrics.replace('<br', ' ')
    lyrics = lyrics.lower()
    return lyrics

#### Build the URLs and Prepare for Scraping
This quick step just made two new columns for the link to the lyrics and to store the lyrics returned by webGrab. The for loop here uses the urlBuild function from above to build all the strings that the webscraping will search.

In [9]:
## BUILD URLS

songs['lyricLink'] = ""
songs['lyrics'] = ""

for index, row in songs.iterrows():
    songs.loc[index, 'lyricLink'] = urlBuild(row.Song, row.Performer)


In [10]:
songs.head()

Unnamed: 0,WeekID,Week Position,Song,Performer,SongID,Peak Position,Weeks on Chart,lyricLink,lyrics
302681,2014-05-10,49,Radioactive,Imagine Dragons,RadioactiveImagine Dragons,3,87,http://www.songlyrics.com/Imagine-Dragons/Radi...,
302673,2014-03-22,45,Sail,AWOLNATION,SailAWOLNATION,17,79,http://www.songlyrics.com/AWOLNATION/Sail-lyrics,
302665,2021-05-29,23,Blinding Lights,The Weeknd,Blinding LightsThe Weeknd,3,76,http://www.songlyrics.com/The-Weeknd/Blinding-...,
278572,2009-10-10,48,I'm Yours,Jason Mraz,I'm YoursJason Mraz,6,76,http://www.songlyrics.com/Jason-Mraz/I'm-Yours...,
278565,1998-10-10,45,How Do I Live,LeAnn Rimes,How Do I LiveLeAnn Rimes,2,69,http://www.songlyrics.com/LeAnn-Rimes/How-Do-I...,


#### Web Scraping Operation
This is the code chunk that takes a long time (so I did not run it to make this appendix).
The for loop iterates over all the songs in the dataset and uses webGrab to attempt to get the lyrics. Since the operation takes so long, I put some effort into tracking progress so I could see if I needed to interrupt it and change things. I only had to make some small changes, and it ended up taking about 5 hours to run to completion.

In [None]:
# For each song, scrape songlyrics.com for the lyrics

from IPython.display import clear_output
count = 0
failCount = 0
total = songs.shape[0]
redirectE = "Redirect error"
songs['lyricBool'] = False

for index, row in songs.iterrows():
    count = count + 1
    try: 
        songs.loc[index, 'lyrics'] = webGrab(songs.loc[index, 'lyricLink'])
    except:
        songs.loc[index, 'lyricBool'] = False
        failCount = failCount + 1
        songs.loc[index, 'lyrics'] = redirectE
        print("Caught an error")
        print("failure number: ", failCount)
        continue
    if failureStr in songs.loc[index, 'lyrics'] or "sorry we have no" in songs.loc[index, 'lyrics']:
        failCount = failCount + 1
        songs.loc[index, 'lyricBool'] = False
        clear_output(wait=True)
        print("failure number: ", failCount)
    else:
        clear_output(wait=True)
        print("success")
        songs.loc[index, 'lyricBool'] = True
    print("total done: ", count)
    print("progress: ", count / total)

failure number:  27
total done:  76
progress:  0.0025860015652114736
Caught an error
failure number:  28


### For the sake of computing time, I've stopped the scraping at 76 pages scraped.

Final success rate:  0.7016911089183028

#### Eliminate Songs without Lyrics
Since I was tracking whether or not lyrics were found in the lyricBool column in the last step, I simply cut down the dataset further to only contain songs for which the web scraping had found lyrics. Then this dataframe is ready for analysis, so it is exported as a csv file.

In [12]:
songsWithLyrics = songs.loc[songs.lyricBool == True]

AttributeError: 'DataFrame' object has no attribute 'lyricBool'

In [None]:
songsWithLyrics.to_csv("songsWithLyrics.csv")

#### Counting and Saving the Word Counts
The following code blocks iterate over all the song lyrics found and adds/counts them in a dict. The first block here counts duplicates, then saves it as a JSON file in the next block. 
The third code block is similar to the first, but it puts all the lyrics of each song in a set, then adds/counts them in a dict. This removes duplicate words in the same song. Then this is saved as a different JSON file.

In [None]:
## Count most common words with duplicates
songCount = {}

for row in songsWithLyrics.lyrics:
    temp = row.split(' ')
    
    for word in temp:
        if word not in songCount:
            songCount[word] = 1
        else:
            songCount[word] = songCount[word] + 1

sortedCounts = dict(sorted(songCount.items(), key= lambda x:x[1], reverse=True))

In [None]:
import json

with open("totalCount", "w") as file:
    json.dump(sortedCounts, file)

In [None]:
## Count most common words with no duplicates
songCount2 = {}

for row in songsWithLyrics.lyrics:
    temp = row.split(' ')
    
    dupList = list(set(temp))
    for word in dupList: 
        if word not in songCount:
            songCount2[word] = 1
        else:
            songCount2[word] = songCount2[word] + 1

sortedCountsNoDupes = dict(sorted(songCount2.items(), key= lambda x:x[1], reverse=True))

In [None]:
with open("totalCountNoDupes", "w") as file:
    json.dump(sortedCountsNoDupes, file)