In [2]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import matplotlib.pyplot as plt
import numpy as np
import os
import re
from   sklearn.feature_extraction.text import TfidfVectorizer
from   sklearn.feature_selection import SelectKBest, mutual_info_classif
from   sklearn.linear_model import LogisticRegression, LinearRegression
from   sklearn.model_selection import cross_val_score
from   sklearn.preprocessing import StandardScaler

# Final Project
#### Table of Contents
0. Project Team
1. Introduction and Hypothesis
2. Data and Methods
        I. Dataset Creation
        II. Metadata + Dataset Information
3. Results
4. Conclusion and Discussion
5. Resources Consulted
6. Bibliography

## 0. Project Team

Estelle Hooper (ehh52) and Gabriella Chu (gc386)

## 1. Introduction and Hypothesis

## 2. Data and methods

### I. Dataset Creation

To answer our research question, we created our own dataset consisting of the Billboard Hot 100 songs for every week in the years 2000-2021, covering two decades of music **(roughly, 22 years * 52 weeks * 100 songs = 114,400 music entries)**. Each week, Billboard.com ranks the "Hot 100" songs, stating its peak position on the chart (of all time) and the number of weeks it has been on the chart. Because Billboard provides archives of this information, we webscraped this data using Beautiful Soup to obtain the most popular songs of the week and then used the song title and artist information to generate links in order to webscrape lyrics from Genius.com.<br>

To collect this data, we wrote the following functions:
- `song_data()` webscrapes the Billboard Hot 100 charts to get song metadata and the lyrics of those songs, stores it all in a pandas dataframe
- `append_lyrics()` (helper function) webscrapes Genius lyrics using information from song metadata.
- `valid_dates()` from a list of dates, keeps only those that have a real Billboard chart

Doing this was extremely computationally expensive. Generating the song data (metadata + lyrics) for one week only takes 1.5 minutes, but seeing as we wanted to cover every weekly chart for 22 years, this means the time needed to collect this data is in total **28.6 hours**! To do this within a reasonable amount of time, we split the work among 21 notebooks, ran them simultaneously, and concatenated them all in this notebook (the file is too big to push to Github). We demo these functions with only one week below.

In [3]:
def song_data(date=''):
    '''
    The Billboard Hot 100 chart represents the Hot 100 songs for that week.
    
    date: a string, in the form "YYYY-MM-DD". For example, "2022-05-16" represents May 16, 2022. If no date specified, function
          will select today's chart.
    returns: a pandas dataframe containing metadata for Billboard Hot 100 songs of the week of the specified date.
    columns: rank: rank of the week (1-100)
             date: a pandas datetime object. date of the chart as stated on the Billboard website, 
             which uses the Saturday to identify the week (so it is the same week as the user input, but the Saturday
             of that week),
             title: title of the song,
             artist1: main artist,
             artist2: a list of the rest of the artists. np.nan if there are none.
             peak_pos: peak position of the song,
             wks_chart: # of weeks the song has been on the chart
             b_url: url to the billboard chart
    '''
    lsongs=[]
    lartists=[]
    artist1=[]
    artist2=[]
    lpeak_pos=[]
    lwks_chart=[]
    
    URL='https://www.billboard.com/charts/hot-100/'+date

    page=requests.get(URL)
    soup=BeautifulSoup(page.content, 'lxml')
  
    ### get the first song, bc it's in a different div container :(
    song1 = soup.find("h3",id='title-of-a-story', class_="c-title a-no-trucate a-font-primary-bold-s u-letter-spacing-0021 u-font-size-23@tablet lrv-u-font-size-16 u-line-height-125 u-line-height-normal@mobile-max a-truncate-ellipsis u-max-width-245 u-max-width-230@tablet-only u-letter-spacing-0028@tablet")
    lsongs.append(song1.text.strip())
    
    ### get the first artist, bc it's in a different div container :(
    artistf=soup.find("span", class_="c-label a-no-trucate a-font-primary-s lrv-u-font-size-14@mobile-max u-line-height-normal@mobile-max u-letter-spacing-0021 lrv-u-display-block a-truncate-ellipsis-2line u-max-width-330 u-max-width-230@tablet-only u-font-size-20@tablet")
    lartists.append(artistf.text.strip())
    
    ### get the first peak position, bc it's in a different div container :(
    nums=soup.find_all('span', class_="c-label a-font-primary-bold-l a-font-primary-m@mobile-max u-font-weight-normal@mobile-max lrv-u-padding-tb-050@mobile-max u-font-size-32@tablet")
    nums1=[]
    for x in nums:
        nums1.append(x.text.strip())
        
    lpeak_pos.append(nums1[1])
    ### get the first weeks on chart, bc it's in a different div container
    lwks_chart.append(nums1[2])
    
    ### get last 99 songs
    songs = soup.find_all("h3", class_="c-title a-no-trucate a-font-primary-bold-s u-letter-spacing-0021 lrv-u-font-size-18@tablet lrv-u-font-size-16 u-line-height-125 u-line-height-normal@mobile-max a-truncate-ellipsis u-max-width-330 u-max-width-230@tablet-only", id="title-of-a-story")
    for song in songs:
        lsongs.append(song.text.strip())
    
    ### get last 99 artists
    artists = soup.find_all("span", class_="c-label a-no-trucate a-font-primary-s lrv-u-font-size-14@mobile-max u-line-height-normal@mobile-max u-letter-spacing-0021 lrv-u-display-block a-truncate-ellipsis-2line u-max-width-330 u-max-width-230@tablet-only")
    for artist in artists:
        lartists.append(artist.text.strip())
        
    ### get last 99 peak position
    all_num=[]
    peak_pos = soup.find_all("span", class_="c-label a-font-primary-m lrv-u-padding-tb-050@mobile-max")
    for num in peak_pos:
        all_num.append(num.text.strip())
    
    x=1
    for peak in all_num:
        if x <= len(all_num)-5:
            lpeak_pos.append(all_num[x])
            x=x+6

    ### get last 99 weeks on chart
    y=2
    for wk in all_num:
        if y <= len(all_num)-4:
            lwks_chart.append(all_num[y])
            y=y+6            
    
    ### get date as listed on the chart, aka the Saturday of the week of the user input
    date=soup.find('h2', id="section-heading")
    cdate=pd.to_datetime(date.text.strip().replace("Week of ",''))
    
    
    ### separate artists into artist1 and artist2
    for a in lartists:
        if ("X &" not in a) and ("X Featuring" not in a) and ("X /" not in a):
            a=a.replace(" X ",",")
        a=a.replace("Featuring",",")
        a=a.replace("&",",")   
        a=a.replace(" / ",",")
        List=a.split(",")
        artists = [i.strip() for i in List]
        artist1.append(artists[0])
        if len(artists)==1:
            artist2.append(np.nan)
        else:
            artist2.append(artists[1:])
    
    metadata=pd.DataFrame()
    metadata['rank']=(range(1,101)) ### get rank position
    metadata['date']=cdate
    metadata['title']=lsongs
    metadata['artist1']=artist1
    metadata['artist2']=artist2
    metadata['peak_pos']=lpeak_pos
    metadata['wks_chart']=lwks_chart
    metadata['b_url']=URL
    
    metadata=append_lyrics(metadata)
    metadata.reset_index(inplace=True, drop=True)

    return metadata

In [4]:
def append_lyrics(metadata):
    '''
    a helper function for song_data(). gets the song lyrics for a given song. appends the
    song lyrics for a song from Genius.com to a "lyrics" column.
    
    If the function cannot find the song on the Genius lyrics website, it will drop the entire observation from the dataset.
    
    metadata: a pandas dataframe, created from song_data(). at the least contains
              the title column and the artist1 column.
    returns: a pandas dataframe of the original dataframe with a lyrics column and URL to the
             Genius website the lyrics were taken from.
    '''
    all_lyrics=[]
    all_URL=[]
    title=metadata.title.values
    artist1=metadata.artist1.values
    for x in range(len(title)):
        
        #construct URL

        t=title[x]
        a=artist1[x]
  
        #clean punctuation/formatting of artist and song title for URL
        t=re.sub(r'[^\w\s]', '', t)
        a=re.sub(r'[!$/]', '-', a)
        a=re.sub(r'["\\#%&;\()*\[\]+,.:;<=>?@^_`{|}~]', '', a) #[\\] 
        
        #concat
        URL= "https://www.genius.com/"+a.replace(' ','-')+'-'+t.replace(' ','-')+'-lyrics'
        URL=URL.replace('--','-')
        
        #webscrape
        page=requests.get(URL)
        soup=BeautifulSoup(page.content, 'lxml')
        
        #check if URL works
        if 'Oops! Page not found' not in soup.text.strip():
            lyrics=soup.find_all('div', class_='Lyrics__Container-sc-1ynbvzw-6 jYfhrf')
            Lyrics = [re.sub(r"\[.*?\]",'',i.text.strip()) for i in lyrics]
            LYRICS=" ".join(Lyrics)
            all_lyrics.append(LYRICS)
            all_URL.append(URL)
        
        #drop song from dataset if website does not exist
        else: 
            #print(URL)
            metadata.drop([x], inplace=True)
    
    metadata['lyrics']=all_lyrics
    metadata['g_url']=all_URL
    return metadata

#### A. Code Demo: This Week's Billboard Hot 100 Songs (Metadata + Lyrics)

We demonstrate our function below using this week's chart, which requires no function parameter. As seen below, out of today's Hot 100 songs, we collected the data for 94. This is because we could not account for extreme edge cases in our function (such as songs with a large amount of artists and artists with interesting punctuation) and found that for certain Genius.com songs, the links were named inconsistently.

In [5]:
%%time
demo=song_data()

CPU times: total: 22.9 s
Wall time: 2min 15s


In [6]:
demo

Unnamed: 0,rank,date,title,artist1,artist2,peak_pos,wks_chart,b_url,lyrics,g_url
0,1,2022-05-21,First Class,Jack Harlow,,1,5,https://www.billboard.com/charts/hot-100/,"MmI been a (G), throw up the (L), sex in the (...",https://www.genius.com/Jack-Harlow-First-Class...
1,2,2022-05-21,As It Was,Harry Styles,,1,6,https://www.billboard.com/charts/hot-100/,"Come on, Harry, we wanna say goodnight to youH...",https://www.genius.com/Harry-Styles-As-It-Was-...
2,3,2022-05-21,Wait For U,Future,"[Drake, Tems]",1,2,https://www.billboard.com/charts/hot-100/,"I will wait for you, for youEarly in the morni...",https://www.genius.com/Future-Wait-For-U-lyrics
3,4,2022-05-21,Moscow Mule,Bad Bunny,,4,1,https://www.billboard.com/charts/hot-100/,"Si yo no te escribo, tú no me escribe', eySi t...",https://www.genius.com/Bad-Bunny-Moscow-Mule-l...
4,5,2022-05-21,Titi Me Pregunto,Bad Bunny,,5,1,https://www.billboard.com/charts/hot-100/,"EyTití me preguntó si tengo mucha' novia', muc...",https://www.genius.com/Bad-Bunny-Titi-Me-Pregu...
...,...,...,...,...,...,...,...,...,...,...
89,96,2022-05-21,Frozen,Lil Baby,,54,2,https://www.billboard.com/charts/hot-100/,Can someone come unthaw my heart? I think it's...,https://www.genius.com/Lil-Baby-Frozen-lyrics
90,97,2022-05-21,Young Harleezy,Jack Harlow,,97,1,https://www.billboard.com/charts/hot-100/,"Young Harleezy, y'all grew up shooting RPG'sI ...",https://www.genius.com/Jack-Harlow-Young-Harle...
91,98,2022-05-21,23,Sam Hunt,,50,20,https://www.billboard.com/charts/hot-100/,La-la-la-la-laYou can marry an architectBuild ...,https://www.genius.com/Sam-Hunt-23-lyrics
92,99,2022-05-21,Poison,Jack Harlow,[Lil Wayne],99,1,https://www.billboard.com/charts/hot-100/,Can you decide?Can you decide?OhNew musicListe...,https://www.genius.com/Jack-Harlow-Poison-lyrics


In 21 separate notebooks, we used pandas's date_range() function to generate a list of dates (all the Saturdays in a year, since that's what Billboard uses) to pass through the function we created. We then concatenate all these dataframes to get a dataframe for each year. We then saved these dataframes to csv files. The following code blocks are copied and pasted from the notebook used to generate the data for the year 2000.

In [None]:
## DON'T RUN ##

dates=pd.date_range(start='2000-01-01',end='2000-12-31',freq='W-SAT') # the charts are every Saturday
print(len(dates))
print(dates)

In [None]:
## DON'T RUN ##
Dates=[date.strftime('%Y-%m-%d') for date in dates]

In [None]:
## DON'T RUN ##

songs=[]
for date in Dates:
    songs.append(song_data(date))
    #print(date) 

songs2000 = pd.concat(songs)
songs2000.to_csv('songs2000.csv')

### II. Metadata + Dataset Information 

link to dataset: 

- `rank` rank of the song on the chart for that date
- `date` date of the week for the chart
- `title` title of the song
- `artist1` artist of the song
- `artist2` a list of of the rest of the featured/additional artists on the song
- `peak_pos` peak position of the chart (up until the observed date)
- `wks_chart` number of weeks the song has been on the chart (up until the observed date)
- `b_url` link to Billboard chart
- `lyrics` **lyrics  of the song (a string), the text data we will be working with**
- `g_url` link to Genius lyrics website

In [77]:
songs=pd.read_csv('songs.csv')
songs=songs.iloc[:,1:]

In [78]:
songs

Unnamed: 0,rank,date,title,artist1,artist2,peak_pos,wks_chart,b_url,lyrics,g_url
0,1,2000-01-01,Smooth,Santana,['Rob Thomas'],1,23,https://www.billboard.com/charts/hot-100/2000-...,"Man, it's a hot oneLike seven inches from the ...",https://www.genius.com/Santana-Smooth-lyrics
1,2,2000-01-01,Back At One,Brian McKnight,,2,19,https://www.billboard.com/charts/hot-100/2000-...,It's undeniableThat we should be togetherIt's ...,https://www.genius.com/Brian-McKnight-Back-At-...
2,3,2000-01-01,I Wanna Love You Forever,Jessica Simpson,,3,12,https://www.billboard.com/charts/hot-100/2000-...,"Ooh, ooh, mmmYou set my soul at easeChased dar...",https://www.genius.com/Jessica-Simpson-I-Wanna...
3,4,2000-01-01,My Love Is Your Love,Whitney Houston,,4,18,https://www.billboard.com/charts/hot-100/2000-...,"Clap your hands, y'allIt's alright (Turn me up...",https://www.genius.com/Whitney-Houston-My-Love...
4,5,2000-01-01,I Knew I Loved You,Savage Garden,,4,11,https://www.billboard.com/charts/hot-100/2000-...,"MmmOoh, ohMaybe it's intuitionBut some things ...",https://www.genius.com/Savage-Garden-I-Knew-I-...
...,...,...,...,...,...,...,...,...,...,...
110212,95,2021-12-25,Freedom Was A Highway,Jimmie Allen,['Brad Paisley'],76,10,https://www.billboard.com/charts/hot-100/2021-...,"(Oh-oh, oh-oh, woo)(Oh-oh, oh-oh)Sunset throug...",https://www.genius.com/Jimmie-Allen-Freedom-Wa...
110213,96,2021-12-25,No Love,Summer Walker,['SZA'],13,6,https://www.billboard.com/charts/hot-100/2021-...,"Oh, ooh woahOh-oh, yeahYeah, yeah, yeahIf I ha...",https://www.genius.com/Summer-Walker-No-Love-l...
110214,97,2021-12-25,Bad Man (Smooth Criminal),Polo G,,49,5,https://www.billboard.com/charts/hot-100/2021-...,"Lil Capalot, bitch, haSmooth criminal, Mike Ja...",https://www.genius.com/Polo-G-Bad-Man-Smooth-C...
110215,98,2021-12-25,Feel Alone,Juice WRLD,,98,1,https://www.billboard.com/charts/hot-100/2021-...,"Smokin' this dope, relaxin'I ain't gon' lie, b...",https://www.genius.com/Juice-WRLD-Feel-Alone-l...


In [32]:
print('Number of unique songs on the charts in 2000-2021:', len(songs.g_url.unique()))

Number of unique songs on the charts in 2000-2021: 8938


In [19]:
print('Fraction of unique songs in 2000-2021:',len(songs.title.unique())/len(songs))

Fraction of unique songs in 2000-2021: 0.07244798896721921


In [64]:
null=songs[songs['lyrics'].isnull()]
null_titles=list(null.title.unique())

In [67]:
null_titles[0]

'Independent Women Part I'

In [73]:
songs.to_csv('songs.csv')

In [71]:
songs[songs['title']==null_titles[3]]

Unnamed: 0,rank,date,title,artist1,artist2,peak_pos,wks_chart,b_url,lyrics,g_url
2973,72,2000-08-19,Case Of The Ex (Whatcha Gonna Do),Mya,,72,1,https://www.billboard.com/charts/hot-100/2000-...,"Yeah, MýaRedZone, what, what?UhIt's after midn...",https://www.genius.com/Mya-Case-Of-The-Ex-What...
3045,57,2000-08-26,Case Of The Ex (Whatcha Gonna Do),Mya,,57,2,https://www.billboard.com/charts/hot-100/2000-...,"Yeah, MýaRedZone, what, what?UhIt's after midn...",https://www.genius.com/Mya-Case-Of-The-Ex-What...
3127,52,2000-09-02,Case Of The Ex (Whatcha Gonna Do),Mya,,52,3,https://www.billboard.com/charts/hot-100/2000-...,"Yeah, MýaRedZone, what, what?UhIt's after midn...",https://www.genius.com/Mya-Case-Of-The-Ex-What...
3208,47,2000-09-09,Case Of The Ex (Whatcha Gonna Do),Mya,,47,4,https://www.billboard.com/charts/hot-100/2000-...,"Yeah, MýaRedZone, what, what?UhIt's after midn...",https://www.genius.com/Mya-Case-Of-The-Ex-What...
3292,42,2000-09-16,Case Of The Ex (Whatcha Gonna Do),Mya,,42,5,https://www.billboard.com/charts/hot-100/2000-...,"Yeah, MýaRedZone, what, what?UhIt's after midn...",https://www.genius.com/Mya-Case-Of-The-Ex-What...
3369,31,2000-09-23,Case Of The Ex (Whatcha Gonna Do),Mya,,31,6,https://www.billboard.com/charts/hot-100/2000-...,"Yeah, MýaRedZone, what, what?UhIt's after midn...",https://www.genius.com/Mya-Case-Of-The-Ex-What...
3449,24,2000-09-30,Case Of The Ex (Whatcha Gonna Do),Mya,,24,7,https://www.billboard.com/charts/hot-100/2000-...,"Yeah, MýaRedZone, what, what?UhIt's after midn...",https://www.genius.com/Mya-Case-Of-The-Ex-What...
3531,18,2000-10-07,Case Of The Ex (Whatcha Gonna Do),Mya,,18,8,https://www.billboard.com/charts/hot-100/2000-...,"Yeah, MýaRedZone, what, what?UhIt's after midn...",https://www.genius.com/Mya-Case-Of-The-Ex-What...
3618,17,2000-10-14,Case Of The Ex (Whatcha Gonna Do),Mya,,17,9,https://www.billboard.com/charts/hot-100/2000-...,"Yeah, MýaRedZone, what, what?UhIt's after midn...",https://www.genius.com/Mya-Case-Of-The-Ex-What...
3703,17,2000-10-21,Case Of The Ex (Whatcha Gonna Do),Mya,,17,10,https://www.billboard.com/charts/hot-100/2000-...,"Yeah, MýaRedZone, what, what?UhIt's after midn...",https://www.genius.com/Mya-Case-Of-The-Ex-What...


In [36]:
songs[songs['title']=='Blinding Lights']

Unnamed: 0,rank,date,title,artist1,artist2,peak_pos,wks_chart,b_url,lyrics,g_url
100108,11,2019-12-14,Blinding Lights,The Weeknd,,11,1,https://www.billboard.com/charts/hot-100/2019-...,,https://www.genius.com/The-Weeknd-Blinding-Lig...
100249,52,2019-12-21,Blinding Lights,The Weeknd,,11,2,https://www.billboard.com/charts/hot-100/2019-...,,https://www.genius.com/The-Weeknd-Blinding-Lig...
100360,63,2019-12-28,Blinding Lights,The Weeknd,,11,3,https://www.billboard.com/charts/hot-100/2019-...,,https://www.genius.com/The-Weeknd-Blinding-Lig...
100464,72,2020-01-04,Blinding Lights,The Weeknd,,11,4,https://www.billboard.com/charts/hot-100/2020-...,YeahI've been tryna callI've been on my own fo...,https://www.genius.com/The-Weeknd-Blinding-Lig...
100544,59,2020-01-11,Blinding Lights,The Weeknd,,11,5,https://www.billboard.com/charts/hot-100/2020-...,YeahI've been tryna callI've been on my own fo...,https://www.genius.com/The-Weeknd-Blinding-Lig...
...,...,...,...,...,...,...,...,...,...,...
108352,17,2021-08-07,Blinding Lights,The Weeknd,,1,86,https://www.billboard.com/charts/hot-100/2021-...,YeahI've been tryna callI've been on my own fo...,https://www.genius.com/The-Weeknd-Blinding-Lig...
108435,16,2021-08-14,Blinding Lights,The Weeknd,,1,87,https://www.billboard.com/charts/hot-100/2021-...,YeahI've been tryna callI've been on my own fo...,https://www.genius.com/The-Weeknd-Blinding-Lig...
108523,18,2021-08-21,Blinding Lights,The Weeknd,,1,88,https://www.billboard.com/charts/hot-100/2021-...,YeahI've been tryna callI've been on my own fo...,https://www.genius.com/The-Weeknd-Blinding-Lig...
108612,21,2021-08-28,Blinding Lights,The Weeknd,,1,89,https://www.billboard.com/charts/hot-100/2021-...,YeahI've been tryna callI've been on my own fo...,https://www.genius.com/The-Weeknd-Blinding-Lig...


In [35]:
songs.g_url.value_counts()

https://www.genius.com/The-Weeknd-Blinding-Lights-lyrics     90
https://www.genius.com/Imagine-Dragons-Radioactive-lyrics    87
https://www.genius.com/AWOLNATION-Sail-lyrics                79
https://www.genius.com/Jason-Mraz-Im-Yours-lyrics            75
https://www.genius.com/OneRepublic-Counting-Stars-lyrics     68
                                                             ..
https://www.genius.com/Future-Please-Tell-Me-lyrics           1
https://www.genius.com/Avicii-Heaven-lyrics                   1
https://www.genius.com/DJ-Drama-So-Many-Girls-lyrics          1
https://www.genius.com/The-Roots-Break-You-Off-lyrics         1
https://www.genius.com/Juice-WRLD-Feel-Alone-lyrics           1
Name: g_url, Length: 8938, dtype: int64

In [33]:
songs.lyrics.value_counts()

Well you done done me in; you bet I felt itI tried to be chill but you're so hot that I meltedI fell right through the cracksNow I'm trying to get backBefore the cool done run outI'll be giving it my bestestNothing's going to stop me but divine interventionI reckon it's again my turn to win some or learn someBut I won't hesitate no more, no moreIt cannot wait, I'm yoursWell open up your mind and see like meOpen up your plans and damn you're freeLook into your heart and you'll findLove, love, love, loveListen to the music of the moment, people dance and singWe're just one big familyAnd it's our God-forsaken right to beLoved, love, love, love, lovedSo I won't hesitate no more, no moreIt cannot wait, I'm sureThere's no need to complicateOur time is shortThis is our fate, I'm yoursDo you want to come on scootch on over closer, dearAnd I will nibble your earI've been spending way too long checking my tongue in the mirrorAnd bending over backwards just to try to see it clearerBut my breath f

## 3. Results

## 4. Discussion and Conclusions

## 5. Resources Consulted

##### 1. Dataset Creation
Removing text in parentheses in a string (modified for brackets): https://stackoverflow.com/questions/640001/how-can-i-remove-text-within-parentheses-with-a-regex <br>
Python regex (re) library documentation: https://www.pythontutorial.net/python-regex/python-regex-sub/ <br>
Pandas datetime objects documentation: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.date_range.html

## 6. Bibliography