# Preliminary Data - Getting the list of songs

### Data Description
In this notebook, I gather the list of songs that appeared on the Billboard Hot 100s Year End lists ranging from the years 2010 to 2020. I use BeautifulSoup to construct a dataframe consisting of the following features:
- Title
- Artists
- Year
- Rank

The idea is to get a sample of songs that we can potentially use for the Erdos Bootcamp. Let's get started!

### Getting the required Packages ready

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from seaborn import set_style
set_style("whitegrid")

from bs4 import BeautifulSoup
from urllib.request import urlopen, Request
import requests

### Trial - Testing out for year 2010
In this subsection, I first use BeautifulSoup to load in the HTML content of the billboard Hot 100 webpage for the Year-End chart for 2010 (https://www.billboard.com/charts/year-end/2010/hot-100-songs). I then try to construct a dummy dataset containing the aforementioned features

In [None]:
url = 'https://www.billboard.com/charts/year-end/2010/hot-100-songs'
header = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36'}
req10 = Request(url, headers = header)
html10 = urlopen(req10)
soup10 = BeautifulSoup(html10, 'html.parser')
print(soup10.prettify())

<!DOCTYPE doctype html>
<html class="" lang="">
 <head>
  <meta charset="utf-8"/>
  <meta content="IE=edge" http-equiv="X-UA-Compatible"/>
  <meta content="width=device-width, initial-scale=1, user-scalable=no" name="viewport"/>
  <title>
   Hot 100 Songs - Year-End | Billboard
  </title>
  <meta content="Hot 100 Songs - Year-End" name="title" property="title">
   <meta content="See Billboard's rankings of this year's most popular songs, albums, and artists." name="description" property="description">
    <meta content="https://www.billboard.com/assets/1621277882/images/ye-charts/charts-ye-share-fb.jpg?e8c1b95317c641337b87" name="og:image" property="og:image">
     <meta content="https://www.billboard.com/assets/1621277882/images/ye-charts/charts-ye-share-twitter.jpg?e8c1b95317c641337b87" name="twitter:image" property="twitter:image"/>
     <meta content="@billboard" name="twitter:site"/>
     <meta content="Billboard" property="og:site_name">
      <meta content="article" property="og

Based on the HTML code, the relevant data is wrapped in the following elements
- Title: `<div class="ye-chart-item__title"></div>`
- Artists: `<div class="ye-chart-item__artist"></div>`
- Rank: `<div class="ye-chart-item__rank"></div>`

So we can use the `find_all` method to construct a list of all the features

In [None]:
print(soup10.find_all('div', {'class','ye-chart-item__rank'})[1].text)
title10 = [title.text.replace("\n","") for title in soup10.find_all('div', {'class':'ye-chart-item__title'})]
artist10 = [artist.text.replace("\n","") for artist in soup10.find_all('div', {'class':'ye-chart-item__artist'})]
rank10 = [int(rank.text.replace("\n","")) for rank in soup10.find_all('div', {'class':'ye-chart-item__rank'})]
year10 = [2010 for a in range(100)]


2



In [None]:
hot100_2010 = pd.DataFrame({'title':title10, 'artists':artist10, 'rank':rank10, 'year':year10})
hot100_2010.head(100)

Unnamed: 0,title,artists,rank,year
0,TiK ToK,Ke$ha,1,2010
1,Need You Now,Lady Antebellum,2,2010
2,"Hey, Soul Sister",Train,3,2010
3,California Gurls,Katy Perry Featuring Snoop Dogg,4,2010
4,OMG,Usher Featuring will.i.am,5,2010
...,...,...,...,...
95,Life After You,Daughtry,96,2010
96,Smile,Uncle Kracker,97,2010
97,Teach Me How To Dougie,Cali Swag District,98,2010
98,Try Sleeping With A Broken Heart,Alicia Keys,99,2010


Looks like Ke\$ha topped the charts in 2010 with her song TiK ToK, and Jerrod Niemann came in last with Lover, Lover. Now let's construct the entire dataset from 2010 to 2020

### Expanding to all years

In [None]:
# Constructing an empty list for all features
titles = []
artists = []
ranks = []
years = []

# Constructing the loop
for i in np.arange(2010, 2021, 1):
    # Getting the HTML for each year
    url = 'https://www.billboard.com/charts/year-end/' + str(i) + '/hot-100-songs'
    header = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36'}
    reqi = Request(url, headers = header)
    htmli = urlopen(reqi)
    soupi = BeautifulSoup(htmli, 'html.parser')
    
    # Constructing lists for year i
    titlei = [str(t.text.replace("\n","")) for t in soupi.find_all('div', {'class':'ye-chart-item__title'})]
    artisti = [a.text.replace("\n","") for a in soupi.find_all('div', {'class':'ye-chart-item__artist'})]
    ranki = [int(r.text.replace("\n","")) for r in soupi.find_all('div', {'class':'ye-chart-item__rank'})]
    yeari = [i for a in range(len(ranki))]
    
    print("For the year", i, "Lengths are", len(titlei), len(artisti), len(ranki), len(yeari))
    # Appending to master list
    titles.extend(titlei)
    artists.extend(artisti)
    ranks.extend(ranki)
    years.extend(yeari)

For the year 2010 Lengths are 100 100 100 100
For the year 2011 Lengths are 99 99 99 99
For the year 2012 Lengths are 100 100 100 100
For the year 2013 Lengths are 100 100 100 100
For the year 2014 Lengths are 100 100 100 100
For the year 2015 Lengths are 100 100 100 100
For the year 2016 Lengths are 99 99 99 99
For the year 2017 Lengths are 100 100 100 100
For the year 2018 Lengths are 100 100 100 100
For the year 2019 Lengths are 100 100 100 100
For the year 2020 Lengths are 100 100 100 100


In [None]:
print(len(titles), len(artists), len(ranks), len(years))
hot100 = pd.DataFrame({'title':titles, 'artists':artists, 'rank':ranks, 'year':years})
hot100.head(1100)

1098 1098 1098 1098


Unnamed: 0,title,artists,rank,year
0,TiK ToK,Ke$ha,1,2010
1,Need You Now,Lady Antebellum,2,2010
2,"Hey, Soul Sister",Train,3,2010
3,California Gurls,Katy Perry Featuring Snoop Dogg,4,2010
4,OMG,Usher Featuring will.i.am,5,2010
...,...,...,...,...
1093,More Than My Hometown,Morgan Wallen,96,2020
1094,Lovin' On You,Luke Combs,97,2020
1095,Said Sum,Moneybagg Yo,98,2020
1096,Slide,H.E.R. Featuring YG,99,2020


In [None]:
hot100.to_csv(r'hot_100.csv', index=False)

In [None]:
hot100['artists'].nunique()

647

## Trying to scrape lyrics off of Genius

So I think I can use this notebook to also figure out a way to webscrape lyrics from Genius. I am going to piggyback off of Maaz Khan's post "How to Leverage Spotify API + Genius Lyrics for Data Science Tasks in Python" (https://medium.com/swlh/how-to-leverage-spotify-api-genius-lyrics-for-data-science-tasks-in-python-c36cdfb55cf3) to find a way to get lyrics for songs from 2020 (if this works, then can possibly expand this to the entire dataset).

In [None]:
# First getting the data for 2020
hot100_2020 = hot100[hot100['year'] == 2020].copy()
# Resetting the index
hot100_2020.reset_index(drop = True, inplace=True)

hot100_2020.head()

Unnamed: 0,title,artists,rank,year
0,Blinding Lights,The Weeknd,1,2020
1,Circles,Post Malone,2,2020
2,The Box,Roddy Ricch,3,2020
3,Don't Start Now,Dua Lipa,4,2020
4,Rockstar,DaBaby Featuring Roddy Ricch,5,2020


The first thing we need to do is to keep only the first artist's name on the dataset - to do this, we need to keep only the substring before "Featuring".

In [None]:
for i, x in enumerate(hot100_2020['artists']):
    # If Featuring is in the artist string, then split it into just the first artist. Else, no change
    if "Featuring" in x:
        split = x.split(" Featuring")
        first = split[0]
        hot100_2020.loc[i, 'main_artist'] = first
    else:
        hot100_2020.loc[i, 'main_artist'] = x

In [None]:
hot100_2020.to_csv(r'hot_100_2020.csv', index=False)
hot100_2020.head()

Unnamed: 0,title,artists,rank,year,main_artist
0,Blinding Lights,The Weeknd,1,2020,The Weeknd
1,Circles,Post Malone,2,2020,Post Malone
2,The Box,Roddy Ricch,3,2020,Roddy Ricch
3,Don't Start Now,Dua Lipa,4,2020,Dua Lipa
4,Rockstar,DaBaby Featuring Roddy Ricch,5,2020,DaBaby


Now the main thing to do is to get a function that automates the process of getting song lyrics. To do this, I adapt Maaz's code for the `get_lyrics` function, that takes in the title and artist names.
**Note: I need to make sure that there are no punctuations in the artist or title names. I will do this later**

In [None]:
def get_lyrics(title, artist):
    if " " in title:
        title_name = str(title.replace(' ','-'))
    else:
        title_name = str(title)
        
    if " " in artist:
        artist_name = str(artist.replace(' ','-'))
    else:
        artist_name = str(artist)
     
    url = 'https://genius.com/' + artist_name + '-' + title_name + '-lyrics'
    header = {'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36'}
    req = Request(url, headers = header)
    html = urlopen(req)
    soup = BeautifulSoup(html, 'html.parser')
    
    # lyrics_div = soup.find('div', {'class':'song_body-lyrics'})
    lyrics_div = soup.find('div', {'class':'SongPageGrid-sc-1vi6xda-0 DGVcp Lyrics__Root-sc-1ynbvzw-0 kkHBOZ'})
    if lyrics_div:
        lyrics = lyrics_div.get_text()
    elif lyrics_div == None:
        lyrics = None
        
    return lyrics

In [None]:
print(get_lyrics('Blinding Lights', 'The Weeknd'))
print(get_lyrics('Rockstar', 'DaBaby'))
print(get_lyrics('The Box', 'Roddy Ricch'))

[Intro]Yeah[Verse 1]I've been tryna callI've been on my own for long enoughMaybe you can show me how to love, maybeI'm going through withdrawalsYou don't even have to do too muchYou can turn me on with just a touch, baby[Pre-Chorus]I look around andSin City's cold and empty (Oh)No one's around to judge me (Oh)I can't see clearly when you're gone[Chorus]I said, ooh, I'm blinded by the lightsNo, I can't sleep until I feel your touchI said, ooh, I'm drowning in the nightOh, when I'm like this, you're the one I trustHey, hey, hey[Verse 2]I'm running out of time'Cause I can see the sun light up the skySo I hit the road in overdrive, baby, oh[Pre-Chorus]The city's cold and empty (Oh)No one's around to judge me (Oh)I can't see clearly when you're gone[Chorus]I said, ooh, I'm blinded by the lightsNo, I can't sleep until I feel your touchI said, ooh, I'm drowning in the nightOh, when I'm like this, you're the one I trust[Bridge]I'm just calling back to let you know (Back to let you know)I could

In [None]:
get_lyrics('The Box', 'Roddy Ricch')

"[Chorus]Pullin' out the coupe at the lotTold 'em fuck 12, fuck SWATBustin' all the bells\u2005out\u2005the boxI just\u2005hit a lick with the boxHad\u2005to put the stick in the box, mmhPour up the whole damn seal, I'ma get lazyI got the mojo deals, we been trappin' like the '80sShe sucked a nigga soul, gotta Cash AppTold 'em wipe a nigga nose, say slatt, slattI won't never sell my soul, and I can back thatAnd I really wanna know where you at, at[Verse 1]I was out back where the stash atCruise the city in a bulletproof Cadillac (Skrrt)'Cause I know these niggas after where the bag at (Yeah)Gotta move smarter, gotta move harderNigga try to get me for my waterI'll lay his ass down, on my son, on my daughterI had the Draco with me, Dwayne CarterLotta niggas out here playin', ain't ballin'I done put my whole arm in the rim, Vince Carter (Yeah)And I know probably get a key for the quarterShawty barely seen in double C's, I bought 'emGot a bitch that's looking like Aaliyah, she a modelI got