# Problem Statement: How have hits changed over the years?

**The goal of this project is to examine how hits have changed over the years in terms of a song's specific features and its lyrics. We will focus on how hits have changed since 2008, but will specifically compare changes from the years 2008 to 2013 to 2018.**

### We will be using Spotify's API and Web Scraping methods to gather all of our data
- Scraping from Billboard's Year-End Hot 100 Songs will provide us with songs to use as a proxy for hit songs.
- Scraping from SongFacts will provide us with other songs that are not on the Year-End Hot 100 list for each year, but were released in the same year as each Hot 100 list was compiled.
- Genius' API will provide us with song lyrics for the tracks we will use in our model.
- Spotify's API will provide us with important audio features that we need for our model.

In [1]:
import spotipy                     # Spotify's API packagelibrary
import spotipy.oauth2 as oauth2    # Spotify's authorization sublibrary
import lyricsgenius as genius      # Genius' API package library
import json                        
import pandas as pd
from bs4 import BeautifulSoup
import requests
import time

### First, let's scrape Billboard's Year-End Hot 100 Songs using BeautifulSoup
- We will be scraping Year-End charts from 2008, 2013 and 2018.

In [6]:
# Instantiating URLs for Year-End Hot 100 Songs for years 2008, 2013 and 2018
url_billboard_2008 = 'https://www.billboard.com/charts/year-end/2008/hot-100-songs'
url__billboard_2013 = 'https://www.billboard.com/charts/year-end/2013/hot-100-songs'
url_billboard_2018 = 'https://www.billboard.com/charts/year-end/2018/hot-100-songs'

# Create a master list of all URLs 
billboard_master_url_list = [url_billboard_2008, url__billboard_2013, url_billboard_2018]

In [3]:
# Create a function that takes a URL and scrapes all track names from that specific year's Year-End Hot 100 list
def get_track_names(url):
    res = requests.get(url)                       # Instantiate a get request for the url
    soup = BeautifulSoup(res.content, 'lxml')     # Instantiate a soup object
    
    soup_track_names = soup.find_all('div', {'class': 'ye-chart-item__title'})     # Soup object query for track names
    track_names = []                                                               # Create an empty list to fill with track names
    
    # Create a for loop to iterate through each json row returned from the query
    for i in range(len(soup_track_names)):              
        stripped_title = soup_track_names[i].text.strip('\n')          # Strip '\n' from the track names
        track_names.append(stripped_title)                             # Append the list with stripped title names
        
    return track_names

In [4]:
# Create a function that takes a URL and scrapes all artist names from that specific year's Year-End Hot 100 list
def get_artist_names(url):
    res = requests.get(url)                       # Instantiate a get request for the url
    soup = BeautifulSoup(res.content, 'lxml')     # Instantiate a soup object
    
    soup_artist_names = soup.find_all('div', {'class': 'ye-chart-item__artist'})   # Soup object query for artist names
    artist_names = []                                                              # Create an empty list to fill with artist names
    
    # Create a for loop to iterate through each json row returned from the query
    for i in range(len(soup_artist_names)):              
        stripped_artist = soup_artist_names[i].text.strip('\n')        # Strip '\n' from the artist names
        artist_names.append(stripped_artist)                           # Append the list with stripped artist names
        
    return artist_names

In [5]:
# Creating variables for all hit songs and respective artists for years 2008, 2013 and 2018
hit_songs_08 = get_track_names(url_billboard_2008)
hit_artists_08 = get_artist_names(url_billboard_2008)

hit_songs_13 = get_track_names(url__billboard_2013)
hit_artists_13 = get_artist_names(url__billboard_2013)

hit_songs_18 = get_track_names(url_billboard_2018)
hit_artists_18 = get_artist_names(url_billboard_2018)

### Next, lets scrape songs that were not on the Year-End Hot 100 list from SongFacts
- SongFacts is a website that provides expansive lists of all songs that were released in a given year in the US. Whether the site has successfully captured every single song release in the specified year is not 100% certain, but for the purposes of this project, the quantity of songs available is what we are going for. Similar to the steps taken above for scraping Billboard, we will proceed with a similar process for SongFacts.

In [8]:
# Instantiating URLs for SongFacts browse pages for years 2008, 2013 and 2018
url_songfacts_2008 = "https://www.songfacts.com/browse/years/2008"
url_songfacts_2013 = "https://www.songfacts.com/browse/years/2013"
url_songfacts_2018 = "https://www.songfacts.com/browse/years/2018"

# Create a master list of all URLs
songfacts_master_url_list = [url_songfacts_2008, url_songfacts_2013, url_songfacts_2018]

# Instantiating a user agent for SongFacts web scrape
headers = {'User-agent': 'danhyunkim'}

**We will create separate functions to scrape all song and artist names for each individual year. We tried creating one function to do this for each year, but ran into issues when trying to scrape multiple pages for each year.**

In [18]:
# Instantiate empty lists to fill with all song and artist names from 2008
sf_songs_2008 = []
sf_artists_2008 = []

# Define the scraping function for 2008
# Inputs are the url, a song list and an artist list
def get_sf_songs_2008(url, song_list, artist_list):
    res_url = requests.get(url, headers=headers)         # Create a get request for the url
    sf_soup = BeautifulSoup(res_url.content, 'lxml')     # Create a soup object for the url

    # This range is specific to the json data that the below query delivers
    # Each page has 100 songs with their respective artists which starts at index 39 and ends at index 138
    for i in range(39, 139):
        
        # Query for getting json data for song and artist names
        # The query will return a list in this format: ['song name - artist name']
        # The code will replace the hyphen with a comma and split on the comma to create two separate strings as such: ['song name', 'artist name']
        song_and_artist = sf_soup.find_all('li')[i].text.replace('-', ',').split(',')
        
        # Append the first item of each list created to empty song list
        sf_songs_2008.append(song_and_artist[0])              
        
        # Append the second item of each list created to empty artist list and replace spaces with empty string
        sf_artists_2008.append(song_and_artist[1].replace(' ', ''))        
        
        # Break this for loop here because a new for loop is necessary for the remaining pages of songs
        break
        
    # New for loop to scrape songs and artists from page 2 through 21
    # This range is specific for the year 2008
    for i in range(2,22):
        # Base url that will be used to create the URL for each page
        url_next_page = 'https://www.songfacts.com/browse/years/2008/page'+str(i)
        
        res_url_next_page = requests.get(url_next_page, headers=headers)         # URL specific get request
        sf_soup_next_page = BeautifulSoup(res_url_next_page.content, 'lxml')     # URL specific soup object
        
        # Same scraping process as above
        for i in range(39, 139):
            song_and_artist = sf_soup_next_page.find_all('li')[i].text.replace('-', ',').split(',')
            sf_songs_2008.append(song_and_artist[0])
            sf_artists_2008.append(song_and_artist[1].replace(' ', ''))
        
        # 5 second break before rerunning the for loop
        time.sleep(5)


**The intuition behind the 2013 and 2018 scraping functions follows the same logic as the 2008 scraping function so commenting is excluded unless otherwise noted.**

In [19]:
sf_songs_2013 = []
sf_artists_2013 = []

def get_sf_songs_2013(url, song_list, artist_list):
    res_url = requests.get(url, headers=headers)
    sf_soup = BeautifulSoup(res_url.content, 'lxml')

    for i in range(39, 139):
        song_and_artist = sf_soup.find_all('li')[i].text.replace('-', ',').split(',')
        sf_songs_2013.append(song_and_artist[0])
        sf_artists_2013.append(song_and_artist[1].replace(' ', ''))
            
        break
        
    # New for loop to scrape songs and artists from page 2 through 29
    # This range is specific for the year 2013      
    for i in range(2,30):
        url_next_page = 'https://www.songfacts.com/browse/years/2013/page'+str(i)
        
        res_url_next_page = requests.get(url_next_page, headers=headers)
        sf_soup_next_page = BeautifulSoup(res_url_next_page.content, 'lxml')

        for i in range(39, 139):
            song_and_artist = sf_soup_next_page.find_all('li')[i].text.replace('-', ',').split(',')
            sf_songs_2013.append(song_and_artist[0])
            sf_artists_2013.append(song_and_artist[1].replace(' ', ''))
            
        time.sleep(5)

In [20]:
sf_songs_2018 = []
sf_artists_2018 = []

def get_sf_songs_2018(url, song_list, artist_list):
    res_url = requests.get(url, headers=headers)
    sf_soup = BeautifulSoup(res_url.content, 'lxml')

    for i in range(39, 139):
        song_and_artist = sf_soup.find_all('li')[i].text.replace('-', ',').split(',')
        sf_songs_2018.append(song_and_artist[0])
        sf_artists_2018.append(song_and_artist[1].replace(' ', ''))
            
        break
        
    # New for loop to scrape songs and artists from page 2 through 15
    # This range is specific for the year 2018     
    for i in range(2,16):
        url_next_page = 'https://www.songfacts.com/browse/years/2013/page'+str(i)
        
        res_url_next_page = requests.get(url_next_page, headers=headers)
        sf_soup_next_page = BeautifulSoup(res_url_next_page.content, 'lxml')

        for i in range(39, 139):
            song_and_artist = sf_soup_next_page.find_all('li')[i].text.replace('-', ',').split(',')
            sf_songs_2018.append(song_and_artist[0])
            sf_artists_2018.append(song_and_artist[1].replace(' ', ''))
            
        time.sleep(5)

In [None]:
get_sf_songs_2008(url_songfacts_2008, sf_songs_2008)

In [None]:
get_sf_songs_2013(url_songfacts_2013, sf_songs_2013)
get_sf_songs_2018(url_songfacts_2018, sf_songs_2018)

In [None]:
sf_songs_2013

In [None]:
sf_songs_2018

In [None]:
lyrics_url = 'https://genius.com/Drake-gods-plan-lyrics'
lyr_res = requests.get(lyrics_url)
lyr_soup = BeautifulSoup(lyr_res.content, 'lxml')

In [None]:
lyr_soup.find_all('p')[0].text.replace('[Chorus 1]', '').replace('[Intro]', '').replace('[Post-Chorus]', '').replace('[Verse 1]', '').replace('[Verse 2]', '').replace('\n', ' ').replace('[Chorus 2]', '').replace('[Bridge]', '').replace("'", '').replace('"', '').replace(',', '').replace('.', '').replace('?', '').replace('[Hook]', '').replace('[Verse 3]', '').replace('[Pre-Chorus]', '').replace('[Chorus]', '').replace('[Outro]', '').replace('(', '').replace(')', '').replace('-', ' ').replace('[Verse 4]', '')


### Now, let's access Genius' API for song lyrics for all of our songs

In [None]:
# Create Genius object
genius = genius.Genius(client_access_token='uThemSV0CHBXwge2qstA1ApXv_elTmbw8fABxs_GTfss_COk3MH_cwplfPenlryG', 
              response_format='plain',         # Format of response is plain text
              timeout=5, sleep_time=0.5, 
              remove_section_headers=True,     # Remove headers such as [Intro], [Verse], [Chorus]n etc.
              skip_non_songs=True,             # Skip items that are not songs
              verbose=True)                    # Print search text

In [None]:
song_name = ['thank u next', 'gods plan']
artist_name = ['ariana grande', 'drake']

In [None]:
for i, j in zip(song_name, artist_name):
    print(i, j)

In [None]:
zip(song_name, artist_name)

In [None]:
import time

lyric_list_Hot100_2008 = []

for song in get_track_names(url_2008):
    song_2008 = genius.search_song(song).lyrics.replace('\n', ' ').replace('.', '').replace(',', '').replace('-', ' ').replace("'", '').replace('"', '').replace('?', '')
    lyric_list_Hot100_2008.append(song_2008)
    time.sleep(3)
    
lyric_list_Hot100_2008

In [None]:
lyric_list_Hot100_2008[2]

In [None]:
song = genius.search_song("God's Plan").lyrics.replace('\n', ' ').replace('.', '').replace(',', '').replace('-', ' ').replace("'", '').replace('"', '').replace('?', '')


In [None]:
song

### Finally, let's access Spotify's API

In [None]:
# Set up credentials and token for API environment
credentials = oauth2.SpotifyClientCredentials(
    client_id='24ae571f5509439a800b0bc9f45b9a3d',       # Client ID provided from developer account page
    client_secret='de55987b16e54df0adbbf88074db3d57')   # Client Secret ID provided from developer account page
                                                        # Client Secret ID has since been changed 
token = credentials.get_access_token()

In [None]:
# Create Spotify object
spotify = spotipy.Spotify(auth=token)

**Now that we have all of our hit songs, we need to get their URIs in order to extract the important audio features from Spotify's API. We set up separate `for` loops for each year's top tracks to get their individual URIs. We could have defined a function to extract all the URIs for all of our hit songs at once, but we thought it would be useful to have separate lists of URIs for each year's top tracks for our analysis because we will be comparing the change in features through the years.**

In [None]:
# For loop to get URIs for 2018 hits
billboard_tracks_URIs_2018 = []

# Iterating through each key in the Top 2018 Tracks dictionary we created earlier
for key in get_track_and_artist_names(url_2018).keys():
    
    # Spotify search query indexed to get the URI for each song
    uri = spotify.search(track, limit=1, offset=0, type='track')['tracks']['items'][0]['uri'] 
    # Appending each URI to the empty list above
    billboard_tracks_URIs_2018.append(uri)

**The `for` loop above will be consistent for the other years we are examining.**

In [None]:
# For loop to get URIs for 2013 hits
billboard_tracks_URIs_2013 = []
for key in get_track_and_artist_names(url_2013).keys():
    uri = spotify.search(track, limit=1, offset=0, type='track')['tracks']['items'][0]['uri']
    billboard_tracks_URIs_2013.append(uri)

In [None]:
# For loop to get URIs for 2008 hits
billboard_tracks_URIs_2008 = []
for key in get_track_and_artist_names(url_2008).keys():
    uri = spotify.search(track, limit=1, offset=0, type='track')['tracks']['items'][0]['uri']
    billboard_tracks_URIs_2008.append(uri)

In [None]:
SongFactsURIs = []
for key in SF_track_and_artist_names.keys():
    try:
        uri = spotify.search(track, limit=1, offset=0, type='track')['tracks']['items'][0]['uri']
    except:
        pass
    SongFactsURIs.append(uri)

**Now that we have the URIs for each song in the Year-End Hot 100 list for each year, we need to retrieve songs from each year that were not on this list. We conducted outside research to find songs that came out in each respective year not in the Hot 100 list and will retrieve their URIs through Spotify's API.**

In [None]:
spotify.search("rockstar", limit=1, offset=0, type='track')['tracks']['items'][0]['name']

In [None]:
artist = spotify.artist_albums(artist_id = 'spotify:artist:2P5sC9cVZDToPxyomzF1UH')
artist['items'][2].keys()

In [None]:
album_URIs = []
def artist_details(artist_URI):
    artist = spotify.artist_albums(artist_URI)

In [None]:
track_URIs = []
track_titles = []
def album_details(album_URI):
    album_songs = spotify.album_tracks(album_URI)
    for i in range(album_songs['total']):
        print(album_songs['items'][i]['name'])
        track_titles.append(album_songs['items'][i]['name'])
        track_URIs.append(album_songs['items'][i]['uri'])

In [None]:
features = []
for i in range(len(trackURIs)):
    print(i)
    features.append(spotify.audio_features(trackURIs[i]))

In [None]:
len(features)

In [None]:
df = pd.DataFrame(features, index = trackNames[:100])

In [None]:
df

### Finally let's scrape Genius for song lyrics