# Problem Statement: How have hits changed over the years?

**The goal of this project is to examine how hits have changed over the years in terms of a song's specific features and its lyrics. We will focus on how hits have changed since 2008, but will specifically compare changes from the years 2008 to 2013 to 2018.**

### We will be using Web API and Web Scraping methods to gather all of our data
- Scraping from Billboard's Year-End Hot 100 Songs will provide us with songs to use as a proxy for hit songs.
- Scraping from SongFacts will provide us with other songs that are not on the Year-End Hot 100 list for each year, but were released in the same year as each Hot 100 list was compiled.
- Genius' API will provide us with song lyrics for the tracks we will use in our model.
- Spotify's API will provide us with important audio features that we need for our model.

In [8]:
import spotipy                     # Spotify's API packagelibrary
import spotipy.oauth2 as oauth2    # Spotify's authorization sublibrary
import lyricsgenius as genius      # Genius' API package library
import json                        
import pandas as pd
from bs4 import BeautifulSoup
import requests
import time
from tqdm import tqdm

### First, let's scrape Billboard's Year-End Hot 100 Songs using BeautifulSoup
- We will be scraping Year-End charts from 2008, 2013 and 2018.

In [2]:
# Instantiating URLs for Year-End Hot 100 Songs for years 2008, 2013 and 2018
url_billboard_2008 = 'https://www.billboard.com/charts/year-end/2008/hot-100-songs'
url_billboard_2013 = 'https://www.billboard.com/charts/year-end/2013/hot-100-songs'
url_billboard_2018 = 'https://www.billboard.com/charts/year-end/2018/hot-100-songs'

# Create a master list of all URLs 
billboard_master_url_list = [url_billboard_2008, url_billboard_2013, url_billboard_2018]

In [3]:
# Create a function that takes a URL and scrapes all track names from that specific year's Year-End Hot 100 list
def get_track_names(url):
    res = requests.get(url)                       # Instantiate a get request for the url
    soup = BeautifulSoup(res.content, 'lxml')     # Instantiate a soup object
    
    soup_track_names = soup.find_all('div', {'class': 'ye-chart-item__title'})     # Soup object query for track names
    track_names = []                                                               # Create an empty list to fill with track names
    
    # Create a for loop to iterate through each json row returned from the query
    for i in range(len(soup_track_names)):              
        stripped_title = soup_track_names[i].text.strip('\n')          # Strip '\n' from the track names
        track_names.append(stripped_title)                             # Append the list with stripped title names
        
    return track_names

In [4]:
# Create a function that takes a URL and scrapes all artist names from that specific year's Year-End Hot 100 list
def get_artist_names(url):
    res = requests.get(url)                       # Instantiate a get request for the url
    soup = BeautifulSoup(res.content, 'lxml')     # Instantiate a soup object
    
    soup_artist_names = soup.find_all('div', {'class': 'ye-chart-item__artist'})   # Soup object query for artist names
    artist_names = []                                                              # Create an empty list to fill with artist names
    
    # Create a for loop to iterate through each json row returned from the query
    for i in range(len(soup_artist_names)):              
        stripped_artist = soup_artist_names[i].text.strip('\n')        # Strip '\n' from the artist names
        artist_names.append(stripped_artist)                           # Append the list with stripped artist names
        
    return artist_names

In [5]:
# Creating variables for all hit songs and respective artists for years 2008, 2013 and 2018
hit_songs_08 = get_track_names(url_billboard_2008)
hit_artists_08 = get_artist_names(url_billboard_2008)

hit_songs_13 = get_track_names(url_billboard_2013)
hit_artists_13 = get_artist_names(url_billboard_2013)

hit_songs_18 = get_track_names(url_billboard_2018)
hit_artists_18 = get_artist_names(url_billboard_2018)

### Next, lets scrape songs that were not on the Year-End Hot 100 list from SongFacts
- SongFacts is a website that provides expansive lists of all songs that were released in a given year in the US. Whether the site has successfully captured every single song release in the specified year is not 100% certain, but for the purposes of this project, the quantity of songs available is what we are going for. Similar to the steps taken above for scraping Billboard, we will proceed with a similar process for SongFacts.

In [6]:
# Instantiating URLs for SongFacts browse pages for years 2008, 2013 and 2018
url_songfacts_2008 = "https://www.songfacts.com/browse/years/2008"
url_songfacts_2013 = "https://www.songfacts.com/browse/years/2013"
url_songfacts_2018 = "https://www.songfacts.com/browse/years/2018"

# Create a master list of all URLs
songfacts_master_url_list = [url_songfacts_2008, url_songfacts_2013, url_songfacts_2018]

# Instantiating a user agent for SongFacts web scrape
headers = {'User-agent': 'danhyunkim'}

**We will create separate functions to scrape all song and artist names for each individual year. We tried creating one function to do this for each year, but ran into issues when trying to scrape multiple pages for each year.**

In [7]:
# Instantiate empty lists to fill with all song and artist names from 2008
sf_songs_2008 = []
sf_artists_2008 = []

# Define the scraping function for 2008
# Inputs are the url, a song list and an artist list
def get_sf_songs_2008(url, song_list, artist_list):
    res_url = requests.get(url, headers=headers)         # Create a get request for the url
    sf_soup = BeautifulSoup(res_url.content, 'lxml')     # Create a soup object for the url

    # This range is specific to the json data that the below query delivers
    # Each page has 100 songs with their respective artists which starts at index 39 and ends at index 138
    for i in range(39, 139):
        
        # Query for getting json data for song and artist names
        # The query will return a list in this format: ['song name - artist name']
        # The code will replace the hyphen with a comma and split on the comma to create two separate strings as such: ['song name', 'artist name']
        song_and_artist = sf_soup.find_all('li')[i].text.replace('-', ',').split(',')
        
        # Append the first item of each list created to empty song list
        sf_songs_2008.append(song_and_artist[0])              
        
        # Append the second item of each list created to empty artist list and replace spaces with empty string
        sf_artists_2008.append(song_and_artist[1].replace(' ', ''))        
        
        # Break this for loop here because a new for loop is necessary for the remaining pages of songs
        break
        
    # New for loop to scrape songs and artists from page 2 through 21
    # This range is specific for the year 2008
    # Using tqdm progress bar to track run time and progress
    for i in tqdm(range(2,22)):
        
        # Base url that will be used to create the URL for each page
        url_next_page = 'https://www.songfacts.com/browse/years/2008/page'+str(i)
        
        res_url_next_page = requests.get(url_next_page, headers=headers)         # URL specific get request
        sf_soup_next_page = BeautifulSoup(res_url_next_page.content, 'lxml')     # URL specific soup object
        
        # Same scraping process as above
        for i in range(39, 139):
            song_and_artist = sf_soup_next_page.find_all('li')[i].text.replace('-', ',').split(',')
            sf_songs_2008.append(song_and_artist[0])
            sf_artists_2008.append(song_and_artist[1])
        
        # 5 second break before rerunning the for loop
        time.sleep(5)


**The intuition behind the 2013 and 2018 scraping functions follows the same logic as the 2008 scraping function so commenting is excluded unless otherwise noted.**

In [8]:
sf_songs_2013 = []
sf_artists_2013 = []

def get_sf_songs_2013(url, song_list, artist_list):
    res_url = requests.get(url, headers=headers)
    sf_soup = BeautifulSoup(res_url.content, 'lxml')

    for i in range(39, 139):
        song_and_artist = sf_soup.find_all('li')[i].text.replace('-', ',').split(',')
        sf_songs_2013.append(song_and_artist[0])
        sf_artists_2013.append(song_and_artist[1].replace(' ', ''))
            
        break
        
    # New for loop to scrape songs and artists from page 2 through 29
    # This range is specific for the year 2013      
    for i in tqdm(range(2,30)):
        url_next_page = 'https://www.songfacts.com/browse/years/2013/page'+str(i)
        
        res_url_next_page = requests.get(url_next_page, headers=headers)
        sf_soup_next_page = BeautifulSoup(res_url_next_page.content, 'lxml')

        for i in range(39, 139):
            song_and_artist = sf_soup_next_page.find_all('li')[i].text.replace('-', ',').split(',')
            sf_songs_2013.append(song_and_artist[0])
            sf_artists_2013.append(song_and_artist[1])
            
        time.sleep(5)

In [9]:
sf_songs_2018 = []
sf_artists_2018 = []

def get_sf_songs_2018(url, song_list, artist_list):
    res_url = requests.get(url, headers=headers)
    sf_soup = BeautifulSoup(res_url.content, 'lxml')

    for i in range(39, 139):
        song_and_artist = sf_soup.find_all('li')[i].text.replace('-', ',').split(',')
        sf_songs_2018.append(song_and_artist[0])
        sf_artists_2018.append(song_and_artist[1].replace(' ', ''))
            
        break
        
    # New for loop to scrape songs and artists from page 2 through 15
    # This range is specific for the year 2018     
    for i in tqdm(range(2,16)):
        url_next_page = 'https://www.songfacts.com/browse/years/2013/page'+str(i)
        
        res_url_next_page = requests.get(url_next_page, headers=headers)
        sf_soup_next_page = BeautifulSoup(res_url_next_page.content, 'lxml')

        for i in range(39, 139):
            song_and_artist = sf_soup_next_page.find_all('li')[i].text.replace('-', ',').split(',')
            sf_songs_2018.append(song_and_artist[0])
            sf_artists_2018.append(song_and_artist[1])
            
        time.sleep(5)

In [11]:
# Run function for year 2008 
get_sf_songs_2008(url_songfacts_2008, sf_songs_2008, sf_artists_2008)

In [16]:
# Run function for year 2013 
get_sf_songs_2013(url_songfacts_2013, sf_songs_2013, sf_artists_2013)

In [21]:
# Run function for year 2018 
get_sf_songs_2018(url_songfacts_2018, sf_songs_2018, sf_artists_2018)

In [14]:
# Check the length of song and artist list of 2008
print(len(sf_songs_2008))
print(len(sf_artists_2008))

1992
1992


In [19]:
# Check the length of song and artist list of 2013
print(len(sf_songs_2013))
print(len(sf_artists_2013))

2785
2785


In [22]:
# Check the length of song and artist list of 2018
print(len(sf_songs_2018))
print(len(sf_artists_2018))

1401
1401


**We now have a sample of non-Billboard songs. The quantity of songs and artists for each year are different and does not reflect the total number of songs released in those respective years because SongFacts limits us from obtaining more.**

### Now, let's access Genius' API for song lyrics for all of our songs

In [9]:
# Create Genius object
genius = genius.Genius(client_access_token='1ZxV3C-PK6E8hlkD3GCk_a8H61dNIE9YFEWgmfAEfQF2BVvsIJTryB8uZw4SslZ4', 
              response_format='plain',         # Format of response is plain text
              timeout=5, sleep_time=0.5, 
              remove_section_headers=True,     # Remove headers such as [Intro], [Verse], [Chorus]n etc.
              skip_non_songs=True,             # Skip items that are not songs
              verbose=True)                    # Print search text

In [24]:
# Create a function to get lyrics from Genius' API
# The inputs are a song list, an artist list and an empty lyrics list to append the lyrics to
def get_lyrics(song_list, artist_list, lyrics_list):
    
    # Create a for loop that will iterate through the songs and artists in the song and artist list
    for song, artist in zip(song_list, artist_list):
        
        # Create a try and except statement in case there are certain songs that Genius does not have the lyrics for
        try:  
            # Genius API query which pulls the lyrics of the song given the song name and artist name
            lyrics = genius.search_song(song, artist=artist).lyrics.replace('\n', ' ').replace('.', '').replace(',', '').replace('-', ' ').replace("'", '').replace('"', '').replace('?', '')
            
            # Append the lyrics to the empty lyrics list
            lyrics_list.append(lyrics)
            
            # 3 second break before rerunning the for loop
            time.sleep(3)
        except:
            pass

In [27]:
# Instantiate an empty list to fill with lyrics of billboard hits for 2008
billboard_lyrics_08 = []

# Run function to get lyrics for 2008 billboard hits
get_lyrics(hit_songs_08, hit_artists_08, billboard_lyrics_08)

# Check length of lyrics list
len(billboard_lyrics_08)

98

In [37]:
# Instantiate an empty list to fill with lyrics of billboard hits for 2013
billboard_lyrics_13 = []

# Run function to get lyrics for 2013 billboard hits
get_lyrics(hit_songs_13, hit_artists_13, billboard_lyrics_13)

len(billboard_lyrics_13)

95

In [42]:
# Instantiate an empty list to fill with lyrics of billboard hits for 2018
billboard_lyrics_18 = []

# Run function to get lyrics for 2018 billboard hits
get_lyrics(hit_songs_18, hit_artists_18, billboard_lyrics_18)

len(billboard_lyrics_18)

99

In [46]:
# Instantiate an empty list to fill with lyrics of songfacts songs for 2008
songfacts_lyrics_08 = []

# Run function to get lyrics for 2008 songfacts songs and append it to empty list
get_lyrics(sf_songs_2008, sf_artists_2008, songfacts_lyrics_08)

len(songfacts_lyrics_08)

1947

In [48]:
# Instantiate an empty list to fill with lyrics of songfacts songs for 2013
songfacts_lyrics_13 = []

# Run function to get lyrics for 2013 songfacts songs and append it to empty list
get_lyrics(sf_songs_2013, sf_artists_2013, songfacts_lyrics_13)

len(songfacts_lyrics_13)

2724

In [158]:
# Instantiate an empty list to fill with lyrics of songfacts songs for 2018
songfacts_lyrics_18 = []

# Run function to get lyrics for 2018 songfacts songs and append it to empty list
get_lyrics(sf_songs_2018, sf_artists_2018, songfacts_lyrics_18)

len(songfacts_lyrics_18)

1382

## Finally, let's access Spotify's API for audio features of our songs
- Spotify provides us with innovative features that comprise a song. These features include danceability, energy, musical key of the song, loudness, mode, speechiness, acousticness, instrumentalness, liveliness, valence, tempo, duration in milliseconds and time signature.

In [315]:
# Set up credentials and token for API environment
credentials = oauth2.SpotifyClientCredentials(
    client_id='24ae571f5509439a800b0bc9f45b9a3d',       # Client ID provided from developer account page
    client_secret='d75b449fae6c460f8482f4d83227b563')   # Client Secret ID provided from developer account page
                                                        # Client Secret ID has since been changed 
token = credentials.get_access_token()

In [316]:
# Create Spotify object
spotify = spotipy.Spotify(auth=token)

**Now that we have all of our songs, we need to get their URIs in order to extract the important audio features from Spotify's API. We created a function that will get the URI for each song and use it to obtain its audio features.**

In [312]:
# Create a function that will get audio features for songs from Spotify's API
# The inputs are a song list 
def get_features(song_list):
    
    # Instantiate a features list to return
    features_list_to_return = []
    
    # Create a for loop that will iterate through each song in the song list that we pass in
    # Utilize tqdm package to show progress bar of our for loop since list iterable is long
    for song in tqdm(song_list):
        
        # Create a try and except statement in the case there are songs that do not exist in Spotify's library
        # thus do not have a URI
        try:
            # Spotify query that retrieves a song's URI based on a given track name
            uri = spotify.search(song, limit=1, offset=0, type='track')['tracks']['items'][0]['uri']

            # Spotify API query that will retrieve audio features with a given URI
            features = spotify.audio_features(uri)

            # Append the audio features to the empty features list
            features_list_to_return.append(features)

        # If the song does not exist in the Spotify library we will pass over it
        except:
            pass

        # 2 second pause before continuing on with the loop    
        time.sleep(2)
        
    # Return the features list
    return features_list_to_return
            

**Utilize the function above to get audio features for all of our songs through Spotify's API.**

In [176]:
billboard_features_2018 = get_features(hit_songs_18)

len(billboard_features_2018)

100

In [219]:
songfacts_features_2018 = get_features(sf_songs_2018)

len(songfacts_features_2018)

1382

In [177]:
billboard_features_2013 = get_features(hit_songs_13)

len(billboard_features_2013)

100

In [314]:
songfacts_features_2013 = get_features(sf_songs_2013)

len(songfacts_features_2013)

1514

In [180]:
billboard_features_2008 = get_features(hit_songs_08)

len(billboard_features_2008)

100

In [318]:
songfacts_features_2008 = get_features(sf_songs_2008)

len(songfacts_features_2008)

1548

# We finally have all of our data. Now let's put the data into DataFrame format for analysis.

**Audio Features DataFrames**

In [297]:
# Create a function that creates a dataframe given a features list
def dataframe(features_list):
    
    # Our features list is a list of dictionaries within a larger list so we need to
    # create a dataframe list to append the individual audio features dataframes for each song
    df_list = []
    
    # Create a for loop that will iterate through each song's audio features
    for i in range(len(features_list)):
        
        # Create dataframe for the ith song
        features_list_df = pd.DataFrame(features_list[i], columns = features_list[0][0].keys())
        
        # Append dataframe to the dataframe list
        df_list.append(features_list_df)
        
    # Concatenate all dataframes in the dataframe list and return the master dataframe
    master_df = pd.concat(df_list)
    return master_df

In [308]:
# This new function is identical to the function above except for the try and except statements
# We discovered that certain song features in the songfacts features list is not in the correct shape
# to be made into a dataframe so we pass over those songs with the except statement
def dataframe2(features_list):
    df_list = []
    for i in range(len(features_list)):
        try:
            features_list_df = pd.DataFrame(features_list[i], columns = features_list[0][0].keys())
            df_list.append(features_list_df)
        except:
            pass
    master_df = pd.concat(df_list)
    return master_df

**Use the dataframe function on each year's list of audio features for our hit songs and our non-hit songs and store them as csv files.**

In [136]:
billboard_features_2018_df = dataframe(billboard_features_2018)
billboard_features_2018_df.to_csv('./data/2018_billboard_features')

In [137]:
billboard_features_2013_df = dataframe(billboard_features_2013)
billboard_features_2013_df.to_csv('./data/2013_billboard_features')

In [226]:
billboard_features_2008_df = dataframe(billboard_features_2008)
billboard_features_2008_df.to_csv('./data/2008_billboard_features')

In [309]:
songfacts_features_18_df = dataframe2(songfacts_features_2018)
songfacts_features_18_df.to_csv('./data/18_songfacts_features')

In [319]:
songfacts_features_2013_df = dataframe2(songfacts_features_2013)
songfacts_features_2013_df.to_csv('./data/2013_songfacts_features')

In [320]:
songfacts_features_2008_df = dataframe2(songfacts_features_2008)
songfacts_features_2008_df.to_csv('./data/2008_songfacts_features')

**Song Lyrics DataFrames**

**Create dataframes for each year's song lyrics for our hit songs and non-hit songs and store them as csv files.**

In [141]:
billboard_lyrics_08_df = pd.DataFrame(billboard_lyrics_08, columns = ['lyrics'])
billboard_lyrics_08_df.to_csv('./data/2008_billboard_lyrics')

In [144]:
billboard_lyrics_13_df = pd.DataFrame(billboard_lyrics_13, columns = ['lyrics'])
billboard_lyrics_13_df.to_csv('./data/2013_billboard_lyrics')

In [146]:
billboard_lyrics_18_df = pd.DataFrame(billboard_lyrics_18, columns = ['lyrics'])
billboard_lyrics_18_df.to_csv('./data/2018_billboard_lyrics')

In [148]:
songfacts_lyrics_08_df = pd.DataFrame(songfacts_lyrics_08, columns = ['lyrics'])
songfacts_lyrics_08_df.to_csv('./data/2008_songfacts_lyrics')

In [149]:
songfacts_lyrics_13_df = pd.DataFrame(songfacts_lyrics_13, columns = ['lyrics'])
songfacts_lyrics_13_df.to_csv('./data/2013_songfacts_lyrics')

In [151]:
songfacts_lyrics_18_df = pd.DataFrame(songfacts_lyrics_18, columns = ['lyrics'])
songfacts_lyrics_18_df.to_csv('./data/2018_songfacts_lyrics')