# Collecting Billboard Hot 100s and Spotify URIs

1. Billboard Hot 100
The Python package billboard.py doesn't allow extraction of year-end Hot 100 lists and Billboard now requires a pro subscription for their Hot 100 Year End lists. To bypass this, I used Beautiful Soup to scrape each Billboard Hot 100 between 2000 and 2024 from Wikipedia then manually filled in the few missing values due to the unique HTML structure of the charts. 

2. Spotipy
The second portion of this notebook is used to get the Spotify URIs for each track in the dataframe (A Spotify URI is the unique page for each song on Spotify). This was achieved using [spotipy](https://spotipy.readthedocs.io/en/2.25.1/) - a Python library for the Spotify API. We'll work with this more in following notebooks. 

## Imports

In [1]:
'''
!pip install billboard.py
!pip install spotipy
!pip install lyricsgenius
'''

'\n!pip install billboard.py\n!pip install spotipy\n!pip install lyricsgenius\n'

In [2]:
import numpy as np
import pandas as pd
import billboard

# for web scraping the billboard rankings
import requests
from bs4 import BeautifulSoup
import time

# for spotipy
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
from tqdm.notebook import tqdm
from requests.exceptions import ReadTimeout, ConnectionError
import re

# for genius collaborators
import lyricsgenius

## Billboard Hot 100

In [3]:
# define the range
years = range(2000, 2025)
# create an empty list for the songs
bb_100 = []

# set headers to mimic a browser
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                  'AppleWebKit/537.36 (KHTML, like Gecko) '
                  'Chrome/115.0.0.0 Safari/537.36'
}

# loop through each year's Wikipedia page
for year in years:
    # base url
    url = f'https://en.wikipedia.org/wiki/Billboard_Year-End_Hot_100_singles_of_{year}'
    try:
        response = requests.get(url, headers=headers)
        if response.status_code != 200:
            print(f'Failed {year}: Status {response.status_code}')
            continue

        soup = BeautifulSoup(response.text, 'html.parser')
        table = soup.find('table', {'class': 'wikitable'})
        if not table:
            print(f'No table found for {year}')
            continue

        rows = table.find_all('tr')
        year_songs = 0

        for row in rows[1:]:  # skip header
            cols = row.find_all(['td', 'th'])
            if len(cols) < 3:
                continue

            # get text from any nested tags
            rank = ' '.join([x.get_text(strip=True) for x in cols[0].find_all(text=True)]).strip()
            title = ' '.join([x.get_text(strip=True) for x in cols[1].find_all(text=True)]).strip()
            artist = ' '.join([x.get_text(strip=True) for x in cols[2].find_all(text=True)]).strip()

            if not rank or not title or not artist:
                continue

            bb_100.append({
                'year': year,
                'rank': rank,
                'title': title,
                'artist': artist
            })
            year_songs += 1

        # if fewer than 100 songs scraped, warn
        if year_songs < 100:
            print(f'Warning: {year_songs} songs scraped for {year} (expected 100)')

        print(f'Loaded {year} ({year_songs} songs)')
        time.sleep(1)  # polite delay

    except Exception as e:
        print(f'Failed {year}: {e}')

# convert to DataFrame
bb_100 = pd.DataFrame(bb_100)

# clean up duplicates and whitespace
bb_100['title'] = bb_100['title'].str.strip()
bb_100['artist'] = bb_100['artist'].str.strip()

  rank = ' '.join([x.get_text(strip=True) for x in cols[0].find_all(text=True)]).strip()
  title = ' '.join([x.get_text(strip=True) for x in cols[1].find_all(text=True)]).strip()
  artist = ' '.join([x.get_text(strip=True) for x in cols[2].find_all(text=True)]).strip()


Loaded 2000 (99 songs)
Loaded 2001 (100 songs)
Loaded 2002 (100 songs)
Loaded 2003 (100 songs)
Loaded 2004 (100 songs)
Loaded 2005 (100 songs)
Loaded 2006 (100 songs)
Loaded 2007 (100 songs)
Loaded 2008 (98 songs)
Loaded 2009 (100 songs)
Loaded 2010 (100 songs)
Loaded 2011 (100 songs)
Loaded 2012 (99 songs)
Loaded 2013 (99 songs)
Loaded 2014 (100 songs)
Loaded 2015 (99 songs)
Loaded 2016 (98 songs)
Loaded 2017 (100 songs)
Loaded 2018 (100 songs)
Loaded 2019 (100 songs)
Loaded 2020 (100 songs)
Loaded 2021 (100 songs)
Loaded 2022 (100 songs)
Loaded 2023 (100 songs)
Loaded 2024 (99 songs)


### Find Missing

There were a total of 9 missing tracks. I manually located these on the respective Wikipedia pages and added them to the dataframe. 

In [4]:
# set the rank column to numeric
bb_100['rank'] = pd.to_numeric(bb_100['rank'], errors='coerce')

# loop through each year
for year in bb_100['year'].unique():
    ranks = bb_100[bb_100['year'] == year]['rank'].dropna().astype(int)
    missing = set(range(1, 101)) - set(ranks)
    if missing:
        print(f'Year {year} is missing ranks: {sorted(missing)}')

Year 2000 is missing ranks: [23]
Year 2008 is missing ranks: [10, 17]
Year 2012 is missing ranks: [17]
Year 2013 is missing ranks: [18]
Year 2015 is missing ranks: [10]
Year 2016 is missing ranks: [2, 21]
Year 2024 is missing ranks: [20]


In [5]:
# create a dictionary of the missing years and associated song information
bb_missing = {'year': [2000,2008,2008,2012,2013,2015,2016,2016,2024]
            , 'rank': [23,10,17,17,18,10,2,21,20]
            , 'title': ['I Need to Know','Forever','Don\'t Stop the Music','Whistle','Wrecking Ball','The Hills'
                        ,'Sory','Heathens','Snooze']
            , 'artist': ['Marc Anthony','Chris Brown','Rihanna','Flo Rida','Miley Cyrus','The Weekend'
                         ,'Justin Bieber','Twenty One Pilots','SZA'] }

# make into dataframe
bb_missing = pd.DataFrame(bb_missing)

In [6]:
# combine the two dataframes
bb_all = pd.concat([bb_100, bb_missing], ignore_index=True)

print(bb_all.shape)   # new row count

(2500, 4)


### Extract the Main Artists

In order to find the songs on Spotify, I extracted the main artists and created a new column for the features. 

In [7]:
# copy original DataFrame
bb_clean = bb_all.copy()

# get main artist
def extract_main_artist(artist_name):
    """extract main artist, splitting at 'with', 'feat.', 'featuring', '&', 'and'"""
    split_pattern = r'\s+(with|feat\.|featuring|&|and)\s+'
    main_artist = re.split(split_pattern, artist_name, flags=re.I)[0]
    return main_artist.strip(' "\'')

# get features artist
def extract_featured_artists(artist_name):
    """Extract featured artists after the main artist"""
    split_pattern = r'\s+(with|feat\.|featuring|&|and)\s+'
    parts = re.split(split_pattern, artist_name, flags=re.I)
    if len(parts) > 1:
        return ' '.join(parts[1:]).strip(' "\'')
    return None

# remove special charaters/spaces/etc. 
def clean_title(title):
    """Remove special characters and extra spaces/quotes for better search"""
    return re.sub(r'[^\w\s]', '', title).strip(' "\'')

# create new columns
bb_clean['main_artist'] = bb_clean['artist'].apply(extract_main_artist)
bb_clean['featured_artists'] = bb_clean['artist'].apply(extract_featured_artists)
bb_clean['title'] = bb_clean['title'].apply(clean_title)
bb_clean['artist'] = bb_clean['artist'].str.strip(' "\'')

# print the first 10
bb_clean[['title', 'artist', 'main_artist', 'featured_artists']].head(10)


Unnamed: 0,title,artist,main_artist,featured_artists
0,Breathe,Faith Hill,Faith Hill,
1,Smooth,Santana featuring Rob Thomas,Santana,featuring Rob Thomas
2,Maria Maria,Santana featuring The Product G&B,Santana,featuring The Product G&B
3,I Wanna Know,Joe,Joe,
4,Everything You Want,Vertical Horizon,Vertical Horizon,
5,Say My Name,Destiny's Child,Destiny's Child,
6,I Knew I Loved You,Savage Garden,Savage Garden,
7,Amazed,Lonestar,Lonestar,
8,Bent,Matchbox Twenty,Matchbox Twenty,
9,He Wasnt Man Enough,Toni Braxton,Toni Braxton,


## Spotipy URIs

In [8]:
# define credentials
client_id = # this has been hidden for security
client_secret = # this has been hidden for security

# assign the spotify credentials
auth_manager = SpotifyClientCredentials(client_id=client_id, client_secret=client_secret, requests_timeout=30)
sp = spotipy.Spotify(auth_manager=auth_manager)

In [9]:
# remove special characters from the title
def clean_title(title):
    '''Remove special characters to improve search'''
    return re.sub(r'[^\w\s]', '', title)

# get the spotify track uris
def get_track_uri_retry(title, artist, max_retries=3):
    '''Try to get Spotify URI with retries on timeout/connection errors'''
    for attempt in range(max_retries):
        try:
            query = f'track:{title} artist:{artist}'
            results = sp.search(q=query, type='track', limit=1)
            if results['tracks']['items']:
                return results['tracks']['items'][0]['uri']
            return None
        except (ReadTimeout, ConnectionError) as e:
            wait_time = 2 ** attempt
            print(f'Timeout for {title} by {artist}, retrying in {wait_time}s...')
            time.sleep(wait_time)
    return None

# if the uri is not found, look under the featured artists
def get_track_uri_flexible(title, main_artist, featured_artists=None, max_retries=3):
    '''
    Try fetching URI first with main_artist.
    If not found and featured_artists exist, try including them.
    '''
    # first attempt: main artist only
    uri = get_track_uri_retry(title, main_artist, max_retries=max_retries)
    if uri is None and featured_artists:
        # second attempt: include featured artists
        combined_artist = f'{main_artist} {featured_artists}'
        uri = get_track_uri_retry(title, combined_artist, max_retries=max_retries)
    return uri

# update dataframe
bb_clean['clean_title'] = bb_clean['title'].apply(clean_title)
if 'spotify_uri' not in bb_clean.columns:
    bb_clean['spotify_uri'] = None

# grab the uris
for i, row in tqdm(bb_clean.iterrows(), total=len(bb_clean), desc='Fetching Spotify URIs'):
    if pd.isna(bb_clean.at[i, 'spotify_uri']):
        uri = get_track_uri_flexible(
            row['clean_title'], 
            row['main_artist'], 
            row.get('featured_artists')
        )
        bb_clean.at[i, 'spotify_uri'] = uri

    # save every 100 tracks
    if i % 100 == 0 and i > 0:
        bb_clean.to_csv('billboard_with_uris_partial.csv', index=False)
        print(f'Saved progress at row {i}...')

# final save
bb_clean.to_csv('billboard_with_uris.csv', index=False)
print(f"Done! Found URIs for {bb_clean['spotify_uri'].notna().sum()} out of {len(bb_clean)} tracks.")

Fetching Spotify URIs:   0%|          | 0/2500 [00:00<?, ?it/s]

Saved progress at row 100...
Saved progress at row 200...
Saved progress at row 300...
Saved progress at row 400...
Saved progress at row 500...
Saved progress at row 600...
Saved progress at row 700...
Saved progress at row 800...
Saved progress at row 900...
Saved progress at row 1000...
Saved progress at row 1100...
Saved progress at row 1200...
Saved progress at row 1300...
Saved progress at row 1400...
Saved progress at row 1500...
Saved progress at row 1600...
Saved progress at row 1700...
Saved progress at row 1800...
Saved progress at row 1900...
Saved progress at row 2000...
Saved progress at row 2100...
Saved progress at row 2200...
Saved progress at row 2300...
Saved progress at row 2400...
Done! Found URIs for 2487 out of 2500 tracks.


### Missing Tracks

In [10]:
# get the tracks with missing uris
missing_tracks = bb_clean[bb_clean['spotify_uri'].isna()]
missing_tracks[['title', 'artist']]

Unnamed: 0,title,artist
90,247,Kevon Edmonds
196,Oochie Wally,QB Finest featuring Nas and Bravehearts
362,My Love Is LikeWo,Mýa
412,FreekaLeek,Petey Pablo
531,Obsession No Es Amor,Frankie J featuring Baby Bash
591,NumbEncore,Jay-Z and Linkin Park
698,For You I Will Confidence,Teddy Geiger
1103,Fuck You Forget You,CeeLo Green
1235,Niggas in Paris,Jay-Z and Kanye West
1287,Cashin Out,Cash Out


In [11]:
# create a dictionary for missing tracks
manual_uris = {
    90: 'spotify:track:4rZB2G955dQMcjlb7e3VNB',
    196: 'spotify:track:6oztAZh36mqITxIGJLu22C',
    362: 'spotify:track:3Gamc2D6VSlXpUcmhPUFYt',
    412: 'spotify:track:4MeDnO5yA2Zi6IMlVApRci',
    531: 'spotify:track:4UC4H4vX3bJtcgtKR0ZCFJ',
    591: 'spotify:track:7dyluIqv7QYVTXXZiMWPHW',
    698: 'spotify:track:2ijMU9lYFCvGBzgwJ8Sd8q',
    1103: 'spotify:track:3ydfhgIZIc2j39NLIhpJpq',
    1235: 'spotify:track:1auxYwYrFRqZP7t3s7w4um',
    1287: 'spotify:track:1POAx4NMLOBPVKZUSsBh92',
    2014: 'spotify:track:2wrJq5XKLnmhRXHIAf9xBa',
    2112: 'spotify:track:6Im9k8u9iIzKMrmV7BWtlF',
    2248: 'spotify:track:3ncmoWTwJgx63LwMTyBCXf'
}

# loop through to update the dataframe
for idx, uri in manual_uris.items():
    bb_clean.at[idx, 'spotify_uri'] = uri

# save the updated DataFrame
bb_clean.to_csv('billboard_with_uris_final.csv', index=False)