A Look Into The Top 100 Songs on Billboard (Week of Augest 4th, 2025)

In [1]:
import pandas as pd 
import numpy as np

data = pd.read_csv('../billboard_hot_100_with_lyrics.csv')
data.dtypes

title              object
artist             object
lyrics             object
rank                int64
last_week           int64
peak_rank           int64
weeks_on_chart      int64
release_year      float64
lyric_length        int64
dtype: object

In [2]:
data.head()

Unnamed: 0,title,artist,lyrics,rank,last_week,peak_rank,weeks_on_chart,release_year,lyric_length
0,Ordinary,Alex Warren,44 ContributorsTranslationsItalianoEspañolУкра...,1,1,1,25,2025.0,356
1,Golden,"HUNTR/X: EJAE, Audrey Nuna & REI AMI",53 ContributorsTranslationsRomanizationEnglish...,2,2,2,6,2025.0,345
2,What I Want,Morgan Wallen Featuring Tate McRae,22 ContributorsTranslationsFrançaisItalianoNed...,3,4,1,11,2025.0,433
3,Daisies,Justin Bieber,44 ContributorsTranslationsČeskyItalianoPolski...,4,3,2,3,2025.0,341
4,Just In Case,Morgan Wallen,17 ContributorsTranslationsEspañolJust In Case...,5,5,2,19,2025.0,391


Data Cleaning

In [3]:
import re

def clean_lyrics(lyrics):
    
    if pd.isna(lyrics) or lyrics == '':
        return None
    
    # Convert lyrics to string type (allows for slicing but intuitively makes sense for lyrics)
    lyrics = str(lyrics)
    
    # Regex expression gets everything before and including the first 'lyrics' (used to remove the first line)
    lyrics_match = re.search(r'.*?lyrics\s*', lyrics, re.IGNORECASE)
    if lyrics_match:
        lyrics = lyrics[lyrics_match.end():]
        
    # Same as above but include | (bitwise OR) re.DOTALL (to match across multiple lines)    
    read_more_match = re.search(r'.*?read more\s*', lyrics, re.IGNORECASE | re.DOTALL)
    if read_more_match:
        lyrics = lyrics[read_more_match.end():]
        
        
    lyrics = re.sub(r'[""]', '"', lyrics)  # Normalize quotes
    lyrics = re.sub(r'\s+', ' ', lyrics)   # Multiple spaces → single space

    lyrics = lyrics.strip()
    
    return lyrics

print(f"First 200 characters of the first song: \n{data['lyrics'][0][0:200]}\n")

data['lyrics'] = data['lyrics'].apply(clean_lyrics)

print(f"First full song: \n{data['lyrics'][0]}")

First 200 characters of the first song: 
44 ContributorsTranslationsItalianoEspañolУкраїнськаDeutschРусский (Russian)PortuguêsČeskyOrdinary Lyrics
They say, "The holy water's watered down
And this town's lost its faith
Our colors will fade e

First full song: 
They say, "The holy water's watered down And this town's lost its faith Our colors will fade eventually" So, if our time is runnin' out Day after day We'll make the mundane our masterpiece Oh, my, my Oh, my, my love I take one look at you You're takin' me out of the ordinary I want you layin' me down 'til we're dead and buried On the edge of your knife, stayin' drunk on your vine The angels up in the clouds are jealous knowin' we found Somethin' so out of the ordinary You got me kissin' thе ground of your sanctuary Shatter me with your touch, oh, Lord, return mе to dust The angels up in the clouds are jealous knowin' we found Hopeless hallelujah On this side of Heaven's gate Oh, my life, how do ya Breathe and take my breath away?

In [4]:
# Identify songs that will be removed
songs_to_remove = data[data['lyrics'].isna()]
print(f"Songs that will be removed (no lyrics after cleaning):")
if len(songs_to_remove) > 0:
    for index, song in songs_to_remove.iterrows():
        print(f" - '{song['title']}' by {song['artist']} (Rank #{song['rank']})")

# Remove songs without lyrics
print(f"\nSongs before removing null lyrics: {len(data)}")
data = data.dropna(subset=['lyrics'])
print(f"Songs after removing null lyrics: {len(data)}")

Songs that will be removed (no lyrics after cleaning):
 - 'Takedown' by JEONGYEON, JIHYO & CHAEYOUNG Of TWICE (Rank #67)

Songs before removing null lyrics: 100
Songs after removing null lyrics: 99


In [5]:
# Convert title, artist, and lyrics to string (though lyrics should already be string after our cleaning)
data['title'] = data['title'].astype('string')
data['artist'] = data['artist'].astype('string')
data['lyrics'] = data['lyrics'].astype('string')

data['release_year'] = data['release_year'].astype('int64')

# Add a new column for word count
data['word_count'] = data['lyrics'].apply(lambda x: len(x.split()))

data.dtypes

title             string[python]
artist            string[python]
lyrics            string[python]
rank                       int64
last_week                  int64
peak_rank                  int64
weeks_on_chart             int64
release_year               int64
lyric_length               int64
word_count                 int64
dtype: object

In [6]:
data.isna().sum()

title             0
artist            0
lyrics            0
rank              0
last_week         0
peak_rank         0
weeks_on_chart    0
release_year      0
lyric_length      0
word_count        0
dtype: int64

In [7]:
data.to_csv('../billboard_hot_100_cleaned.csv', index=False)

Sentiment

In [11]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

analyzer = SentimentIntensityAnalyzer()