**If you lost points on the last checkpoint you can get them back by responding to TA/IA feedback**  

Update/change the relevant sections where you lost those points, make sure you respond on GitHub Issues to your TA/IA to call their attention to the changes you made here.

Please update your Timeline... no battle plan survives contact with the enemy, so make sure we understand how your plans have changed.

# COGS 108 - Data Checkpoint

# Names

- Zoya Hasan
- Arushi Munjal 
- Shruti Yamala
- Siya Randhawa


# Research Question

Using sentiment analysis, which lyrical themes—such as love, empowerment, struggle, anger, hope, celebration, and nostalgia—are most commonly identified in the top songs of Spotify's English-language genres (Pop, Rap, Country, R&B, Indie) from 2013 to 2023?

## Background and Prior Work

Streaming platforms like Spotify have transformed the music industry, impacting not only how music is consumed but also how data on listener preferences, song popularity, and musical trends are accessed. Understanding the factors that drive a song’s popularity can provide insights for artists and producers looking to create resonant music. While many elements influence popularity—including artist reputation and song structure—lyrics play a critical role by directly conveying emotions and themes that listeners connect with. Our project seeks to explore how specific lyrical themes correlate with song popularity across various genres, highlighting trends that resonate most with audiences on platforms like Spotify.

Previous studies have explored the role of sentiment analysis in predicting song popularity and genre classification. For instance, a study published in Ultimatics: Jurnal Teknik Informatika developed a BERT-based model to predict song popularity based on sentiment analysis of English song lyrics, achieving a notable accuracy of 87% through oversampling and data preprocessing techniques. This study found that the sentiment expressed in lyrics, such as positivity or negativity, was a significant factor in popularity. By capturing the sentiment with BERT, they successfully linked lyrical sentiment with popularity trends, underscoring the impact of lyrical emotion on audience engagement and song performance 1.

Another approach was undertaken by Boonyanit and Dahl at Stanford, who aimed to classify songs into genres based solely on lyrical content, using GloVe embeddings and LSTM models to predict genre with an accuracy of 68% at its peak. Their work illuminated the capacity of lyrics to signal genre-related characteristics, especially in distinguishing unique words and recurring themes. By focusing on genre classification, this study underscored how lyrical content often aligns with genre conventions, revealing differences in word choice and thematic style across genres like hip-hop, pop, and rock, despite overlaps 2.

Building on these studies, our project diverges by focusing not on predicting popularity or genre independently but on understanding how specific themes within lyrics correlate with popularity across genres like pop, hip-hop, rock, and country. We are not merely classifying songs by sentiment or genre; rather, we are examining genre as a contextual factor in lyrical themes. This will allow us to determine which themes—such as love, nostalgia, or resilience—drive higher engagement in particular genres, offering insights into the preferences of genre-specific audiences. Our findings can aid musicians in tailoring lyrics to align with listener tastes, leveraging data to enhance song impact on streaming platforms.

Sentiment Analysis on Song Lyrics for Song Popularity Prediction Using BERT Algorithm, Ultimatics: Jurnal Teknik Informatika, 2023. ↩
Music Genre Classification using Song Lyrics, Stanford CS224N Custom Project, 2023.


# Hypothesis


We predict the presence of specific lyrical themes in Spotify songs, such as heartbreak and love, will correlate more strongly with popularity in genres such as Pop and Indie, while lyrical themes such as empowerment will correlate more strongly with popularity in genres such as Rap. 

We hypothesized this because of general knowledge of the sentiment of lyrics in popular songs with genres like Pop, Indie, and Rap. Certain popular songs tend to have lyrics that correlate towards a theme like heartbreak, love, and empowerment.  

# Data

## Data overview


- Dataset #2
  - Dataset Name: 150K Lyrics Labeled with Spotify Valence
  - Link to the dataset: https://www.kaggle.com/datasets/edenbd/150k-lyrics-labeled-with-spotify-valence/data  
  - Number of observations: 150,000
  - Number of variables: 5


This dataset contains 150,000 song lyrics labeled with Spotify Valence scores, which range from 0 to 1 and indicate the emotional positivity of a song (1 being highly positive). Important variables include artist (artist name), seq (song lyrics), song (song title), and label (valence score). The seq column requires preprocessing, such as tokenization, stopword removal, and text normalization, while label serves as the target for mood prediction. Additional steps include handling missing values, balancing the valence distribution, and potentially engineering features like word sentiment or linguistic complexity.



## Dataset #1 (use name instead of number here)

In [2]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE
## FEEL FREE TO ADD MULTIPLE CELLS PER SECTION 
import pandas as pd
spotify_songs = pd.read_csv('/Users/shruti14/Downloads/spotify_songs.csv')
spotify_songs.head()
spotify_songs.shape
#Going to only work with english songs
spotify_songs = spotify_songs[spotify_songs['language'] == 'en']
spotify_songs.head()
spotify_songs = spotify_songs.drop(columns=['mode', 'key', 'speechiness', 'loudness', 
                                            'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo',
                                           'danceability', 'energy'])
spotify_songs['track_album_release_date'] = pd.to_datetime(spotify_songs['track_album_release_date']).dt.year
spotify_songs = spotify_songs[(spotify_songs['track_album_release_date'] >= 2013) & (spotify_songs['track_album_release_date'] <= 2023)]
spotify_songs = spotify_songs.reset_index(drop=True)
spotify_songs.head()

Unnamed: 0,track_id,track_name,track_artist,lyrics,track_popularity,track_album_id,track_album_name,track_album_release_date,playlist_name,playlist_id,playlist_genre,playlist_subgenre,duration_ms,language
0,004s3t0ONYlzxII9PLgU6z,I Feel Alive,Steady Rollin,"The trees, are singing in the wind The sky blu...",28,3z04Lb9Dsilqw68SHt6jLB,Love & Loss,2017,Hard Rock Workout,3YouF0u7waJnolytf9JCXf,rock,hard rock,373512,en
1,00emjlCv9azBN0fzuuyLqy,Dumb Litty,KARD,Get up out of my business You don't keep me fr...,65,7h5X3xhh3peIK9Y0qI5hbK,KARD 2nd Digital Single ‘Dumb Litty’,2019,K-Party Dance Mix,37i9dQZF1DX4RDXswvP6Mj,pop,dance pop,193160,en
2,00f9VGHfQhAHMCQ2bSjg3D,Soldier,James TW,"Hold your breath, don't look down, keep trying...",70,3GNzXsFbzdwM0WKCZtgeNP,Chapters,2019,urban contemporary,4WiB26kw0INKwbzfb5M6Tv,r&b,urban contemporary,224720,en
3,00Gu3RMpDW2vO9PjlMVFDL,Hide Away (feat. Envy Monroe),Blasterjaxx,"Don't run away, it's getting colder Our hearts...",42,5pqG85igfoeWcCDIsSi9x7,Hide Away (feat. Envy Monroe),2019,Big Room EDM - by Spinnin' Records,7xWdFCrU5Gka6qp1ODrSdK,edm,big room,188000,en
4,00HIh9mVUQQAycsQiciWsh,Limestone,Magic City Hippies,How many friends are you gonna set on fire? Ho...,58,7mtoEwzZYBqG8JYItxcccG,Hippie Castle EP,2015,Indie Poptimism,1pZWCY50kMUhshcESknir8,pop,indie poptimism,209165,en


# Ethics & Privacy

As we analyzed the datasets, we found a number of ethical issues that are crucial to the data science process, especially with relation to privacy and biases. Although the dataset used—such as the  Audio features and lyrics of Spotify songs. The possible bias in representation is one of the main issues. The Spotify songs dataset, for instance, has over 18,000 tracks, but it primarily features well-known songs and mainstream performers, possibly ignoring the variety of musical genres and up-and-coming musicians. This prejudice may result in inaccurate inferences regarding the topics and emotional components that are common in popular music lyrics.
The terms of use for each dataset should be taken into account when collecting data. Copyright regulations must be followed because the lyrics in the dataset are protected by intellectual property laws, their use is restricted to scholarly and non-commercial reasons. We'll make sure that our analysis complies with these restrictions and doesn't improperly reveal copyrighted content. We want to carry out exploratory data analysis at several levels in order to address the biases that have been detected and make sure that hidden factors that can affect our discoveries are carefully taken into account. To find any notable biases, we will first look at the datasets' demographic representation. To make sure our results do not unduly favor popular music, we will analyze whether characteristics, such as lyrical sentiment, correspond with popularity across a variety of genres.

To ensure responsible data analysis, we address a number of ethical issues while using the Audio Features and Lyrics of Spotify Songs dataset. We follow fair use rules by concentrating on broad subject trends rather than directly quoting copyrighted lyrics. Furthermore, our study respects the dataset's original use and is solely academic and non-commercial. We will examine more general trends across genres rather than assigning particular themes to specific performers because we acknowledge that lyrics may contain sensitive or personal material. Without disclosing private information, this method preserves the integrity of the analysis. Even if the lyrics are openly accessible, we take care of any personally identifiable information in a responsible manner to prevent invasive or speculative conclusions about the artists based just on the lyrics. By looking at a range of genres and recognizing constraints, we mitigate potential genre biases despite the dataset's emphasis on popular music. Our objective is to advance an inclusive perspective on musical trends and themes.

An additional ethical consideration pertains to the findings' equitable impact. We need to be careful about how our analysis might affect how people view certain musical genres or performers, especially if our results imply that some genres are more popular than others. We aim to promote a more inclusive understanding of music data and its implications by being conscious of potential biases and having an open discussion about them. It should be noted that the goal of our project is to determine a connection between the popularity of songs and their lyrical themes, or messages. Our ultimate objective is to maintain moral principles in our study, favorably advancing our knowledge of music popularity while honoring the rights and talents of all participating artists.


# Team Expectations 

Team Expectation 1: Structure of Meetings and Communication
Weekly meetings will be held on Tuesdays at 6 PM on Zoom, and we will communicate by iMessage. 

Team Expectation 2: Role Assignment and Task Distribution
With frequent updates in a shared Google Spreadsheet, tasks will be assigned according to individual strengths. If a team member is having trouble with a task, they should notify the others at least three days before to the due date so that any required modifications can be made.

Team Expectation 3: Resolving Conflicts and Making Decisions
Major choices will need unanimous approval, but minor ones will be decided by a majority vote. The member with the most appropriate experience will be given preference for each specialized task, and any outstanding conflicts will be settled by week seven if needed.

Team Expectation 4: Accountability and Progress Monitoring
Each group member will report their task status in the shared Google Spreadsheet at least once every week to guarantee steady development. Members will provide a quick update on their accomplished work and any obstacles they are encountering during each weekly meeting.

# Project Timeline Proposal

Specify your team's specific project timeline. An example timeline has been provided. Changes the dates, times, names, and details to fit your group's plan.

If you think you will need any special resources or training outside what we have covered in COGS 108 to solve your problem, then your proposal should state these clearly. For example, if you have selected a problem that involves implementing multiple neural networks, please state this so we can make sure you know what you’re doing and so we can point you to resources you will need to implement your project. Note that you are not required to use outside methods.



| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 10/29        | 1 PM         | Review COGS108 expectations and brainstorm initial project ideas on music analysis | Decide on communication methods, finalize the research question on lyrical themes and popularity scores, and outline initial themes to analyze (e.g., love, empowerment, heartbreak). |
| 11/5         | 10 AM        | Conduct background research on available datasets and methods for lyrical analysis | Discuss potential datasets (e.g., Genius for lyrics, Spotify API for popularity scores), consider ethical implications, and draft the project proposal. Explore methods for lyrical theme extraction using keywords and tools (e.g., NLP or regular expressions)|
| 11/13        | 10 AM        | Finalize and submit the project proposal; identify relevant datasets covering 2013-2023 | Plan data wrangling steps for extracting lyrical themes and popularity scores, assign group roles, and set responsibilities for specific tasks. Ensure the proposal details the keyword-based approach for theme extraction and genre-specific popularity analysis.|
| 11/19        | 6 PM         | Import and preprocess data for lyrical themes and popularity scores from Spotify and other sources | Review data wrangling progress and discuss exploratory data analysis (EDA) to understand genre and theme distributions. Assign team members to handle genre-specific data wrangling for more effective preprocessing.|
| 11/27        | 12 PM        | Complete data wrangling and initial EDA; begin analyzing lyrical themes and popularity across genres | Review preliminary analysis results, make adjustments as needed, and perform a mid-project check-in. |
| 12/3         | 12 PM        | Complete final analysis; draft results, conclusions, and discussion sections | Review and edit the full project, finalizing insights on the relationship between lyrical themes and song popularity across genres from 2013 to 2023. Add a genre-specific findings section to note if certain themes (like resilience or joy) align with higher scores in specific genres.|
| 12/11        | Before 11:59 PM | N/A | Submit the final project and complete group project surveys. |